Presenters: Owen Sheehy, John O’Brien Technical support specialists from GSS and VMware
GSS support the entire VMware stack, so requires technical competencies across all disciplines. For the purposes of this session much will focus on vSphere (ESX and VMware vCenter) as the foundatio building blocks of any solution. 80% of cases are logged against these vSphere foundation components!
VVD’s help GSS to understand complicated environments quickly. However, fixes might break a VVD.
Getting Ready to Engage with GSS
Missing logs on the ESXi Hosts – install disk on flash etc, means logs are stored in RamDisk so don’t persist through reboots. Logs can cycle and roll over quickly.
A way of circumventing this is to create configured scratch directories, set either at the time of installation or configured post installation. If ESXi can’t find a valid log location, either a local boot device, local device or VMFS partition – it’ll write to a RamDisk. This means the log persistence will not survive a reboot.
All of this can be avoided by utilizing vRealize Log Insight, that comes with a 25 OSI licence for vSphere.
ESXi level performance troubleshooting via esxtop is still in use wihtin GSS. esxtop can be run in batch mode to provide usage metrics over a period of time.
Main things that GSS look for within esxtop include %readytime, looking for either poorly configuration or over commitment. For networking %packetsdroped, anything around 1/2% might indicate a wider network problem. For datastores DAVG and KAVG, to rule out both device and kernel latency, anything higher than 15ms may indicate a problem. This can also be monitored at an adapter level, again anything over 15ms might indicate a problem.
Storage latency, wthin the logs GSS will look for the string “deteriorated I/O latency increased”. LogInsight, can be configured to alert on such logs, to provide early warning of issues.
All Paths Down (APD) – All paths to the volume are marked as dead, no response returned from array, this is viewed as a transient state – and the volume is expected to return.
Fast fail all I/O after 140 seconds. WIthin the logs, “APD_Start” is a useful search string within LogInsight.
Permanent Device Lost (PDL) – Code returned from the array to say that the device has been removed and will not be returning. “device is permanently unavailable” “0x5 0x25 0x0” are useful strings to search for within LogInsight.
Given the nature of the storage environments within vSphere, if one host encounters PDL then it is highly likely that all hosts will encounter PDL, this is as most host cluster share paths to the storage and SAN arrays.
Unresponsive ESXi Hosts
PSOD – Host crash, VMs offline, cannot open ssh session, cannot ping any vmkernal. Take a screen shot, as when a host encounters PSOD, to capture the core dump information. The only method of recover is to reboot. Capture Log Bundles! this will capture the Coredump information.
Hung – May be a Hostd hang, VMs can still be online and available, SSH sessions may be available, Host responds to pings on vmkernels. Can attempt to restart services. If you cannot recover hosts, capture required cores, gracefuly power down VMs (RDP/SSH), reboot hosts.
“localcli –plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform” – to capture a LiveCore of ESXi system prior to reboot.
Virtual Machine snapshots
Still an issue! but can fill datastores, multiple levels of snapshots can result in performance issues. Keep careful control of snapshots, don’t leave them hanging around! A snapshot chain can run to 32 snapshots… in the words of GSS, if this is being done on a live VM – it’s crazy 🙂
Snapshot alerting can be configured on a threshold of how much storage is being consumed, age of snapshots. Orphaned snapshots, will require a consolidation.
Troubleshoot your Network
how do GSS troubleshoot a Network. There brought in when poor performance is encounterd, dropped packets, or are adviced “it’s the network”
ESXi logs don’t show a great deal, vsish – is much more useful for finding networking information. Stats can be captured via “get /net/portsets/vSwitch0/stats” this provides dropped packet information very quickly.
To troubleshoot deeper find out out the port ID. This allows GSS to look at individual port stats and information, “get /net/portsets/DvsPortset-0/ports/***”, this gives visibility from a virtual switch right the way down to an individual VM. Wiht the VMxnet3 driver this can go further and get stats and see issues on the virtual nics “/net/portsets/DvsPortset-0/ports/****/vmxnet3/rxSummary”.
Has an aggressive release schedule, many cases are resolved by performing upgrades and staying current with releases. The expectation from VMware is that consumers keep up to date with releases, GSS will often recommend upgrades to resolve issues.
with NSX 6.4, there have been issues observed for hosts using the ixgbe driver. the recommendation is to reload the driver to enable RSS.
Hardware compatibility! It is important to make sure that the hardware is certified and drivers are up to date, not just across the controllers but across drives as well. Make use of the vSAN health check within VC, the ASK Vmware links to KB articles explaining issues. Don’t ignore the warnings, keep environments.
Core Dumps for vSAN enabled hosts, DUMP to Disk. Often times GSS find out that the logging partitions are too small for vSAN. To work around logs can be dumped to file, so a separate VMFS partition If local storage or VMFS is not available then utilize NetDump, this requires vCentre to be hosted elsewhere and has a maximum size of 10GB.
- All hardware certified and up to date
- More Disk Groups per host
- More Capacity Disks
- Network Considerations
- Stripe Width increased to 4 or 6 can improve performance, any higher does not
- IO profile has a major impact
- More writes = heavier load on cache disk
- Results in large evacuations
- The more writes require more cache disks
- Resync/Rebalance are heavy write operations
- Use HCIBench for performance testing
vSAN failure to tolerate
FTT is not a backup! Multiple failures do occur. Placing a host into maintenance mode can be considered the 1st failure.
vSphere web client, the Flex/Flash client is still a point of pain, linked with poor performance, flash required, compatibility issues. It is deprecated post vSPhere 6.7. the HTML5 client is approaching parity with the flash client, the recommendation is to use the HTML 5 client where possible and the flash by exception.
Embedded linked mode, starting with 6.7, instances of VCSA with an embedded PSC can now be linked during deployment. Providing a simpler architecture, protect PSC with VCSA HA. At the moment there is no path to move from external PSCs to embedded PSCs with 6.7.
vSphere 5.5 end of support. Beware, that ESXi 5.5 cannot be connected to vSphere 6.7 – start planning for an intermediate step to either ESXi 6 or 6.5
Distributed architecture, multiple services – very difficult to troubleshot. Use the VAMI to help to show status of registered services, this also gives certain error codes.
Messaging is a key component, is RabbitMQ is not working correctly or unhealthy then vRA will be very unhappy and can result in intermittent errors.
vRA provides Health URLs, links to health checks for each service. This provides simple status information highlighting if services are up or down.
Troubleshooting provisioning, search for a context ID, this can be found by looking for the context ID from the provisioning request. Having the context ID, enables troubleshooting through the provisioning flow. Using LogInsight to search for the Context ID string, allows troubleshooting to observe a request across multiple components in chronological order.
If during upgrades to 7.4 you encounter duplicate entries within the postgres DB checkout vSphere KB article 54982
for performance of vROPS, it really comes down to cluster sizing. Calculations that have been conducted for day 0 or day 1 operations are often invalid come day 2 etc. First place to start is to increase memory of components, then CPU before adding nodes. Re-balancing of the cluster can also help to improve performance.
vROPS services are very sensitive to snapshots, so follow best practice and only have snapshots for as long as required.
Given that vROPS can require very large VMs, consideration of NUMA sizing should be considered. Crossing NUMA boundaries can result in poorer performance than keeping within them and following a scale out model.
Upgrading vROPs to 6.7 will fail if any components have IP v6 disabled!
Before you upgrade
- Read the release notes
- Check supported upgrade path
- Check product interop matrix
- Check hardware matrix
- Check upgrade sequence
- Create a run-book!
- Upgrade in a test environment first
Monitoring the Cloud
- Centralized monitoring using vRealize Operations
- Use vRealize LogInsight!
Opening an SR
- There is a KB article to describe the process
- Provide a clear description of the issue, to ensure you get the right contact
- Select the correct products and build numbers
- Be preemptive and collect the logs
Thanks to Owen Sheehy and John O’Brien for a really informative session