7th September 2019

Troubleshooting and Recovery Planning

During a planned visit to a customer site, I had the fun of walking in to an outage.  Working as a TAM my primary goal is to make sure that VMware’s customers get the most out of VMware’s technology, so seeing that there was an issue I agreed to jump in and offer advice for how to fix matters.

Issue Diagnosis

When there is an outage, it is natural to panic a little and maybe try a rush troubleshooting steps.  In my experience this is oftentimes counterproductive.  Symptoms can often be missed as the root cause, which has the net effect of extending the time to resolution.  So where would I suggest is the best place to start, well that in part depends on the symptoms.  However, the log files will tell you everything that you need to know.

The main symptom of the issue I walked into was that VC was offline, ESXi hosts where very slow to respond to commands and ultimately VMs where offline.  I’d encountered these symptoms before and had suspicions that the outage was linked to an All Paths Down (APD) situation.  The best log file to check to validate this suspicion is either the /var/log/vmkernel.log or the /var/log/vobd.log.  I find that in some cases the vmkerenel.log can be a little daunting to interrogate, however grep makes things easier, especially as I believe I’m looking for APD errors.

cat /var/log/vobd.log | grep -i “All Paths Down”

cat /var/log/vmkernel.log | grep -i “All Paths Down”

These commands will return all log entries that contain the string “All Paths Down”

Sure enough when I ran these commands, I was able to confirm that 6 LUNs had entered an APD state.

The configuration for this environment was two sites, in an active/active configuration.  The customer had been working through a change to power down Site A and fail all services across to Site B.  A quick look at the array, confirmed that these LUNs where indeed unavailable.  They had not been failed over to the correct site and in the process of the power down been stripped away from the hosts and VMs running upon them.

By stepping back, using my experience and knowing the correct logs to review I’d been able to very quickly diagnose the root cause of the issue.

Recovery Planning

Working logically through the environments situation to the desired steps, it was clear that before any VMs could be brought back online the storage from Site A needed to be brought back online and made available to the ESXi hosts.  Once that had been completed, the situation would need to be reaccessed and further steps agreed.

The customer worked across teams and with partners to bring back network connectivity and the array on Site A into service.  Almost immediately the symptom of ESXi hosts being slow to respond disappeared.  A quick review of the vobd.log and vmkernel.log (which I had been tailing from a host) confirmed that the removed LUNs had been restored.


With the LUNs back and available it was possible to access the damage caused through the APD removal.  Listing all VMs on each host;

vim-cmd vmsvc/getallvms

This showed a situation by which each host claimed all VMs.  Messy.  In order to establish which VM belonged where I needed to establish which ESX host owned the locks to the virtual machine.  Thankfully for me this was something I was also familiar with.  I knew that to establish this I needed to locate the VM directories and run a command against a locked file, this would give me the management interface for the ESXi host that holds the locks.  Fortunately, many of the VMs files are locked during runtime including the *.vmx, *.vswp, *.vmdk and the vmware.log.  I used the *.vmx.  Therefore the command run to find the host with the lock is;

vmkfstools -D /vmfs/volumes/UUID/VMDIR/VMNAME.vmx

The output from this command will lock similar to;

Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Lock [type 10c00001 offset 13058048 v 20, hb offset 3499520
Hostname vmkernel: gen 532, mode 1, owner xxxxxxxx-xxxxxxxx-xxxx- xxxxxxxxxxxx mtime xxxxxxxxxx]
Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Addr <4, 136, 2>, gen 19, links 1, type reg, flags 0x0, uid 0, gid 0, mode 600
Hostname vmkernel: 17:00:38:46.977 cpu1:1033)len 297795584, nb 142 tbz 0, zla 1, bs 2097152
Hostname vmkernel: 17:00:38:46.977 cpu1:1033)FS3: 132: <END supp167-w2k3-VC-a3112729.vswp>

The second line (in bold) displays the MAC address after the word owner. In this example, the MAC address of the management vmkernel interface of the offending ESXi host is xx:xx:xx:xx:xx:xx.

Now I knew which VM objects were valid I could remove those that were not.

vim-cmd vmsvc/unregister vmid

As I was already on the command line of the host it was simple enough to match the VMs to the VMID provided from the getallvms command and unregister the invalid objects. With everything in a clean state, VMs could be powered on and services validated.

At this point the ESXi hosts are behaving as they should, all VMs back online, I could turn my attention to the VCSA.

VCSA recovery

Finding the valid VCSA object and working from the shell, I was able to establish that eth0 was stuck in “configuring”.  Speaking to the customer they mentioned that VCSA HA was configured at the time of the outage.  With this information, I was able to quickly establish that VCSA HA had got caught in a split brain situation.  The witness was on a LUN that was removed during the APD state, and therefore quorum had been lost and both nodes where attempting to assume the active role.

Resolving this is relatively straight forward, if perhaps a touch brutal.  First I deleted both the witness and passive appliances.  With those objects deleted I needed to strip out the VCHA configuration from the active node.

destroy-vcha -f

The above command achieves that, the -f switch forces the command through.  With the VCHA configuration removed the last thing to do is to reboot the VCSA and confirm that eth0 comes back online as routable.

All Services Recovered

So happily a successful resolution for the customer and all services recovered!

Top tips;

  1. Don’t panic!
  2. Logs are your friend, use them to find the root cause.
  3. Prioritise the recovery to target the most important systems.
  4. Recovery steps change in accordance with the situation, keep reviewing logs and be adaptable.
  5. Take your time, don’t rush the recovery and miss something important.
  6. Keep communicating with your team, management and vendors to keep everyone on the same page
  7. Document and learn from the experience.

All the best!