vSphere HA Advanced Settings to Consider
Working in the field I come across a lot of VMware products configurations, in addition to working directly advising how to meet architectural and business goals. After conversations this week I’d like to draw attention to two small but often overlooked configuration settings for vSphere HA and one that folk might simply have forgotten about.
For those that are not aware, vSphere HA provides high availability for Virtual Machines. It does this by pooling the Virtual Machines and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the Virtual Machines on a failed host are restarted on alternate hosts. As Virtual Machines are by very definition abstracted from hardware, we can treat those Virtual Machines as resources and fail them between hosts as needed.
Respect Host Soft Affinity Rules
Advanced setting of ‘das.respectvmhostsoftaffinityrules‘
This setting determines if vSphere HA restarts a Virtual Machine following the applicable VM-Host group rules. Now, If host is available that satisfies the VM-Host group rule, or if the value of this option is set to “false”, vSphere HA will restarts the Virtual Machine on any available host in the cluster. Behaviour that is based upon maximising Virtual Machine uptime.
Depending on the version of vSphere you have installed, this setting might be applied differently. For example in vSphere 6.5, the default value for the setting is “true”, the obvious impact here is that if the Virtual Machine cannot be placed to satisfy VM-Host group rules, then it won’t be failed over. This value, if configured in line with defaults, might not be visibly defined in the advanced HA options of the cluster. Therefore if you want to change the default behaviour from 6.5, this option must be manually set as “false” in the advanced HA options for the cluster.
Respect VM Anti-Affinity Rules
Advanced setting of ‘das.respectvmvmantiaffinityrules‘
Virtual Machine anti-affinity rules are created for myriad reasons, to separate resource intensive workloads or perhaps to separate clustered application components such as web front ends. These Virtual Machine Anti-Affinity rules are often set, they might get documented, and then they are forgotten. Often no consideration is given to the impact of these rules will have on vSphere HA behaviour. By using the advanced setting of ‘das.respectvmvmantiaffinityrules‘, a vSphere administrator can customise this behaviour in a the cluster.
The setting determines if vSphere HA enforces VM-VM anti-affinity rules. The default value is set to “true”. This means that VM-VM anti-affinity rules are enforced during a vSphere HA event, even if vSphere DRS is not enabled. Therefore for a cluster configured with this advanced setting at default values, vSphere HA will not fail over a Virtual Machine if doing so violates a rule.
For a cluster configured with this advanced setting at default values, the cluster will issue an event reporting that there are insufficient resources to perform the failover. An error message that in itself might seem nonsensical unless the vSphere administrator is aware of both any VM-VM anti-affinity rules in place within a cluster and this default behaviour.
The advanced setting of ‘das.respectvmvmantiaffinityrules‘ can also be set to “false”, with the result that vSphere HA will not enforce VM-VM anti-affinity. Which depending on the cluster might be the sweet-spot for many in place of a 2am callout.
ESXi Host Isolation Address
Advanced setting of ‘das.isolationaddress[…]’
vSphere HA is one of those solutions that has just got better and better over the years. Back when it was first available (yes, I am old enough to remember Legato Automated Availability Manager) and before features such as datastore heart-beating had been introduced. vSphere HA relied relied upon the Service Console network to determine if a host was isolated. Which is fine, in theory.
When working with Vi3 infrastructure, a major design consideration was how to protect against something that came to be known as ‘Split Brain’. This is HA situation where a host or several hosts become “orphaned” from the rest of the cluster because its primary Service Console network has failed. In the event of a 12-15 second outage on the Service Console network, no host would know correctly if it was impacted or every other host was impacted. It could get very messy very quickly, the net result would be duplicated or orphaned Virtual Machines and a Virtual Administrator unpicking and remapping Virtual Machines to the correct hosts.
A couple of methods that were used to get around this back in the days of working with Virtual Center 2.x, was to remove any single point of failure for the Service Console networks and to change the host isolation response to leave Virtual Machines powered on. The rational behind an isolation response of leaving Virtual Machines powered on, is that this would leave VMDK locks in place, and it would be impossible for another host to power on the workloads during a ‘Split Brain’ scenario. Another option that a Virtual Administrator could use was the ‘das.isolationaddress‘ advanced setting.
By default the HA network uses the default gateway of the ESXi hosts as a secondary isolation address, but with the ‘das.isolationaddress‘ setting it was possible to configure up to 10 additional isolation addresses for use. The setting configures additional addresses that will be pinged to help determine if a host is isolated from the network (do not set to 127.0.0.1!). When heartbeats are not received from any other host in the cluster, these configured IP addresses are pinged. Any address that you specify has to be a reliable address that is pretty much always guaranteed to be available, so that the host can determine if it is isolated from the network. In the past I’ve used other components within the management estate that are seldom or never unavailable, such as the cluster IP for network time sources. When configuring multiple isolation addresses for use in a cluster, these must be specified as ‘das.isolationAddressX’, where X = 0-9. As you may expect, the more addresses that you specify the longer isolation detection takes, therefore consider carefully if you want to specify more than one.
All the above listed ‘Split Brain’ protection approaches are as valid today as they were a decade or more ago. When designing or evaluating the design of vSphere HA it is worth considering both it’s behaviour in suboptimal situations and if the network team have remembered to permit ICMP responses for the default gateway of the management network…
As always thanks for reading and hopefully this is useful information for some!