Close

26th September 2017

Objective 7.5 – Troubleshoot HA and DRS Configurations and Fault Tolerance

Here we are at the end of Section 7, Objective 7.5 – Troubleshoot HA and DRS Configurations and Fault Tolerance.  Sorry for the delay in getting these together, I was actually sitting a vSphere 6.5 ICM course for the past week.

As always, this article is linked to from the main VCP6.5-DCV Blueprint.

Happy Revision

Simon

 

Objective 7.5 – Troubleshoot HA and DRS Configurations and Fault Tolerance

Troubleshoot issues with:

DRS workload balancing

Load Imbalance on Cluster

A cluster has a load imbalance of resources.

A cluster might become unbalanced because of uneven resource demands from virtual machines and unequal capacities of hosts.

The following are possible reasons why the cluster has a load imbalance:

  • The migration threshold is too high.

A higher threshold makes the cluster a more likely candidate for load imbalance.

  • VM/VM or VM/Host DRS rules prevent virtual machines from being moved.
  • DRS is disabled for one or more virtual machines.
  • A device is mounted to one or more virtual machines preventing DRS from moving the virtual machine in order to balance the load.
  • Virtual machines are not compatible with the hosts to which DRS would move them. That is, at least one of the hosts in the cluster is incompatible for the virtual machines that would be migrated. For example, if host A’s CPU is not vMotion-compatible with host B’s CPU, then host A becomes incompatible for powered-on virtual machines running on host B.
  • It would be more detrimental for the virtual machine’s performance to move it than for it to run where it is currently located. This may occur when loads are unstable or the migration cost is high compared to the benefit gained from moving the virtual machine.
  • vMotion is not enabled or set up for the hosts in the cluster.

DRS Seldom or Never Performs vMotion Migrations

DRS seldom or never performs vMotion migrations.

DRS does not perform vMotion migrations.

DRS never performs vMotion migrations when one or more of the following issues is present on the cluster.

  • DRS is disabled on the cluster.
  • The hosts do not have shared storage.
  • The hosts in the cluster do not contain a vMotion network.
  • DRS is manual and no one has approved the migration.

DRS seldom performs vMotion when one or more of the following issues is present on the cluster:

  • Loads are unstable, or vMotion takes a long time, or both. A move is not appropriate.
  • DRS seldom or never migrates virtual machines.
  • DRS migration threshold is set too high.

DRS moves virtual machines for the following reasons:

  • Evacuation of host that a user requested enter maintenance or standby mode.
  • VM/Host DRS rules or VM/VM DRS rules.
  • Reservation violations.
  • Load imbalance.
  • Power management.

HA failover/redundancy, capacity and network configuration

HA/DRS cluster configuration

Best Practices for Networking

Observe the following best practices for the configuration of host NICs and network topology for vSphere HA. Best Practices include recommendations for your ESXi hosts, and for cabling, switches, routers, and firewalls.

Network Configuration and Maintenance

The following network maintenance suggestions can help you avoid the accidental detection of failed hosts and network isolation because of dropped vSphere HA heartbeats.

  • When changing the networks that your clustered ESXi hosts are on, suspend the Host Monitoring feature. Changing your network hardware or networking settings can interrupt the heartbeats that vSphere HA uses to detect host failures, which might result in unwanted attempts to fail over virtual machines.
  • When you change the networking configuration on the ESXi hosts themselves, for example, adding port groups, or removing vSwitches, suspend Host Monitoring. After you have made the networking configuration changes, you must reconfigure vSphere HA on all hosts in the cluster, which causes the network information to be reinspected. Then re-enable Host Monitoring.

Networks Used for vSphere HA Communications

To identify which network operations might disrupt the functioning of vSphere HA, you must know which management networks are being used for heart beating and other vSphere HA communications.

  • On legacy ESX hosts in the cluster, vSphere HA communications travel over all networks that are designated as service console networks. VMkernel networks are not used by these hosts for vSphere HA communications. To contain vSphere HA traffic to a subset of the ESX console networks, use the allowedNetworks advanced option.
  • On ESXi hosts in the cluster, vSphere HA communications, by default, travel over VMkernel networks. With an ESXi host, if you want to use a network other than the one vCenter Server uses to communicate with the host for vSphere HA, you must explicitly enable the Management traffic check box.

To keep vSphere HA agent traffic on the networks you have specified, configure hosts so vmkNICs used by vSphere HA do not share subnets with vmkNICs used for other purposes. vSphere HA agents send packets using any pNIC that is associated with a given subnet when there is also at least one vmkNIC configured for vSphere HA management traffic. Therefore, to ensure network flow separation, the vmkNICs used by vSphere HA and by other features must be on different subnets.

Network Isolation Addresses

A network isolation address is an IP address that is pinged to determine whether a host is isolated from the network. This address is pinged only when a host has stopped receiving heartbeats from all other hosts in the cluster. If a host can ping its network isolation address, the host is not network isolated, and the other hosts in the cluster have either failed or are network partitioned. However, if the host cannot ping its isolation address, it is likely that the host has become isolated from the network and no failover action is taken.

By default, the network isolation address is the default gateway for the host. Only one default gateway is specified, regardless of how many management networks have been defined. Use the das.isolationaddress[…]advanced option to add isolation addresses for additional networks.

Network Path Redundancy

Network path redundancy between cluster nodes is important for vSphere HA reliability. A single management network ends up being a single point of failure and can result in failovers although only the network has failed. If you have only one management network, any failure between the host and the cluster can cause an unnecessary (or false) failover activity if heartbeat datastore connectivity is not retained during the networking failure. Possible failures include NIC failures, network cable failures, network cable removal, and switch resets. Consider these possible sources of failure between hosts and try to minimize them, typically by providing network redundancy.

The first way you can implement network redundancy is at the NIC level with NIC teaming. Using a team of two NICs connected to separate physical switches improves the reliability of a management network. Because servers connected through two NICs (and through separate switches) have two independent paths for sending and receiving heartbeats, the cluster is more resilient. To configure a NIC team for the management network, configure the vNICs in vSwitch configuration for Active or Standby configuration. The recommended parameter settings for the vNICs are:

  • Default load balancing = route based on originating port ID
  • Failback = No

After you have added a NIC to a host in your vSphere HA cluster, you must reconfigure vSphere HA on that host.

In most implementations, NIC teaming provides sufficient heartbeat redundancy, but as an alternative you can create a second management network connection attached to a separate virtual switch. Redundant management networking allows the reliable detection of failures and prevents isolation or partition conditions from occurring, because heartbeats can be sent over multiple networks. The original management network connection is used for network and management purposes. When the second management network connection is created, vSphere HA sends heartbeats over both management network connections. If one path fails, vSphere HA still sends and receives heartbeats over the other path.

Using IPv6 Network Configurations

Only one IPv6 address can be assigned to a given network interface used by your vSphere HA cluster. Assigning multiple IP addresses increases the number of heartbeat messages sent by the cluster’s master host with no corresponding benefit.

Best Practices for Interoperability

Observe the following best practices for allowing interoperability between vSphere HA and other features.

vSphere HA and Storage vMotion Interoperability in a Mixed Cluster

In clusters where ESXi 5.x hosts and ESX/ESXi 4.1 or earlier hosts are present and where Storage vMotion is used extensively or Storage DRS is enabled, do not deploy vSphere HA. vSphere HA might respond to a host failure by restarting a virtual machine on a host with an ESXi version different from the one on which the virtual machine was running before the failure. A problem can occur if, at the time of failure, the virtual machine was involved in a Storage vMotion action on an ESXi 5.x host, and vSphere HA restarts the virtual machine on a host with a version earlier than ESXi 5.0. While the virtual machine might power-on, any subsequent attempts at snapshot operations might corrupt the vdisk state and leave the virtual machine unusable.

Using Auto Deploy with vSphere HA

You can use vSphere HA and Auto Deploy together to improve the availability of your virtual machines. Auto Deploy provisions hosts when they power-on and you can also configure it to install the vSphere HA agent on hosts during the boot process. See the Auto Deploy documentation included in vSphere Installation and Setup for details.

Upgrading Hosts in a Cluster Using vSAN

If you are upgrading the ESXi hosts in your vSphere HA cluster to version 5.5 or later, and you also plan to use vSAN, follow this process.

  1. Upgrade all of the hosts.
  2. Disable vSphere HA.
  3. Enable vSAN.
  4. Re-enable vSphere HA.

Best Practices for Cluster Monitoring

Observe the following best practices for monitoring the status and validity of your vSphere HA cluster.

Setting Alarms to Monitor Cluster Changes

When vSphere HA or Fault Tolerance take action to maintain availability, for example, a virtual machine failover, you can be notified about such changes. Configure alarms in vCenter Server to be triggered when these actions occur, and have alerts, such as emails, sent to a specified set of administrators.

Several default vSphere HA alarms are available.

  • Insufficient failover resources (a cluster alarm)
  • Cannot find master (a cluster alarm)
  • Failover in progress (a cluster alarm)
  • Host HA status (a host alarm)
  • VM monitoring error (a virtual machine alarm)
  • VM monitoring action (a virtual machine alarm)
  • Failover failed (a virtual machine alarm)

Creating a DRS Cluster

A cluster is a collection of ESXi hosts and associated virtual machines with shared resources and a shared management interface. Before you can obtain the benefits of cluster-level resource management you must create a cluster and enable DRS.

Admission Control and Initial Placement

When you attempt to power on a single virtual machine or a group of virtual machines in a DRS-enabled cluster, vCenter Server performs admission control. It checks that there are enough resources in the cluster to support the virtual machine(s).

Virtual Machine Migration

Although DRS performs initial placements so that load is balanced across the cluster, changes in virtual machine load and resource availability can cause the cluster to become unbalanced. To correct such imbalances, DRS generates migration recommendations.

DRS Cluster Requirements

Hosts that are added to a DRS cluster must meet certain requirements to use cluster features successfully.

Configuring DRS with Virtual Flash

DRS can manage virtual machines that have virtual flash reservations.

Create a Cluster

A cluster is a group of hosts. When a host is added to a cluster, the host’s resources become part of the cluster’s resources. The cluster manages the resources of all hosts within it. Clusters enable the vSphere High Availability (HA) and vSphere Distributed Resource Scheduler (DRS) solutions. You can also enable vSAN on a cluster.

Edit Cluster Settings

When you add a host to a DRS cluster, the host’s resources become part of the cluster’s resources. In addition to this aggregation of resources, with a DRS cluster you can support cluster-wide resource pools and enforce cluster-level resource allocation policies.

Set a Custom Automation Level for a Virtual Machine

After you create a DRS cluster, you can customize the automation level for individual virtual machines to override the cluster’s default automation level.

Disable DRS

You can turn off DRS for a cluster.

Restore a Resource Pool Tree

You can restore a previously saved resource pool tree snapshot.

vMotion/Storage vMotion configuration and/or migration

vMotion

The following are the actions that take place during a vMotion, understanding these steps will help in any troubleshooting efforts.

vMotion request is sent to the vCenter Server

During this stage, a call is sent to vCenter Server requesting the live migration of a virtual machine to another host. This call may be issued through the VMware vSphere Web Client, VMware vSphere Client or through an API call.

vCenter Server sends the vMotion request to the destination ESXi host

During this stage, a request is sent to the destination ESXi host by vCenter Server to notify the host for an incoming vMotion. This step also validates if the host can receive a vMotion. If a vMotion is allowed on the host, the host replies to the request allowing the vMotion to continue. If the host is not configured for vMotion, the host replies to the request disallowing the vMotion, resulting in a vMotion failure.

Common issues:

  • Ensure that vMotion is enabled on all ESX/ESXi hosts.
  • Determine if resetting the Migrate.Enabled setting on both the source and destination ESX or ESXi hosts addresses the vMotion failure.
  • Verify that VMkernel network connectivity exists using vmkping.
  • Verify that VMkernel networking configuration is valid.
  • Verify that Name Resolution is valid on the host.
  • Verify if the ESXi/ESX host can be reconnected or if reconnecting the ESX/ESXi host resolves the issue.
  • Verify that you do not have two or more virtual machine swap files in the virtual machine directory.
  • If you are migrating a virtual machine to or from a host running VMware ESXi below version 5.5 Update 2.

vCenter Server computes the specifications of the virtual machine to migrate

During this stage, details of the virtual machine are queried to notify the source and destination hosts of the vMotion task details. This may include any Fault Tolerance settings, disk size, vMotion IP address streams, and the source and destination virtual machine configuration file locations.

Common issues during this stage are:

  • Verify that the virtual machine is not configured to use a device that is not valid on the target host. For more information.

vCenter Server sends the vMotion request to the source ESXi host to prepare the virtual machine for migration

During this stage, a request is made to the source ESXi host by vCenter Server to notify the host for an incoming vMotion. This step validates if the host can send a vMotion. If a vMotion is allowed on the host, the host replies to the request allowing the vMotion to continue. If the host is not configured for vMotion, the host will reply to the request disallowing the vMotion, resulting in a vMotion failure.

Once the vMotion task has been validated, the configuration file for the virtual machine is placed into read-only mode and closed with a 90 second protection timer. This prevents changes to the virtual machine while the vMotion task is in progress.

Common issue during this stage are:

  • Ensure that vMotion is enabled on all ESX/ESXi hosts.
  • Determine if resetting the Migrate.Enabled setting on both the source and destination ESX or ESXi hosts addresses the vMotion failure. For more information.
  • Verify that VMkernel network connectivity exists using vmkping. For more information.
  • Verify that VMkernel networking configuration is valid. For more information.
  • Verify that Name Resolution is valid on the host. For more information.
  • Verify if the ESXi/ESX host can be reconnected or if reconnecting the ESX/ESXi host resolves the issue.

vCenter Server initiates the destination host virtual machine

During this stage, the destination host creates, registers and powers on a new virtual machine. The virtual machine is powered on to a state that allows the virtual machine to consume resources and prepares it to receive the virtual machine state from the source host. During this time a world ID is generated that is sent to the source host as the target virtual machine for the vMotion.

Common issues:

  • Verify that the required disk space is available.
  • Verify that time is synchronized across environment. For more information.
  • Verify that hostd is not spiking the console. For more information.
  • Verify that valid limits are set for the virtual machine being vMotioned. For more information.
  • Verify if the ESXi/ESX host can be reconnected or if reconnecting the ESX/ESXi host resolves the issue. For more information.

vCenter Server initiates the source host virtual machine

During this stage, the source host begins to migrate the memory and running state of the source virtual machine to the destination virtual machine. This information is transferred using VMkernel  ports configured for vMotion. Additional resources are allocated for the destination virtual machine and additional helper worlds are created. The memory of the source virtual machine is transferred using checkpoints.

After the memory and virtual machine state is completed, a stun of the source virtual machine occurs to copy any remaining changes that occurred during the last checkpoint copy. Once this is complete the destination virtual machine resume as the primary machine for the virtual machine that is being migrated.

Common issues:

  • If Jumbo Frames are enabled (MTU of 9000)  (9000 -8 bytes (ICMP header) -20 bytes (IP header) for a total of 8972), ensure that vmkping is using the command:

    vmkping -d -s 8972 destinationIPaddress

    You may experience problems with the trunk between two physical switches that have been misconfigured to an MTU of 1500.

  • Verify that valid limits are set for the virtual machine being vMotioned. For more information.
  • Verify the virtual hardware is not out of date. For more information.
  • This issue may be caused by SAN configuration. Specifically, this issue may occur if zoning is set up differently on different servers in the same cluster.
  • Verify and ensure that the log.rotateSize parameter in the virtual machine’s configuration file is not set to a very low value. For more information.
  • If you are migrating a 64-bit virtual machine, verify that the VT option is enabled on both the source and destination host. For more information.
  • Verify that there are no issues with the shared storage or networking.
  • If you are using NFS storage, verify if the VMFS volume containing the VMDK file of a virtual machine being migrated is on an NFS datastore and the datastore is not mounted differently on both the source and destination.
  • If you are using VMware vShield Endpoint, verify the vShield Endpoint LKM is installed on the ESX/ESXi hosts to which you are trying to vMotion the virtual machine.

vCenter Server switches the virtual machine’s ESXi host from the source to destination

During this stage, the virtual machine is running on the destination ESXi host. vCenter Server will update to reflect this change by changing the virtual machines host and pool to the destination host. The source virtual machine is powered down and unregistered from the source host. Any resources that were in use are released back to the host.

vCenter Server completes the vMotion task

During this stage, the vMotion task is marked as complete.

Common issues during this stage are:

  • Verify that Console OS network connectivity exists.

Troubleshooting Storage DRS

The Storage DRS troubleshooting topics provide solutions to potential problems that you might encounter when using Storage DRS-enabled datastores in a datastore cluster.

Storage DRS is Disabled on a Virtual Disk

Even when Storage DRS is enabled for a datastore cluster, it might be disabled on some virtual disks in the datastore cluster.

Datastore Cannot Enter Maintenance Mode

You place a datastore in maintenance mode when you must take it out of usage to service it. A datastore enters or leaves maintenance mode only as a result of a user request.

Storage DRS Cannot Operate on a Datastore

Storage DRS generates an alarm to indicate that it cannot operate on the datastore.

Moving Multiple Virtual Machines into a Datastore Cluster Fails

Migrating more than one datastore into a datastore cluster fails with an error message after the first virtual machine has successfully moved into the datastore cluster.

Storage DRS Generates Fault During Virtual Machine Creation

When you create or clone a virtual machine on a datastore cluster, Storage DRS might generate a fault.

Storage DRS is Enabled on a Virtual Machine Deployed from an OVF Template

Storage DRS is enabled on a virtual machine that was deployed from an OVF template that has Storage DRS disabled. This can occur when you deploy an OVF template on a datastore cluster.

Storage DRS Rule Violation Fault Is Displayed Multiple Times

When you attempt to put a datastore into maintenance mode, the same affinity or anti-affinity rule violation fault might appear to be listed more than once in the Faults dialog box.

Storage DRS Rules Not Deleted from Datastore Cluster

Affinity or anti-affinity rules that apply to a virtual machine are not deleted when you remove the virtual machine from a datastore cluster.

Alternative Storage DRS Placement Recommendations Are Not Generated

When you create, clone, or relocate a virtual machine, Storage DRS generates only one placement recommendation.

Applying Storage DRS Recommendations Fails

Storage DRS generates space or I/O load balancing recommendations, but attempts to apply the recommendations fail.

Fault Tolerance configuration and failover issues

Troubleshooting Fault Tolerant Virtual Machines

To maintain a high level of performance and stability for your fault tolerant virtual machines and also to minimize failover rates, you should be aware of certain troubleshooting issues.

The troubleshooting topics discussed focus on problems that you might encounter when using the vSphere Fault Tolerance feature on your virtual machines. The topics also describe how to resolve problems.

Hardware Virtualization Not Enabled

You must enable Hardware Virtualization (HV) before you use vSphere Fault Tolerance.

Compatible Hosts Not Available for Secondary VM

If you power on a virtual machine with Fault Tolerance enabled and no compatible hosts are available for its Secondary VM, you might receive an error message.

Secondary VM on Overcommitted Host Degrades Performance of Primary VM

If a Primary VM appears to be executing slowly, even though its host is lightly loaded and retains idle CPU time, check the host where the Secondary VM is running to see if it is heavily loaded.

Increased Network Latency Observed in FT Virtual Machines

If your FT network is not optimally configured, you might experience latency problems with the FT VMs.

Some Hosts Are Overloaded with FT Virtual Machines

You might encounter performance problems if your cluster’s hosts have an imbalanced distribution of FT VMs.

Losing Access to FT Metadata Datastore

Access to the Fault Tolerance metadata datastore is essential for the proper functioning of an FT VM. Loss of this access can cause a variety of problems.

Turning On vSphere FT for Powered-On VM Fails

If you try to turn on vSphere Fault Tolerance for a powered-on VM, this operation can fail.

FT Virtual Machines not Placed or Evacuated by vSphere DRS

FT virtual machines in a cluster that is enabled with vSphere DRS do not function correctly if Enhanced vMotion Compatibility (EVC) is currently disabled.

Fault Tolerant Virtual Machine Failovers

A Primary or Secondary VM can fail over even though its ESXi host has not crashed. In such cases, virtual machine execution is not interrupted, but redundancy is temporarily lost. To avoid this type of failover, be aware of some of the situations when it can occur and take steps to avoid them.

 Explain the DRS Resource Distribution Graph and Target/Current Host Load Deviation

The DRS resource distribution graph displays CPU or Memory metric for each of the host a cluster.  The DRS cluster can be considered load balanced when each of the hosts’ level of consumed resources is equivalent to all the other nodes in the cluster.  Having said that it isn’t too big a leap of faith to accept that when they are not the cluster is considered imbalanced.

Imbalance is part of life for a DRS cluster, the main objective for DRS is not to balance the load perfectly across every host. Rather, DRS monitors the resource demand and works to ensure that every VM is getting the resources entitled. When DRS determines that a better host exists for the VM, it make a recommendation to move that VM.

Explain vMotion Resource Maps

vMotion resource maps provide a visual representation of hosts, datastores, and networks associated with the selected virtual machine.

vMotion resource maps also indicate which hosts in the virtual machine’s cluster or datacenter are compatible with the virtual machine and are potential migration targets. For a host to be compatible, it must meet the following criteria.

  • Connect to all the same datastores as the virtual machine.
  • Connect to all the same networks as the virtual machine.
  • Have compatible software with the virtual machine.
  • Have a compatible CPU with the virtual machine.