We continue the Section 7 objectives with Objective 7.2 – Troubleshoot vSphere Storage and Networking.
As always this article is linked to from the main VCP6.5-DCV Blueprint.
Objective 7.2 – Troubleshoot vSphere Storage and Networking
Identify and isolate network and storage resource contention and latency issues
Resolving SAN Performance Problems
A number of factors can negatively affect storage performance in the ESXi SAN environment. Among these factors are excessive SCSI reservations, path thrashing, and inadequate LUN queue depth.
To monitor storage performance in real time, use the resxtop and esxtop command-line utilities. For more information, see the vSphere Monitoring and Performance documentation.
Excessive SCSI Reservations Cause Slow Host Performance
When storage devices do not support the hardware acceleration, ESXi hosts use the SCSI reservations mechanism when performing operations that require a file lock or a metadata lock in VMFS. SCSI reservations lock the entire LUN. Excessive SCSI reservations by a host can cause performance degradation on other servers accessing the same VMFS.
Excessive SCSI reservations cause performance degradation and SCSI reservation conflicts.
Several operations require VMFS to use SCSI reservations.
- Creating, resignaturing, or expanding a VMFS datastore
- Powering on a virtual machine
- Creating or deleting a file
- Creating a template
- Deploying a virtual machine from a template
- Creating a new virtual machine
- Migrating a virtual machine with VMotion
- Growing a file, such as a thin provisioned virtual disk
To eliminate potential sources of SCSI reservation conflicts, follow these guidelines:
- Serialize the operations of the shared LUNs, if possible, limit the number of operations on different hosts that require SCSI reservation at the same time.
- Increase the number of LUNs and limit the number of hosts accessing the same LUN.
- Reduce the number snapshots. Snapshots cause numerous SCSI reservations.
- Reduce the number of virtual machines per LUN. Follow recommendations in Configuration Maximums.
- Make sure that you have the latest HBA firmware across all hosts.
- Make sure that the host has the latest BIOS.
- Ensure a correct Host Mode setting on the SAN array.
Path Thrashing Causes Slow LUN Access
If your ESXi host is unable to access a LUN, or access is very slow, you might have a problem with path thrashing, also called LUN thrashing.
Your host is unable to access a LUN, or access is very slow. The problem might be caused by path thrashing. Path thrashing might occur when two hosts access the same LUN through different storage processors (SPs) and, as a result, the LUN is never available.
Path thrashing typically occurs on active-passive arrays. Path thrashing can also occur on a directly connected array with HBA failover on one or more nodes. Active-active arrays or arrays that provide transparent failover do not cause path thrashing.
- Ensure that all hosts that share the same set of LUNs on the active-passive arrays use the same storage processor.
- Correct any cabling or masking inconsistencies between different hosts and SAN targets so that all HBAs see the same targets.
- Ensure that the claim rules defined on all hosts that share the LUNs are exactly the same.
- Configure the path to use the Most Recently Used PSP, which is the default.
Increased Latency for I/O Requests Slows Virtual Machine Performance
If the ESXi host generates more commands to a LUN than the LUN queue depth permits, the excess commands are queued in VMkernel. This increases the latency, or the time taken to complete I/O requests.
The host takes longer to complete I/O requests and virtual machines display unsatisfactory performance.
The problem might be caused by an inadequate LUN queue depth. SCSI device drivers have a configurable parameter called the LUN queue depth that determines how many commands to a given LUN can be active at one time. If the host generates more commands to a LUN, the excess commands are queued in the VMkernel.
- If the sum of active commands from all virtual machines consistently exceeds the LUN depth, increase the queue depth.
The procedure that you use to increase the queue depth depends on the type of storage adapter the host uses.
- When multiple virtual machines are active on a LUN, change the Disk.SchedNumReqOutstanding (DSNRO) parameter, so that it matches the queue depth value.
Resolving Network Latency Issues
Networking latency issues could potentially be caused by the following infrastructure elements:
- External Switch
- NIC configuration and bandwidth (1G/10G/40G)
- CPU contention
CPU and NIC resource utilisation is viewable from within vCenter or an ESXi host. Eliminate external switch resource issues from the external switch management utilities.
Verify network and storage configuration
To verify your network links, perform one or more of the following options:
- Check the network link status within vSphere/VMware Infrastructure (VI) Client:
Highlight the ESX/ESXi host and click the Configuration tab.
Click the Networking hyperlink.
The Virtual Network Adapters (vmnics) currently assigned to virtual switches are shown in the diagrams in the client. If one Virtual Network Adapter has a red X over it, it indicates that the link is currently down.
Note: This can be caused by a misconfiguration of the etherchannel on the physical switch.
- Check the status from the ESX/ESXi host:
Run this command:
[root@server root]# esxcfg-nics -l
The output appears similar to:
Name PCI Driver Link Speed Duplex Description
vmnic0 04:04.00 tg3 Up 1000Mbps Full Broadcom BCM5780 Gigabit Ethernet
vmnic1 04:04.01 tg3 Up 1000Mbps Full Broadcom BCM5780 Gigabit Ethernet
- The Link column specifies the status of the link between the network adapter and the physical switch and can be either Up or Down.
- If there are several network adapters and some links are up and some links are down, you must verify that the adapters are connected to the intended physical switch ports.
- To do this, bring down each of the port of the ESX/ESXi host on the physical switch and then run the esxcfg-nics -l command to see which vmnic is affected..
- Utilize Cisco Discovery Protocol (CDP) to discover switch ports corresponding to vmnic connections.
- Check the status at the physical server’s network adapters:Observe the LED lights on the physical network adapter. Refer to your network adapter or server’s documentation for the meaning of the lights. If no lights are illuminated, this typically indicates a link down state or that the integrated network adapter is disabled in the BIOS settings (if applicable).
Verify the Connection Status of a Storage Device
Use the esxcli command to verify the connection status of a particular storage device.
Install vCLI or deploy the vSphere Management Assistant (vMA) virtual machine. See Getting Started with vSphere Command-Line Interfaces. For troubleshooting, run esxcli commands in the ESXi Shell.
- Run the command;
esxcli –server=server_name storage core device list -d=device_ID
- Review the connection status in the Status: area.
on – Device is connected.
dead – Device has entered the APD state. The APD timer starts.
dead timeout – The APD timeout has expired.
not connected – Device is in the PDL state.
Verify Drivers, Firmware, Storage I/O Controllers Against the VMware Compatibility Guide
Use the vSAN Health Service to verify whether your hardware components, drivers, and firmware are compatible with vSAN.
Using hardware components, drivers, and firmware that are not compatible with vSAN might cause problems in the operation of the vSAN cluster and the virtual machines running on it.
The hardware compatibility health checks verify the your hardware against the VMware Compatibility Guide.
Verify that a given virtual machine is configured with the correct network resources
Troubleshooting virtual machine network connection issues
- Virtual machines fail to connect to the network
- There is no network connectivity to or from a single virtual machine
- You cannot connect to the Internet
- A TCP/IP connection fails to and from a single virtual machine
- You may see one or more of the following errors:
Destination Host Unreachable
- Network error: Connection Refused
- Network cable is unplugged
- Ping request could not find host (IP address/hostname). Please check the name and try again.
- Unable to resolve target system name (IP address/hostname).
- You may see one or more of the following errors:
Validate that each troubleshooting step below is true for your environment. The steps provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.
- Ensure that the Port Group name(s) associated with the virtual machine’s network adapter(s) exists in your vSwitch or Virtual Distributed Switch and is spelled correctly. If not, correct it using Edit Settings on the virtual machine and ensure that the Connected checkbox is selected.
- Ensure that the virtual machine has no underlying issues with storage or it is not in resource contention, as this might result in networking issues with the virtual machine. You can do this by logging into ESX/ESXi or Virtual Center/vCenter Server using the VI/vSphere Client and logging into the virtual machine console.
- Verify that the virtual network adapter is present and connected.
- Verify that the networking within the virtual machine’s guest operating system is correct.
- Verify that the TCP/IP stack is functioning correctly.
- If this virtual machine was converted from a physical system, verify that there are no hidden network adapters present.
- Verify that the vSwitch has enough ports for the virtual machine.
- Verify that the virtual machine’s IPSec configuration is configured correctly and that it is not corrupted.
- Verify that the virtual machine is configured with two vNICs to eliminate a NIC or a physical configuration issue. To isolate a possible issue:
- If the load balancing policy is set to Default Virtual Port ID at the vSwitch or vDS level:Leave one vNIC connected with one uplink on the vSwitch or vDS, then try different vNIC and pNIC combinations until you determine which virtual machine is losing connectivity.
If the load balancing policy is set to IP Hash:
- Ensure the physical switch ports are configured as port-channel.
- Shut down all but one of the physical ports the NICs are connected to, and toggle this between all the ports by keeping only one port connected at a time. Take note of the port/NIC combination where the virtual machines lose network connectivity.
You can also check esxtop output using the n option (for networking) to see which pNIC the virtual machine is using. Try shutting down the ports on the physical switch one at at time to determine where the virtual machine is losing network connectivity. This also rules out any misconfiguration on the physical switch port(s).
Monitor/Troubleshoot Storage Distributed Resource Scheduler (SDRS) issues
Storage DRS is Disabled on a Virtual Disk
Even when Storage DRS is enabled for a datastore cluster, it might be disabled on some virtual disks in the datastore cluster.
You have enabled Storage DRS for a datastore cluster, but Storage DRS is disabled on one or more virtual machine disks in the datastore cluster.
The following scenarios can cause Storage DRS to be disabled on a virtual disk.
- A virtual machine’s swap file is host-local (the swap file is stored in a specified datastore that is on the host). The swap file cannot be relocated and Storage DRS is disabled for the swap file disk.
- A certain location is specified for a virtual machine’s .vmx swap file. The swap file cannot be relocated and Storage DRS is disabled on the .vmx swap file disk.
- The relocate or Storage vMotion operation is currently disabled for the virtual machine in vCenter Server (for example, because other vCenter Server operations are in progress on the virtual machine). Storage DRS is disabled until the relocate or Storage vMotion operation is re-enabled in vCenter Server.
- The home disk of a virtual machine is protected by vSphere HA and relocating it will cause loss of vSphere HA protection.
- The disk is a CD-ROM/ISO file.
- If the disk is an independent disk, Storage DRS is disabled, except in the case of relocation or clone placement.
- If the virtual machine has system files on a separate datastore from the home datastore (legacy), Storage DRS is disabled on the home disk. If you use Storage vMotion to manually migrate the home disk, the system files on different datastores will be all be located on the target datastore and Storage DRS will be enabled on the home disk.
- If the virtual machine has a disk whose base/redo files are spread across separate datastores (legacy), Storage DRS for the disk is disabled. If you use Storage vMotion to manually migrate the disk, the files on different datastores will be all be located on the target datastore and Storage DRS will be enabled on the disk.
- The virtual machine has hidden disks (such as disks in previous snapshots, not in the current snapshot). This situation causes Storage DRS to be disabled on the virtual machine.
- The virtual machine is a template.
- The virtual machine is vSphere Fault Tolerance-enabled.
- The virtual machine is sharing files between its disks.
- The virtual machine is being Storage DRS-placed with manually specified datastores.
Storage DRS Cannot Operate on a Datastore
Storage DRS generates an alarm to indicate that it cannot operate on the datastore.
Storage DRS generates an event and an alarm and Storage DRS cannot operate.
The following scenarios can cause vCenter Server to disable Storage DRS for a datastore.
- The datastore is shared across multiple data centers.
Storage DRS is not supported on datastores that are shared across multiple data centers. This configuration can occur when a host in one data center mounts a datastore in another data center, or when a host using the datastore is moved to a different data center. When a datastore is shared across multiple data centers, Storage DRS I/O load balancing is disabled for the entire datastore cluster. However, Storage DRS space balancing remains active for all datastores in the datastore cluster that are not shared across data centers.
- The datastore is connected to an unsupported host.
Storage DRS is not supported on ESX/ESXi 4.1 and earlier hosts.
- The datastore is connected to a host that is not running Storage I/O Control.
- The datastore must be visible in only one data center. Move the hosts to the same data center or unmount the datastore from hosts that reside in other data centers.
- Ensure that all hosts associated with the datastore cluster are ESXi 5.0 or later.
- Ensure that all hosts associated with the datastore cluster have Storage I/O Control enabled.
Datastore Cannot Enter Maintenance Mode
You place a datastore in maintenance mode when you must take it out of usage to service it. A datastore enters or leaves maintenance mode only as a result of a user request.
A datastore in a datastore cluster cannot enter maintenance mode. The Entering Maintenance Mode status remains at 1%.
One or more disks on the datastore cannot be migrated with Storage vMotion. This condition can occur in the following instances.
- Storage DRS is disabled on the disk.
- Storage DRS rules prevent Storage DRS from making migration recommendations for the disk.
- If Storage DRS is disabled, enable it or determine why it is disabled.
- If Storage DRS rules are preventing Storage DRS from making migration recommendations, you can remove or disable particular rules.
Browse to the datastore cluster in the vSphere Web Client object navigator.
Click the Manage tab and click Settings.
Under Configuration, select Rules and click the rule.
- Alternatively, if Storage DRS rules are preventing Storage DRS from making migration recommendations, you can set the Storage DRS advanced option IgnoreAffinityRulesForMaintenance to 1.
- Browse to the datastore cluster in the vSphere Web Client object navigator.
- Click the Manage tab and click Settings.
- Select SDRS and click Edit.
- In Advanced Options > Configuration Parameters, click Add.
- In the Option column, enterIgnoreAffinityRulesForMaintenance.
- In the Value column, enter 1 to enable the option.
- Click OK.
Recognize the impact of network and storage I/O control configurations
vSphere Storage I/O Control allows cluster-wide storage I/O prioritization, which allows better workload consolidation and helps reduce extra costs associated with over provisioning.
Storage I/O Control extends the constructs of shares and limits to handle storage I/O resources. You can control the amount of storage I/O that is allocated to virtual machines during periods of I/O congestion, which ensures that more important virtual machines get preference over less important virtual machines for I/O resource allocation.
When you enable Storage I/O Control on a datastore, ESXi begins to monitor the device latency that hosts observe when communicating with that datastore. When device latency exceeds a threshold, the datastore is considered to be congested and each virtual machine that accesses that datastore is allocated I/O resources in proportion to their shares. You set shares per virtual machine. You can adjust the number for each based on need.
The I/O filter framework (VAIO) allows VMware and its partners to develop filters that intercept I/O for each VMDK and provides the desired functionality at the VMDK granularity. VAIO works along Storage Policy-Based Management (SPBM) which allows you to set the filter preferences through a storage policy that is attached to VMDKs.
By default, all virtual machine shares are set to Normal (1000) with unlimited IOPS. Storage I/O Control is enabled by default on Storage DRS-enabled datastore clusters.
vSphere Network I/O Control version 3 introduces a mechanism to reserve bandwidth for system traffic based on the capacity of the physical adapters on a host. It enables fine-grained resource control at the VM network adapter level similar to the model that you use for allocating CPU and memory resources.
Version 3 of the Network I/O Control feature offers improved network resource reservation and allocation across the entire switch.
Models for Bandwidth Resource Reservation
Network I/O Control version 3 supports separate models for resource management of system traffic related to infrastructure services, such as vSphere Fault Tolerance, and of virtual machines.
The two traffic categories have different nature. System traffic is strictly associated with an ESXi host. The network traffic routes change when you migrate a virtual machine across the environment. To provide network resources to a virtual machine regardless of its host, in Network I/O Control you can configure resource allocation for virtual machines that is valid in the scope of the entire distributed switch.
Bandwidth Guarantee to Virtual Machines
Network I/O Control version 3 provisions bandwidth to the network adapters of virtual machines by using constructs of shares, reservation and limit. Based on these constructs, to receive sufficient bandwidth, virtualized workloads can rely on admission control in vSphere Distributed Switch, vSphere DRS and vSphere HA.
Recognize a connectivity issue caused by a VLAN/PVLAN
Understanding what a VLAN and PVLAN are will help you identify issues that might be caused by VLAN or PVLAN’s.
VLANs let you segment a network into multiple logical broadcast domains at Layer 2 of the network protocol stack. Virtual LANs (VLANs) enable a single physical LAN segment to be further isolated so that groups of ports are isolated from one another as if they were on physically different segments. Private VLANs are used to solve VLAN ID limitations by adding a further segmentation of the logical broadcast domain into multiple smaller broadcast subdomains.
The VLAN configuration in a vSphere environment provides certain benefits.
- Integrates ESXi hosts into a pre-existing VLAN topology.
- Isolates and secures network traffic.
- Reduces congestion of network traffic.
Private VLANs are used to solve VLAN ID limitations by adding a further segmentation of the logical broadcast domain into multiple smaller broadcast subdomains.
A private VLAN is identified by its primary VLAN ID. A primary VLAN ID can have multiple secondary VLAN IDs associated with it. Primary VLANs are Promiscuous, so that ports on a private VLAN can communicate with ports configured as the primary VLAN. Ports on a secondary VLAN can be either Isolated, communicating only with promiscuous ports, or Community, communicating with both promiscuous ports and other ports on the same secondary VLAN.
To use private VLANs between a host and the rest of the physical network, the physical switch connected to the host needs to be private VLAN-capable and configured with the VLAN IDs being used by ESXi for the private VLAN functionality. For physical switches using dynamic MAC+VLAN ID based learning, all corresponding private VLAN IDs must be first entered into the switch’s VLAN database.
Troubleshoot common issues with:
- Storage and network
- Virtual switch and port group configuration
- Physical network adapter configuration
I believe there is enough information in the earlier part of this post, that cover the troubleshooting methods and a few common issues with the above components.
VMFS metadata consistency
Use vSphere On-disk Metadata Analyzer (VOMA) to identify incidents of metadata corruption that affect file systems or underlying logical volumes.
You can check metadata consistency when you experience problems with a VMFS datastore or a virtual flash resource. For example, perform a metadata check if one of the following occurs:
- You experience storage outages.
- After you rebuild RAID or perform a disk replacement.
- You see metadata errors in the vmkernel.log file similar to the following:
cpu11:268057)WARNING: HBX: 599: Volume 50fd60a3-3aae1ae2-3347-0017a4770402 (“<Datastore_name>”) may be damaged on disk. Corrupt heartbeat detected at offset 3305472: [HB state 0 offset 6052837899185946624 gen 15439450 stampUS 5 $
- You are unable to access files on a VMFS.
- You see corruption being reported for a datastore in events tabs of vCenter Server.
To check metadata consistency, run VOMA from the CLI of an ESXi host. VOMA can be used to check and fix minor inconsistency issues for a VMFS datastore or a virtual flash resource. To resolve errors reported by VOMA, consult VMware Support.
Follow these guidelines when you use the VOMA tool:
- Make sure that the VMFS datastore you analyze does not span multiple extents. You can run VOMA only against a single-extent datastore.
- Power off any virtual machines that are running or migrate them to a different datastore.
The following example demonstrates how to use VOMA to check VMFS metadata consistency.
- Obtain the name and partition number of the device that backs the VMFS datastore that you want to check.
#esxcli storage vmfs extent list
The Device Name and Partition columns in the output identify the device. For example:
Volume Name XXXXXXXX Device Name Partition
1TB_VMFS5 XXXXXXXX naa.00000000000000000000000000000703 3
- Check for VMFS errors.
Provide the absolute path to the device partition that backs the VMFS datastore, and provide a partition number with the device name. For example:
# voma -m vmfs -f check -d /vmfs/devices/disks/naa.00000000000000000000000000000703:3
The output lists possible errors. For example, the following output indicates that the heartbeat address is invalid.
Phase 2: Checking VMFS heartbeat region
ON-DISK ERROR: Invalid HB address
Phase 3: Checking all file descriptors.
Phase 4: Checking pathname and connectivity.
Phase 5: Checking resource reference counts.
Total Errors Found: 1