– Determine use cases for and apply esxtop Interactive, Batch and Replay modes
– Use vscsiStats to gather storage performance data
– Use esxtop/resxtopto collect performance data
|witch display: c:cpu i:interrupt m:memory n:network
d:disk adapter u:disk device v:disk VM p:power mgmt
|fF Add or remove fieldsoO Change the order of displayed fields
s Set the delay in seconds between updates
# Set the number of instances to display
W Write configuration file ~/.esxtop50rc
k Kill a world
e Expand/Rollup Cpu Statistics
V View only VM instances
L Change the length of the NAME field
l Limit display to a single group
|usage: esxtop [-h] [-v] [-b] [-l] [-s] [-a] [-c config file] [-R vm-support-dir-path][-d delay] [-n iterations]
[-export-entity entity-file] [-import-entity entity-file]
-h prints this help menu.
|2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
Type below command to display all fields not default ones:
~ # esxtop -a
Of course my screen even will not be enough to show all of them, but the Magic when you are here and press h that will take you to the help screen , my concern here is not the help but how to order by the screen , for the above one , I have the below filters:
CPU (%USED, %RDY, %CSTP)
Press h as mentioned so you can sort by:
U:%USED R:%RDY N:GID
When troubleshooting CPU performance for your virtual machines the following counters are the most important.
%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter Performance raphs read about it here)
%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.
Memory (MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s)
M:MEMSZ B:MCTLSZ N:GID
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine perspective.
MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column then figure out why.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates to the balloon driver inside the guest operating system has “stolen” 500MB from Windows/Linux etc. You would expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy environment you would want this value to på 0 (zero)
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every number above 0 is BAD.
Sequence of memory bottle neck
IF ESXi host has a memory pressure situation it starts with:
Page sharing then ballooning “MCTLSZ” then compression “Cacheusd & ZIP/s” then the last swap “SWR/s & SWW/s” which is really so bad
Network (MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s)
“SPEED” (Mbps) The link speed in Megabits per second. This information is only valid for a physical NIC.
“FDUPLX” ‘Y’ implies the corresponding link is operating at full duplex. ‘N’ implies it is not. This information is only valid for a physical NIC.
“UP” ‘Y’ implies the corresponding link is up. ‘N’ implies it is not. This information is only valid for a physical NIC.
“PKTTX/s” The number of packets transmitted per second.
“PKTRX/s” The number of packets received per second.
“MbTX/s” (Mbps) The MegaBits transmitted per second.
“MbRX/s” (Mbps) The MegaBits received per second.
Q: Why does MbRX/s not match PKTRX/s for different workloads?
A: This is because the packet size may not be the same. The average packet size can be computed as follows: average_packet_size = MbRX/s / PKTRX/s . A large packet size may improve CPU efficiency of processing the packet. However, it may potentially increase latency.
Storage (d:disk adapter u:disk device v:disk VM – vscsiStats )
|KAVG/cmd||Average ESXi VMkernel latency per command, in milliseconds|
|DAVG/cmd||Average device latency per command, in milliseconds.|
|GAVG/cmd||Average virtual machine operating system latency per command, in milliseconds.|
|QAVG/cmd||Average queue latency per command, in milliseconds.|
|Metric||Threshold||What to Check|
|DAVG/cmd||>20||Storage processor/array performance for bottleneck.|
|KAVG/cmd||>1||Kernel driver firmware and adapter queue length.|
|GAVG/cmd||>20||DAVG/KAVG metrics, and Guest OS performance.|
GAVG/cmd = KAVG/cmd + DAVG/cmd
DAVG/cmd is the adapter device Driver Average Latency per Command. This is the round-trip in milliseconds from the HBA to the storage array and the return acknowledgement. Typically, most admins like to see around 20ms or less, though it can vary significantly depending on your workload and its sensitivity to latency.
DAVG/cmd is a good indicator that you need to start your investigation outside of ESX at the fabric and storage array levels.
KAVG/cmd is the adapter device VMkernel Average Latency per Command. This is the average latency between when the HBA receives the data from the storage fabric and passes it along to the Guest OS, or vice versa—basically the round trip time in the kernel itself. So, it should be a very low value, meaning that the the I/O operation should spend as little time as possible—zero or near-zero is ideal—in the kernel.
GAVG/cmd is the adapter device Guest OS Average Latency per Command. This is the round-trip in milliseconds from the Guest OS (it’s perspective) through the HBA to the storage array and back. This is why this number is a sum of DAVG/cmd + KAVG/cmd. If DAVG & KAVG are within normal thresholds, but GAVG/cmd is high, typically this indicates the VMs on that adapter or at least one of them is constrained by another resource, and needs more ESXi resources in order to process IOs more quickly. In my experience, however, high GAVG/cmd will typically be accompanied by another high value in either DAVG or KAVG.
If KAVG/cmd is greater than 1ms or so, check a couple of things.
1) Your device drivers are up-to-date and you are using compatible firmware versions, as this can slow down the kernel IO path;
2) Your adapter optimization settings, which will be provided by the vendor (some of which we will discuss in the next post).
|Metric||Threshold||What to Check|
|DQLEN||n/a||For reference; configured device queue length (prior to 5.0 LQLEN)|
|BLKSZ||n/a||For reference; configured device block size (for alignment issues)|
|RESETS/s||>0||Check paths and device availabilityCheck storage fabric/array for bottleneck|
|ABRTS/sQUED||>0||Check queue depth and storage fabric/array for bottleneck|
|RESV/s||>0-1||Compare to CONS/s|
|CON/s||n/a||If >RESV/s, check for reservation conflicts with other ESXi hosts|
DQLEN is the configured Device Queue Length. This is really a reference point to make sure you have configured your devices correctly. A quick glance, as in the screenshot above, and you might notice one queue misconfigured.
BLKSZ is the configured Device Block Size. This is another reference point to ensure that you have the correct block size for the type of workload you are running.
RESETS/s is the number of Device SCSI Reset Commands per Second. A SCSI reset command is issued when the SCSI operation fails to reach the target, and in a SAN environment is usually indicative in a path down or multipathing issue—i.e., ESXi thinks a path is fine but in reality it is faulty. This is commonly seen on Cisco Nexus fabrics as CRC errors on a port, for example.
ABRTS/s is the number of Device SCSI Abort Commands per Second. A SCSI abort command is issued from the Guest OS when the command times out waiting for a response acknowledgement. In Windows 2008 and later, this is 60 seconds by default. Typically if you are encountering a large number of aborts, the storage fabric/array is causing a bottleneck and is the place to begin your investigation.
If you are using something such as a NetApp FAS, be sure that you run the GOS Timeout Script on your VM or VM template to make sure you have the proper timeout values (login required) set in order to prevent a SCSI abort during a path failover or path problem.
QUED is the current Device Commands Queued in the VMkernel. As I explained previously, this number should be at zero or near zero, otherwise it is indicating that something in the kernel is throttling the IO throughput between the Guest OS and the HBA/storage fabric/array. Check firmware versions for correct revisions and other performance tuning options within ESXi, especially vendor recommendations.
RESV/s is the Device SCSI Reservations per Second. SCSI reservations are commonplace; that’s how SCSI commands work. This value is only important as it relates to CONS/s.
CONS/s is the Device SCSI Reservation Conflicts per Second. If this value is greater than RESV/s, then it is indicative that some other ESXi hosts are holding reservations on this particular path that are conflicting with reservations currently held by this particular host. A very high value could be felt as a performance sluggishness in the storage subsystem due to the kernel constantly requesting SCSI locks and being denied, and consequently, retrying.
Troubleshooting SCSI reservation conflicts can be challenging. Some helpful information can be found in this VMware KB deep-dive article on Troubleshooting SCSI Reservation Conflicts, as well as in VMware KB 1005009 and VMware KB 1002293.
Virtual Machine Disk
You can output your results to csv file for other analysis :
vscsiStats -p all -c > /tmp/output.csv
Determine use cases for and apply esxtop/resxtop Interactive, Batch and Replay modes
Troubleshooting poor performance for specific VM , or identify issues with storage , network or Memory.
Interactive mode (the default mode): – All statistics are displayed are in real time
Batch mode: – Statistics can be collected so that the output can be saved in a file (csv) and can be viewed & analyzed using windows perfmon & other tools in later time.
~ # esxtop -b -d 20 -n 2 -a > /tmp/20secsnds2intrpts.csv
This will run for 20 seconds for 2 iterations and output as csv
Replay mode: – It is similar to record and replay operation. Data that was collected by the vm-support command is interpreted and played back as esxtop statistics. We can view the captured performance information for a particular duration or time period as like real time to view what was happening during that time. It is perfectly used for the VMware support person to replay the stats to understand what was happening to the server during that time.
First let us see the vm-support switches:
So I run it with p to collect the performance data and d during a period of 100 seconds , then over 2 seconds intervals
/vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs # vm-support -p -d 100 -i 2 -w /vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs
Then reconstruct the data:
/vmfs/volumes/4aaa440f-1a187eb4-6f5e-0000c985147e/LoGs # cd esx-esx01.com-2015-04-21–04.56/
Most of information here are from below resources:
Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (1008205)
Interpreting esxtop Statistics