
vCenter can organize two or more ESXi hosts into a load balancing cluster called a Distributed Resource Scheduling cluster. DRS clusters dynamically monitor ESXi host load and VM resource demands. Because VMs resource demands change over time, it is possible that a VM that was previously receiving all of the resources it needed could become resource starved due to the changing resource demands of other VMs on the same host.
DRS looks for this very situation and will take action by either recommending VMotion migrations or initiating VMotion migrations to rebalance VMs across the cluster. In this way, your DRS cluster always runs VMs in the most resource efficient manner. DRS delivers the following benefits:
DRS clusters are vCenter objects that you can create only in the Hosts and Clusters view. To create a new cluster, select either a datacenter or a folder, right click and select New Cluster. This creates the new inventory object and lets you set the cluster's name. You would then right click the new cluster and select Edit Settings... to enable and configure DRS.
A DRS cluster needs one or more ESXi hosts in order to function. You can add ESXi hosts to a DRS cluster by dragging an ESXi host onto the DRS cluster inventory object. You can add a new ESXi host to a DRS cluster at any time. Upon receiving a new host, the cluster will reassess VM placements and migrate VMs onto the new host to even out resource consumption across all hosts.
It is the ability to hot-add ESXi hosts to DRS clusters that gives IT managers the ability to provision new servers in response to demonstrable increases on PC server work loads. In the past IT managers had to provision new PC servers for every new workload (OS + application). Now IT managers can simply create new VMs on a DRS cluster and let the cluster load balance the VM population. If the cluster's overall resource utilization rate climbs too high (say over 80%), you can simply add a new ESXi host to the cluster. DRS will re-balance across all hosts so overall resource load goes down and VM performance goes up.
When you add a host to a DRS cluster – everybody benefits!
Initial Placement is the act of selecting a suitable ESXi host for VM placement and power on. When a user powers on a VM, DRS will:
Because DRS cluster resident VMs should all be VMotion compatible, any ESXi host can act as a power on host for the VM. By finding the least busy ESXi host at VM power on time, DRS attempts to place the VM on the host that will provide the VM with the best overall access to resources.
Dynamic Balancing starts monitoring ESXi host load and VM resource demands. If DRS determines that a VM is being resource starved, it will look to see if another ESXi host has free resources of the type the VM needs (CPU, RAM). If yes, then DRS will either migrate or recommend the migration of the VM over to that host. In this way, DRS can respond to changes in VM resource demands as VMs run without the need for human intervention.
DRS has three modes of operation...
In Manual mode, DRS makes Initial Placement and Migration recommendations only. However, VMotion will take no action on its own.
DRS clusters in Partially Automated mode will make Initial Placement decisions at VM boot time thereby freeing users from having to deal with this task. But, DRS will only make VMotion migration recommendations when it detects signs of resource stress.
DRS clusters in Fully Automated mode make both Initial Placement and Dynamic Balancing decisions automatically. In this way, they address both power on resource contention and running VM resource contention. Fully Automated DRS clusters have additional tunable that let the ESXi administrator fine tune DRS behavior.
If you set your DRS cluster to fully automated, you must further tune the cluster to trade off load balancing vs. VMotion overhead.
VMotion takes CPU and memory resources to complete – so there is a resource penalty to pay when a VM is hot migrated to a new host. Aggressive DRS clusters move VMs when the potential VM performance improvement is just slightly more than the cost of the VMotion. This could result in unnecessary VMotions if the VM is experiencing a very short spike in resource demands.
As a DRS cluster administrator, your job is to find the setting that provides good overall load balancing while minimizing the number of VMotions. Typically this setting is either in the middle (default) or just to the right or left of the middle.
Experiment to find the best setting for your organization.
A DRS cluster in Fully Automated mode can be tuned for resource contention sensitivity. This allows ESXi administrators to set the level of resource contention that must be present before DRS will intervene.
The relative benefit a VM will receive through VMotion migration is represented in DRS by one to five stars. The more stars a recommendation receives, the more DRS believes the VM will benefit if a recommendation is accepted.
DRS recommendations are based on two factors:
In a nutshell, the length and severity of resource starvation, along with the target ESXi hosts ability to satisfy the VMs resource needs, determines the number of stars a recommendation receives.
A 5-star recommendation is a special case. DRS includes VM affinity and anti-affinity settings (see slide 16). A 5-star recommendation is only made when a VM's current ESXi placement is in violation of any (anti-)affinity rules.
CPU compatibility has always been an issue with VMotion. Here's the problem:
When a VM powers on, it gets service by a physical CPU core. At power on time, the VM probes the CPU for special capabilities (e.g.: SSE, SSE2, SSE3 or SSE 4.1 instruction support, 64-bit support, Virtualization Assist technology, etc.). Once the VM learns about the special capabilities of the CPU running it – it never, ever re-probes the CPU. This can present a problem for VMotion because – if a VM moves to a host that lacks some of the capabilities the VM expects, any attempt to use those capabilities will result in application or OS failures (e.g.: Illegal instruction faults if a VM tries to execute an SSE3 instruction on a host that only supports SSE2).
VMware partially solves this problem with Enhanced VMotion Compatibility mode. With this DRS feature turned on, ESXi hosts will mask away features that are not common to all the physical CPUs in the cluster. This creates a situation where VMs see only compatible CPU features – even if the physical CPUs are not the same.
EVC works within processor families – and to a limited extent, across processor families. Before turning on EVC, check to make sure that all of the CPUs in your physical hosts are EVC compatible and determine the highest common EVC processor family. You can use the VMware CPU identification tool to assist (See VMotion chapter).
To use EVC, select your CPU maker, then CPU family.
Intel and AMD improve their CPU products regularly. They add features, new instructions, new hardware capabilities, etc. - some of which are visible to the Guest OS. Any CPU mismatch could result in VMotion failures or VM failures due to CPU compatibility.
If you have Intel Xeon CPUs, EVC will mask away the differences within Xeon product families. That would allow you to mix different versions or stepping of these four Xeon families:
1. Xeon Core i7 CPUs
2. Xeon 45nm Core 2 CPUs
3. Xeon Core 2 CPUs
4. 32nm Core i7 CPUs
5. Sandy Bridge CPUs
6. Ivy Bridge CPUs
7. Haswell CPUs
Click the Xeon processor family that represents your CPUs. Review the information provided and ensure that all of your CPUs match before turning on EVC.
EVC abstracts CPUs within a CPU family down to a common set of features and functions. This makes all CPUs VMotion compatible (because VMs all see the same CPU feature set). This lets you:
EVC places strict limits on hosts when they attempt to join a cluster. EVC performs a host CPU compatibility check and will refuse to allow any host to join the cluster where the new host CPUs a older than the CPU family selected when EVC was enabled.
EVC was introduced in ESXi 3.5 Update 2 and is supported for this release of ESXi and all newer releases of ESXi (e.g.: ESXi 3.5 Update 3, Update 4, ESXi 4.x, ESXi 5.x and ESXi 6.0).
To set (Anti-)Affinity rules, right click the DRS cluster → Manage tab → Settings → VM/Host Rules
Some VMs may perform better whenever they reside on the same ESXi host. An example might be a SQL database VM and an Accounting VM that uses SQL for record storage/retrieval. If these VMs reside on different ESXi hosts then network SQL requests must flow through physical networking at 1GB. If these VMs reside on the same ESXi host then they can exchange packets at virtual network speed (which should be faster). In cases like this, you would create an Affinity rule to tell DRS to keep these VMs on the same ESXi host.
For many applications, it makes sense to create two or more VMs that perform exactly the same function. That way, you can take one VM down for maintenance and the service is still available. Applications that could benefit from this approach include E-mail servers, Web servers, DNS servers, DHCP servers, etc. that perform the same function for the same clients.
If DRS were to place two VMs that provide the same service on the same ESXi host, problems could arise because the VMs compete with each other for the same resources. If the ESXi host were to go down, both VMs would fail. In this case, create an Anti-Affinity rule and DRS will place these VMs on separate ESXi hosts.
Navigation: Click your DRS cluster → Manage → Settings → VM Overrides
Click Green +, select VMs you wish to override → OK
For each VM in the roster, click the VM and select a new automation level
A DRS cluster is configured with a default automation level (either Manual, Partially Automated or Fully Automated). With no further action, this default automation level would be applied to all VMs on the cluster. However, ESXi administrators can override the default DRS cluster automation level on a VM per VM basis.
For example: suppose your organization ran strict change management procedures only on critical production VMs. An ESXi administrator could set the default DRS cluster automation level to Fully Automated and then use per-VM overrides to downgrade critical production VMs to Partially Automated. That way, a person would have to approve (and log) any VMotion migrations of critical production VMs (only).
For another example, suppose a small IT department had 3 ESXi hosts; 2 for Production and one for Test. They wanted to give their production VMs the best possible performance but did not want their Test VMs to leave the test ESXi server. They could create a 3 host DRS cluster and set their production VMs to Fully Automated. They could then set their test VMs automation level to Disabled. As a result, Test VMs would never leave the Test ESXi host. But production VMs could be migrated to/from the Test ESXi host to take advantage of the resources available on all ESXi hosts.
The Resource Allocation tab gives you a point in time view of your VMs and the resources they are receiving (as well as the host they are on and the shares they hold, reservations, limits, etc.).
This view displays values that are editable. If you see a VM whose resources, etc. don't match your needs simply click the value you wish to change and edit it directly. No need to launch the Cluster Properties window to make small changes.
The DRS tab displays the status of your cluster along with additional functions...
The Run DRS link lets you tell DRS to refresh its recommendations – now!
The Apply Recommendations button is how you give DRS permission to make the changes it suggests
The Edit... link lets you change the cluster properties directly from this screen
The Faults button lets you review past DRS actions that failed for any reason
The History button lets you review past DRS actions taken to keep the cluster in balance
DRS clusters honor all resource settings on individual VMs and also on Resource Pools including CPU/Memory reservations, shares and limits. So, VMs will not have their resource entitlements changed simply because they are now managed by a DRS cluster.
Because DRS relies on VMotion, DRS can only load balance VMs that are VMotion compatible. For best results, you should carefully plan your ESXi host deployment and configuration so that all of your ESXi hosts are VMotion compatible. Then, you should take care to configure your VMs so that they use only common storage, networking and removable media resources so that you do not inadvertently lock a VM to an ESXi host.
DRS can run VMs that are not VMotion compatible – but it cannot move them. So, if an ESXi host is running a mix of VMotion and non-VMotion compatible VMs, DRS is limited to moving only the VMotion compatible VMs. This may impair its ability to fully load balance across the cluster.
If your organization is new to VMotion and DRS it is likely that you may receive resistance to the idea of automatically hot migrating VM. If this happens to you, you can ease your organization into VMotion and DRS as follows:
Create a DRS cluster and set the automation level to Manual
Over time, VMs will experience resource starvation on their ESXi host
Finally, deal with any per-VM concerns by using DRS Rules to override the DRS cluster default for that VM.
DRS will only violate affinity rules when it has no choice. The most likely scenario is where a DRS cluster is also an HA (fail over) cluster. For example, if you have two ESXi hosts in a combined DRS+HA cluster and you have two VMs in an anti-affinity rule (keep apart) for high service availability, the following situation could lead to a 5-star recommendation:
The two VMs are running on separate ESXi hosts
High Availability clusters solve the problem of rapid VM placement and recovery if a VM should fail because the host it was running on failed (any of VMkernel panic, hardware failure, non-redundant storage failure, etc.)
HA minimizes VM down time by:
The overall objective for HA is to have VMs back in service in less than 2 minutes from the time an ESXi host fails. Users will still need to re-establish any authenticated sessions against the recovered VM... but (hopefully) the VM down time experienced will be no more than a nuisance.
For HA to place and power on VMs that fail when an ESXi host fails, those VMs must use only resources common to both the failed and the new ESXi host. This means the VM must use common
VMs can meet HA compatibility without meeting VMotion compatibility. Specifically, because the VM is being cold-booted rather than hot migrated, there is no need to maintain ESXi CPU compatibility between HA cluster peers. The reason for this is that when the VM is booted on the target ESXi host, the VM can probe for CPU properties so there is no need for the CPUs to exactly match between the failed ESXi host.
VMware HA clusters reserve ESXi host CPU and memory resources to ensure that VMs that fail when an ESXi host fails can be placed and restarted on a surviving ESXi host. There are two factors that determine how much ESXi host resources are held back:
In the above examples, three HA clusters are illustrated. In each case the cluster is configured to tolerate a maximum of one ESXi host failure.
In a 2-node cluster, HA must hold back 50% of all ESXi host CPU and memory. That way, it can guarantee that it can place and restart all VMs from a single failed ESXi host. Consequently each host can never be more than 50% busy.
In a 3-node cluster, HA will hold back 33% of all ESXi host CPU and memory. If an ESXi host fails, half of the VMs from the failed host will be placed on each of the surviving ESXi hosts. A healthy 3-node cluster can be up to 66% busy.
In a 4-node cluster, HA only needs to hold back 25% of each hosts resources. And, in a 5-node cluster only 20% of each host's resources are kept in reserve.
vCenter is the central management console for HA clusters. vCenter is responsible for creating, monitoring and coping with host failures and recoveries.
The fundamental assumption for HA clusters is that any host can fail at any time. And, because vCenter could be running as a VM, it is possible that the vCenter VM could fail when a host fails. This is a challenge because, if the cluster depended on vCenter and vCenter has failed, then the cluster could not recover from an ESXi host failure. To circumvent this problem, vCenter publishes the cluster configuration to every ESXi host in the HA cluster. That way, each ESXi host knows:
In the event of an ESXi host failure that also causes the vCenter VM to fail, the surviving ESXi hosts would cooperate to distribute the failed VMs (including the vCenter VM). Once the VMs were distributed, each host would begin booting its newly assigned VMs according to individual VM restart priority. As a result, the vCenter VM would be placed and booted quickly thereby bringing vCenter back into service.
Best Practice
If your vCenter server is a VM, it is a best practice to give it high restart priority.
In the Virtual Machine Options page, you are presented with a roster of all of the defined VMs on the cluster. You can click the VM row under the Restart Priority column header to change the restart priority for a VM.
High Priority – AD, DNS, DHCP, DC, vCenter and other critical infrastructure VMs
Medium Priority – Critical application servers like SQL, E-mail, Business applications, file shares, etc.
Low Priority - Test, Development, QA, training, learning, experimental and other non critical workloads
Disabled - Any VMs not required during periods of reduced resource availability (select from Low Priority examples)
To ensure continued operation of all of your virtual infrastructure, it is important that you assign VM restart priority with care. By default, all VMs are placed and restarted at the HA cluster's default restart priority (set on the HA clusters main settings page (right click cluster > click Edit Settings > click HA > click VMware HA). The default for HA cluster Restart Priority settings is Medium.
Note HA will not power off VMs on healthy ESXi hosts to free up resources for High or Medium priority VMs from failed ESXi hosts. Whether it is reasonable to do this or not would be up to the local administrator.
Best Practice
Set your cluster default restart priority to low. Then individually set your critical infrastructure VMs (DNS, DHCP, AD, DCs, etc.) to high and your critical business VMs to medium.
Isolation Response behavior is triggered after 15 seconds (tunable). If a NIC cable was pulled accidentally, 15 seconds should be sufficient time to fix the problem (re-plug the cable) and avoid a cluster failure.
Through heart beat failure, other ESXi hosts would quickly determine that a peer HA cluster node is unresponsive. They would then check their own ESXi Console physical NIC link to verify that they have network connectivity. In this way, they determine that they are not the isolated host and that they should cooperate with other surviving HA cluster nodes to implement the clusters' Isolation Response policy.
After 15 seconds, the isolated HA cluster node implements the VM's isolation response policy. If that policy is Power Down, then our VM would power crash. This is the virtual equivalent of pulling the power plug on a physical machine.
Pulling the virtual power on a VM can be traumatic but does avoid a potentially greater problem. If the VM was told to perform an orderly shutdown instead of loosing it's power, then:
When the VM has been successfully powered down, the isolated ESXi host removes the exclusive lock it holds on the VM's virtual disk. Healthy ESXi cluster nodes monitor the VM's lock and know that it is safe to take ownership of the VM once it's lock has been removed.
HA cluster nodes would then distribute the powered off VMs from the isolated host amongst the the surviving ESXi hosts. VMs would be placed and powered on according to their Restart Priority setting on the HA cluster.
In this case, the left-most ESXi host assumes ownership of the Web01 VM. This VM is added to the hosts VM inventory and then immediately powered on. When the VM is powered on, it's new owner establishes it's own exclusive lock on Web01's virtual disk.
The isolated node would watch for the presence of a lock file for each VM it lost. Once the VM has been successfully powered on on another ESXi host, the isolated host would remove the VM from its own VM inventory.
Before you can take an ESXi cluster node out of a cluster for maintenance, you must announce to the cluster that the host is being pulled from service. You do this by putting the ESXi host into Maintenance mode as follows
Right click ESXi host > Enter Maintenance Mode
When you place an ESXi host into Maintenance Mode, the following takes place:
Once the ESXi host is fully evacuated of VMs, you can shutdown or reboot the ESXi host (right click the host), patch, upgrade, etc. the ESXi host. When you boot it back up, it will automatically re-join any clusters for which it was a member.
It is critical that VMware ESXi administrators be informed of major network upgrades and/or outages... and especially if network outages may occur on switches used by High Availability cluster ESXi management ports.
The scenario above is a real possibility and will have serious results. If a switch that provides ESXi HA management networking fails (for any reason), all nodes in the HA cluster will believe they are isolated. If that happens, and the cluster Isolation Response policy is Power Off, then all ESXi hosts will power off all VMs.
To defend against this possibility, you could:
vSwitch0 is the vSwitch used to connect the default Management port to the physical (management) LAN segment. If you loose connectivity to this NIC, your ESXi host is unmanageable and your HA cluster will trigger it's Isolation Response policy. You have two choices when designing your networks to maximize your management capabilities.
NIC Team vSwitch0
If you NIC Team vSwitch0, then you will have 2 or more NICs connected to the same physical LAN segment usually through the same physical switch. You are protected from a NIC failure, cable pull or switch port failure but not from a physical switch failure.
Second Management Port
You should consider adding a second management port on a completely separate physical LAN segment by making a new service console port on vSwitch1, vSwitch2, etc. If these other vSwitches uplink to different physical switches than vSwitch0, then you will have achieved management port redundancy, NIC redundancy and switch redundancy. This provides you with the maximum protection and minimizes the likelihood of a HA fail over event caused by a single hardware component failure.
If you add a second management port, please make sure that vCenter can connect to all ESXi hosts on all management ports. It is best if vCenter has 2 NICs with connections to each of the physical switch(es) used for management port connectivity.
DRS and HA clusters work best together. DRS will dynamically place and load balance VMs while HA will restart VMs that fail when a host fails. Using these tools, an IT department can deliver consistently good VM performance with very little VM down time.
In a combined DRS+HA cluster if a host fails:
VMware introduced vLockStep into vSphere 4... vLockStep is replication technology that replicates the complete state of a VM running on one ESXi host into a VM running on a second ESXi host. In essence, the two VMs form an active/stand-by pair... They are the same in all respects; they have the same configuration (virtual hardware), share the same virtual MAC address, have the same memory contents, CPU contents, complete the same I/Os, etc. The main difference is that Secondary VM is invisible to the network. VMware upgraded their Fault Tolerance technology to use Rapid Checkpointing in vSphere 6. Rapid Checkpointing provides more scalability than vLockStep.
If the Primary VM were to fail for any reason (e.g.: VMkernel failure on the machine running the Primary VM), Fault Tolerance would continue running the VM – by promoting the Secondary VM to the Primary VM on the surviving host. The new Primary would continue interacting with peers on the network, would complete all pending I/Os, etc. In most cases, peers wouldn't even know that the original Primary has failed.
To protect against a second failure, Fault Tolerance would then create a new Secondary node on another ESXi host by replicating the new Primary onto that host. So, in relatively little time, the VM is again protected and could withstand another VMkernel failure.
Note that Fault Tolerance does not protect against SAN failures
VMware recommends a minimum of 3 ESXi hosts in an HA/FT configuration so that, if one ESXi host is lost, there are 2 hosts remaining so that the FT protected VM can create anew Secondary copy on the 3rd cluster host.
VMware introduced FT vLockStep into vSphere at version 4.0... and replaced it with Fast Checkpointing replication and synchronization technology that:
If the Primary VM were to fail for any reason (e.g.: VMkernel failure on the machine running the Primary VM), Fault Tolerance would continue running the VM – by promoting the Secondary VM to the Primary VM on the surviving host. The new Primary would continue interacting with peers on the network, would complete all pending I/Os, etc. In most cases, peers wouldn't even know that the original Primary has failed.
VMware's best practice is to build HA/FT clusters using an odd number of servers. This allows FT to protect against a second failure. In the case of a host failure, FT would create a new Secondary node on another ESXi host by replicating the new Primary onto that host. So, in relatively little time, the VM is again protected and could withstand another ESXi host failure.
Virtualization is most effective when a large number of PC server workloads can be consolidated onto a much smaller population of ESXi hosts. The key to doing this effectively is the VMkernel's ability to determine which VMs need service and then to ensure those VMs get the resources they need. Since different VMs can spike on resources at different times, the VMkernel must stay vigilant in its efforts to monitor and allocate resources.
The first thing to realize is that the VM does not need it's full allocation of resources - all the time. For example, a VM that is only 10% busy (CPU wise) could, in theory, get by with just 10% of a CPU – as long as it received CPU service exactly when it needed it. In this way, a single CPU resource (e.g.: a CPU core) could service many VMs and give them all the cycles they need – just as long as they didn't all need cycles at exactly the same instant in time.
Secondly, if you know where to look, an operating system will tell you when it doesn't need CPU service. All operating systems include a very low priority Idle task. This task is run when there is absolutely nothing else for the operating system to do (no tasks needing service, no I/O's to complete, etc.). VMware tools monitors the guest OS Idle task and reports back to the VMkernel whenever the Idle task is running. In this way, the VMkernel knows which VMs are idling and can either reduce idling VM's scheduling priority or even pull the CPU from Idling VMs so that it can swing the CPU over to VMs with real work to do.
Along with memory, CPU is one of the most likely resources to experience contention. So, managing CPU effectively will have a great impact on the overall load an ESXi host can handle.
When an ESXi host boots, the VMkernel scans the hardware for CPU resources including Sockets, Cores and Hyper-Threaded Logical Processors. Of the three, an additional socket provides the highest overall increase in CPU performance. After that, adding cores to a physical CPU package is the next most effective strategy to improve CPU capabilities. Finally, if you are an Intel customer, then using Hyperthreading will modestly boost performance even further.
ESXi abstracts sockets, cores and logical processors into separately schedulable, weighted CPU resources called Hardware Execution Contexts (HECs). Since a socket is the most effective CPU resource, the VMkernel will schedule VMs against all sockets first. If there are more VMs to run, the VMkernel will then schedule remaining VMs against CPU cores across CPU sockets. If there are still more VMs that want service, then the VMkernel will schedule those (lower priority) VMs against Hyperthreaded Logical Processors.
ESXi licenses by the socket starting with a minimum of one sockets per server. So, the best way to maximize the performance of CPU to your VMs is to purchase competent (large cache, high frequency) multi-core CPUs.
The VMkernel runs a VM task scheduler to assign physical CPU resources to VMs. The VMkernel task scheduler runs 50 times per second and each time it runs it must decide which VMs will run and which VMs will wait for CPU service.
The scheduler first decides which VMs get to run. To do this it uses a number of factors including:Is the VM running real work or just it's idle task?
To run a VM, the scheduler must find physical CPU resources equal to the number of vCPUs in the VM. If it were to provide the VM with fewer than the declared number of vCPUs the Guest OS in the VM would treat this as a CPU failure and would Blue-Screen the OS.
Physical to Virtual CPU Scheduling
The VMkernel abstracts physical CPU resources into independently schedulable processor resources called Hardware Execution Contexts (HEC). Depending on the capabilities of your CPU(s), an HEC can be any of:
An HEC is exactly like the physical CPU (same maker, model, speed, cache, etc.) but is presented to the VM as a single socket/single core CPU resource.
The VMkernel assigns weights to HECs according to their relative processing power. Using a physical single core CPU as a baseline, each additional CPU socket adds about 85-98% in additional performance. Each core (on a socket) adds an 65-85% additional performance while HyperThreaded Logical Processors might add only 5-30% of a CPU core in performance.
The VMkernel assigns the most powerful HECs to high priority VMs, thereby distributing high priority VMs across sockets. Next, the VMkernel schedules VMs across CPU cores. Finally, low priority VMs get scheduled on cores or HT Logical Processors.
If there are more Virtual CPUs than physical HECs, (as in the slide above) some VMs are forced to wait.
Concurrent vs. Sequential Tasks
Concurrent applications are applications that can process more than one request at the same time. Such applications are either multitasking or multi-threaded. Examples of concurrent applications include modern web servers (service multiple web requests concurrently), mail servers (service inbound and outbound mail requests concurrently), database servers (service queries concurrently), etc. Under heavy load, concurrent applications benefit from additional vCPUs because different tasks/threads can execute simultaneously on different vCPUs.
Sequential applications are applications that service one request at a time or do one thing at a time. These tasks run as a single process with only one thread of execution. (Legacy applications are often designed in this manner.) CPU bound sequential applications receive no additional benefit from adding vCPUs because the application can only use one CPU at a time. Adding vCPUs to a sequential application wastes the CPU resource because the guest OS has no choice but to run it's Idle Task with the additional CPU resource(s). Sequential applications typically execute best on high frequency CPUs that also contain larger caches – because they will execute faster than if they were serviced by slower (frequency) CPUs with smaller caches.
If you have a CPU bound VM, add a second vCPU. If application performance doesn't improve, then you have a sequential application. In this case, remove the additional vCPU from the VM because the VM will waste these extra cycles.
Since VMs don't need 100% CPU service all the time, the VMkernel can effectively run two or more VMs on the same physical CPU resource and give these VMs all of the CPU service they need. The trick is to detect when a VM is idling and immediately steal away the CPU from that VM and give it to another VM that has real tasks to run.
Since an Idling VM would accomplish nothing, and the same VM waiting on a Run Queue for it's turn to run would also accomplish nothing, there is no harm in forcing an Idling VM to give up it's CPU resource.
Light duty workloads that may only need 3-5% of a CPU (e.g.: DNS, DHCP, Active Directory, Domain Controllers, light Web, File/Print, etc.) it makes sense to consolidate these VMs onto ESXi at very high ratios (up to 8 vCPUs per physical CPU resource) because, at 5% utilization, 8 VMs would only need 40% of a CPU core.
Busier VMs may still perform fine with no more than 25-35% of a CPU when they spike and less otherwise. So, on average, if your VMs are 20% busy a 2-socket quad core server could, in theory, run up to 40 VMs before experiencing CPU starvation.
ESXi loads into memory at boot time. ESXi 6.0 needs a minimum of 4GB of memory to boot and run (this is checked for at boot time). Note – the presence of any device (e.g.: shared memory video) that reduces available memory below 4GB will prevent an ESXi host with only 4GB of RAM from booting.
The VMkernel takes approximately 5% of RAM and leaves the rest for VM use. You can see current system-wide RAM statics as follows:
Web Client > Your ESXi Host > Manage tab > Settings tab > Memory
VMs compete with each other for use of remaining physical memory. If there is more RAM in the ESXi host than VMs need, then VMs will get all of the physical RAM they attempt to use (up to their declared maximum). If VMs attempt to use more physical RAM than the ESXi host has available, then the VMkernel must step in and use it's memory management skills to minimize memory contention. The VMkernel has five techniques for accomplishing this task.
Note: ESXi 6.0 is rated for a maximum of 6TB of physical host memory. However, VMware has been working with hardware OEMs to double physical memory support to 12TB. Please see VMware's hardware compatibility portal to find out which machines can handle more than 6TB of physical memory
VMware encourages its customers to run a mix of VMs whose memory foot print (declared memory needs) exceeds physical RAM by between 20-40%. This means that an ESXi host with 16GB of RAM could easily and efficiently run a mix of VMs that collectively declare 20+GB of memory (25% over commit) with no reduction of memory performance and no sign of memory stress.
The VMkernel uses paging to map physical RAM into the virtual address space of a VM. The VM can use any of
To meet a VM's memory requirements. Through demand paging, a VM is tricked into thinking that it has received a full physical allotment of RAM. It also thinks that it's RAM is present, contiguous starting at physical address zero.
Transparent Page Sharing
A memory management trick built into ESXi that is not available on VMware Server, VMware workstation or other hosted virtualization solutions. Transparent page sharing works as follows:
Transparent Page Sharing is highly effective
Ballooning is a memory management technique used by the VMkernel whenever memory becomes tight. The VMkernel receives reports (by VMware Tools) of any guest OS memory over allocation. When the VMkernel determines that memory is becoming scarce, it will use Ballooning to take back RAM from running VMs. It may take back RAM and immediately hand it over to memory starved VMs or it may simply 'bank' the RAM for future use.
If some VMs are memory starved while other VMs are over allocated with RAM, VMkernel will balloon away excess RAM from some VMs and re-assign it to the VM experiencing memory contention. It does this by:
Ballooning is an ongoing process of taking memory from over provisioned VMs and giving it to resource starved VMs. As VMs memory demands change, they could easily transition from a net beneficiary to a net supplier of RAM to other VMs.
This mechanism is ongoing and is completely transparent to the Guest OS.
ESXi 6.0 uses memory compression as an alternative to paging memory to disk. Under extreme memory stress, the ESXi host will page VMs to free up RAM. But paging to disk is very slow and negatively impacts performance. Additionally, ESXi hosts are generally over provisioned with CPU thanks to newer 4, 6, 8 and 12 core CPUs. So, VMware engineers decided to use RAM compression to reduce physical paging to disk.
Memory compression uses up to 10% of a VM's memory as a compressed page cache. This means that memory (normally used for VM memory) is re-purposed to function as a page cache. This cache is dynamic and exists only when the VM must page.
When paging is required, memory pages are stolen from the VM and used for the page cache. These stolen pages are then compressed (using 2:1 compression), so that two compressed pages take up the same space as one normal page. In this way, pages that would normally have to go to disk are stored in RAM.
Under extreme memory stress, pages in the compression cache may be forced to disk. If that happens, they are removed from the compression cache, uncompressed and written to disk. The VMkernel tries to select the best candidates to page to avoid thrashing.
In the example above, 4 pages are stolen from our VM. At 2:1 compression, all 4 pages fit in 2 pages of the compression cache, leaving room for 4 more compressed pages. The same result is achieved as paging to disk but without the need for disk I/O and in 1/10th the time.
The VMkernel will never steal more than 65% of a VM's memory through Ballooning. Also, Ballooning can never force a VM to live with less memory than any memory reservation the VM holds.
The final memory management tool used by the VMkernel is VMkernel Swap – the paging to disk of VM memory pages by the VMkernel. This is a memory management technique of last resort and is always an indicator of extreme memory stress.
It is a last-resort technique because the VMkernel:
VMkernel paging will never force a VM to give up a memory reservation.
Running large VMs across multiple NUMA nodes (to meet the VM's vCPU core and/or memory needs) can produce significant performance issues. For best results, try to size VMs so their vCPU core and vRAM needs can be met entirely within the resources available on a single NUMA node in your PC server.
For example if your PC server has 2 NUMA nodes each with one pCPU of 10 cores and 64GB of local memory, your VMs should (if possible) not have more than 10 vCPU cores and declare no more than 64GB of RAM. Declaring VMs larger than this would force them to span multiple NUMA nodes where cross node memory references may incur significant delays.
For more information, please see VMware Performance Best Practices for vSphere 6.0 guide.
If many VMs reside in the same datastore, then excessive disk traffic could result in bandwidth contention to that LUN. This can be resolved through per-VM LUN shares.
Normally a VM holds 1,000 shares against each LUN used by it's virtual disk(s). If the storage path is idle, then VM disk I/O requests are handled on a first come, first served basis. But, if the disk path is over committed, then disk I/O requests will queue up as the storage sub-system struggles to keep up with demand.
If a LUN is backed up with storage requests (i.e.: requests wait more than 30ms for service), it is possible that a single VM (or a small number of VMs) could be responsible for most of the disk I/O traffic. If the storage controller handled requests in a first-come, first-served basis than VMs performing a small amount of disk I/O could find their requests at the back of a very long disk I/O queue – and their performance would suffer.
To get around this problem, once disk I/O requests exceed 30ms of wait time, the storage controller handles disk I/O requests in proportion to the number of LUN shares held by a VM. In the above example, if each VM held 1,000 LUN shares then VM C would get 1/3 of all I/O bandwidth to the LUN even though it is generating relatively little disk traffic. This lets it jump to the head of the queue and receive consistently good disk I/O service.
By default Storage I/O Control is turned off. This means that disk I/O scheduling is always done First-Come, First-Served (FCFS). The problem with FCFS is that one VM doing a lot of disk I/Os can induce queuing at the physical disk controller. Other VM's disk I/Os go to the back of the queue and may have to wait a long time (maybe multiple seconds) before they are serviced.
To prevent this from happening, you can enable Storage I/O Control on the LUN. When you do this, the disk scheduler uses FCFS scheduling for all I/Os provided no I/O has sat in the I/O queue for more than 30ms. Once any I/O has waited 30+ms for service, the disk scheduler changes to priority scheduling based on VM disk shares (change in VM > Edit Settings > Resources)
VMware does not recommend you adjust Storage I/O Control wait times (normally 30ms) unless you have a good reason to (normally that means you were advised to make a change by VMware support).
ESXi 5.x and 6.x can recognize and use Solid State Drives (SSDs) in a variety of ways to speed up your ESXi hosts. Options include:
Use SSDs for fast, local storage
You can generally connect SSDs to server Serial Attach SCSI (SAS) RAID controllers to provide fast local storage. SSDs are subject to failure so you should consider using SSDs in a redundant (i.e.: RAID-1, RAID-5, RAID-6, RAID-10, RAID-50, RAID-60) configuration.
Use SSDs as an ESXi Host Cache
Host Caches are local read cache volumes that are used to hold frequently read data in storage local to the ESXi host. This option is useful in an environment were VMs read and re-read the same data over and over. This happens in a VMware View or vCloud environment were Linked-Clone VMs read the same data from the same base/replica VM virtual disk. Booting a large number of VMs anchored to the same replica can cause I/O storms that can cripple a SAN. By using a Host Cache, re-read data is fetched from the local cache significantly reducing the I/O load on the SAN
VMkernel Swapfiles
Swapfiles are paging files used by the VMkernel whenever host memory is constrained. Paging to SSDs is much faster than paging to spinning disks
VMware offers Overview and Advanced performance charts. Overview charts are performance charts of the most popular CPU, Memory, Network and Disk metrics. Simply click the Performance tab to see the Overview performance charts for the selected inventory item.
The vSphere Client has a very competent charting system. You can rapidly select from monitoring major sub-systems (CPU, Memory, Network, Disk, System) or drill down to very detailed resource specifics (e.g.: Memory Ballooning, CPU Ready time, etc.)
To help the visually impaired, you can click on any row in the Performance Chart Legend and the data plot associated with that row will bold (a stroke of genius!).
And, you can use the icons in the upper right hand corner to Reset, Tear Off, Print or save the chart in MS Excel format.
Performance problems are the result of one of two situations:
Over committing the ESXi host forces it to run a mix of VMs whose resource demands exceed the available resources on the box. The ESXi host has no choice but to force some VMs to wait.
Over Tuning the box is when an inexperienced (but well intentioned) administrator goes overboard with CPU and Memory Reservations, Shares and Limits to the point where the VMkernel CPU scheduler and memory manager can no longer move resources around freely because it must honor resource settings.
A perfect example of over tuning is assigning a large reservation to a VM that doesn't need it. The VMkernel has no choice but to allocate the reservation, possibly starving other VMs of needed resources while the over tuned VM wastes the allocation.
The ideal amount of CPU Ready time for a VM is 0ms. Any more than that is an indication that the VM is being forced to wait in a run queue for some number of milliseconds – every second. This is time spent in line – waiting for CPU servers, rather than running. Large amounts of CPU Ready time (100's of ms or more per second) are indicative of excessive CPU stress that is directly forcing this VM (and its users) to wait.
Operating system (e.g.: Windows, Linux, etc.) performance monitoring tools were programmed under the assumption that the operating system is the exclusive owner of all hardware resources. An assumption that is no longer valid with virtualization.
For example, Windows tracks CPU Busy time in Task Manager by subtracting time spent in Windows' Idle task from total available time. The result is expressed as a percent and is assumed to be the amount of time Windows spent servicing tasks.
With virtualization, it is possible that the VM could receive no CPU time (due to CPU over commit). In that case, Windows Idle task would clock no time and Windows would incorrectly assume that Windows was 100% busy! This is especially troubling for the guest OS administrator because, not only is Windows reporting 100% CPU busy, but CPU performance may be poor (because the VMkernel is forcing the VM to wait).
When diagnosing performance issues, it is usually best to use the vSphere Client performance analysis tools rather than relying on Guest OS performance analysis tools.
Memory stress shows up first as Memory Ballooning at the VM level. If memory becomes even more scarce, it will show up as VMkernel paging. These two metrics can be monitored by the vSphere Client.
Like CPU, it is best not to put too much stock in guest OS memory performance counters. While they should be accurate within reason, they do not tell the whole picture.
A sign that a VM may be under sized (declared memory size is too small) or experiencing VMkernel paging is that Windows Task Manager reports an excessive number of Page Faults or dramatic ongoing increases in Page Fault Delta values.
Page faults are a standard procedure for all OS's. When an OS looks for a file or executable to put into memory, it searches through its own used pages of memory first and retrieves the file from RAM if it is possible. This is known as a soft page fault – and it is fast.
Most OS page fault reporting refers to hard page faults where the OS must go to disk to get the requested file (data or executable).
When reviewing Page Faults values, don't look at the total number of faults. Rather, look for large dramatic increases in the number of page faults across tasks as displayed by the PF Delta column. If you see numbers in the Page Faults Delta column changing dramatically (up or down), then chances are good that the VM is memory starved.
You can easily review host or datacenter overall CPU and memory performance by clicking the Resource Pool/host/datacenter and then clicking the Virtual Machines tab. Pay attention to the Host CPU, Host Memory – MB and Guest Memory % columns. You can click on any of these column headers to sort by that value. You should also review the Resource Allocations tab to see how the VMkernel is handing out resources to VMs and Resource Pools on the host or cluster. The Resource Allocations tab is also useful because it will show what percentage of shares a VM holds relative to its peers (and consequently what % claim it has on the scarce resource).
VMware vSphere 6.0 is the platform businesses depend on to deploy, manage and run their virtualized Windows and Linux workloads.
In this course you will learn how to effectively manage host CPU/Memory resources with DRS clusters, to minimize VM downtime caused by ESXi host failures with HA clusters, to eliminate unplanned VM downtime with Fault Tolerance, to patch and update ESXi hosts with VUM and how to maximize ESXi host and VM performance.
Learn DRS & HA Clusters, Fault Tolerance, VMware Update Manager and Performance
This course covers five major topics that all vSphere 6 vCenter administrators must know:
The skills you will acquire in this course will help make you a more effective vSphere 6 administrator.
Added Bonus! This course is 100% downloadable. Take it with you and learn on your schedule.