VMware vSphere 6.0 Part 4 - Clusters, Patching, Performance

Learn about load balanced DRS clusters, High Availability failure recovery clusters, Fault Tolerance, VM/Host performace
4.5 (35 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
733 students enrolled
50% off
Take This Course
  • Lectures 185
  • Length 8 hours
  • Skill Level Intermediate Level
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 2/2016 English

Course Description

VMware vSphere 6.0 is the platform businesses depend on to deploy, manage and run their virtualized Windows and Linux workloads.

In this course you will learn how to effectively manage host CPU/Memory resources with DRS clusters, to minimize VM downtime caused by ESXi host failures with HA clusters, to eliminate unplanned VM downtime with Fault Tolerance, to patch and update ESXi hosts with VUM and how to maximize ESXi host and VM performance.

Learn DRS & HA Clusters, Fault Tolerance, VMware Update Manager and Performance

This course covers five major topics that all vSphere 6 vCenter administrators must know:

  • We start with a thorough presentation of VMware Distributed Resource Scheduling (DRS) clusters. DRS clusters dynamically balance VM CPU and memory demands by automatically VMotion migrating VMs experiencing CPU/Memory resource stress. We also look at Enhanced VMotion Compatibility (EVC) - a feature that lets you safely mix newer and older ESXi hosts in a DRS cluster
  • Next, we will learn how to minimize VM downtime due to unplanned ESXi host failures by implementing High Availability clusters (HA). HA clusters monitor ESXi host health, detect ESXi host failures and re-assign VM ownership from failed ESXi hosts to healthy ESXi host peers. We will also learn about key HA policies like All Paths Down and Permanent Device Loss handling - new to vSphere 6
  • We move on to look at how to completely eliminate unplanned VM downtime (even if an ESXi host fails) through VMware Fault Tolerance (FT). FT hot replicates a running VM to a peer ESXi host. If the ESXi host running the primary copy of a FT protected VM fails, FT automatically places the replicated copy into service. We'll see how to configure, run and test FT protected VMs.
  • Next we will see how to use VMware Update Manager to safely and efficiently patch and update ESXi hosts. We will learn about Patch Baselines (patch sets), how to attach Baselines to an ESXi host or cluster, how to check for patch compliance (all needed patches present on a host) and how to patch ESXi hosts.
  • Finally, we will take a close look at ESXi host and VM performance. We will see what the VMkernel does to efficiently utilize physical CPU and how we can right size vCPU in VMs. We will see the five memory management techniques used by the VMkernel to efficiently manage memory and how to turn on Transparent Page Sharing to maximize memory use. We will see how to configure Storage I/O Control and how to identify and fix host and VM performance bottlenecks.

The skills you will acquire in this course will help make you a more effective vSphere 6 administrator.

What are the requirements?

  • We assume that you are familiar with ESXi host and vCenter Server management. That you can create and use VMs, connect to shared storage and perform day to day management tasks in your vSphere environment
  • One way to acquire these skills is to take our VMware vSpher 6.0 Part 1, Part 2 and/or Part 3 classes on Udemy

What am I going to get from this course?

  • Understand ESXi host requirements for DRS/HA clusters
  • Create and edit Distributed Resource Scheduling (DRS) load balanced clusters
  • Understand and adjust DRS automation level settings
  • Understand and apply DRS placement and migration recommendations
  • Use Enhanced VMotion Compatibility (EVC) to safely grow existing DRS clusters
  • Deliver high VM service availability using VMware High Availability clusters
  • Understand and configure HA cluster settings such as Admission Control, All Paths Down and Permanent Device Loss policies
  • Design HA clusters for continued management and correct behavior using network redundancy and Heartbeat datastore redundancy
  • Configure DRS and HA clusters according to VMware's best practices for clusters
  • Understand the features and capabilities of vSphere Fault Tolerance
  • Configure ESXi hosts for FT network logging
  • Enable Fault Tolerance protection on individual VMs
  • Understand the purpose and use of VMware Update Manager
  • Configure VUM for correct behavior according to your vSphere environment
  • Create and update ESXi host patch baselines
  • Apply baselines to ESXi hosts or clusters and check for compliance
  • Patch and update non-compliant ESXi hosts
  • Understand ESXi use of physical CPU resources
  • Understand the five ways ESXi uses to efficiently manage memory
  • Use Overview and Advanced performance charts to monitor resource use
  • Identify and correct common performance issues

Who is the target audience?

  • This course is intended for vSphere Administrators who wish to add DRS cluster, HA cluster, Fault Tolerance or VMware Update Manager capabilities to their existing vSphere environment
  • This course will also benefit vSphere Administrators who want to learn how to improve the scalability and performance of their vSphere environments

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: Introduction
VMware vSphere 6
Course Goals and Objectives
Course Goals and Objectives (continued)
Presented By
New Skills
Should You Take This Course?
Let's Get Started!
Section 2: VMware Distributed Rescource Scheduling (DRS) Load Balanced Clusters
VMware vSphere Distributed Resource Scheduling Clusters

vCenter can organize two or more ESXi hosts into a load balancing cluster called a Distributed Resource Scheduling cluster. DRS clusters dynamically monitor ESXi host load and VM resource demands. Because VMs resource demands change over time, it is possible that a VM that was previously receiving all of the resources it needed could become resource starved due to the changing resource demands of other VMs on the same host.

DRS looks for this very situation and will take action by either recommending VMotion migrations or initiating VMotion migrations to rebalance VMs across the cluster. In this way, your DRS cluster always runs VMs in the most resource efficient manner. DRS delivers the following benefits:

  • Resource contention is immediately addressed
  • ESXi hosts are always approximately equally loaded
  • As VM resource demands change, DRS responds with re-balancing decisions
  • All VMs receive the same resource availability (subject to resource settings). This means that as VM resource demands increase over time, the load is rebalanced so that all VMs receive the same level of service (subject to resource tuning)
  • New ESXi servers can be provisioned based on demonstrable resource demands across all systems rather than the need of one or a few VMs
  • Adding a new ESXi host to a DRS cluster causes the cluster to rebalance VMs across the new host thereby reducing resource stress across all VMs
DRS Cluster Maximums for vSphere 6.0

DRS clusters are vCenter objects that you can create only in the Hosts and Clusters view. To create a new cluster, select either a datacenter or a folder, right click and select New Cluster. This creates the new inventory object and lets you set the cluster's name. You would then right click the new cluster and select Edit Settings... to enable and configure DRS.

A DRS cluster needs one or more ESXi hosts in order to function. You can add ESXi hosts to a DRS cluster by dragging an ESXi host onto the DRS cluster inventory object. You can add a new ESXi host to a DRS cluster at any time. Upon receiving a new host, the cluster will reassess VM placements and migrate VMs onto the new host to even out resource consumption across all hosts.

It is the ability to hot-add ESXi hosts to DRS clusters that gives IT managers the ability to provision new servers in response to demonstrable increases on PC server work loads. In the past IT managers had to provision new PC servers for every new workload (OS + application). Now IT managers can simply create new VMs on a DRS cluster and let the cluster load balance the VM population. If the cluster's overall resource utilization rate climbs too high (say over 80%), you can simply add a new ESXi host to the cluster. DRS will re-balance across all hosts so overall resource load goes down and VM performance goes up.

When you add a host to a DRS cluster – everybody benefits!


Initial Placement is the act of selecting a suitable ESXi host for VM placement and power on. When a user powers on a VM, DRS will:

  • Query the current resource load across ESXi hosts
  • Look at the resource demands of the VM (# of vCPU cores, RAM declaration)
  • Cold migrate (reassign VM ownership) to the ESXi host that has the most free resources
  • Tell that ESXi host to boot the VM

Because DRS cluster resident VMs should all be VMotion compatible, any ESXi host can act as a power on host for the VM. By finding the least busy ESXi host at VM power on time, DRS attempts to place the VM on the host that will provide the VM with the best overall access to resources.

Dynamic Balancing starts monitoring ESXi host load and VM resource demands. If DRS determines that a VM is being resource starved, it will look to see if another ESXi host has free resources of the type the VM needs (CPU, RAM). If yes, then DRS will either migrate or recommend the migration of the VM over to that host. In this way, DRS can respond to changes in VM resource demands as VMs run without the need for human intervention.


DRS has three modes of operation...

In Manual mode, DRS makes Initial Placement and Migration recommendations only. However, VMotion will take no action on its own.

DRS clusters in Partially Automated mode will make Initial Placement decisions at VM boot time thereby freeing users from having to deal with this task. But, DRS will only make VMotion migration recommendations when it detects signs of resource stress.

DRS clusters in Fully Automated mode make both Initial Placement and Dynamic Balancing decisions automatically. In this way, they address both power on resource contention and running VM resource contention. Fully Automated DRS clusters have additional tunable that let the ESXi administrator fine tune DRS behavior.


If you set your DRS cluster to fully automated, you must further tune the cluster to trade off load balancing vs. VMotion overhead.

VMotion takes CPU and memory resources to complete – so there is a resource penalty to pay when a VM is hot migrated to a new host. Aggressive DRS clusters move VMs when the potential VM performance improvement is just slightly more than the cost of the VMotion. This could result in unnecessary VMotions if the VM is experiencing a very short spike in resource demands.

As a DRS cluster administrator, your job is to find the setting that provides good overall load balancing while minimizing the number of VMotions. Typically this setting is either in the middle (default) or just to the right or left of the middle.

Experiment to find the best setting for your organization.


A DRS cluster in Fully Automated mode can be tuned for resource contention sensitivity. This allows ESXi administrators to set the level of resource contention that must be present before DRS will intervene.

The relative benefit a VM will receive through VMotion migration is represented in DRS by one to five stars. The more stars a recommendation receives, the more DRS believes the VM will benefit if a recommendation is accepted.

DRS recommendations are based on two factors:

  • How much resource starvation the VM is experiencing and
  • How long the VM has been starved for resources

In a nutshell, the length and severity of resource starvation, along with the target ESXi hosts ability to satisfy the VMs resource needs, determines the number of stars a recommendation receives.

A 5-star recommendation is a special case. DRS includes VM affinity and anti-affinity settings (see slide 16). A 5-star recommendation is only made when a VM's current ESXi placement is in violation of any (anti-)affinity rules.

Adding ESXi Hosts to a DRS Cluster
DRS Initial Power On Placement Guidance

CPU compatibility has always been an issue with VMotion. Here's the problem:

When a VM powers on, it gets service by a physical CPU core. At power on time, the VM probes the CPU for special capabilities (e.g.: SSE, SSE2, SSE3 or SSE 4.1 instruction support, 64-bit support, Virtualization Assist technology, etc.). Once the VM learns about the special capabilities of the CPU running it – it never, ever re-probes the CPU. This can present a problem for VMotion because – if a VM moves to a host that lacks some of the capabilities the VM expects, any attempt to use those capabilities will result in application or OS failures (e.g.: Illegal instruction faults if a VM tries to execute an SSE3 instruction on a host that only supports SSE2).

VMware partially solves this problem with Enhanced VMotion Compatibility mode. With this DRS feature turned on, ESXi hosts will mask away features that are not common to all the physical CPUs in the cluster. This creates a situation where VMs see only compatible CPU features – even if the physical CPUs are not the same.

EVC works within processor families – and to a limited extent, across processor families. Before turning on EVC, check to make sure that all of the CPUs in your physical hosts are EVC compatible and determine the highest common EVC processor family. You can use the VMware CPU identification tool to assist (See VMotion chapter).

To use EVC, select your CPU maker, then CPU family.


Intel and AMD improve their CPU products regularly. They add features, new instructions, new hardware capabilities, etc. - some of which are visible to the Guest OS. Any CPU mismatch could result in VMotion failures or VM failures due to CPU compatibility.

If you have Intel Xeon CPUs, EVC will mask away the differences within Xeon product families. That would allow you to mix different versions or stepping of these four Xeon families:

1. Xeon Core i7 CPUs
2. Xeon 45nm Core 2 CPUs
3. Xeon Core 2 CPUs
4. 32nm Core i7 CPUs
5. Sandy Bridge CPUs
6. Ivy Bridge CPUs
7. Haswell CPUs

Click the Xeon processor family that represents your CPUs. Review the information provided and ensure that all of your CPUs match before turning on EVC.


EVC abstracts CPUs within a CPU family down to a common set of features and functions. This makes all CPUs VMotion compatible (because VMs all see the same CPU feature set). This lets you:

  • Mix hosts with new and older CPUs from the same CPU family
  • Mix hosts from different manufacturers as long as the CPUs meet EVC requirements
  • Mix hosts with different core/socket counts
  • Extend the life (capacity) of your cluster rather than buying new hosts

EVC places strict limits on hosts when they attempt to join a cluster. EVC performs a host CPU compatibility check and will refuse to allow any host to join the cluster where the new host CPUs a older than the CPU family selected when EVC was enabled.
EVC was introduced in ESXi 3.5 Update 2 and is supported for this release of ESXi and all newer releases of ESXi (e.g.: ESXi 3.5 Update 3, Update 4, ESXi 4.x, ESXi 5.x and ESXi 6.0).

EVC Validation Successful
EVC Validation Failure
Mixing Server Generations within an EVC Enabled DRS Cluster

To set (Anti-)Affinity rules, right click the DRS cluster → Manage tab → Settings → VM/Host Rules

Some VMs may perform better whenever they reside on the same ESXi host. An example might be a SQL database VM and an Accounting VM that uses SQL for record storage/retrieval. If these VMs reside on different ESXi hosts then network SQL requests must flow through physical networking at 1GB. If these VMs reside on the same ESXi host then they can exchange packets at virtual network speed (which should be faster). In cases like this, you would create an Affinity rule to tell DRS to keep these VMs on the same ESXi host.

For many applications, it makes sense to create two or more VMs that perform exactly the same function. That way, you can take one VM down for maintenance and the service is still available. Applications that could benefit from this approach include E-mail servers, Web servers, DNS servers, DHCP servers, etc. that perform the same function for the same clients.

If DRS were to place two VMs that provide the same service on the same ESXi host, problems could arise because the VMs compete with each other for the same resources. If the ESXi host were to go down, both VMs would fail. In this case, create an Anti-Affinity rule and DRS will place these VMs on separate ESXi hosts.


Navigation: Click your DRS cluster → Manage → Settings → VM Overrides
Click Green +, select VMs you wish to override → OK
For each VM in the roster, click the VM and select a new automation level

A DRS cluster is configured with a default automation level (either Manual, Partially Automated or Fully Automated). With no further action, this default automation level would be applied to all VMs on the cluster. However, ESXi administrators can override the default DRS cluster automation level on a VM per VM basis.

For example: suppose your organization ran strict change management procedures only on critical production VMs. An ESXi administrator could set the default DRS cluster automation level to Fully Automated and then use per-VM overrides to downgrade critical production VMs to Partially Automated. That way, a person would have to approve (and log) any VMotion migrations of critical production VMs (only).

For another example, suppose a small IT department had 3 ESXi hosts; 2 for Production and one for Test. They wanted to give their production VMs the best possible performance but did not want their Test VMs to leave the test ESXi server. They could create a 3 host DRS cluster and set their production VMs to Fully Automated. They could then set their test VMs automation level to Disabled. As a result, Test VMs would never leave the Test ESXi host. But production VMs could be migrated to/from the Test ESXi host to take advantage of the resources available on all ESXi hosts.

Distributed Resource Scheduling Clusters
CPU and RAM Host Utilization

The Resource Allocation tab gives you a point in time view of your VMs and the resources they are receiving (as well as the host they are on and the shares they hold, reservations, limits, etc.).

This view displays values that are editable. If you see a VM whose resources, etc. don't match your needs simply click the value you wish to change and edit it directly. No need to launch the Cluster Properties window to make small changes.


The DRS tab displays the status of your cluster along with additional functions...

The Run DRS link lets you tell DRS to refresh its recommendations – now!

The Apply Recommendations button is how you give DRS permission to make the changes it suggests

The Edit... link lets you change the cluster properties directly from this screen

The Faults button lets you review past DRS actions that failed for any reason

The History button lets you review past DRS actions taken to keep the cluster in balance

DRS History

DRS clusters honor all resource settings on individual VMs and also on Resource Pools including CPU/Memory reservations, shares and limits. So, VMs will not have their resource entitlements changed simply because they are now managed by a DRS cluster.

Because DRS relies on VMotion, DRS can only load balance VMs that are VMotion compatible. For best results, you should carefully plan your ESXi host deployment and configuration so that all of your ESXi hosts are VMotion compatible. Then, you should take care to configure your VMs so that they use only common storage, networking and removable media resources so that you do not inadvertently lock a VM to an ESXi host.

DRS can run VMs that are not VMotion compatible – but it cannot move them. So, if an ESXi host is running a mix of VMotion and non-VMotion compatible VMs, DRS is limited to moving only the VMotion compatible VMs. This may impair its ability to fully load balance across the cluster.


If your organization is new to VMotion and DRS it is likely that you may receive resistance to the idea of automatically hot migrating VM. If this happens to you, you can ease your organization into VMotion and DRS as follows:

Create a DRS cluster and set the automation level to Manual

  • Let VM owners review and accept Initial Placement recommendations
  • When VM owners get tired of always accepting DRS Initial Placement recommendations, increase the automation level to Partially Automated

Over time, VMs will experience resource starvation on their ESXi host

  • Have VM owners address their own performance issues by asking them to review and accept DRS Migration recommendations
  • Once VM owners get tired of always accepting VMotion migration recommendations, increase the automation level to Fully Automated

Finally, deal with any per-VM concerns by using DRS Rules to override the DRS cluster default for that VM.


DRS will only violate affinity rules when it has no choice. The most likely scenario is where a DRS cluster is also an HA (fail over) cluster. For example, if you have two ESXi hosts in a combined DRS+HA cluster and you have two VMs in an anti-affinity rule (keep apart) for high service availability, the following situation could lead to a 5-star recommendation:

The two VMs are running on separate ESXi hosts

  1. One host fails completely causing an HA VM migration of VMs that died when the ESXi host died to the surviving ESXi host
  2. The failed VM in the anti-affinity relationship is started on the surviving ESXi host
  3. Both VMs (in the anti-affinity relationship) are now running but in violation of the anti-affinity rule intended to keep them on separate ESXi servers
  4. The failed ESXi host is repaired and rebooted
  5. When it comes up, it rejoins the DRS+HA cluster
  6. DRS will generate a 5-star recommendation on one of the 2 anti-affinity VMs so
  7. that it can be migrated over to the restored ESXi host
DRS Review and Questions
Section 3: VMware High Availability (HA) Clusters
VMware High Availability Clusters

High Availability clusters solve the problem of rapid VM placement and recovery if a VM should fail because the host it was running on failed (any of VMkernel panic, hardware failure, non-redundant storage failure, etc.)

HA minimizes VM down time by:

  • Actively monitoring all ESXi hosts in an HA cluster
  • Immediately detecting the failure of an ESXi host
  • Re-assigning ownership of VMs that died when an ESXi host in an HA cluster dies
  • Instructing the new ESXi host to boot the VM

The overall objective for HA is to have VMs back in service in less than 2 minutes from the time an ESXi host fails. Users will still need to re-establish any authenticated sessions against the recovered VM... but (hopefully) the VM down time experienced will be no more than a nuisance.

High Availability Lab - Part 1

For HA to place and power on VMs that fail when an ESXi host fails, those VMs must use only resources common to both the failed and the new ESXi host. This means the VM must use common

  • Networks including any production, test, NAS or IP storage networks
  • Datastores for both virtual disk storage and/or ISO/floppy image storage
  • NAS resources

VMs can meet HA compatibility without meeting VMotion compatibility. Specifically, because the VM is being cold-booted rather than hot migrated, there is no need to maintain ESXi CPU compatibility between HA cluster peers. The reason for this is that when the VM is booted on the target ESXi host, the VM can probe for CPU properties so there is no need for the CPUs to exactly match between the failed ESXi host.

HA is enabled on a new or existing cluster simply by checking the Enable VMware HA check box. Once you complete setting the HA cluster properties, vCenter will connect to each ESXi host in the cluster and reconfigure it to act as a peer in an HA cluster. It only takes a few minutes for this process to complete.
ESXi Host and Hardware Monitoring
Virtual Machine Monitoring
Review and Set HA Failure Policies
ESXi Host Failure vs. ESXi Host Isolation
Permanent Device Loss (PDL)
All Paths Down (APD)
VM Monitoring Sensitivity
Admission Control Policy

VMware HA clusters reserve ESXi host CPU and memory resources to ensure that VMs that fail when an ESXi host fails can be placed and restarted on a surviving ESXi host. There are two factors that determine how much ESXi host resources are held back:

  1. The number of ESXi host failures the cluster can tolerate
  2. The total number of ESXi hosts in the cluster

In the above examples, three HA clusters are illustrated. In each case the cluster is configured to tolerate a maximum of one ESXi host failure.

In a 2-node cluster, HA must hold back 50% of all ESXi host CPU and memory. That way, it can guarantee that it can place and restart all VMs from a single failed ESXi host. Consequently each host can never be more than 50% busy.

In a 3-node cluster, HA will hold back 33% of all ESXi host CPU and memory. If an ESXi host fails, half of the VMs from the failed host will be placed on each of the surviving ESXi hosts. A healthy 3-node cluster can be up to 66% busy.

In a 4-node cluster, HA only needs to hold back 25% of each hosts resources. And, in a 5-node cluster only 20% of each host's resources are kept in reserve.

Admission Control Settings
Admission Control - Slot Size
ESXi Host Failure Options
HA Cluster Network and Datastore Heartbeating

vCenter is the central management console for HA clusters. vCenter is responsible for creating, monitoring and coping with host failures and recoveries.

The fundamental assumption for HA clusters is that any host can fail at any time. And, because vCenter could be running as a VM, it is possible that the vCenter VM could fail when a host fails. This is a challenge because, if the cluster depended on vCenter and vCenter has failed, then the cluster could not recover from an ESXi host failure. To circumvent this problem, vCenter publishes the cluster configuration to every ESXi host in the HA cluster. That way, each ESXi host knows:

  • All of its ESXi peers in the HA cluster
  • Which VMs are assigned to which ESXi host
  • Specific cluster properties (like VM restart priority)

In the event of an ESXi host failure that also causes the vCenter VM to fail, the surviving ESXi hosts would cooperate to distribute the failed VMs (including the vCenter VM). Once the VMs were distributed, each host would begin booting its newly assigned VMs according to individual VM restart priority. As a result, the vCenter VM would be placed and booted quickly thereby bringing vCenter back into service.

Best Practice
If your vCenter server is a VM, it is a best practice to give it high restart priority.

HA Datastore Heartbeats
Configure Datastore Heartbeat
HA VM Restart Priority

In the Virtual Machine Options page, you are presented with a roster of all of the defined VMs on the cluster. You can click the VM row under the Restart Priority column header to change the restart priority for a VM.

High Priority – AD, DNS, DHCP, DC, vCenter and other critical infrastructure VMs
Medium Priority – Critical application servers like SQL, E-mail, Business applications, file shares, etc.
Low Priority - Test, Development, QA, training, learning, experimental and other non critical workloads
Disabled - Any VMs not required during periods of reduced resource availability (select from Low Priority examples)


To ensure continued operation of all of your virtual infrastructure, it is important that you assign VM restart priority with care. By default, all VMs are placed and restarted at the HA cluster's default restart priority (set on the HA clusters main settings page (right click cluster > click Edit Settings > click HA > click VMware HA). The default for HA cluster Restart Priority settings is Medium.

Note HA will not power off VMs on healthy ESXi hosts to free up resources for High or Medium priority VMs from failed ESXi hosts. Whether it is reasonable to do this or not would be up to the local administrator.

Best Practice
Set your cluster default restart priority to low. Then individually set your critical infrastructure VMs (DNS, DHCP, AD, DCs, etc.) to high and your critical business VMs to medium.

Adjust Individual VM Restart Priority
HA Cluster Overview
vSphere HA Cluster Summary
Impact of ESXi Host Network Isolation
An ESXi host can determine that it is isolated if it loses ESXi Console network connectivity. When that happens, the ESXi host checks the link state of the physical NIC that uplinks the ESXi Console virtual NIC with the physical switch. If the physical NIC link is down, the ESXi host knows that it is isolated.

Isolation Response behavior is triggered after 15 seconds (tunable). If a NIC cable was pulled accidentally, 15 seconds should be sufficient time to fix the problem (re-plug the cable) and avoid a cluster failure.

Through heart beat failure, other ESXi hosts would quickly determine that a peer HA cluster node is unresponsive. They would then check their own ESXi Console physical NIC link to verify that they have network connectivity. In this way, they determine that they are not the isolated host and that they should cooperate with other surviving HA cluster nodes to implement the clusters' Isolation Response policy.


After 15 seconds, the isolated HA cluster node implements the VM's isolation response policy. If that policy is Power Down, then our VM would power crash. This is the virtual equivalent of pulling the power plug on a physical machine.

Pulling the virtual power on a VM can be traumatic but does avoid a potentially greater problem. If the VM was told to perform an orderly shutdown instead of loosing it's power, then:

  • The remaining cluster nodes would have no way of monitoring progress
  • The shutdown request could fail at the VM level
  • The VM could hang or lock up during shut down

When the VM has been successfully powered down, the isolated ESXi host removes the exclusive lock it holds on the VM's virtual disk. Healthy ESXi cluster nodes monitor the VM's lock and know that it is safe to take ownership of the VM once it's lock has been removed.


HA cluster nodes would then distribute the powered off VMs from the isolated host amongst the the surviving ESXi hosts. VMs would be placed and powered on according to their Restart Priority setting on the HA cluster.

In this case, the left-most ESXi host assumes ownership of the Web01 VM. This VM is added to the hosts VM inventory and then immediately powered on. When the VM is powered on, it's new owner establishes it's own exclusive lock on Web01's virtual disk.

The isolated node would watch for the presence of a lock file for each VM it lost. Once the VM has been successfully powered on on another ESXi host, the isolated host would remove the VM from its own VM inventory.


Before you can take an ESXi cluster node out of a cluster for maintenance, you must announce to the cluster that the host is being pulled from service. You do this by putting the ESXi host into Maintenance mode as follows

Right click ESXi host > Enter Maintenance Mode

When you place an ESXi host into Maintenance Mode, the following takes place:

  • On DRS cluster, the ESXi host will no longer receive new VM power on requests, nor will it be the target for a VMotion request. The DRS cluster will attempt to VMotion off all VMotionable VMs on the host that has entered Maintenance Mode. If you have non-VMotion compatible VMs, you would need to power them off yourself before you shut down the ESXi host.
  • On HA clusters, the host in Maintenance Mode will not receive VMs from failed ESXi hosts. You have to manually shutdown any VMs running on the ESXi host that is going into Maintenance Mode.

Once the ESXi host is fully evacuated of VMs, you can shutdown or reboot the ESXi host (right click the host), patch, upgrade, etc. the ESXi host. When you boot it back up, it will automatically re-join any clusters for which it was a member.


It is critical that VMware ESXi administrators be informed of major network upgrades and/or outages... and especially if network outages may occur on switches used by High Availability cluster ESXi management ports.

The scenario above is a real possibility and will have serious results. If a switch that provides ESXi HA management networking fails (for any reason), all nodes in the HA cluster will believe they are isolated. If that happens, and the cluster Isolation Response policy is Power Off, then all ESXi hosts will power off all VMs.

To defend against this possibility, you could:

  • Use multiple physical switches
  • Upgrade one switch at a time
  • Have multiple ESXi Management ports on multiple switches
  • Change the Isolation Response policy during switch maintenance windows to 'Leave Powered on'

vSwitch0 is the vSwitch used to connect the default Management port to the physical (management) LAN segment. If you loose connectivity to this NIC, your ESXi host is unmanageable and your HA cluster will trigger it's Isolation Response policy. You have two choices when designing your networks to maximize your management capabilities.

NIC Team vSwitch0
If you NIC Team vSwitch0, then you will have 2 or more NICs connected to the same physical LAN segment usually through the same physical switch. You are protected from a NIC failure, cable pull or switch port failure but not from a physical switch failure.

Second Management Port
You should consider adding a second management port on a completely separate physical LAN segment by making a new service console port on vSwitch1, vSwitch2, etc. If these other vSwitches uplink to different physical switches than vSwitch0, then you will have achieved management port redundancy, NIC redundancy and switch redundancy. This provides you with the maximum protection and minimizes the likelihood of a HA fail over event caused by a single hardware component failure.

If you add a second management port, please make sure that vCenter can connect to all ESXi hosts on all management ports. It is best if vCenter has 2 NICs with connections to each of the physical switch(es) used for management port connectivity.


DRS and HA clusters work best together. DRS will dynamically place and load balance VMs while HA will restart VMs that fail when a host fails. Using these tools, an IT department can deliver consistently good VM performance with very little VM down time.

In a combined DRS+HA cluster if a host fails:

  • HA will detect the loss of the host
  • HA will place and power on VMs that failed when the host failed
  • HA is now done
  • DRS will then load balance the remaining hosts in the cluster
  • Once VMs power on, DRS will move them if they can get better resource allocations on different (surviving) hosts
  • When the failed ESXi host boots up, it will be added back into the cluster
  • DRS will then VMotion VMs back onto the recovered host to rebalance the cluster
vSphere Clusters - Best Practice
What's New in vSphere 6 for HA Clusters
HA Disabled VM Handling

VMware introduced vLockStep into vSphere 4... vLockStep is replication technology that replicates the complete state of a VM running on one ESXi host into a VM running on a second ESXi host. In essence, the two VMs form an active/stand-by pair... They are the same in all respects; they have the same configuration (virtual hardware), share the same virtual MAC address, have the same memory contents, CPU contents, complete the same I/Os, etc. The main difference is that Secondary VM is invisible to the network. VMware upgraded their Fault Tolerance technology to use Rapid Checkpointing in vSphere 6. Rapid Checkpointing provides more scalability than vLockStep.

If the Primary VM were to fail for any reason (e.g.: VMkernel failure on the machine running the Primary VM), Fault Tolerance would continue running the VM – by promoting the Secondary VM to the Primary VM on the surviving host. The new Primary would continue interacting with peers on the network, would complete all pending I/Os, etc. In most cases, peers wouldn't even know that the original Primary has failed.

To protect against a second failure, Fault Tolerance would then create a new Secondary node on another ESXi host by replicating the new Primary onto that host. So, in relatively little time, the VM is again protected and could withstand another VMkernel failure.

Note that Fault Tolerance does not protect against SAN failures

VMware recommends a minimum of 3 ESXi hosts in an HA/FT configuration so that, if one ESXi host is lost, there are 2 hosts remaining so that the FT protected VM can create anew Secondary copy on the 3rd cluster host.

High Availability Clusters Lab
High Availability Clusters - Review and Questions
Section 4: VMware Fault Tolerance
VMware Fault Tolerance

VMware introduced FT vLockStep into vSphere at version 4.0... and replaced it with Fast Checkpointing replication and synchronization technology that:

  • Builds a duplicate VM (Secondary) on a different HA cluster host
  • Quickly and efficiently synchronizes Primary VM to the Secondary VM
  • Makes all I/O operations visible to the Secondary VM
  • Ensures that the Secondary VM is in exactly the same state as the Primary VM at all times
  • Replicates updates to the Primary's .vmdk to the Secondary's .vmdk

If the Primary VM were to fail for any reason (e.g.: VMkernel failure on the machine running the Primary VM), Fault Tolerance would continue running the VM – by promoting the Secondary VM to the Primary VM on the surviving host. The new Primary would continue interacting with peers on the network, would complete all pending I/Os, etc. In most cases, peers wouldn't even know that the original Primary has failed.

VMware's best practice is to build HA/FT clusters using an odd number of servers. This allows FT to protect against a second failure. In the case of a host failure, FT would create a new Secondary node on another ESXi host by replicating the new Primary onto that host. So, in relatively little time, the VM is again protected and could withstand another ESXi host failure.

Fault Tolerance Benefits
Fault Tolerance Fast Checkpointing
Two ESXi Host Fault Tolerance VM Protection
Three Plus ESXi Host Fault Tolerance VM Protection
Fault Tolerance - Use Cases
Fault Tolerance Lab - Part 1
What's New in vSphere 6 for Fault Tolerance
Fault Tolerance HA Cluster and ESXi Host Requirements
Fault Tolerance HA Cluster Compliance Checks
Fault Tolerance Virtual Machine Requirements
Fault Tolerance Protected Virtual Machine Restrictions
Fault Tolerance Networking - Best Practice
Fault Tolerance VMkernel Port Configuration
Enabling Fault Tolerance Protection on a Virtual Machine
Fault Tolerance Enabled - VM Compliance Checks
VM Turn On Fault Tolerance Wizard - Step 1

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Larry Karnis, VMware vSphere Consultant/Mentor, VCP vSphere 2, 3, 4 and 5

Get VMware vSphere and View trained here... on Udemy!

What do you do if you need to learn VMware but can't afford the $4,000 - $6,000 charged for authorized training? Now you can enroll in my equivalent VMware training here on Udemy!

I have created a six courses that together offer over 32 hours of VMware vSphere 6 lectures (about 8 days of instructor lead training at 4hrs lecture per day). With Udemy, I can provide more insight and detail, without the time constraints that a normal instructor led training class would impose. My goal is to give you a similar or better training experience - at about 10% of the cost of classroom training.

I am an IT consultant / trainer with over 25 years of experience. I worked for 10 years as a UNIX programmer and administrator before moving to Linux in 1995. I've been working with VMware products since 2001 and now focus exclusively on VMware. I earned my first VMware Certified Professional (VCP) designation on ESX 2.0 in 2004 (VCP #: 993). I have also earned VCP in ESX 3, and in vSphere 4 and 5.

I have been providing VMware consulting and training for more than 10 years. I have lead literally hundreds of classes and taught thousands of people how to use VMware. I teach both introductory and advanced VMware classes.

I even worked for VMware as a VMware Certified Instructor (VCI) for almost five years. After leaving VMware, I decided to launch my own training business focused on VMware virtualization. Prior to working for VMware, I worked as a contract consultant and trainer for RedHat, Global Knowledge and Learning Tree.

I hold a Bachelor of Science in Computer Science and Math from the University of Toronto. I also hold numerous industry certifications including VMware Certified Professional on VMware Infrastructure 2 & 3 and vSphere 4 & 5 (ret.), VMware Certified Instructor (ret.), RedHat Certified Engineer (RHCE), RedHat Certified Instructor (RHCI) and RedHat Certified Examiner (RHCX) as well as certifications from LPI, HP, SCO and others.

I hope to see you in one of my Udemy VMware classes... If you have questions, please contact me directly.



Larry Karnis

Ready to start learning?
Take This Course