Technical information

VMware, Inc. 41
Chapter 4 Virtual Infrastructure Management
VMware Fault Tolerance Best Practices
VMware Fault Tolerance (FT) provides continuous virtual machine availability in the event of a server failure.
For each virtual machine there are two FT-related actions that can be taken: turning FT on or off and
enabling or disabling FT.
“Turning on FT” prepares the virtual machine for FT by prompting for the removal of unsupported
devices, disabling unsupported features, and setting the virtual machine’s memory reservation to be
equal to its memory size (thus avoiding ballooning or swapping).
“Enabling FT” performs the actual creation of the secondary virtual machine by live-migrating the
primary.
Each of these operations has performance implications.
Don’t turn on FT for a virtual machine unless you will be using (i.e., Enabling) FT for that machine.
Turning on FT automatically disables some features for the specific virtual machine that can help
performance, such as hardware virtual MMU (if the processor supports it).
Enabling FT for a virtual machine uses additional resources (for example, the secondary virtual
machine uses as much CPU and memory as the primary virtual machine). Therefore make sure you
are prepared to devote the resources required before enabling FT.
The live migration that takes place when FT is enabled can briefly saturate the VMotion network link and
can also cause spikes in CPU utilization.
If the VMotion network link is also being used for other operations, such as FT logging, the
performance of those other operations can be impacted. For this reason it is best to have separate and
dedicated NICs for FT logging traffic and VMotion, especially when multiple FT virtual machines
reside on the same host.
Because this potentially resource-intensive live migration takes place each time FT is enabled, we
recommend that FT not be frequently enabled and disabled.
Because FT logging traffic is asymmetric (the majority of the traffic flows from primary to secondary),
congestion on the logging NIC can be avoided by distributing primaries onto multiple hosts. For example
on a cluster with two ESX hosts and two virtual machines with FT enabled, placing one of the primary
virtual machines on each of the hosts allows the network bandwidth to be utilized bidirectionally.
FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create
significant bandwidth on the NIC specified for the logging traffic. This is true of machines that routinely
do these things as well as machines doing them only intermittently, such as during a backup operation.
To avoid saturating the network link used for logging traffic limit the number of FT virtual machines on
each host or limit disk read bandwidth and network receive bandwidth of those virtual machines.
Make sure the FT logging traffic is carried by at least a Gigabit-rated NIC (which should in turn be
connected to at least Gigabit-rated infrastructure).
Avoid placing more than four FT-enabled virtual machines on a single host. In addition to reducing the
possibility of saturating the network link used for logging traffic, this also limits the number of
live-migrations needed to create new secondary virtual machines in the event of a host failure.
If the secondary virtual machine lags too far behind the primary (which usually happens when the
primary virtual machine is CPU bound and the secondary virtual machine is not getting enough CPU
cycles), the hypervisor may slow the primary to allow the secondary to catch up. The following
recommendations help avoid this situation:
Make sure the hosts on which the primary and secondary virtual machines run are relatively closely
matched, with similar CPU make, model, and frequency.
N
OTE Turning on FT for a powered-on virtual machine will also automatically “Enable FT” for that
virtual machine.