Technical information

ManualsBrandsVMware ManualsComputer AccessoriesvSphere 4

VMware, Inc. 41

Chapter 4 Virtual Infrastructure Management

VMware Fault Tolerance Best Practices

VMware Fault Tolerance (FT) provides continuous virtual machine availability in the event of a server failure.

 For each virtual machine there are two FT-related actions that can be taken: turning FT on or off and

enabling or disabling FT.

“Turning on FT” prepares the virtual machine for FT by prompting for the removal of unsupported

devices, disabling unsupported features, and setting the virtual machine’s memory reservation to be

equal to its memory size (thus avoiding ballooning or swapping).

“Enabling FT” performs the actual creation of the secondary virtual machine by live-migrating the

primary.

Each of these operations has performance implications.

 Don’t turn on FT for a virtual machine unless you will be using (i.e., Enabling) FT for that machine.

Turning on FT automatically disables some features for the specific virtual machine that can help

performance, such as hardware virtual MMU (if the processor supports it).

 Enabling FT for a virtual machine uses additional resources (for example, the secondary virtual

machine uses as much CPU and memory as the primary virtual machine). Therefore make sure you

are prepared to devote the resources required before enabling FT.

 The live migration that takes place when FT is enabled can briefly saturate the VMotion network link and

can also cause spikes in CPU utilization.

 If the VMotion network link is also being used for other operations, such as FT logging, the

performance of those other operations can be impacted. For this reason it is best to have separate and

dedicated NICs for FT logging traffic and VMotion, especially when multiple FT virtual machines

reside on the same host.

 Because this potentially resource-intensive live migration takes place each time FT is enabled, we

recommend that FT not be frequently enabled and disabled.

 Because FT logging traffic is asymmetric (the majority of the traffic flows from primary to secondary),

congestion on the logging NIC can be avoided by distributing primaries onto multiple hosts. For example

on a cluster with two ESX hosts and two virtual machines with FT enabled, placing one of the primary

virtual machines on each of the hosts allows the network bandwidth to be utilized bidirectionally.

 FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create

significant bandwidth on the NIC specified for the logging traffic. This is true of machines that routinely

do these things as well as machines doing them only intermittently, such as during a backup operation.

To avoid saturating the network link used for logging traffic limit the number of FT virtual machines on

each host or limit disk read bandwidth and network receive bandwidth of those virtual machines.

 Make sure the FT logging traffic is carried by at least a Gigabit-rated NIC (which should in turn be

connected to at least Gigabit-rated infrastructure).

 Avoid placing more than four FT-enabled virtual machines on a single host. In addition to reducing the

possibility of saturating the network link used for logging traffic, this also limits the number of

live-migrations needed to create new secondary virtual machines in the event of a host failure.

 If the secondary virtual machine lags too far behind the primary (which usually happens when the

primary virtual machine is CPU bound and the secondary virtual machine is not getting enough CPU

cycles), the hypervisor may slow the primary to allow the secondary to catch up. The following

recommendations help avoid this situation:

 Make sure the hosts on which the primary and secondary virtual machines run are relatively closely

matched, with similar CPU make, model, and frequency.

OTE Turning on FT for a powered-on virtual machine will also automatically “Enable FT” for that

virtual machine.