Users Guide
You should run test recoveries as often as needed. Testing a recovery plan does not affect replication or the ongoing operations
of either site (though it might temporarily suspend the selected local virtual machines at the recovery site if recoveries are
configured to do so). You can cancel a recovery plan test at any time.
In the case of planned migrations, a recovery stops replication after a final synchronization of the source and the target. Note
that for disaster recoveries, virtual machines are restored to the most recent available state, as determined by the recovery
point objective (RPO). After the final replication is completed, SRM makes changes at both sites that require significant time
and effort to reverse. Because of this, the privilege to test a recovery plan and the privilege to run a recovery plan must be
separately assigned.
When SRM test failovers to the recovery site are requested, SRM performs the following steps:
1. Determines the latest recovery point for each replicated volume.
2. Creates a writeable test snapshot for each recovery point, with a name in the form srannnnnn where nnnnnn is a
monotonically increasing number.
3. Maps the test snapshots to the appropriate ESXi hosts on the recovery site.
When testing stops, the test snapshots are unmapped and deleted.
Failover and failback
Failback is the process of setting the replication environment back to its original state at the protected site prior to failover.
Failback with SRM is an automated process that occurs after recovery. This makes the failback process of the protected virtual
machines relatively simple in the case of a planned migration. If the entire SRM environment remains intact after recovery,
failback is done by running the reprotect recovery steps with SRM, followed by running the recovery plan again, which moves
the virtual machines configured within their protection groups back to the original protected SRM site.
In disaster scenarios, failback steps vary with respect to the degree of failure at the protected site. For example, the failover
could have been due to a storage system failure or the loss of the entire data center. The manual configuration of failback is
important because the protected site may have a different hardware or SAN configuration after a disaster. Using SRM, after
failback is configured, it can be managed and automated like any planned SRM failover. The recovery steps can differ based on
the conditions of the last failover that occurred. If failback follows an unplanned failover, a full data re-mirroring between the
two sites may be required. This step usually takes most of the time in a failback scenario.
All recovery plans in SRM include an initial attempt to synchronize data between the protection and recovery sites, even during
a disaster recovery scenario.
During the disaster recovery, an initial attempt will be made to shut down the protection group’s virtual machines and establish a
final synchronization between the sites. This is designed to ensure that virtual machines are static and quiescent before running
the recovery plan, in order to minimize data loss wherever possible. If the protected site is no longer available, the recovery plan
will continue to execute and will run to completion even if errors are encountered.
This new attribute minimizes the possibility of data loss during a disaster recovery, balancing the requirement for virtual machine
consistency with the ability to achieve aggressive recovery-point objectives.
Automatic failover
SRM automates the execution of recovery plans to ensure accurate and consistent execution. Through the vCenter Server you
can gain full visibility and control of the process, including the status of each step, progress indicators, and detailed descriptions
of any error that occurs.
In the event of a disaster when an SRM actual failover is requested, the SRA will perform the following steps:
1. Select the replicated volumes.
2. Identify and remove any incomplete remote copies that are in progress and present the most recently completed Remote
Copy as a primary volume.
3. Convert remote volumes into primary volumes and configure authentication for ESXi hosts to mount them.
If an actual failover does not run completely for any reason, the failover can be called many times to try to complete the run. If,
for example, only one volume failed to restore and that was due to a normal snapshot being present, the snapshot could be
manually deleted and the failover be requested again.
8
Using SRM for disaster recovery