Technical Report Deploying VMware vCenter Site Recovery Manager 4 with NetApp FAS/V-Series Storage Systems Larry Touchette, Julian Cates, NetApp February 2012 | TR-3671 | Version 1.4 DEPLOYING VMWARE VCENTER SITE RECOVERY MANAGER 4 WITH NETAPP FAS/V-SERIES STORAGE SYSTEMS The following document provides information regarding the deployment of VMware vCenter Site Recovery ® Manager version 4, with NetApp storage systems running NetApp Data ONTAP operating in 7-Mode.
TABLE OF CONTENTS 1 INTRODUCTION ........................................................................................................................ 3 1.1 INTENDED AUDIENCE ................................................................................................................................ 3 2 OVERVIEW OF A DISASTER RECOVERY SOLUTION ........................................................... 3 2.1 A TRADITIONAL DR SCENARIO .....................................
8.1 REQUIREMENTS AND ASSUMPTIONS ................................................................................................... 26 8.2 EXECUTING THE RECOVERY PLAN ....................................................................................................... 27 9 RESYNC AFTER RECOVERY ................................................................................................ 29 9.1 REQUIREMENTS AND ASSUMPTIONS ...............................................
2.1 A TRADITIONAL DR SCENARIO When failing over business operations to recover from a disaster, there are several steps that are manual, lengthy, and complex. Often custom scripts are written and utilized to simplify some of these processes. However, these processes can affect the real RTO that any DR solution can deliver. Consider this simplified outline of the flow of a traditional disaster recovery scenario: 1.
each other. Virtual machines can be quickly and easily collected into groups that share common storage resources and can be recovered together. Figure 1 SRM Site Components Upon the execution of a disaster recovery plan SRM will: • Quiesce and break the NetApp SnapMirror relationships. • Connect the replicated datastores to the ESX hosts at the DR site. • If desired, power off virtual machines, such as test/dev instances, at the DR site, freeing compute resources.
3.3 NETAPP FLEXCLONE When NetApp FlexClone® technology is combined with SnapMirror and SRM, testing the DR solution can be performed quickly and easily, requiring very little additional storage, and without interrupting the replication process. FlexClone quickly creates a read-writable copy of a FlexVol® volume. When this functionality is used, an additional copy of the data is not required, for example for a 10GB LUN, another 10GB LUN isn’t required, only the meta-data required to define the LUN.
Unified protocol support – The unified storage adapter allows for protection and recovery of virtual machines with data in datastores using different connectivity protocols at the same time. For example with the new unified adapter a virtual machine system disk may be stored in NFS, while iSCSI or FC RDM devices are used for application data. MultiStore vFiler support – The new adapter supports the use of MultiStore vFilers (virtual storage arrays) configured as arrays in SRM.
Figure 3 SRM Environment Layout 4.2 IP ADDRESS CHANGES Some environments may be able to use the same network IP addresses at both the primary site and the DR site. This is referred to as a stretched VLAN or stretched network setup. Other environments may have a requirement to use different network IP addresses (in different VLANs) at the primary site than what is configured at the DR site. SRM supports both of these scenarios.
SRM and the NetApp adapter will support connecting to a datastore over a private storage network rather than the network or ports used for system administration, virtual machine access, user access, and so on. This support is provided by the addition of a new field, called NFS IP Addresses, in the NetApp adapter setup screen in the SRM array manager configuration wizard.
roles are seized it is very important that the clone never be connected to a production VM Network, and that it is destroyed after DR testing is complete. PROVIDING ACTIVE DIRECTORY SERVICES FOR SRM FAILOVER For a real DR failover scenario, the AD cloning process is not required. In the SRM failover scenario, the existing AD and name resolution servers at the recovery site provide those services. However, the five FSMO roles must be seized per the procedure described in the Microsoft KB at http://support.
5.3 SRA AND DATA ONTAP VERSION DEPENDENCIES Certain versions of the SRA require specific versions of the Data ONTAP operating system for compatibility. Please see the table below regarding which versions of Data ONTAP and the NetApp SRA are supported together. NetApp SRA Version NetApp Data ONTAP version required v1.0 SAN 7.2.2+ v1.0.X SAN 7.2.2+ v1.4 to v1.4.2 SAN 7.2.2+ v1.4 to v1.4.2 NAS 7.2.4+ V1.4.3 7.2.4+ v1.4.3 (using vFiler support) 7.3.
Figure 6 Supported Replication Example Figure 7 Supported Replication Example Relationships where the data (vmdk or RDM) owned by any individual virtual machine exists on multiple arrays (physical controller or vFiler) are not supported. In the examples below VM5 cannot be configured for protection with SRM because VM5 has data on two arrays.
Figure 9 Unsupported Replication Example Any replication relationship where an individual NetApp volume or qtree is replicated from one source array to multiple destination arrays is not supported by SRM. In this example VM1 cannot be configured for protection in SRM because it is replicated to two different locations. Figure 10 Unsupported Replication Example 5.
• In iSCSI environments ESX hosts at the DR site must have established iSCSI sessions to the DR site vFiler prior to executing SRM DR test or failover. The NetApp adapter does not start the iSCSI service in the vFiler or establish an iSCSI connection to the iSCSI target. • In iSCSI environments the DR site vFiler must have an igroup of type "vmware" prior to executing SRM DR test or failover. The NetApp adapter does not create the “vmware” type igroup in the DR site vFiler.
scheduled using NetApp software such as the built-in scheduler in Data ONTAP or Protection Manager software. 6 6.1 INSTALLATION AND CONFIGURATION OVERVIEW Configuring SRM to protect virtual machines replicated by SnapMirror involves these steps, summarized here but explained in detail below. 1. Install and enable the SRM plug-in on each of the vCenter servers. 2. Install the NetApp SRA on each of the vCenter servers.
6.3 USING HTTPS/SSL TO CONNECT TO THE NETAPP CONTROLLERS By default the NetApp storage adapter connects to the storage array via API using HTTP. The NetApp storage adapter can be configured to use SSL (HTTPS) for communication with the storage arrays. This requires that SSL modules be added to the instance of Perl that is shipped and installed with SRM. VMware and NetApp do not distribute these modules.
3. The protection-side array will be listed at the bottom of the Configure Array Managers window, along with the discovered peer array. The destination of any SnapMirror relationships is listed here, including a count of replicated LUNs discovered by SRM. 4. Click Next and configure recovery-side array managers in the same fashion. Click the Add… button and enter the information for the NetApp controller at the DR site.
Once the array is added successfully, the Protection Arrays section at the bottom of the screen will indicate the replicated storage is properly detected. 5. After finishing the configuration dialogs, the Review Mirrored LUNs page will indicate all the discovered NetApp LUNs containing ESX datastores and their associated replication targets. Click Finish. 6. The array managers must also be configured at the DR site.
Figure 11 SRM Site and Array Manager Relationships 6.5 CONFIGURING INVENTORY PREFERENCES The VMware environments at the primary site and the DR site might have different sets of resources, such as virtual machine networks, ESX hosts, folders, and so on. In this stage of the configuration, identify a DR site resource for each corresponding resource at the primary site. 1. On the primary site vCenter server, click the Configure link next to Inventory Preferences. 2.
3. Verify that all the primary resources have the associated corresponding secondary resources. 4. Each of the connection, array managers, and inventory preferences steps in the Setup section of the Summary tab should now indicate Connected or Configured. 5. Proceed with building a protection group. 6.6 BUILDING A PROTECTION GROUP Protection groups allow virtual machines to be collected in groups that will be recovered together.
.vmx files replicated from the primary site are used. 4. The next page allows the configuration of specific settings for each virtual machine. Note: In this example, as we have not separated the transient data in this environment, we finish the wizard here.
5. The Site Recovery Summary tab now indicates that one protection group has been created. Note that in vCenter at the DR site a placeholder virtual machine has been created for each virtual machine that was in the protection group. These virtual machines have .vmx configuration files stored in vmware_temp_dr, but they have no virtual disks attached to them. 6.7 CREATING A RECOVERY PLAN Recovery plans are created in the vCenter server at the DR site.
environment as described in the design section, then that network is selected as the test network for the virtual machine network. No test network is necessary for the DR test bubble network itself, so it remains set to Auto. 5. Next select virtual machines to suspend. SRM allows virtual machines that normally run local at the DR site to be suspended when a recovery plan is executed.
7 EXECUTING RECOVERY PLAN: TEST MODE When running a test of the recovery plan SRM creates a NetApp FlexClone of the replicated FlexVol Volume at the DR site. The datastore in this volume is then mounted to the DR ESX cluster and the virtual machines configured and powered on. During a test, the virtual machines are connected to the DR test bubble network rather than the public virtual machine network. The following figure shows the vSphere client map of the network at the DR site before the DR test.
creates NFS exports. Then SRM initiates a storage rescan on each of the ESX hosts. If you are using NFS storage the adapter will create NFS exports, exporting the volumes to each VMkernel port reported by SRM and SRM will mount each exported volume. Below is an example of NFS datastores being mounted by SRM at the DR site during a DR test. Note that SRM uses each IP address that was provided in the NFS IP Addresses field above. The first recovered datastore is mounted at the first NFS IP address provided.
Figure 13 Screenshot of VM to Network Map view in vCenter Client while in DR Test Mode 6. When completed with DR testing, click the continue link in the Recovery Steps tab. SRM will bring down the DR test environment, disconnect the LUN, clean up the FlexClone volume, and return the environment to normal. 7. If you cloned an Active Directory/DNS server at the DR site it can now be shut down, removed from inventory, and deleted.
- DR site VMware license services. The DR site has time synchronized to the same source or a source in sync with the primary site. • All required NetApp volumes are being replicated using SnapMirror to the DR site. • The SnapMirror operations have been monitored and are up to date with respect to the designed RPO. • Required capacity exists on the DR NetApp controller. This refers to capacity required to support dayto-day operations that have been planned for in the DR environment.
2. Make a remote desktop connection to a system at the DR site. Lauch the vSphere client an log into vCenter and select Site Recovery application. A login prompt appears to collect credentials for the primary vCenter server. 3. The connection fails due to the primary site disaster. The site recovery screen indicates that the paired site (in this case, the primary site) is “Not Responding” because that site is down. 4. From the tree on the left, select the recovery plan to be executed and click Run.
9. 9 9.1 If necessary and possible, isolate the primary site to prevent any conflicts if infrastructure services should be reestablished without warning. RESYNC AFTER RECOVERY REQUIREMENTS AND ASSUMPTIONS After a disaster has been overcome, it is usually necessary to return operations back to the primary site. From a storage standpoint when considering the process of resyncing the environment back to the primary site there are three high-level scenarios to consider: A.
9.2 RESYNCING: PRIMARY STORAGE RECOVERED Recovering back to an existing primary site, where an outage occurred but the primary array data was not lost is likely to be the most common scenario, so it is covered here first. If the disaster event that occurred did not cause the complete loss of data from the primary site, such as an unplanned power outage or planned failover, then SnapMirror allows replication of only the delta of new data back from the DR site.
For more information about the snapmirror resync command, see the Resynchronizing SnapMirror section of the Data Protection Online Backup and Recovery Guide for your version of Data ONTAP here: http://now.netapp.com/NOW/knowledge/docs/ontap/ontap_index.shtml. 7. Check the status of the transfer periodically until it is complete. Notice that the new DR-to-primary entry is now in the list of SnapMirror relationships along with the original primary-to-DR relationship.
hosts may be required for DR testing just as with the original failover setup. 14. Verify that the SnapMirror relationship has completed transferring and that it is up to date. 15. Perform a SnapMirror update if necessary and perform a test run of the recovery plan. 16. When the recovery plans and other necessary infrastructure have been established at the primary site, schedule an outage to perform a controlled failover. 17.
9.3 RESYNCING: PRIMARY STORAGE LOST In this case the outage that was cause for the failover of the environment to the DR site was such that it caused the complete loss of data from the primary storage or the complete loss of the infrastructure at the primary site. From a SnapMirror and SRM point of view, recovery from such an event is nearly identical to the steps in the section above.
18. Execute an SRM recovery plan in the opposite direction as was done in the disaster recovery. 19. Verify the recovered environment and allow users to connect and continue normal business operations. 20. Reestablish primary-to-DR replication. 9.4 REESTABLISHING PRIMARY-TO-DR REPLICATION Reestablishing normal operations requires the following processes: 1. Reversing the SnapMirror and SRM relationships again, to establish the original primary-to-DR site replication and protection.
Another option, with regard to infrastructure server dependencies, is to put such servers in a protection group to be recovered and verified first, before application server protection groups are recovered. Make sure that each protection group/recovery plan contains the appropriate infrastructure servers, such as Active Directory servers, to facilitate testing in the DR test bubble networks.
it is necessary to configure the environment with an igroup for each ESX host, rather than creating one igroup for the entire cluster or group of hosts. STORAGE RESCANS In some cases ESX might require a second storage rescan before a newly connected datastore is recognized. In this case the SRM recovery plan might fail, indicating that a storage location cannot be found even though the storage has been properly prepared during the DR test or failover.
directories exist, it might be necessary to delete and recreate the protection group for those VMs. If these directories remain, they might prevent the ability to run a DR test or failover. NFS SUPPORT NFS support requires SRM 4.0 and vCenter 4.0. ESX 3.0.3, ESX/ESXi 3.5U3+ and 4.0 are also supported for SRM with NFS. USING THE RW (READ/WRITE) FIELD IN NFS EXPORTS NFS support requires that replicated NFS datastores exported from the NetApp array using values in the RW field.
VMware vCenter Site Recovery Manager 4.0 Performance and Best Practices for Performance www.vmware.com/files/pdf/VMware-vCenter-SRM-WP-EN.pdf SRM System Administration guides and Release notes from the VMware vCenter Site Recovery Manager Documentation site www.vmware.com/support/pubs/srm_pubs.
APPENDIX A: VMS WITH NONREPLICATED TRANSIENT DATA If the transient data in your environment, such as VMware virtual machine swapfiles or Windows system pagefiles, has been separated onto nonreplicated datastores, then ESX and SRM might be configured to recover virtual machines while maintaining the separation of that data. Below is an image describing the environment represented in this document, with the addition that transient data contained in Windows pagefiles and VMware *.
1. In the vCenter server at the primary site right-click the ESX cluster and select Edit Settings. Click Swapfile Location and choose the option “Store the swapfile in the datastore specified by the host.” Click OK. 2. In the vCenter configuration tab for each ESX host, click the Virtual Machine Swapfile Location link. 3. Click the Edit… link and select a shared datastore for this cluster that is located in the primary site. 4.
Delete the sched.swap.derivedName option. This line will be automatically recreated by ESX when the virtual machine is powered on. It is set to the physical path name of the vswp file location. c. workingDir = "." This is the default setting for this option. Leave this setting as is.
configure the location of the Windows pagefile, as well as some other miscellaneous temporary data, can be found in TR-3428: NetApp and VMware Virtual Infrastructure 3 Storage Best Practices. 8. Copy the original pagefile vmdk, /vmfs/volumes/templates/pagefile_disks/pagefile.vmdk, to the DR site. This can be done using cp across a NAS share, FTP, and so on. Be sure to copy both the pagefile.vmdk and pagefile-flat.vmdk files. 9.
12. Proceed to the Storage Devices step. Select the virtual disk where the pagefile is stored. Click Browse. 13. Select the datastore where this virtual machines DR pagefile disk is stored, select the vmdk, and click OK. 14. Confirm that the Storage Devices screen indicates the proper location for the disk to be attached on recovery. Step through the rest of the dialog to complete the settings.
15. Make the appropriate changes for each of the virtual machines until they are all configured properly. 16. Click the Summary tab and confirm the protection group status is OK, indicating it has synced with the DR site. 17. Perform a SnapMirror update, if necessary, to make sure that the virtual machines and their configuration files have been replicated to the DR site. 18. In vCenter at the DR site run a test of the recovery plan.
APPENDIX B: NON-QUIESCED SMVI SNAPSHOT RECOVERY By default the NetApp adapter recovers FlexVol volumes to the last replication point transferred by NetApp SnapMirror. The 1.4.3 release of the NetApp SRM adapter provides the capability to recover NetApp volume snapshots created by NetApp SnapManager for Virtual Infrastructure. This feature is not available in NetApp FAS/V-Series Storage Replication Adapter 2.0 for SRM 5.
NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication or with respect to any results that might be obtained by the use of the information or observance of any recommendations provided herein.