High Availability FailOver/iX Manual HP e3000 MPE/iX Computer Systems Edition 2 Manufacturing Part Number: 32650-90911 E0803 U.S.A.
Notice The information contained in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this material, including, but not limited to, the implied warranties of merchantability or fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for direct, indirect, special, incidental or consequential damages in connection with the furnishing or use of this material.
Contents 1 Introduction.................................................................................................................................7 Prerequisites .............................................................................................................................8 2 Product Description ....................................................................................................................9 How Failover Works ......................................................
Contents 5 Monitoring Status .....................................................................................................................36 Failover Status Report............................................................................................................37 Failover Status Descriptions...................................................................................................38 Ready.............................................................................................
Preface This manual documents the High Availability FailOver/iX (HAFO) utilities for the HP e3000 systems. Each chapter of this manual is described briefly. Chapter 1, "Introductions." Contains a brief description of the HAFO utility concept. Chapter 2, "Product Description." Describes how a failover works, including the triggers and a generalized sequence of events after a failover. Each of the four major components is listed and described. Chapter 3, "Installations.
Introduction 1 Introduction This manual documents the High Availability FailOver/iX (or HAFO) utilities for HP e3000 systems running MPE/iX 7.0 and 7.5. HAFO provides protection from disk I/O path failures to the system and user volume sets by allowing a System Administrator to configure a “Primary Path” for normal I/O and an “Alternate Path” for temporary use during the recovery of a primary path failure. HAFO is also to be used only with multi-ported disk storage array products.
Introduction Prerequisites Prerequisites The reader of this document must be familiar and knowledgeable with the following subject matters: • Performing HP e3000 system administration tasks, such as configuring disk Ldevs using SYSGEN. See manual Performing System Management Tasks at http://www.docs.hp.com/mpeix/all/ • Creating and using HP e3000 User Volumes Sets. See manual Volume Management Reference Manual at http://www.docs.hp.
Product Description How Failover Works 2 Product Description Once installed and configured, the HAFO software is intended to provide continued data access to supported high availability storage arrays under a variety of conditions. There are typically five "points of failure" in any I/O subsystem. They are: 1. Disk drive mechanisms 2. Disk power supply 3. Disk array controller 4. I/O host adapter card 5. Cabling and FC switches.
Product Description User Notification of Failover HAFO event information and data structures are memory resident. This eliminates the need for disk file access to perform high availability failover. This is an advantage especially if the path should fail to Ldev#1. To configure and manage the HAFO functions there is a new section of the SYSGEN program for HAFO called "HA" (short for HAUTIL). The "ha" section is accessed through SYSGEN’s "io" menu.
Product Description Components If any other error type occurs (such as a data transmission or device error), the I/O subsystem manages the error and perform corrective action. HAFO remains idle and not participate. In addition, HAFO remains idle when error types are received from non-high availability devices. CAUTION The difference between a Hung device and an extremely slow device can be very difficult to determine.
Product Description User Notification of Failover User Notification of Failover The MPE System Logs will contain Type 111 log records (subsystem 900) to document the failover event. Log entries are also created as a result of HAFO activities including configuration, path verification and path switching. In addition, the following console error message will be displayed upon failover and every five minutes thereafter. The repeating message can be turned off with an MPE “reply” ([CTRL] [A] REPLY or :REPLY).
Product Description Components Components The HAFO product has multiple components: • System Boot Failover Initialization Utility • SYSGEN HAFO commands • HASTAT HAFO status report • HAFOCONF configuration file Each of these components are briefly described. System Boot Failover Initialization Utility During system boot, the HAFO configuration utility reads the HAFOCONF configuration file and arms HAFO.
Product Description Components High Availability Failover Status With the installation of HAFO, a new reporting program, HASTAT, is supplied. The report lists each configured Ldev, its primary and alternate paths, and their status. This feature is documented in detail in Chapter 5, "Monitoring Status.". Additionally, SYSGEN “HA” contains a STATUS(st) command which produces the same output as the HASTAT program.
Installation 3 Installation HAFO requires no subsystem product installation. It is an enhancement to MPE/iX FOS and is installed via operating system patches. On MPE/iX 7.0 and MPE/iX 7.5 releases you must install MPEMXG9 and MPEMXH5 (or superceding patches). If you are updating from 6.5 with a current, active HAFO configuration you need to recreate your HAFOCONF file (see Chapter 4) With correct installation, HAFO becomes available, requiring only HAFO configuration.
System Requirements System Requirements The following is a list of HAFO supported devices and connectivity restrictions: • SureStore E Disk Array XP256 (SCSI Attached ONLY). • SureStore E Disk Array XP512/XP48 with either native FC or Fabric Router attached. • SureStore E Disk Array XP1024/XP128 with either native FC or Fabric Router attached. • Virtual Array 7100 is not supported because the VA7100 contains one active port and one passive port. HAFO requires two active ports.
Configuration 4 Configuration Once HAFO software is installed, the core components are active. This means HAFO is monitoring the I/Os for any HAFO (like) reply messages, as documented in Chapter 2, "Product Description." However, for failover protection, the Ldevs must be configured for HAFO and the configuration must be activated. Configuring an Ldev creates an entry in the HAFO configuration file, HAFOCONF. The HAFOCONF file resides in the configuration group on Ldev1.
Configuration Planning 3. The storage array’s LUN must have the same LUN address for both the primary path and alternate path. If a device is LUN 20 on the primary path it must be LUN 20 on the alternate path. This also means that each device LUN must be unique across both paths. 4. The total number of devices assigned to a primary and alternate path pair should not exceed recommended maximums for the HBA you are using. In the case of the NIO FW/SCSI card no more than 10 devices are supported.
Configuration Planning The SYSGEN (io) configured SCSI target & LUN that is being used on a primary path cannot be configured in the (io) section of SYSGEN on the alternate path. (Configuration restriction #3) For example, if the path for Ldev 1 were 8.0.0, and the alternate path were 48, then one should not use 48.0.0 for any of the Ldevs on that path. None of the Ldevs 101-103 could be configured as 48.0.0. A correct configuration would be to make Ldevs 1, 30, and 31 be 8.0.0, 8.1.0, and 8.2.0.
Configuration Planning 5. Ensure that the array itself is configured to allow each primary path controller to talk to the Ldevs on its corresponding alternate path. 6. When the entire plan and map are complete, configure the primary and alternate data paths in SYSGEN via the HAutil interface. CAUTION Mirrored Disk/iX or Cluster/iX cannot be configured as HAFO devices. Mirrored disk, Cluster/iX and HAFO can be used on the same system with multiple volumes and multiple arrays.
Configuration Commands HAFO Configuration Commands HAFO configuration is a sub-menu of SYSGEN. The sub-menu, HAFO(ha), is found under the "io" sub-menu of SYSGEN. :SYSGEN >io >ha HAFO configuration commands: 1. ADDCONF (ad) to add each Ldev's configuration path to the HAFO configuration file. 2. LISTCONF (li) to display configurations. 3. DOHA(do) to verify and activate the configuration. 4. delconf (de) any mistakes or to delete the entry for an Ldev.
Configuration ADDCONF (ad) ADDCONF (ad) Once the configuration file is built for a specific SYSGEN base configuration, and the MPE/iX Volumes are initialized, the Ldevs must be configured for HAFO with their primary and alternate paths. Fully qualified paths are required. Research and obtain this information prior to configuration.
Configuration ADDCONF (ad) Timeout Parameter The Timeout parameter is used to allow or disallow failovers caused by slow or slightly unresponsive disk arrays. Timeout defaults to “true”, which gives the best protection but may give “false” failovers under heavy I/O loads. If it has been determined that the observed failovers are “false” failovers and are caused by heavy I/O traffic, setting the timeout parameter to “false” may reduce the “false” failovers but it also reduces the protection offered by HAFO.
Configuration LISTCONF (li) LISTCONF (li) LISTCONF displays the entire configuration in the HAFOCONF file or the configuration for a specific Ldev. The syntax is: LI The is optional. For example: ha> li The LI command without any qualifier lists all Ldevs configured with their primary and alternate paths. Ldev ===== 350 351 352 353 450 451 452 453 Primary Path ==================== 0/4/0/0.70954.23 0/4/0/0.70954.24 0/6/0/0.73289.25 0/6/0/0.73289.26 0/6/2/1.3.3 0/6/2/1.3.4 0/6/2/0.3.
Configuration DOHA DOHA The HAFO configurations may be activated on-line. NOTE In most cases, HAFO configurations can be activated on-line, with the DOHA command. The exception to this is when one has issued a DELCONF command. Deletes of Ldevs that have been previously activated, cannot be de-activated online, and a reboot is necessary.
Configuration Troubleshooting Validation Errors ha> doha Start of validation for all HAFO configured devices. ===================================================== VALIDATING ** Ldev: 50 Pri path: 8.15.0 Alt path: 48 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ldev 50 configuration Validated Successfully VALIDATING ** Ldev: 51 Pri path: 8.15.
Configuration Validation Errors Displayed Troubleshooting a Validation Error If DOHA returns an error state, check Appendix B, "Error Messages". Double check for SYSGEN configuration errors or HAFO configuration errors. If the Ldev, primary path, or alternate path needs to be changed, delete the configured Ldev using DELCONF, add the Ldev back in using ADDCONF, do a hold and keep, and then reboot the system. Execute an ADDCONF command, hold and keep, and execute the DOHA command.
Configuration DELCONF (de) DELCONF (de) To remove a configuration that is incorrect or is no longer desired, use the DELCONF command. The syntax is: DE For example: ha> de 350 ha> li Ldev ===== 351 352 353 450 451 452 453 NOTE Primary Path ==================== 0/4/0/0.70954.24 0/6/0/0.73289.25 0/6/0/0.73289.26 0/6/2/1.3.3 0/6/2/1.3.4 0/6/2/0.3.5 0/6/2/0.3.6 Alternate Path Timeout ================== ======= 0/6/0/0.73289 True 0/4/0/0.70954 False 0/4/0/0.
Configuration GONEXT (go) GONEXT (go) The GONEXT command causes the Ldev to switch from the current, “Ready”, path to the “Validated” path according to the HAFOCONF file. The most important use of this command is after the repair of a “failed” primary path. GONEXT switchs the Ldev from using the alternate path back to using the primary path. GONEXT can also be used during preventative maintenance activities to temporarily switch from the primary path to the alternate path.
Configuration REDO (re) Miscellaneous HAFO Commands REDO (re) This is the standard MPE REDO command that re-displays the last command entered. This command may then be edited using the standard line editing commands.
Configuration HELP (he) HELP (he) To get online help on one of the listed HAFO commands, type HELP and the command desired.
Configuration EXIT (ex) EXIT (ex) To exit the HAFO command menu: ha> ex 32
Configuration STATUS (st) STATUS (st) Similar to HASTAT, the STATUS command gives detailed information regarding devices covered under HAFO and status of those devices. ha> st High Availability Failover Device Status Ldev Primary Path Alternate Path Pri. Status Alt. Status ===== ==================== ================== =============== =============== 350 351 352 353 450 451 452 453 0/4/0/0.70954.23 0/4/0/0.70954.24 0/6/0/0.73289.25 0/6/0/0.73289.26 0/6/2/1.3.3 0/6/2/1.3.4 0/6/2/0.3.5 0/6/2/0.3.
Configuration Special Considerations Special Considerations for Configuration Rebooting After a HAFO Event Rebooting the system after a HAFO event has occurred, but before the primary path is repaired is a special situation, especially if the event occurred on the path for Ldev 1. HAFO is not active until very late in the MPE/iX system boot process. NonMPE interfaces such as the system Boot Prompt (PDC) and ISL have no knowledge of HAFO.
Configuration Special Considerations Then, edit the file down to just the lines that list your Ldevs along with their primary and alternate paths. This list, with “AD” inserted, can then be used as the core of an input file for SYSGEN on 7.0/7.5 to create a new HAFOCONF: For example, a file containing the following commands, passed to SYSGEN as input, would create a HAFOCONF file with Ldevs 450-453 using the HBA pair at 0/6/2/0 and 0/6/2/1.
Monitoring Status 5 Monitoring Status Failover from the primary path to alternate path is automatic and allows applications to continue uninterrupted. In the event of failover, repeated warnings appear on the system console indicating a failover event has occurred. The warning specifies which high availability array Ldev has experienced a failover event. HIGH AVAILABILITY FAILOVER IS STARTED FOR Ldev# IN DISK ARRAY. NO DATA LOSS OR CORRUPTION. SYSTEM OPERATION WILL CONTINUE.
Failover Status Report Failover Status Report A sample report is shown here. :HASTAT High Availability Failover Device Status Ldev Primary Path Alternate Path Pri. Status Alt. Status ===== ==================== ================== =============== =============== 350 351 352 353 450 451 452 453 0/4/0/0.70954.23 0/4/0/0.70954.24 0/6/0/0.73289.25 0/6/0/0.73289.26 0/6/2/1.3.3 0/6/2/1.3.4 0/6/2/0.3.5 0/6/2/0.3.6 0/6/0/0.73289 0/6/0/0.73289 0/4/0/0.70954 0/4/0/0.
Failover Status Failover Status Descriptions The following is a complete list of the failover statuses that may appear in the HASTAT HAFO Status Report: Status shown during normal operation: • Ready • Validated Status associated with a failover event: • Array Failure • Timeout/No Reply • Failover Failed Other miscellaneous status values: • GONEXT started • Not HAFO dev • Not Validated • Unknown status Each status is described in the sections below.
Failover Status 3. A GONEXT command that is successful should leave the original path in Validated status and the new path in Ready status. Since the Validated status is set ONLY at the above times, it is critical to understand that “Validated” status may be very out of date. The non-active path is NOT monitored by HAFO and may fail silently in some HAFO configurations.
Failover Status Not HAFO dev This status message indicates that the Ldev on the path is not a high availability array. The list of supported arrays is documented in the “System Requirements” section of this document. Ldev Not Validated The Not Validated status is shown when HAFO is unable to validate both hardware paths (at boot time or during DOHA command). Unknown Status HAFO/iX has been written to account for a large number of known errors.
Recovering From a Failover 6 Recovering From a Failover Once high availability failover to the alternate data path has engaged, users continue to access data on that high availability array Ldev with no restrictions. I/Os outstanding during the failure and those after the failover will be processed on the alternate data path without interruption. NOTE HAFO does not protect against multiple failures or cascading failure types. It is imperative that the broken path be repaired as soon as possible.
Recovering From a Failover Repairs Repairs The repair of Hardware failures on disk array products, especially those connected to a Fibre Channel network, can be a complex activity. Normal Hardware troubleshooting techniques must be followed to determine the source of the failover event. This may include examination of system or device error logs. The methods and procedures for these troubleshooting activities are outside the scope of the HAFO product and this manual.
Recovering From a Failover GONEXT (go) To execute GONEXT: 1. Start SYSGEN. 2. At the sysgen> prompt, enter io. 3. At the io> prompt, enter ha. 4. Execute the GONEXT (go) command using the syntax. go For example, ha> go 8 5. Then execute the DOHA command In this example, all I/Os for Ldev 8 will be rerouted to the primary path configured. Should this attempt fail, you will be notified and the I/Os will continue on the alternate data path. Appendix A, "Sample Failover and Recovery.
Recovering From a Failover Special Considerations Special Considerations for Failed Paths HAFO is not active until the late in the MPE/iX system boot sequence (ISL> START). When the system boots, it first configures devices on the Primary path (according to SYSGEN io>), Then it mounts all disk volumes using ONLY the primary path. After volumes are mounted MPE attempts to validate the HAFO configuration.
Recovering From a Failover Special Considerations 450 0/6/2/1.3.3 0/6/2/0 Ready Validated 451 0/6/2/1.3.4 452 0/6/2/0.3.5 453 0/6/2/0.3.6 0/6/2/0 0/6/2/1 0/6/2/1 Timeout/No Reply Ready Ready Validated Ready Validated High Availability Failover Device Status Ldev Primary Path Alternate Path Pri. Status Alt. Status ===== ==================== ================== =============== =============== 350 351 352 353 0/4/0/0.70954.24 0/4/0/0.70954.24 0/6/0/0.73289.25 0/6/0/0.73289.26 0/6/0/0.73289 0/6/0/0.
Special Considerations Rebooting With a Failed Primary Path for Ldev 1 Ldev 1 can be configured for HAFO just as any other Ldev. It is, however, a very special situation when the system needs to be rebooted while the primary path for Ldev 1 is broken. The user will need to make adjustments at the system primary boot interface. before booting the system. If the primary path for Ldev 1 is broken, the system primary path needs to be adjusted to be alternate path for Ldev 1.
Special Considerations In order to remedy this, you must change the system primary path from 8 to 15. Please refer to the “System Startup, Configuration, and Shutdown Reference Manual” for information on changing the system primary path. After the system primary path is changed to 15.0.0, the system can find Ldev 1 and boot. NOTE In the previous example, there was only one Ldev on the broken system primary path.
Quick Start List 7 Quick Start List The following are directions for creating a simple HAFO environment using an XP512. 1. Create a LUN in the storage array that is accessible from 2 different array ports. In this example we will configure this LUN as MPE/iX Ldev 90. 2. Connect the appropriate fiber cables between the storage array and server’s HBAs. a.
Appendix-A A Sample Failover and Recovery The following scenario illustrates a possible HAFO situation. Sample Scenario The following is a sample HAFO status report extracted from a HASTAT display. The report shows several failover statuses which are explained in succeeding sections of this appendix. Troubleshooting advice for this sample scenario is also provided. High Availability Failover Device Status Ldev Primary Path Alternate Path Pri. Status Alt.
Appendix - A Corrective Action: Failure on Ldev 350 and 351 As indicated by the primary path status for Ldev 350, it is possible that the array controller has failed. This should be verified by your official support representative and the system diagnosed to verify the actual broken component. If it is the array controller, then in many cases, array controllers can be replaced on-line with the host powered-up and array powered-up.
Appendix-B B Error Messages Command Input Errors (HAFOERR — 1…27) 1 MESSAGE: Invalid Hautil Command — (HAFOERR 1) CAUSE: Command entered is not a valid command. ACTION: See Help. Enter valid command. 3 MESSAGE: Missing Ldev parameter — (HAFOERR 3) CAUSE: Parameter 1 for command is missing. ACTION: Input Parameter 1 value with command. 4 MESSAGE: Missing Primary Path parameter — (HAFOERR 4) CAUSE: Parameter 2 for command is missing. ACTION: Input Parameter 2 value with command.
Appendix - B 10 MESSAGE: Invalid character in Alternate path parameter — (HAFOERR 10) CAUSE: Parameter 3 has an invalid value. ACTION: Input correct Parameter 3 value with command. 11 MESSAGE: Ldev parameter too long — (HAFOERR 11) CAUSE: Parameter 1 has an invalid value. ACTION: Input correct Parameter 1 value with command. 12 MESSAGE: Primary Path parameter too long — (HAFOERR 12) CAUSE: Parameter 2 has an invalid value. ACTION: Input correct Parameter 2 value with command.
Appendix-B 18 MESSAGE: Ldev not mounted as MASTER or MEMBER — (HAFOERR 18) CAUSE: The volume associated with specified Ldev has not mounted as a master or member. ACTION: Check the state of the volume set using "DSTAT". The volume must be mounted as a master or member. If not, use the "VOLUTIL" utility to change to the appropriate state or remove Ldev from HAFO configuration. 19 MESSAGE: Ldev does not exist — (HAFOERR 19) CAUSE: Ldev input was not valid. ACTION: Input correct Ldev with command.
Appendix - B 25 MESSAGE: The path format for this Ldev is not correct — (HAFOERR 25) CAUSE: The path format for this Ldev is not correct. ACTION: Check for too many periods or missing periods in path specification. 26 MESSAGE: No devices are configured for HAFO — (HAFOERR 26) CAUSE: No devices have been configured for HAFO. ACTION: Verify changes made to HAFO configuration have been made permanent then try again. 27 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 27) CAUSE: Primary path is already configured.
Appendix-B 203 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 203) CAUSE: An End Of File condition was encountered while reading the HAFOCONF file. The HAFOCONF file is a pre-formatted file and an End Of File in this context is not a normal Condition and may indicate corruption. ACTION: Call your HP Support Representative for assistance. 204 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 204) CAUSE: An error occurred while executing an FREAD intrinsic call. ACTION: Call your HP Support Representative for assistance.
Appendix - B 503 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 503) CAUSE: The Physical Path Table entries are not linked. ACTION: Call your HP Support Representative for assistance. PORT ACCESS ERRORS (HAFOERR — 1000…1017) 1000 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1000) CAUSE: An error occurred while attempting to create HAFO Utilities/Device Manager communications Port. ACTION: Call your HP Support Representative for assistance.
Appendix-B 1006 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1006) CAUSE: An error occurred while attempting to release a message frame allocated for HAFO Utilities. ACTION: Call your HP Support Representative for assistance. 1007 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1007) CAUSE: An error occurred while attempting to send an I/O Request to the Device Manager. ACTION: Check to make sure that there are no mismatches between the SYSGEN (io) section) configuration and the HAFO (ha) section configuration.
Appendix - B 1009 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1009) CAUSE: An error occurred while attempting to send a message to Device Manager. ACTION: Check to make sure that there are no mismatches between the SYSGEN (io) section configuration and the HAFO (ha) section configuration. All HAFO (ha) configured primary paths must match the SYSGEN (io) path configuration for all HAFO Ldevs. Check that there are no hardware problems and make sure no failover events have occurred.
Appendix-B If the HAFO primary and SYSGEN (io) paths are in sync and all HAFO Ldevs are operating on their primary paths, then call your HP Support Representative for assistance. 1012 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1012) CAUSE: Failover failed — Volume label verification failed.
Appendix - B Check that there are no hardware problems and make sure no failover events have occurred. If a failover event has occurred, fix the primary path and make sure all HAFO Ldevs are operating on their primary path. If the HAFO primary and SYSGEN (io) paths are in sync and all HAFO Ldevs are operating on their primary paths, then call your HP Support Representative for assistance.
Appendix-B 1017 MESSAGE: HAFO INTERNAL ERROR — (HAFOERR 1017) CAUSE: A message was received from the Device Manager with an invalid class. ACTION: Check to make sure that there are no mismatches between the SYSGEN (io) section configuration and the HAFO (ha) section configuration. All HAFO (ha) configured primary paths must match the SYSGEN (io) path configuration for all HAFO Ldevs. Check that there are no hardware problems and make sure no failover events have occurred.
Appendix - B 1504 MESSAGE: **warning** Information display mode only — (HAFOERR 1504); CAUSE: User does not have "save" capability. You may continue but only with display capability. ACTION: User must have System Manager capability to make any configuration changes. 1505 MESSAGE: Need System Manager capability to make configuration changes — (HAFOERR 1505) CAUSE: The user attempted to make changes to the HAFOCONF configuration file without having the proper capabilities.