Technical white paper Implement high-availability solutions—easily and effectively HP Instant Capacity on HP Integrity Superdome 2 with HP-UX 11i v3 Table of contents Introduction 3 HP iCAP program overview 3 HP Serviceguard overview HP Serviceguard failover models 4 4 Determining high-availability requirements 5 HP iCAP for blades with processor cores and memory HP iCAP components and high availability HP Serviceguard considerations 6 6 7 HP TiCAP HP TiCAP and high availability Example: HP TiCAP
Performance implications 22 Automate member failover using core usage rights seizure Automated (Serviceguard) member failover from a partial outage with nPartitions Automated (Serviceguard) complete outage member failover and group manager failover Monitor Group Manager health Additional HP GiCAP high-availability solutions 23 26 29 29 Summary 29 Appendix Command reference Summary of usage rights seizure Scripts for implementing failover with HP Serviceguard Additional HP Serviceguard scripting consi
Introduction Most enterprises like yours today depend on mission-critical applications to run their businesses. Traditionally, mission-critical applications require high levels of availability, which has proven expensive because of the need for redundancy, particularly for standby servers. Today, IT managers are challenged by the need to reduce budgets while still delivering services needed to run the business.
HP Serviceguard overview HP Serviceguard builds on the concept of virtualization by grouping multiple servers or partitions into a cluster to provide highly available application services that ensure data integrity. High-availability clusters are created using a networked grouping of HP Integrity servers—HP 9000 Servers and, if necessary, partitions as cluster nodes that are configured with redundant hardware and software components to eliminate single points of failure.
Upon a Serviceguard node failure, the Serviceguard packages protecting the applications on the failed node are failed over to another node within the cluster as defined in the package configuration. The following list shows several models for configuring Serviceguard clusters to handle package failovers: • Active/Standby: one or more cluster nodes are reserved for failover use. Upon a failover, applications maintain their level of performance by using the spare capacity provided by the standby node.
HP iCAP for blades with processor cores and memory iCAP enables you to purchase and install additional processing power using a two-step purchase model: 1. Purchase system components (processor cores and memory) at a fraction of the regular purchase price because the usage rights are not included. These iCAP components are inactive but installed and ready for use. 2. Add extra capacity as needed by paying the remainder of the regular purchase price for the usage rights to activate the components.
HP Serviceguard considerations After iCAP components have been purchased and installed, the replacement of failed processors by iCAP processors is automatic. You do not need to do anything to enable this feature in any Serviceguard configuration (or any other configuration).
HP TiCAP and high availability You can use TiCAP to provide cost-effective high-availability for any of the models covered in the “HP Serviceguard failover models” section. To achieve these high-availability solutions, you must first configure iCAP processors on the cluster nodes. These processors can then be activated temporarily in failover situations, using additional incremental purchases of TiCAP.
Figure 4. HP TiCAP example before failure SD2 – Sys1 SD2 – Sys2 2 blades, 4 active cores (12 iCAP), 3 units of TiCAP 2 blades, 16 active cores A A A A A A A A A A A A A A A A A A A A 90 days A Active core Inactive (iCAP) core Now, imagine that a failure on Sys1 incapacitates the server, forcing a move of existing workloads to Sys2.
Figure 5. HP TiCAP example after failure and one day after activations on Sys2 SD2 – Sys1 SD2 – Sys2 2 blades, 4 active and 12 iCAP cores (12 units of TiCAP) 2 blades, 16 active cores A A A A A A A A A A A A A A A A 78 days A Active core Inactive (iCAP) core due to failure When the failure on Sys1 is resolved, you can stop the use of TiCAP on Sys2 by deactivating the 12 processor cores: Sys2> icapmodify -d 12 OR OA> icapmodify –p 2 –d 12 You can also redeploy the applications to Sys1.
The application startup time is slightly longer when compared to use of a typical Serviceguard package control script that does not invoke TiCAP commands because of the overhead in sensing the status of the active system and activating the inactive cores on the failover or failback system. HP Global Workload Manager considerations The TiCAP solution can also be automated using the HP Global Workload Manager (gWLM) product.
Figure 6.
you can use a GiCAP command to extract (seize) processor core usage rights from that server and transfer those usage rights to the Group Manager. Then, using normal activation commands on the failover server, you can use those transferred usage rights to activate additional processor cores on the failover servers to increase capacity. The freedom that iCAP with GiCAP provides you (no partitions need to be running to seize rights from a member) is that it enables you to recover from more serious outages.
Failover involving unplanned downtime: The OA modules and the complex are up; however, one or more of the partitions have failed Recovery from a server where the OA is still running, is the simplest and most straightforward case. In this situation, the (iCAP) software uses normal migration of usage rights, because the software can contact the still-running OA to commit the changes for the server. A partition that goes down releases all its committed rights to the OA.
These usage rights are available to be migrated to other partitions, under control of the Group Manager, as shown in figure 8. Figure 8.
Figure 9.
Consider the same GiCAP group as shown in figure 7, except this time there are two critical applications on server 1 (OA1); one running in partition Db1, and the other running in partition Db3. The application running in Db3 requires 12 active cores, so only 12 core usage rights were purchased and allocated to that partition. On server 2 (OA2) Db2 and Db4 are running low-priority tasks and have only 4 and 8 active cores respectively. Temporary capacity is not being used in this group.
This time the rights seizure has resulted in 27 available core usage rights being held by the Group Manager (15 seized from Db1; 12 seized from Db3). As before, the failover is not complete until the administrator activates cores on server 2. In this case, the administrator chooses to activate 12 additional cores for Db2, and only eight additional cores for Db4, as that is all that is needed to run each application.
As before, failback to Server 1 can be accomplished simply by deactivating cores on server 2 and activating cores on server 1 (the normal method of usage rights migration within a GiCAP group): db2> icapmodify -d 12 OR OA2>icapmodify –p2 –d 12 db4> icapmodify -d 8 OR OA2>icapmodify –p4 –d 8 db1> icapmodify -a 15 OR OA1>icapmodify –p1 –a 15 db3> icapmodify -a 12 OR OA1>icapmodify –p3 –a 12 To return Db3 to its original state, we activated 12 cores, even though only eight cores were activated on t
HP TiCAP and rights seizure In the previous examples, no TiCAP was being used. If TiCAP is in use during a failover sequence, the activation step on the failover node may need to specify the use of TiCAP. iCAP is designed to stop using temporary capacity and instead take advantage of usage rights if any become available.
Fail over to a standby Group Manager When you need to have a standby Group Manager take over control of group management, log in to the standby system and run this “take control” command: ap2>icapmanage -Q This command establishes ap2 as the active Group Manager for all managed groups and group members, limited only to the extent that ap2 can contact the member systems and the previously active Group Manager.
Split groups and failback In the case of a split group, both Group Managers have active status and each controls a subset of the managed group members, depending on the individual member status at the time of the failover. Control operations can be carried out on both active Group Managers, each communicating with the members that it (and only it) controls.
End-to-end time for Group Manager failover consists of: • Serviceguard failover time • The time needed for a standby to take control The time needed for a standby Group Manager to take control depends on the number of: • Groups • Members in the groups • Members in the groups that are not contactable Remember that during failover to a standby Group Manager, the members of the group can still operate normally with the usage rights they have, they just cannot borrow or lend usage rights within the group.
Figure 12. Initial configuration with HP GiCAP and Serviceguard ap1: Group Manager Server 2 Server 1 Db1: 2 blades, 16 active cores A A A A A A Db2: 2 blades, 4 active and 12 iCAP cores A A Serviceguard cluster A A A A A A A A A A A A A A A A A Db3: 2 blades, 12 active and 4 iCAP cores A A A A A A Db4: 2 blades, 8 active and 5 iCAP cores A A A A A A A A A A Active core Inactive (iCAP) core Imagine that Db has a serious failure and the partitions on it are no longer running.
Figure 13.
Next, the script activates the additional cores on Db2, and the packages start up on Db2 with all cores active. Figure 14 shows this failover state. Figure 14.
Figure 15.
Figure 16.
Monitor Group Manager health Starting with iCAP version 10.05 or later, you can now intelligently monitor the health of a group manager using Serviceguard scripts. Based on the health of the GM you can further decide if you want to failover to the Standby Group Manager. The main purpose of the Serviceguard cluster containing the Group Manager is that the Group Manager is available at all times to manage the group. However, if the partition remains active Serviceguard does not monitor the Group Manager.
Appendix Command reference The following table provides a reference for various commands used in this paper. Command Description icapmodify -d Deactivate cores and free core usage rights to make them available to another partition. (Note that even on systems with full usage rights, and thus no inactive components, you can deactivate cores.) If the server is currently using TiCAP, using the -d option stops (or reduces) TiCAP consumption before releasing core usage rights. Activate cores.
• Maximum core usage rights are seized—All the usage rights except one will be seized from the server. • Expiration of seized usage rights (complete server failure)—If at the time of an attempted rights seizure the OA module or the server complex are unreachable, the rights seizure is instead treated as a loan of usage rights from the specified member to the group. The loan expires 30 days from the first use of the icapmanage -x command.
Scripts for member failover in an nPartition environment These scripts correspond to the “Automated (Serviceguard) member failover from a partial outage with nPartitions” subsection. This example is based on a Serviceguard cluster that includes the partitions Db1 and Db2. The failover package is defined to run normally on Db1 and with AUTO_RUN enabled so that failover will occur to Db2. There is no standby Group Manager, so no Group Manager fails over.
Scripts for failover when the entire complex goes down These scripts correspond to the “Automated (Serviceguard) complete outage member failover and group manager failover.” Here, the rights from the entire complex is seized by the Group Manager and is available for activation on complex 2. The addition of the seized rights to appropriate members can be similar to the activation script shown in the previous subsection.
Scripts for Group Manager failover These scripts pertain to the Group Manager failover portion of the “Recovery from a failure involving the Group Manager” subsection. For Group Manager failover from an active Group Manager ap1 to a standby Group Manager ap2, the package definition specifies “yes” for AUTO_RUN and “manual” or “automatic” for failback_policy. It also indicates that the package should always start on ap1 before ap2, if possible.
# cat gicap_start.sh PATH=$PATH:/usr/sbin HOSTNAME=`/usr/bin/hostname` GM=gmpkg02.xxx.hp.com # The number of cores needed to run the package NUM=8 case $HOSTNAME in db1) OTHER_HOST=oa2.bbb.hp.com OTHER=db2;; db2) OTHER_HOST=oa1.bbn.hp.com OTHER=db1;; esac # check the OS status of the other node in the cluster STATUS=`cimviewcl -f line -n $OTHER | grep '^status=' | cut -f 2 -d=` if [[ $STATUS != "up" ]] then # Our failover/failback node is down, this is a failover startup.
Scripts for monitoring the health of the Group Manager The following monitor script should be defined as a service in the package that is running on the cluster. If the script returns 1, Serviceguard package fails over as the service command has failed. The Group Manager can then be migrated to the standby Group Manager. The monitor_heartbeat variable is a user tunable variable based on the frequency of the monitoring desired by the user. By default this is set to five minutes.
Additional HP Serviceguard scripting considerations When developing Serviceguard scripts, consider the following: • Remember that if temporary capacity is being used in the group, it may be necessary to specify –t on the failover activation command (icapmodify –a or icapmodify -s commands). • Check if usage rights have been made available even after an attempted rights seizure returns an error such as a timeout error.