Cost-Effective High-Availability Solutions with HP Instant Capacity on HP-UX Table of contents Introduction ......................................................................................................................................... 3 Instant Capacity program overview ........................................................................................................ 3 HP Serviceguard overview .................................................................................................
Additional GiCAP high availability solutions ..................................................................................... 37 Summary .......................................................................................................................................... 37 Appendix ......................................................................................................................................... 38 Command reference ............................................................
Introduction Most enterprises today depend on mission-critical applications to run their businesses. Traditionally, mission-critical applications require high levels of availability, which has proven expensive because of the need for redundancy, particularly for standby servers. Today, IT managers are challenged by the need to reduce budgets while still delivering services needed to run the business. HP understands this problem and has developed products that make high availability more cost-effective.
HP Serviceguard overview HP Serviceguard builds on the concept of virtualization by grouping multiple servers or partitions into a cluster to provide highly available application services that ensure data integrity. High-availability clusters are created using a networked grouping of HP Integrity servers, HP 9000 servers, and, if necessary, partitions as cluster nodes that are configured with redundant hardware and software components to eliminate single points of failure.
HP Serviceguard failover models Typical HA configurations consist of multiple servers clustered together. The servers in the cluster might run one or more applications within Serviceguard packages. The cluster configuration reflects the level of redundancy and protection required. Depending on how applications are mapped to cluster members, different cluster configurations are possible, each with a different risk and cost profile.
• What are the critical failures to protect against? What is the scope of the required redundancy to eliminate single points of failure? • Is a fully redundant system cost-effective, or can Instant Capacity be used to help reduce total ownership costs? • Do you need to provide automation for the failover/failback of specific applications? • What level of performance is required after a failover? • Is additional processing capacity required after a failover to meet specific service-level objectives? Instant
Instant capacity components and high availability Once you have installed iCAP components on a server, you have a foundation for the additional types of iCAP. Also, you immediately have these two cost-effective features: • In the event of a processor core failure (low-priority machine check or high-priority machine check), an iCAP core replaces the failed core automatically.
Temporary Instant Capacity Temporary Instant Capacity (TiCAP), an additional, optional iCAP feature, is a prepaid block of processing time that enables the activation of cores beyond the number allowed by the core usage rights purchased for a server. This prepaid block of temporary capacity works like a phone card, providing processor core activation time instead of phone time. You must have iCAP software and hardware installed to take advantage of TiCAP.
Temporary Instant Capacity and high availability You can use TiCAP to provide cost-effective high-availability for any of the models covered in the “HP Serviceguard failover models” section. To achieve these HA solutions, you must first install iCAP processors on the cluster nodes. These processors can then be activated temporarily in failover situations, using additional incremental purchases of TiCAP.
Figure 4: TiCAP example before failure Now, imagine that a failure on Sys1 incapacitates the server, forcing a move of existing workloads to Sys2. To handle the new workload requirements on Sys2, you activate the six inactive processor cores using TiCAP by specifying the -t option to icapmodify: Sys2> icapmodify -a 6 -t These cores remain active until they are deactivated. Since three units of TiCAP were originally purchased, this is sufficient for 15 days of operation with all six cores active.
Figure 5: TiCAP example after failure and one day after activations on Sys2 When the failure on Sys1 is resolved, you can stop the use of TiCAP on Sys2 by deactivating the six processor cores: Sys2> icapmodify -d 6 You can also redeploy the applications to Sys1. HP Serviceguard considerations The TiCAP solution can be automated using the Serviceguard product. The basic steps include: • Configuring a Serviceguard cluster with packages set up to automatically fail over to the standby server (adoptive node).
Global Workload Manager considerations The TiCAP solution can also be automated using the HP Global Workload Manger (gWLM) product. With gWLM, the steps are the same as above except for the Serviceguard scripts. Instead of using the scripts to activate TiCAP, you define gWLM policies to activate TiCAP whenever a Serviceguard package is present on a system. These gWLM policies are known as conditional policies. For more information on gWLM, see http://docs.hp.com/en/vse.html.
Figure 6 shows a GiCAP group with two members and an active Group Manager named ap1. There is no standby Group Manager. It has 24 inactive cores without usage rights across the group (8 on Server 1 and 16 on Server 2). Twenty-four sharing rights are purchased for this group. Also, note that there are two cell boards without usage rights.
Using GiCAP to recover from failures (seizing core usage rights) With GiCAP, you have another tool for providing cost-effective high-availability solutions: The ability to “seize” usage rights from a partition that is down. Depending on the version of Instant Capacity, this feature can also be used to provide disaster recovery when all partitions of a GiCAP member fail.
Migrating versus seizing usage rights Here are some basic guidelines for determining when to seize usage rights or when to simply migrate those rights. • Planned downtime and load balancing: Whenever possible, migrate usage rights by deactivating cores in one partition and then activating cores in another partition. The involved partitions may or may not be part of the same member server.
Recovery from a failure involving one or more nPartitions Rights seizure from a server where at least one other nPartition is still running is the simplest and most straightforward case. In this situation, the Instant Capacity software treats the rights seizure in a manner similar to a normal migration of usage rights because the software can contact the still-running partition to commit the changes for the server.
Figure 7: Initial GiCAP configuration (partial outage example) Imagine that partition db1 has crashed, and the system administrator knows it will be hours or days until the problem can be fixed. The application running in partition db1 is critical to the business and must continue running on Server 2. The administrator chooses to seize usage rights from partition db1 to free up usage rights for Server 2.
Figure 8: GiCAP failover after rights seized (partial outage example) To complete the failover operation, the administrator now needs to activate the six cores on Server 2, using the available usage rights associated with the Group Manager. To activate cores on db2, log in to that system and run the following command: db2> icapmodify -a 6 [-t] Use the -t option if you want to allow the use of TiCAP to activate cores in the event concurrent operations made the just-released usage rights unavailable.
Figure 9 shows the state after the core activations on Server 2, when the failover from db1 to db2 is complete. Figure 9: GiCAP failover completed (partial outage example) Most importantly, note that because partition db3 on Server 1 was still running during the rights seizure from db1: • The rights seizure was effective immediately and without a time expiration constraint.
Recovery from a failure involving all nPartitions When all the nPartitions of a server have failed, the Instant Capacity software cannot contact the downed server to record or commit changes related to the rights seizure. Due to this, seized usage rights are only conditionally released to the group and will expire at the end of ten days.
Figure 10 displays the state after the administrator has issued the two commands to seize usage rights from each failed partition. Figure 10: GiCAP failover after usage rights seized (complete outage example) This time the rights seizure has resulted in ten available core usage rights being held by the Group Manager (six seized from db1; four seized from db3). As before, the failover is not complete until the administrator activates cores on Server 2.
Figure 11 shows this state when the failover from Server 1 to Server 2 is complete. Figure 11: GiCAP failover after usage rights seized (complete outage example) Note that the Group Manager still has two available core usage rights. This is because the rights seizure from db3 made four core usage rights available, but only two additional usage rights were needed to run the application on db4.
However, if Server 1 reconnects to the GiCAP group sometime before the end of the 10-day period, the rights seizure can be committed on Server 1 just as if the rights seizure had been a deactivation of the cores. During the reboot of Server 1, the Instant Capacity software adjusts the usage rights on Server 1 to show that usage rights have been migrated from those partitions. Server 1 will boot with the minimum number of cores, equal to the sum of active cells assigned to all partitions.
The following two sub-sections discuss how these ideas relate to managing usage rights—based on whether some or all virtual partitions are down. Some virtual partitions down (migrate rights with vparmodify and icapmodify) If a subset of virtual partitions within the nPartition are down and you want to distribute those usage rights to other partitions in the group, you must use techniques other than rights seizure.
Example: Manual failover from a complete outage of virtual partitions Figure 12 shows an example group with two members, but this time the servers are configured with virtual partitions in the nPartitions db1 and db2. Temporary capacity is not being used in this group. Figure 12: Initial configuration (virtual partition example) Now imagine that there is a catastrophic failure on Server 1 leaving vp1, vp2, and db3 inaccessible.
Figure 13 shows the result after the rights seizure. Figure 13: GiCAP failover after usage rights seized (virtual partition example) Six usage rights have been seized from the nPartition containing vp1 and vp2, leaving one core usage right for each cell in db1. As with previous examples, completing the failover is a matter of activating cores on the failover system. In this case, three cores are activated in vp3 and in vp4 to run the applications. You can accomplish this several ways.
Figure 14 shows the result. Figure 14: GiCAP failover from Server 1 to Server 2 completed (virtual partition example) As in the previous ”complete outage” example with nPartitions, the seized usage rights activated in vp3 and vp4 will expire in ten days. At that time, the usage rights revert to vp1 and vp2 and unless cores are deactivated in vp3 and vp4, TiCAP will be consumed because more cores will be active than available usage rights.
The restore operation (icapmanage -z) automatically returns the seized usage rights to the nPartition containing the specified host. In this example, the restore operation can designate either vp1 or vp2. Temporary capacity and rights seizure In the previous examples, no temporary capacity was being used. If TiCAP is in use during a failover sequence, the activation step on the failover node may need to specify the use of TiCAP.
Once the standby is defined, the active Group Manager transfers the Group Manager database to the standby Group Manager whenever group database changes occur, and also on a regular timed basis (using a cron job). This enables the standby Group Manager to be ready to take control of the group in case of a Group Manager failure. For a Group Manager in standby status, only limited icapmanage commands are allowed, such as showing status of the group.
Split groups and failback In the case of a split group, both Group Managers have active status and each controls a subset of the managed group members, depending on the individual member status at the time of the failover. Control operations can be carried out on both active Group Managers, each communicating with the members that it (and only it) controls.
The time needed to move usage rights is based on the time to locate the appropriate Group Manager (if a standby Group Manager is defined), the time to locate available usage rights, and then perform the icapmodify operations.
Figure 15: Initial configuration with GiCAP and Serviceguard Imagine that db1 has a serious failure and is no longer running. The package is defined with the AUTO_RUN option, so Serviceguard will automatically fail it over to db2. Scripts are defined to provide customized processing on the start-up and shutdown of the package.
Figures 16 and 17 show the scripting process in two separate steps. Figure 16 shows the state after the script has seized the core usage rights from db1. These usage rights are available for migration by the Group Manager when a member of the GiCAP group needs a core activated.
Next, the script activates the additional cores on db2, and the package starts up on db2, with all cores active. Figure 17 shows this failover state. Figure 17: GiCAP/Serviceguard failover completed The package can be configured to fail back automatically to db1 when it is available, or the cmhaltpkg/cmrunpkg commands can be used to provide the failback manually. The customized scripts should be written generically to work properly with these operations.
Example: Automated (Serviceguard) partial outage member failover and group manager failover This configuration is similar to the previous example, except for the addition of a standby Group Manager and the additional Serviceguard cluster and package that includes the active and standby Group Managers. Note that this example shows two separate clusters but the group manager could be contained in the same cluster and managed by another package.
The member failover can be accomplished in a manner similar to that described previously, but with one change: since the member does not know which Group Manager might be active, the script to do rights seizure must either try each Group Manager in turn, or use a relocatable IP address associated with the Group Manager package to identify the Group Manager that is running. Figure 19 shows the state after failover of both packages.
One final note applies only to Serviceguard configurations where the primary node and the failover node are both on the same system complex. In this situation failover works, but you should always fail back to the primary, instead of making the failover node the new primary (do not use the rotating standby model.) This is because you cannot seize rights that have already been seized from the same complex.
• In most cases, you can easily and automatically add processing power in an iCAP HA solution so that the failover solution actually has more resources available than during normal processing, if necessary. • HP Serviceguard can be used in combination with many iCAP solutions to provide a more automated and comprehensive solution. Appendix Command reference The following table provides a reference for various commands used in this paper.
Summary of usage rights seizure The following list summarizes various facts about usage rights seizure: • iCAP 8.01 requires at least one accessible partition The iCAP 8.01 command that seizes usage rights from a downed partition does not support a failover scenario in which an entire server becomes unavailable, because usage rights can only be seized from a server if there is at least one partition accessible. Access to the GSP/MP/ILO is not sufficient. • iCAP 8.02.
• An nPartition must be unavailable for usage rights to be seized Usage rights can only be seized if the partition is unavailable as determined by the ping command. If a Serviceguard package needs to fail over for other reasons, then the script must account for those possibilities. For example, if remote operations can still be performed on the active node, the script can acquire usage rights directly by deactivating cores on the active partition.
Scripts for implementing failover with Serviceguard While you can implement rights seizure as part of an automatic failover system, you must ensure resources are seized appropriately and in a manner that will not cause problems when the failures are corrected. The Instant Capacity software determines that a partition is down based on whether the ping command is unsuccessful for the partition.
# status is not UP, run icapmanage -x on the Group Manager echo OS on $OTHER_HOST is not up, running icapmanage -x on ap1 >&2 remsh ap1 -l root -n “/usr/sbin/icapmanage -x $OTHER_HOST” fi #Always activate 6 additional cores in order to run the package echo Executing /usr/sbin/icapmodify -a 6 >&2 /usr/sbin/icapmodify -a 6 return 0 The gicap_stop.sh shutdown script is invoked from the customer_defined_halt_cmds function of the package control script.
For package shutdown, the following script works well for automatic failback. In this case it will be run on the failover node before the package starts on the original primary node; it will ensure that interim database changes are transferred from the failover node to the failback node. # cat gm_stop.
# check the OS status of the other node in the cluster STATUS=`cmviewcl -f line -n $OTHER | grep ^status= | cut -f 2 -d=` if [[ $STATUS != "up" ]] then # Our failover/failback node is down, this is a failover startup. # Seize core usage rights from the failed node. echo $OTHER is not up.
• As noted previously, failback with virtual partitions requires a restore operation before the failed virtual partitions begin the reboot process. Because Serviceguard is started after the initial boot-up of the virtual partition, manual intervention is likely required for this failback case rather than using automated scripts under the control of Serviceguard.