Managing Serviceguard Seventeenth Edition, First Reprint December 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

Managing Serviceguard Seventeenth

Edition

HP Part Number: B3936-90144

Published: December 2009

Summary of content (468 pages)

PAGE 1
Managing Serviceguard Seventeenth Edition HP Part Number: B3936-90144 Published: December 2009
PAGE 2
Legal Notices © Copyright 1995-2009 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor’s standard commercial license. The information contained herein is subject to change without notice.
PAGE 3
Table of Contents Printing History ...........................................................................................................................23 Preface.......................................................................................................................................25 1 Serviceguard at a Glance.........................................................................................................29 What is Serviceguard? .................................................
PAGE 4
Redundant Power Supplies ...............................................................................................49 Larger Clusters ...................................................................................................................50 Active/Standby Model ..................................................................................................50 Point to Point Connections to Storage Devices ............................................................
PAGE 5
Deciding When and Where to Run and Halt Failover Packages .......................69 Failover Packages’ Switching Behavior..............................................................70 Failover Policy....................................................................................................72 Automatic Rotating Standby..............................................................................73 Failback Policy......................................................................................
PAGE 6
Additional Heartbeat Requirements......................................................................106 Volume Managers for Data Storage..................................................................................106 Types of Redundant Storage.......................................................................................106 About Device File Names (Device Special Files).........................................................106 Examples of Mirrored Storage...................................
PAGE 7
LVM Planning ..................................................................................................................131 LVM Worksheet ..........................................................................................................132 CVM and VxVM Planning ...............................................................................................133 CVM and VxVM Worksheet .......................................................................................
PAGE 8
Example 1..........................................................................................................185 Example 2..........................................................................................................185 About External Scripts.................................................................................................186 Using Serviceguard Commands in an External Script..........................................188 Determining Why a Package Has Shut Down.................
PAGE 9
Creating Disk Groups............................................................................................217 Creating Volumes...................................................................................................218 Creating File Systems.............................................................................................218 Deporting Disk Groups..........................................................................................218 Re-Importing Disk Groups....................
PAGE 10
Creating Disk Groups............................................................................................247 Mirror Detachment Policies with CVM............................................................247 Creating Volumes ..................................................................................................247 Adding Disk Groups to the Package Configuration .............................................248 Using DSAU during Configuration..................................................
PAGE 11
dependency_location............................................................................................271 weight_name, weight_value.................................................................................271 local_lan_failover_allowed....................................................................................272 monitored_subnet................................................................................................272 monitored_subnet_access..................................
PAGE 12
user_host............................................................................................................286 user_role.............................................................................................................286 Additional Parameters Used Only by Legacy Packages..................................286 Generating the Package Configuration File......................................................................287 Before You Start..................................................
PAGE 13
Using Serviceguard Commands to Add Previously Configured Nodes to a Running Cluster ....................................................................................................313 Removing Nodes from Participation in a Running Cluster........................................313 Halting the Entire Cluster ...........................................................................................314 Automatically Restarting the Cluster ...................................................................
PAGE 14
What You Must Keep in Mind..........................................................................333 Example: Adding a Heartbeat LAN.................................................................334 Example: Deleting a Subnet Used by a Package...............................................336 Removing a LAN or VLAN Interface from a Node.........................................337 Changing the LVM Configuration while the Cluster is Running .........................
PAGE 15
8 Troubleshooting Your Cluster....................................................................................................365 Testing Cluster Operation ................................................................................................365 Start the Cluster using Serviceguard Manager...........................................................365 Testing the Package Manager .....................................................................................
PAGE 16
Problems with VxVM Disk Groups.............................................................................384 Force Import and Deport After Node Failure........................................................384 Package Movement Errors ..........................................................................................385 Node and Network Failures .......................................................................................385 Troubleshooting the Quorum Server...............................
PAGE 17
Using a Relocatable Address as the Source Address for an Application that is Bound to INADDR_ANY.................................................................................................................401 Restoring Client Connections ..........................................................................................403 Handling Application Failures ........................................................................................405 Create Applications to be Failure Tolerant ...........
PAGE 18
Step 4. ..........................................................................................................................423 Step 5. ..........................................................................................................................424 Guidelines for Non-Rolling Upgrade...............................................................................425 Migrating Cluster Lock PV Device File Names...........................................................425 Other Considerations.
PAGE 19
Network Configuration Restrictions................................................................................445 IPv6 Relocatable Address and Duplicate Address Detection Feature.............................446 Local Primary/Standby LAN Patterns..............................................................................447 Example Configurations...................................................................................................447 H Using Serviceguard Manager............................
PAGE 20
List of Figures 1-1 1-2 1-3 2-1 2-2 2-3 2-4 2-5 2-6 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 3-11 3-12 3-13 3-14 3-15 3-16 3-17 3-18 3-19 3-20 3-21 3-22 3-23 3-24 3-25 4-1 5-1 D-1 D-2 D-3 D-4 D-5 20 Typical Cluster Configuration ....................................................................................29 Typical Cluster After Failover ....................................................................................31 Tasks in Configuring a Serviceguard Cluster .................................
PAGE 21
D-6 H-1 H-2 Running Cluster After Upgrades .............................................................................424 System Management Homepage with Serviceguard Manager................................453 Cluster by Type.........................................................................................................
PAGE 22
List of Tables 1 3-1 3-2 3-3 3-4 4-1 4-2 6-1 6-2 7-1 7-2 G-1 G-2 G-3 G-4 G-5 G-6 G-7 G-8 I-1 I-2 22 Printing History..........................................................................................................23 Package Configuration Data.......................................................................................73 Node Lists in Sample Cluster......................................................................................
PAGE 23
Printing History Table 1 Printing History Printing Date Part Number Edition January 1995 B3936-90001 First June 1995 B3936-90003 Second December 1995 B3936-90005 Third August 1997 B3936-90019 Fourth January 1998 B3936-90024 Fifth October 1998 B3936-90026 Sixth December 2000 B3936-90045 Seventh September 2001 B3936-90053 Eighth March 2002 B3936-90065 Ninth June 2003 B3936-90070 Tenth June 2004 B3936-90076 Eleventh June 2005 B3936-90076 Eleventh, First reprint October 20
PAGE 24
19111 Pruneridge Ave.
PAGE 25
Preface This seventeenth edition of the manual applies to Serviceguard Version A.11.19. Earlier versions are available at http://www.docs.hp.com -> High Availability -> Serviceguard. This guide describes how to configure Serviceguard to run on HP 9000 or HP Integrity servers under the HP-UX operating system. The contents are as follows: • • • • • • • • • • • • • • • “Serviceguard at a Glance” (page 29), describes a Serviceguard cluster and provides a roadmap for using this guide.
PAGE 26
• • Appendix H (page 451) describes the Serviceguard Manager GUI. “Maximum and Minimum Values for Parameters” (page 457) provides a reference to the supported ranges for Serviceguard parameters. Related Publications Use the following URL for HP’s high availability web page:http://www.hp.com/go/ha Use the following URL to find the latest versions of a wide variety of HP-UX documentation: http://www.docs.hp.
PAGE 27
• From http://www.docs.hp.com -> High Availability -> Quorum Server: — HP Serviceguard Quorum Server Version A.04.00 Release Notes • From http://www.docs.hp.com -> High Availability -> Event Monitoring Service and HA Monitors -> Installation and User’s Guide: — Using High Availability Monitors — Using the Event Monitoring Service • From http://www.docs.hp.
PAGE 28
PAGE 29
1 Serviceguard at a Glance This chapter introduces Serviceguard on HP-UX, and shows where to find information in this book. It covers the following: • What is Serviceguard? • Using Serviceguard Manager (page 32) • A Roadmap for Configuring Clusters and Packages (page 34) If you are ready to start setting up Serviceguard clusters, skip ahead to Chapter 4: “Planning and Documenting an HA Cluster ” (page 121). Specific steps for setup are given in Chapter 5: “Building an HA Cluster Configuration” (page 195).
PAGE 30
network (LAN) component. In the event that one component fails, the redundant component takes over. Serviceguard and other high availability subsystems coordinate the transfer between components. A Serviceguard cluster is a networked grouping of HP 9000 or HP Integrity servers (or both), known as nodes, having sufficient redundancy of software and hardware that a single point of failure will not significantly disrupt service. A package groups application services (individual HP-UX processes) together.
PAGE 31
services also are used for other types of inter-node communication. (The heartbeat is explained in more detail in the chapter “Understanding Serviceguard Software.”) Failover Any host system running in a Serviceguard cluster is called an active node. Under normal conditions, a fully operating Serviceguard cluster monitors the health of the cluster's components on all its active nodes. Most Serviceguard packages are failover packages.
PAGE 32
provide as many separate power circuits as needed to prevent a single point of failure of your nodes, disks and disk mirrors. Each power circuit should be protected by an uninterruptible power source. For more details, refer to the section on “Power Supply Planning” in Chapter 4, “Planning and Documenting an HA Cluster.
PAGE 33
You can use Serviceguard Manager to monitor, administer, and configure Serviceguard clusters. • • • You can see properties, status, and alerts of clusters, nodes, and packages. You can do administrative tasks such as run or halt clusters, cluster nodes, and packages. Yyou can create or modify a cluster and its packages. Monitoring Clusters with Serviceguard Manager From the main page of Serviceguard Manager, you can see status and alerts for the cluster, nodes, and packages.
PAGE 34
on the command line. As of HP-UX 11i v3, SAM offers a Terminal User Interface (TUI) which also acts as a gateway to the web-based System Management Homepage (SMH). • • To get to the SMH for any task area, highlight the task area in the SAM TUI and press w. To go directly to the SMH from the command line, enter /usr/sbin/sam -w For more information, see the HP-UX Systems Administrator’s Guide, posted at http://docs.hp.
PAGE 35
Figure 1-3 Tasks in Configuring a Serviceguard Cluster The tasks in Figure 1-3 are covered in step-by-step detail in chapters 4 through 7. HP recommends you gather all the data that is needed for configuration before you start. See “Planning and Documenting an HA Cluster ” (page 121) for tips on gathering data.
PAGE 36
PAGE 37
2 Understanding Serviceguard Hardware Configurations This chapter gives a broad overview of how the Serviceguard hardware components work. The following topics are presented: • Redundancy of Cluster Components • Redundant Network Components (page 38) • Redundant Disk Storage (page 44) • Redundant Power Supplies (page 49) • Larger Clusters (page 50) Refer to the next chapter for information about Serviceguard software components.
PAGE 38
Fibre Channel or HP StorageWorks XP or EMC Symmetrix disk technology can be configured for failover among 16 nodes. Note that a package that does not access data from a disk on a shared bus can be configured to fail over to as many nodes as you have configured in the cluster (regardless of disk technology).
PAGE 39
NOTE: Serial (RS232) lines are no longer supported for the cluster heartbeat. Fibre Channel, Token Ring and FDDI networks are no longer supported as heartbeat or data LANs. Rules and Restrictions • • • • • A single subnet cannot be configured on different network interfaces (NICs) on the same node. In the case of subnets that can be used for communication between cluster nodes, the same network interface must not be used to route more than one subnet configured on the same node.
PAGE 40
addresses themselves will be immediately configured into the cluster as stationary IP addresses. CAUTION: If you configure any address other than a stationary IP address on a Serviceguard network interface, it could collide with a relocatable package IP address assigned by Serviceguard. See “Stationary and Relocatable IP Addresses ” (page 90). (Oracle VIPs are an exception to this rule; such configurations require the HP add-on product Serviceguard Extension for Oracle RAC).
PAGE 41
In the figure, a two-node Serviceguard cluster has one bridged net configured with both a primary and a standby LAN card for the data/heartbeat subnet (subnetA). Another LAN card provides an optional dedicated heartbeat LAN. Note that the primary and standby LAN segments are connected by a hub to provide a redundant data/heartbeat subnet. Each node has its own IP address for this subnet.
PAGE 42
• You should not use the wildcard (*) for node_name in the package configuration file, as this could allow the package to fail over across subnets when a node on the same subnet is eligible, and failing over across subnets can take longer. List the nodes in order of preference rather than using the wildcard. • You should configure IP monitoring for each subnet; see “Monitoring LAN Interfaces and Detecting Failure: IP Level” (page 98).
PAGE 43
• • cmrunnode will fail if the “hostname LAN” is down on the node in question. (“Hostname LAN” refers to the public LAN on which the IP address that the node’s hostname resolves to is configured). If a monitored_subnet is configured for PARTIAL monitored_subnet_access in a package’s configuration file, it must be configured on at least one of the nodes on the node_name list for that package.
PAGE 44
cluster is running; see “Changing the Cluster Networking Configuration while the Cluster Is Running” (page 332). Redundant Disk Storage Each node in a cluster has its own root disk, but each node is also physically connected to several other disks in such a way that more than one node can obtain access to the data and programs associated with a package it is configured for.
PAGE 45
Disk Mirroring Serviceguard itself does not provide protection for data on your disks, but protection is provided by HP’s Mirrordisk/UX product for LVM storage, and by the Veritas Volume Manager for VxVM and CVM. The logical volumes used for Serviceguard packages should be mirrored; so should the cluster nodes’ root disks. When you configure logical volumes using software mirroring, the members of each mirrored set contain exactly the same data.
PAGE 46
NOTE: 4.1 and later versions of Veritas Volume Manager (VxVM) and Dynamic Multipathing (DMP) from Symantec are supported on HP-UX 11i v3, but do not provide multipathing and load balancing; DMP acts as a pass-through driver, allowing multipathing and load balancing to be controlled by the HP-UX I/O subsystem instead.
PAGE 47
possible to replace a disk while the cluster stays up and the application remains online. The process is described under “Replacing Disks” (page 368) . Replacing Failed I/O Cards Depending on the system configuration, it is possible to replace failed disk I/O cards while the system remains online. The process is described under “Replacing I/O Cards” (page 372). Sample SCSI Disk Configurations Figure 2-2 shows a two node cluster.
PAGE 48
Figure 2-3 Cluster with High Availability Disk Array Details on logical volume configuration for Serviceguard are in the chapter “Building an HA Cluster Configuration.” Sample Fibre Channel Disk Configuration In Figure 2-4, the root disks are shown with simple mirroring, but the shared storage is now accessed via redundant Fibre Channel switches attached to a disk array.
PAGE 49
Figure 2-4 Cluster with Fibre Channel Switched Disk Array This type of configuration uses native HP-UX or other multipathing software; see “About Multipathing” (page 45). Redundant Power Supplies You can extend the availability of your hardware by providing battery backup to your nodes and disks. HP-supported uninterruptible power supplies (UPS), such as HP PowerTrust, can provide this protection from momentary power loss.
PAGE 50
Therefore, if all of the hardware in the cluster has 2 or 3 power inputs, then at least three separate power circuits will be required to ensure that there is no single point of failure in the power circuit design for the cluster. Larger Clusters You can create clusters of up to 16 nodes with Serviceguard. Clusters of up to 16 nodes may be built by connecting individual SPUs via Ethernet.
PAGE 51
Figure 2-5 Eight-Node Active/Standby Cluster Point to Point Connections to Storage Devices Some storage devices allow point-to-point connection to a large number of host nodes without using a shared SCSI bus. An example is shown in Figure 2-11, a cluster consisting of eight nodes with a SCSI interconnect. The nodes access shared data on an XP or EMC disk array configured with 16 SCSI I/O ports. Each node is connected to the array using two separate SCSI channels.
PAGE 52
Figure 2-6 Eight-Node Cluster with XP or EMC Disk Array Fibre Channel switched configurations also are supported using either an arbitrated loop or fabric login topology. For additional information about supported cluster configurations, refer to the HP Unix Servers Configuration Guide, available through your HP representative.
PAGE 53
3 Understanding Serviceguard Software Components This chapter gives a broad overview of how the Serviceguard software components work.
PAGE 54
NOTE: Veritas CFS may not yet be supported on the version of HP-UX you are running; see “About Veritas CFS and CVM from Symantec” (page 32).
PAGE 55
• • • • • /usr/lbin/cmvxpingd—Serviceguard-to-Veritas Activation daemon. (Only present if Veritas CFS is installed.) /usr/lbin/cmdisklockd— Lock LUN daemon /usr/lbin/cmlockd—Utility daemon /opt/sgproviders/bin/cmwbemd—WBEM daemon /usr/lbin/cmproxyd—Proxy daemon Each of these daemons logs to the /var/adm/syslog/syslog.logfile except for /opt/cmom/lbin/cmomd, which logs to /var/opt/cmom/cmomd.log. The Quorum Server runs outside the cluster.
PAGE 56
NOTE: Two of the central components of Serviceguard—Package Manager, and Cluster Manager—run as parts of the cmcld daemon. This daemon runs at priority 20 on all cluster nodes. It is important that user processes should have a priority lower than 20, otherwise they may prevent Serviceguard from updating the kernel safety timer, causing a system reset. File Management Daemon: cmfileassistd The cmfileassistd daemon is used by cmcld to manage the files that it needs to read from, and write to, disk.
PAGE 57
The SNMP Master Agent and the cmsnmpd provide notification (traps) for cluster-related events. For example, a trap is sent when the cluster configuration changes, or when a Serviceguard package has failed. You must edit /etc/SnmpAgent.d/ snmpd.conf to tell cmsnmpd where to send this information. You must also edit /etc/rc.config.d/cmsnmpagt to auto-start cmsnmpd. Configure cmsnmpd to start before the Serviceguard cluster comes up. For more information, see the cmsnmpd (1m) manpage.
PAGE 58
Lock LUN Daemon: cmdisklockd If a lock LUN is being used, cmdisklockd runs on each node in the cluster and is started by cmcld when the node joins the cluster. Utility Daemon: cmlockd Runs on every node on which cmcld is running (though currently not actually used by Serviceguard on HP-UX systems).
PAGE 59
deployed as part of the Serviceguard Storage Management Suite bundles, the file /etc/gabtab is automatically configured and maintained by Serviceguard. GAB provides membership and messaging for CVM and the CFS. GAB membership also provides orderly startup and shutdown of the cluster file system.
PAGE 60
Heartbeat Messages Central to the operation of the cluster manager is the sending and receiving of heartbeat messages among the nodes in the cluster. Each node in the cluster exchanges UDP heartbeat messages with every other node over each monitored IP network configured as a heartbeat device. (LAN monitoring is discussed later, in the section “Monitoring LAN Interfaces and Detecting Failure: Link Level” (page 92).
PAGE 61
IMPORTANT: When multiple heartbeats are configured, heartbeats are sent in parallel; Serviceguard must receive at least one heartbeat to establish the health of a node. HP recommends that you configure all subnets that connect cluster nodes as heartbeat networks; this increases protection against multiple faults at no additional cost. Heartbeat IP addresses are usually on the same subnet on each node, but it is possible to configure a cluster that spans subnets; see “Cross-Subnet Configurations” (page 41).
PAGE 62
which is a permanent modification of the configuration files. Re-formation of the cluster occurs under the following conditions (not a complete list): • • • • • • • • An SPU or network failure was detected on an active node. An inactive node wants to join the cluster. The cluster manager daemon has been started on that node. A node has been added to or deleted from the cluster configuration. The system administrator halted a node. A node halts because of a package failure.
PAGE 63
If you have a two-node cluster, you are required to configure a cluster lock. If communications are lost between these two nodes, the node that obtains the cluster lock will take over the cluster and the other node will halt (system reset). Without a cluster lock, a failure of either node in the cluster will cause the other node, and therefore the cluster, to halt. Note also that if the cluster lock fails during an attempt to acquire it, the cluster will halt.
PAGE 64
Figure 3-2 Lock Disk or Lock LUN Operation Serviceguard periodically checks the health of the lock disk or LUN and writes messages to the syslog file if the device fails the health check. This file should be monitored for early detection of lock disk problems. If you are using a lock disk, you can choose between two lock disk options—a single or dual lock disk—based on the kind of high availability configuration you are building. A single lock disk is recommended where possible.
PAGE 65
in two separate data centers, a single lock disk would be a single point of failure should the data center it resides in suffer a catastrophic failure. In these two cases only, a dual cluster lock, with two separately powered cluster disks, should be used to eliminate the lock disk as a single point of failure. NOTE: You must use Fibre Channel connections for a dual cluster lock; you can no longer implement it in a parallel SCSI configuration.
PAGE 66
Figure 3-3 Quorum Server Operation The Quorum Server runs on a separate system, and can provide quorum services for multiple clusters. IMPORTANT: For more information about the Quorum Server, see the latest version of the HP Serviceguard Quorum Server release notes at http://docs.hp.com -> High Availability -> Quorum Server. No Cluster Lock Normally, you should not configure a cluster of three or fewer nodes without a cluster lock. In two-node clusters, a cluster lock is required.
PAGE 67
Server), the quorum device (for example from one quorum server to another), and the parameters that govern them (for example the Quorum Server polling interval). For more information about the Quorum Server and lock parameters, see “Cluster Configuration Parameters ” (page 139). NOTE: If you are using the Veritas Cluster Volume Manager (CVM) you cannot change the quorum configuration while SG-CFS-pkg is running. For more information about CVM, see “CVM and VxVM Planning ” (page 133).
PAGE 68
Package Types Three different types of packages can run in the cluster; the most common is the failover package. There are also special-purpose packages that run on more than one node at a time, and so do not failover. They are typically used to manage resources of certain failover packages.
PAGE 69
Figure 3-4 Package Moving During Failover Configuring Failover Packages You configure each package separately. You create a failover package by generating and editing a package configuration file template, then adding the package to the cluster configuration database; see Chapter 6: “Configuring Packages and Their Services ” (page 255). For legacy packages (packages created by the method used on versions of Serviceguard earlier than A.11.
PAGE 70
that determine failover behavior. These are the auto_run parameter, the failover_policy parameter, and the failback_policy parameter.
PAGE 71
Figure 3-5 Before Package Switching Figure 3-6 shows the condition where Node 1 has failed and Package 1 has been transferred to Node 2 on the same subnet. Package 1’s IP address was transferred to Node 2 along with the package. Package 1 continues to be available and is now running on Node 2. Also note that Node 2 can now access both Package1’s disk and Package2’s disk.
PAGE 72
NOTE: For design and configuration information about clusters which span subnets, including site-aware disaster-tolerant clusters, see the documents listed under “Cross-Subnet Configurations” (page 41). Figure 3-6 After Package Switching Failover Policy The Package Manager selects a node for a failover package to run on based on the priority list included in the package configuration file together with the failover_policy parameter, also in the configuration file.
PAGE 73
Automatic Rotating Standby Using the min_package_node failover policy, it is possible to configure a cluster that lets you use one node as an automatic rotating standby node for the cluster. Consider the following package configuration for a four node cluster. Note that all packages can run on all nodes and have the same node_name lists. Although the example shows the node names in a different order for each package, this is not required.
PAGE 74
Figure 3-8 Rotating Standby Configuration after Failover NOTE: Using the min_package_node policy, when node 2 is repaired and brought back into the cluster, it will then be running the fewest packages, and thus will become the new standby node.
PAGE 75
Figure 3-9 CONFIGURED_NODE Policy Packages after Failover If you use configured_node as the failover policy, the package will start up on the highest priority node in the node list, assuming that the node is running as a member of the cluster. When a failover occurs, the package will move to the next highest priority node in the list that is available.
PAGE 76
Figure 3-10 Automatic Failback Configuration before Failover Table 3-2 Node Lists in Sample Cluster Package Name NODE_NAME List FAILOVER POLICY FAILBACK POLICY pkgA node1, node4 CONFIGURED_NODE AUTOMATIC pkgB node2, node4 CONFIGURED_NODE AUTOMATIC pkgC node3, node4 CONFIGURED_NODE AUTOMATIC node1 panics, and after the cluster reforms, pkgA starts running on node4: 76 Understanding Serviceguard Software Components
PAGE 77
Figure 3-11 Automatic Failback Configuration After Failover After rebooting, node 1 rejoins the cluster. At that point, pkgA will be automatically stopped on node 4 and restarted on node 1.
PAGE 78
Figure 3-12 Automatic Failback Configuration After Restart of Node 1 NOTE: Setting the failback_policy to automatic can result in a package failback and application outage during a critical production period. If you are using automatic failback, you may want to wait to add the package’s primary node back into the cluster until you can allow the package to be taken out of service temporarily while it switches back to the primary node.
PAGE 79
Using the Event Monitoring Service Basic package resources include cluster nodes, LAN interfaces, and services, which are the individual processes within an application. All of these are monitored by Serviceguard directly. In addition, you can use the Event Monitoring Service registry through which add-on monitors can be configured. This registry allows other software components to supply monitoring of their resources for Serviceguard.
PAGE 80
How Packages Run Packages are the means by which Serviceguard starts and halts configured applications. Failover packages are also units of failover behavior in Serviceguard. A package is a collection of services, disk volumes and IP addresses that are managed by Serviceguard to ensure they are available. There can be a maximum of 300 packages per cluster and a total of 900 services per cluster.
PAGE 81
packages and failover packages can name some subset of the cluster’s nodes or all of them. If the auto_run parameter is set to yes in a package’s configuration file Serviceguard automatically starts the package when the cluster starts. System multi-node packages are required to have auto_run set to yes. If a failover package has auto_run set to no, Serviceguard cannot start it automatically at cluster startup time; you must explicitly enable this kind of package using the cmmodpkg command.
PAGE 82
1. 2. 3. 4. 5. 6. 7. Before the control script starts. (For modular packages, this is the master control script.) During run script execution. (For modular packages, during control script execution to start the package.) While services are running When a service, subnet, or monitored resource fails, or a dependency is not met. During halt script execution. (For modular packages, during control script execution to halt the package.
PAGE 83
7. 8. Starts up any EMS (Event Monitoring Service) resources needed by the package that were specially marked for deferred startup. Exits with an exit code of zero (0). Figure 3-14 Package Time Line (Legacy Package) At any step along the way, an error will result in the script exiting abnormally (with an exit code of 1). For example, if a package service is unable to be started, the control script will exit with an error. NOTE: This diagram is specific to legacy packages.
PAGE 84
NOTE: After the package run script has finished its work, it exits, which means that the script is no longer executing once the package is running normally. After the script exits, the PIDs of the services started by the script are monitored by the package manager directly. If the service dies, the package manager will then run the package halt script or, if service_fail_fast_enabled is set to yes, it will halt the node on which the package is running.
PAGE 85
SERVICE_RESTART[0]=" " SERVICE_RESTART[0]="-r " SERVICE_RESTART[0]="-R" ; do not restart ; restart as many as times ; restart indefinitely NOTE: If you set restarts and also set service_fail_fast_enabled to yes, the failfast will take place after restart attempts have failed. It does not make sense to set service_restart to “-R” for a service and also set service_fail_fast_enabled to yes.
PAGE 86
NOTE: If a package is dependent on a subnet, and the subnet fails on the node where the package is running, the package will start to shut down. If the subnet recovers immediately (before the package is restarted on an adoptive node), the package manager restarts the package on the same node; no package switch occurs.
PAGE 87
Figure 3-15 Legacy Package Time Line for Halt Script Execution At any step along the way, an error will result in the script exiting abnormally (with an exit code of 1). Also, if the halt script execution is not complete before the time specified in the HALT_SCRIPT_TIMEOUT, the package manager will kill the script. During halt script execution, messages are written to a log file. For legacy packages, this is in the same directory as the run script and has the same name as the run script and the extension.
PAGE 88
• • • 0—normal exit. The package halted normally, so all services are down on this node. 1—abnormal exit, also known as no_restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however. Timeout—Another type of exit occurs when the halt_script_timeout is exceeded. In this scenario, the package is killed and disabled globally. It is not disabled on the current node, however.
PAGE 89
Table 3-3 Error Conditions and Package Movement for Failover Packages (continued) Package Error Condition Results Error or Exit Code Node Failfast Enabled Service Failfast Enabled HP-UX Status on Primary after Error Halt script runs after Error or Exit Package Allowed Package to Run on Primary Allowed to Node after Error Run on Alternate Node Halt Script Timeout YES Either Setting system reset N/A N/A (system reset) Yes, unless the timeout happened after the cmhaltpkg command was executed.
PAGE 90
where the package is running and monitoring the health of all interfaces, switching them when necessary. NOTE: Serviceguard monitors the health of the network interfaces (NICs) and can monitor the IP level (layer 3) network. Stationary and Relocatable IP Addresses Each node (host system) should have at least one IP address for each active network interface. This address, known as a stationary IP address, is configured in the node's /etc/rc.config.d/netconf file or in the node’s /etc/rc.config.
PAGE 91
In addition, relocatable addresses (but not stationary addresses) can be taken over by an adoptive node on the same subnet if control of the package is transferred. This means that applications can access the package via its relocatable address without knowing which node the package currently resides on.
PAGE 92
IP addresses are configured only on each primary network interface card; standby cards are not configured with an IP address. Multiple IPv4 addresses on the same network card must belong to the same IP subnet. CAUTION: HP strongly recommends that you add relocatable addresses to packages only by editing ip_address (page 275) in the package configuration file (or IP [] entries in the control script of a legacy package) and running cmapplyconf (1m).
PAGE 93
NOTE: For a full discussion, see the white paper Serviceguard Network Manager: Inbound Failure Detection Enhancement at http://docs.hp.com -> High Availability -> Serviceguard -> White Papers. • • INOUT: When both the inbound and outbound counts stop incrementing for a certain amount of time, Serviceguard will declare the card as bad. (Serviceguard calculates the time depending on the type of LAN card.
PAGE 94
• • actively keep track of which neighbors are reachable, and which are not, and detect changed link-layer addresses. search for alternate functioning routers when the path to a router fails. Within the Ethernet family, local switching is supported in the following configurations: • • 1000Base-SX and 1000Base-T 1000Base-T or 1000BaseSX and 100Base-T On HP-UX 11i, however, Jumbo Frames can only be used when the 1000Base-T or 1000Base-SX cards are configured.
PAGE 95
In Figure 3-17, we see what would happen if the LAN segment 2 network interface card on Node 1 were to fail. Figure 3-17 Cluster After Local Network Switching As the standby interface takes over, IP addresses will be switched to the hardware path associated with the standby interface. The switch is transparent at the TCP/IP level. All applications continue to run on their original nodes. During this time, IP traffic on Node 1 will be delayed as the transfer occurs.
PAGE 96
Figure 3-18 Local Switching After Cable Failure Local network switching will work with a cluster containing one or more nodes. You may wish to design a single-node cluster in order to take advantage of this local network switching feature in situations where you need only one node and do not wish to set up a more complex cluster. Switching Back to Primary LAN Interfaces after Local Switching If a primary interface fails, the IP address will be switched to a standby.
PAGE 97
cmhaltnode command is issued or on all nodes in the cluster if a cmhaltcl command is issued. • Configurable behavior: NETWORK_AUTO_FAILBACK = NO. Serviceguard will detect and log the recovery of the interface, but will not switch the IP address back from the standby to the primary interface. You can tell Serviceguard to switch the IP address back to the primary interface by means of the cmmodnet command: cmmodnet -e where is the primary interface.
PAGE 98
configured on that node, and identified as monitored subnets in the package configuration file, must be available.) Note that remote switching is supported only between LANs of the same type. For example, a remote switchover between an Ethernet interface on one machine and an IPoIB interface on the failover machine is not supported. The remote switching of relocatable IP addresses is shown in Figure 3-5 and Figure 3-6.
PAGE 99
NOTE: This applies only to subnets for which the cluster configuration parameter IP_MONITOR is set to ON. See “Cluster Configuration Parameters ” (page 139) for more information. — Errors that prevent packets from being received but do not affect the link-level health of an interface IMPORTANT: You should configure the IP Monitor in a cross-subnet configuration, because IP monitoring will detect some errors that link-level monitoring will not. See also “Cross-Subnet Configurations” (page 41).
PAGE 100
IPv4: 1 16.89.143.192 16.89.120.0 … Possible IP Monitor Subnets: IPv4: 16.89.112.0 Polling Target 16.89.112.1 IPv6: 3ffe:1000:0:a801:: Polling Target 3ffe:1000:0:a801::254 … The IP Monitor section of the cluster configuration file will look similar to the following for a subnet on which IP monitoring is configured with target polling.
PAGE 101
NOTE: This is the default if cmquerycl does not detect a gateway for the subnet in question; it is equivalent to having no SUBNET entry for the subnet. See SUBNET under “Cluster Configuration Parameters ” (page 139) for more information. Failure and Recovery Detection Times With the default NETWORK_POLLING_INTERVAL of 2 seconds (see “Cluster Configuration Parameters ” (page 139)) the IP monitor will detect IP failures typically within 8–10 seconds for Ethernet and within 16–18 seconds for InfiniBand.
PAGE 102
local switching may occur, that is, a switch to a standby LAN card if one has been configured; see “Local Switching ” (page 93). The following examples show when and how a link-level failure is differentiated from an IP-level failure in the output of cmviewcl (1m). As you can see, if local switching is configured, the difference is the keyword disabled, which appears in the tabular output, and is set to true in the line output, if the IP Monitor detects the failure.
PAGE 103
cmviewcl -v -f line will report the same failure like this: node:gary|interface:lan2|status=down node:gary|interface:lan2|disabled=false node:gary|interface:lan2|failure_type=link+ip If local switching is not configured and a failure is detected by IP monitoring, output from cmviewcl -v will look like something like this: Network_Parameters: INTERFACE STATUS PRIMARY down (IP only) PRIMARY up PATH 0/3/1/0 0/5/1/0 NAME lan2 lan3 cmviewcl -v -f line will report the same failure like this: node:gary|interfa
PAGE 104
Figure 3-19 Aggregated Networking Ports Both the Single and Dual ported LANs in the non-aggregated configuration have four LAN cards, each associated with a separate non-aggregated IP address and MAC address, and each with its own LAN name (lan0, lan1, lan2, lan3). When these ports are aggregated all four ports are associated with a single IP address and MAC address. In this example, the aggregated ports are collectively known as lan900, the name by which the aggregate is known on HP-UX 11i.
PAGE 105
What is VLAN? Virtual LAN (or VLAN) is a technology that allows logical grouping of network nodes, regardless of their physical locations. VLAN can be used to divide a physical LAN into multiple logical LAN segments or broadcast domains, helping to reduce broadcast traffic, increase network performance and security, and improve manageability.
PAGE 106
Additional Heartbeat Requirements VLAN technology allows great flexibility in network configuration. To maintain Serviceguard’s reliability and availability in such an environment, the heartbeat rules are tightened as follows when the cluster is using VLANs: 1. VLAN heartbeat networks must be configured on separate physical NICs or APA aggregates, to avoid single points of failure. 2. Heartbeats are still recommended on all cluster networks, including VLANs. 3.
PAGE 107
to agile addressing when you upgrade to 11i v3, though you should seriously consider its advantages. For instructions on migrating a system to agile addressing, see the white paper Migrating from HP-UX 11i v2 to HP-UX 11i v3 at http://docs.hp.com. NOTE: It is possible, though not a best practice, to use legacyDSFs (that is, DSFs using the older naming convention) on some nodes after migrating to agile addressing on others; this allows you to migrate different nodes at different times, if necessary.
PAGE 108
NOTE: Under agile addressing (see “About Device File Names (Device Special Files)” (page 106)), the storage units in this example would have names such as disk1, disk2, disk3, etc. Figure 3-20 Physical Disks Within Shared Storage Units Figure 3-21 shows the individual disks combined in a multiple disk mirrored configuration.
PAGE 109
Figure 3-21 Mirrored Physical Disks Figure 3-22 shows the mirrors configured into LVM volume groups, shown in the figure as /dev/vgpkgA and /dev/vgpkgB. The volume groups are activated by Serviceguard packages for use by highly available applications.
PAGE 110
Examples of Storage on Disk Arrays Figure 3-23 shows an illustration of storage configured on a disk array. Physical disks are configured by an array utility program into logical units or LUNs which are then seen by the operating system. Figure 3-23 Physical Disks Combined into LUNs NOTE: LUN definition is normally done using utility programs provided by the disk array manufacturer. Since arrays vary considerably, you should refer to the documentation that accompanies your storage unit.
PAGE 111
Figure 3-24 Multiple Paths to LUNs Finally, the multiple paths are configured into volume groups as shown in Figure 3-25.
PAGE 112
Types of Volume Manager Serviceguard allows a choice of volume managers for data storage: • HP-UX Logical Volume Manager (LVM) and (optionally) Mirrordisk/UX • Veritas Volume Manager for HP-UX (VxVM)—Base and add-on Products • Veritas Cluster Volume Manager for HP-UX Separate sections in Chapters 5 and 6 explain how to configure cluster storage using all of these volume managers.
PAGE 113
• • need to use software RAID mirroring or striped mirroring. have multiple heartbeat subnets configured. Propagation of Disk Groups in VxVM A VxVM disk group can be created on any node, whether the cluster is up or not. You must validate the disk group by trying to import it on each node. Package Startup Time with VxVM With VxVM, each disk group is imported by the package control script that uses the disk group.
PAGE 114
CVM 4.1 and later can be used with Veritas Cluster File System (CFS) in Serviceguard. Several of the HP Serviceguard Storage Management Suite bundles include features to enable both CVM and CFS. CVM can be used in clusters that: • • • run applications that require fast disk group activation after package failover; require storage activation on more than one node at a time, for example to perform a backup from one node while a package using the volume is active on another node.
PAGE 115
Table 3-4 Pros and Cons of Volume Managers with Serviceguard Product Advantages Tradeoffs Logical Volume Manager (LVM) • Software is provided with all versions • Lacks flexibility and extended of HP-UX. features of some other volume managers • Provides up to 3-way mirroring using optional Mirrordisk/UX software. • Dynamic multipathing (DMP) is active by default as of HP-UX 11i v3.
PAGE 116
Table 3-4 Pros and Cons of Volume Managers with Serviceguard (continued) Product Advantages Tradeoffs Veritas Volume Manager— Full VxVM product • Disk group configuration from any node. • DMP for active/active storage devices. • Supports exclusive activation.
PAGE 117
The case is covered in more detail under “What Happens when a Node Times Out” (page 117). See also “Cluster Daemon: cmcld” (page 55). A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses to Package and Service Failures ” (page 119). What Happens when a Node Times Out Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less.
PAGE 118
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To release all resources related to Package2 (such as exclusive access to volume group vg02 and the Package2 IP address) as quickly as possible, SystemB halts (system reset). NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set to zero, the node will not attempt to join the cluster when it comes back up.
PAGE 119
Serviceguard does not respond directly to power failures, although a loss of power to an individual cluster component may appear to Serviceguard like the failure of that component, and will result in the appropriate switching behavior. Power protection is provided by HP-supported uninterruptible power supplies (UPS).
PAGE 120
executes, can examine this variable to see whether it has been restarted after a failure, and if so, it can take appropriate action such as cleanup. Network Communication Failure An important element in the cluster is the health of the network itself. As it continuously monitors the cluster, each node listens for heartbeat messages from the other nodes confirming that all nodes are able to communicate with each other.
PAGE 121
4 Planning and Documenting an HA Cluster Building a Serviceguard cluster begins with a planning phase in which you gather information about all the hardware and software components of the configuration.
PAGE 122
• • electrical points of failure. application points of failure. Serviceguard Memory Requirements Serviceguard requires approximately 15.5 MB of lockable memory. Planning for Expansion When you first set up the cluster, you indicate a set of nodes and define a group of packages for the initial configuration. At a later time, you may wish to add additional nodes and packages, or you may wish to use additional disk hardware for shared data storage.
PAGE 123
NOTE: Under agile addressing, the storage units in this example would have names such as disk1, disk2, disk3, etc. See “About Device File Names (Device Special Files)” (page 106). Figure 4-1 Sample Cluster Configuration Create a similar sketch for your own cluster.
PAGE 124
Host Name The name to be used on the system as the host name. Memory Capacity The memory in MB. Number of I/O slots The number of slots. Network Information Serviceguard monitors LAN interfaces. NOTE: Serviceguard supports communication across routers between nodes in the same cluster; for more information, see the documents listed under “Cross-Subnet Configurations” (page 41).
PAGE 125
An IPV6 address is a string of 8 hexadecimal values separated with colons, in this form: xxx:xxx:xxx:xxx:xxx:xxx:xxx:xxx. For more details of IPv6 address format, see the Appendix G (page 441). NETWORK_FAILURE_DETECTION When there is a primary and a standby network card, Serviceguard needs to determine when a card has failed, so it knows whether to fail traffic over to the other card.
PAGE 126
Table 4-1 SCSI Addressing in Cluster Configuration System or Disk Host Interface SCSI Address Primary System A 7 Primary System B 6 Primary System C 5 Primary System D 4 Disk #1 3 Disk #2 2 Disk #3 1 Disk #4 0 Disk #5 15 Disk #6 14 Others 13 - 8 NOTE: When a boot/root disk is configured with a low-priority address on a shared SCSI bus, a system panic can occur if there is a timeout on accessing the boot/root device.
PAGE 127
This information is used in creating the mirrored disk configuration using Logical Volume Manager. In addition, it is useful to gather as much information as possible about your disk configuration. You can obtain information about available disks by using the following commands: • • • • • • • • • • • diskinfo ioscan -fnC disk or ioscan -fnNC disk lssf /dev/*dsk/* bdf mount swapinfo vgdisplay -v lvdisplay -v lvlnboot -v vxdg list (VxVM and CVM) vxprint (VxVM and CVM) These are standard HP-UX commands.
PAGE 128
Bus Type _SCSI_ Slot Number _6_ Address _24_ Disk Device File __________ Bus Type ______ Slot Number ___ Address ____ Disk Device File _________ Attach a printout of the output from the ioscan -fnC disk command after installing disk hardware and rebooting the system. Mark this printout to indicate which physical volume group each disk belongs to.
PAGE 129
Be sure to follow UPS and cabinet power limits as well as SPU power limits. Power Supply Configuration Worksheet You may find the following worksheet useful to help you organize and record your power supply configuration. This worksheet is an example; blank worksheets are in Appendix E. Make as many copies as you need.
PAGE 130
four nodes can use only a Quorum Server as the cluster lock. In selecting a cluster lock configuration, be careful to anticipate any potential need for additional cluster nodes. For more information on lock disks, lock LUNs, and the Quorum Server, see “Choosing Cluster Lock Disks” (page 205), “Setting Up a Lock LUN” (page 206), and “Setting Up and Running the Quorum Server” (page 209).
PAGE 131
Quorum Server Worksheet You may find it useful to use a Quorum Server worksheet, as in the example that follows, to identify a quorum server for use with one or more clusters. You may also want to enter quorum server host and timing parameters on the Cluster Configuration Worksheet. Blank worksheets are in Blank Planning Worksheets (page 429). You can use the Quorum Server worksheet to record the following: Quorum Server Host The host name (and alternate address, if any) for the quorum server.
PAGE 132
• • • • • You must group high availability applications, services, and data, whose control needs to be transferred together, onto a single volume group or series of volume groups. You must not group two different high availability applications, services, or data, whose control needs to be transferred independently, onto the same volume group. Your root disk must not belong to a volume group that can be activated on another node.
PAGE 133
Physical Volume Name: _____________________________________________________ Name of Second Physical Volume Group: _______bus1____________________________ Physical Volume Name: ______________/dev/dsk/c4t2d0________________________ Physical Volume Name: ______________/dev/dsk/c5t2d0________________________ Physical Volume Name: ______________/dev/dsk/c6t2d0________________________ Physical Volume Name: _____________________________________________________ Physical Volume Name: _____________________
PAGE 134
CVM and VxVM Worksheet You may find a worksheet such as the following useful to help you organize and record your specific physical disk configuration. This worksheet is an example; blank worksheets are in Appendix E (page 429).
PAGE 135
If the cluster has only a single heartbeat network, and a network card on that network fails, heartbeats will be lost while the failure is being detected and the IP address is being switched to a standby interface. The cluster may treat these lost heartbeats as a failure and re-form without one or more nodes. To prevent this, a minimum MEMBER_TIMEOUT value of 14 seconds is required for clusters with a single heartbeat network.
PAGE 136
What Is IPv4–only Mode? IPv4 is the default mode: unless you specify IPV6 or ANY (either in the cluster configuration file or via cmquerycl -a) Serviceguard will always try to resolve the nodes' hostnames (and the Quorum Server's, if any) to IPv4 addresses, and will not try to resolve them to IPv6 addresses. This means that you must ensure that each hostname can be resolved to at least one IPv4 address. NOTE: This applies only to hostname resolution.
PAGE 137
• The node's public LAN address (by which it is known to the outside world) must be the last address listed in /etc/hosts. Otherwise there is a possibility of the address being used even when it is not configured into the cluster. • You must use $SGCONF/cmclnodelist, not ~/.rhosts or /etc/hosts.equiv, to provide root access to an unconfigured node. NOTE: This also applies if HOSTNAME_ADDRESS_FAMILY is set to ANY. See “Allowing Root Access to an Unconfigured Node” (page 197) for more information.
PAGE 138
Recommendations for IPv6-Only Mode IMPORTANT: Check the current Serviceguard release notes for the latest instructions and recommendations. • If you decide to migrate the cluster to IPv6-only mode, you should plan to do so while the cluster is down. What Is Mixed Mode? If you configure mixed mode (HOSTNAME_ADDRESS_FAMILY set to ANY), then the addresses used by the cluster, including the heartbeat, and Quorum Server addresses if any, can be IPv4 or IPv6 addresses.
PAGE 139
• CFS, CVM, VxVM, and VxFS are not supported. NOTE: • This also applies if HOSTNAME_ADDRESS_FAMILY is set to IPV6. HPVM is not supported. You cannot have a virtual machine that is either a node or a package if HOSTNAME_ADDRESS_FAMILY is set to ANY or IPV6. Cluster Configuration Parameters You need to define a set of cluster parameters. These are stored in the binary cluster configuration file, which is distributed to each node in the cluster.
PAGE 140
NOTE: In addition, the following characters must not be used in the cluster name if you are using the Quorum Server: at-sign (@), equal-sign (=), or-sign (|), semicolon (;). These characters are deprecated, meaning that you should not use them, even if you are not using the Quorum Server, because they will be illegal in a future Serviceguard release. All other characters are legal. The cluster name can contain up to 39 characters (bytes).
PAGE 141
IMPORTANT: See “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 135) for important information. See also the latest Serviceguard release notes at docs.hp.com under High Availability —> Serviceguard. FIRST_CLUSTER_LOCK_VG, SECOND_CLUSTER_LOCK_VG The volume group containing the physical disk volume on which a cluster lock is written. This parameter is used only when you employ a lock disk for tie-breaking services in the cluster.
PAGE 142
Server” (page 224). See also “Configuring Serviceguard to Use the Quorum Server” in the latest version HP Serviceguard Quorum Server Version A.04.00 Release Notes, at http:// www.docs.hp.com -> High Availability -> Quorum Server. IMPORTANT: See also“About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 135) for important information about requirements and restrictions in an IPv6–only cluster.
PAGE 143
IMPORTANT: For special instructions that may apply to your version of Serviceguard and the Quorum Server see “Configuring Serviceguard to Use the Quorum Server” in the latest version HP Serviceguard Quorum Server Version A.04.00 Release Notes, at http://www.docs.hp.com -> High Availability -> Quorum Server. See also “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 135) for important information about requirements and restrictions in an IPv6–only cluster.
PAGE 144
so until you have read the HP Serviceguard Quorum Server Version A.04.00 Release Notes, and in particular the following sections in that document: “About the QS Polling Interval and Timeout Extension”, “Network Recommendations”, and “Setting Quorum Server Parameters in the Cluster Configuration File”. Can be changed while the cluster is running; see “What Happens when You Change the Quorum Configuration Online” (page 66) for important information.
PAGE 145
NODE_NAME The hostname of each system that will be a node in the cluster. CAUTION: Make sure that the node name is unique within the subnets configured on the cluster nodes; under some circumstances Serviceguard may not be able to detect a duplicate name and unexpected problems may result. Do not use the full domain name. For example, enter ftsys9, not ftsys9.cup.hp.com.
PAGE 146
which the node identified by the preceding NODE_NAME entry belongs. Can be used only in a site-aware disaster-tolerant cluster, which requires Metrocluster (additional HP software); see the documents listed under “Cross-Subnet Configurations” (page 41) for more information. If SITE is used, it must be used for each node in the cluster (that is, all the nodes must be associated with some defined site, though not necessarily the same one).
PAGE 147
NOTE: Any subnet that is configured in this cluster configuration file as a SUBNET for IP monitoring purposes, or as a monitored_subnet in a package configuration file (or SUBNET in a legacy package; see “Package Configuration Planning ” (page 162)) must be specified in the cluster configuration file via NETWORK_INTERFACE and either STATIONARY_IP or HEARTBEAT_IP.
PAGE 148
NOTE: Any subnet that is configured in this cluster configuration file as a SUBNET for IP monitoring purposes, or as a monitored_subnet in a package configuration file (or SUBNET in a legacy package; see “Package Configuration Planning ” (page 162)) must be specified in the cluster configuration file via NETWORK_INTERFACE and either STATIONARY_IP or HEARTBEAT_IP.
PAGE 149
• • • Two heartbeat subnets; or One heartbeat subnet with a standby; or One heartbeat subnet using APA with two physical ports in hot standby mode or LAN monitor mode. You cannot configure more than one heartbeat IP address on an interface; only one HEARTBEAT_IP is allowed for each NETWORK_INTERFACE. NOTE: The Serviceguard cmapplyconf, cmcheckconf, and cmquerycl commands check that these minimum requirements are met, and produce a warning if they are not met at the immediate network level.
PAGE 150
subnet on another node (that is, each heartbeat path must be physically separate). See “Cross-Subnet Configurations” (page 41) for more information. NOTE: Limitations: • Because Veritas Cluster File System from Symantec (CFS) requires link-level traffic communication (LLT) among the nodes, Serviceguard cannot be configured in cross-subnet configurations with CFS alone.
PAGE 151
NOTE: The use of a private heartbeat network is not advisable if you plan to use Remote Procedure Call (RPC) protocols and services. RPC assumes that each network adapter device or I/O card is connected to a route-able network. An isolated or private heartbeat LAN is not route-able, and could cause an RPC request-reply, directed to that LAN, to risk timeout without being serviced. NFS, NIS and NIS+, and CDE are examples of RPC based applications that are frequently used on HP-UX.
PAGE 152
monitored non-heartbeat subnets here. You can identify any number of subnets to be monitored. IMPORTANT: In a cross-subnet configuration, each package subnet configured on an interface (NIC) must have a standby interface connected to the local bridged network. See “Cross-Subnet Configurations” (page 41) for important information about requirements for such configurations. If HOSTNAME_ADDRESS_FAMILY is set to IPV4 or ANY, a stationary IP address can be either an IPv4 or an IPv6 address.
PAGE 153
FIRST_CLUSTER_LOCK_PV, SECOND_CLUSTER_LOCK_PV The path on this node for the physical volume within the cluster-lock Volume Group that will have the cluster lock written on it (see the entry for FIRST_CLUSTER_LOCK_VG and SECOND_CLUSTER_LOCK_VG near the beginning of this list). Used only if a lock disk is used for tie-breaking services. Use FIRST_CLUSTER_LOCK_PV for the first physical lock volume and SECOND_CLUSTER_LOCK_PV for the second physical lock volume, if any.
PAGE 154
CAPACITY_VALUE specifies a value for the CAPACITY_NAME that precedes it. It must be a floating-point value between 0 and 1000000. Capacity values are arbitrary as far as Serviceguard is concerned; they have meaning only in relation to the corresponding package weights. Capacity definition is optional, but if CAPACITY_NAME is specified, CAPACITY_VALUE must also be specified; CAPACITY_NAME must come first.
PAGE 155
This value leads to a failover time of between approximately 18 and 22 seconds, if you are using a Quorum Server, or a Fiber Channel cluster lock, or no cluster lock. Increasing the value to 25 seconds increases the failover time to between approximately 29 and 39 seconds. The time will increase by between 5 and 13 seconds if you are you using a SCSI cluster lock or dual Fibre Channel cluster lock). Maximum supported value: 300 seconds (300,000,000 microseconds).
PAGE 156
Guidelines: You need to decide whether it's more important for your installation to have fewer (but slower) cluster re-formations, or faster (but possibly more frequent) re-formations: • To ensure the fastest cluster re-formations, use the minimum value applicable to your cluster.
PAGE 157
completes the operation. The time should be selected based on the slowest boot time in the cluster. Enter a value equal to the boot time of the slowest booting node minus the boot time of the fastest booting node plus 600 seconds (ten minutes). Default is 600,000,000 microseconds. Can be changed while the cluster is running. NETWORK_POLLING_INTERVAL Specifies how frequently the networks configured for Serviceguard are checked. Default is 2,000,000 microseconds (2 seconds).
PAGE 158
CONFIGURED_IO_TIMEOUT_EXTENSION The number of microseconds by which to increase the time Serviceguard waits after detecting a node failure, so as to ensure that all pending I/O on the failed node has ceased. This parameter must be set for extended-distance clusters using software mirroring across data centers over links between iFCP switches; it must be set to the switches' maximum R_A_TOV value.
PAGE 159
either HEARTBEAT_IP or STATIONARY_IP. All entries for IP_MONITOR and POLLING_TARGET apply to this subnet until the next SUBNET entry; SUBNET must be the first of each trio. By default, each of the cluster subnets is listed under SUBNET, and, if at least one gateway is detected for that subnet, IP_MONITOR is set to ON and POLLING_TARGET entries are populated with the gateway addresses, enabling target polling; otherwise the subnet is listed with IP_MONITOR set to OFF.
PAGE 160
Can be changed while the cluster is running; must be removed if the preceding SUBNET entry is removed. POLLING_TARGET The IP address to which polling messages will be sent from all network interfaces on the subnet specified in the preceding SUBNET entry, if IP_MONITOR is set to ON. This is called target polling. Each subnet can have multiple polling targets; repeat POLLING_TARGET entries as needed.
PAGE 161
WEIGHT_NAME are the same as those spelled out for CAPACITY_NAME earlier in this list. These parameters are optional, but if they are defined, WEIGHT_DEFAULT must follow WEIGHT_NAME, and must be set to a floating-point value between 0 and 1000000. If they are not specified for a given weight, Serviceguard will assume a default value of zero for that weight.
PAGE 162
information, see “Setting up Access-Control Policies” (page 230). VOLUME_GROUP The name of an LVM volume group whose disks are attached to at least two nodes in the cluster. Such disks are considered cluster-aware. The volume group name can have up to 39 characters (bytes). Cluster Configuration: Next Step When you are ready to configure the cluster, proceed to “Configuring the Cluster ” (page 219). If you find it useful to record your configuration ahead of time, use the worksheet in Appendix E.
PAGE 163
NOTE: As of the date of this manual, the Framework for HP Serviceguard Toolkits deals specifically with legacy packages. Logical Volume and File System Planning NOTE: LVM Volume groups that are to be activated by packages must also be defined as cluster-aware in the cluster configuration file. See “Cluster Configuration Planning ” (page 134). Disk groups (for Veritas volume managers) that are to be activated by packages must be defined in the package configuration file, described below.
PAGE 164
# /dev/vg01/lvoldb4 /general vxfs defaults 0 2 # /dev/vg01/lvoldb5 raw_free ignore ignore 0 0 # /dev/vg01/lvoldb6 raw_free ignore ignore 0 0 # logical volumes that # exist for Serviceguard's # HA package. Do not uncomment. Create an entry for each logical volume, indicating its use for a file system or for a raw device. Don’t forget to comment out the lines (using the # character as shown). NOTE: Do not use /etc/fstab to mount file systems that are used by Serviceguard packages.
PAGE 165
CVM 4.1 and later requires you to configure multiple heartbeat networks, or a single heartbeat with a standby. Using APA, Infiniband, or VLAN interfaces as the heartbeat network is not supported. You create a chain of package dependencies for application failover packages and the non-failover packages: 1. The failover package’s applications should not run on a node unless the mount point packages are already running.
PAGE 166
Planning for Expansion You can add packages to a running cluster. This process is described in “Cluster and Package Maintenance” (page 297). When adding packages, be sure not to exceed the value of max_configured_packages as defined in the cluster configuration file; see “Cluster Configuration Parameters ” (page 139). You can modify this parameter while the cluster is running if you need to.
PAGE 167
Table 4-2 Package Failover Behavior (continued) Switching Behavior Parameters in Configuration File All packages switch following a system • service_fail_fast_enabled set to yes for a specific service. reset (an immediate halt without a • auto_run set to yes for all packages. graceful shutdown) on the node when a specific service fails. Halt scripts are not run. All packages switch following a system • service_fail_fast_enabled set to yes for all services. reset on the node when any service fails.
PAGE 168
resource_name /net/interfaces/lan/status/lan1 resource_polling_interval 60 resource_start deferred resource_up_value = up resource_name /net/interfaces/lan/status/lan0 resource_polling_interval 60 resource_start automatic resource_up_value = up NOTE: For a legacy package, specify the deferred resources again in the package control script, using the DEFERRED_RESOURCE_NAME parameter: DEFERRED_RESOURCE_NAME[0]="/net/interfaces/lan/status/lan0" DEFERRED_RESOURCE_NAME[1]="/net/interfaces/lan/status/lan1" If a
PAGE 169
Make a package dependent on another package if the first package cannot (or should not) function without the services provided by the second, on the same node. For example, pkg1 might run a real-time web interface to a database managed by pkg2 on the same node. In this case it might make sense to make pkg1 dependent on pkg2. In considering whether or not to create a simple dependency between packages, use the Rules for Simple Dependencies and Guidelines for Simple Dependencies that follow.
PAGE 170
• • • If pkg1 is a failover package and pkg2 is a multi-node or system multi-node package, and pkg2 fails, pkg1 will halt and fail over to the next node on its node_name list on which pkg2 is running (and any other dependencies, such as resource dependencies or a dependency on a third package, are met). In the case of failover packages with a configured_node failover_policy, a set of rules governs under what circumstances pkg1 can force pkg2 to start on a given node.
PAGE 171
NOTE: Keep the following in mind when reading the examples that follow, and when actually configuring priorities: 1. auto_run (page 265) should be set to yes for all the packages involved; the examples assume that it is. 2. Priorities express a ranking order, so a lower number means a higher priority (10 is a higher priority than 30, for example).
PAGE 172
— If both packages have moved from node1 to node2 and node1 becomes available, pkg2 will fail back to node1 only if pkg2’s priority is higher than pkg1’s: ◦ If the priorities are equal, neither package will fail back (unless pkg1 is not running; in that case pkg2 can fail back).
PAGE 173
because that provides the best chance for a successful failover (and failback) if pkg1 fails. But you also need to weigh the relative importance of the packages. If pkg2 runs a database that is central to your business, you probably want it to run undisturbed, no matter what happens to application packages that depend on it. In this case, the database package should have the highest priority.
PAGE 174
IMPORTANT: If you have not already done so, read the discussion of Simple Dependencies (page 168) before you go on. The interaction of the legal values of dependency_location and dependency_condition creates the following possibilities: • Same-node dependency: a package can require that another package be UP on the same node. This is the case covered in the section on Simple Dependencies (page 168). • • • • Different-node dependency: a package can require that another package be UP on a different node.
PAGE 175
Rules for different_node and any_node Dependencies These rules apply to packages whose dependency_condition is UP and whose dependency_location is different_node or any_node. For same-node dependencies, see Simple Dependencies (page 168); for exclusionary dependencies, see “Rules for Exclusionary Dependencies” (page 174). • • • Both packages must be failover packages whose failover_policy (page 268) is configured_node.
PAGE 176
will allow the failed package to halt after the successor_halt_timeout number of seconds whether or not the dependent packages have completed their halt scripts. 2. Halts the failing package. After the successor halt timer has expired or the dependent packages have all halted, Serviceguard starts the halt script of the failing package, regardless of whether the dependents' halts succeeded, failed, or timed out. 3.
PAGE 177
Package Weights and Node Capacities You define a capacity, or capacities, for a node (in the cluster configuration file), and corresponding weights for packages (in the package configuration file). Node capacity is consumed by package weights. Serviceguard ensures that the capacity limit you set for a node is never exceeded by the combined weight of packages running on it; if a node's available capacity will be exceeded by a package that wants to run on that node, the package will not run there.
PAGE 178
CAPACITY_VALUE 10 Now all packages will be considered equal in terms of their resource consumption, and this node will never run more than ten packages at one time. (You can change this behavior if you need to by modifying the weight for some or all packages, as the next example shows.) Next, define the CAPACITY_NAME and CAPACITY_VALUE parameters for the remaining nodes, setting CAPACITY_NAME to package_limit in each case. You may want to set CAPACITY_VALUE to different values for different nodes.
PAGE 179
Points to Keep in Mind The following points apply specifically to the Simple Method (page 177). Read them in conjunction with the Rules and Guidelines (page 184), which apply to all weights and capacities. • If you use the reserved CAPACITY_NAME package_limit, then this is the only type of capacity and weight you can define in this cluster. • If you use the reserved CAPACITY_NAME package_limit, the default weight for all packages is 1.
PAGE 180
could be misleading to identify single resources, such as “processor”, if packages really contend for sets of interacting resources that are hard to characterize with a single name. In any case, the real-world meanings of the names you assign to node capacities and package weights are outside the scope of Serviceguard. Serviceguard simply ensures that for each capacity configured for a node, the combined weight of packages currently running on that node does not exceed that capacity.
PAGE 181
CAPACITY_NAME A CAPACITY_VALUE 80 CAPACITY_NAME B CAPACITY_VALUE 50 NODE_NAME node2 CAPACITY_NAME A CAPACITY_VALUE 60 CAPACITY_NAME B CAPACITY_VALUE 70 ... NOTE: You do not have to define capacities for every node in the cluster. If any capacity is not defined for any node, Serviceguard assumes that node has an infinite amount of that capacity.
PAGE 182
Example 3 WEIGHT_NAME A WEIGHT_DEFAULT 20 WEIGHT_NAME B WEIGHT_DEFAULT 15 This means that any package for which weight A is not defined in its package configuration file will have a weight A of 20, and any package for which weight B is not defined in its package configuration file will have a weight B of 15. Given the capacities we defined in the cluster configuration file (see “Defining Capacities”), node1 can run any three packages that use the default for both A and B.
PAGE 183
weight_name A weight_value 40 In pkg3's package configuration file: weight_name B weight_value 35 weight_name A weight_value 0 In pkg4's package configuration file: weight_name B weight_value 40 IMPORTANT: weight_name in the package configuration file must exactly match the corresponding CAPACITY_NAME in the cluster configuration file. This applies to case as well as spelling: weight_name a would not match CAPACITY_NAME A.
PAGE 184
Rules and Guidelines The following rules and guidelines apply to both the Simple Method (page 177) and the Comprehensive Method (page 179) of configuring capacities and weights. • You can define a maximum of four capacities, and corresponding weights, throughout the cluster. NOTE: But if you use the reserved CAPACITY_NAME package_limit, you can define only that single capacity and corresponding weight. See “Simple Method” (page 177).
PAGE 185
For further discussion and use cases, see the white paper Using Serviceguard’s Node Capacity and Package Weight Feature on docs.hp.com under High Availability —> Serviceguard —> White Papers. How Package Weights Interact with Package Priorities and Dependencies If necessary, Serviceguard will halt a running lower-priority package that has weight to make room for a higher-priority package that has weight.
PAGE 186
About External Scripts The package configuration template for modular scripts explicitly provides for external scripts. These replace the CUSTOMER DEFINED FUNCTIONS in legacy scripts, and can be run either: • • On package startup and shutdown, as essentially the first and last functions the package performs.
PAGE 187
NOTE: Some variables, including SG_PACKAGE, and SG_NODE, are available only at package run and halt time, not when the package is validated. You can use SG_PACKAGE_NAME at validation time as a substitute for SG_PACKAGE. IMPORTANT: For more information, see the template in $SGCONF/examples/external_script.template. A sample script follows. It assumes there is another script called monitor.sh, which will be configured as a Serviceguard service to monitor some application. The monitor.
PAGE 188
do case ${SG_SERVICE_CMD[i]} in *monitor.
PAGE 189
Suppose a script run by pkg1 does a cmmodpkg -d of pkg2, and a script run by pkg2 does a cmmodpkg -d of pkg1. If both pkg1 and pkg2 start at the same time, the pkg1 script now tries to cmmodpkg pkg2. But that cmmodpkg command has to wait for pkg2 startup to complete. The pkg2 script tries to cmmodpkg pkg1, but pkg2 has to wait for pkg1 startup to complete, thereby causing a command loop.
PAGE 190
NOTE: last_halt_failed appears only in the line output of cmviewcl, not the default tabular format; you must use the -v and -f line options to see it. The value of last_halt_failed is no if the halt script ran successfully, or was not run since the node joined the cluster, or was not run since the package was configured to run on the node; otherwise it is yes.
PAGE 191
• • Deploying applications in this environment requires careful consideration; see “Implications for Application Deployment” (page 191). If a monitored_subnet (page 272) is configured for PARTIAL monitored_subnet_access in a package’s configuration file, it must be configured on at least one of the nodes on the node_name list for that package.
PAGE 192
Configuring node_name First you need to make sure that pkg1 will fail over to a node on another subnet only if it has to. For example, if it is running on NodeA and needs to fail over, you want it to try NodeB, on the same subnet, before incurring the cross-subnet overhead of failing over to NodeC or NodeD.
PAGE 193
ip_address 15.244.56.100 ip_address 15.244.56.101 Configuring a Package: Next Steps When you are ready to start configuring a package, proceed to Chapter 6: “Configuring Packages and Their Services ” (page 255); start with “Choosing Package Modules” (page 256). (If you find it helpful, you can assemble your package configuration data ahead of time on a separate worksheet for each package; blank worksheets are in Appendix E.
PAGE 194
PAGE 195
5 Building an HA Cluster Configuration This chapter and the next take you through the configuration tasks required to set up a Serviceguard cluster. These procedures are carried out on one node, called the configuration node, and the resulting binary file is distributed by Serviceguard to all the nodes in the cluster. In the examples in this chapter, the configuration node is named ftsys9, and the sample target node is called ftsys10.
PAGE 196
Appendix D (page 413) provides instructions for upgrading Serviceguard without halting the cluster. Make sure you read the entire Appendix, and the corresponding section in the Release Notes, before you begin. Learning Where Serviceguard Files Are Kept Serviceguard uses a special file, /etc/cmcluster.conf, to define the locations for configuration and log files within the HP-UX filesystem. The following locations are defined in the file: ################## cmcluster.
PAGE 197
NOTE: For more information and advice, see the white paper Securing Serviceguard at http://docs.hp.com -> High Availability -> Serviceguard -> White Papers. Allowing Root Access to an Unconfigured Node To enable a system to be included in a cluster, you must enable HP-UX root access to the system by the root user of every other potential cluster node. The Serviceguard mechanism for doing this is the file $SGCONF/cmclnodelist.
PAGE 198
IMPORTANT: If $SGCONF/cmclnodelist does not exist, Serviceguard will look at ~/.rhosts. HP strongly recommends that you use cmclnodelist. NOTE: When you upgrade a cluster from Version A.11.15 or earlier, entries in $SGCONF/cmclnodelist are automatically updated to Access Control Policies in the cluster configuration file. All non-root user-hostname pairs are assigned the role of Monitor.
PAGE 199
Configuring Name Resolution Serviceguard uses the name resolution services built in to HP-UX. Serviceguard nodes can communicate over any of the cluster’s shared networks, so the network resolution service you are using (such as DNS, NIS, or LDAP) must be able to resolve each of their primary addresses on each of those networks to the primary hostname of the node in question.
PAGE 200
15.145.162.131 gryf.uksr.hp.com 10.8.0.131 gryf2.uksr.hp.com 10.8.1.131 gryf3.uksr.hp.com gryf1 alias-node1 gryf2 alias-node1 gryf3 alias-node1 15.145.162.132 sly.uksr.hp.com 10.8.0.132 sly2.uksr.hp.com 10.8.1.132 sly3.uksr.hp.com sly1 alias-node2 sly2 alias-node2 sly3 alias-node2 IMPORTANT: Serviceguard does not support aliases for IPv6 addresses.
PAGE 201
NOTE: If a NIC fails, the affected node will be able to fail over to a standby LAN so long as the node is running in the cluster. But if a NIC that is used by Serviceguard fails when the affected node is not running in the cluster, Serviceguard will not be able to restart the node. (For instructions on replacing a failed NIC, see “Replacing LAN or Fibre Channel Cards” (page 372).) 1. Edit the /etc/hosts file on all nodes in the cluster.
PAGE 202
NOTE: HP recommends that you also make the name service itself highly available, either by using multiple name servers or by configuring the name service into a Serviceguard package. Ensuring Consistency of Kernel Configuration Make sure that the kernel configurations of all cluster nodes are consistent with the expected behavior of the cluster during failover.
PAGE 203
NOTE: If you contact HP support regarding Serviceguard or networking, please be sure to mention any parameters that have been changed from their defaults. Serviceguard has also been tested with non-default values for these network parameters: • • ip6_nd_dad_solicit_count — This network parameter enables the Duplicate Address Detection feature for IPv6 addresses. You can find more information l under “IPv6 Relocatable Address and Duplicate Address Detection Feature” (page 446).
PAGE 204
the system sends the packet through the interface on which the unbound packet was received. This means that the packet source addresses (and therefore the interfaces on a multihomed host) affect the selection of a gateway for outbound packets once ip_strong_es_model is enabled. For more information see “Using a Relocatable Address as the Source Address for an Application that is Bound to INADDR_ANY” (page 401).
PAGE 205
lvextend -m 1 /dev/vg00/lvol3 /dev/dsk/c4t6d0 5. Update the boot information contained in the BDRA for the mirror copies of boot, root and primary swap. /usr/sbin/lvlnboot -b /dev/vg00/lvol1 /usr/sbin/lvlnboot -s /dev/vg00/lvol2 /usr/sbin/lvlnboot -r /dev/vg00/lvol3 6. Verify that the mirrors were properly created.
PAGE 206
Backing Up Cluster Lock Disk Information After you configure the cluster and create the cluster lock volume group and physical volume, you should create a backup of the volume group configuration data on each lock volume group. Use the vgcfgbackup command for each lock volume group you have configured, and save the backup file in case the lock configuration must be restored to a new disk with the vgcfgrestore command following a disk failure.
PAGE 207
IMPORTANT: On HP 9000 systems, there is no means of partitioning a disk or LUN, so you will need to dedicate an entire small disk or LUN for the lock LUN. This means that in a mixed cluster containing both Integrity and HP-PA systems, you must also use an entire disk or LUN; if you partition the device as described below, the HP-PA nodes will not be able to see the partitions.
PAGE 208
NOTE: The first partition, identified by the device file /dev/dsk/c1t4d0s1 or /dev/disk/disk12_p1 in this example, is reserved by EFI and cannot be used for any other purpose. 4. Create the device files on the other cluster nodes. Use the command insf -e on each node. This will create device files corresponding to the three partitions, though the names themselves may differ from node to node depending on each node’s I/O configuration. 5. Define the lock LUN; see “Defining the Lock LUN”.
PAGE 209
NOTE: You should not edit cmignoretypes.conf If Serviceguard does not recognize a device, you will see an error such as the following when you run cmquerycl: Error reading device /dev/dsk/c0t0d0 0x8 Note: Disks were discovered which are not in use by either LVM or VxVM. Use pvcreate(1M) to initialize a disk for LVM or, use vxdiskadm(1M) to initialize a disk for VxVM. In this example, Serviceguard did not recognize the type of the device identified by /dev/dsk/c0t0d0.
PAGE 210
Creating the Storage Infrastructure and Filesystems with LVM, VxVM and CVM In addition to configuring the cluster, you create the appropriate logical volume infrastructure to provide access to data from different nodes. This is done several ways: • for Logical Volume Manager, see “Creating a Storage Infrastructure with LVM” (page 210). Do this before you configure the cluster if you use a lock disk; otherwise it can be done before or after.
PAGE 211
Availability -> Event Monitoring Service and HA Monitors -> Installation and User’s Guide). Creating Volume Groups for Mirrored Individual Data Disks The procedure described in this section uses physical volume groups for mirroring of individual disks to ensure that each logical volume is mirrored to a disk on a different I/O bus. This kind of arrangement is known as PVG-strict mirroring.
PAGE 212
Using PV Strict Mirroring Use the following steps to build a volume group on the configuration node (ftsys9). Later, the same volume group will be created on other nodes. 1. First, create the group directory; for example, vgdatabase: mkdir /dev/vgdatabase 2.
PAGE 213
NOTE: If you are using disk arrays in RAID 1 or RAID 5 mode, omit the -m 1 and -s g options. Creating File Systems If your installation uses filesystems, create them next. Use the following commands to create a filesystem for mounting on the logical volume just created: 1. Create the filesystem on the newly created logical volume: newfs -F vxfs /dev/vgdatabase/rlvol1 Note the use of the raw device file for the logical volume. 2. Create a directory to mount the disk: mkdir /mnt1 3.
PAGE 214
2. Still on ftsys9, copy the map file to ftsys10: rcp /tmp/vgdatabase.map ftsys10:/tmp/vgdatabase.map 3. On ftsys10, create the volume group directory: mkdir /dev/vgdatabase 4. Still on ftsys10, create a control file named group in the directory /dev/vgdatabase, as follows: mknod /dev/vgdatabase/group c 64 0xhh0000 Use the same minor number as on ftsys9. Use the following command to display a list of existing volume groups: ls -l /dev/*/group 5.
PAGE 215
NOTE: When you use PVG-strict mirroring, the physical volume group configuration is recorded in the /etc/lvmpvg file on the configuration node. This file defines the physical volume groups which are the basis of mirroring and indicate which physical volumes belong to each physical volume group. Note that on each cluster node, the /etc/lvmpvg file must contain the correct physical volume names for the physical volume groups’s disks as they are known on that node.
PAGE 216
3. 4. 5. If /etc/lvmpvg on ftsys10 contains entries for volume groups that do not appear in /etc/lvmvpg.new, then copy all physical volume group entries for that volume group to/etc/lvmvpg.new. Adjust any physical volume names in /etc/lvmvpg.new to reflect their correct names on ftsys10. On ftsys10, copy /etc/lvmpvg to /etc/lvmpvg.old to create a backup. Copy /etc/lvmvpg.new to /etc/lvmpvg on ftsys10.
PAGE 217
to initialize multiple disks, or use the vxdisksetup command to initialize one disk at a time, as in the following example: /usr/lib/vxvm/bin/vxdisksetup -i c0t3d2 Initializing Disks Previously Used by LVM If a physical disk has been previously used with LVM, you should use the pvremove command to delete the LVM header data from all the disks in the volume group.
PAGE 218
Creating Volumes Use the vxassist command to create logical volumes. The following is an example: vxassist -g logdata make log_files 1024m This command creates a 1024 MB volume named log_files in a disk group named logdata. The volume can be referenced with the block device file /dev/vx/dsk/logdata/log_files or the raw (character) device file /dev/vx/rdsk/logdata/log_files.
PAGE 219
When all disk groups have been deported, you must issue the following command on all cluster nodes to allow them to access the disk groups: vxdctl enable Re-Importing Disk Groups After deporting disk groups, they are not available for use on the node until they are imported again either by a package control script or with a vxdg import command.
PAGE 220
NOTE: You can use Serviceguard Manager to configure a cluster: open the System Management Homepage (SMH) and choose Tools-> Serviceguard Manager. See “Using Serviceguard Manager” (page 32) for more information. To use Serviceguard commands to configure the cluster, follow directions in the remainder of this section. Use the cmquerycl command to specify a set of nodes to be included in the cluster and to generate a template for the cluster configuration file.
PAGE 221
-w none skips network querying. If you have recently checked the networks, this option will save time. Specifying the Address Family for the Cluster Hostnames You can use the -a option to tell Serviceguard to resolve cluster node names (as well as Quorum Server hostnames, if any) to IPv4 addresses only (-a ipv4) IPv6 addresses only (-a ipv6), or both (-a any). You can also configure the address family by means of the HOSTNAME_ADDRESS_FAMILY in the cluster configuration file.
PAGE 222
• • If you don't use the -h option, Serviceguard will choose the best available configuration to meet minimum requirements, preferring an IPv4 LAN over IPv6 where both are available. The resulting configuration could be IPv4 only, IPv6 only, or a mix of both. You can override Serviceguard's default choices by means of the HEARTBEAT_IP parameter, discussed under “Cluster Configuration Parameters ” (page 139); that discussion also spells out the heartbeat requirements.
PAGE 223
group. After you are done, do not forget to run vgchange -c y vgname to rewrite the cluster ID back to the volume group; for example: vgchange -c y /dev/vglock NOTE: You should not configure a second lock volume group or physical volume unless your configuration specifically requires it. See “Dual Lock Disk” (page 64).
PAGE 224
Specifying a Quorum Server IMPORTANT: The following are standard instructions. For special instructions that may apply to your version of Serviceguard and the Quorum Server see “Configuring Serviceguard to Use the Quorum Server” in the latest version HP Serviceguard Quorum Server Version A.04.00 Release Notes, at http://www.docs.hp.com -> High Availability -> Quorum Server. A cluster lock LUN or Quorum Server is required for two-node clusters.
PAGE 225
Node Names: nodeA nodeB nodeC nodeD Bridged networks (full probing performed): 1 lan3 lan4 lan3 lan4 (nodeA) (nodeA) (nodeB) (nodeB) 2 lan1 lan1 (nodeA) (nodeB) 3 lan2 lan2 (nodeA) (nodeB) 4 lan3 lan4 lan3 lan4 lan1 lan1 (nodeC) (nodeC) (nodeD) (nodeD) (nodeC) (nodeD) lan2 lan2 (nodeC) (nodeD) 5 6 IP subnets: IPv4: 15.13.164.0 lan1 lan1 (nodeA) (nodeB) 15.13.172.0 lan1 lan1 (nodeC) (nodeD) 15.13.165.0 lan2 lan2 (nodeA) (nodeB) 15.13.182.0 lan2 lan2 (nodeC) (nodeD) 15.244.65.
PAGE 226
3ffe:1111::/64 lan3 lan3 (nodeA) (nodeB) 3ffe:2222::/64 lan3 lan3 (nodeC) (nodeD) Possible Heartbeat IPs: 15.13.164.0 15.13.164.1 15.13.164.2 (nodeA) (nodeB) 15.13.172.0 15.13.172.158 15.13.172.159 (nodeC) (nodeD) 15.13.165.0 15.13.165.1 15.13.165.2 (nodeA) (nodeB) 15.13.182.0 15.13.182.158 15.13.182.159 (nodeC) (nodeD) Route connectivity(full probing performed): 1 15.13.164.0 15.13.172.0 2 15.13.165.0 15.13.182.0 3 15.244.65.0 4 15.244.56.
PAGE 227
Identifying Heartbeat Subnets The cluster configuration file includes entries for IP addresses on the heartbeat subnet. HP recommends that you use a dedicated heartbeat subnet, and configure heartbeat on other subnets as well, including the data subnet. The heartbeat can be configured on an IPv4 or IPv6 subnet; see “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 135). The heartbeat can comprise multiple subnets joined by a router.
PAGE 228
A Note about Terminology Although you will also sometimes see the term role-based access (RBA) in the output of Serviceguard commands, the preferred set of terms, always used in this manual, is as follows: • Access-control policies- the set of rules defining user access to the cluster. — Access-control policy - one of these rules, comprising the three parameters USER_NAME, USER_HOST, USER_ROLE. See “Setting up Access-Control Policies” (page 230).
PAGE 229
Figure 5-1 Access Roles Levels of Access Serviceguard recognizes two levels of access, root and non-root: • Root access: Full capabilities; only role allowed to configure the cluster. As Figure 5-1 shows, users with root access have complete control over the configuration of the cluster and its packages. This is the only role allowed to use the cmcheckconf, cmapplyconf, cmdeleteconf, and cmmodnet -a commands.
PAGE 230
IMPORTANT: Users on systems outside the cluster can gain Serviceguard root access privileges to configure the cluster only via a secure connection (rsh or ssh). • Non-root access: Other users can be assigned one of four roles: — Full Admin: Allowed to perform cluster administration, package administration, and cluster and package view operations. These users can administer the cluster, but cannot configure or create a cluster. Full Admin includes the privileges of the Package Admin role.
PAGE 231
NOTE: For more information and advice, see the white paper Securing Serviceguard at http://docs.hp.com -> High Availability -> Serviceguard -> White Papers. Define access-control policies for a cluster in the cluster configuration file; see “Cluster Configuration Parameters ” (page 139). You can define up to 200 access policies for each cluster. A root user can create or modify access control policies while the cluster is running.
PAGE 232
NOTE: If you set USER_HOST to ANY_SERVICEGUARD_NODE, set USER_ROLE to MONITOR; users connecting from outside the cluster cannot have any higher privileges (unless they are connecting via rsh or ssh; this is treated as a local connection). Depending on your network configuration, ANY_SERVICEGUARD_NODE can provide wide-ranging read-only access to the cluster.
PAGE 233
Plan the cluster’s roles and validate them as soon as possible. If your organization’s security policies allow it, you may find it easiest to create group logins. For example, you could create a MONITOR role for user operator1 from ANY_CLUSTER_NODE. Then you could give this login name and password to everyone who will need to monitor your clusters.
PAGE 234
NOTE: Check spelling especially carefully when typing wildcards, such as ANY_USER and ANY_SERVICEGUARD_NODE. If they are misspelled, Serviceguard will assume they are specific users or nodes.
PAGE 235
• • • • • • • • • • • • If all nodes specified are in the same heartbeat subnet, except in cross-subnet configurations(page 41) . If you specify the wrong configuration filename. If all nodes can be accessed. No more than one CLUSTER_NAME and AUTO_START_TIMEOUT are specified. The value for package run and halt script timeouts is less than 4294 seconds. The value for AUTO_START_TIMEOUT variables is >=0. Heartbeat network minimum requirement is met.
PAGE 236
NOTE: Using the -k option means that cmapplyconf only checks disk connectivity to the LVM disks that are identified in the ASCII file. Omitting the -k option (the default behavior) means that cmapplyconf tests the connectivity of all LVM disks on all nodes. Using -k can result in significantly faster operation of the command. • Deactivate the cluster lock volume group.
PAGE 237
NOTE: You must use the vgcfgbackup command to store a copy of the cluster lock disk's configuration data whether you created the volume group using the System Management Homepage (SMH), SAM, or HP-UX commands. If the cluster lock disk ever needs to be replaced while the cluster is running, you must use the vgcfgrestore command to restore lock information to the replacement disk.
PAGE 238
IMPORTANT: Before you proceed, make sure you have read “Planning Veritas Cluster Volume Manager (CVM) and Cluster File System (CFS)” (page 164), which contains important information and cautions. Preparing the Cluster and the System Multi-node Package The Veritas cluster volumes are managed by a Serviceguard-supplied system multi-node package, SG-CFS-pkg, which runs on all nodes at once, and cannot fail over. The package for CVM 4.
PAGE 239
Node : Cluster Manager : CVM state : MOUNT POINT TYPE ftsys10 up up SHARED VOLUME DISK GROUP STATUS NOTE: Because the CVM system multi-node package automatically starts up the Veritas processes, do not edit /etc/llthosts, /etc/llttab, or /etc/gabtab. Creating the Disk Groups Initialize the disk group from the master node. 1. 2. Find the master node using vxdctl or cfscluster status Initialize a new disk group, or import an existing disk group, in shared mode, using the vxdg command.
PAGE 240
3. Activate the disk group and start up the package: cfsdgadm activate logdata 4. To verify, you can use cfsdgadm or cmviewcl. This example shows the cfsdgadm output: cfsdgadm display -v logdata NODE NAME ACTIVATION MODE ftsys9 sw (sw) MOUNT POINT SHARED VOLUME ftsys10 sw (sw) MOUNT POINT SHARED VOLUME 5. TYPE TYPE To view the name of the package that is monitoring a disk group, use the cfsdgadm show_package command: cfsdgadm show_package logdata SG-CFS-DG-1 Creating Volumes 1.
PAGE 241
3. Verify with cmviewcl or cfsmntadm display. This example uses the cfsmntadm command: cfsmntadm display 4.
PAGE 242
Creating Checkpoint and Snapshot Packages for CFS The storage checkpoints and snapshots are two additional mount point package types. They can be associated with the cluster via the cfsmntadm(1m) command. Mount Point Packages for Storage Checkpoints The Veritas File System provides a unique storage checkpoint facility which quickly creates a persistent image of a filesystem at an exact point in time.
PAGE 243
ftsys10 up running MULTI_NODE_PACKAGES PACKAGE SG-CFS-pkg SG-CFS-DG-1 SG-CFS-MP-1 SG-CFS-CK-1 STATUS up up up up STATE running running running running AUTO_RUN enabled enabled enabled disabled SYSTEM yes no no no /tmp/check_logfiles now contains a point-in-time view of /tmp/logdata/ log_files, and it is persistent.
PAGE 244
Package name "SG-CFS-SN-1" was generated to control the resource Mount point "/local/snap1" was associated to the cluster cfsmount /local/snap1 cmviewcl CLUSTER cfs-cluster STATUS up NODE STATUS ftsys9 up ftsys10 up MULTI_NODE_PACKAGES PACKAGE SG-CFS-pkg SG-CFS-DG-1 SG-CFS-MP-1 SG-CFS-SN-1 STATUS up up up up STATE running running STATE running running running running AUTO_RUN enabled enabled enabled disabled SYSTEM yes no no no The snapshot file system /local/snap1 is now mounted and provides a poi
PAGE 245
NOTE: You must use CFS if you want to configure file systems on your CVM disk groups; see “Creating a Storage Infrastructure with Veritas Cluster File System (CFS)” (page 237). The two sets of procedures — with and without CFS — use many of the same commands, but in a slightly different order. Before you start, make sure the directory in which VxVM commands are stored, /usr/ lib/vxvm/bin, is in your path.
PAGE 246
NOTE: Cluster configuration is described under “Configuring the Cluster ” (page 219). Check the heartbeat configuration. CVM 4.1 and later versions require that the cluster have either multiple heartbeats or a single heartbeat with a standby, and do not allow the use of Auto Port Aggregation, Infiniband, or VLAN interfaces as a heartbeat subnet. The CVM cluster volumes are managed by a Serviceguard-supplied system multi-node package which runs on all nodes at once, and cannot fail over. For CVM 4.
PAGE 247
To initialize a disk for CVM, log on to the master node, then use the vxdiskadm program to initialize multiple disks, or use the vxdisksetup command to initialize one disk at a time, as in the following example: /usr/lib/vxvm/bin/vxdisksetup -i c4t3d4 Creating Disk Groups Use the following steps to create disk groups. 1. Use the vxdg command to create disk groups. Use the -s option to specify shared mode, as in the following example: vxdg -s init logdata c0t3d2 2.
PAGE 248
This command creates a 1024 MB volume named log_files in a disk group named logdata. The volume can be referenced with the block device file /dev/vx/dsk/ logdata/log_files or the raw (character) device file /dev/vx/rdsk/logdata/ log_files. Verify the configuration with the following command: vxdg list Now deport the disk groups: vxdg deport Adding Disk Groups to the Package Configuration Next you need to specify the CVM disk groups for each package that uses them.
PAGE 249
Checking Cluster Operation with Serviceguard Commands Serviceguard also provides several commands for control of the cluster: • • • • • • cmviewcl checks status of the cluster and many of its components. A non-root user with the role of Monitor can run this command from a cluster node or see status information in Serviceguard Manager.
PAGE 250
• • 4. Start the node. You can use Serviceguard Manager or use the cmrunnode command. To verify that the node has returned to operation, check in Serviceguard Manager, or use the cmviewcl command again. Bring down the cluster. You can do this in Serviceguard Manager, or use the cmhaltcl -v -f command. Additional cluster testing is described in See “Troubleshooting Your Cluster” (page 365). Refer to Appendix A for a complete list of Serviceguard commands.
PAGE 251
AUTO_START_TIMEOUT period. If neither of the other two cases becomes true in that time, startup will fail. To enable automatic cluster start, set the flag AUTOSTART_CMCLD to 1 in /etc/ rc.config.d/cmclusteron each node in the cluster; the nodes will then join the cluster at boot time. Here is an example of the /etc/rc.config.d/cmcluster file: #************************ CMCLUSTER ************************ # Highly Available Cluster configuration # # @(#) $Revision: 72.
PAGE 252
You still need to have redundant networks, but you do not need to specify any heartbeat LANs, since there is no other node to send heartbeats to. In the cluster configuration file, specify all the LANs that you want Serviceguard to monitor. Use the STATIONARY_IP parameter, rather than HEARTBEAT_IP, to specify LANs that already have IP addresses. For standby LANs, all that is required is the NETWORK_INTERFACE parameter with the LAN device name.
PAGE 253
hacl-probe stream tcp nowait root /opt/cmom/lbin/cmomd /opt/cmom/lbin/cmomd -i -f /var/opt/cmom/cmomd.log -r /var/opt/cmom 3. Restart inetd: /etc/init.d/inetd restart Deleting the Cluster Configuration As root user, you can delete a cluster configuration from all cluster nodes by using Serviceguard Manager or the command line. The cmdeleteconf command prompts for a verification before deleting the files unless you use the -f option. You can delete the configuration only when the cluster is down.
PAGE 254
PAGE 255
6 Configuring Packages and Their Services Serviceguard packages group together applications and the services and resources they depend on. The typical Serviceguard package is a failover package that starts on one node but can be moved (“failed over”) to another if necessary. See “What is Serviceguard? ” (page 29), “How the Package Manager Works” (page 67), and “Package Configuration Planning ” (page 162) for more information.
PAGE 256
NOTE: This is a new process for configuring packages, as of Serviceguard A.11.18. This manual refers to packages created by this method as modular packages, and assumes that you will use it to create new packages; it is simpler and more efficient than the older method, allowing you to build packages from smaller modules, and eliminating the separate package control script and the need to distribute it manually. Packages created using Serviceguard A.11.17 or earlier are referred to as legacy packages.
PAGE 257
them, and then start them up on another node selected from the package’s configuration list; see “node_name” (page 265). To generate a package configuration file that creates a failover package, include-m sg/failover on the cmmakepkg command line. See “Generating the Package Configuration File” (page 287). • Multi-node packages. These packages run simultaneously on more than one node in the cluster.
PAGE 258
NOTE: On systems that support CFS, you configure the CFS system multi-node package by means of the cfscluster command, not by editing a package configuration file; see “Creating a Storage Infrastructure with Veritas Cluster File System (CFS)” (page 237). NOTE: The following parameters cannot be configured for multi-node or system multi-node packages: • failover_policy • failback_policy • ip_subnet • ip_address Volume groups configured for packages of these types must be activated in shared mode.
PAGE 259
start on the first eligible node on which an instance of the multi-node package comes up; this may not be the dependent packages’ primary node. To ensure that dependent failover packages restart on their primary node if the multi-node packages they depend on need to be restarted, make sure the dependent packages’ package switching is not re-enabled before the multi-node packages are restarted.
PAGE 260
Table 6-1 Base Modules Module Name Parameters (page) Comments failover package_name (page 264) * module_name (page 264) * module_version (page 264) * package_type (page 265) package_description (page 265) * node_name (page 265) auto_run (page 265) node_fail_fast_enabled (page 266) run_script_timeout (page 266) halt_script_timeout (page 267) successor_halt_timeout (page 267) * script_log_file (page 267) operation_sequence (page 268) * log_level (page 268) * failover_policy (page 268) failback_policy (pag
PAGE 261
Table 6-2 Optional Modules Module Name Parameters (page) Comments dependency dependency_name (page 269) * dependency_condition (page 270) * dependency_location (page 271) * Add to a base module to create a package that depends on one or more other packages. weight weight_name (page 271) * weight value (page 271) * Add to a base module to create a package that has weight that will be counted against a node's capacity.
PAGE 262
Table 6-2 Optional Modules (continued) 262 Module Name Parameters (page) Comments filesystem concurrent_fsck_operations (page 282) (S) concurrent_mount_and_umount_operations (page 282) (S) fs_mount_retry_count (page 283) (S) fs_umount_retry_count (page 283) * (S) fs_name (page 283) * (S) fs_directory (page 284) * (S) fs_type (page 284) (S) fs_mount_opt (page 284) (S) fs_umount_opt (page 284) (S) fs_fsck_opt (page 284) (S) Add to a base module to configure filesystem options for the package.
PAGE 263
Table 6-2 Optional Modules (continued) Module Name Parameters (page) Comments multi_node_all all parameters that can be used by a multi-node package; includes multi_node, dependency, monitor_subnet, service, resource, volume_group, filesystem, pev, external_pre, external, and acp modules. Use if you are creating a multi-node package that requires most or all of the optional parameters that are available for this type of package.
PAGE 264
NOTE: For more information, see the comments in the editable configuration file output by the cmmakepkg command, and the cmmakepkg manpage.
PAGE 265
package_type The type can be failover, multi_node, or system_multi_node. Default is failover. You can configure only failover or multi-node packages; see “Types of Package: Failover, Multi-Node, System Multi-Node” (page 256). package_description The application that the package runs. This is a descriptive parameter that can be set to any value you choose, up to a maximum of 80 characters. Default value is Serviceguard Package. New for 11.
PAGE 266
This is also referred to as package switching, and can be enabled or disabled while the package is running, by means of the cmmodpkg (1m) command. auto_run should be set to yes if the package depends on another package, or is depended on; see “About Package Dependencies” (page 168). For system multi-node packages, auto_run must be set to yes. In the case of a multi-node package, setting auto_run to yes allows an instance to start on a new node joining the cluster; no means it will not.
PAGE 267
NOTE: VxVM disk groups are imported at package run time and exported at package halt time. If a package uses a large number of VxVM disk, the timeout value must be large enough to allow all of them to finish the import or export. NOTE: If (no_timeout is specified, and the script hangs, or takes a very long time to complete, during the validation step (cmcheckconf (1m)), cmcheckconf will wait 20 minutes to allow the validation to complete before giving up.
PAGE 268
Kept” (page 196) for more information about Serviceguard pathnames.) See also log_level (page 268). operation_sequence Defines the order in which the scripts defined by the package’s component modules will start up. See the package configuration file for details. This parameter is not configurable; do not change the entries in the configuration file. New for modular packages.
PAGE 269
failback_policy Specifies what action the package manager should take when a failover package is not running on its primary node (the first node on its node_name list) and the primary node is once again available. Can be set to automatic or manual. The default is manual. • • manual means the package will continue to run on the current (adoptive) node.
PAGE 270
IMPORTANT: Restrictions on dependency names in previous Serviceguard releases were less stringent. Packages that specify dependency_names that do not conform to the above rules will continue to run, but if you reconfigure them, you will need to change the dependency_name; cmcheckconf and cmapplyconf will enforce the new rules.
PAGE 271
— Both packages must be failover packages whose failover_policy is configured_node. — At least one of the packages must specify a priority (page 269). For more information, see “About Package Dependencies” (page 168). dependency_location Specifies where the dependency_condition must be met. • If dependency_condition is UP, legal values fordependency_location are same_node, any_node, and different_node. — same_node means that the package depended on must be running on the same node.
PAGE 272
weight_value is an unsigned floating-point value between 0 and 1000000 with at most three digits after the decimal point. You can use these parameters to override the cluster-wide default package weight that corresponds to a given node capacity. You can define that cluster-wide default package weight by means of the WEIGHT_NAME and WEIGHT_DEFAULT parameters in the cluster configuration file (explicit default).
PAGE 273
monitored_subnet_access In cross-subnet configurations, specifies whether each monitored_subnet is accessible on all nodes in the package’s node list (see node_name (page 265)), or only some. Valid values are PARTIAL, meaning that at least one of the nodes has access to the subnet, but not all; and FULL, meaning that all nodes have access to the subnet. The default is FULL, and it is in effect if monitored_subnet_access is not specified.
PAGE 274
ip_subnet Specifies an IP subnet used by the package for relocatable addresses; see also ip_address (page 275) and “Stationary and Relocatable IP Addresses ” (page 90). Replaces SUBNET, which is still supported in the package control script for legacy packages. CAUTION: HP recommends that this subnet be configured into the cluster.
PAGE 275
If you want the subnet to be monitored, specify it in the monitored_subnet parameter as well. In a cross-subnet configuration, you also need to specify which nodes the subnet is configured on; see ip_subnet_node below. See also monitored_subnet_access (page 273) and “About Cross-Subnet Failover” (page 190). This parameter can be set for failover packages only. Can be added or deleted while the package is running.
PAGE 276
The length and formal restrictions for the name are the same as for package_name (page 264). service_name must be unique among all packages in the cluster. IMPORTANT: Restrictions on service names in previous Serviceguard releases were less stringent. Packages that specify services whose names do not conform to the above rules will continue to run, but if you reconfigure them, you will need to change the name; cmcheckconf and cmapplyconf will enforce the new rules.
PAGE 277
NOTE: Be careful when defining service run commands. Each run command is executed in the following way: • The cmrunserv command executes the run command. • Serviceguard monitors the process ID (PID) of the process the run command creates. • When the command exits, Serviceguard determines that a failure has occurred and takes appropriate action, which may include transferring the package to an adoptive node.
PAGE 278
resource_name The name of a resource to be monitored. resource_name, in conjunction with resource_polling_interval, resource_start and resource_up_value, defines an Event Monitoring Service (EMS) dependency. In legacy packages, RESOURCE_NAME in the package configuration file requires a corresponding DEFERRED_RESOURCE_NAME in the package control script.
PAGE 279
concurrent_vgchange_operations Specifies the number of concurrent volume group activations or deactivations allowed during package startup or shutdown. Legal value is any number greater than zero. The default is 1. If a package activates a large number of volume groups, you can improve the package’s start-up and shutdown performance by carefully tuning this parameter.
PAGE 280
The default is vgchange -a e. The configuration file contains several other vgchange command variants; either uncomment one of these and comment out the default, or use the default. For more information, see the explanations in the configuration file, “LVM Planning ” (page 131), and “Creating the Storage Infrastructure and Filesystems with LVM, VxVM and CVM” (page 210).
PAGE 281
in the configuration file, and uncomment the line vxvol_cmd "vxvol -g \${DiskGroup} -o bg startall" This allows package startup to continue while mirror re-synchronization is in progress. vg Specifies an LVM volume group (one per vg, each on a new line) on which a file system needs to be mounted. A corresponding vgchange_cmd (page 279) specifies how the volume group is to be activated. The package script generates the necessary filesystem commands on the basis of the fs_ parameters (page 283).
PAGE 282
Legal value is zero or any greater number. As of A.11.18, the default is 2. kill_processes_accessing_raw_devices Specifies whether or not to kill processes that are using raw devices (for example, database applications) when the package shuts down. Default is no. See the comments in the package configuration file for more information. File system parameters A package can activate one or more storage groups on startup, and to mount logical volumes to file systems.
PAGE 283
If the package needs to mount and unmount a large number of filesystems, you can improve performance by carefully tuning this parameter during testing (increase it a little at time and monitor performance each time). fs_mount_retry_count The number of mount retries for each file system. Legal value is zero or any greater number. The default is zero. If the mount point is busy at package startup and fs_mount_retry_count is set to zero, package startup will fail.
PAGE 284
fs_directory The root of the file system specified by fs_name. Replaces FS, which is still supported in the package control script for legacy packages; see “Configuring a Legacy Package” (page 340). See the mount (1m) manpage for more information. fs_type The type of the file system specified by fs_name. This parameter is in the package control script for legacy packages. See the mount (1m) and fstyp (1m) manpages for more information.
PAGE 285
external_pre_script The full pathname of an external script to be executed before volume groups and disk groups are activated during package startup, and after they have been deactivated during package shutdown (that is, effectively the first step in package startup and last step in package shutdown). New for modular packages.
PAGE 286
user_host The system from which a user specified by user_name can execute package-administration commands. Legal values are any_serviceguard_node, or cluster_member_node, or a specific cluster node. If you specify a specific node it must be the official hostname (the hostname portion, and only thehostname portion, of the fully qualified domain name). As with user_name, be careful to spell the keywords exactly as given.
PAGE 287
Generating the Package Configuration File When you have chosen the configuration modules your package needs (see “Choosing Package Modules” (page 256)), you are ready to generate a package configuration file that contains those modules. This file will consist of a base module (usually failover, multi-node or system multi-node) plus the modules that contain the additional parameters you have decided to include.
PAGE 288
• To generate a configuration file for a multi-node package that monitors cluster resources (enter the command all on one line): cmmakepkg -m sg/multi_node -m sg/resource $SGCONF/pkg1/pkg1.conf • To generate a configuration file for a failover package that runs an application that requires another package to be up (enter the command all on one line): cmmakepkg -m sg/failover -m sg/dependency -m sg/service $SGCONF/pkg1/pkg1.
PAGE 289
NOTE: cmcheckconf and cmapplyconf check for missing mount points, volume groups, etc. 4. 5. 6. Halt the package. Configure package IP addresses and application services. Run the package and ensure that applications run as expected and that the package fails over correctly when services are disrupted. See “Testing the Package Manager ” (page 365).
PAGE 290
• • • • • run_script_timeout and halt_script_timeout. Enter the number of seconds Serviceguard should wait for package startup and shutdown, respectively, to complete; or leave the default, no_timeout; see (page 266). successor_halt_timeout. Used if other packages depend on this package; see “About Package Dependencies” (page 168). script_log_file (page 267). log_level (page 268). failover_policy (page 268). Enter configured_node or min_package_node. (This parameter can be set for failover packages only.
PAGE 291
• If your package will use relocatable IP addresses, enter the ip_subnet and ip_address addresses. See the parameter descriptions (page 274) for rules and restrictions. In a cross-subnet configuration, configure the additional ip_subnet_node parameter for each ip_subnet as necessary; see “About Cross-Subnet Failover” (page 190) for more information.
PAGE 292
See “Creating a Storage Infrastructure with Veritas Cluster File System (CFS)” (page 237). • • • • • • If you are using VxVM disk groups without CVM, enter the names of VxVM disk groups that will be imported using vxvm_dg parameters. See “How Control Scripts Manage VxVM Disk Groups” (page 294). If you are using mirrored VxVM disks, use (page 280) to specify the mirror recovery option to be used by vxvol. Specify the filesystem mount retry and unmount count options (see (page 283)).
PAGE 293
In addition to the standard package script, you use the special script that is provided for the database. To set up these scripts, follow the instructions in the README file provided with each toolkit. • Configure the Access Control Policy for up to eight specific users or any_user. The only user role you can configure in the package configuration file is package_admin for the package in question. Cluster-wide roles are defined in the cluster configuration file.
PAGE 294
NOTE: For modular packages, you now need to distribute any external scripts identified by the external_pre_script and external_script parameters. But if you are accustomed to configuring legacy packages, note that you do not have to create a separate package control script for a modular package, or distribute it manually. (You do still have to do this for legacy packages; see “Configuring a Legacy Package” (page 340).
PAGE 295
This command takes over ownership of all the disks in disk group dg_01, even though the disk currently has a different host ID written on it. The command writes the current node’s host ID on all disks in disk group dg_01 and sets the noautoimport flag for the disks. This flag prevents a disk group from being automatically re-imported by a node following a reboot. If a node in the cluster fails, the host ID is still written on each disk in the disk group.
PAGE 296
PAGE 297
7 Cluster and Package Maintenance This chapter describes how to see cluster configuration and status information, how to start and halt a cluster or an individual node, how to perform permanent reconfiguration, and how to start, halt, move, and modify packages during routine maintenance of the cluster.
PAGE 298
TIP: Some commands take longer to complete in large configurations. In particular, you can expect Serviceguard’s CPU usage to increase during cmviewcl -v as the number of packages and services increases. You can also specify that the output should be formatted as it was in a specific earlier release by using the -r option to specify the release format you want, for example: cmviewcl -r A.11.16 (See the cmviewcl(1m) manpage for the supported release formats.
PAGE 299
Node Status and State The status of a node is either up (as an active member of the cluster) or down (inactive in the cluster), depending on whether its cluster daemon is running or not. Note that a node might be down from the cluster perspective, but still up and running HP-UX. A node may also be in one of the following states: • • • • • Failed. Active members of the cluster will see a node in this state if that node was active in a cluster, but is no longer, and is not Halted. Reforming.
PAGE 300
• • • reconfiguring — The node where this package is running is adjusting the package configuration to reflect the latest changes that have been applied. reconfigure_wait — The node where this package is running is waiting to adjust the package configuration to reflect the latest changes that have been applied. unknown — Serviceguard could not determine the status at the time cmviewcl was run. A system multi-node package is up when it is running on all the active cluster nodes.
PAGE 301
The following states are possible only for multi-node packages: • • blocked — The package has never run on this node, either because a dependency has not been met, or because auto_run is set to no. changing — The package is in a transient state, different from the status shown, on some nodes. For example, a status of starting with a state of changing would mean that the package was starting on at least one node, but in some other, transitory condition (for example, failing) on at least one other node.
PAGE 302
Failover and Failback Policies Failover packages can be configured with one of two values for the failover_policy parameter (page 268), as displayed in the output of cmviewcl -v: • • configured_node. The package fails over to the next node in the node_name list in the package configuration file (page 265). min_package_node. The package fails over to the node in the cluster that has the fewest running packages.
PAGE 303
NODE ftsys10 STATUS up STATE running Network_Parameters: INTERFACE STATUS PRIMARY up STANDBY up PATH 28.1 32.1 NAME lan0 lan1 PACKAGE pkg2 STATE running AUTO_RUN enabled STATUS up NODE ftsys10 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM STATUS Service up Subnet up MAX_RESTARTS 0 0 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled RESTARTS 0 0 NAME ftsys10 ftsys9 NAME service2 15.
PAGE 304
Quorum Server Status: NAME STATUS lp-qs up STATE running CFS Package Status If the cluster is using the Veritas Cluster File System (CFS), the system multi-node package SG-CFS-pkg must be running on all active nodes, and the multi-node packages for disk group and mount point must also be running on at least one of their configured nodes. If the cluster is using the Veritas Cluster Volume Manager for raw access to disk groups (that is, without CFS), SG-CFS-pkg must be running on all active nodes.
PAGE 305
Status After Halting a Package After we halt the failover package pkg2 with the cmhaltpkg command, the output of cmviewcl-v is as follows: CLUSTER example NODE ftsys9 STATUS up STATUS up STATE running Network_Parameters: INTERFACE STATUS PRIMARY up STANDBY up PATH 56/36.
PAGE 306
Failback manual Script_Parameters: ITEM STATUS Resource up Subnet up Resource down Subnet up NODE_NAME ftsys9 ftsys9 ftsys10 ftsys10 NAME /example/float 15.13.168.0 /example/float 15.13.168.0 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NAME ftsys10 ftsys9 pkg2 now has the status down, and it is shown as unowned, with package switching disabled. Resource /example/float, which is configured as a dependency of pkg2, is down on one node.
PAGE 307
Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled PACKAGE pkg2 STATUS up STATE running NAME ftsys9 ftsys10 AUTO_RUN disabled (current) NODE ftsys9 Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM STATUS MAX_RESTARTS RESTARTS NAME Service up 0 0 service2.1 Subnet up 15.13.168.
PAGE 308
Both packages are now running on ftsys9 and pkg2 is enabled for switching. ftsys10 is running the cmcld daemon but no packages.
PAGE 309
Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled Alternate up enabled Alternate up enabled NAME manx burmese tabby persian Viewing Information about System Multi-Node Packages The following example shows a cluster that includes system multi-node packages as well as failover packages. The system multi-node packages are running on all nodes in the cluster, whereas the standard packages run on only one node at a time.
PAGE 310
tmp/mnt/dev/vx/dsk/ vg_for_cvm1_dd5/1vol1 regular lvol1 vg_for_cvm_dd5 MOUNTED /var/opt/sgtest/ tmp/mnt/dev/vx/dsk/ vg_for_cvm1_dd5/lvol4 regular lvol4 vg_for_cvm_dd5 MOUNTED Node : ftsys8 Cluster Manager : up CVM state : up MOUNT POINT TYPE /var/opt/sgtest/ tmp/mnt/dev/vx/dsk/ vg_for_cvm1_dd5/lvol1 /var/opt/sgtest/ tmp/mnt/dev/vx/dsk/ vg_for_cvm1_dd5/lvol4 SHARED VOLUME DISK GROUP STATUS regular lvol1 vg_for_cvm_veggie_dd5 MOUNTED regular lvol4 vg_for_cvm_dd5 MOUNTED Status of the Packa
PAGE 311
Status of CFS Disk Group Packages To see the status of the disk group, use the cfsdgadm display command. For example, for the disk group logdata, enter: cfsdgadm display -v logdata NODE NAME ACTIVATION MODE ftsys9 MOUNT POINT sw (sw) SHARED VOLUME TYPE ftsys10 sw (sw) MOUNT POINT SHARED VOLUME TYPE ... To see which package is monitoring a disk group, use the cfsdgadm show_package command.
PAGE 312
You can use Serviceguard Manager or the Serviceguard command line to start or stop the cluster, or to add or halt nodes. Starting the cluster means running the cluster daemon on one or more of the nodes in a cluster. You use different Serviceguard commands to start the cluster, depending on whether all nodes are currently down (that is, no cluster daemons are running), or whether you are starting the cluster daemon on an individual node.
PAGE 313
cmruncl -v -n ftsys9 -n ftsys10 CAUTION: Serviceguard cannot guarantee data integrity if you try to start a cluster with the cmruncl -n command while a subset of the cluster's nodes are already running a cluster. If the network connection is down between nodes, using cmruncl -n might result in a second cluster forming, and this second cluster might start up the same applications that are already running on the other cluster. The result could be two applications overwriting each other's data on the disks.
PAGE 314
NOTE: HP recommends that you remove a node from participation in the cluster (by running cmhaltnode as shown below, or Halt Node in Serviceguard Manager) before running the HP-UX shutdown command, especially in cases in which a packaged application might have trouble during shutdown and not halt cleanly. Use cmhaltnode to halt one or more nodes in a cluster. The cluster daemon on the specified node stops, and the node is removed from active participation in the cluster.
PAGE 315
• • Changing package switching behavior Maintaining a package using maintenance mode Non-root users with the appropriate privileges can perform these tasks. See “Controlling Access to the Cluster” (page 227) for information about configuring access. You can use Serviceguard Manager or the Serviceguard command line to perform these tasks. Starting a Package Ordinarily, when a cluster starts up, the packages configured as part of the cluster will start up on their configured nodes.
PAGE 316
Starting the Special-Purpose CVM and CFS Packages Use CFS administration commands to start the special-purpose multi-node packages used with CFS. For example, to start the special-purpose multi-node package for the disk group package (SG-CFS-DG-id#), use the cfsdgadm command. To start the special-purpose multi-node package for the mount package (SG-CFS-MP-id#) use the cfsmntadm command.
PAGE 317
You cannot halt a package unless all packages that depend on it are down. If you try, Serviceguard will take no action, except to send a message indicating that not all dependent packages are down. Before you halt a system multi-node package, or halt all instances of a multi-node package, halt any packages that depend on them Moving a Failover Package You can use Serviceguard Manager, or Serviceguard commands as shown below, to move a failover package from one node to another.
PAGE 318
package, use the cmmodpkg command. For example, if pkg1 is currently running, and you want to prevent it from starting up on another node, enter the following: cmmodpkg -d pkg1 This does not halt the package, but will prevent it from starting up elsewhere. You can disable package switching to particular nodes by using the -n option of the cmmodpkg command.
PAGE 319
NOTE: In order to run a package in partial-startup maintenance mode, you must first put it in maintenance mode. This means that packages in partial-startup maintenance mode share the characteristics described below for packages in maintenance mode, and the same rules and dependency rules apply. Additional rules apply to partial-startup maintenance mode, and the procedure involves more steps, as explained underPerforming Maintenance Using Partial-Startup Maintenance Mode.
PAGE 320
Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode IMPORTANT: See the latest Serviceguard release notes for important information about version requirements for package maintenance. • • • • The package must have package switching disabled before you can put it in maintenance mode. You can put a package in maintenance mode only on one node. — The node must be active in the cluster and must be eligible to run the package (on the package's node_name list).
PAGE 321
Dependency Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode You cannot configure new dependencies involving a package running in maintenance mode, and in addition the following rules apply (we'll call the package in maintenance mode pkgA). • The packages that depend on pkgA must be down with package switching disabled when you place pkgA in maintenance mode.
PAGE 322
NOTE: If you now run cmviewcl, you'll see that the STATUS of pkg1 is up and its STATE is maintenance. 3. If everything is working as expected, take the package out of maintenance mode: cmmodpkg -m off pkg1 Performing Maintenance Using Partial-Startup Maintenance Mode To put a package in partial-startup maintenance mode, you put it in maintenance mode, then restart it, running only those modules that you will not be working on. Procedure Follow this procedure to perform maintenance on a package.
PAGE 323
NOTE: You can also use cmhaltpkg -s, which stops the modules started by cmrunpkg -m — in this case, all the modules up to and including package_ip. 6. Run the package to ensure everything is working correctly: cmrunpkg pkg1 NOTE: 7. The package is still in maintenance mode. If everything is working as expected, bring the package out of maintenance mode: cmmodpkg -m off pkg1 8.
PAGE 324
NOTE: The full execution sequence for starting a package is: 1. The master control script itself 2. External pre-scripts 3. Volume groups 4. File systems 5. Package IPs 6. External scripts 7. Services Reconfiguring a Cluster You can reconfigure a cluster either when it is halted or while it is still running. Some operations can only be done when the cluster is halted. Table 7-1 shows the required cluster state for many kinds of changes.
PAGE 325
Table 7-1 Types of Changes to the Cluster Configuration (continued) Change to the Cluster Configuration Required Cluster State Delete NICs and their IP addresses, if any, from Cluster can be running. the cluster configuration “Changing the Cluster Networking Configuration while the Cluster Is Running” (page 332). If removing the NIC from the system, see “Removing a LAN or VLAN Interface from a Node” (page 337). Change the designation of an existing interface Cluster can be running.
PAGE 326
NOTE: If you are using CVM or CFS, you cannot change MEMBER_TIMEOUT or AUTO_START_TIMEOUT while the cluster is running. This is because they affect the aggregate failover time, which is only reported to the CVM stack on cluster startup. You also cannot change the quorum configuration while SG-CFS-pkg is running.
PAGE 327
• • • • cmhaltpkg [–t] cmrunpkg [–t] [-n node_name] cmmodpkg { -e [-t] | -d } [-n node_name] cmruncl –v [–t] NOTE: You cannot use the -t option with any command operating on a package in maintenance mode; see “Maintaining a Package: Maintenance Mode” (page 318). For more information about these commands, see their respective manpages. You can also perform these preview functions in Serviceguard Manager: check the Preview [...] box for the action in question.
PAGE 328
cmeval does not require you to be logged in to the cluster being evaluated, and in fact that cluster does not have to be running, though it must use the same Serviceguard release and patch version as the system on which you run cmeval.
PAGE 329
IMPORTANT: For detailed information and examples, see the cmeval (1m) manpage. Updating the Cluster Lock Configuration Use the procedures that follow whenever you need to change the device file names of the cluster lock physical volumes.
PAGE 330
Reconfiguring a Halted Cluster You can make a permanent change in the cluster configuration when the cluster is halted. This procedure must be used for changes marked “Cluster must not be running” in Table 7-1 (page 324), but it can be used for any other cluster configuration changes as well. Use the following steps: 1. 2. 3. 4. Halt the cluster on all nodes, using Serviceguard Manager’s Halt Cluster command, or cmhaltcl on the command line.
PAGE 331
1. Use the following command to store a current copy of the existing cluster configuration in a temporary file: cmgetconf -c cluster1 temp.ascii 2. Specify a new set of nodes to be configured and generate a template of the new configuration. Specify the node name (39 bytes or less) without its full domain name; for example, ftsys8 rather than ftsys8.cup.hp.com. Enter a command such as the following (all on one line): cmquerycl -C clconfig.ascii -c cluster1 -n ftsys8 -n ftsys9 -n ftsys10 3. 4.
PAGE 332
NOTE: If you want to remove a node from the cluster, run the cmapplyconf command from another node in the same cluster. If you try to issue the command on the node you want removed, you will get an error message. 1. Use the following command to store a current copy of the existing cluster configuration in a temporary file: cmgetconf -c cluster1 temp.ascii 2. Specify the new set of nodes to be configured (omitting ftsys10) and generate a template of the new configuration: cmquerycl -C clconfig.
PAGE 333
• • • • • • Change the designation of an existing interface from HEARTBEAT_IP to STATIONARY_IP, or vice versa. Change the NETWORK_POLLING_INTERVAL. Change the NETWORK_FAILURE_DETECTION parameter. Change the NETWORK_AUTO_FAILBACK parameter. Change IP Monitor parameters: SUBNET, IP_MONITOR, POLLING TARGET; see the entries for these parameters under “Cluster Configuration Parameters ” (page 139) for more information.
PAGE 334
• You cannot delete a subnet or IP address from a node while a package that uses it (as a monitored_subnet, ip_subnet, or ip_address) is configured to run on that node. See the package networking parameter descriptions (page 272) for more information. • You cannot change the IP configuration of an interface (NIC) used by the cluster in a single transaction (cmapplyconf).
PAGE 335
1. Run cmquerycl to get a cluster configuration template file that includes networking information for interfaces that are available to be added to the cluster configuration: cmquerycl -c cluster1 -C clconfig.ascii NOTE: As of Serviceguard A.11.18, cmquerycl -c produces output that includes commented-out entries for interfaces that are not currently part of the cluster configuration, but are available. The networking portion of the resulting clconfig.
PAGE 336
4. Apply the changes to the configuration and distribute the new binary configuration file to all cluster nodes: cmapplyconf -C clconfig.ascii If you were configuring the subnet for data instead, and wanted to add it to a package configuration, you would now need to: 1. 2. 3. 4.
PAGE 337
4. Verify the new configuration: cmcheckconf -C clconfig.ascii 5. Apply the changes to the configuration and distribute the new binary configuration file to all cluster nodes: cmapplyconf -C clconfig.ascii Removing a LAN or VLAN Interface from a Node You must remove a LAN or VLAN interface from the cluster configuration before removing the interface from the system. On an HP-UX 11i v3 system, you can then remove the interface without shutting down the node.
PAGE 338
NOTE: If you are removing a volume group from the cluster configuration, make sure that you also modify any package that activates and deactivates this volume group. In addition, you should use the LVM vgexport command on the removed volume group; do this on each node that will no longer be using the volume group. From the LVM’s cluster, follow these steps: 1. 2. 3. 4. Use the cmgetconf command to store a copy of the cluster's existing cluster configuration in a temporary file.
PAGE 339
Similarly, you can delete VxVM or CVM disk groups provided they are not being used by a cluster node at the time. CAUTION: Serviceguard manages the Veritas processes, specifically gab and LLT. This means that you should never use administration commands such as gabconfig, llthosts, and lltconfig to administer a cluster. It is safe to use the read-only variants of these commands, such as gabconfig -a. But a Veritas administrative command could potentially crash nodes or the entire cluster.
PAGE 340
Configuring a Legacy Package IMPORTANT: You can still create a new legacy package. If you are using a Serviceguard Toolkit such as Serviceguard NFS Toolkit, consult the documentation for that product. Otherwise, use this section to maintain and rework existing legacy packages rather than to create new ones.
PAGE 341
3. Edit each configuration file to specify package name, prioritized list of nodes (with 39 bytes or less in the name), the location of the control script, and failover parameters for each package. Include the data you recorded on the Package Configuration Worksheet. Configuring a Package in Stages It is a good idea to configure failover packages in stages, as follows: 1. 2. 3. 4. 5. 6. 7. 8. Configure volume groups and mount points only. Distribute the control script to all nodes.
PAGE 342
• • • • • • FAILBACK_POLICY. For failover packages, enter the failback_policy (page 269). NODE_NAME. Enter the node or nodes on which the package can run; see node_name (page 265). AUTO_RUN. Configure the package to start up automatically or manually; see auto_run (page 265). LOCAL_LAN_FAILOVER_ALLOWED. Enter the policy for local_lan_failover_allowed (page 272). NODE_FAIL_FAST_ENABLED. Enter the policy for node_fail_fast_enabled (page 266). RUN_SCRIPT and HALT_SCRIPT.
PAGE 343
— RESOURCE_UP_VALUE — RESOURCE_START For more information, see “Parameters for Configuring EMS Resources” (page 167), and the resource_ parameter descriptions (page 278). NOTE: For legacy packages, DEFERRED resources must be specified in the package control script. • ACCESS_CONTROL_POLICY. You can grant a non-root user PACKAGE_ADMIN privileges for this package. See the entries for user_name, user_host, and user_role (page 285), and “Controlling Access to the Cluster” (page 227), for more information.
PAGE 344
IMPORTANT: Serviceguard automatically creates the necessary control scripts when you create the multi-node or system multi-node packages for CFS/CVM (version 4.1 and later). HP strongly recommends that you never edit the configuration or control script files for these packages, although Serviceguard does not prevent it. For CFS, create and modify the information using cfs administration commands only. See “Creating a Storage Infrastructure with Veritas Cluster File System (CFS)” (page 237).
PAGE 345
• • • • Select the appropriate options for the storage activation command (not applicable for basic VxVM disk groups), and also include options for mounting filesystems, if desired. Specify the filesystem mount and unmount retry options. If your package uses a large number of volume groups or disk groups or mounts a large number of file systems, consider increasing the number of concurrent vgchange, mount, umount, and fsck operations.
PAGE 346
function customer_defined_run_cmds { # ADD customer defined run commands. : # do nothing instruction, because a function must contain some command. date >> /tmp/pkg1.datelog echo 'Starting pkg1' >> /tmp/pkg1.datelog test_return 51 } # This function is a place holder for customer defined functions. # You should define all actions you want to happen here, before the service is # halted. function customer_defined_halt_cmds { # ADD customer defined halt commands.
PAGE 347
Verifying the Package Configuration Serviceguard checks the configuration you create and reports any errors. For legacy packages, you can do this in Serviceguard Manager: click Check to verify the package configuration you have done under any package configuration tab, or to check changes you have made to the control script. Click Apply to verify the package as a whole. See the local Help for more details.
PAGE 348
Copying Package Control Scripts with HP-UX commands IMPORTANT: In a cross-subnet configuration, you cannot use the same package control script on all nodes if the package uses relocatable IP addresses. See “Configuring Cross-Subnet Failover” (page 348). Use HP-UX commands to copy legacy package control scripts from the node where you created the files, to the same pathname on all nodes which can possibly run the package. Use your favorite method of file transfer (e. g., rcp or ftp).
PAGE 349
NOTE: You cannot use Serviceguard Manager to configure cross-subnet packages. Suppose that you want to configure a package, pkg1, so that it can fail over among all the nodes in a cluster comprising NodeA, NodeB, NodeC, and NodeD. NodeA and NodeB use subnet 15.244.65.0, which is not used by NodeC and NodeD; and NodeC and NodeD use subnet 15.244.56.0, which is not used by NodeA and NodeB. (See “Obtaining Cross-Subnet Information” (page 224) for sample cmquerycl output).
PAGE 350
NOTE: Configuring monitored_subnet_access as FULL (or not configuring monitored_subnet_access) for either of these subnets will cause the package configuration to fail, because neither subnet is available on all the nodes. Creating Subnet-Specific Package Control Scripts Now you need to create control scripts to run the package on the four nodes.
PAGE 351
whether the package is running or not; see “Allowable Package States During Reconfiguration ” (page 354). CAUTION: Be extremely cautious about changing a package's configuration while the package is running. If you reconfigure a package online (by executing cmapplyconf on a package while the package itself is running) it is possible that the package will fail, even if the cmapplyconf succeeds, validating the changes with no errors.
PAGE 352
1. Halt the package if necessary: cmhaltpkg pkg1 CAUTION: Make sure you read and understand the information and caveats under“Allowable Package States During Reconfiguration ” (page 354) before you decide to reconfigure a running package. 2. If it is not already available, obtain a copy of the package's configuration file by using the cmgetconf command, specifying the package name. cmgetconf -p pkg1 pkg1.conf 3. Edit the package configuration file.
PAGE 353
cmapplyconf -P /etc/cmcluster/pkg1/pkg1conf.ascii If this is a legacy package, remember to copy the control script to the/etc/cmcluster/ pkg1 directory on all nodes that can run the package. To create the CFS disk group or mount point multi-node packages on systems that support CFS, see “Creating the Disk Group Cluster Packages” (page 239) and “Creating a File System and Mount Point Package” (page 240).
PAGE 354
NOTE: Any form of the mount command other than cfsmount or cfsumount should be used with caution in a CFS environment. Non-CFS commands (for example, mount -o cluster, dbed_chkptmount, or sfrac_chkptmount) could cause conflicts with subsequent operations on the file system or Serviceguard packages, and will not create an appropriate multi-node package, with the result that cluster packages are not aware of file system changes. 1. Remove any dependencies on the package being deleted.
PAGE 355
not be running, or in which the results might not be what you expect — as well as differences between modular and legacy packages. CAUTION: Be extremely cautious about changing a package's configuration while the package is running. If you reconfigure a package online (by executing cmapplyconf on a package while the package itself is running) it is possible that the package will fail, even if the cmapplyconf succeeds, validating the changes with no errors.
PAGE 356
NOTE: All the nodes in the cluster must be powered up and accessible when you make package configuration changes. Table 7-2 Types of Changes to Packages Change to the Package Required Package State Delete a package or change package name Package must not be running. Change package type Package must not be running. Add or delete a module: modular package Package can be running. Change run script contents: legacy package Package can be running, but should not be starting.
PAGE 357
Table 7-2 Types of Changes to Packages (continued) Change to the Package Required Package State Add or remove an ip_address: modular package Package can be running. Add or remove an IP (in control script): legacy package Package must not be running. (Also applies to cross-subnet configurations.) See “ip_subnet” (page 274) and “ip_address” (page 275) for important information.
PAGE 358
Table 7-2 Types of Changes to Packages (continued) Change to the Package Required Package State Change For AUTOMATIC resources, package can be running. RESOURCE_POLLING_INTERVAL For DEFERRED resources, package must not be running. RESOURCE_UP_VALUE: legacy Serviceguard will not allow a change to RESOURCE_UP_VALUE if it package would cause the package to fail. Change RESOURCE_START: legacy package Package must not be running. Add a volume group: modular package Package can be running.
PAGE 359
Table 7-2 Types of Changes to Packages (continued) Change to the Package Required Package State Remove a file system: modular package Package should not be running. Remove a file system: legacy package Package must not be running. CAUTION: Removing a file system may cause problems if the file system cannot be unmounted because it's in use by a running process. In this case Serviceguard kills the process; this could cause the package to fail. Change Package can be running.
PAGE 360
Table 7-2 Types of Changes to Packages (continued) Change to the Package Required Package State Add, change, or delete external Package can be running. scripts and pre-scripts: modular Changes take effect when applied, whether or not the package is package running. If you add a script, Serviceguard validates it and then (if there are no errors) runs it when you apply the change. If you delete a script, Serviceguard stops it when you apply the change. Change vxvol_cmd Package must not be running.
PAGE 361
NOTE: Check the Serviceguard/SGeRAC/SMS/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes for your version of Serviceguard for up-to-date information on CVM and CFS support: http://www.docs.hp.com -> High Availability -> Serviceguard. Changes that Will Trigger Warnings Changes to the following will trigger warnings, giving you a chance to cancel, if the change would cause the package to fail.
PAGE 362
Single-Node Operation In a multi-node cluster, you could have a situation in which all but one node has failed, or you have shut down all but one node, leaving your cluster in single-node operation. This remaining node will probably have applications running on it. As long as the Serviceguard daemon cmcld is active, other nodes can rejoin the cluster. If the Serviceguard daemon fails when in single-node operation, it will leave the single node up and your applications running.
PAGE 363
NOTE: You should not disable Serviceguard on a system on which it is actually running. If you are not sure, you can get an indication by means of the command: ps -e | grep cmclconfd If there are cmclconfd processes running, it does not mean for certain that Serviceguard is running on this system (cmclconfd could simply be handling UDP queries from a Serviceguard cluster on the same subnet) but it does mean you should investigate further before disabling Serviceguard.
PAGE 364
PAGE 365
8 Troubleshooting Your Cluster This chapter describes how to verify cluster operation, how to review cluster status, how to add and replace hardware, and how to solve some typical cluster problems.
PAGE 366
kill PID 3. To view the package status, enter cmviewcl -v The package should be running on the specified adoptive node. 4. Move the package back to the primary node (see “Moving a Failover Package ” (page 317)). Testing the Cluster Manager To test that the cluster manager is operating correctly, perform the following steps for each node on the cluster: 1. 2. Turn off the power to the node SPU.
PAGE 367
3. 4. Verify that a local switch has taken place so that the Standby card is now the Primary card. In Serviceguard Manager, check the cluster properties. On the command line, use cmviewcl -v. Reconnect the LAN to the original Primary card, and verify its status. In Serviceguard Manager, check the cluster properties. On the command line, use cmviewcl -v.
PAGE 368
Using EMS (Event Monitoring Service) Hardware Monitors A set of hardware monitors is available for monitoring and reporting on memory, CPU, and many other system values. Some of these monitors are supplied with specific hardware products. Hardware Monitors and Persistence Requests When hardware monitors are disabled using the monconfig tool, associated hardware monitor persistent requests are removed from the persistence files.
PAGE 369
Replacing a Faulty Mechanism in an HA Enclosure If you are using software mirroring with Mirrordisk/UX and the mirrored disks are mounted in a high availability disk enclosure, you can use the following steps to hot plug a disk mechanism: 1. Identify the physical volume name of the failed disk and the name of the volume group in which it was configured. In the following example, the volume group name is shown as /dev/vg_sg01 and the physical volume name is shown as /dev/dsk/c2t3d0.
PAGE 370
Replacing a Lock Disk You can replace an unusable lock disk while the cluster is running. You can do this without any cluster reconfiguration if you do not change the devicefile name (Device Special File, or DSF); or, if you need to change the DSF, you can do the necessary reconfiguration while the cluster is running.
PAGE 371
Special File, or DSF); or, if you need to change the DSF, you can do the necessary reconfiguration while the cluster is running.
PAGE 372
Online Hardware Maintenance with In-line SCSI Terminator In some shared SCSI bus configurations, online SCSI disk controller hardware repairs can be made if HP in-line terminator (ILT) cables are used. In-line terminator cables are supported with most SCSI-2 Fast-Wide configurations. In-line terminator cables are supported with Ultra2 SCSI host bus adapters only when used with the SC10 disk enclosure. This is because the SC10 operates at slower SCSI bus speeds, which are safe for the use of ILT cables.
PAGE 373
Offline Replacement Follow these steps to replace an I/O card off-line. 1. 2. 3. 4. 5. 6. Halt the node by using the cmhaltnode command. Shut down the system using /usr/sbin/shutdown, then power down the system. Remove the defective I/O card. Install the new I/O card. The new card must be exactly the same card type, and it must be installed in the same slot as the card you removed. Power up the system. If necessary, add the node back into the cluster by using the cmrunnode command.
PAGE 374
This procedure updates the binary file with the new MAC address and thus avoids data inconsistency between the outputs of the cmviewconf and lanscan commands. Replacing a Failed Quorum Server System When a Quorum Server (QS) fails or becomes unavailable to the clusters it is providing quorum services for, this will not cause a failure on any cluster. However, the loss of the quorum server does increase the vulnerability of the clusters in case there is an additional failure.
PAGE 375
NOTE: While the old Quorum Server is down and the new one is being set up, these things can happen: • These three commands will not work: cmquerycl -q cmapplyconf -C cmcheckconf -C • If there is a node or network failure that creates a 50-50 membership split, the quorum server will not be available as a tie-breaker, and the cluster will fail. NOTE: Make sure that the old Quorum Server system does not rejoin the network with the old IP address.
PAGE 376
lan1* IPv6: Name lan1* lo0 1500 none none 418623 0 55822 Mtu Address/Prefix 1500 none 4136 ::1/128 0 0 Ipkts Opkts 0 10690 0 10690 Reviewing the System Log File Messages from the Cluster Manager and Package Manager are written to the system log file. The default location of the log file is /var/adm/syslog/syslog.log. Also, package-related messages are logged into the package log file. The package log file is located in the package directory, by default.
PAGE 377
Dec 14 14:39:27 star04 cmclconfd[2097]: Command execution message Dec 14 14:39:33 star04 cmcld[2098]: 3 nodes have formed a new cluster Dec 14 14:39:33 star04 cmcld[2098]: The new active cluster membership is: star04(id=1), star05(id=2), star06(id=3) Dec 14 17:39:33 star04 cmlvmd[2099]: Clvmd initialized successfully. Dec 14 14:39:34 star04 cmcld[2098]: Executing '/etc/cmcluster/pkg4/pkg4_run start' for package pkg4.
PAGE 378
Information about the starting and halting of each package is found in the package’s control script log. This log provides the history of the operation of the package control script. By default, it is found at /etc/cmcluster//control_script.log; but another location may have been specified in the package configuration file’s script_log_file parameter. This log documents all package run and halt activities.
PAGE 379
• linkloop verifies the communication between LAN cards at MAC address levels. For example, if you enter linkloop -i4 0x08000993AB72 you should see displayed the following message: Link Connectivity to LAN station: 0x08000993AB72 • • OK cmscancl can be used to verify that primary and standby LANs are on the same bridged net. cmviewcl -v shows the status of primary and standby LANs. Use these commands on all nodes. Solving Problems Problems with Serviceguard may be of several types.
PAGE 380
Name Server: server1.cup.hp.com Address: 15.13.168.63 Name: ftsys9.cup.hp.com Address: 15.13.172.229 If the output of this command does not include the correct IP address of the node, then check your name resolution services further. In many cases, a symptom such as Permission denied... or Connection refused... is the result of an error in the networking or security configuration. Most such problems can be resolved by correcting the entries in /etc/hosts.
PAGE 381
What to do: If this message appears once a month or more often, increase MEMBER_TIMEOUT to more than 10 times the largest reported delay. For example, if the message that reports the largest number says that cmcld was unable to run for the last 1.6 seconds, increase MEMBER_TIMEOUT to more than 16 seconds. 2. This node is at risk of being evicted from the running cluster. Increase MEMBER_TIMEOUT.
PAGE 382
• • • • • • strings /etc/lvmtab - to ensure that the configuration is correct. ioscan -fnC disk - to see physical disks. diskinfo -v /dev/rdsk/cxtydz - to display information about a disk. lssf /dev/d*/* - to check logical volumes and paths. vxdg list - to list Veritas disk groups. vxprint- to show Veritas disk group details.
PAGE 383
2. package on an alternate node. This might include such things as shutting down application processes, removing lock files, and removing temporary files. Ensure that package IP addresses are removed from the system; use the cmmodnet(1m) command. First determine which package IP addresses are installed by inspecting the output of netstat -in.
PAGE 384
cleans up any side effects of the package's run or halt attempt. In this case the package will be automatically restarted on any available alternate node for which it is configured. Problems with Cluster File System (CFS) NOTE: Check the Serviceguard/SGeRAC/SMS/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes for your version of Serviceguard for up-to-date information about support for CFS (http://www.docs.hp.com -> High Availability -> Serviceguard).
PAGE 385
This can happen if a package is running on a node which then fails before the package control script can deport the disk group. In these cases, the host name of the node that had failed is still written on the disk group header. When the package starts up on another node in the cluster, a series of messages is printed in the package log file Follow the instructions in the messages to use the force import option (-C) to allow the current node to import the disk group.
PAGE 386
In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console. You can use the following commands to check the status of your network and subnets: • • • • netstat -in - to display LAN status and check to see if the package IP is stacked on the LAN card. lanscan - to see if the LAN is on the primary interface or has switched to the standby interface. arp -a - to check the arp tables. lanadmin - to display, test, and reset the LAN cards.
PAGE 387
A message such as the following in a Serviceguard node’s syslog file indicates that the node did not receive a reply to its lock request on time. This could be because of delay in communication between the node and the Quorum Server or between the Quorum Server and other nodes in the cluster: Attempt to get lock /sg/cluser1 unsuccessful. Reason: request_timedout Messages Serviceguard sometimes sends a request to the Quorum Server to set the lock state.
PAGE 388
PAGE 389
A Enterprise Cluster Master Toolkit The Enterprise Cluster Master Toolkit (ECMT) provides a group of example scripts and package configuration files for creating Serviceguard packages for several major database and internet software products. Each toolkit contains a README file that explains how to customize the package for your needs. The ECMT can be installed on HP-UX 11i v2, or 11i v3.
PAGE 390
PAGE 391
B Designing Highly Available Cluster Applications This appendix describes how to create or port applications for high availability, with emphasis on the following topics: • Automating Application Operation • Controlling the Speed of Application Failover (page 393) • Designing Applications to Run on Multiple Systems (page 396) • Using a Relocatable Address as the Source Address for an Application that is Bound to INADDR_ANY (page 401) • Restoring Client Connections (page 403) • Handling Application Failures
PAGE 392
There are two principles to keep in mind for automating application relocation: • Insulate users from outages. • Applications must have defined startup and shutdown procedures. You need to be aware of what happens currently when the system your application is running on is rebooted, and whether changes need to be made in the application's response for high availability. Insulate Users from Outages Wherever possible, insulate your end users from outages.
PAGE 393
To reduce the impact on users, the application should not simply abort in case of error, since aborting would cause an unneeded failover to a backup system. Applications should determine the exact error and take specific action to recover from the error rather than, for example, aborting upon receipt of any error.
PAGE 394
is advisable to take certain actions to minimize the amount of data that will be lost, as explained in the following discussion. Minimize the Use and Amount of Memory-Based Data Any in-memory data (the in-memory context) will be lost when a failure occurs. The application should be designed to minimize the amount of in-memory data that exists unless this data can be easily recalculated.
PAGE 395
A common example is a print job. Printer applications typically schedule jobs. When that job completes, the scheduler goes on to the next job.
PAGE 396
the second system simply takes over the load of the first system. This eliminates the start up time of the application. There are many ways to design this sort of architecture, and there are also many issues with this sort of design. This discussion will not go into details other than to give a few examples. The simplest method is to have two applications running in a master/slave relationship where the slave is simply a hot standby application for the master.
PAGE 397
Avoid Node-Specific Information Typically, when a new system is installed, an IP address must be assigned to each active network interface. This IP address is always associated with the node and is called a stationary IP address. The use of packages containing highly available applications adds the requirement for an additional set of IP addresses, which are assigned to the applications themselves. These are known as relocatable application IP addresses.
PAGE 398
Avoid Using SPU IDs or MAC Addresses Design the application so that it does not rely on the SPU ID or MAC (link-level) addresses. The SPU ID is a unique hardware ID contained in non-volatile memory, which cannot be changed. A MAC address (also known as a LANIC id) is a link-specific address associated with the LAN hardware.
PAGE 399
be avoided for the same reason. Also, the gethostbyaddr() call may return different answers over time if called with a stationary IP address. Instead, the application should always refer to the application name and relocatable IP address rather than the hostname and stationary IP address. It is appropriate for the application to call gethostbyname(2), specifying the application name rather than the hostname. gethostbyname(2) will pass in the IP address of the application.
PAGE 400
Network applications can bind to a stationary IP address, a relocatable IP address, or INADDR_ANY. If the stationary IP address is specified, then the application may fail when restarted on another node, because the stationary IP address is not moved to the new system. If an application binds to the relocatable IP address, then the application will behave correctly when moved to another system. Many server-style applications will bind to INADDR_ANY, meaning that they will receive requests on any interface.
PAGE 401
separate mount points. If possible, the application should not assume a specific mount point. To prevent one node from inadvertently accessing disks being used by the application on another node, HA software uses an exclusive access mechanism to enforce access by only one node at a time. This exclusive access applies to a volume group as a whole.
PAGE 402
The procedure uses the HP-UX parameter ip_strong_es_model to enable per-interface default gateways. These default gateways are created for secondary network interfaces when you add a relocatable package IP address to the system. When the ip_strong_es_model is set to 1 and the sending socket (or communication endpoint) is bound to INADDR_ANY, IP will send the packet using the interface on which the inbound packet was received.
PAGE 403
allows the IP packets to go through the firewall to reach other organizations on the network. When the package halts, the route must be removed. Put a command such as the following in the customer_defined_halt_commands function of a legacy package, or the stop_command function in theexternal_script (page 285) for a modular package: /usr/sbin/route delete net default 128.17.17.1 1 source 128.17.17.
PAGE 404
session, or relog in. However, this method is not very automated. For example, a well-tuned hardware and application system may fail over in 5 minutes. But if users, after experiencing no response during the failure, give up after 2 minutes and go for coffee and don't come back for 28 minutes, the perceived downtime is actually 30 minutes, not 5.
PAGE 405
Message queueing is useful only when the user does not need or expect response that the request has been completed (i.e, the application is not interactive). Handling Application Failures What happens if part or all of an application fails? All of the preceding sections have assumed the failure in question was not a failure of the application, but of another component of the cluster. This section deals specifically with application problems.
PAGE 406
Minimizing Planned Downtime Planned downtime (as opposed to unplanned downtime) is scheduled; examples include backups, systems upgrades to new operating system revisions, or hardware replacements. For planned downtime, application designers should consider: • Reducing the time needed for application upgrades/patches.
PAGE 407
the some of the application servers are at revision 4.0. The application must be designed to handle this type of situation. For more information about the rolling upgrades, see “Software Upgrades ” (page 413), and the Release Notes for your version of Serviceguard at http://docs.hp.com -> High Availability. Do Not Change the Data Layout Between Releases Migration of the data to a new format can be very time intensive. It also almost guarantees that rolling upgrade will not be possible.
PAGE 408
PAGE 409
C Integrating HA Applications with Serviceguard The following is a summary of the steps you should follow to integrate an application into the Serviceguard environment: 1. Read the rest of this book, including the chapters on cluster and package configuration, and the Appendix “Designing Highly Available Cluster Applications.” 2.
PAGE 410
NOTE: Check the Serviceguard/SGeRAC/SMS/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes for your version of Serviceguard for up-to-date information about support for CFS (http:// www.docs.hp.com -> High Availability -> Serviceguard). Checklist for Integrating HA Applications This section contains a checklist for integrating HA applications in both single and multiple systems.
PAGE 411
a. b. c. d. 3. 4. Create the cluster configuration. Create a package. Create the package script. Use the simple scripts you created in earlier steps as the customer defined functions in the package control script. Start the cluster and verify that applications run as planned. If you will be building an application that depends on a Veritas Cluster File System (CFS) and Cluster Volume Manager (CVM), then consider the following: a. Build storage on all nodes of the cluster. b.
PAGE 412
15 seconds to run fsck on the filesystems, 30 seconds to start the application and 3 minutes to recover the database.
PAGE 413
D Software Upgrades There are five types of upgrade you can do: • rolling upgrade • rolling upgrade using DRD • non-rolling upgrade • non-rolling upgrade using DRD • migration with cold install Each of these is discussed below. Special Considerations for Upgrade to Serviceguard A.11.19 Serviceguard A.11.19 introduces a new cluster manager.
PAGE 414
CAUTION: From the time when the old cluster manager is shut down until the new cluster manager forms its first cluster, a node failure will cause the entire cluster to fail. HP strongly recommends that you use no Serviceguard commands other than cmviewcl (1m) until the new cluster manager successfully completes its first cluster re-formation.
PAGE 415
upgrade can also be done any time one system needs to be taken offline for hardware maintenance or patch installations. This method is among the least disruptive, but your cluster must meet both general and release-specific requirements. See “Guidelines for Rolling Upgrade” (page 416). Rolling Upgrade Using DRD DRD stands for Dynamic Root Disk. Using a Dynamic Root Disk allows you to perform the update on a clone of the root disk, then halt the node and reboot it from the updated clone root disk.
PAGE 416
Non-Rolling Upgrade Using DRD In a non-rolling upgrade with DRD, you clone each node's root disk and apply the upgrade to the clone, then halt the cluster and reboot each node from its updated clone root disk. This method involves much less cluster down time than a conventional non-rolling upgrade, and is particularly safe because the nodes can be quickly rolled back to their original (pre-upgrade) root disks. But you must make sure your cluster is eligible; see “Restrictions for DRD Upgrades” (page 415).
PAGE 417
HP-UX/Serviceguard release. See the support matrix at docs.hp.com -> High Availability -> Support Matrixes. Performing a Rolling Upgrade Limitations of Rolling Upgrades The following limitations apply to rolling upgrades: CAUTION: Stricter limitations apply to an upgrade to A.11.19; do not proceed with an upgrade to A.11.19 until you have read and understood the Special Considerations for Upgrade to Serviceguard A.11.19 (page 413). • • • During a rolling upgrade to a release other than A.11.
PAGE 418
Before You Start Make sure you plan sufficient system capacity to allow moving the packages from node to node during the process without an unacceptable loss of performance. CAUTION: Do not proceed with an upgrade to A.11.19 until you have read and understood the Special Considerations for Upgrade to Serviceguard A.11.19 (page 413). Running the Rolling Upgrade 1. 2. Halt the node you want to upgrade. You can do this in Serviceguard Manager, or use the cmhaltnode command.
PAGE 419
-> Run Node... Or, on the Serviceguard command line, issue the cmrunnode command. 7. Repeat this process for each node in the cluster. If the cluster fails before the rolling upgrade is complete (because of a catastrophic power failure, for example), you can restart the cluster by entering the cmruncl command from a node which has been upgraded to the latest version of the software.
PAGE 420
Performing a Rolling Upgrade Using DRD IMPORTANT: All the limitations listed under “Guidelines for Rolling Upgrade” (page 416) and “Limitations of Rolling Upgrades ” (page 417) also apply to a rolling upgrade with DRD. You should read the entire section on “Performing a Rolling Upgrade” (page 417) before you proceed. Before You Start CAUTION: Do not proceed with an upgrade to A.11.19 until you have read and understood the Special Considerations for Upgrade to Serviceguard A.11.19 (page 413).
PAGE 421
AUTOSTART_CMCLD = 1 7. Restart the cluster on the upgraded node, using Serviceguard Manager or cmrunnode (1m). 8. Move the packages back to the upgraded node. 9. Verify that the applications are functioning properly. • If the applications do not function properly and this is not the last node to be upgraded, you can revert to the previous release on this node.
PAGE 422
Figure D-1 Running Cluster Before Rolling Upgrade Step 1. Halt the first node, as follows # cmhaltnode -f node1 This will cause pkg1 to be halted cleanly and moved to node 2. The Serviceguard daemon on node 1 is halted, and the result is shown in Figure D-2. Figure D-2 Running Cluster with Packages Moved to Node 2 Step 2. Upgrade node 1 to the next operating system release (“HP-UX (new)”), and install the next version of Serviceguard (“SG (new)”).
PAGE 423
Figure D-3 Node 1 Upgraded to new HP-UX version Step 3. When upgrading is finished, enter the following command on node 1 to restart the cluster on node 1. # cmrunnode -n node1 At this point, different versions of the Serviceguard daemon (cmcld) are running on the two nodes, as shown in Figure D-4. Figure D-4 Node 1 Rejoining the Cluster Step 4. Repeat the process on node 2. Halt the node, as follows: # cmhaltnode -f node2 This causes both packages to move to node 1.
PAGE 424
Figure D-5 Running Cluster with Packages Moved to Node 1 Step 5. Move pkg2 back to its original node. Use the following commands: cmhaltpkg pkg2 cmrunpkg -n node2 pkg2 cmmodpkg -e pkg2 The cmmodpkg command re-enables switching of the package, which was disabled by the cmhaltpkg command. The final running cluster is shown in Figure D-6.
PAGE 425
Guidelines for Non-Rolling Upgrade Do a non-rolling upgrade if: • Your cluster does not meet the requirements for rolling upgrade as specified in the Release Notes for the target version of Serviceguard; or • The limitations imposed by rolling upgrades make it impractical for you to do a rolling upgrade (see “Limitations of Rolling Upgrades ” (page 417)); or • For some other reason you need or prefer to bring the cluster down before performing the upgrade.
PAGE 426
Performing a Non-Rolling Upgrade Using DRD Limitations of Non-Rolling Upgrades using DRD CAUTION: Stricter limitations apply to an upgrade to A.11.19; do not proceed with an upgrade to A.11.19 until you have read and understood the Special Considerations for Upgrade to Serviceguard A.11.19 (page 413). IMPORTANT: Not all paths that are supported for upgrade are supported for an upgrade using DRD, and there are additional requirements and restrictions for paths that are supported.
PAGE 427
CAUTION: You must reboot all the nodes from their original disks before restarting the cluster; do not try to restart the cluster with some nodes booted from the upgraded disks and some booted from the pre-upgrade disks. Guidelines for Migrating a Cluster with Cold Install There may be circumstances when you prefer to do a cold install of the HP-UX operating system rather than an upgrade.
PAGE 428
Matrix, at http://docs.hp.com -> High Availability -> Serviceguard. 6. Recreate any user accounts needed for the cluster applications. 7. Recreate the cluster as described in Chapter 5: “Building an HA Cluster Configuration” (page 195). 8. Restart the cluster. 9. Reinstall the applications. 10. Restore or re-import the data. 11. Recreate and run the cluster packages as described in Chapter 6: “Configuring Packages and Their Services ” (page 255).
PAGE 429
E Blank Planning Worksheets This appendix contains blank versions of the planning worksheets mentioned inChapter 4 “Planning and Documenting an HA Cluster ”. You can duplicate any of these worksheets that you find useful and fill them in as a part of the planning process.
PAGE 430
Worksheet for Hardware Planning HARDWARE WORKSHEET Page ___ of ____ =============================================================================== Node Information: Host Name _____________________ Series No _____________________ Memory Capacity ____________________ Number of I/O Slots ________________ =============================================================================== LAN Information: Name of Subnet _________ Name of IP Interface __________ Addr_____________ Traffic Type ___________ Name o
PAGE 431
Power Supply Worksheet POWER SUPPLY WORKSHEET Page ___ of ____ =============================================================================== SPU Power: Host Name _____________________ Power Supply _______________________ Host Name _____________________ Power Supply _______________________ =============================================================================== Disk Power: Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supp
PAGE 432
Quorum Server Worksheet Quorum Server Data: ============================================================================== QS Hostname: _____________IP Address: _______________IP Address_______________ ============================================================================== Quorum Services are Provided for: Cluster Name: ___________________________________________________________ Host Names ____________________________________________ Host Names ____________________________________________ Cluster
PAGE 433
LVM Volume Group and Physical Volume Worksheet PHYSICAL VOLUME WORKSHEET Page ___ of ____ =============================================================================== Volume Group Name: ______________________________________________________ Physical Volume Name:_____________________________________________________ Physical Volume Name:_____________________________________________________ Physical Volume Name:_____________________________________________________ Physical Volume Name: _____________________
PAGE 434
VxVM Disk Group and Disk Worksheet DISK GROUP WORKSHEET Page ___ of ____ =========================================================================== Disk Group Name: __________________________________________________________ Physical Volume Name:______________________________________________________ Physical Volume Name:______________________________________________________ Physical Volume Name:______________________________________________________ Physical Volume Name: _____________________________________
PAGE 435
Cluster Configuration Worksheet =============================================================================== Name and Nodes: =============================================================================== Cluster Name: __________________________ RAC Version: _______________ Node Names: _________________________________________________________ Volume Groups (for packages):________________________________________ =========================================================================== Subnets: =========
PAGE 436
Package Configuration Worksheet Package Configuration File Data:===============================================================Package Name: __________________Package Type:______________ Primary Node: ____________________ First Failover Node:__________________ Additional Failover Nodes:__________________________________ Run Script Timeout: _____ Halt Script Timeout: _____________ Package AutoRun Enabled? ______ Local LAN Failover Allowed? _____ Node Failfast Enabled? ________ Failover Policy:_______________
PAGE 437
F Migrating from LVM to VxVM Data Storage This appendix describes how to migrate LVM volume groups to VxVM disk groups for use with the Veritas Volume Manager (VxVM), or with the Cluster Volume Manager (CVM) on systems that support it.
PAGE 438
3. 4. 5. Back up the volume group’s data, using whatever means are most appropriate for the data contained on this volume group. For example, you might use a backup/restore utility such as Omniback, or you might use an HP-UX utility such as dd. Back up the volume group configuration: vgcfgbackup Define the new VxVM disk groups and logical volumes. You will need to have enough additional disks available to create a VxVM version of all LVM volume groups.
PAGE 439
3. Edit the new script to include the names of the new VxVM disk groups and logical volumes. The new portions of the package control script that are needed for VxVM use are as follows: • The VXVM_DG[] array. This defines the VxVM disk groups that are used for this package. The first VxVM_DG[] entry should be in index 0, the second in 1, etc. For example: VXVM_DG[0]="dg01" VXVM_DG[1]="dg02" • The LV[], FS[] and FS_MOUNT_OPT[] arrays are used the same as they are for LVM.
PAGE 440
Customizing Packages for CVM NOTE: Check the Serviceguard/SGeRAC/SMS/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes for your version of Serviceguard for up-to-date information about support for CVM and CFS: http://www.docs.hp.com -> High Availability -> Serviceguard. For instructions on configuring packages to use CVM disk groups for raw storage (that is, without CFS) see “Creating the Storage Infrastructure with Veritas Cluster Volume Manager (CVM)” (page 244).
PAGE 441
G IPv6 Network Support This appendix describes some of the characteristics of IPv6 network addresses. Topics: • IPv6 Address Types • Network Configuration Restrictions • Local Primary/Standby LAN Patterns • IPv6 Relocatable Address and Duplicate Address Detection Feature (page 446) IPv6 Address Types Several IPv6 types of addressing schemes are specified in the RFC 2373 (IPv6 Addressing Architecture). IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces.
PAGE 442
can appear only once in an address and it can be used to compress the leading, trailing, or contiguous sixteen-bit zeroes in an address. Example: fec0:1:0:0:0:0:0:1234 can be represented as fec0:1::1234. • When dealing with a mixed environment of IPv4 and IPv6 nodes there is an alternative form of IPv6 address that will be used. It is x:x:x:x:x:x:d.d.d.d, where 'x's are the hexadecimal values of higher order 96 bits of IPv6 address and the 'd's are the decimal values of the 32-bit lower order bits.
PAGE 443
IPv4 and IPv6 Compatibility There are a number of techniques for using IPv4 addresses within the framework of IPv6 addressing. IPv4 Compatible IPv6 Addresses The IPv6 transition mechanisms use a technique for tunneling IPv6 packets over the existing IPv4 infrastructure. IPv6 nodes that support such mechanisms use a special kind of IPv6 addresses that carry IPv4 addresses in their lower order 32-bits. These addresses are called IPv4 Compatible IPv6 addresses.
PAGE 444
TLA ID = Top-level Aggregation Identifier. RES = Reserved for future use. NLA ID = Next-Level Aggregation Identifier. SLA ID = Site-Level Aggregation Identifier. Interface ID = Interface Identifier. Link-Local Addresses Link-local addresses have the following format: Table G-6 10 bits 54 bits 64 bits 1111111010 0 interface ID Link-local address are supposed to be used for addressing nodes on a single link. Packets originating from or destined to a link-local address will not be forwarded by a router.
PAGE 445
‘2’ indicates that the scope is link-local. A value of “5” indicates that the scope is site-local. The “group ID” field identifies the multicast group. Some frequently used multicast groups are the following: All Node Addresses = FF02:0:0:0:0:0:0:1 (link-local) All Router Addresses = FF02:0:0:0:0:0:0:2 (link-local) All Router Addresses = FF05:0:0:0:0:0:0:2 (site-local) Network Configuration Restrictions Serviceguard supports IPv6 for data and heartbeat IP.
PAGE 446
NOTE: Even though link-local IP addresses are not supported in the Serviceguard cluster configuration, the primary link-local address on the Serviceguard primary interface will be switched over the standby during a local switch. This is because of two requirements: First, the dual stack (IPv4/IPv6) kernel requires that the primary IP address associated with an interface must always be a link-local address.
PAGE 447
Local Primary/Standby LAN Patterns The use of IPv6 allows a number of different patterns of failover among LAN cards configured in the cluster. This is true because each LAN card can support several IP addresses when a dual IPv4/IPv6 configuration is used. This section describes several ways in that local failover to a standby LAN can be configured.
PAGE 448
Following the loss of lan0 or lan2, lan1 can adopt either address, as shown below. The same LAN card can be configured with both IPv4 and IPv6 addresses, as shown in below.
PAGE 449
This type of configuration allows failover of both addresses to the standby. This is shown in below.
PAGE 450
PAGE 451
H Using Serviceguard Manager HP Serviceguard Manager is a web-based, HP System Management Homepage (HP SMH) tool, that replaces the functionality of the earlier Serviceguard management tools. HP Serviceguard Manager allows you to monitor, administer and configure a Serviceguard cluster from any system with a supported web browser. Serviceguard Manager does not require additional software installation.
PAGE 452
— A user with HP SMH Administrator access has full cluster management capabilities. — A user with HP SMH Operator access can monitor the cluster and has restricted cluster management capabilities as defined by the user’s Serviceguard role-based access configuration. — A user with HP SMH User access does not have any cluster management capabilities. See the online help topic About Security for more information. • • Have created the security “bootstrap” file cmclnodelist.
PAGE 453
1. Enter the standard URL “http://:2301/” For example http://clusternode1.cup.hp.com:2301/ 2. When the System Management Homepage login screen appears, enter your login credentials and click Sign In. The System Management Homepage for the selected server appears. 3. From the Serviceguard Cluster box, click the name of the cluster. NOTE: If a cluster is not yet configured, then you will not see the Serviceguard Cluster section on this screen.
PAGE 454
Number What is it? Description 3 Tab bar The default Tab bar allows you to view additional cluster-related information. The Tab bar displays different content when you click on a specific node or package. 4 Node information Displays information about the Node status, alerts and general information. 5 Package information Displays information about the Package status, alerts and general information.
PAGE 455
Figure H-2 Cluster by Type 4. Expand HP Serviceguard, and click on a Serviceguard cluster. NOTE: If you click on a cluster running an earlier Serviceguard release, the page will display a link that will launch Serviceguard Manager A.05.01 (if installed) via Java Webstart.
PAGE 456
PAGE 457
I Maximum and Minimum Values for Parameters Table I-1 shows the range of possible values for cluster configuration parameters. Table I-1 Minimum and Maximum Values of Cluster Configuration Parameters Cluster Parameter Minimum Value Maximum Value Member Timeout See MEMBER_TIMEOUT under “Cluster Configuration Parameters” in Chapter 4. See 14,000,000 MEMBER_TIMEOUT microseconds under “Cluster Configuration Parameters” in Chapter 4.
PAGE 458
PAGE 459
Index A Access Control Policies, 227 Access Control Policy, 161 Access roles, 161 active node, 31 adding a package to a running cluster, 352 adding cluster nodes advance planning, 193 adding nodes to a running cluster, 313 adding packages on a running cluster, 294 additional package resources monitoring, 79 addressing, SCSI, 125 administration adding nodes to a ruuning cluster, 313 cluster and package states, 298 halting a package, 316 halting the entire cluster, 314 moving a package, 317 of packages and se
PAGE 460
configuring with commands, 220 redundancy of components, 37 Serviceguard, 30 typical configuration, 29 understanding components, 37 cluster administration, 311 solving problems, 379 cluster and package maintenance, 297 cluster configuration creating with SAM or Commands, 219 file on all nodes, 59 identifying cluster lock volume group, 222 identifying cluster-aware volume groups, 227 planning, 134 planning worksheet, 162 sample diagram, 123 verifying the cluster configuration, 234 cluster configuration file
PAGE 461
defined, 158 Configuring clusters with Serviceguard command line, 220 configuring packages and their services, 255 control script adding customer defined functions, 345 in package configuration, 343 pathname parameter in package configuration, 286 support for additional productss, 346 troubleshooting, 377 controlling the speed of application failover, 393 creating the package configuration, 340 Critical Resource Analysis (CRA) LAN or VLAN, 337 customer defined functions adding to the control script, 345 CVM
PAGE 462
using, 79 exclusive access relinquishing via TOC, 118 expanding the cluster planning ahead, 122 expansion planning for, 166 F failback policy used by package manager, 75 FAILBACK_POLICY parameter used by package manager, 75 failover controlling the speed in applications, 393 defined, 31 failover behavior in packages, 166 failover package, 68 failover policy used by package manager, 72 FAILOVER_POLICY parameter used by package manager, 72 failure kinds of responses, 116 network communication, 120 package, s
PAGE 463
HA cluster defined, 37 objectives in planning, 121 host IP address hardware planning, 124, 131 host name hardware planning, 123 HOSTNAME_ADDRESS_FAMILY defined, 140 discussion and restrictions, 135 how the cluster manager works, 59 how the network manager works, 89 HP Predictive monitoring in troubleshooting, 368 I I/O bus addresses hardware planning, 126 I/O slots hardware planning, 124, 126 I/O subsystem changes as of HP-UX 11i v3, 46, 106 identifying cluster-aware volume groups, 227 in-line terminator p
PAGE 464
M MAC addresses, 398 managing the cluster and nodes, 311 manual cluster startup, 61 MAX_CONFIGURED_PACKAGES defined, 160 maximum number of nodes, 37 MEMBER_TIMEOUT and cluster re-formation, 117 and safety timer, 55 configuring, 156 defined, 154 maximum and minimum values , 155 modifying, 227 membership change reasons for, 62 memory capacity hardware planning, 124 memory requirements lockable memory for Serviceguard, 122 minimizing planned down time, 406 mirror copies of data protection against disk failure,
PAGE 465
primary, 31 NTP time protocol for clusters, 202 O olrad command removing a LAN or VLAN interface, 337 online hardware maintenance by means of in-line SCSI terminators, 372 OTS/9000 support, 457 outages insulating users from, 392 P package adding and deleting package IP addresses, 91 base modules, 259 basic concepts, 37 changes while the cluster is running, 356 configuring legacy, 340 failure, 116 halting, 316 legacy, 340 local interface switching, 93 modular, 259 modular and legacy, 255 modules, 259 movin
PAGE 466
planning worksheets blanks, 429 point of failure in networking, 40 point to point connections to storage devices, 51 POLLING_TARGET defined, 160 ports dual and single aggregated, 104 power planning power sources, 128 worksheet, 129 power supplies blank planning worksheet, 430 power supply and cluster lock, 49 blank planning worksheet, 431 UPS for OPS on HP-UX, 49 Predictive monitoring, 368 primary LAN interfaces defined, 38 primary network interface, 38 primary node, 31 pvcreate creating a root mirror with,
PAGE 467
running cluster adding or removing packages, 294 S safety timer and node TOC, 55 and syslog.
PAGE 468
time protocol (NTP) for clusters, 202 timeout node, 117 TOC and MEMBER_TIMEOUT, 117 and package availability, 118 and safety timer, 156 and the safety timer, 55 defined, 55 when a node fails, 116 toolkits for databases, 389 traffic type LAN hardware planning, 125 troubleshooting approaches, 375 monitoring hardware, 367 replacing disks, 368 reviewing control scripts, 377 reviewing package IP addresses, 375 reviewing system log file, 376 using cmquerycl and cmcheckconf, 378 troubleshooting your cluster, 365 t