Chapter 15 Serviceguard HP-UX Handbook Revision 13.
Chapter 15 Serviceguard October 29, 2013 TERMS OF USE AND LEGAL RESTRICTIONS FOR THE HP-UX RECOVERY HANDBOOK ATTENTION: PLEASE READ THESE TERMS CAREFULLY BEFORE USING THE HP-UX HANDBOOK. USING THESE MATERIALS INDICATES THAT YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT THESE TERMS, DO NOT USE THE HP-UX HANDBOOK. THE HP-UX HANDBOOK HAS BEEN COMPILED FROM THE NOTES OF HP ENGINEERS AND CONTAINS HP CONFIDENTIAL INFORMATION.
Chapter 15 Serviceguard October 29, 2013 TABLE OF CONTENTS Introduction _________________________________________________________________ 5 What is Serviceguard?_____________________________________________________________________6 What is a Serviceguard Cluster ______________________________________________________________ 6 Packages _______________________________________________________________________________8 Supported versions of Serviceguard and associated features ________________________________
Chapter 15 Serviceguard October 29, 2013 Obtaining Flight Recorder Logs from an HPUX crashdump (SG A.11.15 and later) using q4 ______________58 Obtaining Flight Recorder Logs from a cmcld core file (SG A.11.
Chapter 15 Serviceguard October 29, 2013 Introduction When Serviceguard was first marketed, it was titled MC ServiceGuard. Today, marketing has simplified this to just Serviceguard. Any document that uses the older title was probably generated last century in the days of HPUX 10.20 and Serviceguard A.10.06. This chapter introduces High Availability solutions using Serviceguard and how to setup, maintain, and troubleshoot them.
Chapter 15 Serviceguard October 29, 2013 What is Serviceguard? Serviceguard allows you to create high availability clusters of HP Integrity or HP 9000 servers. A high availability cluster promotes critical business application availability in spite of a hardware or software failure. Highly available systems protect users from software failures as well as from failure of a system processing unit (SPU), disk, or local area network (LAN) component. Hardware redundancy is key to high availability.
Chapter 15 Serviceguard October 29, 2013 UPS POW node1 Network pkgA IP switch 1 Redundant connections switch 2 pkgA data Boot & AppA AppB pkgB IP SAN Switch 1 Redundant connections Node2 Network USERS pkgA pkgB pkgB data SAN Switch 2 Boot & AppA AppB UPS POW Figure 1: Robust Cluster Configuration Figure 1 also illustrates the concept of active/active cluster.
Chapter 15 Serviceguard October 29, 2013 Packages A cluster can operate up to 300 packages. A package identifies unique system resources, business applications that need those resources and application monitor services, and the manner in which they are activated or started when the package is started. The package also defines the manner in which to halt those applications and deactivate system resources.
Chapter 15 Serviceguard October 29, 2013 using cfscluster, cfsdgadm cfsmntadm, cfsmount and cfsumount Serviceguard monitors node heartbeat transmission, NIC failure and package status by default. Additional monitors can be added to a package. If any of these failover triggers occurs, Serviceguard may switch traffic to a standby NIC or cause a node to TOC and failover-class packages to start on adoptive nodes to automatically restore business services with minimal interruption.
Chapter 15 Serviceguard October 29, 2013 UPS POW node1 pkgA Network switch 1 Redundant connections Node2 Network USERS switch 2 Boot & AppA AppB pkgA IP pkgB IP pkgB pkgA data SAN Switch 1 pkgB data SAN Switch 2 Boot & AppA AppB UPS POW Figure 2: Continuous operations in spite of multiple network or SAN failures Network and SAN path rerouting is automated by Serviceguard standby LAN failover or HPUX multi-pathing so that critical application operations continue unaffected.
Chapter 15 Serviceguard October 29, 2013 UPS POW node1 Network switch 1 Redundant connections switch 2 Boot & AppA AppB Node2 Network USERS SAN Switch 1 pkgA IP pkgB IP pkgA pkgB pkgA data pkgB data SAN Switch 2 Boot & AppA AppB UPS POW Figure 3: Reduced business outage due to redundant hardware and Serviceguard intervention Though a node failure occurred, all other components continue to operate as expected.
Chapter 15 Serviceguard October 29, 2013 The following figure portrays the result of a complete site failure: UPS POW node1 Network switch 1 Redundant connections USERS switch 2 Boot & AppA AppB Redundant connections Node2 Network pkgA IP SAN Switch 1 pkgA pkgA data pkgB pkgB data pkgB IP SAN Switch 2 Boot & AppA AppB UPS POW Figure 4: Site failure Node failures are detected by Serviceguard (by loss of heartbeat), and cause a cluster reformation and package re-assignment to an adoptiv
Chapter 15 Serviceguard October 29, 2013 Serviceguard uses TCP/IP network services for reliable inter-node communication, including the transmission of heartbeat messages; periodic signals from each functioning node which are central to the operation of the cluster. TCP/IP services also are used for other types of inter-node communication. The network hardware should include redundant LAN interfaces on each node to permit redundant cluster heartbeat paths to increase cluster availability.
Chapter 15 Serviceguard October 29, 2013 Quorum Rules: Nodes that continue to exchange heartbeat messages in a greater than 50% subset of the previous cluster will reform a new cluster. The arbitration device will not be used. Example: If 2 nodes cannot exchange HB with a 3rd node, the 2 nodes will reform a cluster and adopt node 3 packages. Node 3 will automatically TOC (see next rule).
Chapter 15 Serviceguard October 29, 2013 Serviceguard commands In order to understand explanations in the rest of the manual, it is helpful to have a Serviceguard command reference. The following list of Serviceguard commands are arranged by category and typical order of operation. Most Serviceguard commands can be run from any node in the cluster. NOTE: Some of the commands (or options) are new with Serviceguard A.11.20.
Chapter 15 Serviceguard October 29, 2013 -P filename Alternatively, you can supply a list of physical volumes to cmpreparestg in a file specified by filename. -L lvname The name of the new logical volume created on a new or existing VG or disk group. -c lv_counts The -c option is used to create multiple logical volumes on a LVM VG or a VxVM/CVM disk group. -m mountpoint Specify a mount point for a new logical volume, or for multiple new logical volumes specified via the -c option.
Chapter 15 Serviceguard October 29, 2013 [-C cluster_ascii_file] [-q quorum_server [qs_ip2] | -L lock_lun_device|lock_vg:lock_pv] [-n node_name [-L lock_lun_device]]... cmquerycl [-v] -N network_template_file -n node_name ...
Chapter 15 Serviceguard October 29, 2013 Name of the package module files to include in the package configuration file. -u pkg_ascii_file -l -f format -p -s -t file_name Upgrade a package to use the most recent version of the modules. If used by itself, lists the available package module files and their versions and brief descriptions of each module. Select the output format to display. (line/table) Will be obsolete in a future release of Serviceguard.
Chapter 15 Serviceguard October 29, 2013 _ cmmigratepkg - Migrate Serviceguard legacy Package to a Module Package. Usage: cmmigratepkg -p [-x externalscript] [-e] [-s] -o Option meaning: -p package_name Name of an existing configured legacy package to convert. -x external_script Name of the external script file to create. -e Generate PEV's from non Serviceguard parameters. -s Comments out service attributes in the output_file.
Chapter 15 -t -m -e Serviceguard October 29, 2013 Test only. Modular packages: Partial package startup ends after the identified module is completes. exclude_module_name Name of the module that should be excluded from the package run sequence.
Chapter 15 -d -t node_name... Serviceguard October 29, 2013 Test only. Show a list of packages that will be detached if the node is halted with -d option. The name of the node(s)to halt. _____________________________________________________________________________ _ cmhaltcl - halt a high availability cluster Usage: cmhaltcl [-f | -d [ -t]] [-v] Option meaning: -f Force the cluster to shutdown even if packages or group members are currently running. -v Verbose output will be displayed. -d -t Test only.
Chapter 15 -i -d -e Serviceguard October 29, 2013 Reguired. IPv4_Address or IPv6_address Disable LAN interface configured in the cluster. Enable LAN interface configured in the cluster. _____________________________________________________________________________ _ cmmodpkg - enable or disable switching attributes for a high availability package Usage: cmmodpkg {-e[-t]|-d} [-n node_name]... [-v] package_name...
Chapter 15 Option meaning: -h -v -O log_file -D log_level -t poll_interval volume_path Serviceguard October 29, 2013 Displays the usage, as listed above, and exits. Displays the monitor version and exits. Specifies a file for logging (log messages are printed to the console by default). Specifies the log level. Specifies the interval between volume probes. Full block device path to at least one VxVM volume or LVM logical volume device file for monitoring. Required.
Chapter 15 -f format -l type -n node_name -p package_name -S site_name -r release -s config -v Serviceguard October 29, 2013 Select the output format to display. (line/table) Limit the type of data displayed. (group/node/package) View info only about the specific node_name, including info about the packages that are running on these nodes. View info only about the specific package_name(s). View info only about the specific site_name(s), the nodes assigned to the site(s), and relevant packages.
Chapter 15 Serviceguard October 29, 2013 _____________________________________________________________________________ _ cmquerystg - Displays info about the cluster DSFs. (new with A.11.20) Usage: cmquerystg -f format [[-p path]...] [{[-l]| -d} -n node...] Option meaning: -f format The output format ('line' is the only output format) -p dsf_path | vg_path Absolute path name of a device special file or a volume group. -l Limits the output to display only info about the VGs on the specified nodes.
Chapter 15 -o -s output_file Serviceguard October 29, 2013 Write configuration info to a specified output file. Display the configuration info to the screen only. --- end of command list --- Preparation to build a cluster Version of Serviceguard installed Depending in the information sought; use one of the following methods: # cmversion A.11.19.00 # what /usr/lbin/cmcld | grep A.11 A.11.19.00 Date: 03/17/11 Patch: PHSS_41902 # swlist | grep guard ; swlist –l product | grep guard B5140BA A.11.31.
Chapter 15 Serviceguard October 29, 2013 # cmquerycl Cluster Name UNUSED Node Name rxh17u07 rxh17u09_cluster rxh17u09 LVM preparation When a volume group is created, /etc/lvmtab (or for version 2.X volume groups, /etc/lvmtab_p) is loaded with references to the volume group and related physical devices. Every node that will run a package must have an lvmtab file that will allow activation of a package VG. After creating a volume group on one node, import it into the other adoptive nodes.
Chapter 15 Serviceguard October 29, 2013 6. Make sure that you have deactivated the volume group on Node2. Then enable the volume group on nodeC: root@nodeC:/ # vgchange -a y /dev/vgspare 7. Create a directory to mount the disk: root@nodeC:/ # mkdir /spare1 8. Mount and verify the volume group on nodeC: # mount /dev/vgsparee/lvspare1 /spare1 9. Unmount the volume group on nodeC: # umount /spare1 10.
Chapter 15 Serviceguard October 29, 2013 entry to /etc/nsswitch.conf on node rxh17u07. Finalizing /etc/nsswitch.conf file on node rxh17u07 cmdeploycl After cmpreparecl succeeds, use cmdeploycl to quickly build a cluster. Example (building a one-node cluster): # cmdeploycl -n rxh17u07 Running cmdeploycl on nodes rxh17u07 Saving subcommand output to /var/adm/cmcluster/sgeasy/easy_deployment.
Chapter 15 Serviceguard October 29, 2013 Found 2 volume groups on node Node2 Analysis of 5 volume groups should take approximately 1 seconds 0%----10%----20%----30%----40%----50%----60%----70%----80%----90%----100% Note: Disks were discovered which are not in use by either LVM or VxVM. Use pvcreate(1M) to initialize a disk for LVM or, use vxdiskadm(1M) to initialize a disk for VxVM.
Chapter 15 Serviceguard October 29, 2013 The following section is declared for each node: NODE_NAME NETWORK_INTERFACE lan3 HEARTBEAT_IP NETWORK_INTERFACE lan1 Optional parameters: FIRST_CLUSTER_LOCK_VG Or CLUSTER_LOCK_LUN /dev/dsk/c1t2d3s1 (identified per node) Or QS_HOST QS_ADDR QS_POLLING_INTERVAL 120000000 <- default value if used QS_TIMEOUT_EXTENSION 2000000 <- default value if used The following 2 parameters are specified in each node section: CAPACITY_NAME spec
Chapter 15 Serviceguard October 29, 2013 If the cluster has already been configured previously, the cmcheckconf command will compare the configuration in the cluster ASCII file against the previously configuration information stored in the binary configuration file and validates the changes. The same rules apply to the package ASCII file. root@Node1:/ # cmcheckconf -k -v -C /etc/cmcluster/cluster.
Chapter 15 Serviceguard October 29, 2013 root@Node1:/etc/cmcluster# cmcheckconf -k -v -C ./cluster.ascii -P \ ./sw/swpkg.conf Checking cluster file: ./cluster.ascii Note : a NODE_TIMEOUT value of 2000000 was found in line 134. This value is recommended if the top priority is to reform the cluster as fast as possible in case of failure. If the top priority is to minimize reformations, consider using a higher setting.
Chapter 15 Serviceguard October 29, 2013 Before distributing the configuration, ensure that your security files permit copying among the cluster nodes. (For security file details see “Preparing Your Systems” in the Managing Serviceguard, Chapter 5 Building an HA Cluster Configuration.
Chapter 15 Serviceguard October 29, 2013 the configuration file. The external script is used to start and stop applications. The Managing Serviceguard manual and a SAW document explain how to implement external scripts. To create a legacy package, create the package configuration and control templates. File names are not consequential. # cmmakepkg –p # cmmakepkg –s Edit the file.
Chapter 15 Serviceguard October 29, 2013 Waiting for cluster to form .... done Cluster successfully formed. Check the syslog files on all nodes in the cluster to verify that no warnings occurred during startup. When the node forms or joins a cluster, the cluster binary file is read into memory and used to govern cluster operations and the each active node starts a group of Serviceguard deamons.
Chapter 15 Serviceguard Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up disabled NODE Node2 STATUS NAME Node1 (current) Node2 STATE running up Cluster_Lock_LVM: VOLUME_GROUP /dev/vgspare October 29, 2013 PHYSICAL_VOLUME /dev/dsk/c2t8d0 Network_Parameters: INTERFACE STATUS PRIMARY up STANDBY up PATH 0/1/2/0 0/1/2/1 STATUS down NAME lan0 lan1 For Informations about the different states of a cluster, a node, or a package, please refer to Managing Serviceguard,
Chapter 15 Serviceguard October 29, 2013 Successfully halted all nodes specified. Halt operation complete. If the user only wants to shutdown a subset of daemons, the cmhaltnode command should be used instead. Joining a node to a running cluster If a node is not running Serviceguard, and it’s sister nodes are, it can be joined to the cluster using cmrunnode. Example: root@Node1:/# cmrunnode Node2 cmrunnode: Validating network configuration...
Chapter 15 Serviceguard October 29, 2013 removed from the existing cluster. Starting Packages Ordinarily when a cluster starts, the packages will start on their primary configured nodes. You may need to start a package manually after it has been halted manually using the cmrunpkg command. This command may be run on any node within the cluster and may operate on any package within the cluster. If a node is not specified, the node on which the command is run will be used.
Chapter 15 Serviceguard October 29, 2013 AUTO_RUN status identifies whether a package can run on any adoptive node. Node_Switching SWITCHING status identifies whether a package is permitted to run on a specific node. To enable or disable switching attributes, use the cmmodpkg command. The important option s for this command are –e (enable) and –d (disable).
Chapter 15 NETWORK_INTERFACE Serviceguard YES - even if departing node is gone. Delete ALL Change ALL Add/Delete Change from IPV4 <-> IPv6 or vice versa 8,9 NO - remove/re-add node with new name YES - with qualifications 9 YES 7,8 NO PROVISIONAL - Delete and re-add Change IP HEARTBEAT_IP October 29, 2013 Redesignate as STATIONARY_IP or vice versa 9 8,9 YES - with qualifications NO YES - with qualifications YES - will trigger warning if the change will cause a pkg to fail.
Chapter 15 Serviceguard WEIGHT_DEFAULT USER_NAME USER_HOST USER_ROLE VOLUME_GROUP October 29, 2013 Change 9 YES Change, add remove ALL YES Add/Delete ALL YES Updating the arbitration device (cluster lock VG/PV, LUN or Quorum Server) Use the procedures that follow whenever you need to change the device file names of the cluster lock physical volumes – for example, when you are migrating cluster nodes to the agile addressing scheme available as of HP-UX 11i v3. 1.
Chapter 15 Serviceguard October 29, 2013 1. Use the following command to store a current copy of the existing cluster configuration in a temporary file: root@Node1:/ # cmgetconf -c cluster1 temp.ascii 2. Specify the new set of nodes to be configured (omitting nodeC) and generate a template of the new configuration: root@Node1:/ # cmquerycl -C cluster.ascii -c cluster1 -n Node1 –n Node2 3. Edit the file cluster.ascii to check the information about the nodes that remain in the cluster. 4.
Chapter 15 Serviceguard October 29, 2013 activates and deactivates this volume group. In addition, you should use the LVM vgexport command on the removed volume group from each node that will no longer be using the volume group. Using Serviceguard Commands to Change the LVM Configuration While the Cluster is Running From the LVM’s cluster, follow these steps: 1. Use the cmgetconf command to store a copy of the cluster's existing cluster configuration in a temporary file. For example: cmgetconf cluster.
Chapter 15 Serviceguard October 29, 2013 root@Node1:/ # cmhaltpkg sw-pkg 2. If it is not already available, you can obtain a copy of the package's ASCII configuration file by using the cmgetconf command, specifying the package name. root@Node1:/ # cmgetconf -p sw-pkg sw-pkg.ascii 3. Edit the ASCII package configuration file. 4. Verify your changes as follows: root@Node1:/ # cmcheckconf -v -P sw-pkg.ascii 5. Distribute your changes to all nodes: root@Node1:/ # cmapplyconf -v -P sw-pkg.ascii 6.
Chapter 15 Serviceguard October 29, 2013 The following example halts the failover package mypkg and removes the package configuration from the cluster: root@Node1:/# cmhaltpkg sw-pkg root@Node1:/# cmdeleteconf -p sw-pkg The command prompts for a verification before deleting the files unless you use the -f option. The directory /etc/cmcluster/mypkg is not deleted by this command.
Chapter 15 Serviceguard October 29, 2013 SUCCESSOR_HALT_TIMEOUT Change 7-9 YES SCRIPT_LOG_FILE Change 7-9 NO - pkg must be halted FAILOVER_POLICY FAILBACK_POLICY Change Change ALL ALL YES YES PRIORITY Change 7-9 YES Add 7-9 YES, if FAILOVER_POLICY DEPENDENCY_NAME \ DEPENDENCY_CONDITION DEPENDENCY_LOCATION } / weight_name != MIN_PACKAGE_NODE Change 9 YES, requires matching CAPACITY and WEIGHT parameters in the cluster configuration YES, must not exceed CAPACITY or pkg will halt
Chapter 15 Serviceguard SERVICE_NAME October 29, 2013 Add/Delete 6-9 NO (legacy pkg format only) Add/Delete 8,9 YES (modular package format only) SERVICE_FAIL_FAST_ENABLED SERVICE_HALT_TIMEOUT service_name service_cmd service_restart service_fail_fast_enabled service_halt_timeoutk RESOURCE_NAME Add/Delete 6-8 NO - halt pkg first.
Chapter 15 Serviceguard October 29, 2013 fs_name fs_directory Add Add 8,9 8,9 YES YES fs_type fs_mount_opt fs_umount_opt fs_umount_opt Add Add Add Modify/Remove 8,9 8,9 8,9 8,9 YES YES YES YES fs_fsck_opt pev_ Add/Delete Add/Delete 8,9 8,9 YES YES external_pre_script external_script Add/Delete Add/Delete 8,9 8,9 YES YES Deleting or modifying the following parameters may cause the package to halt if the dependent application is still running.
Chapter 15 Serviceguard October 29, 2013 General Troubleshooting Commands cmcheckconf cmcheckconf can be used to troubleshoot your cluster just as it was used to verify the configuration. The following example shows the commands used to verify the existing cluster configuration on Node1 and Node2: # cmquerycl -v -C /etc/cmcluster/verify.ascii -n Node1 –n Node2 # cmcheckconf -v -C /etc/cmcluster/verify.ascii The cmcheckconf command checks: The network addresses and connections.
Chapter 15 Serviceguard October 29, 2013 Serviceguard Logs Serviceguard daemons log their actions to /var/adm/syslog/syslog.log. If nothing is found in this file, check if syslogd still works as expected (‘logger testing syslogd’). Package start and stop actions are logged in the package log which is typically located in the package directory or the /var/adm/cmcluster/logs/ (modular packages). Modular packages offer the script_log_file parameter to identify a different path/file.
Chapter 15 Serviceguard October 29, 2013 With the advent of a new cluster manager engine in A.11.18, Serviceguard daemons and commands offer different debug logging mechanisms. The following is taken from this WTEC page: http://teams3.sharepoint.hp.com/teams/esssupport/InsideESSSupport/InsideWTEC/HAProducts/Pages/s g_debug_logging.aspx Debug logging of cmcld (SG A.11.19 and later) From SG A.11.19 onwards you can start logging for cmcld by adding the following lines in the /etc/cmcluster.
Chapter 15 Serviceguard October 29, 2013 SDB = Status Database SEC = Security Service SES = Sessions SRV = Service Management STA = Status Database API SYN = Synchronization UNK = Unknown UTD = Utility Daemon Serviceguard A.11.18 and older versions use the internal function cl_log() to log messages, warnings, notices and errors. A Serviceguard message is classified by the module that issued the message, by the reason (category) the message is being logged, and by the level of detail of information.
Chapter 15 STA DEV ATS QSM Serviceguard October 29, 2013 Status database API Storage devices Shared tape device Quorum Devices The cmsetlog command (obsolete with SG A.11.19) The cmsetlog command enables users to obtain a more verbose output of cmcld. This is extremely useful if a problem should be reproduced. Cmsetlog allows to set the log level and to restrict logging to specific categories and modules. Cmsetlog is used to enable and disable debug logging.
Chapter 15 Serviceguard October 29, 2013 # cmsetlog -C PER -C ERR -C XER -C INT -C EXT -C DTH -C TRC 6 Disable cmcld debug logging with cmsetlog (obsolete with SG A.11.19) The debug logging is automatically stopped and reset to default once the cluster halted. To reset the debug logging to default modules, categories and loglevel on a running cluster, simply use the command # cmsetlog -r If the '-f ' option has been used with cmsetlog to redirect logging to another file than syslog.
Chapter 15 Serviceguard October 29, 2013 Note: Do not try to enable cmlvmd debug logging online by changing cmcluster.conf and then sending SIGHUP to cmlvmd PID. This will cause the node to perform a TOC! Starting with PHSS_35427 for SG A.11.17 cmlvmd ignores the signal SIGHUP. Debug logging of cmsrvassistd (SG A.11.17 on HPUX and later) You can start logging for cmsrvassistd by adding the /etc/cmcluster.conf and do a kill -SIGHUP .
Chapter 15 Serviceguard October 29, 2013 Debug logging of cmclconfd on HPUX (up to SG A.11.17) The '-T' option described above can also be used to instrument the cmclconfd. This daemon is used to gather and send configuration data from the local and the remote nodes and is therefore started with many Serviceguard commands. To enable cmclconfd debug logging the following has to be done. Modify the following lines in the file /etc/inetd.
Chapter 15 Serviceguard October 29, 2013 closes frdump.cmcld.9. Format a dump file to make a readable outfile using this example: $ /usr/contrib/bin/cmfmtfr frdump.cmcld.0 > /tmp/frdump0_formatted Use this syntax to identify when the flight recorder file was dumped: $ grep Dumped /tmp/frdump0.formatted Dumped time: 2011/04/25 07:59:49 Obtaining Flight Recorder Logs from an HPUX crashdump (SG A.11.
Chapter 15 Serviceguard October 29, 2013 -o extracts the SGFR log buffer from the core file. This extracts a SGFR log buffer from the core, and it outputs a SGFR binary file. 2. Run the cmfmtfr command to convert the SGFR binary file, dumpfile, into readable form. The output goes to standard output. # /usr/contrib/bin/cmfmtfr dumpfr > dumpfr_formatted Before Logging Serviceguard case Download the sginfo script from ftp://hpcu:Toolbox1@ftp.usa.hp.
Chapter 15 Serviceguard October 29, 2013 Cluster Lock initialization Serviceguard Command hangs Serviceguard uses network messages to other nodes when SG commands involve other nodes. Many Serviceguard commands, including cmviewcl, depend on name resolution services to match the IP of the SG network message to a valid nodename. Nsswitch.conf should use /etc/nsswitch.
Chapter 15 Serviceguard October 29, 2013 Heartbeat generation and transmission is delayed by increased kernel activity. The default NODE_TIMEOUT (pre-A.11.19) is often too small. A sign of this is when the ‘sequence #’ value skyrocket in the syslog.log. Increase the default value of NODE_TIMEOUT (pre-A.11.19) from 2 seconds to 8 seconds or add 10 seconds to MEMBER_TIMEOUT (A.11.19 and newer) in the cluster ASCII configuration file and cmapplyconf the file. If the problem persists, look for syslog.
Chapter 15 Serviceguard October 29, 2013 is able to complete before the safety timer expires, then the TOC will not take place. In either case, packages are able to move quickly to another node. The following may cause cmcld to cease to reset the safety timer: 1. cmcld is not given CPU time to reset the timer (system hang) 2. A crucial package such as SG-CFS-pkg has failed. The admin can identify others by of setting failfast=enabled in the cluster binary (via the package configuration file).
Chapter 15 Serviceguard October 29, 2013 You can use the following commands to check the status of your disks: bdf - to see if your package's volume group is mounted. vgdisplay - to see if all volumes are present. lvdisplay -v - to see if the mirrors are synchronized. strings /etc/lvmtab - to ensure that the configuration is correct. ioscan -fnC disk - to see physical disks. diskinfo -v /dev/rdsk/cxtydz - to display information about a disk.
Chapter 15 operation_sequence operation_sequence Serviceguard October 29, 2013 $SGCONF/scripts/sg/service.sh $SGCONF/scripts/sg/resource.sh Legacy package control scripts generally operate using the same order of operation. If toolkits are embedded in the package configuration, their entry points will also be listed. Example listing of a SGeSAP modular package configuration file: operation_sequence $SGCONF/scripts/sg/external_pre.sh operation_sequence $SGCONF/scripts/sg/volume_group.
Chapter 15 Serviceguard October 29, 2013 package control script. customer_defined_run_cmds This is the place where the customer’s HA application is started. Failures analysis in this area should be started from the application side. Commenting out the faulty command may be a good strategy to get a minimum troubleshooting environment running (otherwise the complete package start fails, causing all file systems to be umounted, etc). In modular packages, this section is handled by the external.
Chapter 15 Serviceguard October 29, 2013 AUTO_RUN (automatic package switching) will be disabled (important!). The current node will be disabled from running the package. Following such a failure, since the control script is terminated, some of the package's resources may be left activated. Specifically: Volume groups may be left active. File systems may still be mounted. IP addresses may still be installed. Services may still be running.
Chapter 15 Serviceguard October 29, 2013 # cmmodpkg -e If after cleaning up the node on which the timeout occurred it is desirable to have that node as an alternate for running the package, remember to re-enable the package to run on the node: # cmmodpkg -e -n Package Movement Errors Package fail to move to an adoptive node for these reasons: The initial package halt failed Package startup occurred on the target node, but something failed during package sta
Chapter 15 Serviceguard October 29, 2013 Power failures – not documented in /etc/shutdownlog! Check MP logs. In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console. You can use the following commands to check the status of your network and subnets: netstat -in - to display LAN status and check to see if the package IP is assigned to a NIC. cmviewcl –v [–n node] – to see if a standby NIC has been invoked.
Chapter 15 Serviceguard October 29, 2013 # linkloop –i 2 0x00108318AFED Link connectivity to LAN station: 0x00108318AFED --- OK Sometimes Serviceguard’s discovery does not match the results achieved with linkloop, because it uses a slightly differerent method. The HP unsupported tool dlpiping (More information about dlpiping can be found under subsection “Tools”) can be used in such cases. The dlpiping tool uses exactly the same communcation mechanism that Serviceguard uses.
Chapter 15 Serviceguard October 29, 2013 KERNEL/tools/downloads.htm (HP internal) dlpiping http://teams3.sharepoint.hp.com/teams/esssupport/InsideESSSupport/InsideWTEC/HA/Pages/Tools.aspx (HP internal). Dlpiping is an unsupported program to help troubleshoot ServiceGuard problems where errors are reported such as non uniform network connections. The program sends messages at a link level to check link level connectivity.
Chapter 15 Serviceguard October 29, 2013 LVM related Problems The predominant causes of LVM problems are incorrect content of /etc/lvmtab (or lvmtab_p) on each node, or incorrect activation mode assigned to a given volume group. Example 1 If a disk is missing from lvmtab on one node, cmcheckconf may fail with: ERROR: Volume group vgA is configured differently on node Node1 than on node Node2.
Chapter 15 Serviceguard October 29, 2013 contains a PVID ≠ 0. Example 6 Error: Unable to recv initialize VG message to %s: %s Symptom: cmcheckconf/cmapplyconf hangs or needs very long to complete. When performing cmcheckconf/cmapplyconf a cmclconfd helper process is launched on each node for gathering configuration information. Most likely reasons for hangs or long runtimes are cmclconfd processes being blocked while trying to access disk devices.
Chapter 15 Serviceguard October 29, 2013 In all reality, it is easier leave the file generic and simply ignore the boot-time error messages than it is to try to remember this file when adding VGs. Networking related Problems Networking problems can be devided into 2 classes Serviceguard commands fail Network connectivity fails Serviceguard commands fail Serviceguard commands are processed by cmclconfd, which is activated by inetd.
Chapter 15 Serviceguard October 29, 2013 hacl-cfg 5302/tcp # HA Cluster TCP configuration hacl-cfg 5302/udp # HA Cluster UDP configuration hacl-probe 5303/tcp # HA Cluster TCP probe hacl-probe 5303/udp # HA Cluster UDP probe hacl-local 5304/tcp # HA Cluster Commands hacl-test 5305/tcp # HA Cluster Test hacl-dlm 5408/tcp # HA Cluster distributed lock manager Check this from every node to every other node of the cluster: # telnet localhost hacl-cfg # telnet hacl-cfg # telnet
Chapter 15 Serviceguard October 29, 2013 Please note that often linkloop(1M) is not able to catch such problems since its checking differs from Serviceguard's. Please refer to the tool dlpiping. More information about the tool can be found under Troubleshooting. • Error: Network interface lanX on node Node1 couldn't talk to itself. This message is usually a result of a failed LAN interface check, either on link level.
Chapter 15 Serviceguard October 29, 2013 and to check for errors returned by the network drivers. Discussing all possible DLPI error conditions would exceed this document's scope. It is usually best practice to check affected interfaces with tools like linkloop(1M) and lanadmin(1M). In the past many of those problems were tracked down to defective hardware components.
Chapter 15 Serviceguard October 29, 2013 importing and deporting disk groups on particular nodes. Force Import and Deport After Node Failure After certain failures, packages configured with VxVM disk groups will fail to start, and the following error will be seen in the package log file: vxdg: Error gd_01 may still be imported on Node1 ERROR: Function check_dg failed This can happen if a package is running on a node which then fails before the package control script can deport the disk group.
Chapter 15 Serviceguard October 29, 2013 Note: This force import procedure should only be used when you are certain the disk is not currently being accessed by another node. If you force import a disk that is already being accessed on another node, data corruption can result. Further Problems In this section are described further uncategorized very well known Serviceguard problems. • Sendmail version Some companies such as AT&T install an unsupported version of sendmail.
Chapter 15 # ls -l /dev/*random cr--r--r-1 bin cr--r--r-1 bin # lsdev | grep 62 62 -1 Serviceguard bin bin October 29, 2013 62 0x000000 Nov 19 18:09 /dev/random 62 0x000001 Nov 19 18:09 /dev/urandom rng pseudo If the driver shown by lsdev grep'ing for the major number of the /dev/urandom device file is not "rng" then Serviceguard will not function correctly. Note that the actual major number can vary since this number is dynamically allocated.
Chapter 15 Serviceguard October 29, 2013 daemon, /usr/lbin/cmcld [####], died upon receiving signal number 11. The kernel parameter maxssiz was set too low. Change maxssiz back to its previous setting. • cmcld: WARNING: Cluster lock on disk /dev/dsk/cXtYdZ is missing. Until is fixed, a single failure could cause all nodes in the cluster to crash. This event has been known to be caused by the following: a.
Chapter 15 Serviceguard October 29, 2013 which missed the heartbeat), voting for a new cluster coordinator, and reforming the cluster (the new cluster is based upon the number of nodes which responded during the reformation. Note, if the node which missed the heartbeat is able to respond during the reformation, then the reformation will end up with the same number of nodes in the cluster and your packages will not be effected). Example: Node Node1 is missing heartbeats from Node2.
Chapter 15 Serviceguard October 29, 2013 (See the section titled Serviceguard TOC) • Active cmcld aborts with syslog messages like: cmcld: Aborting! cmcld: Service Guard Aborting! cmcld: Aborting Serviceguard Daemon to preserve data integrity. These messages are logged by cmcld before it actively aborts due to some fatal error condition, that may be also part of the error message. Typically the syslog.log looks similar to this: Aug 5 11:05:31 Node1 cmcld: Aborting: cl_rwlock.
Chapter 15 Serviceguard October 29, 2013 Serviceguard Command Problems Sometimes Serviceguard commands fail and log messages that indicate that either the node is not configured into the cluster, that the binary configuration file misses or other basic problems.
Chapter 15 Serviceguard October 29, 2013 CLUSTER STATUS alwayson up Failed to get dlm configuration. The above mentioned commands have in common that they access the Serviceguard configuration daemon cmclconfd to collect information for them. Basically if the Serviceguard commands do not get a reply from the daemon they will log messages similar to those above. There are numerous causes why a reply is not returned to the command. Ruling them out one after another will usually resolve the problem.
Chapter 15 Serviceguard October 29, 2013 Aug 22 14:25:30 nero inetd[980]: hacl-cfg/udp: Added service, server /usr/lbin/cmclconfd Aug 22 14:25:30 nero inetd[980]: hacl-cfg/tcp: Added service, server /usr/lbin/cmclconfd Netstat -an shows that inetd is listening on hacl-cfg/tcp. # netstat -an | grep 5302 | grep LISTEN tcp 0 0 *.5302 *.* LISTEN 6. Make sure that /var/adm/inetd.sec does not deny access for cluster nodes to hacl-cfg ports. 7.
Chapter 15 sure Serviceguard to list the cluster IP addresses October 29, 2013 at the top of the file. 11. For Serviceguard versions SG A.11.16 and later: If there is no /etc/cmcluster/cmclconfig file: Make sure /etc/cmcluster/cmclnodelist contains the IP addresses of all cluster nodes and of all subnets the cluster nodes can potentially talk on.
Chapter 15 Serviceguard October 29, 2013 # what /usr/lbin/identd usr/lbin/identd: $Revision identd 2.7.4 (PHNE_26305) $ If the version is not sufficient you need to update to a later version of sendmail.Also make sure that you run ARPA patch PHNE_31247 for HPUX 11.11 or later or PHNE_24715 on HPUX 11.00. 14. If the Strong Random Number Generator is installed make sure you run version B.11.11.07 or later. # swlist -l bundle | grep KRNG KRNG11i B.11.11.09 HP-UX 11.
Chapter 15 Serviceguard October 29, 2013 16. Read and adhere to the Special Installation Instructions of the Serviceguard patch you are using. Cluster Quorum to Prevent Split-Brain Syndrome (See the section titled “Quorum Rules and Cluster Arbitration Device”) In general, the algorithm for cluster re-formation requires a cluster quorum of a strict majority (that is, more than 50%) of the nodes previously running.
Chapter 15 Serviceguard October 29, 2013 Use of an LVM Lock Disk as the Cluster Lock Specifying a Lock Disk The lock must be accessible to all nodes and must be powered separately from the nodes To create a lock disk, enter the lock disk information following the cluster name. The lock disk must be in an LVM volume group that is accessible to all the nodes in the cluster. Lock Disk Operation When a node obtains the cluster lock, this area is marked so that other nodes will recognize the lock as “taken.
Chapter 15 Serviceguard October 29, 2013 cmquerycl will not print out the re-formation time for a volume group that currently belongs to a cluster. If you want cmquerycl to print the re-formation time for a volume group, run vgchange c n to clear the cluster ID from the volume group.
Chapter 15 Serviceguard October 29, 2013 … Backing Up Cluster Lock Disk Information After you configure the cluster and create the cluster lock volume group and physical volume, you should create a backup of the volume group configuration data on each lock volume group.
Chapter 15 Serviceguard October 29, 2013 operation. If you use a different subnet, you may experience network delays which may cause quorum server timeouts. To prevent these timeouts, you can use the QS_TIMEOUT_EXTENSION parameter in the cluster configuration file to increase the quorum server timeout interval. The Quorum Server 3.0 allows cluster nodes to communicate with the QS on an alternate subnet. For more information please refer to: http://docs.hp.com/en/B8467-90041/B8467-90041.
Chapter 15 Serviceguard October 29, 2013 When the command is complete, the prompt appears. Verify the quorum server is running by checking the qs.log file. root@QS:/ # cat /var/adm/qs/qs.
Chapter 15 Serviceguard October 29, 2013 cmquerycl -q, cmapplyconf -C, and cmcheckconf -C • If there is a node or network failure that creates a 50-50 membership split, the quorum server will not be available as a tie-breaker, and the cluster will fail Types of Volume Managers Serviceguard allows a choice of volume managers for data storage: • HP-UX Logical Volume Manager (LVM) and (optionally) Mirrordisk/UX • VERITAS Volume Manager for HP-UX (VxVM) • VERITAS Cluster Volume Manager for HP-UX (CVM) Supp
Chapter 15 Serviceguard October 29, 2013 or/and to divide it into virtual disk volumes which are transparently presented as a physical devices used by the operating system and different applications. VxVM can be used in clusters that: • are of any size, up to 16 nodes. • require a fast cluster startup time. • do not require shared storage group activation. (required with CFS) • do not have all nodes cabled to all disks. (required with CFS) • need to use software RAID mirroring or striped mirroring.
Chapter 15 Serviceguard More information about SNOR you can find at: October 29, 2013 SLVM Online Volume Reconfiguration http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01914684/c01914684.pdf Comparison of Volume Managers The following table summarizes some of the advantages and disadvantages of the volume managers that are currently available. Product Advantages Tradeoffs Logical Volume Manager (LVM) • Software is provided with all versions of HP-UX.
Chapter 15 Serviceguard October 29, 2013 B9116DB (11.31 VxVM 5.01.01) B9116EB (11.31 VxVM 5.10) • Supports up to 32 plexes per volume • RAID 1+0 mirrored stripes • RAID 1 mirroring • RAID 5 • RAID 0+1 striped mirrors • Supports multiple heartbeat subnets, which could reduce cluster reformation time. • Does not support activation on multiple nodes in either shared mode or read-only mode • May cause delay at package startup time due to lengthy vxdg import VERITAS Cluster Volume Manager – B9117AA (CVM 3.
Chapter 15 Serviceguard October 29, 2013 To start Serviceguard Manager Set HPWS_APACHE_START=1 in /etc/rc.config.d/hpws_apacheconf Read the file into the shell: . /etc/rc.config.d/hpws_apacheconf Start web services: /sbin/init.d/hpws_apache start Open a browser on the server network and load: http://:2301 Select Tools (in the top bar) -> and at the bottom of the result, locate and click Serviceguard (see below): Click on Serviceguard Manager. HP-UX Handbook – Rev 13.
Chapter 15 Serviceguard October 29, 2013 From there you can do the following tasks: Monitoring Clusters with Serviceguard Manager You can see all the clusters the server can reach, or you can list specific clusters. You can also see all the unused nodes on the subnet - that is, all the Serviceguard nodes that are not currently configured in a cluster. Note that SMH has a timeout, unless the session is set to NEVER EXPIRE (upper right corner of the GUI page).
Chapter 15 Serviceguard October 29, 2013 Configuring Clusters with Serviceguard Manager You can configure clusters and packages. You must have root (UID=0) access to the cluster nodes. Since Serviceguard A.11.17.01, Serviceguard Manager is available in two forms, a standalone utility and the new SMH-based utility. Old: as an independent management application running on an HP-UX, Linux, or Windows system. This utility is not used on current software.
Chapter 15 Serviceguard October 29, 2013 To see “live” clusters, from a management station, connect to a Serviceguard node’s Cluster Object Manager (COM) daemon. (The COM is automatically installed with Serviceguard.) This node becomes the session server. It goes out over its subnets, and establishes connections with the COMs on other Serviceguard nodes.
Chapter 15 Serviceguard October 29, 2013 Example for the URL of the System Management Homepage: http://node1.domain.hp.com:2301 Patch information Patch Requirements (external Documentation) To find the required patches for Serviceguard, check the Release Notes of your Serviceguard version on http://www.hp.com/go/hpux-SG-docs; particularly the section named: “Patches and Fixes in this Version” For required patches for the Serviceguard Storage Management Suite check here: http://www.hp.
Chapter 15 Serviceguard October 29, 2013 selected patch list. HP-Internal Documentation Recommended Patches for ACSL Products http://haweb.ind.hp.com/Support/RecmdPatches.
Chapter 15 Serviceguard October 29, 2013 of the Serviceguard Storage Managment Suite bundles, the file /etc/gabtab is automatically configured and maintained by Serviceguard. GAB provides membership and messaging for CVM and the CFS. GAB membership also provides orderly startup and shutdown of the cluster file system.
Chapter 15 Serviceguard Activation daemon. (Only present when VERITAS CFS is installed.) October 29, 2013 VERITAS Clustered File System product Serviceguard Related Files Serviceguard uses a special file, /etc/cmcluster.conf, to define the locations for configuration and log files within the HP-UX filesystem. Note: Do not edit this /etc/cmcluster.
Chapter 15 Serviceguard October 29, 2013 Serviceguard file location / etc/ opt/ cmcluster.conf inetd.conf lvmrc nsswitch.conf services cmcluster/ rc.config.d/ cmcluster nfs/ toolkit/ sap/ cmcluster/ cmclconfig cmclnodelist cmcluster.conf pkg1/ pkg1.config(legacy) pkg1.cntl (legacy) pkg1.log pkg2/ pkg2.config(modular) pkg2.log (optional loc.
Chapter 15 Serviceguard October 29, 2013 External Technical Documentation This URL is the starting point for all Documentation for High Availability Products: http://www.hp.com/go/hpux-serviceguard-docs Additionally a list of direct links for the most interesting parts of Serviceguard on HPUX: Manuals and Release Notes for ServiceGuard http://www.hp.com/go/hpux-SG-docs Manuals and Release Notes for Serviceguard Extension for Real Application Cluster http://www.hp.
Chapter 15 HP-UX Handbook – Rev 13.