Managing HP Serviceguard A.12.00.00 for Linux, June 2014

10 Troubleshooting Your Cluster
This chapter describes how to verify cluster operation, how to review cluster status, how to add
and replace hardware, and how to solve some typical cluster problems. Topics are as follows:
Testing Cluster Operation
Monitoring Hardware (page 270)
Replacing Disks (page 271)
Replacing LAN Cards (page 272)
Replacing a Failed Quorum Server System (page 273)
Troubleshooting Approaches (page 274)
Solving Problems (page 276)
“Troubleshooting serviceguard-xdc package” (page 281)
10.1 Testing Cluster Operation
Once you have configured your Serviceguard cluster, you should verify that the various components
of the cluster behave correctly in case of a failure. In this section, the following procedures test that
the cluster responds properly in the event of a package failure, a node failure, or a LAN failure.
CAUTION: In testing the cluster in the following procedures, be aware that you are causing
various components of the cluster to fail, so that you can determine that the cluster responds correctly
to failure situations. As a result, the availability of nodes and applications may be disrupted.
10.1.1 Testing the Package Manager
To test that the package manager is operating correctly, perform the following procedure for each
package on the cluster:
1. Obtain the PID number of a service in the package by entering
ps -ef | grep <service_cmd>
where service_cmd is the executable specified in the package configuration file by means
of the service_cmd parameter (page 193). The service selected must have the default
service_restart value (none).
2. To kill the service_cmd PID, enter
kill <PID>
3. To view the package status, enter
cmviewcl -v
The package should be running on the specified adoptive node.
4. Halt the package, then move it back to the primary node using the cmhaltpkg, cmmodpkg,
and cmrunpkg commands:
cmhaltpkg <PackageName>
cmmodpkg -e <PrimaryNode> <PackageName>
cmrunpkg -v <PackageName>
Depending on the specific databases you are running, perform the appropriate database
recovery.
You can also test the package manager using generic resources. Perform the following procedure
for each package on the cluster:
10.1 Testing Cluster Operation 269