Managing Serviceguard Nineteenth Edition, Reprinted June 2011

8 Troubleshooting Your Cluster
This chapter describes how to verify cluster operation, how to review cluster status, how to add
and replace hardware, and how to solve some typical cluster problems. Topics are as follows:
Testing Cluster Operation
Monitoring Hardware (page 309)
Replacing Disks (page 310)
Replacing I/O Cards (page 313)
Replacing LAN or Fibre Channel Cards (page 314)
Replacing a Failed Quorum Server System (page 315)
Troubleshooting Approaches (page 316)
Solving Problems (page 319)
Testing Cluster Operation
Once you have configured your Serviceguard cluster, you should verify that the various components
of the cluster behave correctly in case of a failure. In this section, the following procedures test that
the cluster responds properly in the event of a package failure, a node failure, or a LAN failure.
CAUTION: In testing the cluster in the following procedures, be aware that you are causing
various components of the cluster to fail, so that you can determine that the cluster responds correctly
to failure situations. As a result, the availability of nodes and applications may be disrupted.
Start the Cluster using Serviceguard Manager
If you have just finished configuring your cluster, it starts automatically. If it is halted later, restart
it: from the System Management Homepage (SMH), select the cluster and choose Administration
-> Run Cluster...
Testing the Package Manager
You can test that the package manager is operating correctly. Perform the following procedure for
each package on the cluster:
1. Obtain the PID number of a service in the package by entering
ps -ef | grep <service_cmd>
where service_cmd is the executable specified in the package control script with the parameter
SERVICE_CMD. The service selected must not have SERVICE_RESTART specified.
2. To kill the service_cmd PID, enter
kill PID
3. To view the package status, enter
cmviewcl -v
The package should be running on the specified adoptive node.
4. Move the package back to the primary node (see “Moving a Failover Package (page 273)).
Testing the Cluster Manager
To test that the cluster manager is operating correctly, perform the following steps for each node
on the cluster:
308 Troubleshooting Your Cluster