Managing Serviceguard Eighteenth Edition, September 2010

8 Troubleshooting Your Cluster

This chapter describes how to verify cluster operation, how to review cluster status,

how to add and replace hardware, and how to solve some typical cluster problems.

Topics are as follows:

• Testing Cluster Operation

• Monitoring Hardware (page 401)

• Replacing Disks (page 402)

• Replacing I/O Cards (page 406)

• Replacing LAN or Fibre Channel Cards (page 407)

• Replacing a Failed Quorum Server System (page 408)

• Troubleshooting Approaches (page 409)

• Solving Problems (page 413)

Testing Cluster Operation

Once you have configured your Serviceguard cluster, you should verify that the various

components of the cluster behave correctly in case of a failure. In this section, the

following procedures test that the cluster responds properly in the event of a package

failure, a node failure, or a LAN failure.

CAUTION: In testing the cluster in the following procedures, be aware that you are

causing various components of the cluster to fail, so that you can determine that the

cluster responds correctly to failure situations. As a result, the availability of nodes and

applications may be disrupted.

Start the Cluster using Serviceguard Manager

If you have just finished configuring your cluster, it starts automatically. If it is halted

later, restart it: from the System Management Homepage (SMH), select the cluster and

choose Administration -> Run Cluster...

Testing the Package Manager

You can test that the package manager is operating correctly. Perform the following

procedure for each package on the cluster:

1. Obtain the PID number of a service in the package by entering

ps -ef | grep <service_cmd>

where service_cmd is the executable specified in the package control script with

the parameter SERVICE_CMD. The service selected must not have

SERVICE_RESTART specified.

2. To kill the service_cmd PID, enter

Testing Cluster Operation 399