How to Fix a CL_STONITH Resource Process Stopped on a Fusion Manager in FMHA

How to fix a CL_stonith resource process that has

stopped on a FusionManager in FMHA

Solution ID: 98

Created: 4/29/2009

DETAILS:

There is a error condition that exists in FMHA that will prevent FMHA failover from

working properly. If you run ibrix_fmha -i from an FM and one of the CL_stonith BMC

process a.k.a "resource" has a status of "stopped" for primary or the secondary FM, you

should take the action outlined below to correct this before expecting FMHA to work

properly.

Here is what the error condition looks like in ibrix_fmha -i output:

============

Last updated: Wed Apr 29 17:43:26 2009

Current DC: lab14-59 (9b9599db-f3b7-4f7b-8094-ec3b8efb7cb4)

2 Nodes configured.

3 Resources configured.

============

Node: lab14-59 (9b9599db-f3b7-4f7b-8094-ec3b8efb7cb4): online

Node: lab14-58 (4116f41d-0a43-451c-8e79-837fc8c17ac5): online

Resource Group: FusionManager_group

R_cluster1_10.10.114.60 (heartbeat::ocf:IPaddr): Started lab14-58

R_user_10.10.14.60 (heartbeat::ocf:IPaddr): Started lab14-58

R_FusionManager (heartbeat::ocf:fusionmanager): Started lab14-58

Clone Set: CL_stonithset_lab14-58

CL_stonith_10.10.14.158:0 (stonith:external/ibrix_ipmi): Started lab14-58

CL_stonith_10.10.14.158:1 (stonith:external/ibrix_ipmi): Stopped lab14-59 <---

-here is bad status on BMC resource

Clone Set: CL_stonithset_lab14-59

CL_stonith_10.10.14.159:0 (stonith:external/ibrix_ipmi): Started lab14-59

CL_stonith_10.10.14.159:1 (stonith:external/ibrix_ipmi): Started lab14-58

High-Availabilty services are currently ENABLED.

Resolution:

Procedure to get a stopped stonith resource working again:

1) Verify that the stonith script really works:

stonith -t external/ibrix_ipmi ipminame=IPADDR hostlist=STRING

username=TXT password=TXT power_state_on_reset=poweroff -S

2) Reset the resource's failcount.

crm_failcount -U NODENAME -r RESOURCE -G crm_failcount -U

NODENAME -r RESOURCE -D

3) "Clean" the resource.

crm_resource -C -r RESOURCE

PAGE 1
How to fix a CL_stonith resource process that has stopped on a FusionManager in FMHA Solution ID: 98 Created: 4/29/2009 DETAILS: There is a error condition that exists in FMHA that will prevent FMHA failover from working properly. If you run ibrix_fmha -i from an FM and one of the CL_stonith BMC process a.k.a "resource" has a status of "stopped" for primary or the secondary FM, you should take the action outlined below to correct this before expecting FMHA to work properly.
PAGE 2
Note: You can "tail -f /usr/local/ibrix/log/stonith_ipmi.log | grep lan" to see when the stonith_ipmi script is triggered. and also use the linux-ha command crm_mon -i 4 which gives you a live real time capture of ibrix_fmha -i output. You will see the CL_stonith resource start up and again have a healthy FMHA configuration. Lab Example: stonith -t external/ibrix_ipmi ipminame=192.168.17.244 hostlist=sc2c03.ibrix.com username=ibrix password=ibrix power_state_on_reset=poweroff -S crm_failcount -U sc2-c02.