How to Fix a CL_STONITH Resource Process Stopped on a Fusion Manager in FMHA
How to fix a CL_stonith resource process that has
stopped on a FusionManager in FMHA
Solution ID: 98
Created: 4/29/2009
DETAILS:
There is a error condition that exists in FMHA that will prevent FMHA failover from
working properly. If you run ibrix_fmha -i from an FM and one of the CL_stonith BMC
process a.k.a "resource" has a status of "stopped" for primary or the secondary FM, you
should take the action outlined below to correct this before expecting FMHA to work
properly.
Here is what the error condition looks like in ibrix_fmha -i output:
============
Last updated: Wed Apr 29 17:43:26 2009
Current DC: lab14-59 (9b9599db-f3b7-4f7b-8094-ec3b8efb7cb4)
2 Nodes configured.
3 Resources configured.
============
Node: lab14-59 (9b9599db-f3b7-4f7b-8094-ec3b8efb7cb4): online
Node: lab14-58 (4116f41d-0a43-451c-8e79-837fc8c17ac5): online
Resource Group: FusionManager_group
R_cluster1_10.10.114.60 (heartbeat::ocf:IPaddr): Started lab14-58
R_user_10.10.14.60 (heartbeat::ocf:IPaddr): Started lab14-58
R_FusionManager (heartbeat::ocf:fusionmanager): Started lab14-58
Clone Set: CL_stonithset_lab14-58
CL_stonith_10.10.14.158:0 (stonith:external/ibrix_ipmi): Started lab14-58
CL_stonith_10.10.14.158:1 (stonith:external/ibrix_ipmi): Stopped lab14-59 <---
-here is bad status on BMC resource
Clone Set: CL_stonithset_lab14-59
CL_stonith_10.10.14.159:0 (stonith:external/ibrix_ipmi): Started lab14-59
CL_stonith_10.10.14.159:1 (stonith:external/ibrix_ipmi): Started lab14-58
High-Availabilty services are currently ENABLED.
Resolution:
Procedure to get a stopped stonith resource working again:
1) Verify that the stonith script really works:
stonith -t external/ibrix_ipmi ipminame=IPADDR hostlist=STRING
username=TXT password=TXT power_state_on_reset=poweroff -S
2) Reset the resource's failcount.
crm_failcount -U NODENAME -r RESOURCE -G crm_failcount -U
NODENAME -r RESOURCE -D
3) "Clean" the resource.
crm_resource -C -r RESOURCE