Specifications
2
Software Architecture
27
Peer communication loss causes failover
IfcommunicationbetweenthepeerShMCsbreaksdown,thefirstoftheprevious
communicationmechanisms(ShMC‐to‐ShMCoverIPMB)fails.ThestandbyShelfManager
attemptstopingitspeerovertheLANinterfacetoconfirmitsfullhighavailability(HA)state.
Ifthisfails,thiscausesthestandbyShelfManagertoeffectafailover,assumetheactiverole,
andsendoutaneventnotifyingalleventreceiversofthefailover.Thiscanhappenifthe
activeShelfManagerwashot‐swappedfromtheshelfimproperlyorifthelocalIPMBonthe
activeShelfManagerhardwarefailed,causingittode‐linkitselffromIPMB‐0.
Watchdog expiration causes failover
IftheoperatingenvironmentthattheShMSisrunningonfails,theIPMIwatchdoginthe
ShMCexpires,causingittoresetthemodulehostingtheShMSsothatitrebootstoagood
state.IfthishappensintheactiveShelfManager,afailovertothestandbyShelfManager
occursandthepreviouslyactiveShelfManagerassumesthestandbyroleafteritreboots.
However,ifthishappensinthestandbyShelfManager,thennofailoverisrequired,buta
payloadresettoagoodworkingstandbymodestillhappens.
Peer communication loss causes failover, non-redundancy
IfthepeerShMSsarenotabletomaintaincommunicationovertheLANinterface,they
attempttosynchronizeovertheIPMB.Ifthatfailsaswell,thestandbyShMSassumesthe
activerole.ThisprotectsagainstanunlikelyconditionwheretheactiveShMShasan
unrecoverablefailure,butkeepsrunninginanon‐functionalstate.Withcommunicationover
theLANinterfacelost,theShelfManagersarenolongerconsideredredundant,andtheShelf
Managerredundancysensoronpage 28issuesanalarm.
Watchdog expiration causes reboot of non-redundant SCM
IfnostandbyShelfManagerispresentandthecommunicationbetweentheactiveShMSand
ShMCbreaksdown,theShMCwatchdogstillexpires,causingasoftwarerebootoftheSCM.In
thiscase,theShelfManagerstillrebootstoagoodworkingstateandreassumestheactive
role.Therebootresults
indowntime,becausethereisnoShMStomonitorandrespondto
eventsfromshelfcomponentswhiletheShelfManagerpayloadisrebooting.
Initiating a failover manually
Youcanmanuallyinitiateafailoverby:
•Usingtheplatform‐managementCLI.
•UsingHPI‐Controlnumber0x1010inresource0x02.RefertotheShelfManagerfailover
controlsectionoftheSAFMappingSpecificationformoreinformation.
• Tog gling(openingandthenclosing)thebottomejectorlatchoftheactiveSCM.
•Poweringofforpower
cyclingtheactiveSCMusingHPIwiththe
saHpiResourcePowerStateSet()function.
•ExtractingtheactiveSCMeithermanuallyorusingHPIwiththe
saHpiHotSwapActionRequest()function.