HP Serviceguard A.11.20- Managing Serviceguard Twentieth Edition, August 2011
application. If the application can be restarted on the same node after a failure (see “Handling
Application Failures ” following), the retry to the current server should continue for the amount
of time it takes to restart the server locally. This will keep the client from having to switch to
the second server in the event of a application failure.
• Use a transaction processing monitor or message queueing software to increase robustness.
Use transaction processing monitors such as Tuxedo or DCE/Encina, which provide an interface
between the server and the client. Transaction processing monitors (TPMs) can be useful in
creating a more highly available application. Transactions can be queued such that the client
does not detect a server failure. Many TPMs provide for the optional automatic rerouting to
alternate servers or for the automatic retry of a transaction. TPMs also provide for ensuring
the reliable completion of transactions, although they are not the only mechanism for doing
this. After the server is back online, the transaction monitor reconnects to the new server and
continues routing it the transactions.
• Queue Up Requests
As an alternative to using a TPM, queue up requests when the server is unavailable. Rather
than notifying the user when a server is unavailable, the user request is queued up and
transmitted later when the server becomes available again. Message queueing software
ensures that messages of any kind, not necessarily just transactions, are delivered and
acknowledged.
Message queueing is useful only when the user does not need or expect response that the
request has been completed (i.e, the application is not interactive).
Handling Application Failures
What happens if part or all of an application fails?
All of the preceding sections have assumed the failure in question was not a failure of the
application, but of another component of the cluster. This section deals specifically with application
problems. For instance, software bugs may cause an application to fail or system resource issues
(such as low swap/memory space) may cause an application to die. The section deals with how
to design your application to recover after these types of failures.
Create Applications to be Failure Tolerant
An application should be tolerant to failure of a single component. Many applications have multiple
processes running on a single node. If one process fails, what happens to the other processes? Do
they also fail? Can the failed process be restarted on the same node without affecting the remaining
pieces of the application?
Ideally, if one process fails, the other processes can wait a period of time for that component to
come back online. This is true whether the component is on the same system or a remote system.
The failed component can be restarted automatically on the same system and rejoin the waiting
processing and continue on. This type of failure can be detected and restarted within a few seconds,
so the end user would never know a failure occurred.
Another alternative is for the failure of one component to still allow bringing down the other
components cleanly. If a database SQL server fails, the database should still be able to be brought
down cleanly so that no database recovery is necessary.
The worse case is for a failure of one component to cause the entire system to fail. If one component
fails and all other components need to be restarted, the downtime will be high.
Be Able to Monitor Applications
All components in a system, including applications, should be able to be monitored for their health.
A monitor might be as simple as a display command or as complicated as a SQL query. There
must be a way to ensure that the application is behaving correctly. If the application fails and it
is not detected automatically, it might take hours for a user to determine the cause of the downtime
and recover from it.
Handling Application Failures 351