Managing Serviceguard Eighteenth Edition, September 2010

Handling Application Failures
What happens if part or all of an application fails?
All of the preceding sections have assumed the failure in question was not a failure of
the application, but of another component of the cluster. This section deals specifically
with application problems. For instance, software bugs may cause an application to
fail or system resource issues (such as low swap/memory space) may cause an
application to die. The section deals with how to design your application to recover
after these types of failures.
Create Applications to be Failure Tolerant
An application should be tolerant to failure of a single component. Many applications
have multiple processes running on a single node. If one process fails, what happens
to the other processes? Do they also fail? Can the failed process be restarted on the
same node without affecting the remaining pieces of the application?
Ideally, if one process fails, the other processes can wait a period of time for that
component to come back online. This is true whether the component is on the same
system or a remote system. The failed component can be restarted automatically on
the same system and rejoin the waiting processing and continue on. This type of failure
can be detected and restarted within a few seconds, so the end user would never know
a failure occurred.
Another alternative is for the failure of one component to still allow bringing down
the other components cleanly. If a database SQL server fails, the database should still
be able to be brought down cleanly so that no database recovery is necessary.
The worse case is for a failure of one component to cause the entire system to fail. If
one component fails and all other components need to be restarted, the downtime will
be high.
Be Able to Monitor Applications
All components in a system, including applications, should be able to be monitored
for their health. A monitor might be as simple as a display command or as complicated
as a SQL query. There must be a way to ensure that the application is behaving correctly.
If the application fails and it is not detected automatically, it might take hours for a
user to determine the cause of the downtime and recover from it.
Handling Application Failures 439