Managing Serviceguard Eighteenth Edition, September 2010
Minimizing Planned Downtime
Planned downtime (as opposed to unplanned downtime) is scheduled; examples
include backups, systems upgrades to new operating system revisions, or hardware
replacements. For planned downtime, application designers should consider:
• Reducing the time needed for application upgrades/patches.
Can an administrator install a new version of the application without scheduling
downtime? Can different revisions of an application operate within a system? Can
different revisions of a client and server operate within a system?
• Providing for online application reconfiguration.
Can the configuration information used by the application be changed without
bringing down the application?
• Documenting maintenance operations.
Does an operator know how to handle maintenance operations?
When discussing highly available systems, unplanned failures are often the main point
of discussion. However, if it takes 2 weeks to upgrade a system to a new revision of
software, there are bound to be a large number of complaints.
The following sections discuss ways of handling the different types of planned
downtime.
Reducing Time Needed for Application Upgrades and Patches
Once a year or so, a new revision of an application is released. How long does it take
for the end-user to upgrade to this new revision? This answer is the amount of planned
downtime a user must take to upgrade their application. The following guidelines
reduce this time.
Provide for Rolling Upgrades
Provide for a “rolling upgrade” in a client/server environment. For a system with many
components, the typical scenario is to bring down the entire system, upgrade every
node to the new version of the software, and then restart the application on all the
affected nodes. For large systems, this could result in a long downtime.
An alternative is to provide for a rolling upgrade. A rolling upgrade rolls out the new
software in a phased approach by upgrading only one component at a time. For example,
the database server is upgraded on Monday, causing a 15 minute downtime. Then on
Tuesday, the application server on two of the nodes is upgraded, which leaves the
application servers on the remaining nodes online and causes no downtime. On
Wednesday, two more application servers are upgraded, and so on. With this approach,
you avoid the problem where everything changes at once, plus you minimize long
outages.
The trade-off is that the application software must operate with different revisions of
the software. In the above example, the database server might be at revision 5.0 while
440 Designing Highly Available Cluster Applications