Technical information

66
Redundant Components
- Component redundancy is where a failure may happen, but the server will remain
operational as other duplicate components can keep the system running. However, the system will have to be
taken down to make the repair necessary to the affected component. This will probably be managed
downtime. It may be a mistake to wait too long as a further failure to the same type of component could
cause the whole system to fail. Some examples of redundant components are disks (coupled with RAID),
fans, power supply units (PSU), and LAN cards.
Hot-swap, redundant components
– This is where a repair can take place whilst the system is still running,
resulting in zero downtime. Just pull out a failed disk while the system is running, and replace it with a new
disk. The system automatically rebuilds the original data using information from other disks in the system.
Hot-swap means no impact on users, and no downtime.
Hot-spare components
– If another disk should fail before the first disk has been swapped out then there
would be a disaster. A hot-spare disk can automatically cut in if any other disk fails.
What is RAID?
(see also
Overview of RAID
)
The loss of the use of a server is one thing, the loss of the data it contains is quite another and one that the
modern, competitive organisation is not prepared to contemplate. One way to minimise the risk is through
RAID, or Redundant Array of Independent Disks. RAID is a key element in supporting hot-swap and hot-
spare disks.
In simple terms, RAID ensures that should a disk fail, for whatever reason, other disks are standing by to
recreate the lost data, and to guarantee a seamless operation. Many different 'levels' of RAID are available,
and choosing the right level requires a trade-off between resilience and performance.
However, RAID isn't just concerned with maximum fault tolerance. Today, RAID also addresses other
priorities for the corporate IT user. For example, it meets the need for scalability by offering a substantially
increased level of storage capacity up to nearly 3TB currently, which is also highly resilient. It is also
designed to achieve a high level of performance through increased simultaneous reads/writes and faster data
transfer.
Clusters
– Despite all the best efforts to reduce the risk of failure, if a system does fail, either because
redundancy is not available, or because a failure occurs on another component, then the work being
performed by that server is transferred to another server, resulting in minimal downtime whilst the
application is being transferred.
Clustering used to be widely used in the database application area, but the increasing need for around-the-
clock availability has seen the technique expand into new areas - most notably for Internet and email servers,
helping to support the need for access at every hour of the day and night.
Clusters can also be used to allow upgrades or maintenance to be carried out on one of the servers at a time
of your choosing.
Automatic Server Recovery
(ASR) - If a server experiences a failure where the system ‘hangs’ (e.g.
operating system), automatic server recovery will take place to re-boot the system and restart the operating
system and applications. This is particularly useful when the system is unattended, i.e. in the middle of the
night or at a location where there is no on-site maintenance. There will be a short period in downtime whilst
the system reboots.
The
A800
i
,
L800
i
and
T800
i
support ASR whilst the
C800
i
and
G800
i
support ASR if the Server Manager
Assist card is configured. ASR is turned off by default and can be activated by installing server
manager
and then using the management software to turn ASR on.