Technical information

64
10 Steps To Resilience
Resilience is a way of virtually eliminating unexpected system downtime. It consists of a combination of
redundancy, availability and manageability features. Complete resilience leads to a system which is always
running, 100% of the time. One definition of resilience in a Business Computing Environment:
Maximising system tolerance to any failure
[and]
providing the highest level of system uptime.
A server either serves a number of people or is involved in supporting mission critical operations. It is
therefore necessary to consider the implications of the server failing. In an ideal world, the server should
never fail. The quality of components used by vendors and the level of quality testing the vendor performs is
crucially important.
However if a component should fail it should be ensured that there is no data loss and that the system can be
brought back up and running as soon as possible with minimal disruption to the users.
Resilience is an important aspect to the Fujitsu server range, and begins at the design stage. A major share of
the company's huge R&D budget has been invested in creating world class technology, which is capable of
meeting the ever-increasing demands of the real world.
This lies at the core of Fujitsu's “10 steps to resilience”.
Each additional ‘step’ builds up the solid foundation of the previous layers:
Prevent Component Failure
1. Component reliability and investment in underlying technology
- stringent quality criteria are
applied to the selection of all Fujitsu's server components.
2. Reliability through design, validation and testing
- rigorous testing procedures are applied to the
system, operating system and environment as well as the manufacturing process.
3. Fujitsu Software Partnerships
- alliances with the world's leading software technology companies -
such as Microsoft, SCO, Novell, SAP and Citrix - enable Fujitsu to bring the best platform technology to
its customers, with compatibility pre-certified by the software vendor.
4. Server Management
– analysing the status of the server, disk usage, processor temperature etc.,
allowing pre-failure problems to be dealt with before they might cause a failure.
These four steps are built in at no extra cost, all focussed on minimising the chance of a component failure.
Prevent System Failure
5. Redundancy against data loss
– saving your business by duplicating your critical business data
against loss or corruption e.g. ECC memory, RAID disks.
6. Redundant components (extra)
– ensuring your server continues to service your users/customers even
if a component should fail, e.g. disks, ECC memory, LAN Card, fans, PSU.
7. Uninterruptible Power Supplies (UPS)
– protecting your server if a power cut occurs or someone
accidentally pulls out the wrong plug; the UPS will kick in to keep the server running. The UPS also
smoothes out any “spikes” in the mains power to prevent any harm to your server.
Zero Down-time
8. Hot-swap/spare components
– replacing failed (redundant) components without impacting your
user/customer service i.e. zero downtime, by using hot-swap:- disks, fans and PSUs. Risk can be reduced
even further by stopping the chance that two component failures might bring the service down. The
server automatically configures the hot-spare to replace a failed component e.g. hot-spare redundant disk.
Manage Failure
9. Automatic Server Recovery
– often under-rated, it constantly monitors whether the operating system
is alive and running. Once it is sure the O/S has hung it automatically re-boots the system.
10. Availability Clusters
– consists of two servers where the applications on one server can be moved
across to the other, either manually if you want to upgrade one of the servers or automatically in the event
of a component failure where the component is a single point of failure.
All of this adds up to ...
Maximum System Resilience