Datasheet

CHAPTER 1 What Kind of Protection do You need?

diminishing percentage of why servers failed. The growing majority of server outages were due to

software — meaning not only the software-based hardware drivers, but also the applications and

the OS itself. It is because of the shift in why servers were failing that data protection and availabil-

ity had to evolve.

So, let’s start by looking at what we can do to protect those hardware elements that can

cause a server failure or data loss. In such cases, when a tier-one server vendor is respected in

the datacenter space, I tend to dismiss the server hardware at ﬁrst glance as the likely point of

failure. So, storage is where we should look ﬁrst.

In t r o d u c I n g rAId

No book on data protection would be complete in its ﬁrst discussions on disk without summariz-

ing what RAID is. Depending on when you ﬁrst heard of RAID, it has been both:

Redundant Array of

•u Inexpensive Disks

Redundant Array of

•u Independent Disks

In Chapter 3, we will take an in-depth look at storage resiliency, including RAID models, but

for now, the key idea is that statistically, the most common physical component of a computer

to fail is a hard drive. Because of this, the concept of strapping multiple disks together in vari-

ous ways (with the assumption that multiple hard drives will not all likely break at once) is now

standard practice. RAID comes in multiple conﬁgurations, depending on how the redundancy is

achieved or the disks are aligned:

Mirroring — RAID 1 The ﬁrst thing we can do is to remove the single spindle (another

term for a single physical disk, referring to the axis that all the physical platters within the

disk spin on). In its simplest resolution, we mirror one disk or spindle with another. With

this, the disk blocks are paired up so that when disk block number 234 is being written to the

ﬁrst disk, block number 234 on the second disk is receiving the same instruction at the same

time. This completely removes a single spindle from being the single point of failure (SPOF),

but it does so by consuming twice as much disk (which equates to at least twice the costs)

power, cooling, and space within the server.

RAID 5, 1+0/10, and Others Chapter 3 will take us through all of the various RAID lev-

els and their pros and cons, but, for now, the chief takeaway is that you are still solving a

spindle-level failure. The difference between straight mirroring (RAID 1) and all other RAID

variants is that you are not in a 1:1 ratio of production disk and redundant disk. Instead, in

classic RAID 5, you might be spanning four disks where, for every N-1 (3 in this case) blocks

being written, three of the disks get data and the fourth disk calculates parity for the other

three. If any single spindle fails, the other three have the ability to reconstitute what was on

the fourth, both in production on the ﬂy (though performance is degraded) and in reconsti-

tuting a new fourth disk.

But it is all within the same array, storage cabinet, or shelf for the same server. What if your

fancy RAID 5 disk array cabinet fails, due to two disks failing in a short timeframe, or the

power failing, or whatever?

In principle, mirroring (also known as RAID-1) and most of the other RAID topologies are

all attempts to keep a single hard drive failure from affecting the production server. Whether

the strategy is applied at the hardware layer or within the OS, the result is that two or more disk

drives act together to improve performance and/or mitigate outages. In large enterprises,

572146c01.indd 4 6/23/10 5:42:19 PM