Datasheet
6
one to exceed the 137 GB limit
3
. Each box provides 7 ×
160 GB = 112 0 GB of usable RAID-5 disk s pace in addi-
tion to a CPU for computations. 300 terabytes is reached
with 270 boxes. Use 40 commodity 8-port giga bit ether-
net switches ($800 each) to connect the 270 boxes to a
40-port, high end, fast backplane ether net switch [43], [44].
This could easily fit in a room that was formerly occupied
by a few old Mainframes , say an area of about a hundred
square meters. The power consumption would be 42 kilo-
watts. One would need to build up operational exp e rience
for s mooth running. As newer disks arrive that hold yet
more data, even a petabyte system would become feasible.
B. Data Transfer Strategy
Fo r small amounts of data and to up date analysis soft-
ware one can use internet file transfers, preferably via
“rsync”. The program “rsync” remotely copies files and
uses a remote-update protocol to greatly speedup file trans-
fers when the destination file already exists. This remote-
upda te protocol allows “rsync” to trans fer just the differ-
ences between two sets of files across the netwo rk link, us-
ing an efficient checksum-search algo rithm. Some of the
additional features of “rsync” are: support for copying
links, devices, owners, groups and permissions; can use
any transparent re mote shell, including “rsh” or “ssh”; can
tunnel over encrypted connections and is compatible with
Kerberized rsh/ssh authentication; and does not r e quire
root privileges. The only problem is the available band-
width. Internet2 may ameliorate this problem but given
the prevalence of Na ps ter-like progra ms competing with
data transfers , this is no t a certainty. The other method
would be to use some form of removable, and universally
readable media. Two new methods are hot plugg able IDE
disks in $90 FireWire cases [39], and DVD-R disks . Since
FireWire works on Linux , Windows 98SE, and Macintosh
OS9 and OSX, one can use hot pluggable EIDE disks in
FireWire cases as a simple method of transferring reason-
able amounts of data or even full s e ts of analysis software.
In any case, its best not to try and transfer any chunk of
data more than once. Local CPUs and disks are far less
exp ensive than wide area networks.
Writa ble 4.7 GB DVD-R disks can be purchased for $5.
They can be read by $60 DVD-ROM drives and written
by the $500 Pioneer DVR-A03 drive [3]. Linux is capable
of writing DVD-Rs. However, the s oftware to do so is not
available under a free license. It is an enhanced version of
“cdrecor d”, the free program that writes CDs, CD-Rs, and
CD-RWs. A demo version that will write up to 1 GB is
available from the author’s FTP site [45]. An alternative,
which is free, is to use the patch for cdrecord [46]. Using
this patched version of “cdre c ord”, we have succ e eded in
writing a DVD-R using the Pioneer DVR-A03 both inter-
nally (it’s an EIDE device) and over FireWire. The spe-
3
Promise Technology’s Ultra133 TX2 PCI controller card uses a
wider 48-bit data address versus the older 28-bit address, which is
limited to 2
28
512 byte blocks or 137 Gigabytes. The card contr ol s
four disks and has a $59 list pr ice. (See http://www.promise.com/
marketing/datasheet/file/Ultra133tx2DS.pdf)
cific kernel used was linux 2.4.18 plus the pre1 patch from
Marcelo Tosatti [47], [48], the pre1-ac2 patch from Alan
Cox [49], and the iee e1394 tree [35]. We used a patched
version of c drecord 1.11a11. The image was a standard
iso966 0 filesystem image created with “mkisofs”, including
a 2880 kB boot image. (The DVD itself contains a com-
plete copy of the February 27, 2002 snapshot of Debian
Linux’s upcoming 3.0 release, which would normally take
up six 700 MB CD-Rs.) The image too k approximately 25
minutes to write at 2x speed. The long-term reliability of
DVD-R media still needs to be explored.
VI. Conclusion
We have tested redundant arrays of IDE disk drives for
use in offline high e nergy physics data analysis and Monte
Carlo simulations. Parts costs of to tal systems using com-
modity IDE disks are now at the $4000 per tera byte level,
the same cost per terabyte as Storage Technology tape si-
los. The disks, however, offer much better granularity;
even small institutions can afford them. The faster ac-
cess of disk versus tape is a major added bo nus. We have
tested software RAID-5 sys tems running under Linux 2.4
using Promise Ultra 100 disk controllers. RAID-5 provides
parity bits to protect data in case of a single catastrophic
disk failure. Tape backup is not required fo r data that
can be recreated with modest effort. Journaling file sys-
tems permit rapid recovery from crashes. Our data anal-
ysis s trategy is to encapsulate data and CPU processing
power. Data is stored on many PCs. Analysis for a partic-
ular part of a data set takes place locally on the PC where
the data resides. The network is only used to put results to-
gether. Commodity 8-port gigabit ethernet switches com-
bined with a single high end, fast backplane switch would
allow one to connect a thousand PCs, each with a terabyte
of disk space . Some tasks may need more than one CPU
to go through the data even on one RAID array. For such
tasks dual CPUs and/or several boxes on one local 8-port
ethernet switch should be adequate and avoids overwhelm-
ing the backbone s w itching fabric connecting an entire in-
stallation. Again the backbone is only used to put results
together. We successfully performed simple tests of three
methods of moving data between sites; internet transfers,
hot pluggable EIDE disks in FireWire cas e s, and DVD-R
disks.
Current high energy physics experiments, like BaBar at
SLAC, feature relatively low data acquisition rates, only 3
MB/s, less than a third of the r ates taken at Fermilab fixed
target experiments a decade ago [1], [2]. The Large Hadron
Collider e xperiments CMS and Atlas, with data acquisition
rates sta rting at 10 0 MB/s, will be more challenging and
require physical architectures that minimize helter skelter
data movement if they are to fulfill their promise. In many
cases, architectures designed to solve particular processing
problems are far more cost effective than general solutions
[1], [2], [40], [41]. Some of the techniques explored in this
paper, to physically encaps ulate data and CPUs to gether,
may be useful.