Datasheet
5
what Red Hat and other distributions base their kernels
on, and includes drivers not in stock 2.4.12.
3. We had to enabled FireWire supp ort when configuring
the kernel. This involved turning on the following:
IEEE 1394 (FireWire) support (EXPERIMENTAL)
OHCI-1394 support
SBP-2 support (Harddisks etc.)
(The RAWIO dr iver is not necessary for storage devices.
In addition, you will need the SCSI disk driver enabled in
the kernel, even if you don’t have a real SCSI interface on
the machine. This is because FireWire is treated as a SCSI
channel.)
4. After rebooting with the new kernel, some recent dis-
tributions should detect the FireWire card and insta ll the
correct dr ivers. If not, the following modules need to be
manually loaded, in this order:
ohci1394
sbp2
The sbp2 driver is somewhat finicky; it helps to have a few
seconds delay between the two modprobes. The command
“cat /proc/scs i/scsi” should list the attached storage de-
vices (disks, CD-ROMs, etc.):
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: Maxtor Model: 1394 storage Rev: 60
Type: Direct-Access ANSI SCSI revision: 02
Some of the output may not make sense if an IDE-FireWire
(1394) bridge is in use; we noticed the non-Maxtor drive
had strange output.
At the moment, the devices are added in more-or- le ss
random order. The only way to guarantee or dering is to
manually hot-plug them. We don’t know if this is a soft-
ware limitation or an artifact of the plug& play nature of
FireWire (there’s no permanent ID setting like IDE or SCSI
have). Pr esumably if one writes a volume header label
(e.g. with tune2fs -L) to each disk you could get a round
this problem.
Hot plugging seems to work with the following caveat.
Do not unplug a FireWire device without unmounting it
first. While you do not have to shutdown the computer
to remove the device you do have to unmount it. Once
unmounted, disconnect the device physically and then run
“rescan- scsi-bus.sh -r”. For new devices, plug them in and
run “rescan-scsi-bus.sh”. The script can be downloaded at
http://www.garlo ff.de/kurt/linux/ rescan-scsi-bus.sh
We succ e ssfully configured two FireWire disk s, after for-
matting the disks using ext2, (but any co mmon file sys-
tem, such as ext3 or Rieser FS, would work) as a RAID-5
array. One of the disks used the new Oxford 911 FireWire
to EIDE interface chip [36], [37], [38], [39]. We have suc-
ceeded in writing a DVD-R using the Pioneer DVR-A03
over FireWire.
V. High Energy Physics Strategy
A. Data Storage Strategy – Event Persistence
We encapsulate data and CPU processing power. A
block of real or Monte Carlo simulated data for an analy-
sis is broken up into groups of events and distributed once
to a set of RAID disk boxes, which each may also serve a
few additional processors via a local 8-port gigabit ethernet
switch
2
Dual processor boxes would also add more local
CPU power. Events are kept physically contiguous on dis ks
to minimize I/O. Events are only built once . Eve nt paral-
lel processing has a long history of success in hig h energy
physics [1], [2], [40], [41], [42]. The data from each anal-
ysis are distributed among a ll the RAID arrays so all the
computing power can be brought to bear on each analysis.
Fo r ex ample, in the c ase of an impo rtant analy sis (such
as a Higgs analysis), one could put 50 GB of data onto
each of 100 RAID arrays and then bring the full comput-
ing power of 700 CPUs into play. Instances of an analysis
job are run on each lo cal cluster in parallel. Several anal-
yses jobs may be running in memory or queued to each
local cluster to level loads. The data volume of the results
(e.g. histograms) is small and is gathered together over the
network backbone. Results are examined and the analysis
is rerun. The system is inherently fault tolerant. If three
of a hundred clusters are down, one still gets 97% of the
data and analysis is not impeded.
RAID-5 arrays sho uld be treated as fairly sec ure, large,
high-speed “scratch disks”. RAID-5 just mea ns that disk
data will be lost less frequently. Data which is very hard
to re-create still needs to reside on tape. The inefficiency
of an offline tap e vault can be an advantage. Its harder to
erase your entire raw data set with a single keystroke, if
thousands of tapes have to be physically mounted. Some-
one may ask why all the write pro tect switches are being
reset before all is lost. Its the same reaso n the Air Force
has real p e ople with keys in ICBM silos.
The gra nularity offered by RAID-5 arrays allows a uni-
versity or small experiment in a laboratory to set up a
few terabyte computer farm, while allowing a large Analy-
sis Site or Laboratory to set up a few hundred terabyte
or a petabyte computer system. For a large site, they
would not necessarily have to purchase the full system at
once, but buy and install the system in smaller parts. T his
would have two advantages, primarily they would be able
to spre ad the cost over a few years and secondly, g iven the
rapid increas e in both CPU power and disk size, one could
get the best “bang for the buck ”.
What would be required to build a 300 terabyte system
(the same size as a tape silo)? Start with eight 160 GB
Maxtor disks in a box. The Promise Ultra133 card allows
2
D–Link DGS–1008T 8-port gigabit ethernet switch $765
Linksys EG0008 8-port gigabit ethernet switch $727
Netgear GS508T 8-port gigabit ethernet switch $770
Netgear GS524T 24-port gigabit ethernet switch $1860
D-Link DGE500T RJ45 gigabit ethernet PCI adapter $46
(See http://www.dlink.com/ , http://www.linksys.com/products/ ,
and http://www.netgear.com/)