System information
1 Introduction
At present, computer clusters
1
are the predominant construction type of supercomputer
installations. They are used in a wide range of applications like web search engines [4],
weather forecasts [5], simulation of financial markets [6] and high energy experiments. The
data analysis of future high energy experiments like CMS
2
and ALICE
3
are accomplished
by computer clusters, for example. A driving force for the usage of computer clusters is
the increased need of cheap computing power for computational science and commercial
application. The traditional supercomputing platforms cause high costs and have a low
availability, whereas clusters can be build up with cheap commodity-off-the-shelf (COTS)
components and are readily available.
Clusters can consist of several hundreds of computer nodes. For example, the data center
of a government agency in Sweden has a computer cluster of more than 2,000 nodes [9].
Hence, the management of those big computer farms requires a considerable amount of
administrative effort: installation, configuration and maintenance. For instance, installing
one node and cloning its hard disk provides a fast and easy way to setup the cluster nodes.
Afterwards, the files are copied from node to node. This can be done from a distance
using the remote boot function of the network card. But in case the booting fails, one
needs access to the console of the node to detect the source of the error and repeat the
installation. Furthermore, commodity PCs normally do not provide remote access to the
system without running an operating system [10]. This is the drawback using COTS instead
of expensive server computers, which provide a wide range of remote control functions.
There are a couple of remote management tools and devices which enable remote control
features on a single computer. The following sections will discuss functions of those and the
drawbacks of using them in a computer cluster. But either the existing remote management
functions are designed for a specific computer system or they provide only a subset of remote
control functions. This thesis describes a remote control and maintenance facility which
was developed for the HLT
4
cluster of the ALICE experiment at CERN. The facility is
installed to every cluster node and allows the remote control of economic COTS cluster
nodes. Furthermore, it provides functions for the automation of the node administration.
In addition, this hardware device monitors the computer and takes action when a failure is
detected. A specific feature of this device is the possibility to access most of the hardware
units of the host computer. Therefore, malfunctioning of computer nodes can be inspected
more precisely.
1
Cluster is a collection of interconnected computers working together as a single system.
2
Compact Muon Solenoid [7].
3
A Large Ion Collider Experiment [8].
4
High Level Trigger.
17