System information

6 Hardware Monitor Functionality
Computer systems are in general not fail-safe with respect to availability and reliability.
Besides the errors produced by the software running on the system, the hardware is also
error-prone. Moving parts of the hardware are one of the first devices with the highest
potential to fail over a specified period of time. Specially, these are the fans and the hard
disks [76] of the computer system. But a computer system also produces heat which can
damage other parts of the system like the CPU. Therefore, the CHARM card was not only
developed to provide remote control functions, it also monitors the host computer system.
Additionally, the CHARM offers diagnose function to improve the search of the error source.
The following functions are used to detect a failure or to find the source of a failure:
POST Code Analyzer.
Host System Inspector.
Measurement of temperature, voltage and the fan speed.
Display Screen Inspector.
The following sections discuss the features more precisely. Furthermore, the CHARM
provides monitoring software which summarizes the results of the measurements of the
CHARM card or takes action in case a pending error is detected. Basically, there are two
monitor clients running on the CHARM card:
Lemon - LHC Era Monitoring.
SysMES - System Management for Networked Embedded Systems and Clusters.
Section 6.4 explains the Lemon software and the SysMES framework.
6.1 Power On Self Test
The Power-on self-test is a series of diagnostic routines performed when a computer system
is powered up [59]. The POST is handled by the BIOS of the computer. The principal duties
of the computer BIOS during POST are as follows: to verify the integrity of the BIOS code
itself, to test the operability of the CPU, to verify system and size main memory, to discover,
to initialize, and to catalog all system buses and devices, to identify and to select which
devices are available for booting. At the beginning of each POST task, the BIOS outputs
the test-point error code normally to I/O port 0x80 [77]. But on few computer systems the
I/O port 0x300 and 0x81 are used to output the error code [78]. The code written to port
0x80 does not ever mean a failure. Instead it represents a checkpoint to indicate the task
79