Compaq AlphaServer SC RMS Reference Manual Quadrics Supercomputers World Ltd.
The information supplied in this document is believed to be correct at the time of publication, but no liability is assumed for its use or for the infringements of the rights of others resulting from its use. No license or other rights are granted in respect of any rights owned by any of the organizations mentioned herein. This document may not be copied, in whole or in part, without the prior written consent of Quadrics Supercomputers World Ltd.
Contents 1 Introduction 1-1 1.1 Scope of Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.2 Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.3 Using this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.4 Related Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 1.5 Location of Online Documentation . . . . . . . . . . . . . . . . . . . 1-3 1.6 Reader’s Comments . . . . . . . . . . . . . .
2.4.4 RMS Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Parallel Programs Under RMS 3-1 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.2 Resource Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.3 Loading and Running Programs . . . . . . . . . . . . . . . . . . . . . 3-3 4 RMS Daemons 4-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 4.1.1 Startup . . . . . . . . . . . .
nodestatus(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 msqladmin(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 prun(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 rcontrol(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 rinfo(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32 rmsbuild(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.5 Idle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Event Handling 8-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 8.1.1 Posting Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.1.2 Waiting on Events . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.2 Event Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 8.3 List of Events Generated . . . . . . . . . . . . . . . . . . . . .
10 The RMS Database 10-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.1.1 General Information about the Tables . . . . . . . . . . . . . . . 10-1 10.1.2 Access to the Database . . . . . . . . . . . . . . . . . . . . . . . . 10-2 10.1.3 Categories of Table . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Listing of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.2.1 The Access Controls Table . . . . . . . . . . . . . .
A Compaq AlphaServer SC Interconnect Terms A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A.2 Link States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4 A.3 Link Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4 B RMS Status Values B-1 B.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 B.2 Generic Status Values . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2 B.
rms_ncaps(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 rms_getcap(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 rms_prggetstats(3) . . . . . . . . . . . . . . . . . . . . . . . . . . C-13 D RMS Application Interface D.1 D-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1 rms_allocateResource(3) . . . . . . . . . . . . . . . . . . . . . . . D-2 rms_deallocateResource(3) . . . . . . . . . . . . . . . . . . . . .
List of Figures 2.1 A Network of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 2.2 High Availability RMS Configuration . . . . . . . . . . . . . . . . . . . . 2-3 2.3 The Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 2.4 Partitioning a System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 2.5 Distribution of Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 2.6 Preemption of Low Priority Jobs . . . .
List of Tables 10.1 Access Controls Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.2 Accounting Statistics Table . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 10.3 Machine Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 10.4 Performance Statistics Attributes . . . . . . . . . . . . . . . . . . . . . . 10-7 10.5 Server Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7 10.6 Scheduling Attributes . . . .
10.22 Partitions Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 10.23 Projects Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.24 Resources Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.25 Servers Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.26 Services Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 10.
1 Introduction 1.1 Scope of Manual This manual describes the Resource Management System (RMS). The manual’s purpose is to provide a technical overview of the RMS system, its functionality and programmable interfaces. It covers the RMS daemons, client applications, the RMS database, the system call interface to the RMS kernel module and the application program interface to the RMS database. 1.2 Audience This manual is intended for system administrators and developers.
Related Information Chapter 1 (Introduction) explains the layout of the manual and the conventions used to present information Chapter 2 (Overview of RMS) overviews the functions of the RMS and introduces its components Chapter 3 (Parallel Programs Under RMS) shows how parallel programs are executed under RMS Chapter 4 (RMS Daemons) describes the functionality of the RMS daemons Chapter 5 (RMS Commands) describes the RMS commands Chapter 6 (Access Control, Usage Limits and Accounting) explains RMS access co
Conventions 1.4 Related Information The following manuals provide additional information about the RMS from the point of view of either the system administrator or the user: • Compaq AlphaServer SC User Guide • Compaq AlphaServer SC System Administration Guide 1.5 Location of Online Documentation Online documentation in HTML format is installed in the directory /usr/opt/rms/docs/html and can be accessed from a browser at http://rmshost:8081/html/index.html.
Conventions italic monospace type Italic (slanted) monospace type denotes some meta text. This is used most often in command or parameter descriptions to show where a textual value is to be substituted. italic type Italic (slanted) proportional type is used in the text to introduce new terms. It is also used when referring to labels on graphical elements such as buttons. Ctrl/x This symbol indicates that you hold down the Ctrl key while you press another key or mouse button (shown here by x).
2 Overview of RMS 2.1 Introduction This chapter describes the role of the Resource Management System (RMS). The RMS provides tools for the management and use of a Compaq AlphaServer SC system. To put into context the functions that RMS performs, a brief overview of the system architecture is given first in Section 2.2. Section 2.3 outlines the main functions of the RMS and introduces the major components of the RMS: a set of UNIX daemons, a suite of command line utilities and a SQL database.
The System Architecture system are also connected to an external LAN. The application nodes, used for running parallel programs, are accessed through the RMS. Figure 2.1: A Network of Nodes QM-S16 Switch Switch Network Control Switch Network ... Interactive Nodes with LAN/FDDI Interface Terminal Concentrator Application Nodes Management Network All of the nodes are connected to a management network (normally, a 100 BaseT Ethernet).
The Role of the RMS Figure 2.2: High Availability RMS Configuration RMS Host Backup RMS Host RMS Database The RMS processes run on the node with the name rmshost, which migrates to the backup on fail-over. The database is held on a shared disk, accessible to both the primary and backup node. 2.3 The Role of the RMS The RMS provides a single point interface to the system for resource management.
The Role of the RMS Scheduling deciding when and where to run parallel jobs Audit maintaining an audit trail of system state changes From the user’s point of view, RMS provides tools for: Information querying the resources of the system Execution loading and running parallel programs on a given set of resources Monitoring monitoring the execution of parallel programs 2.3.
The Role of the RMS • The RMS Daemon, rmsd, runs on each node in the system. It loads and runs user processes and monitors resource usage and system performance. The RMS daemons are described in more detail in Chapter 4 (RMS Daemons). 2.3.3 The RMS Commands RMS commands call on the RMS daemons to get information about the system, to distribute work across the system, to monitor the state of programs and, in the case of administrators, to configure the system and back it up.
RMS Management Functions Section 10.2.20). Users have read access to all tables but no write access. Operator and administrative applications are granted limited write access. Password-protected administrative applications and RMS itself have full read/write access. The RMS commands are described in more detail in Chapter 5 (RMS Commands). 2.3.4 The RMS Database The database provides a platform-independent interface to the RMS system.
RMS Management Functions 2.4 RMS Management Functions The RMS gives the system administrator control over how the resources of a system are assigned to the tasks it must perform. This includes the allocation of resources (Section 2.4.1), scheduling policies (Section 2.4.2), access controls and accounting (Section 2.4.3) and system configuration (Section 2.4.4). 2.4.1 Allocating Resources The nodes in an RMS system can be configured into mutually exclusive sets known as partitions as shown in Figure 2.4.
RMS Management Functions A further partition, the root partition, is always present. It includes all nodes. It does not have a scheduler. The root partition can only be used by administrative users (root and rms by default). 2.4.2 Scheduling Partitions enable different scheduling policies to be put into action. On each partition, one or more of three scheduling policies can be deployed to suit the intended usage: 1.
RMS Management Functions The RMS scheduler allocates contiguous ranges of nodes with a given number of CPUs per node 1 . Where possible each resource request is met by allocating a single range of nodes. If this is not possible, an unconstrained request (those that only specify the number of CPUs required) may be satisfied by allocating CPUs on disjoint nodes. This ensures that an unconstrained resource request can utilize all of the available CPUs. The scheduler attempts to find free CPUs for each request.
RMS Management Functions detail in Chapter 6 (Access Control, Usage Limits and Accounting). Each partition, except the root partition, is managed by a Partition Manager (see Section 4.4), which mediates user requests, checking access permissions and usage limits before scheduling CPUs and starting user jobs. An accounting record is created as CPUs are allocated to each request. It is updated periodically until the resources are freed.
3 Parallel Programs Under RMS 3.1 Introduction RMS provides users with tools for running parallel programs and monitoring their execution, as described in Chapter 5 (RMS Commands). Users can determine what resources are available to them and request allocation of the CPUs and memory required to run their programs. This chapter describes the structure of parallel programs under RMS and how they are run.
Resource Requests 3.2 Resource Requests Having logged into the system, a user makes a request for the resources needed to run a parallel program by using the RMS commands prun (see Page 5-11) or allocate (see Page 5-3).
Loading and Running Programs The resource request is sent to the Partition Manager, pmanager (described in Section 4.4). The Partition Manager performs access checks (described in Chapter 6 (Access Control, Usage Limits and Accounting)) and then allocates CPUs according to the policies established for the partition (see Chapter 7 (RMS Scheduling)). RMS makes a distinction between allocating resources and starting jobs on them.
Loading and Running Programs processes, removing any core files if requested (see Page 5-11) and then deallocating the CPUs. The application processes are run from the user’s current working directory with the current limits and group rights. The data and stack size limits may be reduced if RMS has applied a memory limit to the program. During execution, the processes may be suspended at any time by the scheduler to allow a program with higher priority to run.
Loading and Running Programs Sometimes, it is desirable for a user to be granted more control over the use of a resource. For instance, the user may want to run several jobs concurrently or use the same nodes for a sequence of jobs. This functionality is supported by the command allocate (see Page 5-3) which allows a user to allocate CPUs in a parallel partition to a UNIX shell. These CPUs are used for subsequent parallel jobs started from this shell.
4 RMS Daemons 4.1 Introduction This chapter describes the role of the RMS daemons. There are daemons that run on the rmshost node providing services for the system as a whole: msqld Manages the database (see Section 4.2). mmanager Monitors the health of the machine as a whole (see Section 4.3). pmanager Controls the use of resources (see Section 4.4). swmgr Monitors the health of the Compaq AlphaServer SC Interconnect (see Section 4.5).
The Machine Manager 4.1.1 Startup RMS is started as each node executes the initialization script /sbin/init.d/rms with the start argument on startup. This starts the rmsmhd daemon which, in turn, starts the other daemons on that node. The daemons can also be started, stopped and reloaded individually by rcontrol once RMS is running. See Page 5-20 for details. 4.1.2 Log Files Output from the management daemons is logged to the directory /var/rms/adm/log. The log files are called daemon.
The Partition Manager 4.3 The Machine Manager The Machine Manager, mmanager, is responsible for detecting and reporting changes in the state of each node in the system. It records the current state of each node and any changes in state in the database. When a node is functioning correctly, rmsd, a daemon which runs on each node, periodically updates the database. However, if the node crashes, or IP traffic to and from the node stops, then these updates stop.
The Partition Manager The Partition Manager makes new scheduling decisions periodically and in response to incoming resource requests (see Chapter 7 (RMS Scheduling) for details). These decisions may result in jobs being suspended or resumed. Such scheduling operations, together with those performed as jobs are killed, are performed by the Partition Manager sending scheduling or signal delivery requests to the rmsds. The Partition Manager is connected to its rmsds by a tree of sockets.
The Transaction Log Manager Configuration information about each partition is held in the partitions table (see Section 10.2.16). The information is indexed by the name of the partition together with the name of the active configuration. 4.5 The Switch Network Manager The Switch Network Manager, swmgr, controls and monitors the Compaq AlphaServer SC Interconnect (see Appendix A (Compaq AlphaServer SC Interconnect Terms)).
The Process Manager Each entry in the services table specifies which command to run, who can run it and on which host. 4.6.1 Interaction with the Database The Transaction Log Manager maintains the transactions table (see Section 10.2.23). It consults the services table (see Section 10.2.20) in order to execute transactions on behalf of its clients. 4.
The RMS Daemon 4.8 The Process Manager The Process Manager, rmsmhd, is responsible for starting and stopping the other RMS daemons. It runs on each node and is responsible for managing the other daemons that run on its node. It starts them as the node boots, stops them as the node halts and starts or stops them in response to requests from the RMS client application rcontrol (see Page 5-20). 4.8.
The RMS Daemon The rmsds communicate with each other and with the Partition Manager that controls their node over a balanced tree of sockets. Requests (for example, to deliver a signal to all processes in a parallel program) are passed down this tree to the appropriate range of nodes. The results of each request are combined as they pass back up the tree. rmsd is started by the RMS daemon rmsmhd and restarted when it exits – this happens when a partition is shut down. 4.9.
5 RMS Commands 5.1 Introduction This chapter describes the RMS commands. RMS includes utilities that enable system administrators to configure and manage the system, in addition to those that enable users to run their programs. RMS includes the following commands intended for use by system administrators: rcontrol The rcontrol command is used to control the system resources. rmsbuild The rmsbuild command creates and populates an RMS database for a given machine.
Introduction rmshost The rmshost command reports the name of the node running the RMS management daemons. msqladmin The msqladmin command is used for creating and deleting databases and stopping the mSQL server. RMS includes the following commands for all users of the system: allocate The allocate command is used to reserve access to a set of CPUs either for running multiple tasks in parallel or for running a sequence of commands on the same CPUs.
allocate(1) NAME allocate – Reserves access to CPUs SYNOPSIS allocate [-hIv] [-B base] [-C CPUs] [-N nodes | all] [-n CPUs] [-p partition] [-P project] [-R request] [script [args ...]] OPTIONS -B base Specifies the number of the base node (the first node to use) in the partition. Numbering within the partition starts at 0. By default, the base node is unassigned, leaving the scheduler free to select nodes that are not in use. -C CPUs Specifies the number of CPUs required per node (default 1).
allocate(1) immediate=0 | 1 With a value of 1, this specifies that the request should fail if it cannot be met immediately (this is the same as the -I option). hwbcast=0 | 1 With a value of 1, this specifies a contiguous range of nodes and constrains the scheduler to queue the request until a contiguous range becomes available. rails=n In a multirail system, this specifies the number of rails required, where 1 ≤ n ≤ 32.
allocate(1) can be used with hwbcast set to 1 to ensure that the range of nodes allocated is contiguous. Before allocating resources, the Partition Manager checks the resource limits imposed on the current project. The project can be specified explicitly with the -P option. This overrides the value of the environment variable RMS_PROJECT or any default setting in the users table. (See Section 10.2.24). The script argument (with optional arguments) can be used in two different ways, as follows: 1.
allocate(1) RMS_TIMELIMIT Specifies the execution time limit in seconds. The program will be signaled either after this time has elapsed or after any time limit imposed by the system has elapsed. The shorter of the two time limits is used. RMS_DEBUG Specifies whether to execute in verbose mode and display diagnostic messages. Setting a value of 1 or more will generate additional information that may be useful in diagnosing problems. (See Section 9.6).
allocate(1) argument, it is interpreted as -I and the user is warned that this feature should not be used anymore.
nodestatus(1) NAME nodestatus – Gets or sets the status or run level of each node SYNOPSIS nodestatus [-bhr] [status] OPTIONS -b Operate in the background. -h Display the list of options. -r Get/set run level. DESCRIPTION The nodestatus command is used to update status information in the RMS database as nodes are booted or halted. When run without arguments, nodestatus gets the status of the node on which it is running from the Machine Manager.
msqladmin(1) NAME msqladmin – Perform administrative operations on the mSQL database server SYNOPSIS msqladmin [-q] [-f confFile] [-h host] command OPTIONS -f confFile Specify a non-default configuration file to be loaded. The default action is to load the standard configuration file located in /var/rms/msql.conf. -h host Specify a remote hostname or IP address on which the mSQL server (msql2d) is running.
msqladmin(1) Displays server statistics. stats Most administrative functions can only be executed by the user specified in the run-time configuration as the admin user (rms). They can also only be executed from the host on which the server process is running (for example you cannot shut down a remote server process). EXAMPLES # msqladmin version Version Details :msqladmin version mSQL server version mSQL protocol version mSQL connection Target platform 2.0.11 2.0.11 23 Localhost via UNIX socket OSF1-V5.
prun(1) NAME prun – Runs a parallel program SYNOPSIS prun [-hIOrstv] [-B base] [-c cpus] [-e mode] [-i mode] [-o mode] [-N nodes | all] [-n procs] [-m block | cyclic] [-P project] [-p partition] [-R request] program [args ...] OPTIONS -B base Specifies the number of the base node (the first node to use) in the partition. Numbering within the partition starts at 0. By default, the base node is unassigned, leaving the scheduler free to select nodes that are not in use.
prun(1) -n procs Specifies the number of processes required. The -n and -N options can be combined to control how processes are distributed over nodes. If neither is specified, prun starts one process. -O Allows resources to be over-committed. Set this flag to run more than one process per CPU. -P project Specifies the name of the project with which the job should be associated for scheduling and accounting purposes. -p partition Specifies the partition on which to run the program.
prun(1) DESCRIPTION The prun program executes multiple copies of the specified program on a partition. prun automatically requests resources for the program unless it is executed from a shell that already has resources allocated to it. (See Page 5-3). The way in which processes are allocated to CPUs is controlled by the -c, -n, -p, -B and -N options. The -n option specifies the total number of processes to run. The -c option specifies the number of CPUs required per process, this defaults to 1.
prun(1) Before allocating resources, prun checks the resource limits imposed on the current project. The project can be specified explicitly with the -P option. This overrides the value of the environment variable RMS_PROJECT or any default setting in the users table. (See Section 10.2.24). By default, when running a parallel program, prun forwards standard input to the process with an identifier of 0. The -i option requests a different mode of operation.
prun(1) none Do not redirect standard output (or standard error) from any process. file prun opens the named file for output and associates it with the standard output (standard error) stream so that each process writes standard output (standard error) to the file. file.% prun expands the % character to generate and open for output a separate file name for each process: process 0 writes standard output (standard error) to file.0, process 1 writes to file.1 and so on.
prun(1) ENVIRONMENT VARIABLES The following environment variables may be used to identify resource requirements and modes of operation to prun. These environment variables are used where no equivalent command line options are given: RMS_IMMEDIATE Controls whether to exit rather than block if resources are not immediately available. The -I option overrides the value of this environment variable. By default, prun blocks until resources become available. Root resource requests are always met.
prun(1) RMS_STDOUTMODE Specifies the mode for redirecting standard output from a parallel program. The -o option overrides the value of this environment variable. Values for mode are the same as those used with the -o option. RMS_STDERRMODE Specifies the mode for redirecting standard error from a parallel program. The -e option overrides the value of this environment variable. Values for mode are the same as those used with the -e option.
prun(1) $ prun -n 4 -N 2 hostname atlas0.quadrics.com atlas0.quadrics.com atlas1.quadrics.com atlas1.quadrics.com $ prun -n 4 -N 4 hostname atlas1.quadrics.com atlas3.quadrics.com atlas0.quadrics.com atlas2.quadrics.com The -m option controls how processes are distributed over nodes. It is used in the following example in conjunction with the -t option which tags each line of output with the identifier of the process that wrote it. $ 0 1 2 3 $ 0 2 1 3 prun -t -n 4 -N 2 -m block hostname atlas0.quadrics.
prun(1) 0: 1 0: 2 0: 4 0: 8 0: 16 0: 32 Elapsed time User time Cpus used bytes bytes bytes bytes bytes bytes 3.60 uSec 0.28 MB/s 3.53 uSec 0.57 MB/s 2.44 uSec 1.64 MB/s 2.47 uSec 3.23 MB/s 2.54 uSec 6.29 MB/s 2.57 uSec 12.46 MB/s 1.00 secs Allocated time 0.93 secs System time 2 1.99 secs 0.13 secs Note that the allocated time (in CPU seconds) is twice the elapsed time (in seconds) because two CPUs were allocated. WARNINGS In earlier versions, the -i option specified immediate mode.
rcontrol(1) NAME rcontrol – Controls use of system resources SYNOPSIS rcontrol command [args ...] [-ehs] [-r level] [command args ...] OPTIONS -e Exit on the first error. -h Display the list of options. -r level Set reporting level. -s Stop and print warning on error. command is specified as follows: create object [=] name [configuration=val] [partition=val] [attr=val] object may be one of: access_control, attribute, configuration, node, partition, project, user.
rcontrol(1) start object [=] name object may be one of: configuration, partition, server. stop object [=] name [option [=] kill | wait] object may be one of: configuration, partition, server. If server is specified as the object, no option should be given. reload object [=] name [debug [=] value] object may be one of: partition, server. suspend job [=] name [name ...] job may be one of: resource, batchid. suspend attribute [=] value [attribute [=] value ...] Attributes of the same name are ORed together.
rcontrol(1) set attribute [=] name val [=] value exit help [all | command] show object [=] name object may be one of: nodes, configuration, partition. DESCRIPTION rcontrol is used to manage the following: nodes, partitions and configurations; servers; users and their resource requests, projects and access controls; system attributes. rcontrol can create, start, stop and remove a configuration or partition. It can create, remove and set the attributes of nodes and configure them in and out of the machine.
rcontrol(1) # rcontrol configure in nodes = ’atlas[1-3]’ # rcontrol configure in nodes ’atlas[1-3]’ Creating and Removing Nodes To create a new node description, use rcontrol with the create command and the argument node followed by the hostname of the node. Additional attribute-value pairs specify properties of the node, such as its type and position. The attributes rack and unit specify the position of the node in the system.
rcontrol(1) The timelimit attribute specifies the maximum time in seconds for which CPUs can be allocated on the partition. On expiry of the time limit, jobs will be sent the signal SIGXCPU. If they have not exited within a grace period, they will be killed. The grace period for a site is defined in the attributes table (attribute name grace-period). Its default value is 60 seconds.
rcontrol(1) To stop a partition in the active configuration, use rcontrol with the stop command and the partition argument followed by the name of the partition. To stop all of the partitions in the active configuration, use rcontrol with the stop command and the configuration argument followed by the name of the configuration. When stopping partitions you can optionally specify what should happen to the running jobs. The options are to leave them running, to wait for them to exit or to kill them.
rcontrol(1) start command, the server argument and the name of the server. The command rinfo (with the -s flag) can be used to show the status of the RMS servers. To instruct an RMS server to change its reporting level, use the reload command and the server argument with the name of the server. In addition, you should specify the attribute debug and a value. RMS servers write their log files to the directory /var/rms/adm/log on the rmshost. See Section 9.6.
rcontrol(1) # rcontrol set resource = 32 priority = 25 # rcontrol set batchid = 48 priority = 40 rcontrol can also be used to suspend, kill or resume jobs identified by their attributes. The attributes that can be specified are: partition, project, status and user. Attributes of the same name are ORed together, attributes with different names are ANDed.
rcontrol(1) Note that a user can be in more than one project in which case the value would be a comma-separated list: # rcontrol set user = frank projects = parallax,science To create an access control called, for example, science, in the par1 partition, use rcontrol with the create command followed by the type of the object, its name and the name of the partition. Additional attribute-value pairs specify attributes of the access control, for example, its class.
rcontrol(1) The attribute pmanager-queuedepth limits the number of resource requests that a Partition Manager will handle at any time. If the attribute is undefined or set to NULL or 0, no limit is imposed. By default, it is set to 0. If a limit is set and reached, subsequent resource requests by prun will block or, if the immediate option to prun is set, fail. The blocked requests will not appear in the RMS database. To set the pmanager-queuedepth attribute, use rcontrol with the set command.
rcontrol(1) The attribute cpu-poll-stats-interval specifies the interval between successive polls for gathering node statistics. The interval is specified in seconds and must be in the range 0 to 86400 (1 day). The attribute rms-keep-core determines whether core files are deleted or saved. By default, it is set to 1 so that core files are saved. Change this to 0 to delete core files. The attribute local-corepath specifies the directory in which core files are saved. By default, it is set to /local/core/rms.
rcontrol(1) # rcontrol kill resource = 2212 2213 # rcontrol kill batchid = 44 45 To instruct a Partition Manager to reread the user, projects and access_controls tables: # rcontrol reload partition = par1 To enable debug reporting from the RMS scheduler for the partition called par1: # rcontrol reload partition = par1 debug = 41 RMS Commands 5-31
rinfo(1) NAME rinfo – Displays resource usage and availability information for parallel jobs SYNOPSIS rinfo [-chjlmnpqr] [-L [partition] [statistic]] [-s daemon [hostname] | all] [-t node | name] OPTIONS -c List the configuration names. -h Display the list of options. -j List current jobs. -l Give more detailed information. -m Show the machine name. -n Show the status of each node. This can be combined with -l.
rinfo(1) -t node | name Where node is the network identifier of a node, rinfo translates it into the hostname; where name is a hostname, rinfo translates it into the network identifier. See Section A.1 for more information on network identifiers. DESCRIPTION The rinfo program displays information about resource usage and availability. Its default output is in four parts that identify: the machine, the active configuration, resource requests and the current jobs.
rinfo(1) EXAMPLES When used with the -q flag, rinfo prints information on the user’s projects, CPU usage limits, memory limits and priorities. $ rinfo -q PARTITION parallel parallel CLASS project project NAME default divisionA CPUS 0/8 16/64 MEMLIMIT 100 none PRIORITY 0 1 In this example, the access controls allow any user to run jobs on up to 8 CPUs with a memory limit of 100MB. Jobs submitted for the divisionA project run at priority 1, have no memory limit and can use up to 64 CPUs.
rmsbuild(1) NAME rmsbuild – Creates and populates an RMS database SYNOPSIS rmsbuild [-dhv] [-I list] [-m machine] [-n nodes | -N list] [-p ports] [-t type] OPTIONS -d Create a demonstration database. -h Display the list of options. -I list Specifies the names of any interactive nodes. -m machine Specifies a name for the machine. -n nodes Specifies the number of nodes in the machine. -N list Specifies the nodes in the machine by name.
rmsbuild(1) Detailed information about each node (number of CPUs, amount of memory and so on) is added later by rmsd as it starts on each node. The machine name is specified with the -m option. Machines should be given a short name that does not end a digit. Node names are generated by appending a number to the machine name. Database entries for the nodes are generated by the -n or -N options. Use -n with a number to generate entries for nodes 0 through n-1.
rmsctl(1) NAME rmsctl – Stops, starts or shows the status of the RMS system. SYNOPSIS rmsctl [-aehv] [start | stop | restart | show] OPTIONS -a Show all servers, when used with the show command. -e Only show errors, when used with the show command. -h Display the list of options. -v Verbose operation DESCRIPTION The rmsctl script is used to start, stop or restart the RMS system on all nodes in a machine, and to show status information. rmsctl starts and stops RMS by executing the /sbin/init.
rmsctl(1) RMS RMS RMS RMS service service service service stopped stopped stopped stopped on on on on atlas0 atlas3 atlas2 atlasms To start the RMS system, use rmsctl as follows: # rmsctl stop RMS service started on atlas0 RMS service started on atlas1 RMS service started on atlasms RMS service started on atlas2 RMS service started on atlas3 pmanager-parallel: cpus=16 (4 per node) maxfree=4096MB swap=5171MB no memory limits pstartup.OSF1: general partition parallel starting pstartup.
rmsexec(1) NAME rmsexec – Runs a sequential program on a lightly loaded node SYNOPSIS rmsexec [-hv] [-p partition] [-s stat] [hostname] program [args ...] OPTIONS -h Display the list of options. -v Specifies verbose operation. -p partition Specifies the target partition. The request will fail if load-balancing is not enabled on the partition. (See Section 10.2.16). -s stat Specifies the statistic on which to base the load-balancing calculation (see below).
rmsexec(1) freemem Free memory in megabytes. users Lowest number of users. By default, usercpu is used as the statistic.
rmshost(1) NAME rmshost – Prints the name of the node running the RMS management daemons SYNOPSIS rmshost [-hl] OPTIONS -h Display the list of options. -l Prints the fully qualified domain name. DESCRIPTION The rmshost command prints the name of the node that is running (or should run) the RMS management daemons. It is used by the RMS system.
rmsquery(1) NAME rmsquery – Submits SQL queries to the RMS database SYNOPSIS rmsquery [-huv] [-d name] [-m machine] [SQLquery] OPTIONS -d name Select database by name. -h Display the list of options. -m machine Select database by machine name. -u Print dates as seconds since January 1st 1970. The default is to print dates as a string created with localtime(3). -v Verbosely prints field names above each column of output. DESCRIPTION rmsquery is used to submit SQL queries to the RMS database.
rmsquery(1) The source is provided in /usr/opt/rms/src. Details of the SQL language can be found on the Quadrics support page http://www.quadrics.com/web/support. EXAMPLES An example follows of a select statement that results in a list of the names of all of the nodes in the machine. Note that the query must be quoted. This is because rmsquery expects a single argument.
rmstbladm(1) NAME rmstbladm – Database administration SYNOPSIS rmstbladm [-BcdDfhmuv] [-r file] [-t table] [machine] OPTIONS -B Dump the first five rows of each table to stdout as a sequence of SQL statements. A specific table can be dumped if the -t option is used. -c Clean out old entries from the node statistics (node_stats) table, the resources table, the events table and the jobs table. (See Chapter 10 (The RMS Database).
rmstbladm(1) DESCRIPTION The command rmstbladm is used to administer the RMS database. It creates the tables and their default entries. It can be used to back up individual tables (or the whole database) to a text file, to restore tables from file or to force the recreation of tables. Unless a specific machine is specified, rmstbladm operates on the database of the host machine.
6 Access Control, Usage Limits and Accounting 6.1 Introduction RMS access controls and usage limits operate on a per-user or per-project basis (a project is a list of named users). Each partition may have its own controls. This mechanism allows system administrators to control the way in which the resources of a machine are allocated amongst the user community. RMS accounts for resource usage by user and by project.
Access Controls control records. When submitting requests for CPUs, users can select any project of which they are a member (by setting the RMS_PROJECT environment variable or by using the -P flag when executing prun or allocate). RMS rejects requests to use projects that do not exist or requests to use projects of which the user is not a member. Users without an RMS user record are subject to the constraints on the default project.
Access Controls The access controls for individual users must set lower limits than those of the projects of which they are a member. That is to say, they must have a lower priority, smaller number of CPUs, smaller memory limit and so on than the access control record for the project. Where a memory limit exists for a user or project, it takes precedence over any default limit set on the partition (see Section 10.2.16). When the system is installed, there are no access control records.
How Access Controls are Applied parallel priority = 5 rcontrol create access_control = default class = project partition = \ parallel priority = 0 memlimit = 256 name design default class project project partition parallel parallel priority 5 0 maxcpus Null Null memlimit Null 256 Requests submitted by Jim, Mary and John run at priority 5, causing other users’ jobs to be suspended if running. These requests are not subject to CPU or memory limits.
How Access Controls are Applied allocated has its memory limits set to this value. A process with more than one CPU allocated has proportionately higher memory limits. The RMS_MEMLIMIT environment variable can be used to reduce the memory limit set by the system, but not to raise it. By default, the memory limit is capped by the minimum value for any node in the partition of the smaller of these two amounts: 1. The amount of memory on the node. 2. The amount of swap space.
Accounting 1. No CPU usage limits are set on jobs run by the root user. 2. If the user has an access control record for the partition, the CPU usage limit is determined by the maxcpus field in this record. 3. The access control record for the user’s current project determines the CPU usage limit. 4. The access control record for the default project determines the CPU usage limit. CPU usage limits can be set to a higher value than the actual number of CPUs available in the partition.
Accounting Accounting records are updated periodically until the CPUs are deallocated. The running flag is set to 0 at this point. The atime statistic is summed over all CPUs allocated to the resource request. The utime and stime statistics are accumulated over all processes in all jobs running on the allocated CPUs. Note The memint statistics are not implemented in the current release. All values for this fields are 0.
7 RMS Scheduling 7.1 Introduction The Partition Manager (see Section 4.4) is responsible for scheduling resource requests and enforcing usage limits. This chapter describes the RMS scheduling policies and explains how the Partition Manager responds to resource requests. 7.2 Scheduling Policies The scheduling policy in use on a partition is controlled by the type attribute of the partition. The type attribute can take one of four values: login Normal UNIX time-sharing applies.
Scheduling Constraints together. That is to say, all of the processes in a program are either running or suspended at the same time. Gang scheduling is required for tightly coupled parallel programs which communicate frequently. It becomes increasingly important as the rate of interprocess communication increases. For example, if a program is executing a barrier synchronization, all processes must be scheduled before the barrier completes.
What Happens When a Request is Received Time Limit Jobs are normally run to completion or until they are preempted by a higher priority request. Each partition may have a time limit associated with it which restricts the amount of time the Partition Manager may allow for a parallel job. On expiry of this time limit, the job is sent a SIGXCPU signal. A period of grace is allowed following this signal for the job to clean up and exit. After this period, the job is killed and the resource deallocated.
What Happens When a Request is Received immediate The request should fail rather than block if resources are not available immediately. Note The RMS scheduler attempts to allocate CPUs on a contiguous range of nodes. If a contiguous range of nodes is not available then requests that explicitly specify a contiguous range with the hwbcast parameter will block if the requested CPUs cannot be allocated.
What Happens When a Request is Received 7.4.1 Memory Limits If memory limits are enabled (by setting the memlimit attribute of a partition or access control) then a request is only allocated CPUs on nodes that have sufficient memory available. RMS enforces memory limits by setting the data and stack size limits on a process. If the process exceeds the allowed size, it is killed (and the parallel program terminated).
What Happens When a Request is Received processes (including those belonging to the system) will be killed if the system runs out of swap space. 7.4.3 Time Slicing Time slicing is enabled on a partition by setting its timeslice attribute; values of 15–120 seconds are recommended. If a timeslice is set, the Partition Manager evaluates the list of requests periodically.
8 Event Handling 8.1 Introduction RMS includes a general mechanism for posting, waiting on and handling events. This functionality is provided by the Event Manager, eventmgr (see Section 4.7). Events are specified by RMS class, name, type and description.
Introduction $ rmsquery -v "select * from events order by ctime" id name class type ctime handled description --------------------------------------------------------------20 atlas0 node status 05/04/01 15:53:02 1 running 21 atlas0 node status 05/05/01 11:27:29 1 not responding 8.1.1 Posting Events Events are normally posted by RMS servers but they can also be generated by the command line utility rmspost. This is useful for testing the response of the system to rare events.
Event Handling ::: matches node:atlas0:status Note that the class, name, type and description must all be specified when posting events but one or more of the class, name and type can be null when waiting on events. 8.2 Event Handling Event handler scripts are specified in the event_handlers table.
List of Events Generated program=‘basename $0‘ id=$1 class=$2 name=$3 type=$4 description=$5 # # format event description message # message() { echo "‘date ’+%h %e %X’‘ OSF1 event $id $type $class $name $description" } # # log the event # message >> /var/rms/adm/log/event.log # # execute OSF1 specific handler # /usr/opt/srasysman/bin/checkout.exp -I -R -i $id -c $class -n $name -t $type -d $description 8.
List of Events Generated ambient=value DS20, ES40, QM-S16, QM-S128 class = module type = temphigh If the temperature exceeds the threshold value, the event type is temphigh and the description contains the above report and, in addition, the words threshold exceeded. In the event of multiple failures, the reports are concatenated. class = module type = psu The name field contains the name of the module.
List of Events Generated submitted started complete failed error transaction submitted transaction being executed transaction completed successfully transaction failed to execute transaction completed but there were errors In the case of a transaction completing with errors (a link error test or boundary scan, for example), details of the failures are added to the transaction outputs table.
9 Setting up RMS 9.1 Introduction This chapter describes how to set up RMS and carry out routine operations. The information is organized as follows: • Planning the installation (see Section 9.2). • Starting RMS and configuring the system (see Section 9.3). • Carrying out day-to-day operations and establishing backup and archive procedures (see Section 9.4). • Customizing RMS (see Section 9.5). • Dealing with log files (see Section 9.6). 9.
Setting up RMS • Is the machine primarily for running parallel jobs or do you expect a significant workload from sequential jobs? • Will some of your users have jobs that consume all of the resources of the system for extended periods of time? If so, are you happy for other users to wait until the machine is available or do they need access to resources of their own? • How do you wish to process the accounting data? The answers to these questions should help you to determine how to configure the system.
Setting up RMS for this command to work correctly. This should have been enabled as part of the installation. # rmsctl start Configure all of the nodes into the machine using rcontrol. # rcontrol configure in ’atlas[0-63]’ Use rinfo with the -n option to check the status of the nodes. The output should show that all of the nodes are running. # rinfo -n running atlas[0-63], atlasms If any of the nodes show a status other than running, restart them by running /sbin/init.d/rms on the nodes in question.
Setting up RMS Once RMS is running on all of the nodes, you set up a single partition as follows: # rcontrol create partition=parallel configuration=day nodes=’atlas[0-63]’ # rcontrol start partition=parallel You should now be able to run a parallel program across all 64 nodes, for example: # prun -N64 hostname ... # prun -N64 dping 0 32 ... 9.3.3 Simple Day/Night Setup In this example, the system is set up with two operating configurations: one called day and the other called night.
Day-to-Day Operation Note In the current release, any requests that are suspended when a partition is stopped must be resumed manually if the partition is restarted. 9.4 Day-to-Day Operation Once the system is up and running, give some thought to automating some routine or day-to-day operations: • Periodic shift changes • Backing up the database • Summarizing accounting data • Archiving data • Database maintenance You may also want to configure nodes out of the system in the event of failures. 9.4.
Day-to-Day Operation 9.4.3 Summarizing Accounting Data Accounting records accumulate in the RMS database as each job is run. By default, they are not processed as each site has its own requirements in this respect. A simple example script to produce a summary of resource usage is included in the release in /usr/opt/rms/examples/scripts/accounting_summary. See Appendix E (Accounting Summary Script) for a listing. The script produces the following output.
Day-to-Day Operation The data can be archived as a sequence of SQL statements using rmstbladm. The following example archives data from the node statistics (node_stats) table (see Section 10.2.15): $ rmstbladm -d -t node_stats > nodestats.
Day-to-Day Operation instructing the table administration program, rmstbladm, to remove old entries. Before running rmstbladm, archive any data you want to keep as described in Section 9.4.4. Remove old entries as follows: # rmstbladm -c rmstbladm clears out all entries that are older than a specified lifetime. The lifetime for job data and the lifetime for statistical data are specified in the attributes table (see Section 10.2.3).
Day-to-Day Operation 2. Change to the directory that contains the database, for example: # cd /var/rms/msqldb/rms_atlas Delete the following files: node_stats.dat, node_stats.def, node_stats.idx and node_stats.ofl. # rm node_stats.* 3. Restart the database server, as follows: # /sbin/init.d/msqld start MSQL: daemon started 4. Create a new node statistics table, as follows: # rmstbladm -u After this, rmstbladm should succeed in cleaning out old entries. 9.4.
Local Customization of RMS # rcontrol configure in node=atlas2 3. Restart the partition: # rcontrol start partition=parallel This brings the partition back up to its full complement of nodes. 9.5 Local Customization of RMS RMS can be customized to suit local operating patterns in a variety of ways. Customization is done through site-specific scripts in /usr/local/rms/etc.
Log Files crashed. A site-specific variant might copy core files from the local temporary directory to a global file system for subsequent analysis. To create a site-specific core file analysis script, copy the default script /opt/rms/etc/core_analysis to /usr/local/rms/etc and modify it as required. 9.5.3 Event Handling The default event handlers check for the existence of a site-specific handler of the same name in /usr/local/rms/etc.
Log Files 9.6 Log Files The RMS daemons output reports to log files in the directory /var/rms/adm/log. The amount of detail is controlled for each daemon by setting a reporting level. By default, the reporting level is set to 0. The reporting level is a bitmap that turns on different reports.
10 The RMS Database 10.1 Introduction This chapter describes the tables which make up the RMS database. Each machine has its own database, called rms_ machine, where machine is the name of the machine. This allows a single database server to support multiple machines.
Introduction x-y This denotes a range of possible integer values. text This denotes a character string of arbitrary length. • Fields of type text can be selected by the field name but the text entry cannot be matched. If the text is a list of items, for example, a list of node names, the items in the list may be separated by white space. A list of names, all of which share a common base, for example, atlas0 atlas1 atlas2, may also be represented by a glob-like expression, in this example, atlas[0-2].
Introduction Operational State The following tables hold details of the current state of the machine.
Listing of Tables transaction outputs request types statistics services contains output from requests posted to the transaction log describes output formats in the transaction outputs table lists the performance statistics available in the current release describes the services available and who can use them 10.2 Listing of Tables This section lists the tables in alphabetical order. 10.2.1 The Access Controls Table The access_controls table shown in Table 10.
Listing of Tables at the end of each job by the Partition Manager, pmanager (see Section 4.4). Table 10.
Listing of Tables The memint field is set to 0 in AlphaServer SC Version 2.0. The number of entries in the accounting statistics table can grow rapidly. The table should be cleared periodically of old entries as described in Section 9.4.3. 10.2.3 The Attributes Table The attributes table shown in Table 10.3, stores information specific to the site or the release. This information is stored as attribute-value pairs.
Listing of Tables entries, if called with the -c option (see Page 5-44). Note that the accounting statistics table is not cleared out (see Section 10.2.2). Table 10.4: Performance Statistics Attributes Attribute node-statistics cpu-stats-pollinterval data-lifetime stats-lifetime Default cpu 120 Description statistics collected per node time in seconds between CPU samples 48 24 time in hours to keep job data time in hours to keep statistical data The server attributes in Table 10.
Listing of Tables Table 10.
Listing of Tables 10.2.5 The Elites Table The elites table shown in Table 10.8, contains one entry for each switch in the network. Its entries are created and maintained by the Switch Network Manager, swmgr (see Section 4.5). Table 10.
Listing of Tables Table 10.9: Events Table (cont.) Field class type ctime handled description Type char(16) char(16) UTC 0|1 text Description class of the object, such as node or partition type of event time at which the event occurred whether the event has been handled or not description of the event Table 10.10 shows three typical events.
Listing of Tables Table 10.11: Event Handlers Table (cont.) Field handler Type char(32) Description handler script to run 10.2.8 The Fields Table The fields table shown in Table 10.12, defines which RMS objects and attributes can be created and modified using rcontrol (see Page 5-20), identifying them by a table name and field name within that table. Table 10.
Listing of Tables 10.2.9 The Installed Components Table The installed_components shown in Table 10.14, contains information about software components installed on each node. Table 10.
Listing of Tables Table 10.15: Jobs Table (cont.) Field exitStatus session cmd Type int int text Description exit status of the job UNIX session ID of the allocating process command being executed Job names are sequence numbers generated automatically. The status field holds one of the values shown in Table B.1. While the job is running, endTime is set to the time by which the job must end, assuming there is a timelimit on the partition. If there is no time limit, the endTime is set to 0.
Listing of Tables 10.2.12 The Modules Table The modules table shown in Table 10.17, contains descriptions of each hardware module in a machine. The modules may be nodes, network components or storage devices. It is created by rmsbuild. Entries are added and removed by rcontrol and updated by rmsd and the Switch Network Manager, swmgr. Table 10.
Listing of Tables 10.2.13 The Module Types Table The module_types table shown in Table 10.18, contains descriptions of each of the module types supported in a given release of the RMS. It is updated by the table administration program, rmstbladm (see Page 5-44), when a new release is installed. Table 10.
Listing of Tables Machine Manager, mmanager, when the node’s status or run level changes. Table 10.
Listing of Tables attributes table (see Section 10.2.3) must be set to cpu. This is the default setting. The interval at which the nodes are sampled for CPU statistics is controlled by the attribute cpu-stats-poll-interval in the attributes table; the default is to sample every 2 minutes. The node statistics (node_stats) table can grow rapidly, especially on a large machine. Running the table administration program, rmstbladm, with the -c option removes old entries.
Listing of Tables The partitions table shown in Table 10.22, describes how nodes are allocated to partitions in each of the configurations. It also contains scheduling parameters (see also Section 7.3) for each partition. The entries in the partitions table are created by rcontrol. The information is updated by the Partition Manager, pmanager, as it starts. Table 10.
Listing of Tables The configured_nodes field stores the subset of nodes that were configured in when the partition was started. The timeslice field stores the interval in seconds between periodic rescheduling of parallel jobs. Time slicing is disabled when this field is null, the default. The timelimit field stores the maximum interval in seconds for which CPUs in a partition may remain allocated. Time limits are disabled when this field is null, the default.
Listing of Tables Table 10.24: Resources Tables (cont.
Listing of Tables Table 10.25: Servers Table (cont.
Listing of Tables Currently, only rms is valid. Some services, such as rcontrol, must have exclusive access to the database, requiring that other transactions wait until they complete. The sequential field should be set to 1 for these services. Others such as swctrl may run for long periods of time and should not block the execution of other transactions. sequential should be set to 0 for these services. Sample records from the services table are shown in Table 10.27. Table 10.
Listing of Tables 10.2.22 The Switch Boards Table The switch_boards shown in Table 10.30, contains one entry for each switch board in the Compaq AlphaServer SC Interconnect. It is created and maintained by the Switch Network Manager, swmgr (see Section 4.5). Table 10.
Listing of Tables An example of the transaction to add a partition is shown below in Table 10.32. The handle is a unique number, generated automatically, which is passed to both the service and the client. The service uses the handle to label any output resulting from the transaction; the client uses the handle to select the resulting entries. If the service fails, the output log (conventionally in the directory /var/rms/adm/log) may contain useful diagnostics.
A Compaq AlphaServer SC Interconnect Terms A.1 Introduction RMS includes support for programs that use Compaq AlphaServer SC Interconnect. This appendix provides an introduction to Compaq AlphaServer SC Interconnect, defining terms used elsewhere in this manual. Before an application process can use Compaq AlphaServer SC Interconnect, it must be given an Elan capability (see Section C.2), describing the nodes and communications contexts that it is allowed to use.
Introduction Figure A.1: A 2-Stage, 16-Node, Switch Network Plane 3 Top Switches Level 0 Plane 0 Uplinks Level 1 Network Adapters Net id 15 Level 2 Net id 0 The level is the index of the stage, starting with 0 at the top. Note that in a 2-stage switch network the Elans are at level 2. Each component has a network ID that describes how to reach it from the top of the network. The plane is the index of switches that have the same switch network ID.
Introduction Four such 64-node networks and an additional stage of switches can be used to construct a 256-way network. Alternatively, the unused uplinks can be used to double the number of nodes a switch can connect. This avoids the need to add an additional switch stage but the resulting network cannot be expanded further. This technique is used in the 128-node network, shown in Figure A.3. Figure A.
Link Errors network, data can be broadcast directly to a contiguous range of processors: data is routed up to a node in the tree from which all processors can be reached; then the data is routed down to all switch outputs in the broadcast range on the way down. Data can be recombined as it travels through the network to support global reduction operations and barrier synchronization. Multiple Elan network adapters may be installed per node, each connected to a different switch network.
B RMS Status Values B.1 Overview This appendix lists the various states that RMS objects can enter. State information is stored in the status field of the RMS table for the object in question. For example, the current state of a partition is held in the partitions table (see Section 10.2.16). and the current state of a node is entered in the nodes table (see Section 10.2.14). Status changes are recorded in the events table (see Section 10.2.6).
Link Status Values B.2 Generic Status Values There are three generic status values: ok This state means that an object is functioning correctly as far as the relevant RMS daemon can tell. error This state means that one or more errors have been detected. A description of the problem will be found in the event record. unknown This state means that the RMS daemon responsible for an object either has not run or is unable to determine the state of the object.
Module Status Values B.4 Link Status Values Each switch (see Appendix A (Compaq AlphaServer SC Interconnect Terms)) has an entry in the elites table. Each switch has eight links and the state of each of these links is recorded in the linkstate field of the elites table. The field holds eight characters, one for each link. Valid values for the characters are as shown in Table B.2. See also Section A.2. Table B.
Node Status Values node has more than one instance of each type of temperature sensor, the maximum of their values is recorded. Temperature information is recorded as a list of attribute-value pairs, for example: ambient=15 cpu=40 psu=20 Note that not all node types support all types of thermistor reading. The environment field may contain only a subset of this information. If an error occurs, the environment string contains details of what has failed.
Resource Status Values The current UNIX run level of a node is held in the nodes table in the runlevel field. This field is updated by the nodestatus program as the run level changes. The valid strings are shown together with their meaning in Table B.5. Table B.
Transaction Status Values jobs using the CPUs complete. While CPUs are allocated, the valid resource status strings are as shown in Table B.7. Table B.
C RMS Kernel Module C.1 Introduction The RMS kernel module supports the operation of RMS on each node in a system. It provides functions that bind together the set of processes that make up a program on each node, allowing RMS to apply scheduling, signal delivery and statistics gathering operations to them collectively. For example, the RMS kernel module allows the rmsd daemon or an administrator process to send a signal to all processes in a parallel program at the same time.
System Call Interface and the Elan hardware context numbers to be used.
rms_setcorepath(3) NAME rms_setcorepath, rms_getcorepath – Set, get the path for application core files SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_setcorepath(caddr_t path); int rms_getcorepath(pid_t pid, caddr_t path, int maxlen); PARAMETERS path Array containing the path name. maxlen Size of the array pointed to by path. pid Process identifier. DESCRIPTION The function rms_setcorepath() sets the core file path for the current process.
rms_prgcreate(3) NAME rms_prgcreate, rms_prgdestroy – Create, destroy program descriptions SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgcreate(int id, uid_t uid, int cpus); int rms_prgdestroy(int id); PARAMETERS id Program identifier. uid Owner of the program. cpus Number of CPUs allocated. DESCRIPTION rms_prgcreate() creates a new program description with the current process as its root process.
rms_prgcreate(3) EINVAL Program identifier is in use or the number of CPUs is invalid. ECHILD Processes belonging to this program are still running. EEXIST Program identifier does not exist.
rms_prgids(3) NAME rms_prgids, rms_prginfo, rms_getprgid – Get information on a program or programs SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgids(int maxids, int *ids, int *nids); int rms_prginfo(int id, int maxids, pid_t *pids, int nids); int rms_getprgid(int pid, int *id); PARAMETERS id Program identifier. pid Process identifier. maxids Maximum number of identifiers to be returned. ids Array of program identifiers.
rms_prgids(3) EINVAL Count of number of array elements is invalid. EFAULT Array address is invalid. ENOMEM Insufficient kernel memory to perform this operation. ESRCH Process or program does not exist.
rms_prgsuspend(3) NAME rms_prgsuspend, rms_prgresume, rms_prgsignal – Suspend or resume the processes in a program, deliver a signal to all processes in a program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgsuspend(int id); int rms_prgresume(int id); int rms_prgsignal(int id, int signo); PARAMETERS id Program identifier. signo Signal number. DESCRIPTION rms_prgsuspend() suspends all of the processes in a program.
rms_prgsuspend(3) EACCESS Caller is not permitted to perform this operation. ESRCH No such program identifier. EINVAL Invalid signal number.
rms_prgaddcap(3) NAME rms_prgaddcap, rms_setcap – Associate Elan capabilities with a program or process SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgaddcap(int id, int index, ELAN_CAPABILITY *cap); int rms_setcap(int index, int context); PARAMETERS id Program identifier. index Index of the capability for this program. cap Pointer to a capability. context Context number for this process.
rms_prgaddcap(3) EACCESS Caller is not permitted to perform this operation. ENOMEM There was insufficient memory to perform this operation. ESRCH Program does not exist. EFAULT Capability has invalid address. EINVAL Invalid context number (rms_setcap() only).
rms_ncaps(3) NAME rms_ncaps, rms_getcap – Return information on the Elan capabilities allocated to a process in a parallel program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_ncaps(int *ncaps); int rms_getcap(int index, ELAN_CAPABILITY *cap); PARAMETERS ncaps Number of capabilities allocated. index Index of a capability to be returned. cap Pointer to a capability.
rms_prggetstats(3) NAME rms_prggetstats – Return resource usage information for a program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prggetstats(int id, prgstats_t *stats); PARAMETERS id Program identifier. stats Pointer to a program statistics structure. DESCRIPTION rms_prggetstats() returns resource usage information for the processes of a parallel program on the calling node.
rms_prggetstats(3) The elapsed time statistic etime is the time in millisecs since the program was created. The allocated time statistic atime is the time in millisecs for which CPUs have been allocated multiplied by the number of CPUs allocated. The utime and etime statistics are summed over the processes that make up the program (on this node). If one or more processes belonging to the program is still running, the flags field will contain the value PRG_RUNNING.
D RMS Application Interface D.1 Introduction The RMS application interface is provided so that external scheduling modules can make inquiries about the availability of resources, allocate and deallocate CPUs and perform job control operations. The application interface is provided as a dynamic library librmsapi.so. Function prototypes are defined in the header file .
rms_allocateResource(3) NAME rms_allocateResource, rms_deallocateResource – Allocate or deallocate a resource SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_allocateResource(char *partition, int cpus, int baseNode, int nodes, uid_t uid, char *project, char *requestFlags); int rms_deallocateResource(int rid); PARAMETERS partition Partition containing the resources. cpus Total number of CPUs to allocate. baseNode ID of the first node to allocate.
rms_allocateResource(3) ID of the resource to deallocate. rid DESCRIPTION rms_allocateResource() allocates CPUs from a named partition. If partition is NULL, the default partition is used, otherwise the named partition must exist. You can optionally specify the base node and the number of nodes (as with the allocate and prun commands). Alternatively, this can be left to the scheduler by passing the value RMS_UNASSIGNED.
rms_run(3) NAME rms_run – Run a program on an allocated resource SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_run(int rid, char *cmd, char **args, char *jobFlags); PARAMETERS rid Resource id. cmd Command to execute. args Arguments for the command. jobFlags The job flags currently supported are as follows: tag=0 | 1 With a value of 1, this specifies that output from each process should be tagged by the process id.
rms_run(3) SEE ALSO rms_allocateResource(3), RMS Application Interface D-5
rms_suspendResource(3) NAME rms_suspendResource, rms_resumeResource, rms_killResource – Job control operations on allocated resources SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_suspendResource(int rid); int rms_resumeResource(int rid); int rms_killResource(int rid, int signo); PARAMETERS rid ID of the resource. signo Signal to send. DESCRIPTION rms_suspendResource() and rms_resumeResource() suspend and resume a resource specified by rid.
rms_defaultPartition(3) NAME rms_defaultPartition, rms_numCpus, rms_numNodes, rms_freeCpus – Provide information on RMS partitions SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] char *rms_defaultPartition(); int rms_numCpus(char *partition); int rms_numNodes(char *partition); int rms_freeCpus(char *partition); PARAMETERS partition Name of an active partition.
E Accounting Summary Script E.1 Introduction This appendix describes the example accounting summary script included in /usr/opt/rms/examples/scripts/accounting_summary and referred to in Section 9.4.3. • Section E.2 describes the command line interface. • Section E.3 shows a sample of output from the script. • Section E.4 is a listing of the script. E.
Listing of the Script -p Sort the records by project name and then by user name. This is the default. -M Show time in minutes rather than seconds. -H Show time in hours rather than seconds. days Show statistics for the specified number of days. By default, statistics are shown for the previous day only. The script processes the arguments passed to it on the command line and generates a SQL query which it passes to rmsquery.
Listing of the Script E.
Listing of the Script # # parse the options # while [ $# -gt 0 ]; do option=‘echo $1 | sed "s/ˆ-//"‘ if [ "$option" = "$1" ]; then break fi if [ "$option" = "p" ]; then primary="project" elif [ "$option" = "u" ]; then primary="user" elif [ "$option" = "d" ]; then delete="1" elif [ "$option" = "M" ]; then if [ "$hours" = "1" ]; then echo "$sname: ERROR : -M and -H are mutually exclusive" exit 1 fi minutes="1" elif [ "$option" = "H" ]; then if [ "$minutes" = "1" ]; then echo "$sname: ERROR : -M and -H are mut
Listing of the Script starttime=‘expr $now - $daysecs‘ if [ "$primary" = "project" ]; then primarytitle="Project" secondarytitle="User" querystr="select \ acctstats.project,resources.username, \ acctstats.atime,acctstats.utime, acctstats.stime \ from resources,acctstats \ where acctstats.started > $starttime and resources.name=acctstats.name \ order by acctstats.project,resources.username" else primarytitle="User" secondarytitle="Project" querystr="select \ resources.username,acctstats.project, \ acctstats.
Listing of the Script printf ("\t %-8.8s ", secondary) } } function printdashes() { printf ("---------------------------------------------------------------------\ ----\n") } function printvals(vals, i) { for (i=1; i<=nvalues; i++) { if (hours == 1 || minutes == 1) { printf (" %13.2f", vals[i]) } else { printf (" %13.0f", vals[i]) } } } NF > 0 { if ($1 != primary) { if (primary != "") { printsortfields() printvals(values) printf (" %6d\n", recs) printdashes() printf ("Total %-10.
Listing of the Script printf ("Name Name Secs Secs Secs Sessions\n") } } printdashes() } primary = $1 secondary = $2 for (i=1; i<=nvalues; i++) { values[i] = 0 primvalues[i] = 0 } recs = 0 primrecs = 0 printprim = 1 } else { if ($2 != secondary) { printsortfields() printvals(values) printf (" %6d\n", recs) secondary = $2 for (i=1; i<=nvalues; i++) { values[i] = 0 } recs = 0 } } for (i=1; i
Listing of the Script printf (" %6d\n", primrecs) printdashes() printf ("Grand Total ") printvals(grandvalues) printf (" %6d\n", grandrecs) printdashes() }’ primtitle="$primarytitle" sectitle="$secondarytitle" machine=$machine \ days=$days hours=$hours minutes=$minutes /bin/rm $tmpfile if [ "$delete" ]; then echo "$sname : Deleting accounting statistics records" querystr="delete from acctstats where running=0" $RMSQUERY $querystr if [ $? -ne 0 ]; then echo "$sname : ERROR : $RMSQUERY $querystr FAILED" exit
Glossary Abbreviations API Application Program Interface — specification of interface to software package (library). CFS Cluster File System — the file system for Tru64 UNIX clusters. CGI Common Gateway Interface — a standard method for generating HTML pages dynamically from an application so that a Web server and a Web browser can exchange information. A CGI script can be written in any language and can access various types of data, for example, a SQL database.
HTML HyperText Markup Language — a generic markup language, comprising a set of tags, that enables structured documents to be delivered over the World Wide Web and viewed by a browser. HTTP HyperText Transfer Protocol — a communications protocol commonly used between a Web server and a Web browser together with a URL (Uniform Resource Locator). LED Light-Emitting Diode.
Shmem A one-sided (put/get) inter-process communication interface used on high-performance parallel systems. SMP Symmetric MultiProcessor — a computer whose main memory is shared by more than one processor. SNMP Simple Network Management Protocol — a protocol used to monitor and control devices on the Internet. SQL Structured Query Language — a database language.
Flit A communications cycle unit of information. HTTP cookies Cookies provide a general mechanism that HTTP server-side connections use to store and to retrieve information on the client side of the connection. main memory The memory normally associated with the main processor, that is to say, memory on the CPU’s high speed memory bus. main processor The main CPU (or CPUs for a multi-processor) of a node, typically an Alpha 21264.
slice A local copy of a global object. switch network The network constructed from the Elan cards and Elite cards. thread An independent sequence of execution. Every host process has at least one thread. virtual memory A feature provided by the operating system, in conjunction with the MMU, that provides each process with a private address space that may be larger than the amount of physical memory accessible to the CPU.
Index A C access controls CPU usage, 6-5, 7-2 memory limits, 6-4, 7-3, 7-5 priority, 6-5, 7-2 records, 6-2 system services, 2-5 table, 10-4 accounting record, 2-10, 6-1, 6-6 statistics, 10-4 allocate, 5-3 application node, 2-1 attributes cpu-poll-stats-interval, 5-29 default-priority, 5-29 grace-period, 5-29 node-status-poll-interval, 4-3 pmanager-idletimeout, 5-29 pmanager-queuedepth, 5-28 rms-keep-core, 5-30 rms-poll-interval, 4-3 tables, 10-6 users-to-mail, 8-3 capability, A-1, C-1 commands, 2-5, 5-1
rmsloader, 3-3 Switch Network Manager (swmgr), 4-5 Transaction Log Manager (tlogmgr), 4-5 database, 2-2, 2-6 administration, 5-44 building, 5-35 field names, 10-1 name, 10-1 SQL interface, 2-6 SQL queries, 5-42 tables, 10-2 Database Manager, 4-2 documentation feedback, 1-3 online, 1-3 E Elan, A-1 Elite, A-1 Event Manager, 4-6 eventmgr, 4-6 events, 8-1 handlers, 8-3 mail alerts, 8-3 posting, 8-2 string, 8-1 table, 10-9 waiting, 8-2 G gang scheduling, 7-1 I installed components, 10-12 interactive node, 2-1
P Partition Manager, 4-3 partitions, 2-7, 4-3, 10-17 root, 2-7 scheduling, 2-8 pmanager, 4-3 priority, 7-2 Process Manager, 4-7 project, 2-9, 6-1 default, 6-1 membership, 6-2 specifying, 10-24 table, 10-19 prun, 5-11 R rcontrol, 5-20 resources, 10-19 allocation, 2-7, 5-3 rinfo, 5-32 rms_allocateResource, D-2 rms_deallocateResource, D-2 rms_defaultPartition, D-7 rms_freeCpus, D-7 rms_getcap, C-12 rms_getcorepath, C-3 rms_getprgid, C-6 rms_killResource, D-6 rms_ncaps, C-12 rms_numCpus, D-7 rms_numNodes, D-7
adapters, A-4 barrier synchronization, A-3 boards, 10-23 control interface, 4-5, 10-9 crosspoint switch, A-1 Elan, A-1 Elans, 10-8 Elite, A-1 Elites, 10-9 fat tree network, A-1 layer, A-4 level, A-1 links, A-3 multistage network, A-1 plane, A-1 rail, A-4 reduction, A-3 top switch, A-3 uplinks, A-2 Switch Network Manager, 4-5 swmgr, 4-5 system architecture, 2-1 T tlogmgr, 4-5 Transaction Log Manager, 4-5 transactions, 10-23 U user commands, 2-5 users, 10-24 Index-4