User's Manual

ManualsBrandsCompaq ManualsserversCompaq Server SC RMS

Compaq AlphaServer SC RMS

Reference Manual

Quadrics Supercomputers World Ltd.Document Version 7 - June 22nd 2001 - AA-RLAZB-TE

Summary of content (194 pages)

PAGE 1
Compaq AlphaServer SC RMS Reference Manual Quadrics Supercomputers World Ltd.
PAGE 2
The information supplied in this document is believed to be correct at the time of publication, but no liability is assumed for its use or for the infringements of the rights of others resulting from its use. No license or other rights are granted in respect of any rights owned by any of the organizations mentioned herein. This document may not be copied, in whole or in part, without the prior written consent of Quadrics Supercomputers World Ltd.
PAGE 3
Contents 1 Introduction 1-1 1.1 Scope of Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.2 Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.3 Using this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.4 Related Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 1.5 Location of Online Documentation . . . . . . . . . . . . . . . . . . . 1-3 1.6 Reader’s Comments . . . . . . . . . . . . . .
PAGE 4
2.4.4 RMS Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Parallel Programs Under RMS 3-1 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.2 Resource Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.3 Loading and Running Programs . . . . . . . . . . . . . . . . . . . . . 3-3 4 RMS Daemons 4-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 4.1.1 Startup . . . . . . . . . . . .
PAGE 5
nodestatus(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 msqladmin(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 prun(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 rcontrol(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 rinfo(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32 rmsbuild(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 6
7.4.5 Idle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Event Handling 8-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 8.1.1 Posting Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.1.2 Waiting on Events . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.2 Event Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 8.3 List of Events Generated . . . . . . . . . . . . . . . . . . . . .
PAGE 7
10 The RMS Database 10-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.1.1 General Information about the Tables . . . . . . . . . . . . . . . 10-1 10.1.2 Access to the Database . . . . . . . . . . . . . . . . . . . . . . . . 10-2 10.1.3 Categories of Table . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Listing of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.2.1 The Access Controls Table . . . . . . . . . . . . . .
PAGE 8
A Compaq AlphaServer SC Interconnect Terms A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A.2 Link States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4 A.3 Link Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4 B RMS Status Values B-1 B.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 B.2 Generic Status Values . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2 B.
PAGE 9
rms_ncaps(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 rms_getcap(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 rms_prggetstats(3) . . . . . . . . . . . . . . . . . . . . . . . . . . C-13 D RMS Application Interface D.1 D-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1 rms_allocateResource(3) . . . . . . . . . . . . . . . . . . . . . . . D-2 rms_deallocateResource(3) . . . . . . . . . . . . . . . . . . . . .
PAGE 10
PAGE 11
List of Figures 2.1 A Network of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 2.2 High Availability RMS Configuration . . . . . . . . . . . . . . . . . . . . 2-3 2.3 The Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 2.4 Partitioning a System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 2.5 Distribution of Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 2.6 Preemption of Low Priority Jobs . . . .
PAGE 12
PAGE 13
List of Tables 10.1 Access Controls Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.2 Accounting Statistics Table . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 10.3 Machine Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 10.4 Performance Statistics Attributes . . . . . . . . . . . . . . . . . . . . . . 10-7 10.5 Server Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7 10.6 Scheduling Attributes . . . .
PAGE 14
10.22 Partitions Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 10.23 Projects Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.24 Resources Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.25 Servers Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.26 Services Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 10.
PAGE 15
1 Introduction 1.1 Scope of Manual This manual describes the Resource Management System (RMS). The manual’s purpose is to provide a technical overview of the RMS system, its functionality and programmable interfaces. It covers the RMS daemons, client applications, the RMS database, the system call interface to the RMS kernel module and the application program interface to the RMS database. 1.2 Audience This manual is intended for system administrators and developers.
PAGE 16
Related Information Chapter 1 (Introduction) explains the layout of the manual and the conventions used to present information Chapter 2 (Overview of RMS) overviews the functions of the RMS and introduces its components Chapter 3 (Parallel Programs Under RMS) shows how parallel programs are executed under RMS Chapter 4 (RMS Daemons) describes the functionality of the RMS daemons Chapter 5 (RMS Commands) describes the RMS commands Chapter 6 (Access Control, Usage Limits and Accounting) explains RMS access co
PAGE 17
Conventions 1.4 Related Information The following manuals provide additional information about the RMS from the point of view of either the system administrator or the user: • Compaq AlphaServer SC User Guide • Compaq AlphaServer SC System Administration Guide 1.5 Location of Online Documentation Online documentation in HTML format is installed in the directory /usr/opt/rms/docs/html and can be accessed from a browser at http://rmshost:8081/html/index.html.
PAGE 18
Conventions italic monospace type Italic (slanted) monospace type denotes some meta text. This is used most often in command or parameter descriptions to show where a textual value is to be substituted. italic type Italic (slanted) proportional type is used in the text to introduce new terms. It is also used when referring to labels on graphical elements such as buttons. Ctrl/x This symbol indicates that you hold down the Ctrl key while you press another key or mouse button (shown here by x).
PAGE 19
2 Overview of RMS 2.1 Introduction This chapter describes the role of the Resource Management System (RMS). The RMS provides tools for the management and use of a Compaq AlphaServer SC system. To put into context the functions that RMS performs, a brief overview of the system architecture is given first in Section 2.2. Section 2.3 outlines the main functions of the RMS and introduces the major components of the RMS: a set of UNIX daemons, a suite of command line utilities and a SQL database.
PAGE 20
The System Architecture system are also connected to an external LAN. The application nodes, used for running parallel programs, are accessed through the RMS. Figure 2.1: A Network of Nodes QM-S16 Switch Switch Network Control Switch Network ... Interactive Nodes with LAN/FDDI Interface Terminal Concentrator Application Nodes Management Network All of the nodes are connected to a management network (normally, a 100 BaseT Ethernet).
PAGE 21
The Role of the RMS Figure 2.2: High Availability RMS Configuration RMS Host Backup RMS Host RMS Database The RMS processes run on the node with the name rmshost, which migrates to the backup on fail-over. The database is held on a shared disk, accessible to both the primary and backup node. 2.3 The Role of the RMS The RMS provides a single point interface to the system for resource management.
PAGE 22
The Role of the RMS Scheduling deciding when and where to run parallel jobs Audit maintaining an audit trail of system state changes From the user’s point of view, RMS provides tools for: Information querying the resources of the system Execution loading and running parallel programs on a given set of resources Monitoring monitoring the execution of parallel programs 2.3.
PAGE 23
The Role of the RMS • The RMS Daemon, rmsd, runs on each node in the system. It loads and runs user processes and monitors resource usage and system performance. The RMS daemons are described in more detail in Chapter 4 (RMS Daemons). 2.3.3 The RMS Commands RMS commands call on the RMS daemons to get information about the system, to distribute work across the system, to monitor the state of programs and, in the case of administrators, to configure the system and back it up.
PAGE 24
RMS Management Functions Section 10.2.20). Users have read access to all tables but no write access. Operator and administrative applications are granted limited write access. Password-protected administrative applications and RMS itself have full read/write access. The RMS commands are described in more detail in Chapter 5 (RMS Commands). 2.3.4 The RMS Database The database provides a platform-independent interface to the RMS system.
PAGE 25
RMS Management Functions 2.4 RMS Management Functions The RMS gives the system administrator control over how the resources of a system are assigned to the tasks it must perform. This includes the allocation of resources (Section 2.4.1), scheduling policies (Section 2.4.2), access controls and accounting (Section 2.4.3) and system configuration (Section 2.4.4). 2.4.1 Allocating Resources The nodes in an RMS system can be configured into mutually exclusive sets known as partitions as shown in Figure 2.4.
PAGE 26
RMS Management Functions A further partition, the root partition, is always present. It includes all nodes. It does not have a scheduler. The root partition can only be used by administrative users (root and rms by default). 2.4.2 Scheduling Partitions enable different scheduling policies to be put into action. On each partition, one or more of three scheduling policies can be deployed to suit the intended usage: 1.
PAGE 27
RMS Management Functions The RMS scheduler allocates contiguous ranges of nodes with a given number of CPUs per node 1 . Where possible each resource request is met by allocating a single range of nodes. If this is not possible, an unconstrained request (those that only specify the number of CPUs required) may be satisfied by allocating CPUs on disjoint nodes. This ensures that an unconstrained resource request can utilize all of the available CPUs. The scheduler attempts to find free CPUs for each request.
PAGE 28
RMS Management Functions detail in Chapter 6 (Access Control, Usage Limits and Accounting). Each partition, except the root partition, is managed by a Partition Manager (see Section 4.4), which mediates user requests, checking access permissions and usage limits before scheduling CPUs and starting user jobs. An accounting record is created as CPUs are allocated to each request. It is updated periodically until the resources are freed.
PAGE 29
3 Parallel Programs Under RMS 3.1 Introduction RMS provides users with tools for running parallel programs and monitoring their execution, as described in Chapter 5 (RMS Commands). Users can determine what resources are available to them and request allocation of the CPUs and memory required to run their programs. This chapter describes the structure of parallel programs under RMS and how they are run.
PAGE 30
Resource Requests 3.2 Resource Requests Having logged into the system, a user makes a request for the resources needed to run a parallel program by using the RMS commands prun (see Page 5-11) or allocate (see Page 5-3).
PAGE 31
Loading and Running Programs The resource request is sent to the Partition Manager, pmanager (described in Section 4.4). The Partition Manager performs access checks (described in Chapter 6 (Access Control, Usage Limits and Accounting)) and then allocates CPUs according to the policies established for the partition (see Chapter 7 (RMS Scheduling)). RMS makes a distinction between allocating resources and starting jobs on them.
PAGE 32
Loading and Running Programs processes, removing any core files if requested (see Page 5-11) and then deallocating the CPUs. The application processes are run from the user’s current working directory with the current limits and group rights. The data and stack size limits may be reduced if RMS has applied a memory limit to the program. During execution, the processes may be suspended at any time by the scheduler to allow a program with higher priority to run.
PAGE 33
Loading and Running Programs Sometimes, it is desirable for a user to be granted more control over the use of a resource. For instance, the user may want to run several jobs concurrently or use the same nodes for a sequence of jobs. This functionality is supported by the command allocate (see Page 5-3) which allows a user to allocate CPUs in a parallel partition to a UNIX shell. These CPUs are used for subsequent parallel jobs started from this shell.
PAGE 34
PAGE 35
4 RMS Daemons 4.1 Introduction This chapter describes the role of the RMS daemons. There are daemons that run on the rmshost node providing services for the system as a whole: msqld Manages the database (see Section 4.2). mmanager Monitors the health of the machine as a whole (see Section 4.3). pmanager Controls the use of resources (see Section 4.4). swmgr Monitors the health of the Compaq AlphaServer SC Interconnect (see Section 4.5).
PAGE 36
The Machine Manager 4.1.1 Startup RMS is started as each node executes the initialization script /sbin/init.d/rms with the start argument on startup. This starts the rmsmhd daemon which, in turn, starts the other daemons on that node. The daemons can also be started, stopped and reloaded individually by rcontrol once RMS is running. See Page 5-20 for details. 4.1.2 Log Files Output from the management daemons is logged to the directory /var/rms/adm/log. The log files are called daemon.
PAGE 37
The Partition Manager 4.3 The Machine Manager The Machine Manager, mmanager, is responsible for detecting and reporting changes in the state of each node in the system. It records the current state of each node and any changes in state in the database. When a node is functioning correctly, rmsd, a daemon which runs on each node, periodically updates the database. However, if the node crashes, or IP traffic to and from the node stops, then these updates stop.
PAGE 38
The Partition Manager The Partition Manager makes new scheduling decisions periodically and in response to incoming resource requests (see Chapter 7 (RMS Scheduling) for details). These decisions may result in jobs being suspended or resumed. Such scheduling operations, together with those performed as jobs are killed, are performed by the Partition Manager sending scheduling or signal delivery requests to the rmsds. The Partition Manager is connected to its rmsds by a tree of sockets.
PAGE 39
The Transaction Log Manager Configuration information about each partition is held in the partitions table (see Section 10.2.16). The information is indexed by the name of the partition together with the name of the active configuration. 4.5 The Switch Network Manager The Switch Network Manager, swmgr, controls and monitors the Compaq AlphaServer SC Interconnect (see Appendix A (Compaq AlphaServer SC Interconnect Terms)).
PAGE 40
The Process Manager Each entry in the services table specifies which command to run, who can run it and on which host. 4.6.1 Interaction with the Database The Transaction Log Manager maintains the transactions table (see Section 10.2.23). It consults the services table (see Section 10.2.20) in order to execute transactions on behalf of its clients. 4.
PAGE 41
The RMS Daemon 4.8 The Process Manager The Process Manager, rmsmhd, is responsible for starting and stopping the other RMS daemons. It runs on each node and is responsible for managing the other daemons that run on its node. It starts them as the node boots, stops them as the node halts and starts or stops them in response to requests from the RMS client application rcontrol (see Page 5-20). 4.8.
PAGE 42
The RMS Daemon The rmsds communicate with each other and with the Partition Manager that controls their node over a balanced tree of sockets. Requests (for example, to deliver a signal to all processes in a parallel program) are passed down this tree to the appropriate range of nodes. The results of each request are combined as they pass back up the tree. rmsd is started by the RMS daemon rmsmhd and restarted when it exits – this happens when a partition is shut down. 4.9.
PAGE 43
5 RMS Commands 5.1 Introduction This chapter describes the RMS commands. RMS includes utilities that enable system administrators to configure and manage the system, in addition to those that enable users to run their programs. RMS includes the following commands intended for use by system administrators: rcontrol The rcontrol command is used to control the system resources. rmsbuild The rmsbuild command creates and populates an RMS database for a given machine.
PAGE 44
Introduction rmshost The rmshost command reports the name of the node running the RMS management daemons. msqladmin The msqladmin command is used for creating and deleting databases and stopping the mSQL server. RMS includes the following commands for all users of the system: allocate The allocate command is used to reserve access to a set of CPUs either for running multiple tasks in parallel or for running a sequence of commands on the same CPUs.
PAGE 45
allocate(1) NAME allocate – Reserves access to CPUs SYNOPSIS allocate [-hIv] [-B base] [-C CPUs] [-N nodes | all] [-n CPUs] [-p partition] [-P project] [-R request] [script [args ...]] OPTIONS -B base Specifies the number of the base node (the first node to use) in the partition. Numbering within the partition starts at 0. By default, the base node is unassigned, leaving the scheduler free to select nodes that are not in use. -C CPUs Specifies the number of CPUs required per node (default 1).
PAGE 46
allocate(1) immediate=0 | 1 With a value of 1, this specifies that the request should fail if it cannot be met immediately (this is the same as the -I option). hwbcast=0 | 1 With a value of 1, this specifies a contiguous range of nodes and constrains the scheduler to queue the request until a contiguous range becomes available. rails=n In a multirail system, this specifies the number of rails required, where 1 ≤ n ≤ 32.
PAGE 47
allocate(1) can be used with hwbcast set to 1 to ensure that the range of nodes allocated is contiguous. Before allocating resources, the Partition Manager checks the resource limits imposed on the current project. The project can be specified explicitly with the -P option. This overrides the value of the environment variable RMS_PROJECT or any default setting in the users table. (See Section 10.2.24). The script argument (with optional arguments) can be used in two different ways, as follows: 1.
PAGE 48
allocate(1) RMS_TIMELIMIT Specifies the execution time limit in seconds. The program will be signaled either after this time has elapsed or after any time limit imposed by the system has elapsed. The shorter of the two time limits is used. RMS_DEBUG Specifies whether to execute in verbose mode and display diagnostic messages. Setting a value of 1 or more will generate additional information that may be useful in diagnosing problems. (See Section 9.6).
PAGE 49
allocate(1) argument, it is interpreted as -I and the user is warned that this feature should not be used anymore.
PAGE 50
nodestatus(1) NAME nodestatus – Gets or sets the status or run level of each node SYNOPSIS nodestatus [-bhr] [status] OPTIONS -b Operate in the background. -h Display the list of options. -r Get/set run level. DESCRIPTION The nodestatus command is used to update status information in the RMS database as nodes are booted or halted. When run without arguments, nodestatus gets the status of the node on which it is running from the Machine Manager.
PAGE 51
msqladmin(1) NAME msqladmin – Perform administrative operations on the mSQL database server SYNOPSIS msqladmin [-q] [-f confFile] [-h host] command OPTIONS -f confFile Specify a non-default configuration file to be loaded. The default action is to load the standard configuration file located in /var/rms/msql.conf. -h host Specify a remote hostname or IP address on which the mSQL server (msql2d) is running.
PAGE 52
msqladmin(1) Displays server statistics. stats Most administrative functions can only be executed by the user specified in the run-time configuration as the admin user (rms). They can also only be executed from the host on which the server process is running (for example you cannot shut down a remote server process). EXAMPLES # msqladmin version Version Details :msqladmin version mSQL server version mSQL protocol version mSQL connection Target platform 2.0.11 2.0.11 23 Localhost via UNIX socket OSF1-V5.
PAGE 53
prun(1) NAME prun – Runs a parallel program SYNOPSIS prun [-hIOrstv] [-B base] [-c cpus] [-e mode] [-i mode] [-o mode] [-N nodes | all] [-n procs] [-m block | cyclic] [-P project] [-p partition] [-R request] program [args ...] OPTIONS -B base Specifies the number of the base node (the first node to use) in the partition. Numbering within the partition starts at 0. By default, the base node is unassigned, leaving the scheduler free to select nodes that are not in use.
PAGE 54
prun(1) -n procs Specifies the number of processes required. The -n and -N options can be combined to control how processes are distributed over nodes. If neither is specified, prun starts one process. -O Allows resources to be over-committed. Set this flag to run more than one process per CPU. -P project Specifies the name of the project with which the job should be associated for scheduling and accounting purposes. -p partition Specifies the partition on which to run the program.
PAGE 55
prun(1) DESCRIPTION The prun program executes multiple copies of the specified program on a partition. prun automatically requests resources for the program unless it is executed from a shell that already has resources allocated to it. (See Page 5-3). The way in which processes are allocated to CPUs is controlled by the -c, -n, -p, -B and -N options. The -n option specifies the total number of processes to run. The -c option specifies the number of CPUs required per process, this defaults to 1.
PAGE 56
prun(1) Before allocating resources, prun checks the resource limits imposed on the current project. The project can be specified explicitly with the -P option. This overrides the value of the environment variable RMS_PROJECT or any default setting in the users table. (See Section 10.2.24). By default, when running a parallel program, prun forwards standard input to the process with an identifier of 0. The -i option requests a different mode of operation.
PAGE 57
prun(1) none Do not redirect standard output (or standard error) from any process. file prun opens the named file for output and associates it with the standard output (standard error) stream so that each process writes standard output (standard error) to the file. file.% prun expands the % character to generate and open for output a separate file name for each process: process 0 writes standard output (standard error) to file.0, process 1 writes to file.1 and so on.
PAGE 58
prun(1) ENVIRONMENT VARIABLES The following environment variables may be used to identify resource requirements and modes of operation to prun. These environment variables are used where no equivalent command line options are given: RMS_IMMEDIATE Controls whether to exit rather than block if resources are not immediately available. The -I option overrides the value of this environment variable. By default, prun blocks until resources become available. Root resource requests are always met.
PAGE 59
prun(1) RMS_STDOUTMODE Specifies the mode for redirecting standard output from a parallel program. The -o option overrides the value of this environment variable. Values for mode are the same as those used with the -o option. RMS_STDERRMODE Specifies the mode for redirecting standard error from a parallel program. The -e option overrides the value of this environment variable. Values for mode are the same as those used with the -e option.
PAGE 60
prun(1) $ prun -n 4 -N 2 hostname atlas0.quadrics.com atlas0.quadrics.com atlas1.quadrics.com atlas1.quadrics.com $ prun -n 4 -N 4 hostname atlas1.quadrics.com atlas3.quadrics.com atlas0.quadrics.com atlas2.quadrics.com The -m option controls how processes are distributed over nodes. It is used in the following example in conjunction with the -t option which tags each line of output with the identifier of the process that wrote it. $ 0 1 2 3 $ 0 2 1 3 prun -t -n 4 -N 2 -m block hostname atlas0.quadrics.
PAGE 61
prun(1) 0: 1 0: 2 0: 4 0: 8 0: 16 0: 32 Elapsed time User time Cpus used bytes bytes bytes bytes bytes bytes 3.60 uSec 0.28 MB/s 3.53 uSec 0.57 MB/s 2.44 uSec 1.64 MB/s 2.47 uSec 3.23 MB/s 2.54 uSec 6.29 MB/s 2.57 uSec 12.46 MB/s 1.00 secs Allocated time 0.93 secs System time 2 1.99 secs 0.13 secs Note that the allocated time (in CPU seconds) is twice the elapsed time (in seconds) because two CPUs were allocated. WARNINGS In earlier versions, the -i option specified immediate mode.
PAGE 62
rcontrol(1) NAME rcontrol – Controls use of system resources SYNOPSIS rcontrol command [args ...] [-ehs] [-r level] [command args ...] OPTIONS -e Exit on the first error. -h Display the list of options. -r level Set reporting level. -s Stop and print warning on error. command is specified as follows: create object [=] name [configuration=val] [partition=val] [attr=val] object may be one of: access_control, attribute, configuration, node, partition, project, user.
PAGE 63
rcontrol(1) start object [=] name object may be one of: configuration, partition, server. stop object [=] name [option [=] kill | wait] object may be one of: configuration, partition, server. If server is specified as the object, no option should be given. reload object [=] name [debug [=] value] object may be one of: partition, server. suspend job [=] name [name ...] job may be one of: resource, batchid. suspend attribute [=] value [attribute [=] value ...] Attributes of the same name are ORed together.
PAGE 64
rcontrol(1) set attribute [=] name val [=] value exit help [all | command] show object [=] name object may be one of: nodes, configuration, partition. DESCRIPTION rcontrol is used to manage the following: nodes, partitions and configurations; servers; users and their resource requests, projects and access controls; system attributes. rcontrol can create, start, stop and remove a configuration or partition. It can create, remove and set the attributes of nodes and configure them in and out of the machine.
PAGE 65
rcontrol(1) # rcontrol configure in nodes = ’atlas[1-3]’ # rcontrol configure in nodes ’atlas[1-3]’ Creating and Removing Nodes To create a new node description, use rcontrol with the create command and the argument node followed by the hostname of the node. Additional attribute-value pairs specify properties of the node, such as its type and position. The attributes rack and unit specify the position of the node in the system.
PAGE 66
rcontrol(1) The timelimit attribute specifies the maximum time in seconds for which CPUs can be allocated on the partition. On expiry of the time limit, jobs will be sent the signal SIGXCPU. If they have not exited within a grace period, they will be killed. The grace period for a site is defined in the attributes table (attribute name grace-period). Its default value is 60 seconds.
PAGE 67
rcontrol(1) To stop a partition in the active configuration, use rcontrol with the stop command and the partition argument followed by the name of the partition. To stop all of the partitions in the active configuration, use rcontrol with the stop command and the configuration argument followed by the name of the configuration. When stopping partitions you can optionally specify what should happen to the running jobs. The options are to leave them running, to wait for them to exit or to kill them.
PAGE 68
rcontrol(1) start command, the server argument and the name of the server. The command rinfo (with the -s flag) can be used to show the status of the RMS servers. To instruct an RMS server to change its reporting level, use the reload command and the server argument with the name of the server. In addition, you should specify the attribute debug and a value. RMS servers write their log files to the directory /var/rms/adm/log on the rmshost. See Section 9.6.
PAGE 69
rcontrol(1) # rcontrol set resource = 32 priority = 25 # rcontrol set batchid = 48 priority = 40 rcontrol can also be used to suspend, kill or resume jobs identified by their attributes. The attributes that can be specified are: partition, project, status and user. Attributes of the same name are ORed together, attributes with different names are ANDed.
PAGE 70
rcontrol(1) Note that a user can be in more than one project in which case the value would be a comma-separated list: # rcontrol set user = frank projects = parallax,science To create an access control called, for example, science, in the par1 partition, use rcontrol with the create command followed by the type of the object, its name and the name of the partition. Additional attribute-value pairs specify attributes of the access control, for example, its class.
PAGE 71
rcontrol(1) The attribute pmanager-queuedepth limits the number of resource requests that a Partition Manager will handle at any time. If the attribute is undefined or set to NULL or 0, no limit is imposed. By default, it is set to 0. If a limit is set and reached, subsequent resource requests by prun will block or, if the immediate option to prun is set, fail. The blocked requests will not appear in the RMS database. To set the pmanager-queuedepth attribute, use rcontrol with the set command.
PAGE 72
rcontrol(1) The attribute cpu-poll-stats-interval specifies the interval between successive polls for gathering node statistics. The interval is specified in seconds and must be in the range 0 to 86400 (1 day). The attribute rms-keep-core determines whether core files are deleted or saved. By default, it is set to 1 so that core files are saved. Change this to 0 to delete core files. The attribute local-corepath specifies the directory in which core files are saved. By default, it is set to /local/core/rms.
PAGE 73
rcontrol(1) # rcontrol kill resource = 2212 2213 # rcontrol kill batchid = 44 45 To instruct a Partition Manager to reread the user, projects and access_controls tables: # rcontrol reload partition = par1 To enable debug reporting from the RMS scheduler for the partition called par1: # rcontrol reload partition = par1 debug = 41 RMS Commands 5-31
PAGE 74
rinfo(1) NAME rinfo – Displays resource usage and availability information for parallel jobs SYNOPSIS rinfo [-chjlmnpqr] [-L [partition] [statistic]] [-s daemon [hostname] | all] [-t node | name] OPTIONS -c List the configuration names. -h Display the list of options. -j List current jobs. -l Give more detailed information. -m Show the machine name. -n Show the status of each node. This can be combined with -l.
PAGE 75
rinfo(1) -t node | name Where node is the network identifier of a node, rinfo translates it into the hostname; where name is a hostname, rinfo translates it into the network identifier. See Section A.1 for more information on network identifiers. DESCRIPTION The rinfo program displays information about resource usage and availability. Its default output is in four parts that identify: the machine, the active configuration, resource requests and the current jobs.
PAGE 76
rinfo(1) EXAMPLES When used with the -q flag, rinfo prints information on the user’s projects, CPU usage limits, memory limits and priorities. $ rinfo -q PARTITION parallel parallel CLASS project project NAME default divisionA CPUS 0/8 16/64 MEMLIMIT 100 none PRIORITY 0 1 In this example, the access controls allow any user to run jobs on up to 8 CPUs with a memory limit of 100MB. Jobs submitted for the divisionA project run at priority 1, have no memory limit and can use up to 64 CPUs.
PAGE 77
rmsbuild(1) NAME rmsbuild – Creates and populates an RMS database SYNOPSIS rmsbuild [-dhv] [-I list] [-m machine] [-n nodes | -N list] [-p ports] [-t type] OPTIONS -d Create a demonstration database. -h Display the list of options. -I list Specifies the names of any interactive nodes. -m machine Specifies a name for the machine. -n nodes Specifies the number of nodes in the machine. -N list Specifies the nodes in the machine by name.
PAGE 78
rmsbuild(1) Detailed information about each node (number of CPUs, amount of memory and so on) is added later by rmsd as it starts on each node. The machine name is specified with the -m option. Machines should be given a short name that does not end a digit. Node names are generated by appending a number to the machine name. Database entries for the nodes are generated by the -n or -N options. Use -n with a number to generate entries for nodes 0 through n-1.
PAGE 79
rmsctl(1) NAME rmsctl – Stops, starts or shows the status of the RMS system. SYNOPSIS rmsctl [-aehv] [start | stop | restart | show] OPTIONS -a Show all servers, when used with the show command. -e Only show errors, when used with the show command. -h Display the list of options. -v Verbose operation DESCRIPTION The rmsctl script is used to start, stop or restart the RMS system on all nodes in a machine, and to show status information. rmsctl starts and stops RMS by executing the /sbin/init.
PAGE 80
rmsctl(1) RMS RMS RMS RMS service service service service stopped stopped stopped stopped on on on on atlas0 atlas3 atlas2 atlasms To start the RMS system, use rmsctl as follows: # rmsctl stop RMS service started on atlas0 RMS service started on atlas1 RMS service started on atlasms RMS service started on atlas2 RMS service started on atlas3 pmanager-parallel: cpus=16 (4 per node) maxfree=4096MB swap=5171MB no memory limits pstartup.OSF1: general partition parallel starting pstartup.
PAGE 81
rmsexec(1) NAME rmsexec – Runs a sequential program on a lightly loaded node SYNOPSIS rmsexec [-hv] [-p partition] [-s stat] [hostname] program [args ...] OPTIONS -h Display the list of options. -v Specifies verbose operation. -p partition Specifies the target partition. The request will fail if load-balancing is not enabled on the partition. (See Section 10.2.16). -s stat Specifies the statistic on which to base the load-balancing calculation (see below).
PAGE 82
rmsexec(1) freemem Free memory in megabytes. users Lowest number of users. By default, usercpu is used as the statistic.
PAGE 83
rmshost(1) NAME rmshost – Prints the name of the node running the RMS management daemons SYNOPSIS rmshost [-hl] OPTIONS -h Display the list of options. -l Prints the fully qualified domain name. DESCRIPTION The rmshost command prints the name of the node that is running (or should run) the RMS management daemons. It is used by the RMS system.
PAGE 84
rmsquery(1) NAME rmsquery – Submits SQL queries to the RMS database SYNOPSIS rmsquery [-huv] [-d name] [-m machine] [SQLquery] OPTIONS -d name Select database by name. -h Display the list of options. -m machine Select database by machine name. -u Print dates as seconds since January 1st 1970. The default is to print dates as a string created with localtime(3). -v Verbosely prints field names above each column of output. DESCRIPTION rmsquery is used to submit SQL queries to the RMS database.
PAGE 85
rmsquery(1) The source is provided in /usr/opt/rms/src. Details of the SQL language can be found on the Quadrics support page http://www.quadrics.com/web/support. EXAMPLES An example follows of a select statement that results in a list of the names of all of the nodes in the machine. Note that the query must be quoted. This is because rmsquery expects a single argument.
PAGE 86
rmstbladm(1) NAME rmstbladm – Database administration SYNOPSIS rmstbladm [-BcdDfhmuv] [-r file] [-t table] [machine] OPTIONS -B Dump the first five rows of each table to stdout as a sequence of SQL statements. A specific table can be dumped if the -t option is used. -c Clean out old entries from the node statistics (node_stats) table, the resources table, the events table and the jobs table. (See Chapter 10 (The RMS Database).
PAGE 87
rmstbladm(1) DESCRIPTION The command rmstbladm is used to administer the RMS database. It creates the tables and their default entries. It can be used to back up individual tables (or the whole database) to a text file, to restore tables from file or to force the recreation of tables. Unless a specific machine is specified, rmstbladm operates on the database of the host machine.
PAGE 88
PAGE 89
6 Access Control, Usage Limits and Accounting 6.1 Introduction RMS access controls and usage limits operate on a per-user or per-project basis (a project is a list of named users). Each partition may have its own controls. This mechanism allows system administrators to control the way in which the resources of a machine are allocated amongst the user community. RMS accounts for resource usage by user and by project.
PAGE 90
Access Controls control records. When submitting requests for CPUs, users can select any project of which they are a member (by setting the RMS_PROJECT environment variable or by using the -P flag when executing prun or allocate). RMS rejects requests to use projects that do not exist or requests to use projects of which the user is not a member. Users without an RMS user record are subject to the constraints on the default project.
PAGE 91
Access Controls The access controls for individual users must set lower limits than those of the projects of which they are a member. That is to say, they must have a lower priority, smaller number of CPUs, smaller memory limit and so on than the access control record for the project. Where a memory limit exists for a user or project, it takes precedence over any default limit set on the partition (see Section 10.2.16). When the system is installed, there are no access control records.
PAGE 92
How Access Controls are Applied parallel priority = 5 rcontrol create access_control = default class = project partition = \ parallel priority = 0 memlimit = 256 name design default class project project partition parallel parallel priority 5 0 maxcpus Null Null memlimit Null 256 Requests submitted by Jim, Mary and John run at priority 5, causing other users’ jobs to be suspended if running. These requests are not subject to CPU or memory limits.
PAGE 93
How Access Controls are Applied allocated has its memory limits set to this value. A process with more than one CPU allocated has proportionately higher memory limits. The RMS_MEMLIMIT environment variable can be used to reduce the memory limit set by the system, but not to raise it. By default, the memory limit is capped by the minimum value for any node in the partition of the smaller of these two amounts: 1. The amount of memory on the node. 2. The amount of swap space.
PAGE 94
Accounting 1. No CPU usage limits are set on jobs run by the root user. 2. If the user has an access control record for the partition, the CPU usage limit is determined by the maxcpus field in this record. 3. The access control record for the user’s current project determines the CPU usage limit. 4. The access control record for the default project determines the CPU usage limit. CPU usage limits can be set to a higher value than the actual number of CPUs available in the partition.
PAGE 95
Accounting Accounting records are updated periodically until the CPUs are deallocated. The running flag is set to 0 at this point. The atime statistic is summed over all CPUs allocated to the resource request. The utime and stime statistics are accumulated over all processes in all jobs running on the allocated CPUs. Note The memint statistics are not implemented in the current release. All values for this fields are 0.
PAGE 96
PAGE 97
7 RMS Scheduling 7.1 Introduction The Partition Manager (see Section 4.4) is responsible for scheduling resource requests and enforcing usage limits. This chapter describes the RMS scheduling policies and explains how the Partition Manager responds to resource requests. 7.2 Scheduling Policies The scheduling policy in use on a partition is controlled by the type attribute of the partition. The type attribute can take one of four values: login Normal UNIX time-sharing applies.
PAGE 98
Scheduling Constraints together. That is to say, all of the processes in a program are either running or suspended at the same time. Gang scheduling is required for tightly coupled parallel programs which communicate frequently. It becomes increasingly important as the rate of interprocess communication increases. For example, if a program is executing a barrier synchronization, all processes must be scheduled before the barrier completes.
PAGE 99
What Happens When a Request is Received Time Limit Jobs are normally run to completion or until they are preempted by a higher priority request. Each partition may have a time limit associated with it which restricts the amount of time the Partition Manager may allow for a parallel job. On expiry of this time limit, the job is sent a SIGXCPU signal. A period of grace is allowed following this signal for the job to clean up and exit. After this period, the job is killed and the resource deallocated.
PAGE 100
What Happens When a Request is Received immediate The request should fail rather than block if resources are not available immediately. Note The RMS scheduler attempts to allocate CPUs on a contiguous range of nodes. If a contiguous range of nodes is not available then requests that explicitly specify a contiguous range with the hwbcast parameter will block if the requested CPUs cannot be allocated.
PAGE 101
What Happens When a Request is Received 7.4.1 Memory Limits If memory limits are enabled (by setting the memlimit attribute of a partition or access control) then a request is only allocated CPUs on nodes that have sufficient memory available. RMS enforces memory limits by setting the data and stack size limits on a process. If the process exceeds the allowed size, it is killed (and the parallel program terminated).
PAGE 102
What Happens When a Request is Received processes (including those belonging to the system) will be killed if the system runs out of swap space. 7.4.3 Time Slicing Time slicing is enabled on a partition by setting its timeslice attribute; values of 15–120 seconds are recommended. If a timeslice is set, the Partition Manager evaluates the list of requests periodically.
PAGE 103
8 Event Handling 8.1 Introduction RMS includes a general mechanism for posting, waiting on and handling events. This functionality is provided by the Event Manager, eventmgr (see Section 4.7). Events are specified by RMS class, name, type and description.
PAGE 104
Introduction $ rmsquery -v "select * from events order by ctime" id name class type ctime handled description --------------------------------------------------------------20 atlas0 node status 05/04/01 15:53:02 1 running 21 atlas0 node status 05/05/01 11:27:29 1 not responding 8.1.1 Posting Events Events are normally posted by RMS servers but they can also be generated by the command line utility rmspost. This is useful for testing the response of the system to rare events.
PAGE 105
Event Handling ::: matches node:atlas0:status Note that the class, name, type and description must all be specified when posting events but one or more of the class, name and type can be null when waiting on events. 8.2 Event Handling Event handler scripts are specified in the event_handlers table.
PAGE 106
List of Events Generated program=‘basename $0‘ id=$1 class=$2 name=$3 type=$4 description=$5 # # format event description message # message() { echo "‘date ’+%h %e %X’‘ OSF1 event $id $type $class $name $description" } # # log the event # message >> /var/rms/adm/log/event.log # # execute OSF1 specific handler # /usr/opt/srasysman/bin/checkout.exp -I -R -i $id -c $class -n $name -t $type -d $description 8.
PAGE 107
List of Events Generated ambient=value DS20, ES40, QM-S16, QM-S128 class = module type = temphigh If the temperature exceeds the threshold value, the event type is temphigh and the description contains the above report and, in addition, the words threshold exceeded. In the event of multiple failures, the reports are concatenated. class = module type = psu The name field contains the name of the module.
PAGE 108
List of Events Generated submitted started complete failed error transaction submitted transaction being executed transaction completed successfully transaction failed to execute transaction completed but there were errors In the case of a transaction completing with errors (a link error test or boundary scan, for example), details of the failures are added to the transaction outputs table.
PAGE 109
9 Setting up RMS 9.1 Introduction This chapter describes how to set up RMS and carry out routine operations. The information is organized as follows: • Planning the installation (see Section 9.2). • Starting RMS and configuring the system (see Section 9.3). • Carrying out day-to-day operations and establishing backup and archive procedures (see Section 9.4). • Customizing RMS (see Section 9.5). • Dealing with log files (see Section 9.6). 9.
PAGE 110
Setting up RMS • Is the machine primarily for running parallel jobs or do you expect a significant workload from sequential jobs? • Will some of your users have jobs that consume all of the resources of the system for extended periods of time? If so, are you happy for other users to wait until the machine is available or do they need access to resources of their own? • How do you wish to process the accounting data? The answers to these questions should help you to determine how to configure the system.
PAGE 111
Setting up RMS for this command to work correctly. This should have been enabled as part of the installation. # rmsctl start Configure all of the nodes into the machine using rcontrol. # rcontrol configure in ’atlas[0-63]’ Use rinfo with the -n option to check the status of the nodes. The output should show that all of the nodes are running. # rinfo -n running atlas[0-63], atlasms If any of the nodes show a status other than running, restart them by running /sbin/init.d/rms on the nodes in question.
PAGE 112
Setting up RMS Once RMS is running on all of the nodes, you set up a single partition as follows: # rcontrol create partition=parallel configuration=day nodes=’atlas[0-63]’ # rcontrol start partition=parallel You should now be able to run a parallel program across all 64 nodes, for example: # prun -N64 hostname ... # prun -N64 dping 0 32 ... 9.3.3 Simple Day/Night Setup In this example, the system is set up with two operating configurations: one called day and the other called night.
PAGE 113
Day-to-Day Operation Note In the current release, any requests that are suspended when a partition is stopped must be resumed manually if the partition is restarted. 9.4 Day-to-Day Operation Once the system is up and running, give some thought to automating some routine or day-to-day operations: • Periodic shift changes • Backing up the database • Summarizing accounting data • Archiving data • Database maintenance You may also want to configure nodes out of the system in the event of failures. 9.4.
PAGE 114
Day-to-Day Operation 9.4.3 Summarizing Accounting Data Accounting records accumulate in the RMS database as each job is run. By default, they are not processed as each site has its own requirements in this respect. A simple example script to produce a summary of resource usage is included in the release in /usr/opt/rms/examples/scripts/accounting_summary. See Appendix E (Accounting Summary Script) for a listing. The script produces the following output.
PAGE 115
Day-to-Day Operation The data can be archived as a sequence of SQL statements using rmstbladm. The following example archives data from the node statistics (node_stats) table (see Section 10.2.15): $ rmstbladm -d -t node_stats > nodestats.
PAGE 116
Day-to-Day Operation instructing the table administration program, rmstbladm, to remove old entries. Before running rmstbladm, archive any data you want to keep as described in Section 9.4.4. Remove old entries as follows: # rmstbladm -c rmstbladm clears out all entries that are older than a specified lifetime. The lifetime for job data and the lifetime for statistical data are specified in the attributes table (see Section 10.2.3).
PAGE 117
Day-to-Day Operation 2. Change to the directory that contains the database, for example: # cd /var/rms/msqldb/rms_atlas Delete the following files: node_stats.dat, node_stats.def, node_stats.idx and node_stats.ofl. # rm node_stats.* 3. Restart the database server, as follows: # /sbin/init.d/msqld start MSQL: daemon started 4. Create a new node statistics table, as follows: # rmstbladm -u After this, rmstbladm should succeed in cleaning out old entries. 9.4.
PAGE 118
Local Customization of RMS # rcontrol configure in node=atlas2 3. Restart the partition: # rcontrol start partition=parallel This brings the partition back up to its full complement of nodes. 9.5 Local Customization of RMS RMS can be customized to suit local operating patterns in a variety of ways. Customization is done through site-specific scripts in /usr/local/rms/etc.
PAGE 119
Log Files crashed. A site-specific variant might copy core files from the local temporary directory to a global file system for subsequent analysis. To create a site-specific core file analysis script, copy the default script /opt/rms/etc/core_analysis to /usr/local/rms/etc and modify it as required. 9.5.3 Event Handling The default event handlers check for the existence of a site-specific handler of the same name in /usr/local/rms/etc.
PAGE 120
Log Files 9.6 Log Files The RMS daemons output reports to log files in the directory /var/rms/adm/log. The amount of detail is controlled for each daemon by setting a reporting level. By default, the reporting level is set to 0. The reporting level is a bitmap that turns on different reports.
PAGE 121
10 The RMS Database 10.1 Introduction This chapter describes the tables which make up the RMS database. Each machine has its own database, called rms_ machine, where machine is the name of the machine. This allows a single database server to support multiple machines.
PAGE 122
Introduction x-y This denotes a range of possible integer values. text This denotes a character string of arbitrary length. • Fields of type text can be selected by the field name but the text entry cannot be matched. If the text is a list of items, for example, a list of node names, the items in the list may be separated by white space. A list of names, all of which share a common base, for example, atlas0 atlas1 atlas2, may also be represented by a glob-like expression, in this example, atlas[0-2].
PAGE 123
Introduction Operational State The following tables hold details of the current state of the machine.
PAGE 124
Listing of Tables transaction outputs request types statistics services contains output from requests posted to the transaction log describes output formats in the transaction outputs table lists the performance statistics available in the current release describes the services available and who can use them 10.2 Listing of Tables This section lists the tables in alphabetical order. 10.2.1 The Access Controls Table The access_controls table shown in Table 10.
PAGE 125
Listing of Tables at the end of each job by the Partition Manager, pmanager (see Section 4.4). Table 10.
PAGE 126
Listing of Tables The memint field is set to 0 in AlphaServer SC Version 2.0. The number of entries in the accounting statistics table can grow rapidly. The table should be cleared periodically of old entries as described in Section 9.4.3. 10.2.3 The Attributes Table The attributes table shown in Table 10.3, stores information specific to the site or the release. This information is stored as attribute-value pairs.
PAGE 127
Listing of Tables entries, if called with the -c option (see Page 5-44). Note that the accounting statistics table is not cleared out (see Section 10.2.2). Table 10.4: Performance Statistics Attributes Attribute node-statistics cpu-stats-pollinterval data-lifetime stats-lifetime Default cpu 120 Description statistics collected per node time in seconds between CPU samples 48 24 time in hours to keep job data time in hours to keep statistical data The server attributes in Table 10.
PAGE 128
Listing of Tables Table 10.
PAGE 129
Listing of Tables 10.2.5 The Elites Table The elites table shown in Table 10.8, contains one entry for each switch in the network. Its entries are created and maintained by the Switch Network Manager, swmgr (see Section 4.5). Table 10.
PAGE 130
Listing of Tables Table 10.9: Events Table (cont.) Field class type ctime handled description Type char(16) char(16) UTC 0|1 text Description class of the object, such as node or partition type of event time at which the event occurred whether the event has been handled or not description of the event Table 10.10 shows three typical events.
PAGE 131
Listing of Tables Table 10.11: Event Handlers Table (cont.) Field handler Type char(32) Description handler script to run 10.2.8 The Fields Table The fields table shown in Table 10.12, defines which RMS objects and attributes can be created and modified using rcontrol (see Page 5-20), identifying them by a table name and field name within that table. Table 10.
PAGE 132
Listing of Tables 10.2.9 The Installed Components Table The installed_components shown in Table 10.14, contains information about software components installed on each node. Table 10.
PAGE 133
Listing of Tables Table 10.15: Jobs Table (cont.) Field exitStatus session cmd Type int int text Description exit status of the job UNIX session ID of the allocating process command being executed Job names are sequence numbers generated automatically. The status field holds one of the values shown in Table B.1. While the job is running, endTime is set to the time by which the job must end, assuming there is a timelimit on the partition. If there is no time limit, the endTime is set to 0.
PAGE 134
Listing of Tables 10.2.12 The Modules Table The modules table shown in Table 10.17, contains descriptions of each hardware module in a machine. The modules may be nodes, network components or storage devices. It is created by rmsbuild. Entries are added and removed by rcontrol and updated by rmsd and the Switch Network Manager, swmgr. Table 10.
PAGE 135
Listing of Tables 10.2.13 The Module Types Table The module_types table shown in Table 10.18, contains descriptions of each of the module types supported in a given release of the RMS. It is updated by the table administration program, rmstbladm (see Page 5-44), when a new release is installed. Table 10.
PAGE 136
Listing of Tables Machine Manager, mmanager, when the node’s status or run level changes. Table 10.
PAGE 137
Listing of Tables attributes table (see Section 10.2.3) must be set to cpu. This is the default setting. The interval at which the nodes are sampled for CPU statistics is controlled by the attribute cpu-stats-poll-interval in the attributes table; the default is to sample every 2 minutes. The node statistics (node_stats) table can grow rapidly, especially on a large machine. Running the table administration program, rmstbladm, with the -c option removes old entries.
PAGE 138
Listing of Tables The partitions table shown in Table 10.22, describes how nodes are allocated to partitions in each of the configurations. It also contains scheduling parameters (see also Section 7.3) for each partition. The entries in the partitions table are created by rcontrol. The information is updated by the Partition Manager, pmanager, as it starts. Table 10.
PAGE 139
Listing of Tables The configured_nodes field stores the subset of nodes that were configured in when the partition was started. The timeslice field stores the interval in seconds between periodic rescheduling of parallel jobs. Time slicing is disabled when this field is null, the default. The timelimit field stores the maximum interval in seconds for which CPUs in a partition may remain allocated. Time limits are disabled when this field is null, the default.
PAGE 140
Listing of Tables Table 10.24: Resources Tables (cont.
PAGE 141
Listing of Tables Table 10.25: Servers Table (cont.
PAGE 142
Listing of Tables Currently, only rms is valid. Some services, such as rcontrol, must have exclusive access to the database, requiring that other transactions wait until they complete. The sequential field should be set to 1 for these services. Others such as swctrl may run for long periods of time and should not block the execution of other transactions. sequential should be set to 0 for these services. Sample records from the services table are shown in Table 10.27. Table 10.
PAGE 143
Listing of Tables 10.2.22 The Switch Boards Table The switch_boards shown in Table 10.30, contains one entry for each switch board in the Compaq AlphaServer SC Interconnect. It is created and maintained by the Switch Network Manager, swmgr (see Section 4.5). Table 10.
PAGE 144
Listing of Tables An example of the transaction to add a partition is shown below in Table 10.32. The handle is a unique number, generated automatically, which is passed to both the service and the client. The service uses the handle to label any output resulting from the transaction; the client uses the handle to select the resulting entries. If the service fails, the output log (conventionally in the directory /var/rms/adm/log) may contain useful diagnostics.
PAGE 145
A Compaq AlphaServer SC Interconnect Terms A.1 Introduction RMS includes support for programs that use Compaq AlphaServer SC Interconnect. This appendix provides an introduction to Compaq AlphaServer SC Interconnect, defining terms used elsewhere in this manual. Before an application process can use Compaq AlphaServer SC Interconnect, it must be given an Elan capability (see Section C.2), describing the nodes and communications contexts that it is allowed to use.
PAGE 146
Introduction Figure A.1: A 2-Stage, 16-Node, Switch Network Plane 3 Top Switches Level 0 Plane 0 Uplinks Level 1 Network Adapters Net id 15 Level 2 Net id 0 The level is the index of the stage, starting with 0 at the top. Note that in a 2-stage switch network the Elans are at level 2. Each component has a network ID that describes how to reach it from the top of the network. The plane is the index of switches that have the same switch network ID.
PAGE 147
Introduction Four such 64-node networks and an additional stage of switches can be used to construct a 256-way network. Alternatively, the unused uplinks can be used to double the number of nodes a switch can connect. This avoids the need to add an additional switch stage but the resulting network cannot be expanded further. This technique is used in the 128-node network, shown in Figure A.3. Figure A.
PAGE 148
Link Errors network, data can be broadcast directly to a contiguous range of processors: data is routed up to a node in the tree from which all processors can be reached; then the data is routed down to all switch outputs in the broadcast range on the way down. Data can be recombined as it travels through the network to support global reduction operations and barrier synchronization. Multiple Elan network adapters may be installed per node, each connected to a different switch network.
PAGE 149
B RMS Status Values B.1 Overview This appendix lists the various states that RMS objects can enter. State information is stored in the status field of the RMS table for the object in question. For example, the current state of a partition is held in the partitions table (see Section 10.2.16). and the current state of a node is entered in the nodes table (see Section 10.2.14). Status changes are recorded in the events table (see Section 10.2.6).
PAGE 150
Link Status Values B.2 Generic Status Values There are three generic status values: ok This state means that an object is functioning correctly as far as the relevant RMS daemon can tell. error This state means that one or more errors have been detected. A description of the problem will be found in the event record. unknown This state means that the RMS daemon responsible for an object either has not run or is unable to determine the state of the object.
PAGE 151
Module Status Values B.4 Link Status Values Each switch (see Appendix A (Compaq AlphaServer SC Interconnect Terms)) has an entry in the elites table. Each switch has eight links and the state of each of these links is recorded in the linkstate field of the elites table. The field holds eight characters, one for each link. Valid values for the characters are as shown in Table B.2. See also Section A.2. Table B.
PAGE 152
Node Status Values node has more than one instance of each type of temperature sensor, the maximum of their values is recorded. Temperature information is recorded as a list of attribute-value pairs, for example: ambient=15 cpu=40 psu=20 Note that not all node types support all types of thermistor reading. The environment field may contain only a subset of this information. If an error occurs, the environment string contains details of what has failed.
PAGE 153
Resource Status Values The current UNIX run level of a node is held in the nodes table in the runlevel field. This field is updated by the nodestatus program as the run level changes. The valid strings are shown together with their meaning in Table B.5. Table B.
PAGE 154
Transaction Status Values jobs using the CPUs complete. While CPUs are allocated, the valid resource status strings are as shown in Table B.7. Table B.
PAGE 155
C RMS Kernel Module C.1 Introduction The RMS kernel module supports the operation of RMS on each node in a system. It provides functions that bind together the set of processes that make up a program on each node, allowing RMS to apply scheduling, signal delivery and statistics gathering operations to them collectively. For example, the RMS kernel module allows the rmsd daemon or an administrator process to send a signal to all processes in a parallel program at the same time.
PAGE 156
System Call Interface and the Elan hardware context numbers to be used.
PAGE 157
rms_setcorepath(3) NAME rms_setcorepath, rms_getcorepath – Set, get the path for application core files SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_setcorepath(caddr_t path); int rms_getcorepath(pid_t pid, caddr_t path, int maxlen); PARAMETERS path Array containing the path name. maxlen Size of the array pointed to by path. pid Process identifier. DESCRIPTION The function rms_setcorepath() sets the core file path for the current process.
PAGE 158
rms_prgcreate(3) NAME rms_prgcreate, rms_prgdestroy – Create, destroy program descriptions SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgcreate(int id, uid_t uid, int cpus); int rms_prgdestroy(int id); PARAMETERS id Program identifier. uid Owner of the program. cpus Number of CPUs allocated. DESCRIPTION rms_prgcreate() creates a new program description with the current process as its root process.
PAGE 159
rms_prgcreate(3) EINVAL Program identifier is in use or the number of CPUs is invalid. ECHILD Processes belonging to this program are still running. EEXIST Program identifier does not exist.
PAGE 160
rms_prgids(3) NAME rms_prgids, rms_prginfo, rms_getprgid – Get information on a program or programs SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgids(int maxids, int *ids, int *nids); int rms_prginfo(int id, int maxids, pid_t *pids, int nids); int rms_getprgid(int pid, int *id); PARAMETERS id Program identifier. pid Process identifier. maxids Maximum number of identifiers to be returned. ids Array of program identifiers.
PAGE 161
rms_prgids(3) EINVAL Count of number of array elements is invalid. EFAULT Array address is invalid. ENOMEM Insufficient kernel memory to perform this operation. ESRCH Process or program does not exist.
PAGE 162
rms_prgsuspend(3) NAME rms_prgsuspend, rms_prgresume, rms_prgsignal – Suspend or resume the processes in a program, deliver a signal to all processes in a program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgsuspend(int id); int rms_prgresume(int id); int rms_prgsignal(int id, int signo); PARAMETERS id Program identifier. signo Signal number. DESCRIPTION rms_prgsuspend() suspends all of the processes in a program.
PAGE 163
rms_prgsuspend(3) EACCESS Caller is not permitted to perform this operation. ESRCH No such program identifier. EINVAL Invalid signal number.
PAGE 164
rms_prgaddcap(3) NAME rms_prgaddcap, rms_setcap – Associate Elan capabilities with a program or process SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prgaddcap(int id, int index, ELAN_CAPABILITY *cap); int rms_setcap(int index, int context); PARAMETERS id Program identifier. index Index of the capability for this program. cap Pointer to a capability. context Context number for this process.
PAGE 165
rms_prgaddcap(3) EACCESS Caller is not permitted to perform this operation. ENOMEM There was insufficient memory to perform this operation. ESRCH Program does not exist. EFAULT Capability has invalid address. EINVAL Invalid context number (rms_setcap() only).
PAGE 166
rms_ncaps(3) NAME rms_ncaps, rms_getcap – Return information on the Elan capabilities allocated to a process in a parallel program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_ncaps(int *ncaps); int rms_getcap(int index, ELAN_CAPABILITY *cap); PARAMETERS ncaps Number of capabilities allocated. index Index of a capability to be returned. cap Pointer to a capability.
PAGE 167
rms_prggetstats(3) NAME rms_prggetstats – Return resource usage information for a program SYNOPSIS cc [ flag ... ] file ... #include -lrmscall [ library ... ] int rms_prggetstats(int id, prgstats_t *stats); PARAMETERS id Program identifier. stats Pointer to a program statistics structure. DESCRIPTION rms_prggetstats() returns resource usage information for the processes of a parallel program on the calling node.
PAGE 168
rms_prggetstats(3) The elapsed time statistic etime is the time in millisecs since the program was created. The allocated time statistic atime is the time in millisecs for which CPUs have been allocated multiplied by the number of CPUs allocated. The utime and etime statistics are summed over the processes that make up the program (on this node). If one or more processes belonging to the program is still running, the flags field will contain the value PRG_RUNNING.
PAGE 169
D RMS Application Interface D.1 Introduction The RMS application interface is provided so that external scheduling modules can make inquiries about the availability of resources, allocate and deallocate CPUs and perform job control operations. The application interface is provided as a dynamic library librmsapi.so. Function prototypes are defined in the header file .
PAGE 170
rms_allocateResource(3) NAME rms_allocateResource, rms_deallocateResource – Allocate or deallocate a resource SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_allocateResource(char *partition, int cpus, int baseNode, int nodes, uid_t uid, char *project, char *requestFlags); int rms_deallocateResource(int rid); PARAMETERS partition Partition containing the resources. cpus Total number of CPUs to allocate. baseNode ID of the first node to allocate.
PAGE 171
rms_allocateResource(3) ID of the resource to deallocate. rid DESCRIPTION rms_allocateResource() allocates CPUs from a named partition. If partition is NULL, the default partition is used, otherwise the named partition must exist. You can optionally specify the base node and the number of nodes (as with the allocate and prun commands). Alternatively, this can be left to the scheduler by passing the value RMS_UNASSIGNED.
PAGE 172
rms_run(3) NAME rms_run – Run a program on an allocated resource SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_run(int rid, char *cmd, char **args, char *jobFlags); PARAMETERS rid Resource id. cmd Command to execute. args Arguments for the command. jobFlags The job flags currently supported are as follows: tag=0 | 1 With a value of 1, this specifies that output from each process should be tagged by the process id.
PAGE 173
rms_run(3) SEE ALSO rms_allocateResource(3), RMS Application Interface D-5
PAGE 174
rms_suspendResource(3) NAME rms_suspendResource, rms_resumeResource, rms_killResource – Job control operations on allocated resources SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] int rms_suspendResource(int rid); int rms_resumeResource(int rid); int rms_killResource(int rid, int signo); PARAMETERS rid ID of the resource. signo Signal to send. DESCRIPTION rms_suspendResource() and rms_resumeResource() suspend and resume a resource specified by rid.
PAGE 175
rms_defaultPartition(3) NAME rms_defaultPartition, rms_numCpus, rms_numNodes, rms_freeCpus – Provide information on RMS partitions SYNOPSIS cc [ flag ... ] file ... #include -lrmsapi -lrms [ library ... ] char *rms_defaultPartition(); int rms_numCpus(char *partition); int rms_numNodes(char *partition); int rms_freeCpus(char *partition); PARAMETERS partition Name of an active partition.
PAGE 176
PAGE 177
E Accounting Summary Script E.1 Introduction This appendix describes the example accounting summary script included in /usr/opt/rms/examples/scripts/accounting_summary and referred to in Section 9.4.3. • Section E.2 describes the command line interface. • Section E.3 shows a sample of output from the script. • Section E.4 is a listing of the script. E.
PAGE 178
Listing of the Script -p Sort the records by project name and then by user name. This is the default. -M Show time in minutes rather than seconds. -H Show time in hours rather than seconds. days Show statistics for the specified number of days. By default, statistics are shown for the previous day only. The script processes the arguments passed to it on the command line and generates a SQL query which it passes to rmsquery.
PAGE 179
Listing of the Script E.
PAGE 180
Listing of the Script # # parse the options # while [ $# -gt 0 ]; do option=‘echo $1 | sed "s/ˆ-//"‘ if [ "$option" = "$1" ]; then break fi if [ "$option" = "p" ]; then primary="project" elif [ "$option" = "u" ]; then primary="user" elif [ "$option" = "d" ]; then delete="1" elif [ "$option" = "M" ]; then if [ "$hours" = "1" ]; then echo "$sname: ERROR : -M and -H are mutually exclusive" exit 1 fi minutes="1" elif [ "$option" = "H" ]; then if [ "$minutes" = "1" ]; then echo "$sname: ERROR : -M and -H are mut
PAGE 181
Listing of the Script starttime=‘expr $now - $daysecs‘ if [ "$primary" = "project" ]; then primarytitle="Project" secondarytitle="User" querystr="select \ acctstats.project,resources.username, \ acctstats.atime,acctstats.utime, acctstats.stime \ from resources,acctstats \ where acctstats.started > $starttime and resources.name=acctstats.name \ order by acctstats.project,resources.username" else primarytitle="User" secondarytitle="Project" querystr="select \ resources.username,acctstats.project, \ acctstats.
PAGE 182
Listing of the Script printf ("\t %-8.8s ", secondary) } } function printdashes() { printf ("---------------------------------------------------------------------\ ----\n") } function printvals(vals, i) { for (i=1; i<=nvalues; i++) { if (hours == 1 || minutes == 1) { printf (" %13.2f", vals[i]) } else { printf (" %13.0f", vals[i]) } } } NF > 0 { if ($1 != primary) { if (primary != "") { printsortfields() printvals(values) printf (" %6d\n", recs) printdashes() printf ("Total %-10.
PAGE 183
Listing of the Script printf ("Name Name Secs Secs Secs Sessions\n") } } printdashes() } primary = $1 secondary = $2 for (i=1; i<=nvalues; i++) { values[i] = 0 primvalues[i] = 0 } recs = 0 primrecs = 0 printprim = 1 } else { if ($2 != secondary) { printsortfields() printvals(values) printf (" %6d\n", recs) secondary = $2 for (i=1; i<=nvalues; i++) { values[i] = 0 } recs = 0 } } for (i=1; i
PAGE 184
Listing of the Script printf (" %6d\n", primrecs) printdashes() printf ("Grand Total ") printvals(grandvalues) printf (" %6d\n", grandrecs) printdashes() }’ primtitle="$primarytitle" sectitle="$secondarytitle" machine=$machine \ days=$days hours=$hours minutes=$minutes /bin/rm $tmpfile if [ "$delete" ]; then echo "$sname : Deleting accounting statistics records" querystr="delete from acctstats where running=0" $RMSQUERY $querystr if [ $? -ne 0 ]; then echo "$sname : ERROR : $RMSQUERY $querystr FAILED" exit
PAGE 185
Glossary Abbreviations API Application Program Interface — specification of interface to software package (library). CFS Cluster File System — the file system for Tru64 UNIX clusters. CGI Common Gateway Interface — a standard method for generating HTML pages dynamically from an application so that a Web server and a Web browser can exchange information. A CGI script can be written in any language and can access various types of data, for example, a SQL database.
PAGE 186
HTML HyperText Markup Language — a generic markup language, comprising a set of tags, that enables structured documents to be delivered over the World Wide Web and viewed by a browser. HTTP HyperText Transfer Protocol — a communications protocol commonly used between a Web server and a Web browser together with a URL (Uniform Resource Locator). LED Light-Emitting Diode.
PAGE 187
Shmem A one-sided (put/get) inter-process communication interface used on high-performance parallel systems. SMP Symmetric MultiProcessor — a computer whose main memory is shared by more than one processor. SNMP Simple Network Management Protocol — a protocol used to monitor and control devices on the Internet. SQL Structured Query Language — a database language.
PAGE 188
Flit A communications cycle unit of information. HTTP cookies Cookies provide a general mechanism that HTTP server-side connections use to store and to retrieve information on the client side of the connection. main memory The memory normally associated with the main processor, that is to say, memory on the CPU’s high speed memory bus. main processor The main CPU (or CPUs for a multi-processor) of a node, typically an Alpha 21264.
PAGE 189
slice A local copy of a global object. switch network The network constructed from the Elan cards and Elite cards. thread An independent sequence of execution. Every host process has at least one thread. virtual memory A feature provided by the operating system, in conjunction with the MMU, that provides each process with a private address space that may be larger than the amount of physical memory accessible to the CPU.
PAGE 190
PAGE 191
Index A C access controls CPU usage, 6-5, 7-2 memory limits, 6-4, 7-3, 7-5 priority, 6-5, 7-2 records, 6-2 system services, 2-5 table, 10-4 accounting record, 2-10, 6-1, 6-6 statistics, 10-4 allocate, 5-3 application node, 2-1 attributes cpu-poll-stats-interval, 5-29 default-priority, 5-29 grace-period, 5-29 node-status-poll-interval, 4-3 pmanager-idletimeout, 5-29 pmanager-queuedepth, 5-28 rms-keep-core, 5-30 rms-poll-interval, 4-3 tables, 10-6 users-to-mail, 8-3 capability, A-1, C-1 commands, 2-5, 5-1
PAGE 192
rmsloader, 3-3 Switch Network Manager (swmgr), 4-5 Transaction Log Manager (tlogmgr), 4-5 database, 2-2, 2-6 administration, 5-44 building, 5-35 field names, 10-1 name, 10-1 SQL interface, 2-6 SQL queries, 5-42 tables, 10-2 Database Manager, 4-2 documentation feedback, 1-3 online, 1-3 E Elan, A-1 Elite, A-1 Event Manager, 4-6 eventmgr, 4-6 events, 8-1 handlers, 8-3 mail alerts, 8-3 posting, 8-2 string, 8-1 table, 10-9 waiting, 8-2 G gang scheduling, 7-1 I installed components, 10-12 interactive node, 2-1
PAGE 193
P Partition Manager, 4-3 partitions, 2-7, 4-3, 10-17 root, 2-7 scheduling, 2-8 pmanager, 4-3 priority, 7-2 Process Manager, 4-7 project, 2-9, 6-1 default, 6-1 membership, 6-2 specifying, 10-24 table, 10-19 prun, 5-11 R rcontrol, 5-20 resources, 10-19 allocation, 2-7, 5-3 rinfo, 5-32 rms_allocateResource, D-2 rms_deallocateResource, D-2 rms_defaultPartition, D-7 rms_freeCpus, D-7 rms_getcap, C-12 rms_getcorepath, C-3 rms_getprgid, C-6 rms_killResource, D-6 rms_ncaps, C-12 rms_numCpus, D-7 rms_numNodes, D-7
PAGE 194
adapters, A-4 barrier synchronization, A-3 boards, 10-23 control interface, 4-5, 10-9 crosspoint switch, A-1 Elan, A-1 Elans, 10-8 Elite, A-1 Elites, 10-9 fat tree network, A-1 layer, A-4 level, A-1 links, A-3 multistage network, A-1 plane, A-1 rail, A-4 reduction, A-3 top switch, A-3 uplinks, A-2 Switch Network Manager, 4-5 swmgr, 4-5 system architecture, 2-1 T tlogmgr, 4-5 Transaction Log Manager, 4-5 transactions, 10-23 U user commands, 2-5 users, 10-24 Index-4