HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

171

172

173

174

175

176

177

178

179

180

Table 14-4 Output of the sinfo command for Various Transitions

Meaning:sinfo shows:Transition Cause:

The node is running a job

alloc

Transient Network Congestion

The slurmctld daemon has lost contact

with the node

alloc*

Contact between the node and the

slurmctld daemon has been restored

alloc

The node is ready to accept a job

idle

Node fails while no job is running on the

node.

The slurmctld daemon lost contact with

the node

idle*

The slurmctld daemon has removed the

node from service (see `sinfo -R`)

down*

The node has been returned to service

idle

The node is running a job.

alloc

Node fails while a job is running on the

node

The slurmctld daemon lost contact with

the node.

alloc*

The slurmctld daemon has removed the

node from service (see sinfo -R).

down*

The node has been returned to service.

idle

The node is ready to accept a job.

idle

The System Administrator sets the node

state to down.

The slurmctld daemon has removed the

node from service.

down

The slurmctld daemon lost contact with

the node (see sinfo -R).

down*

The node has been returned to service.

idle

The node is running a job.

alloc

The System Administrator sets the node

state to drain while a job is running on

the node.

SLURM is waiting for the job or jobs to

finish.

drng

SLURM removed the node from service.

drain

The slurmctld daemon lost contact with

the node (see sinfo -R).

drain*

The node has been returned to service.

idle

The node is ready to accept a job.

idle

The System Administrator sets the node

state to drain while a job is running on

the node.

SLURM removed the node from service.

drain

The slurmctld daemon lost contact with

the node (see sinfo -R).

drain*

The node has been returned to service.

idle

14.7 Configuring the SLURM Epilog Script

SLURM provides the capability of automatically killing rogue processes at the end of a job using an epilog

script.

When configured, the SLURM epilog script is launched after the user's job on the node completes. This

script verifies that the user has another job assigned to this node, and, if not, sends a SIGKILL signal to

all the processes that belong to that user on all the nodes in the user's allocation.

NOTE: If the user logged in from a node that is also a compute node, the epilog script also terminates

the user's login. You can avoid this problem by editing the EPILOG_EXCLUDE_NODES variable in the

epilog file. It is empty by default. Specify the host names of the login nodes, separated by spaces, so that

the epilog script does not kill the user jobs on those nodes; for example:

14.7 Configuring the SLURM Epilog Script 171