HP XC System Software Administration Guide Version 3.0

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

111

112

113

114

115

116

117

118

119

120

Table 12-4. Output of the sinfo command for Various Transitions Output of the sinfo command for

Various Transitions

Meaning:sinfo shows:Transition Cause:

The node is running a joballocTransient Network Congestion

The slurmctld daemon has lost contact

with the node

alloc*

Contact between the node and the

slurmctld daemon has been restored

alloc

The node is ready to accept a jobidleNode fails while no job is running on the

node.

The slurmctld daemon lost contact with

the node

idle*

The slurmctld daemon has removed the

node from service (see `sinfo -R`)

down*

The node has been returned to serviceidle

The node is running a job.allocNode fails while a job is running on the

node

The slurmctld daemon lost contact with

the node.

alloc*

The slurmctld daemon has removed the

node from service (see sinfo -R).

down*

The node has been returned to service.idle

The node is ready to accept a job.idleThe System Administrator sets the node

state to down.

The slurmctld daemon has removed the

node from service.

down

The slurmctld daemon lost contact with

the node (see sinfo -R).

down*

The node has been returned to service.idle

The node is running a job.allocThe System Administrator sets the node

state to drain while a job is running on

the node.

SLURM is waiting for the job or jobs to finish.drng

SLURM removed the node from service.drain

The slurmctld daemon lost contact with

the node (see sinfo -R).

drain*

The node has been returned to service.idle

The node is ready to accept a job.idleThe System Administrator sets the node

state to drain while a job is running on

the node.

SLURM removed the node from service.drain

The slurmctld daemon lost contact with

the node (see sinfo -R).

drain*

The node has been returned to service.idle

Configuring the SLURM Epilog Script

SLURM provides the capability of automatically killing rogue processes at the end of a job using an epilog

script.

When configured, the SLURM epilog script is launched after the user's job on the node completes. This script

checks that the user has another job assigned to this node, and, if not, sends a SIGKILL signal to all the

processes that belong to that user on all the nodes in the user's allocation.

NOTE: If the user logged in from a node that is also a compute node, the epilog script also terminates the

user's login. You can avoid this problem by editing the EPILOG_EXCLUDE_NODES variable in the epilog

file. It is empty by default. Specify the host names of the login nodes, separated by spaces, so that the epilog

script does not kill the user jobs on those nodes; for example:

Configuring the SLURM Epilog Script 115