HP Scalable Visualization Array, V1.1 Readme/Release Notes
6.
Job Launch: Failure to Start Due to License
•
Impact
Low.
•
Summary
The job fails to start and displays a license error.
•
Solution
Restart the license manager. This procedure is described in the
SVA System Administration Guide
.
7.
Job Launch: SLURM Epilog Script Excessive Clean Up
•
Impact
Medium.
•
Summary
The
SVA Installation Guide
adds a SLURM epilog script to slurm.conf. This epilog script runs
as root when a SLURM job terminates on each node in the job.
The epilog script terminates any rgsender processes on nodes used in a job (started when using
HP Remote Graphics Software), removes any X lock files (for example, /tmp/.X0-lock,
/tmp/.X1-lock), and invokes the optional XC epilog script, which kills any processes owned
by the user running the job (except for root and any other UID less than 100).
This is an effective way of ensuring that all pieces of your job go away when your job exits.
However, if you have other processes on that node, even if not started by the job (for example,
when debugging), they are also killed.
•
Solution
You can disable the epilog by changing /hptc/slurm/etc/slurm.conf and restarting SLURM
on all nodes using this command:
pdsh -a service slurm restart
Future SVA jobs may not be able to start if processes are left around by previous jobs and not
cleaned up by the epilog. If you don't want the SVA-specific behavior of killing rgsender processes
and removing X server lock files, but you do want to kill processes owned by the user who ran
the job, change the following:
Epilog=/opt/sva/sbin/sva_epilog.clean
To:
Epilog=/opt/hptc/slurm/etc/slurm.epilog.clean
If you don't want any of this cleanup behavior, delete the following line:
Epilog=/opt/sva/sbin/sva_epilog.clean
8.
Job Termination: Failure of slurmd Daemon to Exit
•
Impact
Medium. Rare event but does stop the job progress.
•
Summary
The job does not exit cleanly. Instead, it waits indefinitely for srun commands to complete. These
commands don't complete because the slurmd daemons that they launch never exit.
•
Solution
On the node that launched the job, use ps to determine where the srun command is trying to
launch things.
ssh to that node and as root, kill the slurm.d processes that have a job number after them in
the ps lists.
10 Welcome to the HP Scalable Visualization Array V1.1 Release