Installing LSF-HPC With SLURM Into an Existing Standard LSF Cluster

xc1
xc1
In this scenario, the srun command was not found because the user's $PATH did not include
/opt/hptc/bin, which is specific to XC. There are several standard ways to address this if
necessary. For example, you can add /opt/hptc/bin to the default $PATH on the non-XC
node; or create a softlink to the srun command from /usr/bin on all the nodes in XC.
User on the XC node, launching to a Linux ia32 resource:
[test@xc128 test]$ bsub -I -n1 -R type=LINUX86 hostname
Job <415> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on plain>>
plain
Launching to an XC resource:
[test@xc128 test]$ bsub -I -n6 -R type=SLINUX64 srun hostname
Job <416> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on xclsf>>
xc3
xc3
xc2
xc2
xc1
xc1
Troubleshooting
If your cluster does not perform or behave as expected after you have applied this XC HowTo, use the
following procedure to verify the configuration:
Use the following commands to check your configuration changes:
Confirm the firewall settings (use other command options if necessary):
iptables -L
Confirm startup script:
pdsh -a 'ls -l /etc/init.d/lsf'
Confirm that the LSF tree was properly mounted (using the example):
pdsh -a 'ls -ld /shared/lsf/'
Confirm the LSF environment scripts:
pdsh -a 'ls -l /etc/profile.d/lsf.sh'
Monitor the LSF log files when problems arise for communication complaints, unresolved host
name issues, or configuration problems.
Ensure that the XC firewall is disabled. The firewall prevents the non-XC LSF nodes from
communicating with the XC LSF node.
Ensure that controllsf has been properly configured with alias. Run controllsf show to
confirm its settings.
Check the ifconfig output on the XC LSF node to ensure that the LSF alias was properly
established. If eth0 is the external network device, the LSF alias entry is eth0:lsf.
Use the appropriate LSF commands to restart daemons when network communications have
been adjusted:
lsadmin reconfig - to restart the Load Information manager (LIM)
badmin mbdrestart - to restart the Master Batch Daemon (mbatchd)
badmin reconfig - to reset the batch settings for the Slave Batch Daemon (sbatchd).