HP XC System Software Administration Guide Version 3.0
.
sockets 67456 0 (unused)
sdp 208008 0 [sockets]
ipoib-ud 171072 1
ats 40000 1
devucm 17808 2
q_mng 22184 0 [sdp]
ibat 67928 5 [sockets ipoib-ud devucm]
cm_2 77312 0 [sdp devucm]
gsi 64304 6 [ipoib-ud ats ibat cm_2]
adaptor-tavor 148496 1
vverbs-base 54576 0 [sdp ipoib-ud ats devucm q_mng ibat
cm_2 gsi a
daptor-tavor]
mlog 9792 0 [sockets sdp ipoib-ud ats devucm q_mng
ibat cm_2
gsi adaptor-tavor vverbs-base]
repository 82208 0 [sockets sdp ipoib-ud ats q_mng ibat
cm_2 gsi
adaptor-tavor vverbs-base mlog]
hadump 2240 0 [gsi]
mod_ib_mgt 76992 0 (unused)
mod_vapi 137824 0 [adaptor-tavor mod_ib_mgt]
mod_vipkl 235776 0 [mod_vapi]
mod_thh 248064 0 [mod_vapi]
mod_hh 26896 0 [mod_vipkl mod_thh]
mod_vapi_common 74824 0 [mod_ib_mgt mod_vapi mod_vipkl mod_thh]
mod_mpga 28832 0 [mod_vapi]
mosal 140768 0 [mod_ib_mgt mod_vapi mod_vipkl mod_thh
mod_vap i_common mod_mpga]
The sizes may differ from this output.
7. The InfiniBand ipoib0 IP interface should be up. Use the ifconfig command to display the interface
network configuration:
# ifconfig ipoib0
ipoib0 Link encap:Ethernet HWaddr 00:02:C9:50:FF:80
inet addr:172.22.0.3 Bcast:172.22.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:7 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:420 (420.0 b) TX bytes:240 (240.0 b)
You can try to ping other nodes that are connected to the network.
8. You can find additional information about InfiniBand in the /proc/voltaire directory. Use the find
command to display it:
# find /proc/voltaire -type f -print -exec cat {} \;
SLURM Troubleshooting
The following section discusses SLURM troubleshooting in terms of configuration issues and run-time
troubleshooting.
SLURM Configuration Issues
SLURM consists of the following primary components:
slurmctld a master/backup daemon.
slurmd a slave daemon.
Command binaries The sinfo, srun, scancel, squeue, and scontrol commands.
SLURM Troubleshooting 163