Administrator's Guide

NOTE: Remove all files when you are finished testing with accelerator test.
Running accelerator tests
GPU detection
When you start testnodes.pl -gpu, a test is launched to check all nodes for the presence of
accelerator cards (GPUs). If any GPUs are detected and they are responsive to communication,
the node will be marked by adding /g<number of nodes> to the node name in the nodes
window. In the example below, each node has three detected and responsive GPUs.
You should compare the number of GPUs indicated in the nodes monitoring window to the actual
number of GPUs for each node. Any discrepancies indicate a problem with GPUs on that node.
It might be helpful to run the Verify test, described below, to get more information about problem
nodes. Additional information on the nodes monitoring window is available at “The nodes
monitoring window” (page 15).
IMPORTANT: For all the accelerator tests, only nodes with detected GPUs should be selected.
Deselect any nodes that do not have GPUs.
Verify
The Verify test is similar to the GPU detection run on testnodes.pl -gnu startup. Each selected
node is tested for the presence of GPUs using lspci and is then queried. The test report shows
the accelerators detected for each node and whether communication with the GPU was successful.
If a GPU is installed on a node but not detected, reseat the GPU and repeat the test. An example
test report is shown below.
----------------
n21
----------------
** The lspci command shows that there are 3 GPGPUs installed on node
** All 3 GPGPUs appear to be functional on this node
GPU Model Video BIOS Link Speed Width Bus ID
0 Tesla S2050 70.00.2f.00.03 5GT/s, x16, 06.00.0
1 Tesla S2050 70.00.2f.00.03 5GT/s, x16, 14.00.0
2 Tesla S2050 70.00.2f.00.03 5GT/s, x16, 11.00.0
To use the Verify test report:
Make sure all GPUs are listed for each node.
Verify the Model numbers.
Verify the Video BIOS.
The Link Speed can be reported as either 2.5, 5, or UNKNOWN. A report of 5 or UNKNOWN
indicates the GPU is running at Gen2 speed and is acceptable. A value of 2.5 might indicate
the GPU is not properly configured. However this test is timing sensitive, so it is recommended
you retest any nodes reporting 2.5. If the test consistently reports 2.5, the GPU should be
re-seated and the test repeated. If all the GPUs report 2.5, there might be a BIOS setting
error.
Running accelerator tests 23