Administrator Guide

Common Errors

33 RAPIDS Scaling on Dell EMC PowerEdge Servers

J Common Errors

During the tests we experimented GPU device memory issues, for more details on the memory

performance and issues we encountered please see the section A “Controlling memory usage”. These

errors have been documented and explained by NVIDIA [10] as below:

“Running out of GPU Device Memory:

• ETL processes may create many copies of data in device memory, resulting in memory utilization

spikes

• Need to budget 25% GPU device memory to account for XGBoost overhead

• Cannot exceed 24GB on 32GB GPU, or cannot exceed 12GB on 16GB GPU

• Memory utilization which exceeds available device resources will cause a Dask worker to crash

• This error can be propagated forward in the Dask task graph, and manifest in very short ETL times

(sub-millisecond timescale)

• An error may be raised by another routine referring to None Type in data or similar

Running out of system memory:

• “The final step of the ETL process migrates all computed results back to system memory before

training, and if you do not have sufficient system memory, your program will crash. The step before

training migrates a portion of the data back into device memory for XGBoost to train against”