Administrator Guide
Common Errors
33 RAPIDS Scaling on Dell EMC PowerEdge Servers
J Common Errors
During the tests we experimented GPU device memory issues, for more details on the memory
performance and issues we encountered please see the section A “Controlling memory usage”. These
errors have been documented and explained by NVIDIA [10] as below:
“Running out of GPU Device Memory:
• ETL processes may create many copies of data in device memory, resulting in memory utilization
spikes
• Need to budget 25% GPU device memory to account for XGBoost overhead
• Cannot exceed 24GB on 32GB GPU, or cannot exceed 12GB on 16GB GPU
• Memory utilization which exceeds available device resources will cause a Dask worker to crash
• This error can be propagated forward in the Dask task graph, and manifest in very short ETL times
(sub-millisecond timescale)
• An error may be raised by another routine referring to None Type in data or similar
Running out of system memory:
• “The final step of the ETL process migrates all computed results back to system memory before
training, and if you do not have sufficient system memory, your program will crash. The step before
training migrates a portion of the data back into device memory for XGBoost to train against”