GTC Silicon Valley-2019 ID:S9285:GPU Resource Pooling and the benefit of deploying CUPTI

Lingling Jin(Alibaba),Lingjie Xu(Alibaba),Junrui Zhou(Alibaba)
We'll describe a lightweight GPU counter monitoring tool called GPUPerf that our Alibaba team developed with NVIDIA. It monitors GPU context create and destroy, and records GPU internal counter values, such as active/elapsed cycles, IPC, and memory access bandwidth with little overhead. We'll discuss how we deployed this tool in one of our lab clusters to do real-time monitoring. Combined with information collected from NVIDIA-smi, we now understand our GPU server workload much better. We'll also explain how GPUPerf helps improve GPU cluster orchestra and scheduling algorithms.

