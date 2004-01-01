NVIDIA DCGM

Manage and Monitor GPUs in Cluster Environments

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go).

DCGM also integrates into the Kubernetes ecosystem using DCGM-Exporter to provide rich GPU telemetry in containerized environments.

DCGM is now open-source! Check us out on GitHub!

Benefits





GPU Diagnostics and System Validation Effectively identify failures, performance degradations, power inefficiencies and their root causes.



GPU Telemetry Gather rich set of GPU telemetry to explain job behavior, identifying opportunities to drive utilization and efficiencies, and determining root causes of potential application performance issues.



Active GPU Health Monitoring Use low-overhead, non-invasive health monitoring while jobs are running without impact to application behavior and performance.

Integration with Management Ecosystem Easily deploy a DCGM based monitoring solution in a Kubernetes cluster environment. Out of the box integration with various ISV solutions such as Bright Cluster Manager, IBM Spectrum LSF and open-source tools such as Prometheus, collectd.

Learn More

Installing the Latest DCGM

By downloading and using the software, you agree to fully comply with the terms and conditions of the NVIDIA DCGM License.

Note that it is recommended to use the latest R450+ NVIDIA datacenter driver that can be downloaded from NVIDIA Driver Downloads page.

As the recommended method, install DCGM directly from the CUDA network repos. Older DCGM releases are also available from the repos.

Quickstart Instructions:

Ubuntu LTS

Set up the CUDA network repository meta-data, GPG key. The example shown below is for Ubuntu 20.04 on x86_64:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb

$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"



Install DCGM

$ sudo apt-get update

&& sudo apt-get install -y datacenter-gpu-manager

Red Hat

Set up the CUDA network repository meta-data, GPG key. The example shown below is for RHEL 8 on x86_64:

$ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo



Install DCGM

$ sudo dnf clean expire-cache \

&& sudo dnf install -y datacenter-gpu-manager

Set up the DCGM service

$ sudo systemctl --now enable nvidia-dcgm



Review the release notes and the documentation for install instructions on supported distributions and platforms.

Archived Releases