Data Center / Cloud

Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring

High-performance computing (HPC) customers continue to scale rapidly, with generative AI, large language models (LLMs), computer vision, and other uses leading to tremendous growth in GPU resource needs. As a result, GPU efficiency is an ever-growing focus of infrastructure optimization. With enormous GPU fleet sizes, even small inefficiencies translate into significant cluster bottlenecks 

Optimizing GPU usage helps:

  • Generate significant savings in operational costs.
  • Enable more workloads to access GPU resources.
  • Improve developer experience and throughput.

In this blog, we present our process for reducing idle GPU waste across large-scale clusters—an effort that has the potential of saving millions in infrastructure costs and also improves overall developer productivity and resource utilization. In industry terms, waste means GPUS are not being used to their full potential, specifically due to lack of effective management of the cluster, or misses in optimization or error resolution. 

Understanding GPU waste

GPU waste can be classified into multiple categories, and each requires its own tailored solution. One of the most frequent issues is associated with jobs occupying GPU resources but not actually doing meaningful work and sitting completely idle. The following table provides a summary of waste issues.

GPU waste issueSolutionsObserved frequency
Hardware unavailability caused by failuresFleet health efficiency program for monitoring, tracking, and rolling out fixes to hardwareLow
GPUs are healthy but not occupiedOccupancy efficiency programs which primarily involve scheduler efficiencyLow
Jobs occupy GPUs but don’t use the compute efficiently Application optimizatio effortsHigh
Jobs occupy GPUs but don’t use themIdle waste reduction program Moderate
Table 1. Summary of GPU cluster efficiency challenges, targeted solutions, and their frequency in large-scale infrastructure.

Through the operation of research clusters supporting highly diverse workloads, we have encountered expected and unexpected causes of GPU idleness. Distinguishing between these factors is challenging but essential to ensure that researcher productivity remains unaffected. We have identified several recurring patterns that lead to idle GPUs. Some of these include: 

  • CPU-only data processing jobs: Running on GPU nodes without using the GPUs.
  • Misconfigured jobs: Over-provisioning GPUs due to exclusive node settings.
  • Stuck jobs: Jobs that appear active but are stalled.
  • Infrastructure overhead: Delays from container downloads or data fetching.
  • Unattended interactive sessions: Leftover jobs consuming resources.

Ways to reduce GPU resource waste

To reduce idle GPU waste at scale, emphasis was placed on observing actual cluster behavior rather than relying on theoretical utilization targets. Once the underlying patterns surfaced, it became clear that efficiency could be meaningfully improved through a focused set of operational techniques rather than sweeping architectural changes. 

From that analysis, we prioritized four techniques:

  • Data collection and analysis: Gathered utilization and job traces to identify the top contributors to GPU waste.
  • Metric development: Created a dedicated GPU idle waste metric to track baselines and measure improvements over time.
  • Customer collaboration: Resolved inefficiencies by working directly with users and teams whose workflows drove the highest idle impact.
  • Scaling solutions: Built self-serve tools and automation pipelines so improvements could scale across the entire fleet.

Building the GPU utilization metrics pipeline

To build the GPU utilization metrics pipeline, we aligned real-time telemetry from the NVIDIA Data Center GPU Manager (DCGM) with Slurm job metadata to create a unified view of how workloads actually consumed GPU resources. Although Slurm provided data at a five-minute granularity, it was sufficient for joining with the higher-resolution DCGM fields. 

A key enabler in this process was the NVIDIA DCGM Exporter’s HPC job-mapping capability, through which GPU activity could be tagged with precise job context. That established the foundation needed to measure idle periods, identify waste contributors, and attribute inefficiencies to specific workflows.

Diagram showing how GPU utilization data is collected and processed across multiple GPU clusters. On the left, four green icons represent different GPU clusters sending data into two data stores: a DCGM telemetry database and a Slurm job metadata database. Both data sources feed into a central aggregation pipeline, shown with a circular green processing icon. The pipeline aligns GPU activity with job context and outputs a final data store labeled one hour GPU waste metric per job. The flow illustrates how GPU telemetry and job information are combined to quantify idle GPU time for each workload.
Figure 1. Pipeline for merging DCGM telemetry and Slurm job data to compute per-job GPU idle waste.

With the pipeline established, the next step was to examine the DCGM signals that drove the analysis and define how idle GPU behavior would be identified. The following sections outline the metrics used and the criteria applied to determine when a job was considered to be causing idle GPU time.

Tapping into DCGM

DCGM is NVIDIA’s management and monitoring framework for data-center GPUs. It provides a powerful set of tools and APIs that let you observe, control, and optimize GPU resources at scale.

At its core, DCGM provides a variety of metrics and telemetry data, organized into structures called fields. Each field has a unique identifier and field number—together, they represent everything from GPU temperature and clock speeds to utilization and power draw. You can explore the complete list of available fields in the official documentation.

Here’s what these metrics typically cover:

  • GPU utilization metrics: Measure how actively a GPU is being used. These include indicators for core compute load, memory usage, I/O throughput, and power consumption, helping you see if a GPU is doing productive work or sitting idle.
  • GPU performance metrics: Reflect how efficiently a GPU is operating. Metrics such as clock speed, thermal status, and throttling events help assess performance and detect bottlenecks.

For the GPU waste metric, the DCGM_FI_DEV_GPU_UTIL field was used as the primary indicator of high-level GPU activity. Future iterations of the analysis are planned to transition to DCGM_FI_PROF_GR_ENGINE_ACTIVE to capture a more precise view of GPU engine utilization.

What classifies a job as idle?

AI and machine-learning (ML) workloads often include periods where the GPU is not actively used, either due to infrastructure inefficiencies or the natural behavior of the workload. Several common scenarios were observed:

  • Container downloads: Job startup can stall while containers are pulled across multiple hosts, especially under heavy load or slow registry performance.
  • Data loading and initialization: Training workflows may wait on data retrieval from storage before GPU compute begins.
  • Checkpoint reads and writes: Utilization can drop during checkpointing operations.
  • Model-specific behavior: Some model types simply do not fully utilize the GPU by design.

To account for these cases, a threshold for prolonged inactivity was established. A conservative definition was used: A workload was considered idle when a full hour of continuous GPU inactivity was detected.

Services and tools for analyzing GPU cluster efficiency

Once the GPU waste metric was established, the focus shifts toward making the data usable. The goal was not only to surface idle behavior but to expose it in a way that allowed researchers and platform teams to quickly understand the source of inefficiencies. To support this, several visualization layers and operational tools were built to turn the underlying telemetry into clear signals and automated interventions.

GPU waste metrics were surfaced through two primary interfaces:

  • User portal: An internal NVIDIA portal where ML researchers could view cluster-, user-, and job-level GPU usage, making idle patterns far easier to recognize.
  • OneLogger: A unified monitoring layer that correlated job phases with GPU telemetry, giving users clearer visibility into where inefficiencies emerged.

Together, these tools made GPU waste more transparent and actionable.

Tooling: Idle GPU job reaper

We developed a service to identify and clean up jobs that were no longer using their GPUs-essentially providing self-cleaning behavior for the fleet. Because the cluster runs highly diverse workloads with no shared abstraction layer, users were given the ability to tune the reaper’s thresholds to match the expected idle characteristics of their jobs. This allowed the system to distinguish between predictable idle phases and genuine waste. 

At a high level, the service:

  • Monitored GPU utilization through DCGM metrics.
  • Flagged jobs with prolonged periods of inactivity.
  • Terminated those jobs and reclaimed the idle GPUs.
  • Logged and reported the recovered capacity and user configurations to drive further improvements.

This approach ensured that both expected and unexpected idle patterns could be handled consistently across the fleet.

Tooling: Job linter

We created a job-linting tool to detect misconfigured workloads—for example, jobs requesting exclusive access to all GPUs on a node but only using a subset, leaving the remaining devices idle. Future versions of the linter are planned to expand coverage to a broader set of misconfiguration patterns.

Tooling: Defunct jobs

Time-limited jobs in the cluster often led users to submit long chains of follow-on jobs that waited in the queue with reserved resources, even when they were no longer needed. The other issue here is that any regression in users’ jobs would be compounded by repeated large numbers of re-runs. These defunct submissions consumed scheduling cycles and introduced unnecessary overhead. Tooling was built to automatically detect and cancel such redundant jobs, reducing waste and improving overall scheduling efficiency.

Lessons learned and next steps

Small inefficiencies compound quickly at scale. Once the right metrics were exposed, visibility alone drove a natural shift toward accountability and better behavior across teams. Beyond metrics, researchers also require actionable guidance on how to improve the efficiency of their workloads. Broad adoption of these practices was necessary to achieve fleet-level impact. Monitoring tools needed to integrate directly into the daily workflow to be effective; making utilization insights available at job submission time and within experiment-tracking interfaces proved essential.

Through these efforts, GPU waste was reduced from roughly 5.5 % to about 1%, a substantial improvement that translated into meaningful cost savings and increased availability of GPUs for high-priority workloads. These gains demonstrated how operational inefficiencies, once surfaced and addressed, can return significant capacity back to the fleet.

The measurement process also surfaced a number of infrastructure gaps that contribute to idle behavior. Several improvements are planned to further reduce waste such as faster container loading, data caching, support for long-running jobs, and enhanced debugging tools.

Start instrumenting and monitoring DCGM metrics today. These signals reveal where GPU cycles are being wasted and provide the foundation for building simple, actionable tooling that helps researchers optimize their jobs and keep GPUs consistently utilized.

Mohamed Fawzy, Mohammed Elshall, Bugra Gedik, Michael Hale, Kaiwen Shi, Vishal Patil, and Ashita Kulkarni contributed to the research described in this blog.

Discuss (0)

Tags