Data Center / Cloud

Ensuring Reliable Model Training on NVIDIA DGX Cloud

Image shows cloud-based GPU clusters dedicated to AI training.

Mar 10, 2025

By Shelby Thomas, Dong Ahn, Deepak Narayanan, Pavel Shamis and Rama Govindaraju

Discuss (0)

AI-Generated Summary

Dislike

To minimize downtime during large language model training on NVIDIA DGX Cloud, it's essential to focus on error attribution and have both reactive and proactive systems in place to quickly detect and recover from errors.
Error attribution involves categorizing errors into immediate crashes, hangs in communication libraries, and speed regressions, which can be achieved by analyzing cluster, node, and application telemetry data.
By correlating this telemetry data, NVIDIA achieved less than 1% hardware downtime for training runs using up to 10K GPUs, as seen in the Nemotron model family training runs from 2024-2025.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Training AI models on massive GPU clusters presents significant challenges for model builders. Because manual intervention becomes impractical as job scale increases, automation is critical to maintaining high GPU utilization and training productivity. An exceptional training experience requires resilient systems that provide low-latency error attribution and automatic fail over based on root cause analysis.

Automation isn’t new. Health checks, preflight checks, syslogs, and telemetry all exist for various hardware and software components. Unfortunately, most of these are opaque to end users and difficult to access and use as a first-line tool.

What happens in most cases is that model builders are the first to encounter problems with a training run. They must engage with the infrastructure and operations teams to gather the necessary data to triage issues, for example, if an error is due to hardware or software or if it is intermittent or persistent.

This costly manual intervention process slows down overall development cycles and hinders rapid experimentation. As researchers scale up their experiments, the combinatorial complexity of the systems involved also exacerbates this problem.

This post covers achieving reliable, efficient training of large language models (LLMs) on NVIDIA DGX Cloud. We introduce some of the challenges that our teams encountered while training NVIDIA Llama Nemotron and other foundation models, highlight opportunities for more resilient training, and show some of the ideas behind how DGX Cloud achieves <1% hardware downtime for training runs from 2K-10K GPU scale.

Minimizing downtime

As a model builder, when you encounter an error during training, the key challenge is identifying the cause, locating the issue, and finding a way to keep the job moving forward to avoid delays. This delay is further exacerbated in environments where engineer intervention is required for recovery, often adding hours to triage and remediation.

Because these interventions and delays significantly impact productivity, it is critical to have a metric that reflects your actual experience and the friction you face in bringing training back online.

Traditional metrics such as MFU, which primarily reflects hardware utilization, and MTTF, which measures average time between failures, focus on infrastructure efficiency rather than the complete training experience. They do not fully account for your perspective, that is, the time lost not only to crashes but also to checkpointing, lost work due to errors, and restarts.

To capture these real-world costs and pain points, we focus on downtime, the total unproductive training time from the model builder’s viewpoint. This includes the following:

Checkpoint time: The training loop is blocked to save checkpoints.
Lost work: Iterations lost after the last checkpoint before shutdown.
Shutdown time: From the last iteration until the system stops.
Restart time: From job initiation until productive training begins again.

These are all affected when a hardware or infrastructure failure occurs. As a result, as we look to improve the developer experience of training, downtime is the key metric to try to minimize.

\(\text{Downtime}=\text{Checkpoint Overhead}+\text{ErrorCount}\times \left( \underbrace{\text{Detection Time}}_{\text{fault or straggler}} +\underbrace{\text{Recovery Time}}_{\text{lost work, shutdown, restart}} \right)\)

Reducing downtime requires both reactive and proactive systems throughout training. At scale, errors are inevitable, and the speed of detection and recovery is critical. For both application and hardware failures, error attribution is key.

The system must determine whether an issue requires user intervention or can be resolved automatically, for example, by excluding bad nodes and auto-restarting or by retrying multiple times before notifying the user. In this post, we focus primarily on improving error attribution, leaving recovery time and specific automation techniques for future study.

Error attribution

For error attribution, we broadly categorize the kind of errors that researchers encounter into the following main buckets:

Immediate crashes: Stem from hardware faults such as bios, power supply or thermal issues, uncorrectable ECC errors, silent data corruption (NaNs in intermediate results), or network instability (link flapping).
Hangs in communication libraries: Often manifest as PyTorch NCCL watch dog errors and Transformer Engine communication hangs. Hangs are often due to cascading dependencies in data transfer from the filesystem (for example, for input data) and from tensors (for example, gradients, intermediate activations, and so on) across the east-west (E/W) network. This highlights the need for robust fault tolerance, containment, and early detection mechanisms within libraries and applications.
Speed regressions: These encompass both transient slowdowns (for example, temporary network or storage issues) and persistent bottlenecks (for example, a consistently slow GPU in a large cluster). These regressions can significantly affect overall training speed and efficiency.

While such failures can stem from underlying hardware, infrastructure, or software issues, from your perspective, they typically show up as abrupt interruptions or significant slowdowns during training. By recognizing these common failure modes, we can better develop solutions and processes that enable researchers to maintain momentum and keep workflows moving forward.

Figure 1 highlights the distribution of different failure types in a 6K GPU training run. While these failures often present themselves to the researcher as a single error, identifying the root cause requires more thorough analysis. We found that correlating cluster, node, and application telemetry improves the speed and accuracy to provide effective remediation strategies

Cluster telemetry

This telemetry includes storage servers, covering metadata operations, and read/write operations and switches. This visibility is crucial because a failure in one node can often spread to other nodes through communication calls, passing corrupted gradients, or overloading the storage system.

For example, if a single job overwhelms a storage node by generating excessive metadata operations, other jobs or nodes may experience performance issues as a result.

Node telemetry

Periodic health checks at the node level ensure that key hardware and software components such as GPUs, CPUs, memory, network, storage, and services are functioning correctly. Preliminary checks before a job starts also validate hardware status, verify software dependencies, and configure the environment for the task.

This early detection of potential issues reduces debugging time and improves overall reliability. After a job completes, cleanup routines reclaim resources, store logs, and restore the system to a clean state. These are also called prolog and epilog scripts.

Application logs

Applications have critical knowledge of the key control points, invariants, and measure of progress, including system errors and performance patterns. They provide one of the strongest signals for error attribution, especially when correlated with historical data in a central repository to spot recurring failures over time.

For example, these logs help determine if there are stragglers, hangs, or if a failure is intermittent or recurring. Certain errors, such as a NaN error, that appear deterministically at the same iteration and rank but different physical nodes are likely an application error. The same error recurring on a node without such a pattern, however, can indicate a more serious hardware failure.

Unified telemetry

Analyzing this temporal data across both intra-job (within a single job) and inter-job (across multiple jobs) contexts helps identify recurring issues, detect patterns, and take proactive measures rather than reactive.

This unified telemetry is shared across both operations teams and researchers through recommendations, alerts, and visualizations, ensuring that both groups have a common view of system behavior and failure patterns.

This cross-pollination of telemetry means that researchers can leverage infrastructure data to improve debugging while the operations team uses application insights to improve system automations to reduce hardware downtime.

While downtime is a function of scale, for training runs using <10K GPUs, we’ve achieved less than 1% downtime due to hardware failures with these techniques. Our results were achieved on NVIDIA DGX Cloud across Nemotron model family training runs from 2024-2025. We calculated hardware downtime as the percent of total downtime attributable to hardware failure, averaged over all jobs.

Conclusion

We’ve found that end-to-end resilience requires a holistic view. High uptime depends on a comprehensive approach that spans both infrastructure and developer experience.

This approach bridges application and infrastructure, improving debugging speed and accuracy while enabling a more proactive system. It helps researchers resolve issues efficiently at the time of failure and reduces friction in diagnosing recurring problems.

For model builders, a robust error attribution system is essential for effective automation. Automations like this enables researchers to train models without monitoring jobs around the clock.

But more than keeping GPUs fully utilized, this helps you and other researchers on DGX Cloud focus on what truly matters: developing models, advancing science, and leaving the heavy lifting to us.

For more information about training resilience services on DGX Cloud, sign up for the Cloud-Native Approach to Achieving Training Resilience and Efficiency at Extreme Scale GTC session. For more information about how we can accelerate your pre– and post-training workloads, contact us.

Discuss (0)

About the Authors

About Shelby Thomas
Shelby is the product lead for training reliability at NVIDIA in DGX Cloud. Before NVIDIA, he worked at OctoAI on accelerating ML model deployment across diverse hardware platforms and developed deep learning models at Google. He received his Ph.D. in computer science from UC San Diego.

View all posts by Shelby Thomas

About Dong Ahn
Dong Ahn is a distinguished engineer at NVIDIA in the AI Data-Infra Optimization group building end to end reliability systems for model builders. Before joining NVIDIA, Dong worked for the Development Environment Group (DEG) in Livermore Computing for 20 years. Dong has worked on several code-development tools and next-generation resource management and scheduling software framework projects with a common goal to provide highly capable and scalable software ecosystems for large computing systems.

View all posts by Dong Ahn

About Deepak Narayanan
Deepak Narayanan is a senior applied deep learning research scientist in the ADLR group at NVIDIA, where he looks at making the training and inference of LLMs faster and more reliable. He holds a PhD in Computer Science from Stanford University.

View all posts by Deepak Narayanan

About Pavel Shamis
Pavel (Pasha) Shamis is a distinguished engineer at NVIDIA in the AI Data-Infra Optimization group where his primary focus lies in optimizing efficiency of the AI software and hardware stack. Before joining NVIDIA, Pasha served as a senior principal research engineer at Arm for six years, working on co-designing software and hardware building blocks for large-scale distributed systems.

View all posts by Pavel Shamis

About Rama Govindaraju
Rama Govindaraju is currently a senior engineering director at NVIDIA leading the team for Architecture and Performance of NVIDIA DGX Cloud. Before this role, he was a principal engineer leading the effort to ensure the reliability of Google's machine learning and AI infrastructure. Rama also served as the director of engineering at Google, leading the systems infrastructure architecture team.

View all posts by Rama Govindaraju