Training AI models on massive GPU clusters presents significant challenges for model builders. Because manual intervention becomes impractical as job scale increases, automation is critical to maintaining high GPU utilization and training productivity. An exceptional training experience requires resilient systems that provide low-latency error attribution and automatic fail over based on root cause analysis.
Automation isn’t new. Health checks, preflight checks, syslogs, and telemetry all exist for various hardware and software components. Unfortunately, most of these are opaque to end users and difficult to access and use as a first-line tool.
What happens in most cases is that model builders are the first to encounter problems with a training run. They must engage with the infrastructure and operations teams to gather the necessary data to triage issues, for example, if an error is due to hardware or software or if it is intermittent or persistent.
This costly manual intervention process slows down overall development cycles and hinders rapid experimentation. As researchers scale up their experiments, the combinatorial complexity of the systems involved also exacerbates this problem.
This post covers achieving reliable, efficient training of large language models (LLMs) on NVIDIA DGX Cloud. We introduce some of the challenges that our teams encountered while training NVIDIA Llama Nemotron and other foundation models, highlight opportunities for more resilient training, and show some of the ideas behind how DGX Cloud achieves <1% hardware downtime for training runs from 2K-10K GPU scale.
Minimizing downtime
As a model builder, when you encounter an error during training, the key challenge is identifying the cause, locating the issue, and finding a way to keep the job moving forward to avoid delays. This delay is further exacerbated in environments where engineer intervention is required for recovery, often adding hours to triage and remediation.
Because these interventions and delays significantly impact productivity, it is critical to have a metric that reflects your actual experience and the friction you face in bringing training back online.
Traditional metrics such as MFU, which primarily reflects hardware utilization, and MTTF, which measures average time between failures, focus on infrastructure efficiency rather than the complete training experience. They do not fully account for your perspective, that is, the time lost not only to crashes but also to checkpointing, lost work due to errors, and restarts.
To capture these real-world costs and pain points, we focus on downtime, the total unproductive training time from the model builder’s viewpoint. This includes the following:
- Checkpoint time: The training loop is blocked to save checkpoints.
- Lost work: Iterations lost after the last checkpoint before shutdown.
- Shutdown time: From the last iteration until the system stops.
- Restart time: From job initiation until productive training begins again.
These are all affected when a hardware or infrastructure failure occurs. As a result, as we look to improve the developer experience of training, downtime is the key metric to try to minimize.
Reducing downtime requires both reactive and proactive systems throughout training. At scale, errors are inevitable, and the speed of detection and recovery is critical. For both application and hardware failures, error attribution is key.
The system must determine whether an issue requires user intervention or can be resolved automatically, for example, by excluding bad nodes and auto-restarting or by retrying multiple times before notifying the user. In this post, we focus primarily on improving error attribution, leaving recovery time and specific automation techniques for future study.
Error attribution
For error attribution, we broadly categorize the kind of errors that researchers encounter into the following main buckets:
- Immediate crashes: Stem from hardware faults such as bios, power supply or thermal issues, uncorrectable ECC errors, silent data corruption (NaNs in intermediate results), or network instability (link flapping).
- Hangs in communication libraries: Often manifest as PyTorch NCCL watch dog errors and Transformer Engine communication hangs. Hangs are often due to cascading dependencies in data transfer from the filesystem (for example, for input data) and from tensors (for example, gradients, intermediate activations, and so on) across the east-west (E/W) network. This highlights the need for robust fault tolerance, containment, and early detection mechanisms within libraries and applications.
- Speed regressions: These encompass both transient slowdowns (for example, temporary network or storage issues) and persistent bottlenecks (for example, a consistently slow GPU in a large cluster). These regressions can significantly affect overall training speed and efficiency.
While such failures can stem from underlying hardware, infrastructure, or software issues, from your perspective, they typically show up as abrupt interruptions or significant slowdowns during training. By recognizing these common failure modes, we can better develop solutions and processes that enable researchers to maintain momentum and keep workflows moving forward.

Figure 1 highlights the distribution of different failure types in a 6K GPU training run. While these failures often present themselves to the researcher as a single error, identifying the root cause requires more thorough analysis. We found that correlating cluster, node, and application telemetry improves the speed and accuracy to provide effective remediation strategies
Cluster telemetry
This telemetry includes storage servers, covering metadata operations, and read/write operations and switches. This visibility is crucial because a failure in one node can often spread to other nodes through communication calls, passing corrupted gradients, or overloading the storage system.
For example, if a single job overwhelms a storage node by generating excessive metadata operations, other jobs or nodes may experience performance issues as a result.
Node telemetry
Periodic health checks at the node level ensure that key hardware and software components such as GPUs, CPUs, memory, network, storage, and services are functioning correctly. Preliminary checks before a job starts also validate hardware status, verify software dependencies, and configure the environment for the task.
This early detection of potential issues reduces debugging time and improves overall reliability. After a job completes, cleanup routines reclaim resources, store logs, and restore the system to a clean state. These are also called prolog and epilog scripts.
Application logs
Applications have critical knowledge of the key control points, invariants, and measure of progress, including system errors and performance patterns. They provide one of the strongest signals for error attribution, especially when correlated with historical data in a central repository to spot recurring failures over time.
For example, these logs help determine if there are stragglers, hangs, or if a failure is intermittent or recurring. Certain errors, such as a NaN error, that appear deterministically at the same iteration and rank but different physical nodes are likely an application error. The same error recurring on a node without such a pattern, however, can indicate a more serious hardware failure.
Unified telemetry
Analyzing this temporal data across both intra-job (within a single job) and inter-job (across multiple jobs) contexts helps identify recurring issues, detect patterns, and take proactive measures rather than reactive.
This unified telemetry is shared across both operations teams and researchers through recommendations, alerts, and visualizations, ensuring that both groups have a common view of system behavior and failure patterns.
This cross-pollination of telemetry means that researchers can leverage infrastructure data to improve debugging while the operations team uses application insights to improve system automations to reduce hardware downtime.
While downtime is a function of scale, for training runs using <10K GPUs, we’ve achieved less than 1% downtime due to hardware failures with these techniques. Our results were achieved on NVIDIA DGX Cloud across Nemotron model family training runs from 2024-2025. We calculated hardware downtime as the percent of total downtime attributable to hardware failure, averaged over all jobs.
Conclusion
We’ve found that end-to-end resilience requires a holistic view. High uptime depends on a comprehensive approach that spans both infrastructure and developer experience.
This approach bridges application and infrastructure, improving debugging speed and accuracy while enabling a more proactive system. It helps researchers resolve issues efficiently at the time of failure and reduces friction in diagnosing recurring problems.
For model builders, a robust error attribution system is essential for effective automation. Automations like this enables researchers to train models without monitoring jobs around the clock.
But more than keeping GPUs fully utilized, this helps you and other researchers on DGX Cloud focus on what truly matters: developing models, advancing science, and leaving the heavy lifting to us.
For more information about training resilience services on DGX Cloud, sign up for the Cloud-Native Approach to Achieving Training Resilience and Efficiency at Extreme Scale GTC session. For more information about how we can accelerate your pre– and post-training workloads, contact us.