In the rapidly evolving landscape of AI systems and workloads, achieving optimal model training performance extends far beyond chip speed. It requires a comprehensive evaluation of the entire stack, from compute to networking to model framework.
Navigating the complexities of AI system performance can be difficult. There are many application changes that you can make, from precision to parallelism, but they currently require significant effort and specialized knowledge to implement effectively.
NVIDIA DGX Cloud Benchmarking Recipes offer guidance for boosting training performance by sharing what good performance looks like per workload and best practices for how to get there.
For more information, see NVIDIA DGX Cloud Benchmarking Recipes.
Evaluating an AI system holistically
DGX Cloud Benchmarking Recipes are an end-to-end benchmarking suite that can both measure performance in real-world environments and identify optimization opportunities in AI training workloads. These benchmarking recipes are crafted to evaluate performance for real-world AI applications, factoring in the complete AI stack.
Chip-centric metrics, such as peak floating-point operations per second (FLOPS) and bandwidth, can be inadequate in estimating end-to-end performance. Traditionally, peak FLOPS has often been used to compare platforms, but FLOPS is only one of the many components that affect the end-to-end application performance.
In practice, the training time for an AI model is a function of many other components, such as network, software, firmware, and underlying infrastructure.
For example, the high-bandwidth NVIDIA NVLink network fabric enables scaling parallelism strategies, such as tensor parallelism, beyond the traditional single-server, 8-GPU limit. With NVIDIA Grace class systems, the NVLink networking layer enables you to achieve higher FLOPS in real-world applications, bridging the gap between theoretical and practical performance.
Evaluating AI platforms solely through FLOPS can result in inaccurate estimations of total training time and the associated costs, without accounting for the rest of the platform. For modern AI workloads such as fine-tuning Llama 3.1 models, it’s more accurate to use benchmarks that measure end-to-end performance across the entire system, providing a holistic view of how a platform will perform in actual usage scenarios.

Infrastructure factors affecting performance include the following:
- Server hardware designs
- Operating systems
- Virtualization layers
- Software stacks
- Network architectures
- Storage implementations
AI workload factors affecting performance include the following:
- Compute-to-communication ratio
- Model scaling factors
- Batch size
- Precision format
- Data loading strategies
Tuning workloads to optimal performance
Beyond the job execution aspect of benchmarking, NVIDIA DGX Cloud Benchmarking Recipes are also playbooks for optimizing popular models and workloads. These recipes provide workload-specific strategies to maximize performance for popular models such as Llama 3.1, Grok, and Mixtral.
Workload | Type | Description | Container Version | Dataset | Max Scale (#GPUs) | DTYPE |
Nemotron4 | Training | 15B and 340B benchmarks | 24.09 | Synthetic | 2048 | FP8, BF16 |
Nemo Megatron | Training | 175B benchmarks | 24.05 | Pile | 2048 | FP8, BF16 |
Llama 3.1 | Training | 8B, 70B, and 405B benchmarks | 24.09 | Pile | 2304 | FP8, BF16 |
PaXML | Training | 5B and 175B benchmarks | 24.03.04 | Synthetic | 2048 | FP8, BF16 |
Maxtext | Training | Llama2 70B benchmarks | 2024.12.09 | Synthetic | 2048 | FP8, BF16 |
Grok1 | Training | Grok1 314B benchmarks | 24.09 | Synthetic | 2048 | FP8, BF16 |
Llama 2 | Fine Tuning | Hugging Face 70B benchmarks | 24.02 | HF Llama2 | 512 | BF16 |
Mistral | Fine Tuning | Hugging Face 7B benchmarks | 24.02 | HF Mistral | 256 | BF16 |
In Table 1, workloads include both training and fine tuning and support both FP8 and BF16 where possible.
Each training workload has a different fingerprint of how it exercises the platform. A basic question you might ask about a workload’s fingerprint is, “How much does compute time overlap with communication or networking time?”
Some models may be more compute-bound and some are more communication bound, depending on the choice of parallelism and hyperparameters such as sequence length and batch size. Scaling behavior also varies between models as the number of GPUs are increased, as well as type of scaling, weak or strong.
For each workload and cluster scale, you must tune your model and system to achieve optimal performance.
On the model side, this may involve adjusting the parallelism strategy, batch sizes, precision formats, and data loading strategies, among other configurations. On the system side, make sure that the workload makes maximal use of NVLink high bandwidth (for example for tensor and context parallelism) and confirm that the scale-out fabric is not a blocker for the corresponding networking collectives (for example for pipeline or expert parallelism).
The latter requires a fabric that provides low transport latency (RDMA) and effective congestion management and adaptive routing as found in the reference NVIDIA SpectrumX and InfiniBand networking architectures. For efficient scaling of AI workloads, using these technologies is essential as they help mitigate the impact of jitter, ensuring consistent performance and reliability.
Using FP8
DGX Cloud Benchmarking Recipes provide optimized configurations and tuning recommendations specifically for FP8 workloads, helping you achieve optimal performance with this precision format. For example, the recipe for Llama 3.1 70B training includes FP8 settings that have been carefully tested and optimized for DGX Cloud platforms.
Understanding what constitutes good performance for a given AI workload can be complex. DGX Cloud Benchmarking Recipes provide a range of baseline performance results for various popular models, enabling you to set realistic expectations and goals for your own implementations.
These baselines include metrics such as model FLOPS utilization (MFU), which measures how efficiently a model uses the available compute resources. You can see how MFU and throughput compare for popular models. By comparing your results to these benchmarks, you can gauge the effectiveness of optimizations and identify areas for improvement.
DeepSeek-R1 is a 671B-parameter model that runs on one NVIDIA H200 GPU node. Its high compute utilization shows how holistic optimization of compute, networking, and parallelism strategies can push delivered performance closer to theoretical limits. Systematic benchmarking enables direct comparisons, helping teams collaboratively optimize models and platforms to maximize the value of GPU systems.
Finally, these per-workload performance optimizations also show the need for further research and discussion around application tuning. For example, the recommended usage of the parallelization strategies varies across combinations of workloads and platforms.
Get started with DGX Cloud Benchmarking Recipes
The recipes for benchmarking platform performance are hosted in NVIDIA’s public registry, NGC Catalog. For more information about the latest release of recipes, see DGX Cloud Benchmarking 24.11.1.
Within each workload recipe, you can access the following:
- Containerized benchmarks for reproducibility across environments
- Scripts that generate synthetic data as needed
- Performance metrics collection and reporting (to
stdout
) - Configuration best practices for that workload per platform
- Performance data from the NVIDIA reference architecture for comparison
The recipes require Slurm cluster management. Support for Kubernetes is currently in development. To use the DGX Cloud Benchmarking Recipes, download the recipe that best matches your workload and execute the cluster setup and benchmarking scripts.
Keep moving the platform performance goalposts
In today’s AI landscape, achieving optimal performance requires looking beyond individual components to understand how entire systems work together. While raw GPU capabilities matter, full optimization comes from carefully tuning every layer of the stack, from hardware and software configuration to workload-specific parameters.
At NVIDIA, we use benchmark recipes to continuously refine every layer of the technology stack, from hardware interconnects such as NVIDIA NVLink and NVLink Switch to software libraries such as NVIDIA TensorRT-LLM, enabling substantial performance gains over time.
For example, accelerated computing performance has increased 3.4x in MLPerf Inference on NVIDIA H100 GPUs in just one year through continuous improvements in software developments alone. These ongoing optimizations enable organizations to run more complex models, reduce infrastructure requirements, and improve efficiency, driving further innovation.
These benchmarking recipes enable your team to:
- Optimize AI workloads for specific environments, including for FP8.
- Assess how close a cluster’s performance is to NVIDIA’s observed performance.
- Identify performance bottlenecks in your current setups.
Training large models can take weeks or months and cost millions in compute resources, so modest performance improvements can translate into substantial time and cost savings. By using continually evolving performance optimizations and workload-specific recipes from NVIDIA, your organization can maximize AI infrastructure investments and focus engineering efforts on innovation rather than infrastructure tuning.
For more information, see DGX Cloud Benchmarking Recipes.