NVIDIA Sets New Generative AI Performance and Scale Records in MLPerf Training v4.0

Generative AI models have a variety of uses, such as helping write computer code, crafting stories, composing music, generating images, producing videos, and more. And, as these models continue to grow in size and are trained on even more data, they are producing even higher-quality outputs.

Building and deploying these more intelligent models is incredibly compute-intensive, requiring many high-performance processors working in parallel, orchestrated by efficient and versatile software.

For example, Meta announced that it trained its latest Llama 3 family of large language models (LLMs) using AI clusters featuring 24,576 NVIDIA H100 Tensor Core GPUs. The larger of the models, Llama 3 70B, required a total 6.4 million H100 GPU-hours to train.

When LLMs are pretrained, they can then be customized through a variety of techniques, including model fine-tuning, to achieve higher accuracy for specific tasks. As enterprises move to adopt LLMs for a wide variety of applications, LLM fine-tuning is fast becoming a core industry workload.

AI training is a full-stack challenge, and delivering world-class end-to-end training performance requires the combination of powerful processors, fast memory, high-bandwidth and low-latency networking, and optimized software.

MLPerf Training has emerged as the industry-standard benchmark to measure and evaluate end-to-end AI training performance. Developed by the MLCommons consortium, MLPerf Training workloads are frequently updated to reflect the latest AI use cases. During each submission round, the results undergo a rigorous peer-review process to ensure their integrity before publication.

In MLPerf Training v4.0, NVIDIA set new generative AI training performance records and continued to deliver the highest performance on every workload. This performance was delivered using the full stack of NVIDIA software and hardware:

NVIDIA Hopper GPUs
The latest, fourth-generation NVLink interconnect combined with the latest third-generation NVSwitch chip
NVIDIA Quantum-2 InfiniBand networking
The optimized and versatile NVIDIA software stack:
- NVIDIA NeMo framework
- NVIDIA Transformer Engine library
- NVIDIA cuBLAS library
- NVIDIA cuDNN library
- NVIDIA Magnum IO
- NCCL

Each component has been optimized further since the last round of MLPerf Training to continue delivering more performance and value to users. This post provides a closer look at these outstanding results.

MLPerf Training v4.0 updates

This round of MLPerf saw the addition of two new tests to reflect popular industry workloads.

The first measures how quickly Llama 2 70B can be fine-tuned using the popular low-rank adaptation (LoRA) technique. LLM fine-tuning enables enterprises to customize LLMs using their proprietary data to improve response quality for specific use cases.

The second new test focuses on graph neural network (GNN) training, based on an implementation of RGAT (relational graph attention network). GNNs are being applied to many domains, including drug discovery, fraud detection, and recommendation systems.

The latest MLPerf Training v4.0 test suite has the following workloads:

LLM pre-training (GPT-3 175B)
LLM fine-tuning (Llama 2 70B with LoRA)
Graph neural network (GNN)
Text-to-image (Stable Diffusion v2)
Recommender (DLRM-dcnv2)
Natural language processing (BERT-Large)
Image classification (ResNet-50)
Lightweight object detection (RetinaNet)
Biomedical image segmentation (3D U-Net)

As AI is a diverse and rapidly evolving field, with new models and applications being invented continuously, it’s important that industry benchmarks, such as MLPerf, cover a wide range of use cases and evolve in lock-step with industry trends.

NVIDIA sets new LLM pretraining performance and scale records

MLPerf incorporates an LLM pretraining benchmark based on GPT-3 175B, a 175B parameter LLM developed by OpenAI. The workload is extremely demanding and is a good test of large-scale LLM training performance, which stresses the compute, networking, and software efficiency of an accelerated computing platform.

NVIDIA first submitted results on the GPT3-175B LLM benchmark when it was introduced in MLPerf Training v3.0 last year. We achieved a time-to-train of 10.9 minutes using 3,584 H100 GPUs, representing both performance and scale records at the time.

In this round of MLPerf Training, NVIDIA has more than tripled its submission scale to 11,616 H100 GPUs and more than tripled performance to 3.4 minutes to train, delivering near-linear performance scaling. These results build upon the prior records set by NVIDIA last round with 10,752 H100 GPUs that delivered a time-to-train of just 3.9 minutes.

MLPerf Training v3.1 and v4.0 results retrieved from www.mlperf.org. on June 12, 2024, from the following entries: NVIDIA 3.0-2069, NVIDIA 4.0-0059. NVIDIA A100 result with 512 A100 is not verified by MLCommons. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

The exceptional results submitted by NVIDIA this round reflected both increased submission scale, as well as significant software improvements that further enhanced delivered performance at scale.

One notable example is the first use of CUDA Graphs in NVIDIA LLM submissions. As training scales to several thousand GPUs, CPU overhead becomes more pronounced. The use of CUDA Graphs, which enables multiple GPU operations to be launched with a single CPU operation, also contributed to the performance delivered at max scale.

At a scale of 512 GPUs, H100 performance has increased by 27% in just one year, completing the workload in under an hour, with per-GPU utilization now reaching 904 TFLOP/s.

This exceptional result was enabled by numerous improvements to the NVIDIA software stack:

Optimized FP8 kernels
A new FP8-aware distributed optimizer
An optimized FlashAttention implementation in cuDNN
More effective overlapped execution of math operations and GPU-to-GPU communication operations
Intelligent power allocation within the H100 GPUs to maximize Tensor Core throughput

Diving further into the last optimization, a notable characteristic of LLM training is its high compute intensity. Especially for smaller-scale LLM runs, math operations can make up a much greater part of the time required to perform each training step compared to operations related to GPU-to-GPU communication. This leads to high Tensor Core utilization and can result in scenarios where Tensor Core throughput is constrained by the power available to the GPU.

In the submission with 512 H100 GPUs, we improved end-to-end performance by redirecting power from the L2 cache memory on each H100 GPU to the streaming multiprocessor (SM), which houses, among other units, NVIDIA Hopper fourth-generation Tensor Cores. This was done by setting a ratio using a boost slider managed by NVIDIA Management Libraries (NVML).

This resulted in higher GPU operating frequency within the same power budget and improved end-to-end performance by 4%. The boost slider can be set through the command nvidia-smi boost-slider –vboost <value>. For more information about this command, including how to get all possible values, run nvidia-smi boost-slider –help.

By improving performance with the same GPUs, you can either train models with similar computational requirements in less time and at a lower cost or train more computationally intensive models in a similar time with similar costs.

NVIDIA achieves the highest LLM fine-tuning performance

The latest version of MLPerf Training includes a fine-tuning test, which applies LoRA to the Llama 2 70B model, developed by Meta. LoRA is a popular form of parameter-efficient fine-tuning, described in this post.

The NVIDIA platform excelled on this new test, delivering the fastest single-server performance as well as scalability well beyond a single GPU server.

A single DGX H100, incorporating eight H100 GPUs, delivered an outstanding performance, completing the test in just over 28 minutes. The NVIDIA H200 Tensor Core GPU, which upgrades the NVIDIA Hopper architecture with 141 GB of HBM3e memory, delivered an additional 14% speedup, reducing the time-to-train with a single node to just 24.7 minutes.

NVIDIA submissions this round also demonstrated the ability to fine-tune LLMs using up to 1,024 H100 GPUs, delivering an outstanding result of just 1.5 minutes, establishing both performance and scale records.

To enable efficient scaling to 1,024 H100 GPUs, NVIDIA submissions on the LLM fine-tuning benchmark leveraged the context parallelism capability available in the NVIDIA NeMo framework. To learn more about context parallelism and how to leverage it when using the NeMo framework, see this page.

In the NVIDIA LLM fine-tuning submissions this round, we used an FP8 implementation of self-attention, available through cuDNN. This improved performance by 15% at the 8-GPU scale. For more information, see Accelerating Transformers with NVIDIA cuDNN 9.

These fantastic results complement the great performance on supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) demonstrated on NVIDIA Hopper GPUs late last year.

These fine-tuning techniques can provide better accuracy compared to parameter-efficient methods such as LoRA, but at the cost of greater compute intensity. The NVIDIA NeMo framework supports many model customization techniques to provide you with the flexibility to choose the ones that best serve your needs.

NVIDIA raises the bar for text-to-image generative AI training

Generative AI is transforming visual design and is being applied to a broad range of use cases, including marketing and advertising, media and entertainment, product design and prototyping, as well as architecture visualization.

To represent visual generative AI, MLPerf Training v4.0 includes a text-to-image benchmark, based on Stable Diffusion v2.

Building upon the record-setting NVIDIA submissions in the last round, NVIDIA submissions this round deliver up to 80% more performance at the same submission scales through extensive software enhancements:

Use of full-iteration CUDA Graphs
Use of distributed optimizer for Stable Diffusion
Optimized cuDNN and cuBLAS heuristics for Stable Diffusion
…and more

MLPerf Training v3.1 and v4.0 results retrieved from www.mlperf.org. on June 12, 2024, from the following entries: NVIDIA 3.1-2050, NVIDIA 4.0-0053. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information

NVIDIA accelerates graph neural network training

Graph neural networks (GNNs) are used for a range of applications, including social network analysis, drug discovery, fraud detection, recommenders in retail, and even molecular chemistry. The addition of a GNN benchmark to MLPerf broadens the workload coverage to cover this important class of neural networks.

NVIDIA submitted results using 8, 64, and 512 H100 GPUs, setting a new benchmark time to train record of just 1.1 minutes in the largest-scale configuration.

NVIDIA also submitted eight GPU results using eight H200 Tensor Core GPUs, each featuring 141 GB of HBM3e and delivering a 47% boost compared to the H100 submission at the same scale.

Key takeaways

The NVIDIA platform continues to demonstrate the highest performance and greatest versatility for the full diversity of AI workloads, spanning both generative AI as well as more traditional AI workloads.

The NVIDIA platform is moving fast. By continuing to optimize the NVIDIA software stack, customers can enjoy more performance per GPU, which reduces the cost to train, and the ability to efficiently scale to larger numbers of GPUs to train even more demanding models.

The NVIDIA platform continues to deliver even more performance through invention across the entire stack, including new chips and systems. The NVIDIA Blackwell platform, announced at GTC 2024, is set to democratize trillion-parameter AI, with NVIDIA GB200 NVL72 delivering up to 30x faster real-time trillion-parameter inference, and up to 4x faster trillion-parameter training compared to the same number of NVIDIA Hopper GPUs.