Agentic AI / Generative AI

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

Aug 12, 2024

By Brian Slechta, Nick Comly, Ashraf Eassa, Joe DeLaere and Shivam Raj

Discuss (0)

AI-Generated Summary

Dislike

To achieve real-time responses for large language models, using multiple GPUs with techniques like tensor parallelism is necessary to process inference requests quickly enough.
The NVIDIA Hopper Architecture GPU, combined with NVSwitch, enables high-bandwidth communication between GPUs, with a total bidirectional bandwidth of 25.6 terabits per second.
The use of NVSwitch provides a significant benefit for large model inference, with NVLink and NVSwitch innovations continuing to push the boundaries of real-time inference performance, such as the NVIDIA Blackwell architecture's fifth-generation NVLink, which doubles per-GPU NVLink speeds to 1,800 GB/s.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are simultaneously important.

Even if a large model can fit in the memory of a single state-of-the-art GPU, the rate at which that GPU can generate tokens depends on the total compute available to process requests. By combining the compute capabilities of multiple cutting-edge GPUs, real-time user experiences on the latest models are possible.

To understand the need for high tokens per second, the following GIFs show two scenarios:

5 tokens/second: Below typical human reading speed and not real-time.
50 tokens/second: An excellent user experience.

GIF displays three lines of a Shakespeare quote from Macbeth with words appearing one at a time. — *Figure 1. 5 tokens/second output example*

By using the combined compute performance of multiple GPUs with techniques such as tensor parallelism (TP) to run large models, inference requests can be processed quickly enough to enable real-time responses. By carefully selecting the number of GPUs used to run a model, cloud inference services can also simultaneously optimize both user experience and cost.

For more information about parallelism techniques to balance user experience, see Demystifying AI Inference Deployments for Trillion Parameter Large Language Models.

Multi-GPU inference is communication-intensive

Multi-GPU TP inference works by splitting the calculation of each model layer across two, four, or even eight GPUs in a server. In theory, two GPUs could run a model 2x faster, four GPUs 4x faster, and eight GPUs 8x faster.

However, each GPU cannot complete their work independently. After each GPU completes the execution of its portion of the model layer, every GPU must send the results of the calculations to every other GPU, performing an all-to-all reduction. Only then can inference execution proceed to the next model layer.

Minimizing the time spent communicating results between GPUs is critical, as during this communication, Tensor Cores often remain idle, waiting for data to continue processing.

During this communication step, a large amount of data must be transferred. A single query to Llama 3.1 70B (8K input tokens and 256 output tokens) requires that up to 20 GB of TP synchronization data be transferred from each GPU. As multiple queries are processed in parallel through batching to improve inference throughput, the amount of data transferred increases by multiples.

This is why a high-bandwidth GPU-to-GPU interconnect is essential for multi-GPU inference.

NVSwitch is critical for fast multi-GPU LLM inference

For good multi-GPU scaling, an AI server first requires GPUs with excellent per-GPU interconnect bandwidth. It must also provide fast connectivity to enable all GPUs to exchange data with all other GPUs as quickly as possible.

The NVIDIA Hopper Architecture GPU can communicate at 900 GB/s with fourth-generation NVLink. With the NVSwitch, every NVIDIA Hopper GPU in a server can communicate at 900 GB/s with any other NVIDIA Hopper GPU simultaneously.

The peak rate does not depend on the number of GPUs that are communicating. That is, the NVSwitch is non-blocking. Every NVIDIA HGX H100 and NVIDIA HGX H200 system with eight GPUs features four third-generation NVSwitch chips. The total bidirectional bandwidth of each NVSwitch chip is a staggering 25.6 terabits per second.

For comparison, consider a hypothetical server with eight H200 GPUs without NVSwitch that instead uses point-to-point connections on the server motherboard (Figure 4).

In the point-to-point design, though it is a lower system cost without four high-speed switches, each GPU must split the same 900 GB/s connectivity into seven dedicated 128 GB/s point-to-point connections, each connecting to one of the other GPUs in the system. This means that the speed at which GPUs can communicate depends on the number of GPUs that are communicating.

GPU Count	Point-to-Point Bandwidth	NVSwitch Bandwidth
2	128 GB/s	900 GB/s
4	3 x 128 GB/s	900 GB/s
8	7 x 128 GB/s	900 GB/s

Table 1. GPU-to-GPU bandwidth comparison

Table 1 shows a GPU-to-GPU bandwidth comparison between GPUs connected through a point-to-point interconnect and GPUs connected with NVSwitch.

For models that only require two GPUs for the best balance of user experience and cost, such as Llama 3.1 70B, a point-to-point architecture only provides 128 GB/s of bandwidth. 20 GB of data would consume 150 ms to perform just one of the many all-to-all reductions. With high communication overhead, Amdahl’s Law limits the speed-up possible with each additional GPU.

Meanwhile, the system using NVSwitch would provide the full 900 GB/s of bandwidth, taking only 22 ms to transfer 20 GB, dramatically reducing the time spent during GPU-to-GPU communication. This has a significant impact on overall inference throughput and user experience.

On the top of the diagram are two GPUs connected with a small green line, with an indicator that communication makes up a large portion of the execution time. On the bottom, two GPUs are connected via NVSwitch, with communication making up a small portion of the execution time. — *Figure 5. Multi-GPU communication with and without NVSwitch*

Cloud services often set fixed response time budgets for model serving, to provide good end-user experiences. This typically means being able to generate tokens faster than human reading speed. To maximize throughput and decrease serving costs, requests are batched as high as possible while maintaining the response time.

Table 2 shows the measured Llama 3.1 70B throughput at various real-time response time budgets from 30-50 tokens/s/user.

Real-time Response Budget tok/s/user	Throughput tok/s/GPU (batch size)			NVSwitch Benefit
Real-time Response Budget tok/s/user	Single GPU TP=1	Point-to-Point TP=2	NVSwitch TP=2	NVSwitch Benefit
30	67 (2)	80 (6)	115 (9)	1.4x
35	Does Not Meet	74 (5)	104 (7)	1.4x
40	Does Not Meet	67 (4)	87 (5)	1.3x
45	Does Not Meet	56 (3)	76 (4)	1.4x
50	Does Not Meet	43 (2)	63 (3)	1.5x

Table 2. Throughput and NVSwitch benefit for Llama 3.1 70B inference at various real-time user experience targets with batch sizes

Throughput modeled using internal measurements. H200 GPU, ISL/OSL = 8k/256.

As Table 2 shows, a single GPU configuration (TP=1) is challenged to achieve real-time performance. Splitting the model using tensor parallel across two GPUs combines the compute resources of both GPUs to achieve high throughput across a wide range of real-time experience budgets. Real-time inference throughput on NVIDIA H200 GPUs with TP=2 and NVSwitch is up to 1.5x greater than a comparable GPU without NVSwitch.

To show how NVSwitch benefits scenarios with greater GPU-to-GPU communication traffic, Table 3 shows overall server throughput at fixed batch sizes. Larger batch sizes mean that requests from an increasing number of users can be processed at one time, improving overall server utilization and reducing cost per inference.

Batch Size	Throughput tok/s/GPU		NVSwitch Benefit
Batch Size	Point-to-Point	NVSwitch	NVSwitch Benefit
1	25	26	1.0x
2	44	47	1.1x
4	66	76	1.2x
8	87	110	1.3x
16	103	142	1.4x
32	112	168	1.5x

Table 3. Throughput and NVSwitch benefit for Llama 3.1 70B inference at various fixed-batch sizes

Throughput modeled using internal measurements. H200 GPU, TP=2, ISL/OSL = 8K/256.

As batch size increases, GPU-to-GPU traffic increases, as does the benefit provided by NVSwitch compared to a point-to-point topology. However, even at relatively modest batch sizes, the gains can be significant.

Continued NVLink innovation for trillion-parameter model inference

NVLink and NVSwitch provide high bandwidth communication between GPUs based on the NVIDIA Hopper architecture and provide significant benefits for real-time, cost-effective large model inference today.

As model sizes continue to grow, NVIDIA continues to innovate with both NVLink and NVSwitch to push the boundaries of real-time inference performance for even larger NVLink domains.

The NVIDIA Blackwell architecture features fifth-generation NVLink, which doubles per-GPU NVLink speeds to 1,800 GB/s. For Blackwell, a new NVSwitch chip and NVLink switch trays have also been introduced to enable even larger NVLink domain sizes.

The NVIDIA GB200 NVL72 system connects 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs in a rack-scale design, and with the fifth-generation NVLink, enables all 72 GPUs to act as a single GPU, enabling 30x faster real-time trillion-parameter inference compared to the prior generation.

Discuss (0)

About the Authors

About Brian Slechta
Brian Slechta is a director of AI architecture in the GPU Architecture group at NVIDIA. He is passionate about pushing the boundaries of hardware and software performance in the data center for large scale AI workloads. Brian holds an M.Sc. in computer systems engineering from the University of Illinois at Urbana-Champaign.

View all posts by Brian Slechta

About Nick Comly
Nick Comly leads products for inference optimization at NVIDIA. His team focuses on pushing the capabilities and performance of the NVIDIA stack for GenAI developers. Nick received his M.S. from Stanford University, where he specialized in deep learning and optimization.

View all posts by Nick Comly

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Joe DeLaere
Joe DeLaere is a senior manager for the NVIDIA accelerated computing solution portfolio for data center. Previously, Joe worked in various marketing and product management roles in the semiconductor and data center sectors. Joe has a B.S. in Electrical Engineering from San Jose State University.

View all posts by Joe DeLaere

About Shivam Raj
Shivam Raj is a senior architect in the GPU Architecture group at NVIDIA. He focuses on training and inference performance of data center AI workloads. Shivam holds an M.Sc. in electrical engineering from the University of Southern California.

View all posts by Shivam Raj