Generative AI / LLMs

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

Decorative image of linked modules.

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience.  High throughput reduces the cost of service.  Both are simultaneously important.

Even if a large model can fit in the memory of a single state-of-the-art GPU, the rate at which that GPU can generate tokens depends on the total compute available to process requests. By combining the compute capabilities of multiple cutting-edge GPUs, real-time user experiences on the latest models are possible. 

To understand the need for high tokens per second, the following GIFs show two scenarios: 

  • 5 tokens/second: Below typical human reading speed and not real-time.
  • 50 tokens/second: An excellent user experience. 
GIF displays three lines of a Shakespeare quote from Macbeth with words appearing one at a time.
Figure 1. 5 tokens/second output example
GIF displays 20 lines of a Shakespeare quote from Macbeth with entire lines appearing quickly.
Figure 2. 50 tokens/second output example

By using the combined compute performance of multiple GPUs with techniques such as tensor parallelism (TP) to run large models, inference requests can be processed quickly enough to enable real-time responses. By carefully selecting the number of GPUs used to run a model, cloud inference services can also simultaneously optimize both user experience and cost. 

For more information about parallelism techniques to balance user experience, see Demystifying AI Inference Deployments for Trillion Parameter Large Language Models

Multi-GPU inference is communication-intensive

Multi-GPU TP inference works by splitting the calculation of each model layer across two, four, or even eight GPUs in a server. In theory, two GPUs could run a model 2x faster, four GPUs 4x faster, and eight GPUs 8x faster. 

However, each GPU cannot complete their work independently. After each GPU completes the execution of its portion of the model layer, every GPU must send the results of the calculations to every other GPU, performing an all-to-all reduction. Only then can inference execution proceed to the next model layer.

Minimizing the time spent communicating results between GPUs is critical, as during this communication, Tensor Cores often remain idle, waiting for data to continue processing. 

During this communication step, a large amount of data must be transferred. A single query to Llama 3.1 70B (8K input tokens and 256 output tokens) requires that up to 20 GB of TP synchronization data be transferred from each GPU. As multiple queries are processed in parallel through batching to improve inference throughput, the amount of data transferred increases by multiples. 

This is why a high-bandwidth GPU-to-GPU interconnect is essential for multi-GPU inference. 

NVSwitch is critical for fast multi-GPU LLM inference

For good multi-GPU scaling, an AI server first requires GPUs with excellent per-GPU interconnect bandwidth. It must also provide fast connectivity to enable all GPUs to exchange data with all other GPUs as quickly as possible.

The NVIDIA Hopper Architecture GPU can communicate at 900 GB/s with fourth-generation NVLink. With the NVSwitch, every NVIDIA Hopper GPU in a server can communicate at 900 GB/s with any other NVIDIA Hopper GPU simultaneously.

The peak rate does not depend on the number of GPUs that are communicating. That is, the NVSwitch is non-blocking. Every NVIDIA HGX H100 and NVIDIA HGX H200 system with eight GPUs features four third-generation NVSwitch chips. The total bidirectional bandwidth of each NVSwitch chip is a staggering 25.6 terabits per second.

Picture of the NVIDIA Hopper Architecture GPU with a callout showing the four NVSwitch chips.
Figure 3. HGX H200 8-GPU with four NVIDIA NVSwitch devices

For comparison, consider a hypothetical server with eight H200 GPUs without NVSwitch that instead uses point-to-point connections on the server motherboard (Figure 4).

Diagram shows 8 GPUs on the top, each with links going to every other GPU. On the bottom, 8 GPUs are connected to each other with a centralized NVSwitch.
Figure 4. GPU-to-GPU bandwidth with and without NVSwitch all-to-all switch topology

In the point-to-point design, though it is a lower system cost without four high-speed switches, each GPU must split the same 900 GB/s connectivity into seven dedicated 128 GB/s point-to-point connections, each connecting to one of the other GPUs in the system. This means that the speed at which GPUs can communicate depends on the number of GPUs that are communicating. 

GPU CountPoint-to-Point BandwidthNVSwitch Bandwidth
128 GB/s900 GB/s
3 x 128 GB/s900 GB/s
7 x 128 GB/s900 GB/s
Table 1. GPU-to-GPU bandwidth comparison

Table 1 shows a GPU-to-GPU bandwidth comparison between GPUs connected through a point-to-point interconnect and GPUs connected with NVSwitch.

For models that only require two GPUs for the best balance of user experience and cost, such as Llama 3.1 70B, a point-to-point architecture only provides 128 GB/s of bandwidth. 20 GB of data would consume 150 ms to perform just one of the many all-to-all reductions. With high communication overhead, Amdahl’s Law limits the speed-up possible with each additional GPU.

Meanwhile, the system using NVSwitch would provide the full 900 GB/s of bandwidth, taking only 22 ms to transfer 20 GB, dramatically reducing the time spent during GPU-to-GPU communication. This has a significant impact on overall inference throughput and user experience.

On the top of the diagram are two GPUs connected with a small green line, with an indicator that communication makes up a large portion of the execution time. On the bottom, two GPUs are connected via NVSwitch, with communication making up a small portion of the execution time. 
Figure 5. Multi-GPU communication with and without NVSwitch

Cloud services often set fixed response time budgets for model serving, to provide good end-user experiences. This typically means being able to generate tokens faster than human reading speed. To maximize throughput and decrease serving costs, requests are batched as high as possible while maintaining the response time.

Table 2 shows the measured Llama 3.1 70B throughput at various real-time response time budgets from 30-50 tokens/s/user.


Real-time Response Budget tok/s/user
Throughput tok/s/GPU (batch size)NVSwitch  Benefit
Single GPU TP=1Point-to-Point TP=2NVSwitch TP=2
3067 (2)80 (6)115 (9)1.4x
35Does Not Meet74 (5)104 (7)1.4x
40Does Not Meet67 (4)87 (5)1.3x
45Does Not Meet56 (3)76 (4)1.4x
50Does Not Meet43 (2)63 (3)1.5x
Table 2. Throughput and NVSwitch benefit for Llama 3.1 70B inference at various real-time user experience targets with batch sizes

Throughput modeled using internal measurements. H200 GPU, ISL/OSL = 8k/256. 

As Table 2 shows, a single GPU configuration (TP=1) is challenged to achieve real-time performance. Splitting the model using tensor parallel across two GPUs combines the compute resources of both GPUs to achieve high throughput across a wide range of real-time experience budgets. Real-time inference throughput on NVIDIA H200 GPUs with TP=2 and NVSwitch is up to 1.5x greater than a comparable GPU without NVSwitch.

To show how NVSwitch benefits scenarios with greater GPU-to-GPU communication traffic, Table 3 shows overall server throughput at fixed batch sizes. Larger batch sizes mean that requests from an increasing number of users can be processed at one time, improving overall server utilization and reducing cost per inference. 


Batch Size
Throughput tok/s/GPU
NVSwitch Benefit
Point-to-PointNVSwitch
125261.0x
244471.1x
466761.2x
8871101.3x
161031421.4x
321121681.5x
Table 3. Throughput and NVSwitch benefit for Llama 3.1 70B inference at various fixed-batch sizes

Throughput modeled using internal measurements. H200 GPU, TP=2, ISL/OSL = 8K/256. 

As batch size increases, GPU-to-GPU traffic increases, as does the benefit provided by NVSwitch compared to a point-to-point topology. However, even at relatively modest batch sizes, the gains can be significant.  

NVLink and NVSwitch provide high bandwidth communication between GPUs based on the NVIDIA Hopper architecture and provide significant benefits for real-time, cost-effective large model inference today. 

As model sizes continue to grow, NVIDIA continues to innovate with both NVLink and NVSwitch to push the boundaries of real-time inference performance for even larger NVLink domains.

The NVIDIA Blackwell architecture features fifth-generation NVLink, which doubles per-GPU NVLink speeds to 1,800 GB/s. For Blackwell, a new NVSwitch chip and NVLink switch trays have also been introduced to enable even larger NVLink domain sizes.

The NVIDIA GB200 NVL72 system connects 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs in a rack-scale design, and with the fifth-generation NVLink, enables all 72 GPUs to act as a single GPU, enabling 30x faster real-time trillion-parameter inference compared to the prior generation.

Discuss (0)

Tags