Agentic AI / Generative AI

Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance

Sep 26, 2024

By Nick Comly, Joe DeLaere, Ashraf Eassa, Ivan Goldwasser, Brian Pharris and Brian Slechta

Discuss (0)

AI-Generated Summary

Dislike

The NVIDIA GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips, achieves fast time-to-first-token (TTFT) for long-context inference using Llama 3.1 70B and 405B models.
For Llama 3.1 70B, GH200 NVL32 achieves a TTFT of 472 milliseconds with an input sequence length of 32,768, and 2.2 seconds with a sequence length of 122,880.
The GH200 NVL32 system's high-bandwidth, low-latency NVLink Switch System enables efficient data exchange between GPUs, resulting in a 3x acceleration in TTFT for Llama 3.1 405B queries with long input sequences.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

As LLMs continue to grow in size – with the latest community models now featuring hundreds of billions of parameters – they deliver more accurate responses and also support even larger context windows to allow users to ask longer, information-rich queries. For example, the Llama 3.1 and 3.2 family of LLMs supports up to 128K token context windows, or roughly the length of a novel. These capabilities make LLMs more useful, but they also require more delivered parallel compute performance for good interactivity.

Until now, AI has scaled with model pre-training. Recent advances will also scale with post-training synthetic data generation and inference-time reasoning. Inference performance and scaling is now critically important.

In this post, we show how the NVIDIA GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips connected using the NVLink Switch system, and with TensorRT-LLM improvements, scales to deliver outstanding TTFT for long-context inference using the latest Llama 3.1 70B and 405B models.

Time-to-first-token matters for real-time use cases

Applications such as AI speech bots, digital assistants, AI NPCs in games, and more aim to simulate natural, human-like conversational capabilities. For these use cases, a TTFT in the realm of a few hundred milliseconds is crucial.

To understand the impact of TTFT on the user experience, consider the following animations. The first represents a TTFT of about half a second, while the second represents a TTFT of about five seconds.

Fast TTFT is particularly impactful in services where up-to-date knowledge is important, like in the recent rise of agentic workflows. To build useful agents, Retrieval-Augmented Generation (RAG) – which enhances LLM prompts with relevant data – is needed for accurate actions and responses. This means that contexts can be very long with tens or hundreds of thousands of tokens. Having a fast TTFT, even at such long contexts, makes these services feel more interactive.

Below we will show how NVIDIA GH200 NVL32 is able to achieve the fastest published TTFT for the Llama 3.1 models, even at very long contexts.

NVIDIA GH200 NVL32 supercharges TTFT for long context inference

To generate the first new token in response to an inference request, the input tokens must be processed by the LLM. This phase of inference, known as prefill, often has a large number of tokens and thus benefits from increased aggregate compute performance. It can be accelerated by splitting the calculations across multiple GPUs using parallelism techniques, such as tensor parallelism.

When computations are split across many GPUs using tensor parallelism, all GPUs involved in the computation must exchange data between all other GPUs in an AllReduce synchronization that happens twice per model layer. As the number of GPUs involved in the calculation increases, the total amount of synchronization traffic grows. Llama 3.1 405B incorporates 126 layers, yielding 252 AllReduce synchronizations per inference step. This means that running Llama 3.1 405B across 32 GPUs with an input sequence of 122,880 tokens, generates 114 TB of aggregate interconnect traffic. A high-bandwidth, low-latency all-to-all GPU-to-GPU fabric is needed to minimize time spent during these synchronizations and maximize time available GPUs spend on compute.

GH200 NVL32 is a rack scale solution that connects 32 NVIDIA GH200 Grace Hopper Superchips – each composed of an NVIDIA Grace CPU and an NVIDIA Hopper GPU connected via NVLink-C2C – using the NVLink Switch System. This allows each Hopper GPU to communicate with any other GPU within the NVLink domain at full 900 GB/s bandwidth, for 28.8 TB/s of aggregate bandwidth. The NVLink Switch System means the combined 32 GH200’s form “one mighty GPU” with up to 127 petaFLOPs of peak FP8 AI compute. This helps to dramatically shorten TTFT, particularly on the most demanding models with long context.

In scaling from eight NVIDIA H200 Tensor Core GPUs to 32 GH200 Grace Hopper Superchips, TTFT for a 122,880 token Llama 3.1 405B query, TTFT is accelerated by 3x, enabling a real-time experience. And, even for Llama 3.1 70B, TTFT for the same length query sees a 2.6x speedup in TTFT.

In the following sections, we show how GH200 NVL32 makes responsive, long context Llama 3.1 70B and 405B inference possible.

Llama 3.1 70B

A single GH200 NVL32 system achieves a TTFT of just 472 milliseconds when running Llama 3.1 70B, using an input sequence length of 32,768. In practical terms, this means that Llama 3.1 70B can begin outputting a summary of a 90-page document or coding suggestions on thousands of lines of code, in less than half a second.

	Llama 3.1 70B Time to First token (milliseconds) (Lower is better)
Input Sequence Length	GH200 NVL32
4,096	64
32,768	472
122,880	2,197

Table 1. Llama 3.1 70B time-to-first-token (TTFT) using the GH200 NVL32 rack-scale system.

Data measured between 9/6/2024 and 9/10/2024 using an internal TensorRT-LLM development branch. Batch = 1.

And, for an input sequence length of 122,880 – approximately 15K lines of code or a 330 page book – GH200 NVL32 can achieve a TTFT of just 2.2 seconds.

Llama 3.1 405B

Llama 3.1 405B requires substantially more compute to generate the first token of a response, as the model incorporates nearly 6X the parameter count of Llama 3.1 70B.

	Llama 3.1 405B Time to First token (milliseconds) (Lower is better)
Input Sequence Length	GH200 NVL32
4,096	208
32,768	1,627
122,880	7,508

Table 2. Llama 3.1 405B time-to-first-token (TTFT) using the GH200 NVL32 rack-scale system.

Data measured between 9/6/2024 and 9/10/2024 using an internal TensorRT-LLM development branch. Batch = 1.

GH200 NVL32, running Llama 3.1 405B, is able to provide a TTFT of about 1.6 seconds using a 32,768 token input. And, using a small codebase-sized 122,880 token input, GH200 NVL32 can begin responding in just 7.5 seconds.

Inference continues to be a hotbed of invention

The pace of inference innovation across serving techniques, runtime optimizations, kernels and more has been extraordinary. Advancements like in-flight batching, speculative decoding, FlashAttention, key-value caching, and more have been developed by both industry and academia. Collectively, these innovations are enabling more capable models and systems to be deployed efficiently and more cost-effectively in production, making powerful AI capabilities more accessible to the entire NVIDIA ecosystem.

To innovate quickly, researchers need a rich developer ecosystem and a productive tool stack. And, for ‌innovations to have the greatest reach, a large platform installed base is required. The NVIDIA accelerated computing platform has more than 5 million developers, with an installed base of several hundred million GPUs across CSPs, on-prem, personal computers, and edge devices – all compatible with the CUDA programming model. Deep engagement with developers, computing providers, and customers enables and accelerates AI innovation on the NVIDIA platform.

Next up: accelerating agentic workflows

Agentic workflows perform tree search, self-reflection, and iterative inferences to reason and produce answers to complex queries. This means that the number of inferences per prompt will grow by orders of magnitude. With each successive inference, we would need to process the aggregate response in the next agent as a new context — thus fast TTFT becomes even more important as workflows scale.

Fast token generation speed is also important for agentic workflows. In a future chapter, we will provide an update on accelerating token generation speed by scaling to many more GPUs on the NVIDIA platform with NVLink and the NVLink Switch system.

NVIDIA Blackwell GB200 NVL72 powers a new era of computing

Looking ahead, as model sizes continue to grow rapidly, and as models support even longer context lengths, and agentic workflows become more popular, the amount of delivered compute performance required for fast inference continues to rise.

The GB200 NVL72, based on the NVIDIA Blackwell platform, delivers the next giant leap for generative AI and accelerated computing. With second-generation Transformer Engine and fifth-generation Tensor Cores, Blackwell delivers up to 20 PFLOPS of FP4 AI compute – up 5x the AI compute of NVIDIA Hopper. And, fifth-generation NVLink provides 1,800 GB/s of GPU-to-GPU bandwidth – twice that provided by Hopper – and expands NVLink domain size to 72 GPUs with the GB200 NVL72 rack-scale system, enabled by the latest NVLink Switch chip.

NVIDIA continues to innovate at every layer of the technology stack to increase performance, reduce total cost of ownership, and enable the next-generation of AI.

This blog is part of a series – view Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch.

Discuss (0)

About the Authors

About Nick Comly
Nick Comly leads products for inference optimization at NVIDIA. His team focuses on pushing the capabilities and performance of the NVIDIA stack for GenAI developers. Nick received his M.S. from Stanford University, where he specialized in deep learning and optimization.

View all posts by Nick Comly

About Joe DeLaere
Joe DeLaere is a senior manager for the NVIDIA accelerated computing solution portfolio for data center. Previously, Joe worked in various marketing and product management roles in the semiconductor and data center sectors. Joe has a B.S. in Electrical Engineering from San Jose State University.

View all posts by Joe DeLaere

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Ivan Goldwasser
Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University.

View all posts by Ivan Goldwasser

About Brian Pharris
Brian Pharris is a senior distinguished engineer and the technical lead for GPU-accelerated inference, shaping the architecture, performance, and scalability of some of the world's most advanced AI systems. He holds both B.S. and M.Eng. degrees in electrical engineering and computer science from MIT.

View all posts by Brian Pharris

About Brian Slechta
Brian Slechta is a director of AI architecture in the GPU Architecture group at NVIDIA. He is passionate about pushing the boundaries of hardware and software performance in the data center for large scale AI workloads. Brian holds an M.Sc. in computer systems engineering from the University of Illinois at Urbana-Champaign.

View all posts by Brian Slechta