Generative AI / LLMs

Optimizing Inference Efficiency for LLMs at Scale with NVIDIA NIM Microservices

As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize throughput to lower operational costs and minimize latency to deliver superior user experiences.

This post discusses the critical performance metrics of throughput and latency for LLMs, exploring their importance and trade-offs between the two. It also looks at how throughput and latency impact the efficiency and user experience of AI applications, and how they can be optimized with NVIDIA NIM microservices.

Key metrics for measuring cost efficiency

When a user sends a request to an LLM, the system processes this request and begins generating a response by outputting a series of tokens. There are often multiple requests sent to the system, which the system tries to process simultaneously to minimize wait time for each request. 

Throughput measures the number of successful operations per unit of time. Throughput is an important measurement for enterprises to determine how well they can handle user requests simultaneously. For LLMs, throughput is measured by tokens per second. Since tokens are the new currency, higher throughput can lower costs and bring in revenue for enterprises. 

Furthermore, an improved throughput provides a competitive edge in delivering high-performance applications that can scale with software like Kubernetes, leading to lower server costs and the ability to handle more users. 

Latency, the delay before or between data transfers, is measured by time to first token (TTFT) and inter-token latency (ITL). Latency is crucial to ensure a smooth user experience while maximizing overall system efficiency.

Figure 1 shows a model receiving several concurrent requests (L1 – Ln) across a period of time (T_start – T_end), with each line representing the latency for each request. Having more lines in each row that are individually shorter equates to higher throughput and lower latency overall. 

Diagram showing the timeline of request processing from T_start to T_end, with each shorter line segment L1-Ln​ representing the latency of individual requests. Higher throughput is depicted by more line segments, indicating more tokens produced.
Figure 1. Timeline of request processing on a server from T_start to T_end, with each shorter line segment L1-Ln​ representing the latency of individual requests

TTFT measures the time it takes for the model to generate the first token after receiving a request, indicating how long an end user needs to wait before seeing the first token. This is essential for quick initial responses, from customer support to e-commerce bots. Most often, TTFT should be under a few seconds; the shorter, the better, though this imposes constraints on the overall system throughput (more on that in the next section).

ITL refers to the time interval between generating consecutive tokens, which is vital for applications that require smooth and continuous text generation. ITL should be less than human reading speed, to ensure a smooth reading experience. 

Figure 2 shows the combination of these latency benchmarks with the user and inference service interaction. The time from the query to the first generated token is the TTFT and the time between each token is the ITL.

The image illustrates the token generation process in an inference service, highlighting TTFT and ITL. A user's query is sent to the inference service, which then generates and delivers tokens sequentially. TTFT represents the time from query submission to the delivery of the first token, while ITL indicates the time intervals between subsequent tokens. The entire process, from the first token to the last token, is referred to as the "generation time."
Figure 2. The token generation process in an inference service, highlighting the role of TTFT and ITL

The goal for enterprises is to reduce ITL and TTFT to the extent possible, lowering latency while keeping throughput high. This ensures the overall system is efficient and the individual user’s experience is smooth.

Balancing throughput and latency

The trade-off between throughput and latency is driven by the number of concurrent requests and the latency budget, both determined by the application’s use case.  By handling a large number of user requests concurrently, enterprises can increase throughput; however, this often results in higher latency for each individual request. 

On the other hand, under a set latency budget, which is the acceptable amount of latency the end user will tolerate, one can maximize throughput by increasing the number of concurrent requests. Latency budget can pose a constraint for either the TTFT or the end-to-end latency.

Figure 3 illustrates the trade-off between throughput and latency. The y-axis is throughput, and x-axis is latency (TTFT in this case), and the corresponding concurrency is labeled over each marker on the curve. This can be used to identify the point that maximizes throughput within a specified latency budget, tailored to a specific use case.

Graph showing throughput versus TTFT with concurrency labeled on the markers. The x-axis represents "Single User: time to first token(s)" on a logarithmic scale, ranging from 0.1 to 100 seconds. The y-axis represents "Total System: tokens/s," ranging from 0 to over 1000 tokens per second. The graph shows a series of blue markers connected by a dotted line, illustrating the relationship between throughput and time to first token (TTFT) at different concurrency levels, with concurrencies printed over each marker. As TTFT increases, throughput rises and then plateaus, indicating a saturation point.
Figure 3. The relationship between TTFT and throughput tokens per second

As the number of concurrent requests increases, more GPUs can be added by standing up multiple instances of the model service. This will sustain the needed level of throughput and user experience. For example, a chatbot handling shopping requests on Black Friday would need to use several GPUs to maintain throughput and latency under such peak concurrency. 

By focusing on how throughput and latency vary with the number of concurrent users. enterprises can make informed decisions on enhancing the efficiency of their AI solutions based on their use cases. This translates to a perfect balance between throughput and latency to avoid wasting resources and minimize server costs. 

How NVIDIA NIM optimizes throughput and latency

NVIDIA offers enterprises an optimized solution to maintain high throughput and low latency—NVIDIA NIM. NIM is a set of microservices for optimizing performance while offering security, ease of use, and the flexibility to deploy the models anywhere. NIM lowers TCO by delivering low latency and high throughput AI inference that scales efficiently with infrastructure resources. 

With NIM, enterprises can get optimized model performance through key techniques including runtime refinement, intelligent model representation, and tailored throughput and latency profiles. NVIDIA TensorRT-LLM optimizes model performance by leveraging parameters such as GPU count and batch size. With NIM, enterprises can have these parameters automatically tuned to best suit their use cases to reach optimal latency and throughput.

As part of the NVIDIA AI Enterprise suite of software, NIM goes through exhaustive tuning to ensure the high-performance configuration for each model. Additionally, techniques like Tensor Parallelism (Figure 4) and in-flight batching (IFB) further boost throughput and reduce latency by processing multiple requests in parallel and maximizing GPU utilization. 

These powerful optimization techniques are widely available to increase performance in AI applications. Furthermore, NIM performance will increase over time as NVIDIA continues to refine each NIM with each new release.

Diagram illustrating tensor parallelism (TP) in LLMs. The diagram shows a neural network graph, with the top and bottom half split into two different colors, or sections, demonstrating how Tensor Parallelism optimizes inference across multiple GPUs.
Figure 4. Tensor parallelism shows how models can be sharded to utilize parallel computing across multiple GPUs, increasing throughput and minimizing latency by processing requests concurrently

NVIDIA NIM performance

Using NIM, throughput and latency improve significantly. Specifically, the NVIDIA Llama 3.1 8B Instruct NIM has achieved 2.5x improvement in throughput, 4x faster TTFT, and 2.2x faster ITL compared to the best open-source alternatives (Figure 5). 

This image has three charts that each show the improvement in performance metrics with NIM. With NIM, Llama 3.1 8B Instruct has a throughput of 6372 tokens/sec, TTFT of 1s, and an ITL of 30ms. With NIM Off, Llama 3.1 8B Instruct has a throughput of 2679 tokens/sec, TTFT of 4s, and an ITL of 65ms. The configuration is Llama 3.1 8B Instruct, input token length: 1,000  output token length: 1,000. Concurrent client requests: 200, on  1x NVIDIA H100 SXM.
Figure 5. The acceleration improvement for throughput and latency using Llama 3.1 8B Instruct

Figure 6 is a live demo of NIM On versus NIM Off that shows real-time chatbot generation. NIM On (right) produces an output 2.4x faster than NIM Off (left). This speedup with NIM On is provided by the optimized Tensort-RT LLM and techniques previously mentioned, such as in-flight batching and tensor parallelism.

This demo shows two chatbots generating responses when queried. It shows the performance difference between the model running with and without NIM. With NIM, the model runs 2.4x faster in terms of inter-token latency.
Figure 6. Demo of Mixtral 8x7B running with and without NIM for a 2.4x ITL gain with NIM On

Get started

NVIDIA NIM is setting a new standard in the world of enterprise AI by delivering unmatched performance, ease of use, and cost efficiency. Whether you’re looking to enhance customer service, streamline operations, or innovate in your industry, NIM provides the robust, scalable, and secure solution you need.

Experience the high throughput and low latency of the Llama 3 70B NIM

To learn more about benchmarking NIM on your machines, check out the NIM LLM Benchmarking Guide and NIM documentation.

Discuss (0)

Tags