Agentic AI / Generative AI

Optimizing Inference Efficiency for LLMs at Scale with NVIDIA NIM Microservices

Aug 14, 2024

By Rajvir Singh and Nirmal Kumar Juluru

Discuss (0)

AI-Generated Summary

Dislike

Throughput measures the number of successful operations per unit of time, and for large language models (LLMs), it is measured by tokens per second, directly impacting cost efficiency and revenue.
Latency, measured by time to first token (TTFT) and inter-token latency (ITL), is crucial for ensuring a smooth user experience, with TTFT measuring the time to generate the first token and ITL measuring the time between consecutive tokens.
NVIDIA NIM microservices optimize throughput and latency by leveraging techniques such as runtime refinement, intelligent model representation, and tailored throughput and latency profiles, resulting in significant performance improvements, including a 2.5x improvement in throughput and 4x faster TTFT for the NVIDIA Llama 3.1 8B Instruct NIM.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize throughput to lower operational costs and minimize latency to deliver superior user experiences.

This post discusses the critical performance metrics of throughput and latency for LLMs, exploring their importance and trade-offs between the two. It also looks at how throughput and latency impact the efficiency and user experience of AI applications, and how they can be optimized with NVIDIA NIM microservices.

Key metrics for measuring cost efficiency

When a user sends a request to an LLM, the system processes this request and begins generating a response by outputting a series of tokens. There are often multiple requests sent to the system, which the system tries to process simultaneously to minimize wait time for each request.

Throughput measures the number of successful operations per unit of time. Throughput is an important measurement for enterprises to determine how well they can handle user requests simultaneously. For LLMs, throughput is measured by tokens per second. Since tokens are the new currency, higher throughput can lower costs and bring in revenue for enterprises.

Furthermore, an improved throughput provides a competitive edge in delivering high-performance applications that can scale with software like Kubernetes, leading to lower server costs and the ability to handle more users.

Latency, the delay before or between data transfers, is measured by time to first token (TTFT) and inter-token latency (ITL). Latency is crucial to ensure a smooth user experience while maximizing overall system efficiency.

Figure 1 shows a model receiving several concurrent requests (L1 – Ln) across a period of time (T_start – T_end), with each line representing the latency for each request. Having more lines in each row that are individually shorter equates to higher throughput and lower latency overall.

TTFT measures the time it takes for the model to generate the first token after receiving a request, indicating how long an end user needs to wait before seeing the first token. This is essential for quick initial responses, from customer support to e-commerce bots. Most often, TTFT should be under a few seconds; the shorter, the better, though this imposes constraints on the overall system throughput (more on that in the next section).

ITL refers to the time interval between generating consecutive tokens, which is vital for applications that require smooth and continuous text generation. ITL should be less than human reading speed, to ensure a smooth reading experience.

Figure 2 shows the combination of these latency benchmarks with the user and inference service interaction. The time from the query to the first generated token is the TTFT and the time between each token is the ITL.

The goal for enterprises is to reduce ITL and TTFT to the extent possible, lowering latency while keeping throughput high. This ensures the overall system is efficient and the individual user’s experience is smooth.

Balancing throughput and latency

The trade-off between throughput and latency is driven by the number of concurrent requests and the latency budget, both determined by the application’s use case. By handling a large number of user requests concurrently, enterprises can increase throughput; however, this often results in higher latency for each individual request.

On the other hand, under a set latency budget, which is the acceptable amount of latency the end user will tolerate, one can maximize throughput by increasing the number of concurrent requests. Latency budget can pose a constraint for either the TTFT or the end-to-end latency.

Figure 3 illustrates the trade-off between throughput and latency. The y-axis is throughput, and x-axis is latency (TTFT in this case), and the corresponding concurrency is labeled over each marker on the curve. This can be used to identify the point that maximizes throughput within a specified latency budget, tailored to a specific use case.

As the number of concurrent requests increases, more GPUs can be added by standing up multiple instances of the model service. This will sustain the needed level of throughput and user experience. For example, a chatbot handling shopping requests on Black Friday would need to use several GPUs to maintain throughput and latency under such peak concurrency.

By focusing on how throughput and latency vary with the number of concurrent users. enterprises can make informed decisions on enhancing the efficiency of their AI solutions based on their use cases. This translates to a perfect balance between throughput and latency to avoid wasting resources and minimize server costs.

How NVIDIA NIM optimizes throughput and latency

NVIDIA offers enterprises an optimized solution to maintain high throughput and low latency—NVIDIA NIM. NIM is a set of microservices for optimizing performance while offering security, ease of use, and the flexibility to deploy the models anywhere. NIM lowers TCO by delivering low latency and high throughput AI inference that scales efficiently with infrastructure resources.

With NIM, enterprises can get optimized model performance through key techniques including runtime refinement, intelligent model representation, and tailored throughput and latency profiles. NVIDIA TensorRT-LLM optimizes model performance by leveraging parameters such as GPU count and batch size. With NIM, enterprises can have these parameters automatically tuned to best suit their use cases to reach optimal latency and throughput.

As part of the NVIDIA AI Enterprise suite of software, NIM goes through exhaustive tuning to ensure the high-performance configuration for each model. Additionally, techniques like Tensor Parallelism (Figure 4) and in-flight batching (IFB) further boost throughput and reduce latency by processing multiple requests in parallel and maximizing GPU utilization.

These powerful optimization techniques are widely available to increase performance in AI applications. Furthermore, NIM performance will increase over time as NVIDIA continues to refine each NIM with each new release.

NVIDIA NIM performance

Using NIM, throughput and latency improve significantly. Specifically, the NVIDIA Llama 3.1 8B Instruct NIM has achieved 2.5x improvement in throughput, 4x faster TTFT, and 2.2x faster ITL compared to the best open-source alternatives (Figure 5).

Figure 6 is a live demo of NIM On versus NIM Off that shows real-time chatbot generation. NIM On (right) produces an output 2.4x faster than NIM Off (left). This speedup with NIM On is provided by the optimized Tensort-RT LLM and techniques previously mentioned, such as in-flight batching and tensor parallelism.

Get started

NVIDIA NIM is setting a new standard in the world of enterprise AI by delivering unmatched performance, ease of use, and cost efficiency. Whether you’re looking to enhance customer service, streamline operations, or innovate in your industry, NIM provides the robust, scalable, and secure solution you need.

Experience the high throughput and low latency of the Llama 3 70B NIM.

To learn more about benchmarking NIM on your machines, check out the NIM LLM Benchmarking Guide and NIM documentation.

Discuss (0)

About the Authors

About Rajvir Singh
Rajvir Singh is a product marketing manager intern at NVIDIA focused on the promotion of NVIDIA NIM performance. Before NVIDIA, he worked at Kaiser Permanente as a software development intern. He is a rising junior at USC majoring in Computer Science and Business Administration.

View all posts by Rajvir Singh

About Nirmal Kumar Juluru
Nirmal Kumar Juluru is a product marketing manager at NVIDIA driving the adoption of AI software, models, and APIs in the NVIDIA NGC Catalog and NVIDIA AI Foundation models and endpoints. He previously worked as a software developer. Nirmal holds an MBA from Carnegie Mellon University and a bachelors in computer science from BITS Pilani.

View all posts by Nirmal Kumar Juluru