As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize throughput to lower operational costs and minimize latency to deliver superior user experiences.
This post discusses the critical performance metrics of throughput and latency for LLMs, exploring their importance and trade-offs between the two. It also looks at how throughput and latency impact the efficiency and user experience of AI applications, and how they can be optimized with NVIDIA NIM microservices.
Key metrics for measuring cost efficiency
When a user sends a request to an LLM, the system processes this request and begins generating a response by outputting a series of tokens. There are often multiple requests sent to the system, which the system tries to process simultaneously to minimize wait time for each request.
Throughput measures the number of successful operations per unit of time. Throughput is an important measurement for enterprises to determine how well they can handle user requests simultaneously. For LLMs, throughput is measured by tokens per second. Since tokens are the new currency, higher throughput can lower costs and bring in revenue for enterprises.
Furthermore, an improved throughput provides a competitive edge in delivering high-performance applications that can scale with software like Kubernetes, leading to lower server costs and the ability to handle more users.
Latency, the delay before or between data transfers, is measured by time to first token (TTFT) and inter-token latency (ITL). Latency is crucial to ensure a smooth user experience while maximizing overall system efficiency.
Figure 1 shows a model receiving several concurrent requests (L1 – Ln) across a period of time (T_start – T_end), with each line representing the latency for each request. Having more lines in each row that are individually shorter equates to higher throughput and lower latency overall.
TTFT measures the time it takes for the model to generate the first token after receiving a request, indicating how long an end user needs to wait before seeing the first token. This is essential for quick initial responses, from customer support to e-commerce bots. Most often, TTFT should be under a few seconds; the shorter, the better, though this imposes constraints on the overall system throughput (more on that in the next section).
ITL refers to the time interval between generating consecutive tokens, which is vital for applications that require smooth and continuous text generation. ITL should be less than human reading speed, to ensure a smooth reading experience.
Figure 2 shows the combination of these latency benchmarks with the user and inference service interaction. The time from the query to the first generated token is the TTFT and the time between each token is the ITL.
The goal for enterprises is to reduce ITL and TTFT to the extent possible, lowering latency while keeping throughput high. This ensures the overall system is efficient and the individual user’s experience is smooth.
Balancing throughput and latency
The trade-off between throughput and latency is driven by the number of concurrent requests and the latency budget, both determined by the application’s use case. By handling a large number of user requests concurrently, enterprises can increase throughput; however, this often results in higher latency for each individual request.
On the other hand, under a set latency budget, which is the acceptable amount of latency the end user will tolerate, one can maximize throughput by increasing the number of concurrent requests. Latency budget can pose a constraint for either the TTFT or the end-to-end latency.
Figure 3 illustrates the trade-off between throughput and latency. The y-axis is throughput, and x-axis is latency (TTFT in this case), and the corresponding concurrency is labeled over each marker on the curve. This can be used to identify the point that maximizes throughput within a specified latency budget, tailored to a specific use case.
As the number of concurrent requests increases, more GPUs can be added by standing up multiple instances of the model service. This will sustain the needed level of throughput and user experience. For example, a chatbot handling shopping requests on Black Friday would need to use several GPUs to maintain throughput and latency under such peak concurrency.
By focusing on how throughput and latency vary with the number of concurrent users. enterprises can make informed decisions on enhancing the efficiency of their AI solutions based on their use cases. This translates to a perfect balance between throughput and latency to avoid wasting resources and minimize server costs.
How NVIDIA NIM optimizes throughput and latency
NVIDIA offers enterprises an optimized solution to maintain high throughput and low latency—NVIDIA NIM. NIM is a set of microservices for optimizing performance while offering security, ease of use, and the flexibility to deploy the models anywhere. NIM lowers TCO by delivering low latency and high throughput AI inference that scales efficiently with infrastructure resources.
With NIM, enterprises can get optimized model performance through key techniques including runtime refinement, intelligent model representation, and tailored throughput and latency profiles. NVIDIA TensorRT-LLM optimizes model performance by leveraging parameters such as GPU count and batch size. With NIM, enterprises can have these parameters automatically tuned to best suit their use cases to reach optimal latency and throughput.
As part of the NVIDIA AI Enterprise suite of software, NIM goes through exhaustive tuning to ensure the high-performance configuration for each model. Additionally, techniques like Tensor Parallelism (Figure 4) and in-flight batching (IFB) further boost throughput and reduce latency by processing multiple requests in parallel and maximizing GPU utilization.
These powerful optimization techniques are widely available to increase performance in AI applications. Furthermore, NIM performance will increase over time as NVIDIA continues to refine each NIM with each new release.
NVIDIA NIM performance
Using NIM, throughput and latency improve significantly. Specifically, the NVIDIA Llama 3.1 8B Instruct NIM has achieved 2.5x improvement in throughput, 4x faster TTFT, and 2.2x faster ITL compared to the best open-source alternatives (Figure 5).
Figure 6 is a live demo of NIM On versus NIM Off that shows real-time chatbot generation. NIM On (right) produces an output 2.4x faster than NIM Off (left). This speedup with NIM On is provided by the optimized Tensort-RT LLM and techniques previously mentioned, such as in-flight batching and tensor parallelism.
Get started
NVIDIA NIM is setting a new standard in the world of enterprise AI by delivering unmatched performance, ease of use, and cost efficiency. Whether you’re looking to enhance customer service, streamline operations, or innovate in your industry, NIM provides the robust, scalable, and secure solution you need.
Experience the high throughput and low latency of the Llama 3 70B NIM.
To learn more about benchmarking NIM on your machines, check out the NIM LLM Benchmarking Guide and NIM documentation.