NVIDIA offers tools like Perf Analyzer and Model Analyzer to assist machine learning engineers with measuring and balancing the trade-off between latency and throughput, crucial for optimizing ML inference performance. Model Analyzer has been embraced by leading organizations such as Snap to identify optimal configurations that enhance throughput and reduce deployment costs.
However, when serving generative AI models, particularly large language models (LLMs), performance measurement becomes more specialized.
For LLMs, latency and throughput metrics are further broken down into token-level metrics. The following list shows key metrics, but additional metrics such as request latency, request throughput, and number of output tokens are also important to track.
- Time to first token: Time between when a request is sent and when its first response is received, one value per request in the benchmark.
- Output token throughput: Total number of output tokens from benchmark divided by benchmark duration.
- Inter-token latency: Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in the benchmark. This metric is also known as time per output token.
When measuring LLMs, it is important to see results quickly and consistently across users and models. For many applications, time to first token is given the highest priority, followed by output token throughput and inter-token latency. However, a tool that can report all of these metrics can help define and measure what is most important for your specific system and use case.
Achieving optimal performance for LLM inference requires balancing these metrics effectively. There is a natural trade-off between output token throughput and inter-token latency: processing multiple user queries concurrently can increase throughput but may also lead to higher inter-token latency. Choosing the right balance to achieve total cost of ownership (TCO) savings can be challenging without specialized benchmarking tools tailored for generative AI.
Introducing GenAI-Perf
The latest release of NVIDIA Triton now includes a new generative AI performance benchmarking tool, GenAI-Perf. This solution is designed to enhance the measurement and optimization of generative AI performance.
This solution empowers you in the following ways:
- Accurately measures the specific metrics crucial for generative AI, to determine optimal configurations that deliver peak performance and cost-effectiveness.
- Use industry-standard data sets such as OpenOrca and CNN_dailymail to evaluate model performance.
- Facilitate standardized performance evaluation across diverse inference engines through an OpenAI-compatible API.
GenAI-Perf serves as the default benchmarking tool for assessing performance across all NVIDIA generative AI offerings, including NVIDIA NIM, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM. It facilitates easy comparisons among different serving solutions that support the OpenAI-compatible API.
In the following sections, we guide you through how GenAI-Perf can be used to measure the performance of models compatible with OpenAI endpoints.
Currently supported endpoints
NVIDIA GenAI-Perf currently supports three OpenAI endpoint APIs:
- Chat
- Chat Completions
- Embeddings
New endpoints will be released as new model types become popular. GenAI-Perf is also open source and accepts community contributions.
Running GenAI-Perf
For more information about getting started, see Installation in the GenAI-Perf GitHub repo. The easiest way to run GenAI-Perf is to install the latest Triton Inference Server SDK container. To do so, you can run the latest SDK container version on NVIDIA GPU Cloud or use version YY.MM
which corresponds to the year and month (for example, 24.07 for July 2024).
To run the container, run the following command:
docker run -it --net=host --rm --gpus=all
nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk
Then, you must get the server running. Each of the following examples has a different command to run that type of model.
Start the vLLM OpenAI server:
docker run -it --net=host --rm \
--gpus=all vllm/vllm-openai:latest \
--model gpt2 \
--dtype float16 --max-model-len 1024
The model changes with the endpoint:
- For chat and chat-completion, use gpt2.
- For embeddings, use
intfloat/e5-mistral-7b-instruct \
.
When the server is running, you can run GenAI-Perf with the GenAI-Perf command to get the results. You see results visually, which are also saved as CSV and JSON for easy parsing of the data. They are saved in the /artifacts
folder with other generated artifacts, including graphs visualizing the model performance.
Profiling OpenAI chat-compatible models
Run GenAI-Perf:
genai-perf \
-m gpt2 \
--service-kind openai \
--endpoint-type chat \
--tokenizer gpt2
Review the sample results (Table 1).
Statistic | Avg | Min | Max | P99 | P90 | P75 |
Request latency (ms) | 1679.30 | 567.31 | 2929.26 | 2919.41 | 2780.70 | 2214.89 |
Output sequence length | 453.43 | 162.00 | 784.00 | 782.60 | 744.00 | 588.00 |
Input sequence length | 318.48 | 124.00 | 532.00 | 527.00 | 488.00 | 417.00 |
The output shows LLM performance metrics such as request latency, output sequence length, and input sequence length computed from running GTP2 for chat.
- Output token throughput (per sec): 269.99
- Request throughput (per sec): 0.60
Review some of the generated graphs (Figure 1).
Profiling OpenAI chat completions-compatible models
Run GenAI-Perf:
genai-perf \
-m gpt2 \
--service-kind openai \
--endpoint-type completions \
--tokenizer gpt2 \
--generate-plots
Review sample results (Table 2).
Statistic | Avg | Min | Max | P99 | P90 | P75 |
Request latency (ms) | 74.56 | 30.08 | 96.08 | 93.43 | 82.34 | 74.81 |
Output sequence length | 15.88 | 2.00 | 17.00 | 16.00 | 16.00 | 16.00 |
Input sequence length | 311.62 | 29.00 | 570.00 | 538.04 | 479.40 | 413.00 |
- Output token throughput (per sec): 218.55
- Request throughput (per sec): 13.76
The output shows LLM performance metrics such as request latency, output sequence length, and input sequence length computed from running GTP2 for chat completion.
Review generated graphs (Figure 2).
Profiling OpenAI embeddings-compatible models
Create a compatible JSONL file with sample texts for embedding. You can generate this file with the following command on the Linux command line:
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
Run GenAI-Perf:
genai-perf \
-m intfloat/e5-mistral-7b-instruct \
--batch-size 2 \
--service-kind openai \
--endpoint-type embeddings \
--input-file embeddings.jsonl
Review the sample results (Table 3).
Statistic | Avg | Min | Max | P99 | P90 | P75 |
Request latency (ms) | 41.96 | 28.16 | 302.19 | 55.24 | 47.46 | 42.57 |
Request throughput (per sec): 23.78
Conclusion
That is all there is to benchmarking your models with GenAI-Perf. You have everything you need to get started.
You can review the other CLI arguments to see how updating the inference parameters affects performance. For example, you can pass in different values for --request-rate
to modify the number of requests sent per second. You can then see how those changes affect metrics like inter-token latency, request latency, and throughput.
GenAI-Perf is open source and available on GitHub.