Agentic AI / Generative AI

Measuring Generative AI Model Performance Using NVIDIA GenAI-Perf and an OpenAI-Compatible API

Aug 01, 2024

By David Yastremsky, Elias Bermudez, Ganesh Kudleppanavar and Harry Kim

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA's GenAI-Perf is a new benchmarking tool for generative AI performance, included in the latest release of NVIDIA Triton.
GenAI-Perf measures key metrics such as time to first token, output token throughput, and inter-token latency, which are crucial for optimizing large language models (LLMs).
GenAI-Perf supports OpenAI endpoint APIs, including chat, chat completions, and embeddings, and is the default benchmarking tool for NVIDIA generative AI offerings, including NVIDIA NIM, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA offers tools like Perf Analyzer and Model Analyzer to assist machine learning engineers with measuring and balancing the trade-off between latency and throughput, crucial for optimizing ML inference performance. Model Analyzer has been embraced by leading organizations such as Snap to identify optimal configurations that enhance throughput and reduce deployment costs.

However, when serving generative AI models, particularly large language models (LLMs), performance measurement becomes more specialized.

For LLMs, latency and throughput metrics are further broken down into token-level metrics. The following list shows key metrics, but additional metrics such as request latency, request throughput, and number of output tokens are also important to track.

Time to first token: Time between when a request is sent and when its first response is received, one value per request in the benchmark.
Output token throughput: Total number of output tokens from benchmark divided by benchmark duration.
Inter-token latency: Time between intermediate responses for a single request divided by the number of generated tokens of the latter response, one value per response per request in the benchmark. This metric is also known as time per output token.

When measuring LLMs, it is important to see results quickly and consistently across users and models. For many applications, time to first token is given the highest priority, followed by output token throughput and inter-token latency. However, a tool that can report all of these metrics can help define and measure what is most important for your specific system and use case.

Achieving optimal performance for LLM inference requires balancing these metrics effectively. There is a natural trade-off between output token throughput and inter-token latency: processing multiple user queries concurrently can increase throughput but may also lead to higher inter-token latency. Choosing the right balance to achieve total cost of ownership (TCO) savings can be challenging without specialized benchmarking tools tailored for generative AI.

Introducing GenAI-Perf

The latest release of NVIDIA Triton now includes a new generative AI performance benchmarking tool, GenAI-Perf. This solution is designed to enhance the measurement and optimization of generative AI performance.

This solution empowers you in the following ways:

Accurately measures the specific metrics crucial for generative AI, to determine optimal configurations that deliver peak performance and cost-effectiveness.
Use industry-standard data sets such as OpenOrca and CNN_dailymail to evaluate model performance.
Facilitate standardized performance evaluation across diverse inference engines through an OpenAI-compatible API.

GenAI-Perf serves as the default benchmarking tool for assessing performance across all NVIDIA generative AI offerings, including NVIDIA NIM, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM. It facilitates easy comparisons among different serving solutions that support the OpenAI-compatible API.

In the following sections, we guide you through how GenAI-Perf can be used to measure the performance of models compatible with OpenAI endpoints.

Currently supported endpoints

NVIDIA GenAI-Perf currently supports three OpenAI endpoint APIs:

Chat
Chat Completions
Embeddings

New endpoints will be released as new model types become popular. GenAI-Perf is also open source and accepts community contributions.

Running GenAI-Perf

For more information about getting started, see Installation in the GenAI-Perf GitHub repo. The easiest way to run GenAI-Perf is to install the latest Triton Inference Server SDK container. To do so, you can run the latest SDK container version on NVIDIA GPU Cloud or use version YY.MM which corresponds to the year and month (for example, 24.07 for July 2024).

To run the container, run the following command:

docker run -it --net=host --rm --gpus=all
nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk

Then, you must get the server running. Each of the following examples has a different command to run that type of model.

Start the vLLM OpenAI server:

docker run -it --net=host --rm \
--gpus=all vllm/vllm-openai:latest \
--model gpt2 \
--dtype float16 --max-model-len 1024

The model changes with the endpoint:

For chat and chat-completion, use gpt2.
For embeddings, use intfloat/e5-mistral-7b-instruct \.

When the server is running, you can run GenAI-Perf with the GenAI-Perf command to get the results. You see results visually, which are also saved as CSV and JSON for easy parsing of the data. They are saved in the /artifacts folder with other generated artifacts, including graphs visualizing the model performance.

Profiling OpenAI chat-compatible models

Run GenAI-Perf:

genai-perf \
  -m gpt2 \
  --service-kind openai \
  --endpoint-type chat \
  --tokenizer gpt2

Review the sample results (Table 1).

Statistic	Avg	Min	Max	P99	P90	P75
Request latency (ms)	1679.30	567.31	2929.26	2919.41	2780.70	2214.89
Output sequence length	453.43	162.00	784.00	782.60	744.00	588.00
Input sequence length	318.48	124.00	532.00	527.00	488.00	417.00

Table 1. LLM metrics output from running GenAI -Perf to measure GPT2 performance for chat

The output shows LLM performance metrics such as request latency, output sequence length, and input sequence length computed from running GTP2 for chat.

Output token throughput (per sec): 269.99
Request throughput (per sec): 0.60

Review some of the generated graphs (Figure 1).

Heat map shows that slight increases in input sequence length result in larger increases in output sequence length. — *Figure 1. Running GenAI-Perf to compare input sequence length to output sequence length*

Profiling OpenAI chat completions-compatible models

Run GenAI-Perf:

genai-perf \
  -m gpt2 \
  --service-kind openai \
  --endpoint-type completions \
  --tokenizer gpt2 \
  --generate-plots

Review sample results (Table 2).

Statistic	Avg	Min	Max	P99	P90	P75
Request latency (ms)	74.56	30.08	96.08	93.43	82.34	74.81
Output sequence length	15.88	2.00	17.00	16.00	16.00	16.00
Input sequence length	311.62	29.00	570.00	538.04	479.40	413.00

Table 2. LLM metrics output from running GenAI-Perf to measure GPT2 performance for chat completion

Output token throughput (per sec): 218.55
Request throughput (per sec): 13.76

The output shows LLM performance metrics such as request latency, output sequence length, and input sequence length computed from running GTP2 for chat completion.

Review generated graphs (Figure 2).

Scatter plot shows time to first token on the Y-axis and Input sequence length on the X-axis to draw the relationship between the two metrics. As input sequence length increases significantly, time to first token increases slightly. — *Figure 2. Running GenAI-Perf to compare time to first token (TTFT) against the input sequence length*

Profiling OpenAI embeddings-compatible models

Create a compatible JSONL file with sample texts for embedding. You can generate this file with the following command on the Linux command line:

echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl

Run GenAI-Perf:

genai-perf \
-m intfloat/e5-mistral-7b-instruct \
--batch-size 2 \
--service-kind openai \
--endpoint-type embeddings \
--input-file embeddings.jsonl

Review the sample results (Table 3).

Statistic	Avg	Min	Max	P99	P90	P75
Request latency (ms)	41.96	28.16	302.19	55.24	47.46	42.57

Table 3. Embedding metrics output from running GenAI-Perf to measure E5-Mistral-7b-Instruct performance

Request throughput (per sec): 23.78

Conclusion

That is all there is to benchmarking your models with GenAI-Perf. You have everything you need to get started.

You can review the other CLI arguments to see how updating the inference parameters affects performance. For example, you can pass in different values for --request-rate to modify the number of requests sent per second. You can then see how those changes affect metrics like inter-token latency, request latency, and throughput.

GenAI-Perf is open source and available on GitHub.

Discuss (0)

About the Authors

About David Yastremsky
David Yastremsky is a senior system software engineer at NVIDIA, specializing in profiling AI model inference on Triton Tools. With extensive experience in developing Triton Inference Server and the Clara Platform for Healthcare AI, David is committed to democratizing AI and maximizing its potential for social good. He holds a master's degree in computer science from the University of Pennsylvania.

View all posts by David Yastremsky

About Elias Bermudez
Elias Bermudez is a Senior System Software Engineer at NVIDIA leading the development of GenAI-Perf for the Triton Tools Team. He specializes in providing tools to profile inference performance of AI models. He cares deeply about connecting users with easy to use and powerful software, always iterating to achieve that goal. He holds a Masters of Engineering and BS from Cornell University in Electrical and Computer Engineering as well as a BS in Computer Science from Portland State University.

View all posts by Elias Bermudez

About Ganesh Kudleppanavar
Ganesh Kudleppanavar is a System Software Manager at NVIDIA, dedicated to optimizing the performance of Machine Learning and Generative AI models. He leads the Triton tools team, utilizing the powerful Triton Tools to meticulously benchmark these models, ensuring their efficient deployment and seamless utilization across diverse applications. Ganesh holds a Master of Science degree in Electrical Engineering from California State University, Long Beach.

View all posts by Ganesh Kudleppanavar

About Harry Kim
Harry Kim is a Principal Product Manager at NVIDIA enabling performant and scalable AI/ML inference with Triton. He has experience working on recommendation systems at Meta, AI infrastructure at Intel AI, and Ads ranking and recommendation at Google. He holds a PhD in Statistics from UC Berkeley.

View all posts by Harry Kim