AI Inference
Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.
When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.
Click here to view other performance data.
MLPerf Inference v5.0 Performance Benchmarks
Offline Scenario, Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | Dataset |
---|---|---|---|---|---|---|
Llama3.1 405B | 13,886 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 | NVIDIA GB200 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | Subset of LongBench, LongDataCollections, Ruler, GovReport |
1,538 tokens/sec | 8x B200 | SYS-421GE-NBRT-LCC | NVIDIA B200-SXM-180GB | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
574 tokens/sec | 8x H200 | Cisco UCS C885A M8 | NVIDIA H200-SXM-141GB | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
Llama2 70B | 98,858 tokens/sec | 8x B200 | NVIDIA DGX B200 | NVIDIA B200-SXM-180GB | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | OpenOrca (max_seq_len=1024) |
35,453 tokens/sec | 8x H200 | ThinkSystem SR680a V3 | NVIDIA H200-SXM-141GB | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | OpenOrca (max_seq_len=1024) | |
Mixtral 8x7B | 128,795 tokens/sec | 8x B200 | SYS-421GE-NBRT-LCC | NVIDIA B200-SXM-180GB | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) | OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) |
63,515 tokens/sec | 8x H200 | ThinkSystem SR780a V3 | NVIDIA H200-SXM-141GB | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) | OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | |
Stable Diffusion XL | 30 samples/sec | 8x B200 | NVIDIA DGX B200 | NVIDIA B200-SXM-180GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val |
19 samples/sec | 8x H200 | AS-4125GS-TNHR2-LCC | NVIDIA H200-SXM-141GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | Subset of coco-2014 val | |
RGAT | 450,175 samples/sec | 8x H200 | ThinkSystem SR780a V3 | NVIDIA H200-SXM-141GB | 99% of FP32 (72.86%) | IGBH |
GPT-J | 21,626 tokens/sec | 8x H200 | ThinkSystem SR780a V3 | NVIDIA H200-SXM-141GB | 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) | CNN Dailymail (v3.0.0, max_seq_len=2048) |
ResNet-50 | 773,300 samples/sec | 8x H200 | ThinkSystem SR680a V3 | NVIDIA H200-SXM-141GB | 76.46% Top1 | ImageNet (224x224) |
RetinaNet | 15,200 samples/sec | 8x H200 | AS-4125GS-TNHR2-LCC | NVIDIA H200-SXM-141GB | 0.3755 mAP | OpenImages (800x800) |
DLRMv2 | 654,489 samples/sec | 8x H200 | HPE Cray XD670 with Cray ClusterStor | NVIDIA H200-SXM-141GB | 99% of FP32 (AUC=80.31%) | Synthetic Multihot Criteo Dataset |
3D-UNET | 55 samples/sec | 8x H200 | HPE Cray XD670 with Cray ClusterStor | NVIDIA H200-SXM-141GB | 99.9% of FP32 (0.86330 mean DICE score) | KiTS 2019 |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) |
Dataset |
---|---|---|---|---|---|---|---|
Llama3.1 405B | 8,850 tokens/sec | 72x GB200 | NVIDIA GB200 NVL72 | NVIDIA GB200 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport |
1,080 tokens/sec | 8x B200 | SYS-A21GE-NBRT | NVIDIA B200-SXM-180GB | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
294 tokens/sec | 8x H200 | Cisco UCS C885A M8 | NVIDIA H200-SXM-141GB | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) | TTFT/TPOT: 6000 ms/175 ms | Subset of LongBench, LongDataCollections, Ruler, GovReport | |
Llama2 70B Interactive | 62,266 tokens/sec | 8x B200 | SYS-A21GE-NBRT | NVIDIA B200-SXM-180GB | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) |
20,235 tokens/sec | 8x H200 | G893-SD1 | NVIDIA H200-SXM-141GB | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | TTFT/TPOT: 450 ms/40 ms | OpenOrca (max_seq_len=1024) | |
Llama2 70B | 98,443 tokens/sec | 8x B200 | NVIDIA DGX B200 | NVIDIA B200-SXM-180GB | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) |
33,072 tokens/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB-CTS | 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (max_seq_len=1024) | |
Mixtral 8x7B | 129,047 tokens/sec | 8x B200 | SYS-421GE-NBRT-LCC | NVIDIA B200-SXM-180GB | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) |
61,802 tokens/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB-CTS | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16) | TTFT/TPOT: 2000 ms/200 ms | OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | |
Stable Diffusion XL | 29 samples/sec | 8x B200 | SYS-A21GE-NBRT | NVIDIA B200-SXM-180GB | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val |
18 samples/sec | 8x H200 | NVIDIA H200 | NVIDIA H200-SXM-141GB-CTS | FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s | Subset of coco-2014 val | |
GPT-J | 21,813 queries/sec | 8x H200 | Cisco UCS C885A M8 | NVIDIA H200-SXM-141GB | 99% of FP32 (72.86%) | 20 s | CNN Dailymail |
ResNet-50 | 676,219 queries/sec | 8x H200 | G893-SD1 | NVIDIA H200-SXM-141GB | 76.46% Top1 | 15 ms | ImageNet (224x224) |
RetinaNet | 14,589 queries/sec | 8x H200 | AS-4125GS-TNHR2-LCC | NVIDIA H200-SXM-141GB | 0.3755 mAP | 100 ms | OpenImages (800x800) |
DLRMv2 | 590,167 queries/sec | 8x H200 | HPE Cray XD670 with Cray ClusterStor | NVIDIA H200-SXM-141GB | 99% of FP32 (AUC=80.31%) | 60 ms | Synthetic Multihot Criteo Dataset |
MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
LLM Inference Performance of NVIDIA Data Center Products
B200 DeepSeek R1 - Per User
Model | Attention | MoE | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
DeepSeek R1 0528 | TP8 | EP2 | 1,024 | 2,048 | 379 output tokens/sec/user | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.20 | NVIDIA B200 |
Accuracy Evaluation:
Precision FP8 (AA Ref): MMLU Pro = 85 | GPQA Diamond = 81 | LiveCodeBench = 77 | SCICODE = 40 | MATH-500 = 98 | AIME 2024 = 89
Precision FP4: MMLU Pro = 84.2 | GPQA Diamond = 80 | LiveCodeBench = 76.3 | SCICODE = 40.1 | MATH-500 = 98.1 | AIME 2024 = 91.3
More details on Accuracy Evalution here
Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 2
Batch Size = 1
Input tokens not included in TPS calculations
B200 DeepSeek R1 - Max Throughput
Model | Attention | MoE | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
DeepSeek R1 0528 | TP8 | EP8 | 1,024 | 2,048 | 43,146 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.20 | NVIDIA B200 |
Accuracy Evaluation:
Precision FP8 (AA Ref): MMLU Pro = 85 | GPQA Diamond = 81 | LiveCodeBench = 77 | SCICODE = 40 | MATH-500 = 98 | AIME 2024 = 89
Precision FP4: MMLU Pro = 84.2 | GPQA Diamond = 80 | LiveCodeBench = 76.3 | SCICODE = 40.1 | MATH-500 = 98.1 | AIME 2024 = 91.3
More details on Accuracy Evalution here
Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
Input tokens not included in TPS calculations
B200 Inference Performance - Max Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 405B | 1 | 8 | 128 | 128 | 9,185 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 128 | 2048 | 10,387 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 128 | 4096 | 8,742 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 2048 | 128 | 954 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 5000 | 500 | 1,332 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 500 | 2000 | 9,242 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 1000 | 1000 | 7,566 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 1000 | 2000 | 7,697 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 2048 | 2048 | 6,092 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.1 405B | 1 | 8 | 20000 | 2000 | 962 output tokens/sec | 8x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 128 | 128 | 11,253 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 128 | 2048 | 9,925 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 128 | 4096 | 6,319 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 2048 | 128 | 1,375 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 5000 | 500 | 1,488 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 500 | 2000 | 7,560 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 1000 | 1000 | 6,867 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 1000 | 2000 | 6,737 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 2048 | 2048 | 4,545 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
Llama v3.3 70B | 1 | 1 | 20000 | 2000 | 581 output tokens/sec | 1x B200 | DGX B200 | FP4 | TensorRT-LLM 0.19.0 | NVIDIA B200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)
H200 Inference Performance - Max Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 405B | 1 | 8 | 128 | 128 | 3,800 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 128 | 2048 | 5,661 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 128 | 4096 | 5,167 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 8 | 1 | 2048 | 128 | 764 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14a | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 5000 | 500 | 656 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 500 | 2000 | 4,854 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 1000 | 1000 | 3,332 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 1000 | 2000 | 3,682 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 2048 | 2048 | 3,056 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 405B | 1 | 8 | 20000 | 2000 | 514 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 128 | 128 | 3,658 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 128 | 2048 | 4,351 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 4 | 128 | 4096 | 11,525 output tokens/sec | 4x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 2048 | 128 | 433 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 5000 | 500 | 544 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 500 | 2000 | 3,476 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 1000 | 1000 | 2,727 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 1 | 2048 | 2048 | 1,990 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 70B | 1 | 2 | 20000 | 2000 | 618 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 128 | 28,447 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 23,295 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 17,481 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 3,531 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 3,852 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 21,463 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 17,591 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 12,022 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 1,706 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.19.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 128 | 31,938 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 2048 | 27,409 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 128 | 4096 | 18,505 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 2048 | 128 | 3,834 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 5000 | 500 | 4,042 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 500 | 2000 | 22,355 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 1000 | 1000 | 18,426 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 2048 | 2048 | 12,347 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mistral 7B | 1 | 1 | 20000 | 2000 | 1,823 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 128 | 128 | 17,158 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 128 | 2048 | 15,095 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 128 | 4096 | 21,565 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 2048 | 128 | 2,010 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 5000 | 500 | 2,309 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 500 | 2000 | 12,105 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 1 | 1000 | 1000 | 10,371 output tokens/sec | 1x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 2048 | 2048 | 14,018 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x7B | 1 | 2 | 20000 | 2000 | 2,227 output tokens/sec | 2x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 128 | 25,179 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 2048 | 32,623 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 128 | 4096 | 25,753 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 2048 | 128 | 3,095 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 5000 | 500 | 4,209 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 500 | 2000 | 27,430 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 1000 | 1000 | 20,097 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.15.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 2048 | 2048 | 15,799 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA H200 |
Mixtral 8x22B | 1 | 8 | 20000 | 2000 | 2,897 output tokens/sec | 8x H200 | DGX H200 | FP8 | TensorRT-LLM 0.14.0 | NVIDIA H200 |
TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)
GH200 Inference Performance - Max Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 70B | 1 | 1 | 128 | 128 | 3,637 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 128 | 2048 | 10,358 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 128 | 4096 | 6,628 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 2048 | 128 | 425 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 5000 | 500 | 422 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 500 | 2000 | 9,091 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 1 | 1000 | 1000 | 1,746 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 2048 | 2048 | 4,865 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 70B | 1 | 4 | 20000 | 2000 | 959 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 128 | 29,853 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 21,770 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 14,190 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 3,844 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 3,933 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 17,137 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 16,483 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 10,266 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 1,560 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 128 | 32,498 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 2048 | 23,337 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 128 | 4096 | 15,018 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 2048 | 128 | 3,813 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 5000 | 500 | 3,950 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 500 | 2000 | 18,556 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 1000 | 1000 | 17,252 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 2048 | 2048 | 10,756 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mistral 7B | 1 | 1 | 20000 | 2000 | 1,601 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 128 | 128 | 16,859 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 128 | 2048 | 11,120 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 4 | 128 | 4096 | 30,066 output tokens/sec | 4x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.13.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 2048 | 128 | 1,994 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 5000 | 500 | 2,078 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 500 | 2000 | 9,193 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 1000 | 1000 | 8,849 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 2048 | 2048 | 5,545 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
Mixtral 8x7B | 1 | 1 | 20000 | 2000 | 861 output tokens/sec | 1x GH200 | NVIDIA Grace Hopper x4 P4496 | FP8 | TensorRT-LLM 0.17.0 | NVIDIA GH200 96B |
TP: Tensor Parallelism
PP: Pipeline Parallelism
H100 Inference Performance - Max Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 70B | 1 | 1 | 128 | 128 | 3,191 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 128 | 2048 | 5,822 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 4 | 128 | 4096 | 8,210 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 2048 | 128 | 748 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 5000 | 500 | 867 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 4 | 500 | 2000 | 10,278 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 2 | 1000 | 1000 | 4,191 output tokens/sec | 2x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 4 | 2048 | 2048 | 5,640 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 70B | 1 | 4 | 20000 | 2000 | 911 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 128 | 128 | 27,569 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 22,004 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 13,640 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 3,495 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 3,371 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 17,794 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 15,270 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 9,654 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 1,341 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.19.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
PP: Pipeline Parallelism
L40S Inference Performance - Max Throughput
Model | PP | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
Llama v3.1 8B | 1 | 1 | 128 | 128 | 9,105 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 128 | 2048 | 5,366 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 128 | 4096 | 3,026 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 2048 | 128 | 1,067 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 5000 | 500 | 981 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 500 | 2000 | 4,274 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 1000 | 1000 | 4,055 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 2048 | 2048 | 2,225 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Llama v3.1 8B | 1 | 1 | 20000 | 2000 | 328 output tokens/sec | 1x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Mixtral 8x7B | 4 | 1 | 128 | 128 | 15,278 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 128 | 2048 | 9,087 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 1 | 4 | 128 | 4096 | 5,736 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.17.0 | NVIDIA L40S |
Mixtral 8x7B | 4 | 1 | 2048 | 128 | 2,098 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 5000 | 500 | 1,558 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 500 | 2000 | 7,974 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 1000 | 1000 | 6,579 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
Mixtral 8x7B | 2 | 2 | 2048 | 2048 | 4,217 output tokens/sec | 4x L40S | Supermicro SYS-521GE-TNRT | FP8 | TensorRT-LLM 0.15.0 | NVIDIA L40S |
TP: Tensor Parallelism
PP: Pipeline Parallelism
Inference Performance of NVIDIA Data Center Products
B200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 18,517 images/sec | 39 images/sec/watt | 0.43 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 |
128 | 57,280 images/sec | 58 images/sec/watt | 2.23 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
EfficientNet-B0 | 8 | 10,861 images/sec | 30 images/sec/watt | 0.74 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 |
128 | 28,889 images/sec | 41 images/sec/watt | 4.43 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
EfficientNet-B4 | 8 | 2,634 images/sec | 5 images/sec/watt | 3.04 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 |
128 | 4,101 images/sec | 5 images/sec/watt | 31.21 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
HF Swin Base | 8 | 6,062 samples/sec | 14 samples/sec/watt | 1.32 | 1x B200 | DGX B200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA B200 |
32 | 11,319 samples/sec | 19 samples/sec/watt | 2.83 | 1x B200 | DGX B200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
HF Swin Large | 8 | 4,742 samples/sec | 10 samples/sec/watt | 1.69 | 1x B200 | DGX B200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA B200 |
32 | 7,479 samples/sec | 11 samples/sec/watt | 4.28 | 1x B200 | DGX B200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
HF ViT Base | 8 | 11,267 samples/sec | 22 samples/sec/watt | 0.71 | 1x B200 | DGX B200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA B200 |
64 | 21,688 samples/sec | 29 samples/sec/watt | 2.95 | 1x B200 | DGX B200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
HF ViT Large | 8 | 5,171 samples/sec | 8 samples/sec/watt | 1.55 | 1x B200 | DGX B200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA B200 |
64 | 8,485 samples/sec | 10 samples/sec/watt | 7.54 | 1x B200 | DGX B200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
QuartzNet | 8 | 7,787 samples/sec | 24 samples/sec/watt | 1.03 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 |
128 | 25,034 samples/sec | 47 samples/sec/watt | 5.11 | 1x B200 | DGX B200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA B200 | |
RetinaNet-RN34 | 8 | 3,318 images/sec | 8 images/sec/watt | 2.41 | 1x B200 | DGX B200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA B200 |
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
H200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 21,253 images/sec | 67 images/sec/watt | 0.38 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
128 | 65,328 images/sec | 107 images/sec/watt | 1.96 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
EfficientNet-B0 | 8 | 17,243 images/sec | 77 images/sec/watt | 0.46 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
128 | 57,387 images/sec | 122 images/sec/watt | 2.23 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
EfficientNet-B4 | 8 | 4,613 images/sec | 14 images/sec/watt | 1.73 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
128 | 9,018 images/sec | 15 images/sec/watt | 14.19 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
HF Swin Base | 8 | 5,040 samples/sec | 11 samples/sec/watt | 1.59 | 1x H200 | DGX H200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA H200 |
32 | 8,175 samples/sec | 12 samples/sec/watt | 3.91 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
HF Swin Large | 8 | 3,387 samples/sec | 6 samples/sec/watt | 2.36 | 1x H200 | DGX H200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA H200 |
32 | 4,720 samples/sec | 7 samples/sec/watt | 6.78 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
HF ViT Base | 8 | 8,847 samples/sec | 19 samples/sec/watt | 0.9 | 1x H200 | DGX H200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
64 | 15,611 samples/sec | 23 samples/sec/watt | 4.1 | 1x H200 | DGX H200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
HF ViT Large | 8 | 3,667 samples/sec | 6 samples/sec/watt | 2.18 | 1x H200 | DGX H200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
64 | 5,459 samples/sec | 8 samples/sec/watt | 11.72 | 1x H200 | DGX H200 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
QuartzNet | 8 | 7,012 samples/sec | 25 samples/sec/watt | 1.14 | 1x H200 | DGX H200 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA H200 |
128 | 34,359 samples/sec | 90 samples/sec/watt | 3.73 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 | |
RetinaNet-RN34 | 8 | 3,025 images/sec | 9 images/sec/watt | 2.64 | 1x H200 | DGX H200 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA H200 |
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
GH200 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 21,420 images/sec | 61 images/sec/watt | 0.37 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
128 | 66,276 images/sec | 105 images/sec/watt | 1.93 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
EfficientNet-B0 | 8 | 17,198 images/sec | 68 images/sec/watt | 0.47 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
128 | 57,736 images/sec | 116 images/sec/watt | 2.22 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
EfficientNet-B4 | 8 | 4,622 images/sec | 13 images/sec/watt | 1.73 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
128 | 9,015 images/sec | 15 images/sec/watt | 14.2 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
HF Swin Base | 8 | 5,023 samples/sec | 11 samples/sec/watt | 1.59 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
32 | 8,046 samples/sec | 12 samples/sec/watt | 3.98 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
HF Swin Large | 8 | 3,351 samples/sec | 6 samples/sec/watt | 2.39 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
32 | 4,502 samples/sec | 7 samples/sec/watt | 7.11 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
HF ViT Base | 8 | 8,746 samples/sec | 18 samples/sec/watt | 0.91 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
64 | 15,167 samples/sec | 23 samples/sec/watt | 4.22 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
HF ViT Large | 8 | 3,360 samples/sec | 6 samples/sec/watt | 2.38 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
64 | 5,165 samples/sec | 8 samples/sec/watt | 12.39 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
QuartzNet | 8 | 7,038 samples/sec | 24 samples/sec/watt | 1.14 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
128 | 34,280 samples/sec | 82 samples/sec/watt | 3.73 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 | |
RetinaNet-RN34 | 8 | 2,955 images/sec | 5 images/sec/watt | 2.71 | 1x GH200 | NVIDIA P3880 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA GH200 |
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
H100 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 21,912 images/sec | 65 images/sec/watt | 0.37 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
128 | 56,829 images/sec | 119 images/sec/watt | 2.25 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
EfficientNet-B0 | 8 | 17,208 images/sec | 63 images/sec/watt | 0.46 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
128 | 52,455 images/sec | 191 images/sec/watt | 2.44 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
EfficientNet-B4 | 8 | 4,419 images/sec | 13 images/sec/watt | 1.81 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
128 | 8,701 images/sec | 14 images/sec/watt | 14.71 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
HF Swin Base | 8 | 5,124 samples/sec | 9 samples/sec/watt | 1.56 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
32 | 7,348 samples/sec | 11 samples/sec/watt | 4.35 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
HF Swin Large | 8 | 3,147 samples/sec | 6 samples/sec/watt | 2.54 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
32 | 4,392 samples/sec | 6 samples/sec/watt | 7.29 | 1x H100 | DGX H100 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
HF ViT Base | 8 | 8,494 samples/sec | 17 samples/sec/watt | 0.94 | 1x H100 | DGX H100 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
64 | 14,968 samples/sec | 22 samples/sec/watt | 4.28 | 1x H100 | DGX H100 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
HF ViT Large | 8 | 3,399 samples/sec | 5 samples/sec/watt | 2.35 | 1x H100 | DGX H100 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
64 | 5,195 samples/sec | 8 samples/sec/watt | 12.32 | 1x H100 | DGX H100 | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
QuartzNet | 8 | 7,002 samples/sec | 23 samples/sec/watt | 1.14 | 1x H100 | DGX H100 | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
128 | 34,881 samples/sec | 95 samples/sec/watt | 3.67 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB | |
RetinaNet-RN34 | 8 | 2,764 images/sec | 15 images/sec/watt | 2.89 | 1x H100 | DGX H100 | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | H100 SXM5-80GB |
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
L40S Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 23,025 images/sec | 71 images/sec/watt | 0.35 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
32 | 29,073 images/sec | 84 images/sec/watt | 4.4 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
EfficientDet-D0 | 8 | 4,640 images/sec | 16 images/sec/watt | 1.72 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
EfficientNet-B0 | 8 | 20,504 images/sec | 96 images/sec/watt | 0.39 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
32 | 42,553 images/sec | 127 images/sec/watt | 3.01 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
EfficientNet-B4 | 8 | 5,135 images/sec | 17 images/sec/watt | 1.56 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
16 | 4,066 images/sec | 12 images/sec/watt | 31.48 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
HF Swin Base | 8 | 3,812 samples/sec | 11 samples/sec/watt | 2.1 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
16 | 4,236 samples/sec | 12 samples/sec/watt | 7.55 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
HF Swin Large | 8 | 1,939 samples/sec | 6 samples/sec/watt | 4.12 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA L40S |
16 | 2,027 samples/sec | 6 samples/sec/watt | 15.79 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
HF ViT Base | 8 | 6,247 samples/sec | 18 samples/sec/watt | 1.28 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
HF ViT Large | 8 | 1,979 samples/sec | 6 samples/sec/watt | 4.04 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | FP8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
QuartzNet | 8 | 7,570 samples/sec | 31 samples/sec/watt | 1.06 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | Mixed | Synthetic | TensorRT 10.9 | NVIDIA L40S |
128 | 22,478 samples/sec | 65 samples/sec/watt | 5.69 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S | |
RetinaNet-RN34 | 8 | 1,477 images/sec | 6 images/sec/watt | 5.42 | 1x L40S | Supermicro SYS-521GE-TNRT | 25.04-py3 | INT8 | Synthetic | TensorRT 10.9 | NVIDIA L40S |
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256
View More Performance Data
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More