AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.


When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.


Click here to view other performance data.

MLPerf Inference v5.0 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
Llama3.1 405B13,886 tokens/sec72x GB200NVIDIA GB200 NVL72NVIDIA GB20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
1,538 tokens/sec8x B200SYS-421GE-NBRT-LCCNVIDIA B200-SXM-180GB99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
574 tokens/sec8x H200Cisco UCS C885A M8NVIDIA H200-SXM-141GB99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B98,858 tokens/sec8x B200NVIDIA DGX B200NVIDIA B200-SXM-180GB99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)OpenOrca (max_seq_len=1024)
35,453 tokens/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GB99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)OpenOrca (max_seq_len=1024)
Mixtral 8x7B128,795 tokens/sec8x B200SYS-421GE-NBRT-LCCNVIDIA B200-SXM-180GB99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
63,515 tokens/sec8x H200ThinkSystem SR780a V3NVIDIA H200-SXM-141GB99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL30 samples/sec8x B200NVIDIA DGX B200NVIDIA B200-SXM-180GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
19 samples/sec8x H200AS-4125GS-TNHR2-LCCNVIDIA H200-SXM-141GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
RGAT450,175 samples/sec8x H200ThinkSystem SR780a V3NVIDIA H200-SXM-141GB99% of FP32 (72.86%)IGBH
GPT-J21,626 tokens/sec8x H200ThinkSystem SR780a V3NVIDIA H200-SXM-141GB99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)CNN Dailymail (v3.0.0, max_seq_len=2048)
ResNet-50773,300 samples/sec8x H200ThinkSystem SR680a V3NVIDIA H200-SXM-141GB76.46% Top1ImageNet (224x224)
RetinaNet15,200 samples/sec8x H200AS-4125GS-TNHR2-LCCNVIDIA H200-SXM-141GB0.3755 mAPOpenImages (800x800)
DLRMv2654,489 samples/sec8x H200HPE Cray XD670 with Cray ClusterStorNVIDIA H200-SXM-141GB99% of FP32 (AUC=80.31%)Synthetic Multihot Criteo Dataset
3D-UNET55 samples/sec8x H200HPE Cray XD670 with Cray ClusterStorNVIDIA H200-SXM-141GB99.9% of FP32 (0.86330 mean DICE score)KiTS 2019

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
Llama3.1 405B8,850 tokens/sec72x GB200NVIDIA GB200 NVL72NVIDIA GB20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
1,080 tokens/sec8x B200SYS-A21GE-NBRTNVIDIA B200-SXM-180GB99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
294 tokens/sec8x H200Cisco UCS C885A M8NVIDIA H200-SXM-141GB99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B Interactive62,266 tokens/sec8x B200SYS-A21GE-NBRTNVIDIA B200-SXM-180GB99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 450 ms/40 msOpenOrca (max_seq_len=1024)
20,235 tokens/sec8x H200G893-SD1NVIDIA H200-SXM-141GB99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 450 ms/40 msOpenOrca (max_seq_len=1024)
Llama2 70B98,443 tokens/sec8x B200NVIDIA DGX B200NVIDIA B200-SXM-180GB99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/200 msOpenOrca (max_seq_len=1024)
33,072 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTS99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/200 msOpenOrca (max_seq_len=1024)
Mixtral 8x7B129,047 tokens/sec8x B200SYS-421GE-NBRT-LCCNVIDIA B200-SXM-180GB99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)TTFT/TPOT: 2000 ms/200 msOpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
61,802 tokens/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTS99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)TTFT/TPOT: 2000 ms/200 msOpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL29 samples/sec8x B200SYS-A21GE-NBRTNVIDIA B200-SXM-180GBFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
18 samples/sec8x H200NVIDIA H200NVIDIA H200-SXM-141GB-CTSFID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
GPT-J21,813 queries/sec8x H200Cisco UCS C885A M8NVIDIA H200-SXM-141GB99% of FP32 (72.86%)20 sCNN Dailymail
ResNet-50676,219 queries/sec8x H200G893-SD1NVIDIA H200-SXM-141GB76.46% Top115 msImageNet (224x224)
RetinaNet14,589 queries/sec8x H200AS-4125GS-TNHR2-LCCNVIDIA H200-SXM-141GB0.3755 mAP100 msOpenImages (800x800)
DLRMv2590,167 queries/sec8x H200HPE Cray XD670 with Cray ClusterStorNVIDIA H200-SXM-141GB99% of FP32 (AUC=80.31%)60 msSynthetic Multihot Criteo Dataset

MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 DeepSeek R1 - Per User

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
DeepSeek R1 0528TP8EP21,0242,048379 output tokens/sec/user8x B200DGX B200FP4TensorRT-LLM 0.20NVIDIA B200

Accuracy Evaluation:
Precision FP8 (AA Ref): MMLU Pro = 85 | GPQA Diamond = 81 | LiveCodeBench = 77 | SCICODE = 40 | MATH-500 = 98 | AIME 2024 = 89
Precision FP4: MMLU Pro = 84.2 | GPQA Diamond = 80 | LiveCodeBench = 76.3 | SCICODE = 40.1 | MATH-500 = 98.1 | AIME 2024 = 91.3
More details on Accuracy Evalution here
Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 2
Batch Size = 1
Input tokens not included in TPS calculations

B200 DeepSeek R1 - Max Throughput

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
DeepSeek R1 0528TP8EP81,0242,04843,146 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.20NVIDIA B200

Accuracy Evaluation:
Precision FP8 (AA Ref): MMLU Pro = 85 | GPQA Diamond = 81 | LiveCodeBench = 77 | SCICODE = 40 | MATH-500 = 98 | AIME 2024 = 89
Precision FP4: MMLU Pro = 84.2 | GPQA Diamond = 80 | LiveCodeBench = 76.3 | SCICODE = 40.1 | MATH-500 = 98.1 | AIME 2024 = 91.3
More details on Accuracy Evalution here
Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
Input tokens not included in TPS calculations

B200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 405B181281289,185 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B18128204810,387 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B1812840968,742 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B182048128954 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B1850005001,332 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B1850020009,242 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B18100010007,566 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B18100020007,697 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B18204820486,092 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.1 405B18200002000962 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1112812811,253 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1112820489,925 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1112840966,319 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1120481281,375 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1150005001,488 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B1150020007,560 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B11100010006,867 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B11100020006,737 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B11204820484,545 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B11200002000581 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 405B181281283,800 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B1812820485,661 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B1812840965,167 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B812048128764 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14aNVIDIA H200
Llama v3.1 405B185000500656 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B1850020004,854 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B18100010003,332 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B18100020003,682 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B18204820483,056 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B18200002000514 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B111281283,658 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B1112820484,351 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B14128409611,525 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B112048128433 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B115000500544 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B1150020003,476 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B11100010002,727 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B11204820481,990 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 70B12200002000618 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B1112812828,447 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B11128204823,295 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B11128409617,481 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B1120481283,531 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B1150005003,852 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B11500200021,463 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B111000100017,591 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B112048204812,022 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 8B112000020001,706 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Mistral 7B1112812831,938 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11128204827,409 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11128409618,505 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B1120481283,834 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B1150005004,042 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B11500200022,355 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B111000100018,426 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B112048204812,347 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mistral 7B112000020001,823 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1112812817,158 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B11128204815,095 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B12128409621,565 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1120481282,010 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B1150005002,309 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B11500200012,105 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B111000100010,371 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B122048204814,018 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x7B122000020002,227 output tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1812812825,179 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200
Mixtral 8x22B18128204832,623 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18128409625,753 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B1820481283,095 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B1850005004,209 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B18500200027,430 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B181000100020,097 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.15.0NVIDIA H200
Mixtral 8x22B182048204815,799 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.17.0NVIDIA H200
Mixtral 8x22B182000020002,897 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.14.0NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

GH200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B111281283,637 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B14128204810,358 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B1412840966,628 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B112048128425 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B115000500422 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B1450020009,091 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B11100010001,746 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 70B14204820484,865 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 70B14200002000959 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Llama v3.1 8B1112812829,853 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11128204821,770 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11128409614,190 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B1120481283,844 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B1150005003,933 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B11500200017,137 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B111000100016,483 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B112048204810,266 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Llama v3.1 8B112000020001,560 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1112812832,498 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11128204823,337 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11128409615,018 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1120481283,813 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B1150005003,950 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B11500200018,556 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B111000100017,252 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B112048204810,756 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mistral 7B112000020001,601 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1112812816,859 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11128204811,120 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B14128409630,066 output tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.13.0NVIDIA GH200 96B
Mixtral 8x7B1120481281,994 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1150005002,078 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B1150020009,193 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11100010008,849 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11204820485,545 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B
Mixtral 8x7B11200002000861 output tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.17.0NVIDIA GH200 96B

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 70B111281283,191 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B1212820485,822 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B1412840968,210 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B122048128748 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B125000500867 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B14500200010,278 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B12100010004,191 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B14204820485,640 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 70B14200002000911 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B1112812827,569 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B11128204822,004 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B11128409613,640 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B1120481283,495 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B1150005003,371 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B11500200017,794 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B111000100015,270 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B11204820489,654 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB
Llama v3.1 8B112000020001,341 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.19.0H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.1 8B111281289,105 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1112820485,366 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1112840963,026 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1120481281,067 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B115000500981 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B1150020004,274 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11100010004,055 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11204820482,225 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Llama v3.1 8B11200002000328 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Mixtral 8x7B4112812815,278 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2212820489,087 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B1412840965,736 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.17.0NVIDIA L40S
Mixtral 8x7B4120481282,098 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250005001,558 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B2250020007,974 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22100010006,579 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S
Mixtral 8x7B22204820484,217 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.15.0NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5818,517 images/sec39 images/sec/watt0.431x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12857,280 images/sec58 images/sec/watt2.231x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
EfficientNet-B0810,861 images/sec30 images/sec/watt0.741x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12828,889 images/sec41 images/sec/watt4.431x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
EfficientNet-B482,634 images/sec5 images/sec/watt3.041x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
1284,101 images/sec5 images/sec/watt31.211x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
HF Swin Base86,062 samples/sec14 samples/sec/watt1.321x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
3211,319 samples/sec19 samples/sec/watt2.831x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
HF Swin Large84,742 samples/sec10 samples/sec/watt1.691x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
327,479 samples/sec11 samples/sec/watt4.281x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
HF ViT Base811,267 samples/sec22 samples/sec/watt0.711x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
6421,688 samples/sec29 samples/sec/watt2.951x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
HF ViT Large85,171 samples/sec8 samples/sec/watt1.551x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
648,485 samples/sec10 samples/sec/watt7.541x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
QuartzNet87,787 samples/sec24 samples/sec/watt1.031x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12825,034 samples/sec47 samples/sec/watt5.111x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
RetinaNet-RN3483,318 images/sec8 images/sec/watt2.411x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,253 images/sec67 images/sec/watt0.381x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
12865,328 images/sec107 images/sec/watt1.961x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
EfficientNet-B0817,243 images/sec77 images/sec/watt0.461x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
12857,387 images/sec122 images/sec/watt2.231x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
EfficientNet-B484,613 images/sec14 images/sec/watt1.731x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
1289,018 images/sec15 images/sec/watt14.191x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF Swin Base85,040 samples/sec11 samples/sec/watt1.591x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
328,175 samples/sec12 samples/sec/watt3.911x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF Swin Large83,387 samples/sec6 samples/sec/watt2.361x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
324,720 samples/sec7 samples/sec/watt6.781x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF ViT Base88,847 samples/sec19 samples/sec/watt0.91x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
6415,611 samples/sec23 samples/sec/watt4.11x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
HF ViT Large83,667 samples/sec6 samples/sec/watt2.181x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
645,459 samples/sec8 samples/sec/watt11.721x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
QuartzNet87,012 samples/sec25 samples/sec/watt1.141x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
12834,359 samples/sec90 samples/sec/watt3.731x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
RetinaNet-RN3483,025 images/sec9 images/sec/watt2.641x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,420 images/sec61 images/sec/watt0.371x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12866,276 images/sec105 images/sec/watt1.931x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
EfficientNet-B0817,198 images/sec68 images/sec/watt0.471x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12857,736 images/sec116 images/sec/watt2.221x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
EfficientNet-B484,622 images/sec13 images/sec/watt1.731x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
1289,015 images/sec15 images/sec/watt14.21x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
HF Swin Base85,023 samples/sec11 samples/sec/watt1.591x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
328,046 samples/sec12 samples/sec/watt3.981x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
HF Swin Large83,351 samples/sec6 samples/sec/watt2.391x GH200NVIDIA P388025.04-py3MixedSyntheticTensorRT 10.9NVIDIA GH200
324,502 samples/sec7 samples/sec/watt7.111x GH200NVIDIA P388025.04-py3MixedSyntheticTensorRT 10.9NVIDIA GH200
HF ViT Base88,746 samples/sec18 samples/sec/watt0.911x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
6415,167 samples/sec23 samples/sec/watt4.221x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
HF ViT Large83,360 samples/sec6 samples/sec/watt2.381x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
645,165 samples/sec8 samples/sec/watt12.391x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
QuartzNet87,038 samples/sec24 samples/sec/watt1.141x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12834,280 samples/sec82 samples/sec/watt3.731x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
RetinaNet-RN3482,955 images/sec5 images/sec/watt2.711x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,912 images/sec65 images/sec/watt0.371x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
12856,829 images/sec119 images/sec/watt2.251x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
EfficientNet-B0817,208 images/sec63 images/sec/watt0.461x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
12852,455 images/sec191 images/sec/watt2.441x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
EfficientNet-B484,419 images/sec13 images/sec/watt1.811x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
1288,701 images/sec14 images/sec/watt14.711x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
HF Swin Base85,124 samples/sec9 samples/sec/watt1.561x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
327,348 samples/sec11 samples/sec/watt4.351x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
HF Swin Large83,147 samples/sec6 samples/sec/watt2.541x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
324,392 samples/sec6 samples/sec/watt7.291x H100DGX H10025.04-py3MixedSyntheticTensorRT 10.9H100 SXM5-80GB
HF ViT Base88,494 samples/sec17 samples/sec/watt0.941x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
6414,968 samples/sec22 samples/sec/watt4.281x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
HF ViT Large83,399 samples/sec5 samples/sec/watt2.351x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
645,195 samples/sec8 samples/sec/watt12.321x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
QuartzNet87,002 samples/sec23 samples/sec/watt1.141x H100DGX H10025.04-py3MixedSyntheticTensorRT 10.9H100 SXM5-80GB
12834,881 samples/sec95 samples/sec/watt3.671x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
RetinaNet-RN3482,764 images/sec15 images/sec/watt2.891x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5823,025 images/sec71 images/sec/watt0.351x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
3229,073 images/sec84 images/sec/watt4.41x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientDet-D084,640 images/sec16 images/sec/watt1.721x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientNet-B0820,504 images/sec96 images/sec/watt0.391x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
3242,553 images/sec127 images/sec/watt3.011x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientNet-B485,135 images/sec17 images/sec/watt1.561x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
164,066 images/sec12 images/sec/watt31.481x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF Swin Base83,812 samples/sec11 samples/sec/watt2.11x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
164,236 samples/sec12 samples/sec/watt7.551x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF Swin Large81,939 samples/sec6 samples/sec/watt4.121x L40SSupermicro SYS-521GE-TNRT25.04-py3MixedSyntheticTensorRT 10.9NVIDIA L40S
162,027 samples/sec6 samples/sec/watt15.791x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF ViT Base86,247 samples/sec18 samples/sec/watt1.281x L40SSupermicro SYS-521GE-TNRT25.04-py3FP8SyntheticTensorRT 10.9NVIDIA L40S
HF ViT Large81,979 samples/sec6 samples/sec/watt4.041x L40SSupermicro SYS-521GE-TNRT25.04-py3FP8SyntheticTensorRT 10.9NVIDIA L40S
QuartzNet87,570 samples/sec31 samples/sec/watt1.061x L40SSupermicro SYS-521GE-TNRT25.04-py3MixedSyntheticTensorRT 10.9NVIDIA L40S
12822,478 samples/sec65 samples/sec/watt5.691x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
RetinaNet-RN3481,477 images/sec6 images/sec/watt5.421x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More