AI Inference

Real-world inferencing demands high throughput and low latency with maximum efficiency across use cases. NVIDIA’s industry-leading solutions let customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Click here to view other performance data.

MLPerf Inference v4.0 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
Llama2 70B 31,712 tokens/sec 8x H200 NVIDIA DGX H200 NVIDIA H200-SXM-141GB-CTS rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 OpenOrca
22,290 tokens/sec 8x H100 GIGABYTE G593-SD1 NVIDIA H100-SXM-80GB rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 OpenOrca
3,871 tokens/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip NVIDIA GH200-GraceHopper-Superchip 144GB rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 OpenOrca
15,086 tokens/sec 8x H100 NVL SYS-521GE-TNRT NVIDIA H100 NVL rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 OpenOrca
Stable Diffusion XL 13.2 samples/sec 8x H100 GIGABYTE G593-SD1 H100-SXM-80GB FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
1.8 samples/sec 1x GH200 GH200-GraceHopper-Superchip_GH200-96GB_aarch64x1_TRT NVIDIA GH200-GraceHopper-Superchip FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
5.04 samples/sec 8x L40S ESC8000-E11 NVIDIA L40S FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val
ResNet-50 705,887 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 76.46% Top1 ImageNet (224x224)
369,341 samples/sec 8x L40S ESC8000-E11 NVIDIA L40S 76.46% Top1 ImageNet (224x224)
RetinaNet 14,291 samples/sec 8x H100 HPE Cray XD670 H100-SXM-80GB 0.3755 mAP OpenImages (800x800)
6,401 samples/sec 8x L40S ESC8000-E11 NVIDIA L40S 0.3755 mAP OpenImages (800x800)
BERT 70,759 samples/sec 8x H100 HPE Cray XD670 H100-SXM-80GB 90.874% f1 SQuAD v1.1
26,430 samples/sec 8x L40S SYS-521GE-TNRT NVIDIA L40S 90.87% f1 SQuAD v1.1
GPT-J 243 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
32 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
98 samples/sec 8x L40S ESC8000-E11 NVIDIA L40S rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
DLRMv2 354,151 samples/sec 8x H100 NVIDIA DGX H100 H100-SXM-80GB 80.31% AUC Synthetic Multihot Criteo Dataset
49,651 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 80.31% AUC Synthetic Multihot Criteo Dataset
101,691 samples/sec 1x L40S ESC8000-E11 NVIDIA L40S 80.31% AUC Synthetic Multihot Criteo Dataset
3D-UNET 52 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 0.863 DICE mean KiTS 2019
32 samples/sec 1x L40S SYS-521GE-TNRT NVIDIA L40S 0.863 DICE mean KiTS 2019
RNN-T 191,355 samples/sec 8x H100 GIGABYTE G593-SD1 H100-SXM-80GB 7.45% WER Librispeech dev-clean
91,782 samples/sec 1x L40S ESC8000-E11 NVIDIA L40S 7.45% WER Librispeech dev-clean

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
Llama2 70B 29,526 tokens/sec 8x H200 NVIDIA DGX H200 NVIDIA H200-SXM-141GB-CTS rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 TTFT/TPOT: 2000 ms/200 ms OpenOrca
21,504 tokens/sec 8x H100 SYS-821GE-TNHR NVIDIA H100-SXM-80GB rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 TTFT/TPOT: 2000 ms/200 ms OpenOrca
3,617 tokens/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip NVIDIA GH200-GraceHopper-Superchip 144GB rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 TTFT/TPOT: 2000 ms/200 ms OpenOrca
14,275 tokens/sec 8x H100 NVL SYS-521GE-TNRT NVIDIA H100 NVL rouge1=44.4312, rouge2=22.0352, rougeL=28.6162 TTFT/TPOT: 2000 ms/200 ms OpenOrca
Stable Diffusion XL 13.6 queries/sec 8x H100 SYS-821GE-TNHR NVIDIA H100-SXM-80GB FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
1.68 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip NVIDIA GH200-GraceHopper-Superchip FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
4.96 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val
ResNet-50 630,172 queries/sec 8x H100 GIGABYTE G593-SD1 H100-SXM-80GB 76.46% Top1 15 ms ImageNet (224x224)
355,029 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S 76.46% Top1 15 ms ImageNet (224x224)
RetinaNet 13,676 queries/sec 8x H100 HPE Cray XD670 H100-SXM-80GB 0.3755 mAP 100 ms OpenImages (800x800)
5,798 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S 0.3755 mAP 100 ms OpenImages (800x800)
BERT 57,293 queries/sec 8x H100 GIGABYTE G593-SD1 H100-SXM-80GB 90.87% f1 130 ms SQuAD v1.1
25,121 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S 90.87% f1 130 ms SQuAD v1.1
GPT-J 240 queries/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
31 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip NVIDIA GH200-GraceHopper-Superchip rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
98 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
DLRMv2 333,218 queries/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
48,788 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip NVIDIA GH200-GraceHopper-Superchip 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
94,969 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
RNN-T 179,985 queries/sec 8x H100 GIGABYTE G593-SD1 H100-SXM-80GB 7.45% WER 1000 ms Librispeech dev-clean
87,974 queries/sec 8x L40S ESC8000-E11 NVIDIA L40S 7.45% WER 1000 ms Librispeech dev-clean

Power Efficiency Offline Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B 17,099 tokens/sec 2.99 tokens/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB OpenOrca
Stable Diffusion XL 9.65 samples/sec 0.00203 samples/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Subset of coco-2014 val
4.24 samples/sec 0.00119 samples/sec/watt 8x L40S PRIMERGY CDI NVIDIA L40S Subset of coco-2014 val
ResNet-50 456,575 samples/sec 113 samples/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB ImageNet (224x224)
RetinaNet 10,106 samples/sec 2 samples/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB OpenImages (800x800)
BERT 53,727 samples/sec 11 samples/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB SQuAD v1.1
GPT-J 174 samples/sec 0.0377 samples/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB CNN Dailymail
DLRMv2 283,714 samples/sec 50 samples/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Synthetic Multihot Criteo Dataset
3D-UNET 37 samples/sec 0.009 samples/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB KiTS 2019
RNN-T 139,938 samples/sec 32 samples/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Librispeech dev-clean

Power Efficiency Server Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
Llama2 70B 15,487 tokens/sec 2.62 tokens/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB OpenOrca
Stable Diffusion XL 8.78 queries/sec 0.00196 queries/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Subset of coco-2014 val
4.12 queries/sec 0.00117 queries/sec/watt 8x L40S PRIMERGY CDI NVIDIA L40S Subset of coco-2014 val
ResNet-50 400,031 queries/sec 103 queries/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB ImageNet (224x224)
RetinaNet 8,794 queries/sec 2 queries/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB OpenImages (800x800)
BERT 42,386 queries/sec 8 queries/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB SQuAD v1.1
GPT-J 150 queries/sec 0.0326 queries/sec/watt 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB CNN Dailymail
DLRMv2 255,995 queries/sec 44 queries/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Synthetic Multihot Criteo Dataset
RNN-T 123,981 queries/sec 27 queries/sec/watt 8x H100 NVIDIA DGX H100 H100-SXM-80GB Librispeech dev-clean

MLPerf™ v4.0 Inference Closed: Llama2 70B, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 4.0-0002, 4.0-0033, 4.0-0042, 4.0-0044, 4.0-0047, 4.0-0062, 4.0-0063, 4.0-0064, 4.0-0065, 4.0-0066, 4.0-0068, 4.0-0070, 4.0-0071, 4.0-0082, 4.0-0085, 4.0-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA H200 and NVIDIA GH200 GraceHopper-Superchip 144GB is a preview submission
Llama2 Max Sequence Length = 1,024
BERT-Large Max Sequence Length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

H200 Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput/GPU GPU Server Precision Framework GPU Version
GPTJ 6B1024112812829,169 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
GPTJ 6B120112820489,472 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
GPTJ 6B60112840965,287 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
GPTJ 6B64120481282,962 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
GPTJ 6B641204820484,149 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 7B896112812820,548 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 7B120112820488,343 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 7B601204840964,808 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 7B84120481282,430 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 7B561204820483,530 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 70B51211281283,844 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 70B512212820484,008 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 70B256212840962,712 total tokens/sec2x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 70B6412048128422 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Llama v2 70B641204820481,461 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Falcon 180B102441281281,117 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Falcon 180B102441282048991 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Falcon 180B51241284096668 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Falcon 180B6442048128119 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Falcon 180B64420482048269 total tokens/sec4x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Mistral 7B512112812820,569 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Mistral 7B120112820488,968 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Mistral 7B60112840965,210 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Mistral 7B84120481282,450 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200
Mistral 7B561204820483,868 total tokens/sec1x H200DGX H200FP8TensorRT-LLM 0.8.0NVIDIA H200

TP: Tensor Parallelism
Batch size per GPU

GH200 Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B1024112812828,946 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
GPT-J 6B120112820488,882 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
GPT-J 6B60112840964,938 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
GPT-J 6B64120481282,783 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
GPT-J 6B641204820483,832 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Llama v2 70B25611281283,401 total tokens/sec1x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Llama v2 70B256212820482,904 total tokens/sec2x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Llama v2 70B128212840961,904 total tokens/sec2x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Llama v2 70B9622048128305 total tokens/sec2x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Llama v2 70B642204820481,028 total tokens/sec2x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Falcon 180B102441281281,132 total tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Falcon 180B51241282048946 total tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Falcon 180B25641284096590 total tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Falcon 180B6442048128121 total tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B
Falcon 180B64420482048277 total tokens/sec4x GH200NVIDIA Grace Hopper x4 P4496FP8TensorRT-LLM 0.8.0NVIDIA GH200 96B

TP: Tensor Parallelism
Batch size per GPU

H100 Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B1024112812827,358 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
GPT-J 6B120112820487,832 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
GPT-J 6B60112840964,424 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
GPT-J 6B64120481282,661 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
GPT-J 6B641204820483,409 total tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
Llama v2 70B102421281283,269 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
Llama v2 70B1024412820482,718 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
Llama v2 70B256412840961,879 total tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
Llama v2 70B9622048128347 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB
Llama v2 70B642204820481,020 total tokens/sec2x H100DGX H100FP8TensorRT-LLM 0.8.0H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU

H100 NVL Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B1024112812820,484 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
GPT-J 6B120112820487,134 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
GPT-J 6B60112840963,990 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
GPT-J 6B64120481282,124 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
GPT-J 6B641204820483,062 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 7B896112812815,044 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 7B120112820486,153 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 7B60112840963,556 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 7B84120481281,736 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 7B561204820482,591 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 70B25611281282,335 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 70B9612048128264 total tokens/sec1x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL
Llama v2 70B64220482048846 total tokens/sec2x H100Supermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0H100 NVL

TP: Tensor Parallelism
Batch size per GPU

L40S Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput/GPU GPU Server Precision Framework GPU Version
GPT-J 6B51211281287,993 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
GPT-J 6B64112820481,874 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
GPT-J 6B3211284096992 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
GPT-J 6B3212048128693 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
GPT-J 6B32120482048768 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 7B25611281285,954 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 7B64112820481,654 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 7B3211284096868 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 7B3212048128580 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 7B16120482048543 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 70B51281281281,473 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 70B512812820482,329 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 70B256812840962,003 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 70B6442048128167 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Llama v2 70B64820482048840 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Falcon 180B51281281281,220 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Falcon 180B256812820481,604 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Falcon 180B256812840961,490 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Falcon 180B3282048128127 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Falcon 180B32820482048420 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Mistral 7B89611281289,680 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Mistral 7B120112820484,401 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Mistral 7B60112840962,331 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Mistral 7B8412048128979 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S
Mistral 7B561204820481,721 total tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.8.0NVIDIA L40S

TP: Tensor Parallelism
Batch size per GPU

H100 Inference Performance - Low Latency

Model Batch Size TP Input Length 1st Token Latency GPU Server Precision Framework GPU Version
GPT-J 6B111287 ms1x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
GPT-J 6B11204829 ms1x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B111287 ms1x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B11204836 ms1x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B1412826 ms4x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B142048109 ms4x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B1812827 ms8x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B182048205 ms8x H100DGX H100FP8TRT-LLM 0.5.0H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU

L40S Inference Performance - Low Latency

Model Batch Size TP Input Length 1st Token Latency GPU Server Precision Framework GPU Version
GPT-J 6B1112812 ms1x L40Sasrockrack 4u8g-romeFP8TRT-LLM 0.5.0NVIDIA L40S
GPT-J 6B11204871 ms1x L40Sasrockrack 4u8g-romeFP8TRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B1112814 ms1x L40Sasrockrack 4u8g-romeFP8TRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B11204873 ms1x L40Sasrockrack 4u8g-romeFP8TRT-LLM 0.5.0NVIDIA L40S

TP: Tensor Parallelism
Batch size per GPU

Inference Performance of NVIDIA Data Center Products

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)12.12 images/sec- 471.91x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
43.3 images/sec- 1210.631x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
Stable Diffusion XL10.35 images/sec- 2899.481x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
ResNet-50821,350 images/sec78 images/sec/watt0.371x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
12863,745 images/sec118 images/sec/watt2.011x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
54277,857 images/sec- images/sec/watt6.961x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
ResNet-50v1.512861,867 images/sec112 images/sec/watt2.071x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
47274,489 images/sec- images/sec/watt6.981x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
BERT-BASE89,328 sequences/sec22 sequences/sec/watt0.861x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
BERT-LARGE84,073 sequences/sec9 sequences/sec/watt1.961x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
EfficientNet-B0816,357 images/sec80 images/sec/watt0.491x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
12856,136 images/sec126 images/sec/watt2.281x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
47968,736 images/sec- images/sec/watt6.971x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
EfficientNet-B484,521 images/sec15 images/sec/watt1.771x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
557,911 images/sec- images/sec/watt6.951x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
1288,673 images/sec15 images/sec/watt14.761x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
HF Swin Base84,109 samples/sec10 samples/sec/watt1.951x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
326,432 samples/sec11 samples/sec/watt4.971x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
HF Swin Large82,727 samples/sec5 samples/sec/watt2.931x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
323,926 samples/sec6 samples/sec/watt8.151x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
HF ViT Base86,698 samples/sec12 samples/sec/watt1.191x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
HF ViT Large82,715 samples/sec4 samples/sec/watt2.951x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
643,819 samples/sec5 samples/sec/watt16.761x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
Megatron BERT Large QAT84,987 sequences/sec14 sequences/sec/watt1.61x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB
QuartzNet86,415 samples/sec26 samples/sec/watt1.251x GH200NVIDIA P388024.02-py3MixedSyntheticTensorRT 8.6.3GH200 96GB
12833,527 samples/sec94 samples/sec/watt3.821x GH200NVIDIA P388024.02-py3INT8SyntheticTensorRT 8.6.3GH200 96GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)12.13 images/sec- 468.791x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
43.12 images/sec- 1284.051x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
Stable Diffusion XL10.33 images/sec- 3023.541x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
ResNet-50820,766 images/sec73 images/sec/watt0.391x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
12859,967 images/sec101 images/sec/watt2.131x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
49570,882 images/sec- images/sec/watt6.981x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
ResNet-50v1.512858,467 images/sec106 images/sec/watt2.191x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
47267,927 images/sec- images/sec/watt6.951x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
BERT-BASE89,319 sequences/sec22 sequences/sec/watt0.861x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
BERT-LARGE83,985 sequences/sec8 sequences/sec/watt2.011x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
EfficientNet-B0815,995 images/sec63 images/sec/watt0.51x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
12854,695 images/sec108 images/sec/watt2.341x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
46766,922 images/sec- images/sec/watt6.981x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
EfficientNet-B484,479 images/sec12 images/sec/watt1.791x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
537,681 images/sec- images/sec/watt6.91x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
1288,484 images/sec14 images/sec/watt15.091x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
HF Swin Base83,965 samples/sec9 samples/sec/watt2.021x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
326,256 samples/sec10 samples/sec/watt5.121x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
HF Swin Large82,694 samples/sec5 samples/sec/watt2.971x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
323,732 samples/sec5 samples/sec/watt8.571x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
HF ViT Base86,688 samples/sec12 samples/sec/watt1.21x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
HF ViT Large82,683 samples/sec4 samples/sec/watt2.981x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
643,270 samples/sec5 samples/sec/watt19.571x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
Megatron BERT Large QAT84,794 sequences/sec13 sequences/sec/watt1.671x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB
QuartzNet86,448 samples/sec22 samples/sec/watt1.241x H100DGX H10024.02-py3MixedSyntheticTensorRT 8.6.3H100 SXM 80GB
12832,691 samples/sec80 samples/sec/watt3.921x H100DGX H10024.02-py3INT8SyntheticTensorRT 8.6.3H100 SXM 80GB

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion XL10.16 images/sec- 6454.461x L40SSupermicro SYS-521GE-TNRT23.12-py3INT8SyntheticTensorRT 8.6.1L40S
ResNet-50823,704 images/sec79 images/sec/watt0.341x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
3239,363 images/sec114 images/sec/watt0.811x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
ResNet-50v1.5823,034 images/sec75 images/sec/watt0.351x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
3237,522 images/sec109 images/sec/watt0.851x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
BERT-BASE88,271 sequences/sec29 sequences/sec/watt0.971x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
12812,910 sequences/sec37 sequences/sec/watt9.911x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
BERT-LARGE83,167 sequences/sec10 sequences/sec/watt2.531x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
244,321 sequences/sec13 sequences/sec/watt5.551x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
EfficientDet-D02 images/sec13 images/sec/watt0.921x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1L40S
84,530 images/sec16 images/sec/watt1.771x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
EfficientNet-B0820,456 images/sec105 images/sec/watt0.391x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
3240,357 images/sec137 images/sec/watt0.791x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
EfficientNet-B485,082 images/sec18 images/sec/watt1.571x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
165,307 images/sec18 images/sec/watt2.71x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
HF Swin Base83,138 samples/sec10 samples/sec/watt2.551x L40SSupermicro SYS-521GE-TNRT24.02-py3MixedSyntheticTensorRT 8.6.3NVIDIA L40S
163,624 samples/sec11 samples/sec/watt4.421x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
HF Swin Large81,598 samples/sec5 samples/sec/watt5.011x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
161,778 samples/sec6 samples/sec/watt91x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
HF ViT Base124,019 samples/sec13 samples/sec/watt2.991x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
HF ViT Large81,365 samples/sec4 samples/sec/watt5.861x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
Megatron BERT Large QAT84,228 sequences/sec13 sequences/sec/watt1.891x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
245,102 sequences/sec15 sequences/sec/watt4.71x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S
QuartzNet87,625 samples/sec34 samples/sec/watt1.051x L40SSupermicro SYS-521GE-TNRT24.02-py3MixedSyntheticTensorRT 8.6.3NVIDIA L40S
12822,232 samples/sec64 samples/sec/watt5.761x L40SSupermicro SYS-521GE-TNRT24.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)10.45 images/sec- 2230.891x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
40.46 images/sec- 8612.551x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
Stable Diffusion XL10.05 images/sec- 20540.471x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
ResNet-50810,164 images/sec141 images/sec/watt0.791x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
3210,426 images/sec145 images/sec/watt3.071x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
ResNet-50v1.589,761 images/sec135 images/sec/watt0.821x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
3210,076 images/sec140 images/sec/watt3.181x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
BERT-BASE83,511 sequences/sec50 sequences/sec/watt2.281x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
244,034 sequences/sec56 sequences/sec/watt5.951x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
BERT-LARGE81,109 sequences/sec15 sequences/sec/watt7.221x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
121,293 sequences/sec18 sequences/sec/watt9.281x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
EfficientNet-B481,816 images/sec25 images/sec/watt4.41x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
HF Swin Base81,100 samples/sec15 samples/sec/watt7.271x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
HF Swin Large8541 samples/sec8 samples/sec/watt14.781x L4GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3NVIDIA L4
HF ViT Base81,304 samples/sec18 samples/sec/watt6.131x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
HF ViT Large8393 samples/sec5 samples/sec/watt20.351x L4GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3NVIDIA L4
Megatron BERT Large QAT81,517 sequences/sec21 sequences/sec/watt5.281x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
QuartzNet84,600 samples/sec64 samples/sec/watt1.741x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
1285,998 samples/sec83 samples/sec/watt21.341x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4
RetinaNet-RN348373 images/sec5 images/sec/watt21.431x L4GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50811,549 images/sec44 images/sec/watt0.691x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
11316,444 images/sec- images/sec/watt6.871x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
12816,247 images/sec54 images/sec/watt7.881x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
ResNet-50v1.5811,116 images/sec41 images/sec/watt0.721x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
10915,626 images/sec- images/sec/watt6.911x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
12815,626 images/sec52 images/sec/watt8.191x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
BERT-BASE84,392 sequences/sec15 sequences/sec/watt1.8224.02-py3INT8SyntheticTensorRTA40TensorRT 8.6.3A40
1285,704 sequences/sec20 sequences/sec/watt22.441x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
BERT-LARGE81,596 sequences/sec5 sequences/sec/watt5.011x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
1281,964 sequences/sec7 sequences/sec/watt65.171x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
EfficientNet-B0810,900 images/sec59 images/sec/watt0.731x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
12820,003 images/sec67 images/sec/watt6.41x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
13819,890 images/sec- images/sec/watt6.941x A40GIGABYTE G482-Z52-0023.12-py3INT8SyntheticTensorRT 8.6.1A40
EfficientNet-B482,106 images/sec8 images/sec/watt3.81x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
152,274 images/sec- images/sec/watt6.61x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
1282,690 images/sec9 images/sec/watt47.581x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
HF Swin Base81,444 samples/sec5 samples/sec/watt5.541x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
321,465 samples/sec5 samples/sec/watt21.841x A40GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A40
HF Swin Large8829 samples/sec3 samples/sec/watt9.651x A40GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A40
32840 samples/sec3 samples/sec/watt38.11x A40GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A40
HF ViT Base82,176 samples/sec7 samples/sec/watt3.681x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
642,182 samples/sec7 samples/sec/watt29.331x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
HF ViT Large8694 samples/sec2 samples/sec/watt11.531x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
64713 samples/sec2 samples/sec/watt89.731x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
Megatron BERT Large QAT82,101 sequences/sec8 sequences/sec/watt3.811x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
1282,688 sequences/sec9 sequences/sec/watt47.621x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
QuartzNet84,501 samples/sec21 samples/sec/watt1.781x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
1288,492 samples/sec28 samples/sec/watt15.071x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40
RetinaNet-RN348706 images/sec2 images/sec/watt11.331x A40GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50810,446 images/sec74 images/sec/watt0.771x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
11717,104 images/sec- images/sec/watt6.841x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
12817,328 images/sec106 images/sec/watt7.391x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
ResNet-50v1.5810,167 images/sec71 images/sec/watt0.791x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
11816,540 images/sec- images/sec/watt6.951x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
12816,759 images/sec102 images/sec/watt7.641x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
84,337 sequences/sec26 sequences/sec/watt1.841x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
1285,784 sequences/sec35 sequences/sec/watt22.131x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,488 sequences/sec9 sequences/sec/watt5.381x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
1282,043 sequences/sec12 sequences/sec/watt62.651x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
EfficientNet-B088,928 images/sec82 images/sec/watt0.91x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
11717,178 images/sec- images/sec/watt6.811x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
12817,251 images/sec105 images/sec/watt7.421x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
EfficientNet-B481,866 images/sec13 images/sec/watt4.291x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
142,091 images/sec- images/sec/watt6.691x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
1282,395 images/sec15 images/sec/watt53.441x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
HF Swin Base81,456 samples/sec9 samples/sec/watt5.491x A30GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A30
321,604 samples/sec10 samples/sec/watt19.961x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
HF Swin Large8811 samples/sec5 samples/sec/watt9.871x A30GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A30
32841 samples/sec5 samples/sec/watt38.041x A30GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A30
HF ViT Base82,028 samples/sec12 samples/sec/watt3.941x A30GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A30
642,140 samples/sec13 samples/sec/watt29.91x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
HF ViT Large8648 samples/sec4 samples/sec/watt12.341x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
64698 samples/sec4 samples/sec/watt91.711x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
Megatron BERT Large QAT81,816 sequences/sec13 sequences/sec/watt4.411x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
1282,766 sequences/sec17 sequences/sec/watt46.281x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
QuartzNet83,429 samples/sec30 samples/sec/watt2.331x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
1289,891 samples/sec71 samples/sec/watt12.941x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30
RetinaNet-RN348695 images/sec4 images/sec/watt11.521x A30GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A30

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 1/4 MIG Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-5084,047 images/sec47 images/sec/watt1.981x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
324,650 images/sec- images/sec/watt6.881x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
1284,788 images/sec54 images/sec/watt26.731x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
ResNet-50v1.583,894 images/sec47 images/sec/watt2.051x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
314,463 images/sec- images/sec/watt6.951x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
1284,636 images/sec51 images/sec/watt27.611x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
BERT-BASE81,571 sequences/sec17 sequences/sec/watt5.091x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
1281,705 sequences/sec18 sequences/sec/watt75.051x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
BERT-LARGE8519 sequences/sec6 sequences/sec/watt15.421x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
128592 sequences/sec6 sequences/sec/watt216.211x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A30 4 MIG Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50815,541 images/sec94 images/sec/watt2.061x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
2917,253 images/sec- images/sec/watt6.751x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
12818,158 images/sec111 images/sec/watt28.261x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
ResNet-50v1.5814,924 images/sec91 images/sec/watt2.151x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
2816,678 images/sec- images/sec/watt6.741x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
12817,511 images/sec106 images/sec/watt29.311x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
BERT-BASE85,715 sequences/sec35 sequences/sec/watt5.711x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
1286,004 sequences/sec37 sequences/sec/watt86.991x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
BERT-LARGE81,885 sequences/sec12 sequences/sec/watt17.061x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30
1282,090 sequences/sec13 sequences/sec/watt246.091x A30GIGABYTE G482-Z52-0023.11-py3INT8SyntheticTensorRT 8.6.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A10 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-5088,877 images/sec59 images/sec/watt0.91x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
7511,019 images/sec- images/sec/watt3.021x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
12811,526 images/sec77 images/sec/watt11.111x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
ResNet-50v1.588,469 images/sec57 images/sec/watt0.941x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
7010,801 images/sec- images/sec/watt6.481x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
12810,868 images/sec73 images/sec/watt11.781x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
83,227 sequences/sec22 sequences/sec/watt2.481x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
1283,768 sequences/sec25 sequences/sec/watt33.971x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,120 sequences/sec7 sequences/sec/watt7.141x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
1281,267 sequences/sec9 sequences/sec/watt101.051x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
EfficientNet-B089,496 images/sec64 images/sec/watt0.841x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
12814,315 images/sec96 images/sec/watt8.941x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
EfficientNet-B481,592 images/sec11 images/sec/watt5.021x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
1281,853 images/sec12 images/sec/watt69.091x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
HF Swin Base81,061 samples/sec7 samples/sec/watt7.541x A10GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A10
321,046 samples/sec7 samples/sec/watt30.611x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
HF Swin Large8554 samples/sec4 samples/sec/watt14.451x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
32575 samples/sec4 samples/sec/watt55.681x A10GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A10
HF ViT Base81,384 samples/sec9 samples/sec/watt5.781x A10GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A10
641,438 samples/sec10 samples/sec/watt44.511x A10GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A10
HF ViT Large8462 samples/sec3 samples/sec/watt17.321x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
64446 samples/sec3 samples/sec/watt143.471x A10GIGABYTE G482-Z52-0024.02-py3MixedSyntheticTensorRT 8.6.3A10
Megatron BERT Large QAT81,596 sequences/sec11 sequences/sec/watt5.011x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
1281,846 sequences/sec13 sequences/sec/watt69.361x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
QuartzNet83,999 samples/sec27 samples/sec/watt21x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
1285,875 samples/sec39 samples/sec/watt21.791x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10
RetinaNet-RN348503 images/sec3 images/sec/watt15.891x A10GIGABYTE G482-Z52-0024.02-py3INT8SyntheticTensorRT 8.6.3A10

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

NVIDIA Performance with Triton Inference Server

H100 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceH100 SXM5-80GBtensorrtTensorRTMixed4141.2073,311 inf/sec24.02-py3
BERT Large InferenceH100 SXM5-80GBtensorrtPyTorchMixed411614.7841,082 inf/sec24.02-py3
BERT Large InferenceH100 SXM5-80GBtensorrtPyTorchMixed42812.7151,258 inf/sec24.02-py3
DLRMH100 SXM5-80GBts-tracePyTorchMixed11320.9434,027 inf/sec24.02-py3
DLRMH100 SXM5-80GBts-tracePyTorchMixed42320.91370,071 inf/sec24.02-py3
FastPitch InferenceH100 SXM5-80GBts-tracePyTorchMixed21512119.5314,281 inf/sec24.02-py3
FastPitch InferenceH100 SXM5-80GBts-tracePyTorchMixed22256119.364,287 inf/sec24.02-py3
ResNet-50 v1.5H100 SXM5-80GBtensorrtPyTorchMixed41161.9778,090 inf/sec24.02-py3
ResNet-50 v1.5H100 SXM5-80GBtensorrtPyTorchMixed42164.1017,801 inf/sec24.02-py3
TFT InferenceH100 SXM5-80GBts-scriptPyTorchMixed21102433.02730,996 inf/sec24.02-py3
TFT InferenceH100 SXM5-80GBts-scriptPyTorchMixed2251225.52240,114 inf/sec24.02-py3

H100 NVL Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA H100 NVLtensorrtTensorRTMixed4141.3333,000 inf/sec24.02-py3
BERT Large InferenceNVIDIA H100 NVLtensorrtPyTorchMixed411619.055840 inf/sec24.02-py3
BERT Large InferenceNVIDIA H100 NVLtensorrtPyTorchMixed42817.025940 inf/sec24.02-py3
DLRMNVIDIA H100 NVLts-tracePyTorchMixed21320.80439,745 inf/sec24.02-py3
DLRMNVIDIA H100 NVLts-tracePyTorchMixed22321.07159,691 inf/sec24.02-py3
FastPitch InferenceNVIDIA H100 NVLts-tracePyTorchMixed21512151.0793,386 inf/sec24.02-py3
FastPitch InferenceNVIDIA H100 NVLts-tracePyTorchMixed1212882.1133,117 inf/sec24.02-py3
ResNet-50 v1.5NVIDIA H100 NVLtensorrtPyTorchMixed41161.9788,086 inf/sec24.02-py3
ResNet-50 v1.5NVIDIA H100 NVLtensorrtPyTorchMixed42163.8128,392 inf/sec24.02-py3
TFT InferenceNVIDIA H100 NVLts-tracePyTorchMixed2151216.84630,387 inf/sec24.02-py3
TFT InferenceNVIDIA H100 NVLts-tracePyTorchMixed2225613.84836,966 inf/sec24.02-py3

L40S Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Number of Concurrent Client Requests Latency (ms) Throughput Triton Container Version
BERT Base InferenceNVIDIA L40StensorrtTensorRTMixed4141.3962,863 inf/sec24.02-py3
BERT Large InferenceNVIDIA L40StensorrtPyTorchMixed21815.677510 inf/sec24.02-py3
BERT Large InferenceNVIDIA L40StensorrtPyTorchMixed22415.077531 inf/sec24.02-py3
DLRMNVIDIA L40Sts-tracePyTorchMixed11641.54541,403 inf/sec24.02-py3
DLRMNVIDIA L40Sts-tracePyTorchMixed12320.92968,867 inf/sec24.02-py3
FastPitch InferenceNVIDIA L40Sts-tracePyTorchMixed216464.5152,413 inf/sec24.02-py3
FastPitch InferenceNVIDIA L40Sts-tracePyTorchMixed223226.3842,425 inf/sec24.02-py3
ResNet-50 v1.5NVIDIA L40StensorrtPyTorchMixed41162.0177,928 inf/sec24.02-py3
ResNet-50 v1.5NVIDIA L40StensorrtPyTorchMixed42163.9648,070 inf/sec24.02-py3
TFT InferenceNVIDIA L40Sts-tracePyTorchMixed2125610.12225,288 inf/sec24.02-py3
TFT InferenceNVIDIA L40Sts-tracePyTorchMixed22644.98825,658 inf/sec24.02-py3

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5813,768 images/sec- images/sec/watt0.581x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
12830,338 images/sec- images/sec/watt4.221x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,308 images/sec- images/sec/watt3.471x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB
1284,045 images/sec- images/sec/watt31.641x A100GCP A2-HIGHGPU-1G23.10-py3INT8Synthetic-A100-SXM4-40GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More