AI Inference
Real-world inferencing demands high throughput and low latency with maximum efficiency across use cases. NVIDIA’s industry-leading solutions let customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Click here to view other performance data.
MLPerf Inference v3.1 Performance Benchmarks
Offline Scenario, Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | Dataset |
---|---|---|---|---|---|---|
ResNet-50 | 707,537 samples/sec | 8x H100 | AS-8125GS-TNHR | H100-SXM-80GB | 76.46% Top1 | ImageNet (224x224) |
93,198 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 76.46% Top1 | ImageNet (224x224) | |
12,882 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 76.46% Top1 | ImageNet (224x224) | |
RetinaNet | 14,196 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 0.3755 mAP | OpenImages (800x800) |
1,849 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 0.3755 mAP | OpenImages (800x800) | |
226 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 0.3755 mAP | OpenImages (800x800) | |
BERT | 71,213 samples/sec | 8x H100 | GIGABYTE G593-SD0 | H100-SXM-80GB | 90.87% f1 | SQuAD v1.1 |
10,163 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 90.87% f1 | SQuAD v1.1 | |
1,029 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 90.87% f1 | SQuAD v1.1 | |
GPT-J | 107 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail |
13 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
1 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | CNN Dailymail | |
DLRMv2 | 344,370 samples/sec | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | 80.31% AUC | Synthetic Multihot Criteo Dataset |
49,002 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
3,673 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 80.31% AUC | Synthetic Multihot Criteo Dataset | |
3D-UNET | 52 samples/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 0.863 DICE mean | KiTS 2019 |
7 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 0.863 DICE mean | KiTS 2019 | |
1 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 0.863 DICE mean | KiTS 2019 | |
RNN-T | 187,469 samples/sec | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | 7.45% WER | Librispeech dev-clean |
25,975 samples/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 7.45% WER | Librispeech dev-clean | |
3,899 samples/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 7.45% WER | Librispeech dev-clean |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) |
Dataset |
---|---|---|---|---|---|---|---|
ResNet-50 | 620,874 queries/sec | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | 76.46% Top1 | 15 ms | ImageNet (224x224) |
77,018 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 76.46% Top1 | 15 ms | ImageNet (224x224) | |
12,204 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 76.46% Top1 | 15 ms | ImageNet (224x224) | |
RetinaNet | 13,021 queries/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 0.3755 mAP | 100 ms | OpenImages (800x800) |
1,731 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 0.3755 mAP | 100 ms | OpenImages (800x800) | |
200 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 0.3755 mAP | 100 ms | OpenImages (800x800) | |
BERT | 57,331 queries/sec | 8x H100 | Dell PowerEdge XE9680 | H100-SXM-80GB | 90.87% f1 | 130 ms | SQuAD v1.1 |
7,704 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 90.87% f1 | 130 ms | SQuAD v1.1 | |
899 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 90.87% f1 | 130 ms | SQuAD v1.1 | |
GPT-J | 86 queries/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail |
11 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
1 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 | 20 s | CNN Dailymail | |
DLRMv2 | 327,051 queries/sec | 8x H100 | SYS-821GE-TNHR | H100-SXM-80GB | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset |
48,517 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset | |
3,305 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 80.31% AUC | 60 ms | Synthetic Multihot Criteo Dataset | |
RNN-T | 180,017 queries/sec | 8x H100 | GIGABYTE G593-SD0 | H100-SXM-80GB | 7.45% WER | 1000 ms | Librispeech dev-clean |
24,008 queries/sec | 1x GH200 | NVIDIA GH200-GraceHopper-Superchip | GH200-GraceHopper-Superchip | 7.45% WER | 1000 ms | Librispeech dev-clean | |
3,755 queries/sec | 1x L4 | ASROCKRACK 1U1G-MILAN | NVIDIA L4 | 7.45% WER | 1000 ms | Librispeech dev-clean |
Power Efficiency Offline Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 | 474,849 samples/sec | 117 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | ImageNet (224x224) |
RetinaNet | 10,114 samples/sec | 2 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | OpenImages (800x800) |
BERT | 54,050 samples/sec | 11 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | SQuAD v1.1 |
GPT-J | 65 samples/sec | 0.017 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | CNN Dailymail |
DLRMv2 | 273,527 samples/sec | 49 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | Synthetic Multihot Criteo Dataset |
3D-UNET | 38 samples/sec | 0.009 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | KiTS 2019 |
RNN-T | 125,479 samples/sec | 30 samples/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | Librispeech dev-clean |
Power Efficiency Server Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 | 400,094 queries/sec | 97 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | ImageNet (224x224) |
RetinaNet | 8,802 queries/sec | 2 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | OpenImages (800x800) |
BERT | 42,416 queries/sec | 8 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | SQuAD v1.1 |
GPT-J | 49 queries/sec | 0.013 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | CNN Dailymail |
DLRMv2 | 244,023 queries/sec | 42 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | Synthetic Multihot Criteo Dataset |
RNN-T | 112,015 queries/sec | 25 queries/sec/watt | 8x H100 | DGX-H100 | H100-SXM-80GB | Librispeech dev-clean |
MLPerf™ v3.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 3.1-0069, 3.1-0077, 3.1-0106, 3.1-0110, 3.1-0132, 3.1-0135, 3.1-0109. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
LLM Inference Performance of NVIDIA Data Center Products
H200 Inference Performance
Model | Batch Size | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
LLaMA 13B | 1024 | 1 | 128 | 128 | 11,819 output tokens/sec | 1x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
LLaMA 13B | 128 | 1 | 128 | 2048 | 4,750 output tokens/sec | 1x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
LLaMA 13B | 64 | 1 | 2048 | 128 | 1,349 output tokens/sec | 1x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
LLaMA 70B | 512 | 4 | 128 | 2048 | 6,616 output tokens/sec | 4x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
LLaMA 70B | 512 | 1 | 128 | 128 | 3,014 output tokens/sec | 1x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
LLaMA 70B | 64 | 2 | 2048 | 128 | 682 output tokens/sec | 2x H200 | HGX H200 | FP8 | TRT-LLM 0.5.0, TensorRT 9.1.0.4 | NVIDIA H200 |
TP: Tensor Parallelism
Batch size per GPU
H100 Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 64 | 1 | 128 | 128 | 10,907 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
GPT-J 6B | 64 | 1 | 128 | 2048 | 6,179 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
GPT-J 6B | 64 | 1 | 2048 | 128 | 2,229 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
GPT-J 6B | 64 | 1 | 2048 | 2048 | 2,980 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 64 | 1 | 128 | 128 | 9,193 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 64 | 1 | 128 | 2048 | 5,367 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 64 | 1 | 2048 | 128 | 2,058 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 32 | 1 | 2048 | 2048 | 2,230 output tokens/sec | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 64 | 4 | 128 | 128 | 3,317 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 64 | 4 | 128 | 2048 | 2,616 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 64 | 4 | 2048 | 128 | 843 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 64 | 4 | 2048 | 2048 | 1,583 output tokens/sec | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
Falcon 180B | 96 | 8 | 128 | 128 | 2,686 output tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
Falcon 180B | 96 | 8 | 128 | 2048 | 2,073 output tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
Falcon 180B | 64 | 8 | 2048 | 128 | 465 output tokens/sec | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
Batch size per GPU
L40S Inference Performance - High Throughput
Model | Batch Size | TP | Input Length | Output Length | Throughput | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 64 | 1 | 128 | 128 | 3,630 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
GPT-J 6B | 64 | 1 | 128 | 2048 | 1,859 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
GPT-J 6B | 32 | 1 | 2048 | 128 | 616 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
GPT-J 6B | 32 | 1 | 2048 | 2048 | 757 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 64 | 1 | 128 | 128 | 3,240 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 64 | 1 | 128 | 2048 | 1,622 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 32 | 1 | 2048 | 128 | 581 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 16 | 1 | 2048 | 2048 | 531 output tokens/sec | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
TP: Tensor Parallelism
Batch size per GPU
H100 Inference Performance - Low Latency
Model | Batch Size | TP | Input Length | 1st Token Latency | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 1 | 1 | 128 | 7 ms | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
GPT-J 6B | 1 | 1 | 2048 | 29 ms | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 1 | 1 | 128 | 7 ms | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 7B | 1 | 1 | 2048 | 36 ms | 1x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 1 | 4 | 128 | 26 ms | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
LLaMA 70B | 1 | 4 | 2048 | 109 ms | 4x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
Falcon 180B | 1 | 8 | 128 | 27 ms | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
Falcon 180B | 1 | 8 | 2048 | 205 ms | 8x H100 | DGX H100 | FP8 | TensorRT-LLM 0.5.0 | H100-SXM5-80GB |
TP: Tensor Parallelism
Batch size per GPU
L40S Inference Performance - Low Latency
Model | Batch Size | TP | Input Length | 1st Token Latency | GPU | Server | Precision | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|
GPT-J 6B | 1 | 1 | 128 | 12 ms | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
GPT-J 6B | 1 | 1 | 2048 | 71 ms | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 1 | 1 | 128 | 14 ms | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
LLaMA 7B | 1 | 1 | 2048 | 73 ms | 1x L40S | asrockrack 4u8g-rome | FP8 | TensorRT-LLM 0.5.0 | NVIDIA L40S |
TP: Tensor Parallelism
Batch size per GPU
Inference Performance of NVIDIA Data Center Products
H100 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 2.1 images/sec | - | 475.28 | 1x H100 | DGX H100 | - | Mixed | LAION-5B | TensorRT 8.6.0 | H100-SXM5-80GB |
4 | 3.21 images/sec | - | 1244.73 | 1x H100 | DGX H100 | - | Mixed | LAION-5B | TensorRT 8.6.0 | H100-SXM5-80GB | |
ResNet-50 | 8 | 20,687 images/sec | 73 images/sec/watt | 0.39 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 60,124 images/sec | 110 images/sec/watt | 2.13 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
496 | 70,998 images/sec | - images/sec/watt | 6.99 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
ResNet-50v1.5 | 8 | 20,119 images/sec | 65 images/sec/watt | 0.4 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 58,062 images/sec | 101 images/sec/watt | 2.2 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
473 | 68,145 images/sec | - images/sec/watt | 6.99 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
BERT-BASE | 8 | 9,103 sequences/sec | 21 sequences/sec/watt | 0.88 | 1x H100 | DGX H100 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 24,828 sequences/sec | 36 sequences/sec/watt | 5.16 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
BERT-LARGE | 8 | 3,948 sequences/sec | 9 sequences/sec/watt | 2.03 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 8,313 sequences/sec | 12 sequences/sec/watt | 15.4 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
EfficientNet-B0 | 8 | 15,945 images/sec | 65 images/sec/watt | 0.5 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 54,823 images/sec | 118 images/sec/watt | 2.33 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
467 | 66,972 images/sec | - images/sec/watt | 6.99 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
EfficientNet-B4 | 8 | 4,438 images/sec | 14 images/sec/watt | 1.8 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
53 | 7,686 images/sec | - images/sec/watt | 7.03 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
128 | 8,479 images/sec | 15 images/sec/watt | 15.1 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
HF Swin Base | 8 | 3,762 samples/sec | 8 samples/sec/watt | 2.13 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
32 | 5,628 samples/sec | 9 samples/sec/watt | 5.69 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
HF Swin Large | 8 | 2,517 samples/sec | 5 samples/sec/watt | 3.18 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
32 | 3,409 samples/sec | 5 samples/sec/watt | 9.39 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
HF ViT Base | 8 | 6,717 samples/sec | 12 samples/sec/watt | 1.19 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
64 | 10,222 samples/sec | 15 samples/sec/watt | 6.26 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
HF ViT Large | 8 | 2,722 samples/sec | 4 samples/sec/watt | 2.94 | 1x H100 | DGX H100 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
64 | 3,388 samples/sec | 5 samples/sec/watt | 18.89 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
Megatron BERT Large QAT | 8 | 4,839 sequences/sec | 12 sequences/sec/watt | 1.65 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 12,230 sequences/sec | 18 sequences/sec/watt | 10.47 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB | |
QuartzNet | 8 | 6,339 samples/sec | 22 samples/sec/watt | 1.26 | 1x H100 | DGX H100 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
128 | 32,670 samples/sec | 88 samples/sec/watt | 3.92 | 1x H100 | DGX H100 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | H100 SXM5-80GB |
512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L40S Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 23,574 images/sec | 80 images/sec/watt | 0.34 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
32 | 39,117 images/sec | 118 images/sec/watt | 0.82 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
ResNet-50v1.5 | 8 | 22,947 images/sec | 77 images/sec/watt | 0.35 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
32 | 37,073 images/sec | 110 images/sec/watt | 0.86 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
BERT-BASE | 8 | 8,285 sequences/sec | 28 sequences/sec/watt | 0.97 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
128 | 13,036 sequences/sec | 38 sequences/sec/watt | 9.82 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
BERT-LARGE | 8 | 3,184 sequences/sec | 10 sequences/sec/watt | 2.51 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
24 | 4,214 sequences/sec | 13 sequences/sec/watt | 5.52 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
EfficientDet-D0 | 2 | 2,182 images/sec | 13 images/sec/watt | 0.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
8 | 4,505 images/sec | 17 images/sec/watt | 1.78 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
EfficientNet-B0 | 8 | 20,092 images/sec | 103 images/sec/watt | 0.4 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
32 | 40,149 images/sec | 140 images/sec/watt | 0.8 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
EfficientNet-B4 | 8 | 5,022 images/sec | 18 images/sec/watt | 1.59 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
16 | 5,902 images/sec | 18 images/sec/watt | 2.71 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
HF Swin Base | 8 | 3,166 samples/sec | 9 samples/sec/watt | 2.53 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
16 | 3,605 samples/sec | 11 samples/sec/watt | 4.44 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
HF Swin Large | 8 | 1,615 samples/sec | 5 samples/sec/watt | 4.95 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
16 | 1,756 samples/sec | 5 samples/sec/watt | 9.11 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
HF ViT Base | 12 | 3,981 samples/sec | 13 samples/sec/watt | 3.01 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
HF ViT Large | 8 | 1,368 samples/sec | 4 samples/sec/watt | 5.85 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
Megatron BERT Large QAT | 8 | 3,848 sequences/sec | 11 sequences/sec/watt | 2.08 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
24 | 4,874 sequences/sec | 14 sequences/sec/watt | 4.92 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S | |
QuartzNet | 8 | 7,535 samples/sec | 32 samples/sec/watt | 1.06 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
128 | 22,297 samples/sec | 66 samples/sec/watt | 5.74 | 1x L40S | Supermicro SYS-521GE-TNRT | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | NVIDIA L40S |
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (1,024x1,024) | 1 | 0.2 images/sec | - | 5072.49 | 1x L40 | GIGABYTE G482-Z54-00 | - | Mixed | LAION-5B | TensorRT 8.5.2 | L40 |
ResNet-50 | 8 | 18,572 images/sec | 71 images/sec/watt | 0.43 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
32 | 28,637 images/sec | 96 images/sec/watt | 1.12 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
ResNet-50v1.5 | 8 | 18,025 images/sec | 67 images/sec/watt | 0.44 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
32 | 27,061 images/sec | 90 images/sec/watt | 1.18 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
BERT-BASE | 128 | 7,629 sequences/sec | 26 sequences/sec/watt | 16.78 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
BERT-LARGE | 8 | 2,448 sequences/sec | 8 sequences/sec/watt | 3.27 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
12 | 2,580 sequences/sec | 9 sequences/sec/watt | 3.27 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
24 | 2,725 sequences/sec | 10 sequences/sec/watt | 8.81 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
EfficientNet-B0 | 128 | 38,862 images/sec | 130 images/sec/watt | 3.29 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
EfficientNet-B4 | 8 | 4,707 images/sec | 16 images/sec/watt | 1.7 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
HF Swin Base | 8 | 2,354 samples/sec | 8 samples/sec/watt | 3.4 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
32 | 2,370 samples/sec | 8 samples/sec/watt | 13.5 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
HF Swin Large | 8 | 1,173 samples/sec | 4 samples/sec/watt | 6.82 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | L40 |
HF ViT Base | 8 | 2,647 samples/sec | 9 samples/sec/watt | 3.02 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | L40 |
64 | 2,653 samples/sec | 9 samples/sec/watt | 24.12 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | L40 | |
HF ViT Large | 8 | 829 samples/sec | 3 samples/sec/watt | 9.64 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | L40 |
Megatron BERT Large QAT | 8 | 3,594 sequences/sec | 13 sequences/sec/watt | 2.23 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
128 | 4,335 sequences/sec | 14 sequences/sec/watt | 29.52 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 | |
QuartzNet | 8 | 7,208 samples/sec | 31 samples/sec/watt | 1.11 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
128 | 17,638 samples/sec | 59 samples/sec/watt | 7.26 | 1x L40 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L40 |
1,024x1,024 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
L4 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion v2.1 (512x512) | 1 | 0.47 images/sec | - | 2113.07 | 1x L4 | GIGABYTE G482-Z54-00 | - | Mixed | LAION-5B | TensorRT 8.6.0 | L4 |
ResNet-50 | 8 | 10,172 images/sec | 143 images/sec/watt | 0.79 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
32 | 10,413 images/sec | 144 images/sec/watt | 3.07 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 | |
ResNet-50v1.5 | 8 | 9,663 images/sec | 134 images/sec/watt | 0.83 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
32 | 10,146 images/sec | 141 images/sec/watt | 3.15 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 | |
BERT-BASE | 8 | 3,554 sequences/sec | 50 sequences/sec/watt | 2.25 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
24 | 4,064 sequences/sec | 56 sequences/sec/watt | 5.91 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 | |
BERT-LARGE | 8 | 1,097 sequences/sec | 15 sequences/sec/watt | 7.29 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
12 | 1,293 sequences/sec | 18 sequences/sec/watt | 9.28 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 | |
EfficientNet-B4 | 8 | 1,817 images/sec | 25 images/sec/watt | 4.4 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
HF Swin Base | 8 | 1,052 samples/sec | 15 samples/sec/watt | 7.6 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
HF Swin Large | 8 | 524 samples/sec | 7 samples/sec/watt | 15.28 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
HF ViT Base | 8 | 1,304 samples/sec | 18 samples/sec/watt | 6.14 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
HF ViT Large | 8 | 396 samples/sec | 5 samples/sec/watt | 20.21 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | L4 |
Megatron BERT Large QAT | 8 | 1,516 sequences/sec | 22 sequences/sec/watt | 5.28 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
QuartzNet | 8 | 4,523 samples/sec | 63 samples/sec/watt | 1.77 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
128 | 6,019 samples/sec | 84 samples/sec/watt | 21.26 | 1x L4 | GIGABYTE G482-Z54-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | L4 |
512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 11,504 images/sec | 40 images/sec/watt | 0.7 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
109 | 15,968 images/sec | - images/sec/watt | 6.89 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
128 | 15,932 images/sec | 53 images/sec/watt | 8.03 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
ResNet-50v1.5 | 8 | 11,011 images/sec | 38 images/sec/watt | 0.73 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
106 | 15,458 images/sec | - images/sec/watt | 6.92 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
128 | 15,294 images/sec | 51 images/sec/watt | 8.37 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
BERT-BASE | 8 | 4,355 sequences/sec | 15 sequences/sec/watt | 1.84 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
128 | 5,622 sequences/sec | 19 sequences/sec/watt | 22.77 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
BERT-LARGE | 8 | 1,572 sequences/sec | 5 sequences/sec/watt | 5.09 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
128 | 1,937 sequences/sec | 6 sequences/sec/watt | 66.08 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
EfficientNet-B0 | 8 | 10,848 images/sec | 56 images/sec/watt | 0.74 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
128 | 19,747 images/sec | 66 images/sec/watt | 6.48 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
136 | 19,875 images/sec | - images/sec/watt | 6.89 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
EfficientNet-B4 | 8 | 2,129 images/sec | 7 images/sec/watt | 3.76 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
15 | 2,314 images/sec | - images/sec/watt | 6.91 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
128 | 2,624 images/sec | 9 images/sec/watt | 48.78 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
HF Swin Base | 8 | 1,410 samples/sec | 5 samples/sec/watt | 5.67 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A40 |
32 | 1,425 samples/sec | 5 samples/sec/watt | 22.46 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A40 | |
HF Swin Large | 8 | 802 samples/sec | 3 samples/sec/watt | 9.97 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
32 | 819 samples/sec | 3 samples/sec/watt | 39.07 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
HF ViT Base | 8 | 2,129 samples/sec | 7 samples/sec/watt | 3.76 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A40 |
64 | 2,152 samples/sec | 7 samples/sec/watt | 29.74 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
HF ViT Large | 8 | 680 samples/sec | 2 samples/sec/watt | 11.77 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A40 |
64 | 702 samples/sec | 2 samples/sec/watt | 91.12 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
Megatron BERT Large QAT | 8 | 2,094 sequences/sec | 7 sequences/sec/watt | 3.82 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
128 | 2,655 sequences/sec | 9 sequences/sec/watt | 48.21 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 | |
QuartzNet | 8 | 4,383 samples/sec | 18 samples/sec/watt | 1.83 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
128 | 8,340 samples/sec | 28 samples/sec/watt | 15.35 | 1x A40 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A40 |
512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A30 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 10,394 images/sec | 74 images/sec/watt | 0.77 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
116 | 17,040 images/sec | - images/sec/watt | 6.87 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 17,303 images/sec | 105 images/sec/watt | 7.4 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
ResNet-50v1.5 | 8 | 10,144 images/sec | 71 images/sec/watt | 0.79 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
111 | 16,294 images/sec | - images/sec/watt | 6.87 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 16,680 images/sec | 101 images/sec/watt | 7.67 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 4,309 sequences/sec | 26 sequences/sec/watt | 1.86 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 5,763 sequences/sec | 35 sequences/sec/watt | 22.21 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,494 sequences/sec | 9 sequences/sec/watt | 5.36 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 2,057 sequences/sec | 13 sequences/sec/watt | 62.23 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
EfficientNet-B0 | 8 | 8,889 images/sec | 78 images/sec/watt | 0.9 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
116 | 16,979 images/sec | - images/sec/watt | 6.89 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 17,101 images/sec | 104 images/sec/watt | 7.48 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
EfficientNet-B4 | 8 | 1,871 images/sec | 12 images/sec/watt | 4.28 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
14 | 2,099 images/sec | - images/sec/watt | 7.15 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 2,389 images/sec | 15 images/sec/watt | 53.58 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
HF Swin Base | 8 | 1,339 samples/sec | 8 samples/sec/watt | 5.98 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
32 | 1,462 samples/sec | 9 samples/sec/watt | 21.89 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
HF Swin Large | 8 | 762 samples/sec | 5 samples/sec/watt | 10.5 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
32 | 786 samples/sec | 5 samples/sec/watt | 40.73 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A30 | |
HF ViT Base | 8 | 2,016 samples/sec | 12 samples/sec/watt | 3.97 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A30 |
64 | 2,177 samples/sec | 13 samples/sec/watt | 29.4 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
HF ViT Large | 8 | 645 samples/sec | 4 samples/sec/watt | 12.4 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
64 | 692 samples/sec | 4 samples/sec/watt | 92.42 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A30 | |
Megatron BERT Large QAT | 8 | 1,805 sequences/sec | 13 sequences/sec/watt | 4.43 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 2,753 sequences/sec | 17 sequences/sec/watt | 46.49 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
QuartzNet | 8 | 3,409 samples/sec | 30 samples/sec/watt | 2.35 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 9,875 samples/sec | 72 samples/sec/watt | 12.96 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A30 1/4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 4,020 images/sec | 47 images/sec/watt | 1.99 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
32 | 4,634 images/sec | - images/sec/watt | 7.12 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 4,803 images/sec | 53 images/sec/watt | 26.65 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
ResNet-50v1.5 | 8 | 3,877 images/sec | 47 images/sec/watt | 2.06 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
31 | 4,487 images/sec | - images/sec/watt | 7.13 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 4,624 images/sec | 53 images/sec/watt | 27.68 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-BASE | 8 | 1,580 sequences/sec | 17 sequences/sec/watt | 5.06 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 1,711 sequences/sec | 18 sequences/sec/watt | 74.83 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-LARGE | 8 | 519 sequences/sec | 6 sequences/sec/watt | 15.4 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 593 sequences/sec | 6 sequences/sec/watt | 215.95 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE
A30 4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 15,416 images/sec | 94 images/sec/watt | 2.08 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
29 | 17,341 images/sec | - images/sec/watt | 6.94 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 18,166 images/sec | 111 images/sec/watt | 28.26 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
ResNet-50v1.5 | 8 | 14,883 images/sec | 90 images/sec/watt | 2.16 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
28 | 16,630 images/sec | - images/sec/watt | 7 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
128 | 17,528 images/sec | 106 images/sec/watt | 29.28 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-BASE | 8 | 5,728 sequences/sec | 35 sequences/sec/watt | 5.69 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 6,025 sequences/sec | 37 sequences/sec/watt | 86.72 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 | |
BERT-LARGE | 8 | 1,889 sequences/sec | 11 sequences/sec/watt | 17.06 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
128 | 2,094 sequences/sec | 13 sequences/sec/watt | 245.59 | 1x A30 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE
A10 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 8,784 images/sec | 59 images/sec/watt | 0.91 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
77 | 11,346 images/sec | - images/sec/watt | 6.87 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
128 | 11,331 images/sec | 76 images/sec/watt | 11.3 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
ResNet-50v1.5 | 8 | 8,387 images/sec | 56 images/sec/watt | 0.95 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
70 | 10,368 images/sec | - images/sec/watt | 7.04 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
128 | 10,766 images/sec | 72 images/sec/watt | 11.89 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 3,287 sequences/sec | 22 sequences/sec/watt | 2.43 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
128 | 3,842 sequences/sec | 26 sequences/sec/watt | 33.32 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,136 sequences/sec | 8 sequences/sec/watt | 7.04 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
128 | 1,271 sequences/sec | 9 sequences/sec/watt | 100.72 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
EfficientNet-B0 | 8 | 9,401 images/sec | 63 images/sec/watt | 0.85 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
128 | 14,246 images/sec | 97 images/sec/watt | 8.98 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
EfficientNet-B4 | 8 | 1,601 images/sec | 11 images/sec/watt | 5 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
128 | 1,868 images/sec | 12 images/sec/watt | 68.53 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
HF Swin Base | 8 | 1,029 samples/sec | 7 samples/sec/watt | 7.77 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
32 | 1,013 samples/sec | 7 samples/sec/watt | 31.57 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
HF Swin Large | 8 | 546 samples/sec | 4 samples/sec/watt | 14.65 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
32 | 563 samples/sec | 4 samples/sec/watt | 56.83 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
HF ViT Base | 8 | 1,382 samples/sec | 9 samples/sec/watt | 5.79 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
64 | 1,433 samples/sec | 10 samples/sec/watt | 44.68 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
HF ViT Large | 8 | 463 samples/sec | 3 samples/sec/watt | 17.28 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
64 | 444 samples/sec | 3 samples/sec/watt | 144.07 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | Mixed | Synthetic | TensorRT 8.6.1 | A10 | |
Megatron BERT Large QAT | 8 | 1,571 sequences/sec | 11 sequences/sec/watt | 5.09 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
128 | 1,839 sequences/sec | 12 sequences/sec/watt | 69.62 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 | |
QuartzNet | 8 | 3,968 samples/sec | 27 samples/sec/watt | 2.02 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
128 | 5,828 samples/sec | 39 samples/sec/watt | 21.96 | 1x A10 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A10 |
512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256
A2 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 2,741 images/sec | 46 images/sec/watt | 2.92 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
20 | 3,012 images/sec | - images/sec/watt | 6.97 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
128 | 3,137 images/sec | 52 images/sec/watt | 40.81 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
ResNet-50v1.5 | 8 | 2,646 images/sec | 44 images/sec/watt | 3.02 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
19 | 2,889 images/sec | - images/sec/watt | 6.97 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
128 | 3,002 images/sec | 50 images/sec/watt | 42.64 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
BERT-BASE | 8 | 920 sequences/sec | 15 sequences/sec/watt | 8.69 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
128 | 986 sequences/sec | 16 sequences/sec/watt | 129.85 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
BERT-LARGE | 8 | 297 sequences/sec | 5 sequences/sec/watt | 26.91 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
128 | 318 sequences/sec | 5 sequences/sec/watt | 402.7 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
EfficientNet-B0 | 8 | 3,189 images/sec | 61 images/sec/watt | 2.51 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
25 | 3,714 images/sec | - images/sec/watt | 7 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
128 | 3,917 images/sec | 65 images/sec/watt | 32.68 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 | |
EfficientNet-B4 | 8 | 480 images/sec | 8 images/sec/watt | 16.68 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
128 | 521 images/sec | 9 images/sec/watt | 245.87 | 1x A2 | GIGABYTE G482-Z52-00 | 23.08-py3 | INT8 | Synthetic | TensorRT 8.6.1 | A2 |
Sequence length=128 for BERT-BASE and BERT-LARGE
NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server
A100 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A100-SXM4-40GB | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 24 | 32.212 | 745 inf/sec | 384 | 23.07-py3 |
BERT Large Inference | A100-SXM4-40GB | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 59.367 | 808 inf/sec | 384 | 23.07-py3 |
BERT Base Inference | A100-SXM4-80GB | tensorrt | TensorRT | Mixed | 4 | 1 | 1 | 20 | 3.464 | 5,772 inf/sec | 128 | 23.07-py3 |
BERT Base Inference | A100-SXM4-40GB | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 7.222 | 6,645 inf/sec | 128 | 23.07-py3 |
DLRM Inference | A100-SXM4-40GB | ts-trace | PyTorch | Mixed | 2 | 1 | 65,536 | 30 | 1.167 | 25,694 inf/sec | - | 23.07-py3 |
DLRM Inference | A100-SXM4-80GB | ts-trace | PyTorch | Mixed | 2 | 2 | 65,536 | 30 | 1.212 | 49,484 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A100-SXM4-80GB | tensorrt | PyTorch | Mixed | 4 | 1 | 128 | 512 | 31.198 | 16,400 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A100-SXM4-80GB | tensorrt | PyTorch | Mixed | 4 | 2 | 128 | 384 | 43.656 | 17,591 inf/sec | - | 23.07-py3 |
BERT Large Inference | A100-PCIE-80GB | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 24 | 36.777 | 652 inf/sec | 384 | 23.07-py3 |
BERT Large Inference | A100-PCIE-80GB | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 65.889 | 728 inf/sec | 384 | 23.07-py3 |
BERT Base Inference | A100-PCIE-80GB | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 24 | 4.559 | 5,262 inf/sec | 128 | 23.07-py3 |
BERT Base Inference | A100-PCIE-80GB | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 7.79 | 6,161 inf/sec | 128 | 23.07-py3 |
DLRM Inference | A100-PCIE-80GB | ts-trace | PyTorch | Mixed | 2 | 1 | 65,536 | 30 | 1.184 | 25,324 inf/sec | - | 23.07-py3 |
DLRM Inference | A100-PCIE-80GB | ts-trace | PyTorch | Mixed | 2 | 2 | 65,536 | 30 | 1.112 | 53,917 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A100-PCIE-80GB | tensorrt | PyTorch | Mixed | 4 | 1 | 128 | 512 | 32.524 | 15,729 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A100-PCIE-80GB | tensorrt | PyTorch | Mixed | 2 | 2 | 128 | 512 | 57.479 | 17,795 inf/sec | - | 23.07-py3 |
A30 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A30 | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 20 | 66.853 | 359 inf/sec | 384 | 23.07-py3 |
BERT Large Inference | A30 | tensorrt | TensorRT | Mixed | 2 | 2 | 1 | 20 | 105.347 | 380 inf/sec | 384 | 23.07-py3 |
BERT Base Inference | A30 | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 24 | 7.369 | 3,256 inf/sec | 128 | 23.07-py3 |
BERT Base Inference | A30 | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 13.428 | 3,574 inf/sec | 128 | 23.07-py3 |
ResNet-50 v1.5 | A30 | tensorrt | PyTorch | Mixed | 2 | 1 | 128 | 512 | 56.456 | 9,061 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A30 | tensorrt | PyTorch | Mixed | 2 | 2 | 128 | 512 | 113.571 | 9,004 inf/sec | - | 23.07-py3 |
A10 Triton Inference Server Performance
Network | Accelerator | Model Format | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A10 | tensorrt | TensorRT | Mixed | 4 | 1 | 1 | 24 | 101.066 | 237 inf/sec | 384 | 23.07-py3 |
BERT Large Inference | A10 | tensorrt | TensorRT | Mixed | 4 | 2 | 1 | 24 | 195.715 | 245 inf/sec | 384 | 23.07-py3 |
BERT Base Inference | A10 | tensorrt | TensorRT | Mixed | 2 | 1 | 1 | 24 | 10.726 | 2,237 inf/sec | 128 | 23.07-py3 |
BERT Base Inference | A10 | tensorrt | TensorRT | Mixed | 2 | 2 | 1 | 20 | 16.638 | 2,404 inf/sec | 128 | 23.07-py3 |
ResNet-50 v1.5 | A10 | tensorrt | PyTorch | Mixed | 2 | 1 | 128 | 512 | 87.367 | 5,855 inf/sec | - | 23.07-py3 |
ResNet-50 v1.5 | A10 | tensorrt | PyTorch | Mixed | 2 | 2 | 128 | 384 | 131.146 | 5,850 inf/sec | - | 23.07-py3 |
Inference Performance of NVIDIA GPUs in the Cloud
A100 Inference Performance in the Cloud
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 13,486 images/sec | - images/sec/watt | 0.59 | 1x A100 | GCP A2-HIGHGPU-1G | 23.05-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 30,517 images/sec | - images/sec/watt | 4.19 | 1x A100 | GCP A2-HIGHGPU-1G | 23.05-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
8 | 13,673 images/sec | - images/sec/watt | 0.59 | 1x A100 | AWS EC2 p4d.24xlarge | 23.07-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
128 | 30,517 images/sec | - images/sec/watt | 4.19 | 1x A100 | AWS EC2 p4d.24xlarge | 23.07-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
8 | 13,733 images/sec | - images/sec/watt | 0.58 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 23.06-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
128 | 32,513 images/sec | - images/sec/watt | 3.94 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 23.06-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
BERT-LARGE | 8 | 2,326 images/sec | - images/sec/watt | 3.44 | 1x A100 | GCP A2-HIGHGPU-1G | 23.05-py3 | INT8 | Synthetic | - | A100-SXM4-40GB |
128 | 4,055 images/sec | - images/sec/watt | 31.57 | 1x A100 | GCP A2-HIGHGPU-1G | 23.05-py3 | INT8 | Synthetic | - | A100-SXM4-40GB | |
8 | 2,349 images/sec | - images/sec/watt | 3.41 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 23.06-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
128 | 4,177 images/sec | - images/sec/watt | 30.64 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 23.06-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
8 | 2,318 images/sec | - images/sec/watt | 3.45 | 1x A100 | AWS EC2 p4d.24xlarge | 23.07-py3 | INT8 | Synthetic | - | A100-SXM4-80GB | |
128 | 4,037 images/sec | - images/sec/watt | 31.7 | 1x A100 | AWS EC2 p4d.24xlarge | 23.07-py3 | INT8 | Synthetic | - | A100-SXM4-80GB |
BERT-Large: Sequence Length = 128
View More Performance Data
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Learn MoreAI Pipeline
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.
Learn More