Jetson AGX Xavier: Deep Learning Inference Benchmarks
This page provides initial benchmarking results of deep learning inference performance and energy efficiency for Jetson AGX Xavier on networks including ResNet-18 FCN, ResNet-50, VGG19, GoogleNet, and AlexNet using JetPack 4.1.1 Developer Preview software. Performance and power characteristics will continue to improve over time as NVIDIA releases software updates containing additional features and optimizations for Jetson AGX Xavier.
ResNet-18 FCN (Fully Convolutional Network) for semantic segmentation operates at full HD resolution (2048x1024) and is representative of autonomous machine workloads involved with perception, path planning, and navigation. ResNet-50, VGG19, GoogleNet, and AlexNet perform recognition and classification on image patches with 224x224 resolution, and are commonly used as the encoder backbones of various object detection and segmentation networks. Using a batch size of 8 or higher at the lower resolution can be used to approximate the performance and latency of a batch size of 1 at higher resolutions. Robotic platforms and autonomous machines often incorporate multiple cameras and sensors which can be batch processed for increased performance, in addition to performing detection of regions-of-interest (ROIs) followed by further classification of the ROIs in batches.
With the recent availability of substantial computational resources at the edge, applications are deploying increasingly complex networks, such as variants of ResNet and VGG, for improved accuracy. Here we provide GoogleNet and AlexNet for historical completeness (denoted in the tables in gray), and in the future we will be updating this document to include additional networks for tasks like object detection, motion planning, and those that incorporate Recurrent Neural Networks (RNNs) for processes such as speech recognition and image captioning.
The benchmark measurements below were collected with the following environment:
- Jetson AGX Xavier Developer Kit
- JetPack 4.1.1 Developer Preview
- TensorRT 5.0, using trtexec** tool
- Concurrent use of GPU (INT8) and two DLA’s (FP16)
- MAX-N and 15W power modes (nvpmodel)
Total module power consumption and energy efficiency measurements include the usage of CPU, GPU, DLAs, memory, miscellaneous SoC power, I/O, and regulator efficiency losses on all rails. Power consumption was measured using INA voltage and current monitors onboard the module.
Estimates of future performance, incorporating software enhancements such as INT8 support for the DLA’s and additional GPU optimizations, are provided for various network configurations. The future performance estimates presume concurrent use of GPU (INT8) and two DLAs (INT8).
15W Mode (HD)
NETWORK | BATCH SIZE | PERF (img/sec) | LATENCY (ms) | MODULE POWER (watts) | MODULE PERFORMANCE / watt |
---|---|---|---|---|---|
ResNet-18 FCN | 1 | 34 | 29.2 | 12.3 | 2.8 |
ResNet-18 FCN | 2 | 36 | 55.4 | 12.5 | 2.9 |
ResNet-18 FCN | 4 | 41 | 96.6 | 12.7 | 3.3 |
ResNet-18 FCN | 8 | 48 | 167.4 | 12.8 | 3.7 |
ResNet-18 FCN | 16 | 49 | 323.9 | 13.0 | 3.8 |
ResNet-18 FCN | 32 | 51 | 622.6 | 13.2 | 3.9 |
MAX-N Mode* (HD)
NETWORK | BATCH SIZE | PERF (img/sec) | LATENCY (ms) | MODULE POWER (watts) | MODULE PERFORMANCE / watt |
---|---|---|---|---|---|
ResNet-18 FCN | 1 | 64 | 15.6 | 35.4 | 1.8 |
ResNet-18 FCN | 2 | 68 | 29.6 | 36.0 | 1.9 |
ResNet-18 FCN | 4 | 74 | 54.1 | 36.9 | 2.0 |
ResNet-18 FCN | 8 | 82 | 97.6 | 37.3 | 2.2 |
ResNet-18 FCN | 16 | 86 | 186.0 | 38.3 | 2.2 |
ResNet-18 FCN | 32 | 88 | 363.6 | 38.7 | 2.3 |
15W Mode
NETWORK | BATCH SIZE | PERF (img/sec) | LATENCY (ms) | MODULE POWER (watts) | MODULE PERFORMANCE / watt | FUTURE PERF (img/sec) | FUTURE MODULE POWER (watts) | FUTURE MODULE PERFORMANCE / watt |
---|---|---|---|---|---|---|---|---|
ResNet-50 | 1 | 358 | 2.8 | 11.5 | 31.2 | 800 | 12 | 67 |
ResNet-50 | 2 | 508 | 3.9 | 12.8 | 39.7 | 1090 | 14 | 78 |
ResNet-50 | 4 | 634 | 6.3 | 13.6 | 46.5 | 1280 | 14 | 91 |
ResNet-50 | 8 | 717 | 11.2 | 14.4 | 49.8 | 1360 | 14 | 97 |
ResNet-50 | 16 | 767 | 20.9 | 14.9 | 51.3 | 1410 | 15 | 94 |
ResNet-50 | 32 | 841 | 38.0 | 15.1 | 55.7 | 1430 | 15 | 95 |
ResNet-50 | 64 | 869 | 73.6 | 15.1 | 57.6 | 1430 | 15 | 95 |
ResNet-50 | 128 | 879 | 145.7 | 15.2 | 57.7 | 1430 | 15 | 95 |
VGG19 | 1 | 84 | 11.9 | 14.2 | 5.9 | 230 | 12 | 19 |
VGG19 | 2 | 132 | 15.2 | 14.4 | 9.1 | 290 | 13 | 22 |
VGG19 | 4 | 174 | 22.9 | 14.6 | 11.9 | 320 | 13 | 25 |
VGG19 | 8 | 191 | 41.8 | 14.9 | 12.8 | 340 | 13 | 26 |
VGG19 | 16 | 231 | 69.4 | 15.0 | 15.3 | 350 | 13 | 27 |
VGG19 | 32 | 260 | 123.1 | 15.2 | 17.1 | 350 | 13 | 27 |
VGG19 | 64 | 269 | 238.0 | 15.3 | 17.6 | 350 | 13 | 27 |
VGG19 | 128 | 274 | 467.8 | 15.4 | 17.8 | 350 | 13 | 27 |
GoogleNet | 1 | 542 | 1.8 | 9.8 | 55.0 | 1310 | 11 | 119 |
GoogleNet | 2 | 684 | 2.9 | 10.4 | 65.8 | 1670 | 13 | 128 |
GoogleNet | 4 | 890 | 4.5 | 11.4 | 78.1 | 1920 | 15 | 128 |
GoogleNet | 8 | 1015 | 7.9 | 12.0 | 84.4 | 1940 | 15 | 129 |
GoogleNet | 16 | 1121 | 14.3 | 12.8 | 87.6 | 1950 | 15 | 130 |
GoogleNet | 32 | 1184 | 27.0 | 13.2 | 90.0 | 1980 | 15 | 132 |
GoogleNet | 64 | 1235 | 51.8 | 13.2 | 93.6 | 1980 | 15 | 132 |
GoogleNet | 128 | 1255 | 102.0 | 13.3 | 94.3 | 1980 | 15 | 132 |
AlexNet | 1 | 299 | 3.3 | 14.0 | 21.3 | 1090 | 12 | 91 |
AlexNet | 2 | 466 | 4.3 | 14.3 | 32.6 | 1790 | 12 | 149 |
AlexNet | 4 | 721 | 5.5 | 14.9 | 48.5 | 2650 | 13 | 204 |
AlexNet | 8 | 990 | 8.1 | 13.5 | 73.4 | 3510 | 13 | 270 |
AlexNet | 16 | 1291 | 12.4 | 14.2 | 90.8 | 4200 | 14 | 300 |
AlexNet | 32 | 1713 | 18.7 | 14.4 | 119.0 | 4670 | 14 | 334 |
AlexNet | 64 | 2087 | 30.7 | 14.8 | 141.3 | 4670 | 14 | 334 |
AlexNet | 128 | 2270 | 56.4 | 14.9 | 152.5 | 4670 | 14 | 334 |
MAX-N Mode*
NETWORK | BATCH SIZE | PERF (img/sec) | LATENCY (ms) | MODULE POWER (watts) | MODULE PERFORMANCE / watt | FUTURE PERF (img/sec) | FUTURE MODULE POWER (watts) | FUTURE MODULE PERFORMANCE / watt |
---|---|---|---|---|---|---|---|---|
ResNet-50 | 1 | 656 | 1.5 | 31 | 21.2 | 1390 | 29 | 48 |
ResNet-50 | 2 | 915 | 2.2 | 34.2 | 26.7 | 1970 | 34 | 58 |
ResNet-50 | 4 | 1143 | 3.5 | 37.2 | 30.7 | 2320 | 37 | 63 |
ResNet-50 | 8 | 1293 | 6.2 | 39.3 | 32.9 | 2490 | 38 | 66 |
ResNet-50 | 16 | 1388 | 11.5 | 40.7 | 34.1 | 2620 | 39 | 67 |
ResNet-50 | 32 | 1561 | 20.5 | 41.6 | 37.5 | 2710 | 40 | 68 |
ResNet-50 | 64 | 1612 | 39.7 | 42.1 | 38.3 | 2710 | 40 | 68 |
ResNet-50 | 128 | 1631 | 78.5 | 42.4 | 38.5 | 2710 | 40 | 68 |
VGG19 | 1 | 156 | 6.4 | 38.2 | 4.1 | 420 | 32 | 13 |
VGG19 | 2 | 236 | 8.5 | 40.6 | 5.8 | 550 | 35 | 16 |
VGG19 | 4 | 316 | 12.7 | 41 | 7.7 | 620 | 36 | 17 |
VGG19 | 8 | 375 | 21.4 | 43.9 | 8.5 | 660 | 37 | 18 |
VGG19 | 16 | 451 | 35.5 | 44.7 | 10.1 | 680 | 37 | 18 |
VGG19 | 32 | 502 | 63.7 | 45.2 | 11.1 | 680 | 37 | 18 |
VGG19 | 64 | 521 | 122.9 | 45.7 | 11.4 | 680 | 37 | 18 |
VGG19 | 128 | 531 | 241.1 | 45.9 | 11.6 | 680 | 37 | 18 |
GoogleNet | 1 | 1030 | 1 | 27.1 | 38 | 2290 | 29 | 79 |
GoogleNet | 2 | 1310 | 1.5 | 29.8 | 43.9 | 2980 | 34 | 88 |
GoogleNet | 4 | 1644 | 2.4 | 31.3 | 52.4 | 3560 | 39 | 91 |
GoogleNet | 8 | 1897 | 4.2 | 36 | 52.7 | 3980 | 42 | 95 |
GoogleNet | 16 | 2085 | 7.7 | 36 | 57.9 | 4250 | 45 | 94 |
GoogleNet | 32 | 2234 | 14.3 | 38 | 58.9 | 4410 | 46 | 96 |
GoogleNet | 64 | 2373 | 27 | 41.5 | 57.2 | 4410 | 46 | 96 |
GoogleNet | 128 | 2414 | 53 | 41.7 | 57.9 | 4410 | 46 | 96 |
AlexNet | 1 | 483 | 2.1 | 33.9 | 14.3 | 1810 | 27 | 67 |
AlexNet | 2 | 774 | 2.6 | 34.9 | 22.2 | 3060 | 29 | 106 |
AlexNet | 4 | 1231 | 3.2 | 37 | 33.3 | 4660 | 32 | 146 |
AlexNet | 8 | 1734 | 4.6 | 40.6 | 42.7 | 6390 | 36 | 178 |
AlexNet | 16 | 2535 | 6.3 | 39.7 | 63.8 | 7840 | 38 | 206 |
AlexNet | 32 | 3338 | 9.6 | 41.1 | 81.2 | 8860 | 41 | 216 |
AlexNet | 64 | 4129 | 15.5 | 41.7 | 99 | 8860 | 41 | 216 |
AlexNet | 128 | 4504 | 28.4 | 42.4 | 106.2 | 8860 | 41 | 216 |
15W Mode (Per-Rail)
The following provides a breakdown of the individual power rails, with each rail including regulator efficiency losses, for a subset of ResNet-50:
NETWORK | BATCH SIZE | PERF (img/sec) | SoC POWER (watts) | CPU POWER (watts) | GPU POWER (watts) | DLA POWER (watts) | DRAM POWER (watts) | 5V I/O POWER (watts) | MODULE POWER (watts) |
---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 2 | 508 | 2.2 | 1.1 | 3.5 | 1.4 | 1.9 | 2.7 | 12.8 |
ResNet-50 | 8 | 717 | 2.3 | 1.1 | 4.6 | 1.4 | 2.2 | 2.8 | 14.4 |
Notes
* Module power in MAX-N mode with the JetPack 4.1.1 Developer Preview release may exceed TDP for some configurations. Users should tune the power profile and configuration to stay within the TDP for their application. Future versions of JetPack will further optimize performance and power.
** Example trtexec launch commands:
For GPU
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --int8 --batch=8 --iterations=10000 --output=prob --useSpinWait
For DLA (Core 0)
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback
For DLA (Core 1)
$ ./trtexec --avgRuns=100 --deploy=resnet50.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=1 --useSpinWait --allowGPUFallback
Multiple instances of trtexec can be launched simultaneously in this fashion for concurrent execution of the GPU and DLA’s. DLA supports a maximum batch size of 32 depending on the network, while the GPU can run higher batch sizes concurrently.
Last updated: Nov 20, 2018 | NVIDIA Corporation | Subject to Change