Computer Vision / Video Analytics

Beating SOTA Inference Performance on NVIDIA GPUs with GPUNet

Aug 30, 2022

By Satish Salian, Carl (Izzy) Putterman, Linnan Wang and Krzysztof Kudrynski

Discuss (0)

AI-Generated Summary

Dislike

GPUNet is a class of convolutional neural networks designed to maximize the performance of NVIDIA GPUs using NVIDIA TensorRT, and is built using novel neural architecture search (NAS) methods.
The NAS AI agent used to build GPUNet efficiently makes design choices to create models that are up to 2x faster than EfficientNet-X and FBNet-V3, and optimizes them for NVIDIA GPU using TensorRT.
GPUNet's model architecture is an eight-stage architecture that uses EfficientNet-V2 as the baseline, and is designed to be deployment-ready with latencies that include all performance optimizations available in TensorRT.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Crafted by AI for AI, GPUNet is a class of convolutional neural networks designed to maximize the performance of NVIDIA GPUs using NVIDIA TensorRT.

Built using novel neural architecture search (NAS) methods, GPUNet demonstrates state-of-the-art inference performance up to 2x faster than EfficientNet-X and FBNet-V3.

The NAS methodology helps build GPUNet for a wide range of applications such that deep learning engineers can directly deploy these neural networks depending on the relative accuracy and latency targets.

GPUNet NAS design methodology

Efficient architecture search and deployment-ready models are the key goals of the NAS design methodology. This means little to no interaction with the domain experts and efficient use of cluster nodes for training potential architecture candidates. Most important is that the generated models are deployment-ready.

Crafted by AI

Finding the best performing architecture search for a target device can be time-consuming. NVIDIA built and deployed a novel NAS AI agent that efficiently makes the tough design choices required to build GPUNets that beat the current SOTA models by a factor of 2x.

This NAS AI agent automatically orchestrates hundreds of GPUs in the Selene supercomputer without any intervention from the domain experts.

Optimized for NVIDIA GPU using TensorRT

GPUNet picks up the most relevant operations required to meet the target model accuracy with related TensorRT inference latency cost, promoting GPU-friendly operators (for example, larger filters) over memory-bound operators (for example, fancy activations). It delivers the SOTA GPU latency and the accuracy on ImageNet.

Deployment-ready

The GPUNet reported latencies include all the performance optimization available in the shipping version of TensorRT, including fused kernels, quantization, and other optimized paths. Built GPUNets are ready for deployment.

Building a GPUNet: An end-to-end NAS workflow

At a high level, the neural architecture search (NAS) AI agent is split into two stages:

Categorizing all possible network architectures by the inference latency.
Using a subset of these networks that fit within the latency budget and optimizing them for accuracy.

In the first stage, as the search space is high-dimensional, the agent uses Sobol sampling to distribute the candidates more evenly. Using the latency look-up table, these candidates are then categorized into a subsearch space, for example, a subset of networks with total latency under 0.5 msecs on NVIDIA V100 GPUs.

The inference latency used in this stage is an approximate cost, calculated by summing up the latency of each layer from the latency lookup table. The latency table uses input data shape and layer configurations as keys to look up the related latency on the queried layer.

In the second stage, the agent sets up Bayesian optimization loss function to find the best performing higher accuracy network within the latency range of the subspace:

\(loss = CrossEntropy(model weights) + \alpha * latency(architecture candidate)^{\beta}\)

Control flow block diagram of the NAS AI Agent, starting with a baseline model to ending with a list of best ranked neural architectures. — *Figure 2. NVIDIA NAS AI Agent End-to-End workflow*

The AI agent uses a client-server distributed training controller to perform NAS simultaneously across multiple network architectures. The AI agent runs on one server node, proposing and training network candidates that run on several client nodes on the cluster.

Based on the results, only the promising network architecture candidates that meet both the accuracy and the latency targets of the target hardware get ranked, resulting in a handful of best-performing GPUNets that are ready to be deployed on NVIDIA GPUs using TensorRT.

GPUNet model architecture

The GPUNet model architecture is an eight-stage architecture using EfficientNet-V2 as the baseline architecture.

The search space definition includes searching on the following variables:

Type of operations
Number of strides
Kernel size
Number of layers
Activation function
IRB expansion ratio
Output channel filters
Squeeze excitation (SE)

Table 1 shows the range of values for each variable in the search space.

*Table 1. Value ranges for search space variables*
Stage	Type	Stride	Kernel	Layers	Activation	ER	Filters	SE
0	Conv	2	[3,5]	1	[R,S]		[24, 32, 8]
1	Conv	1	[3,5]	[1,4]	[R,S]		[24, 32, 8]
2	F-IRB	2	[3,5]	[1,8]	[R,S]	[2, 6]	[32, 80, 16]	[0, 1]
3	F-IRB	2	[3,5]	[1,8]	[R,S]	[2, 6]	[48, 112, 16]	[0, 1]
4	IRB	2	[3,5]	[1,10]	[R,S]	[2, 6]	[96, 192, 16]	[0, 1]
5	IRB	1	[3,5]	[0,15]	[R,S]	[2, 6]	[112, 224, 16]	[0, 1]
6	IRB	2	[3,5]	[1,15]	[R,S]	[2, 6]	[128, 416, 32]	[0, 1]
7	IRB	1	[3,5]	[0,15]	[R,S]	[2, 6]	[256, 832, 64]	[0, 1]
8	Conv1x1 & Pooling & FC

The first two stages search for the head configurations using convolutions. Inspired by EfficientNet-V2, the second and third stages use Fused-IRB. Fused-IRBs result in higher latency though, so in stages 4 to 7 these are replaced by IRBs.

The column Layers show the range of layers in the stage. For example, [1, 10] in stage 4 means that the stage can have 1 to 10 IRBs. The column Filters shows the range of output channel filters for the layers in the stage. This search space also tunes the expansion ratio (ER), activation types, kernel sizes, and the Squeeze Excitation (SE) layer inside the IRB/Fused-IRB.

Finally, the dimensions of the input image are searched from 224 to 512, at the step of 32.

Each GPUNet candidate build from the search space is encoded into a 41-wide integer vector (Table 2).

*Table 2. The encoding scheme of networks in the search space*
Stage	Type	Hyperparameters	Length
Stage	Resolution	[Resolution]	1
0	Conv	[#Filters]	1
1	Conv	[Kernel, Activation, #Layers]	3
2	Fused-IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6
3	Fused-IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6
4	IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6
5	IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6
6	IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6
7	IRB	[#Filters, Kernel, E, SE, Act, #Layers]	6

At the end of the NAS search, the returned ranked candidates is a list of these best-performing encodings, which are in turn the best-performing GPUNets.

Summary

All ML practitioners are encouraged to read the CVPR 2022 GPUNet paper, with related GPUNet training code on the NVIDIA/DeepLearningExamples GitHub repo, and run inference on the colab instance on available cloud GPUs. GPUNet inference is also available on the PyTorch hub. The colab run instance uses the GPUNet checkpoints hosted on the NGC hub. These checkpoints have varying accuracy and latency tradeoffs, which can be applied based on the requirement of the target application.

Discuss (0)

About the Authors

About Satish Salian
Satish Salian is a principal systems software engineer at NVIDIA building end-to-end technologies and solutions for developers harnessing the power of NVIDIA GPUs. His current focus is on Neural Architecture Search (NAS) methodologies, searching for high-performance neural architectures for NVIDIA GPUs. In the recent past, Satish was involved with MLPerf-Training benchmarks, AR/VR research projects, and building the NVIDIA DGX-1 system software stack. His prior projects also include building CUDA developer tools, GFE UI control panel, NVAPIs and creating NVIDIA documentation. He has a bachelor's degree in computer engineering from the University of Pune, India.

View all posts by Satish Salian

About Carl (Izzy) Putterman
Carl (Izzy) Putterman is a senior deep learning algorithms engineer. He graduated from the University of California, Berkeley in 2021 with BAs in applied mathematics and computer science. With NVIDIA, he currently works on NIM LLM Performance and has previously worked on time series modeling and graph neural networks.

View all posts by Carl (Izzy) Putterman

About Linnan Wang
Linnan is a senior deep learning engineer at NVIDIA. He got his Ph.D. from Brown University in 2021. His research topic is Neural Architecture Search, and his NAS-related works have been published at ICML, NeurIPS, ICLR, CVPR, TPMAI, and AAAI. At NVIDIA, Linnan is continuing his R&D in NAS and shipping NAS-optimized models to NVIDIA core products.

View all posts by Linnan Wang

About Krzysztof Kudrynski
Krzysztof is an expert in the field of autonomous vehicles and artificial intelligence engineering with a Ph.D. in computer science. He joined NVIDIA in 2018 and currently is responsible for large-scale deep learning integration and benchmarking systems.

View all posts by Krzysztof Kudrynski