NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support

NVIDIA today announced the latest release of NVIDIA TensorRT, an ecosystem of APIs for high-performance deep learning inference. TensorRT includes inference runtimes and model optimizations that deliver low latency and high throughput for production applications.

This post outlines the key features and upgrades of this release, including easier installation, increased usability, improved performance, and more natively supported AI models.

Developer experience upgrades

Getting started with TensorRT 10.0 is easier, thanks to updated Debian and RPM metapackages. For example, >apt-get install tensorrt or pip install tensorrt will install all relevant TensorRT libraries for C++ or Python.

In addition, Debug Tensors is a newly added API to mark tensors as debug tensors at build time. This makes it easier to identify any issues that arise in the graph. At runtime, each time the value of the tensor is written, a user-defined callback function is invoked with the value, type, and dimensions.

TensorRT 10.0 includes tooling in the ONNX parser to identify unsupported nodes when the call to parse fails. This error reporting contains node name, node type, reason for failure, and the local function stack if the node is located in an ONNX local function. You can query the number of these errors with the getNbErrors function, and get information about individual errors using the getError function.

TensorRT 10.0 is also a major upgrade for Windows development. Windows developers can now leverage version compatibility, hardware forward compatibility, weight-stripped engines, and Stable Diffusion pipeline improvements.

Feature upgrades

TensorRT 10.0 performance highlights include INT4 Weight-Only Quantization (WoQ) with block quantization and improved memory allocation options. In addition, new features such as weight-stripped engines and weight streaming ease the process of deploying larger models to smaller GPUs. You no longer need to fit your entire model into GPU memory.

INT4 Weight-Only Quantization

TensorRT 10.0 supports weight compression using INT4, which is hardware architecture agnostic. WoQ is useful when memory bandwidth limits GEMM operation performance or when GPU memory is scarce. In WoQ, GEMM weights are quantized to INT4 precision while GEMM input data and compute operations remain in high precision. TensorRT WoQ kernels read 4-bit weights from memory and dequantize them before calculating dot product in high precision.

Block Quantization enables higher granularity settings in quantization scales. It divides the tensor into fixed-size blocks along a single dimension. A scale factor is defined for each block, with all of the elements in a block sharing a common scale factor.

Runtime allocation

createExecutionContext accepts an argument specifying the allocation strategy (kSTATIC, kON_PROFILE_CHANGE, and kUSER_MANAGED) of execution context device memory. For user-managed allocation, kUSER_MANAGED, the additional API updateDeviceMemorySizeForShapes is added to query the required size based on actual input shapes.

Weight-stripped engines

TensorRT 10.0 supports weight-stripped engines, enabling 99% compression of engine size. Engines are refitted with the weights, without rebuilding the engine at runtime. This can be done using the new REFIT_IDENTICAL flag. REFIT_IDENTICAL instructs the TensorRT builder to optimize under the assumption that the engine will be refitted with weights identical to those provided at build time.

Using this flag in conjunction with kSTRIP_PLAN minimizes plan size in deployment scenarios where, for example, the plan is being shipped alongside an ONNX model containing the weights. TensorRT enables refitting only for constant weights that don’t impact the builder’s ability to optimize and produce an engine with the same runtime performance as a non-refittable engine. Those weights are then omitted from the serialized engine, resulting in a small plan file that can be refitted at runtime using the weights from the ONNX model.

This feature enables you to avoid an additional copy of weights in the TensorRT plan when executing ONNX models, or with multiple engines built with the same set of weights. Windows supports dozens of RTX GeForce GPUs, and each GPU has dedicated weight-stripped engines.

Weight streaming

TensorRT can be configured to stream the network’s weights from host memory to device memory during network execution, instead of placing them in device memory at engine load time. This enables models with weights larger than free GPU memory to run, but potentially with significantly increased latency. Weight streaming is an opt-in feature at both build time and runtime. Note that this feature is only supported with strongly typed networks.

NVIDIA TensorRT Model Optimizer 0.11

TensorRT 10.0 also includes NVIDIA TensorRT Model Optimizer, a new comprehensive library of post-training and training-in-the-loop model optimizations. These include quantization, sparsity, and distillation to reduce model complexity, enabling compiler frameworks to optimize the inference speed of deep learning models.

Model Optimizer simulates quantized checkpoints for PyTorch and ONNX models that deploy to TensorRT-LLM or TensorRT. Model Optimizer Python APIs enable model optimization techniques to accelerate inference with existing runtime and compiler optimizations in TensorRT.

NVIDIA TensorRT Model Optimizer is public and free to use as an NVIDIA PyPI wheel. For more in-depth information, see Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available.

Post-Training Quantization

Post-Training Quantization (PTQ) is one of the most popular model compression methods to reduce memory footprint and accelerate inference. While some other quantization toolkits only support WoQ or basic techniques, Model Optimizer provides advanced calibration algorithms, including INT8 SmoothQuant and INT4 AWQ. If you’re using FP8 or lower precisions such as INT8 or INT4 in TensorRT-LLM, you’re already leveraging Model Optimizer PTQ under the hood.

Quantization Aware Training

Quantization Aware Training (QAT) enables you to fully unlock inference speedups with 4-bit without compromising accuracy. By computing scaling factors during training and incorporating simulated quantization loss into the fine-tuning process, QAT makes the neural network more resilient to quantization. The Model Optimizer QAT workflow is designed to integrate with leading training frameworks including NVIDIA NeMo, Megatron-LM, and Hugging Face Trainer API, providing developers the option to harness the capabilities of the NVIDIA platform across a variety of frameworks.

Sparsity

Sparsity reduces the size of models by selectively encouraging zero values in model parameters that can then be discarded from storage or computations. In MLPerf Inference v4.0, TensorRT-LLM used Model Optimizer post-training sparsity under the hood to showcase a further 1.3x speedup on top of FP8 quantization for a Llama 2 70B on NVIDIA H100.

Nsight Deep Learning Designer

TensorRT 10.0 also introduces support for profiling and engine building with Nsight Deep Learning Designer 2024.1 (Early Access). Nsight Deep Learning Designer is an integrated development environment for designing deep neural networks (DNNs).

Model optimization is a careful balance of speed and accuracy. Nsight Deep Learning Designer provides a visual diagnosis of network inference performance to help tune models to meet performance targets and saturate GPU resources.

The tool also provides visual inspection of TensorRT ONNX models. You can make adjustments to model graphs and individual operators in real time to optimize the inference process.

Nsight Deep Learning Designer is available for free. Learn more and get access to version 2024.1.

Screenshot of a TensorRT 10.0 model mapped in Nsight Deep Learning Designer. — *Figure 1.* *Nsight Deep Learning Designer 2024.1 visualizes a TensorRT 10.0 model for examining and controlling the inference process in real time*

Expanded support for AI models

NVIDIA TensorRT-LLM is an open-source library for optimizing LLM inference. The easy-to-use Python API incorporates the latest advancements in LLM inference like FP8 and INT4 AWQ with no loss in accuracy. TensorRT-LLM 0.10, which will be available in late May, supports newly released AI models, including Meta Llama 3, Google CodeGemma and Google RecurrentGemma, and Microsoft Phi-3.

FP8-supported Mixture of Experts (MoE) has also been added. Encoder-decoder models are supported in the C++ runtime and NVIDIA Triton backend with inflight batching. Weight-stripped engines added in TensorRT 10.0 are also available in TensorRT-LLM.

Summary

The NVIDIA TensorRT 10.0 release offers many new features, including weight streaming, weight-stripped engines, INT4 quantization, and improved memory allocation. It also includes Model Optimizer, a comprehensive library of post-training and training-in-the-loop model optimizations that deploy to TensorRT-LLM or TensorRT. TensorRT-LLM continues LLM-specific optimizations with many new models, features, and performance improvements.

Learn more about TensorRT.

NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support

Developer experience upgrades