TensorRT: What’s New
NVIDIA® TensorRT-LLM greatly speeds optimization of large language models (LLMs). Leveraging TensorRT™, FasterTransformer, and more, TensorRT-LLM accelerates LLMs via targeted optimizations like Flash Attention, Inflight Batching, and FP8 in an open-source Python API, enabling developers to get optimal inference performance on GPUs.
NVIDIA TensorRT 8.6 improves cross-compatibility between GPUs and software stacks, making TensorRT more versatile across hardware deployments and upgrades.
TensorRT 8.6 GA is a free download for members of the NVIDIA Developer Program.
Download Now DocumentationWays to Get Started With NVIDIA TensorRT
TensorRT
TensorRT is available to download for free as a binary on multiple different platforms or as a container on NVIDIA NGC™. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorRT and TensorRT-LLM, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Download Now Pull Container From NGC Documentation
Beginner
- Getting started with NVIDIA TensorRT (video)
- Introductory blog
- Getting started notebooks (Jupyter Notebook)
- Quick-start guide
Intermediate
- Sample code (C++)
- BERT, EfficientDet inference using TensorRT (Jupyter Notebook)
- Serving model with NVIDIA Triton™ (blog, docs)
Expert
- Using quantization aware training (QAT) with TensorRT (blog)
- PyTorch-quantization toolkit (Python code)
- TensorFlow quantization toolkit (blog)
- Sparsity with TensorRT (blog)
TensorRT-LLM
TensorRT-LLM is available on GitHub. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorRT and TensorRT-LLM, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales to learn more.
Download Now Documentation
Beginner
Intermediate
- Sample code (Python)
- Performance benchmarks
Ways to Get Started With NVIDIA TensorRT Frameworks
Torch-TensorRT
Torch-TensorRT is available in the PyTorch container from the NGC catalog. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes PyTorch, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Pull Container From NGC Documentation
Beginner
- Getting started with NVIDIA Torch-TensorRT (video)
- Accelerate inference up to 6X in PyTorch (blog)
- Object detection with SSD (Jupyter Notebook)
Intermediate
- Post-training quantization with Hugging Face BERT (Jupyter Notebook)
- Quantization aware training (Jupyter Notebook)
- Serving model with Triton (blog, docs)
- Using dynamic shapes (Jupyter Notebook)
TensorFlow-TensorRT
TensorFlow-TensorRT is available in the TensorFlow container from the NGC catalog. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorFlow, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Pull Container From NGC Documentation
Beginner
- Getting started with TensorFlow-TensorRT (video)
- Leverage TF-TRT Integration for Low-Latency Inference (blog)
- Image classification with TF-TRT (video)
- Quantization with TF-TRT (sample code)
Intermediate
- Serving model with Triton (blog, docs)
- Using dynamic shapes (Jupyter Notebook)
Explore More TensorRT Resources
Large Language Models
Conversational AI
- Real-Time NLP With BERT (blog)
- Optimizing T5 and GPT-2 (blog)
- Quantize BERT with PTQ and QAT for INT8 Inference (sample)
- ASR With TensorRT (Jupyter Notebook)
- How to Deploy Real-Time TTS (blog)
- NLU With BERT Notebook (Jupyter Notebook)
- Real-Time Text-to-Speech (sample)
- Building an RNN Network Layer by Layer (sample code)
Image and Vision
- Optimize Object Detection (Jupyter Notebook)
- Estimating Depth With ONNX Models and Custom Layers (blog)
- Speeding Up Inference Using TensorFlow, ONNX, and TensorRT (blog)
- Object Detection With EfficientDet, YOLOv3 Networks (Python code samples)
- Using NVIDIA Ampere Architecture and TensorRT (blog)
- Achieving FP32 Accuracy in INT8 using Quantization-Aware Training (blog)