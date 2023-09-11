TensorRT: What’s New
NVIDIA® TensorRT-LLM greatly speeds optimization of large language models (LLMs). Leveraging TensorRT™, FasterTransformer, and more, TensorRT-LLM accelerates LLMs via targeted optimizations like Flash Attention, Inflight Batching, and FP8 in an open-source Python API, enabling developers to get optimal inference performance on GPUs.
NVIDIA TensorRT 8.6 improves cross-compatibility between GPUs and software stacks, making TensorRT more versatile across hardware deployments and upgrades.
Ways to Get Started With NVIDIA TensorRT
TensorRT
TensorRT is available to download for free as a binary on multiple different platforms or as a container on NVIDIA NGC™. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorRT and TensorRT-LLM, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Beginner
- Getting started with NVIDIA TensorRT (video)
- Introductory blog
- Getting started notebooks (Jupyter Notebook)
- Quick-start guide
Intermediate
- Sample code (C++)
- BERT, EfficientDet inference using TensorRT (Jupyter Notebook)
- Serving model with NVIDIA Triton™ (blog, docs)
Expert
- Using quantization aware training (QAT) with TensorRT (blog)
- PyTorch-quantization toolkit (Python code)
- TensorFlow quantization toolkit (blog)
- Sparsity with TensorRT (blog)
TensorRT-LLM
TensorRT-LLM is available on GitHub. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorRT and TensorRT-LLM, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales to learn more.
Beginner
Intermediate
- Sample code (Python)
- Performance benchmarks
Ways to Get Started With NVIDIA TensorRT Frameworks
Torch-TensorRT
Torch-TensorRT is available in the PyTorch container from the NGC catalog. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes PyTorch, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Beginner
- Getting started with NVIDIA Torch-TensorRT (video)
- Accelerate inference up to 6X in PyTorch (blog)
- Object detection with SSD (Jupyter Notebook)
Intermediate
- Post-training quantization with Hugging Face BERT (Jupyter Notebook)
- Quantization aware training (Jupyter Notebook)
- Serving model with Triton (blog, docs)
- Using dynamic shapes (Jupyter Notebook)
TensorFlow-TensorRT
TensorFlow-TensorRT is available in the TensorFlow container from the NGC catalog. Purchase NVIDIA AI Enterprise, an end-to-end AI software platform that includes TensorFlow, for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. Contact sales or apply for a 90-day NVIDIA AI Enterprise evaluation license to get started.
Beginner
- Getting started with TensorFlow-TensorRT (video)
- Leverage TF-TRT Integration for Low-Latency Inference (blog)
- Image classification with TF-TRT (video)
- Quantization with TF-TRT (sample code)
Intermediate
- Serving model with Triton (blog, docs)
- Using dynamic shapes (Jupyter Notebook)
Explore More TensorRT Resources
Large Language Models
Conversational AI
- Real-Time NLP With BERT (blog)
- Optimizing T5 and GPT-2 (blog)
- Quantize BERT with PTQ and QAT for INT8 Inference (sample)
- ASR With TensorRT (Jupyter Notebook)
- How to Deploy Real-Time TTS (blog)
- NLU With BERT Notebook (Jupyter Notebook)
- Real-Time Text-to-Speech (sample)
- Building an RNN Network Layer by Layer (sample code)
Image and Vision
- Optimize Object Detection (Jupyter Notebook)
- Estimating Depth With ONNX Models and Custom Layers (blog)
- Speeding Up Inference Using TensorFlow, ONNX, and TensorRT (blog)
- Object Detection With EfficientDet, YOLOv3 Networks (Python code samples)
- Using NVIDIA Ampere Architecture and TensorRT (blog)
- Achieving FP32 Accuracy in INT8 using Quantization-Aware Training (blog)
