NVIDIA TensorRT

- [What&#39;s New](#whats-new)
- [Get Started With TensorRT](#tensorrt)
- [Get Started With TensorRT Frameworks](#frameworks)
- [Additional Resources](#resources)

  

# TensorRT

NVIDIA® TensorRT™ is an ecosystem of APIs for high-performance deep learning inference. The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that deliver low latency and high throughput for production applications. TensorRT-LLM builds on top of TensorRT in an open-source Python API with large language model (LLM)-specific optimizations like in-flight batching and custom attention. TensorRT Model Optimizer provides state-of-the-art techniques like quantization and sparsity to reduce model complexity, enabling TensorRT, TensorRT-LLM, and other inference libraries to further optimize speed during deployment.

  

TensorRT 10.0 GA is a free download for members of the [NVIDIA Developer Program](https://developer.nvidia.com/developer-program).

[Download Now](https://developer.nvidia.com/nvidia-tensorrt-download)[Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)

* * *

## Ways to Get Started With NVIDIA TensorRT

TensorRT and TensorRT-LLM are available on multiple platforms for free for development. Simplify the deployment of AI models across cloud, data center, and GPU-accelerated workstations with [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) for generative AI, and [NVIDIA Triton™ Inference Server](https://www.nvidia.com/en-us/ai-data-science/products/triton-inference-server/) for every workload, both part of [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).

  

## TensorRT

TensorRT is available to download for free as a binary on multiple different platforms or as a container on [NVIDIA NGC™.](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt)

  
[Download Now](https://developer.nvidia.com/nvidia-tensorrt-download)[Pull Container From NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt)[Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)

### Beginner

- [Getting Started with NVIDIA TensorRT](https://www.youtube.com/watch?v=SlUouzxBldU) (video)
- [Introductory blog](https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorrt-updated/)
- [Getting started notebooks](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/IntroNotebooks) (Jupyter Notebook)
- [Quick-start guide](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html)

### Intermediate

- [Sample code (C++)](https://github.com/NVIDIA/TensorRT/tree/main/samples)
- [BERT](https://github.com/NVIDIA/TensorRT/tree/main/demo/BERT), [EfficientDet](https://github.com/NVIDIA/TensorRT/tree/main/demo/EfficientDet/notebooks) inference using TensorRT (Jupyter Notebook)
- Serving model with NVIDIA Triton™ ([blog](/blog/optimizing-and-serving-models-with-nvidia-tensorrt-and-nvidia-triton/), [docs](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton))

### Expert

- [Using quantization aware training (QAT) with TensorRT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/) (blog)
- [PyTorch-quantization toolkit](https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization) (Python code)
- [TensorFlow quantization toolkit](https://developer.nvidia.com/blog/accelerating-quantized-networks-with-qat-toolkit-and-tensorrt/) (blog)
- [Sparsity with TensorRT](https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) (blog)

## TensorRT-LLM

TensorRT-LLM is available for free on [GitHub.](https://github.com/NVIDIA/TensorRT-LLM/tree/rel)

  
[Download Now](https://github.com/NVIDIA/TensorRT-LLM/tree/rel)[Documentation](https://nvidia.github.io/TensorRT-LLM)

### Beginner

- [Introduction on how TensorRT-LLM supercharges inference](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus) (blog)
- [How to get started with TensorRT-LLM](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/) (blog)

### Intermediate

- [Sample code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) (Python)
- [Performance benchmarks](https://nvidia.github.io/TensorRT-LLM/performance.html)
- [RAG chatbot on Windows reference project](https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0)

  

## TensorRT Model Optimizer

TensorRT Model Optimizer is available for free on NVIDIA PyPI, with examples and recipes on [GitHub](https://github.com/NVIDIA/TensorRT-Model-Optimizer).

  
[Download Now](https://github.com/NVIDIA/TensorRT-Model-Optimizer)[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/)

### Beginner

- [TensorRT Model Optimizer Quick-Start Guide](https://nvidia.github.io/TensorRT-Model-Optimizer/)
- [Introduction on Model Optimizer](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/) (blog)
- [Optimize Generative AI Inference With Quantization](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/) (video)
- [Optimizing Diffusion models with 8-bit quantization](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/) (blog)

### Intermediate

- [Example code](https://github.com/NVIDIA/TensorRT-Model-Optimizer)

* * *

## Ways to Get Started With NVIDIA TensorRT Frameworks

Torch-TensorRT and TensorFlow-TensorRT are available for free as containers on the NGC catalog or you can purchase [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/) for mission-critical AI inference with enterprise-grade security, stability, manageability, and support. [Contact sales](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/) or apply for a [90-day NVIDIA AI Enterprise evaluation license](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&amp;ProductFamily=NVAIEnterprise) to get started.

  

## Torch-TensorRT

Torch-TensorRT is available in the [PyTorch container from the NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).

  
[Pull Container From NGC](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)[Documentation](https://nvidia.github.io/Torch-TensorRT/)

### Beginner

- [Getting started with NVIDIA Torch-TensorRT](https://www.youtube.com/watch?v=TU5BMU6iYZ0) (video)
- [Accelerate inference up to 6X in PyTorch](https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/) (blog)
- [Object detection with SSD](https://github.com/NVIDIA/Torch-TensorRT/blob/master/notebooks/ssd-object-detection-demo.ipynb) (Jupyter Notebook)

### Intermediate

- [Post-training quantization with Hugging Face BERT](https://pytorch.org/TensorRT/_notebooks/Hugging-Face-BERT.html) (Jupyter Notebook)
- [Quantization aware training](https://pytorch.org/TensorRT/_notebooks/vgg-qat.html) (Jupyter Notebook)
- Serving model with Triton ([blog](https://developer.nvidia.com/blog/optimizing-and-serving-models-with-nvidia-tensorrt-and-nvidia-triton/), [docs](https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html))
- [Using dynamic shapes](https://pytorch.org/TensorRT/_notebooks/dynamic-shapes.html) (Jupyter Notebook)

## TensorFlow-TensorRT

TensorFlow-TensorRT is available in the [TensorFlow container from the NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow).

  
[Pull Container From NGC](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)[Documentation](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)

### Beginner

- [Getting started with TensorFlow-TensorRT](https://www.youtube.com/watch?v=w7871kMiAs8) (video)
- [Leverage TF-TRT Integration for Low-Latency Inference](https://blog.tensorflow.org/2021/01/leveraging-tensorflow-tensorrt-integration.html) (blog)
- [Image classification with TF-TRT](https://www.youtube.com/watch?v=O-_K42EAlP0) (video)
- [Quantization with TF-TRT](https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples-py/PTQ_example.ipynb) (sample code)

### Intermediate

- Serving model with Triton ([blog](https://developer.nvidia.com/blog/optimizing-and-serving-models-with-nvidia-tensorrt-and-nvidia-triton/), [docs](https://github.com/tensorflow/tensorrt/tree/master/tftrt/triton))
- [Using dynamic shapes](https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples-py/dynamic_shapes.ipynb) (Jupyter Notebook)

* * *

## Explore More TensorRT Resources
  

### Large Language Models

- [TensorRT-LLM Helps Sweep MLPerf Inference Benchmarks](https://blogs.nvidia.com/blog/2023/09/11/grace-hopper-inference-mlperf/) (blog)
- [TensorRT-LLM Supercharges Inference](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus) (blog)
- [How to Get Started with TensorRT-LLM](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/) (blog) 

### Conversational AI

- [Real-Time NLP With BERT](/blog/real-time-nlp-with-bert-using-tensorrt-updated/) (blog)
- [Optimizing T5 and GPT-2](https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/) (blog)
- [Quantize BERT with PTQ and QAT for INT8 Inference](https://github.com/NVIDIA/FasterTransformer/tree/main/examples/pytorch/bert/bert-quantization-sparsity)(sample)
- [ASR With TensorRT](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/QuartzNet#inference-process) (Jupyter Notebook)
- [How to Deploy Real-Time TTS](https://devblogs.nvidia.com/how-to-deploy-real-time-text-to-speech-applications-on-gpus-using-tensorrt/) (blog)
- [NLU With BERT Notebook](https://github.com/NVIDIA/TensorRT/tree/main/demo/BERT) (Jupyter Notebook)
- [Real-Time Text-to-Speech](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt) (sample)
- [Building an RNN Network Layer by Layer](https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleCharRNN) (sample code)

### Image and Vision

- [Optimize Object Detection](https://github.com/NVIDIA/TensorRT/blob/master/demo/EfficientDet/notebooks/EfficientDet-TensorRT8.ipynb) (Jupyter Notebook)
- [Estimating Depth With ONNX Models and Custom Layers](/blog/estimating-depth-beyond-2d-using-custom-layers-on-tensorrt-and-onnx-models/) (blog)
- [Speeding Up Inference Using TensorFlow, ONNX, and TensorRT](/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/) (blog)
- Object Detection With [EfficientDet](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/efficientdet), [YOLOv3](https://github.com/NVIDIA/TensorRT/tree/main/samples/python/yolov3_onnx) Networks (Python code samples)
- [Using NVIDIA Ampere Architecture and TensorRT](https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) (blog)
- [Achieving FP32 Accuracy in INT8 using Quantization-Aware Training](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/) (blog)

Stay up to date on the latest inference news from NVIDIA.

[Sign Up](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/)