NVIDIA TensorRT Benefits
Speed up inference by 36X
NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling you to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms.
Optimize inference performance
TensorRT, built on the NVIDIA CUDA® parallel programming model, enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs.
Accelerate every workload
TensorRT provides INT8 using quantization-aware training and post-training quantization and Floating Point 16 (FP16) optimizations for deployment of deep learning inference applications, such as video streaming, recommendations, fraud detection, and natural language processing. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.
Deploy, run, and scale with Triton
TensorRT-optimized models can be deployed, run, and scaled with NVIDIA Triton™, an open-source inference serving software that includes TensorRT as one of its backends. The advantage of using Triton is high throughput with dynamic batching and concurrent model execution and use of features like model ensembles, streaming audio/video inputs, and more.
World-Leading Inference Performance
TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. It also accelerates every workload across the data center and edge in computer vision, automatic speech recognition, natural language understanding (BERT), text-to-speech, and recommender systems.
Supports All Major Frameworks
TensorRT is integrated with PyTorch and TensorFlow so you can achieve 6X faster inference with a single line of code. If you’re performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.
Below are a few integrations with information on how to get started.
Accelerate PyTorch models using the new Torch-TensorRT Integration with just one line of code. Get 6X faster inference using the TensorRT optimizations in a familiar PyTorch environment.Learn More
TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT like 6X the performance with one line of code.Learn More
TensorRT provides an ONNX parser so you can easily import ONNX models from popular frameworks into TensorRT. It’s also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference in the ONNX format.Learn More
Inference for Large Language Models
NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest large language models (LLMs) on NVIDIA GPUs. It lets developers experiment with new LLMs, offering speed-of-light performance and quick customization without deep knowledge of C++ or CUDA.
TensorRT-LLM wraps TensorRT’s deep learning compiler—which includes optimized kernels from FasterTransformer, pre- and post-processing, and multi-GPU and multi-node communication—in a simple open-source Python API for defining, optimizing, and executing LLMs for inference in production.
Accelerate Every Inference Platform
TensorRT can optimize and deploy applications to the data center, as well as embedded and automotive environments. It powers key NVIDIA solutions such as NVIDIA TAO, NVIDIA DRIVE™, NVIDIA Clara™, and NVIDIA Jetpack™.
TensorRT is also integrated with application-specific SDKs, such as NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Morpheus, and NVIDIA Broadcast Engine to provide developers with a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conference, AI based cybersecurity, and streaming apps in production.
Join the TensorRT and Triton community and stay current on the latest feature updates, bug fixes, and more.
Read Success Stories
Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.
American Express improves fraud detection by analyzing tens of millions of daily transactions 50X faster. Find out how.
Widely-Adopted Across Industries
Explore Introductory Resources
Read the introductory TensorRT blog
Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.
Watch on-demand TensorRT sessions from GTC
Learn more about TensorRT and its new features from a curated list of webinars of GTC.
Experience enterprise-ready AI inference
Access to reliable support is often vital to organizations scaling AI in production. Global NVIDIA Enterprise Support for NVIDIA TritonRT is available with NVIDIA AI Enterprise, including guaranteed response times, priority security notifications, regular updates, and access to NVIDIA AI experts.