NVIDIA TensorRT™ LLM is an open-source library built to deliver high-performance, real-time inference optimization for large language models (LLMs) on NVIDIA GPUs—whether on a desktop or in a data center. It includes a modular Python runtime, PyTorch-native model authoring, and a stable production API. Specifically customized for NVIDIA platforms, TensorRT LLM helps developers maximize inference performance to serve more users in parallel, while minimizing operational costs and delivering blazingly fast experiences.

How TensorRT LLM Works

The latest TensorRT LLM architecture is purpose-built to streamline the developer experience—enabling faster iteration and smoother deployment without sacrificing its industry-leading inference performance. The architecture provides easy-to-use Python APIs, a simple CLI, PyTorch model authorship, and an extensible Python framework to enable innovation.



Optimized for peak performance on NVIDIA platforms, TensorRT LLM leverages deep hardware-software integration to deliver unmatched efficiency and speed for LLM inference. Kernels specially designed for NVIDIA hardware achieve peak performance for common LLM inference operations, and runtime optimizations drive GPU utilization and end-user response speeds. Key optimizations include FP8 and NVFP4 quantization, disaggregated serving, parallelization techniques, including wide expert parallelism (EP), and advanced speculative decoding techniques, including EAGLE-3 and multi-token prediction.

