TensorRT for Large Language Models
Beta Release

Early access is now available for download.

Generative AI is powering change in every industry. From speech recognition and recommenders, to computer vision and unique image generation, to music composition and language translation, AI lets organizations create groundbreaking applications.

As large language models (LLMs) evolve, developers need higher-accuracy results for optimal production inference deployments. Higher performance helps decrease costs while improving user experiences. But increases in LLM sizes drive up the costs and complexities of deployment.

Rapidly Expanding LLM Ecosystem

Community LLMs are growing at an explosive rate, with increased demand from companies to deploy these models into production. LLMs such as BLOOM, Dolly, Falcon, Llama, MPT, Starcoder, and others have pushed the state of the art with new architectures and operators that make it difficult to optimize models quickly. New fine-tuning and quantization techniques further complicate these efforts.

Over the past two years, NVIDIA has been working closely with leading LLM companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now part of Databricks, OctoML, Tabnine, and Together.ai, to accelerate and optimize LLM inference.

TensorRT-LLM is specifically designed to address the diverse universe of LLMs and speed time to production.

TensorRT-LLM

TensorRT-LLM is an open-source library that accelerates and optimizes inference performance on the latest LLMs on NVIDIA Tensor Core GPUs. Developers can use it to experiment with new LLMs, while benefiting from dramatically improved performance and quick customization capabilities without needing deep knowledge of C++ or CUDA.

TensorRT-LLM wraps TensorRT’s Deep Learning Compiler, optimized kernels from FasterTransformer, pre- and post-processing, and multi-GPU/multi-node communication in a simple, open-source Python API for defining, optimizing, and executing LLMs for inference in production.

Enhancing FasterTransformer

TensorRT-LLM productizes FasterTransformer with further enhancements. Using TensorRT-LLM, AI developers can implement deep learning inference applications much more simply with optimized LLMs. The software maintains the core functionality of FasterTransformer, with improved ease of use and extensibility through an open-source modular Python API to support new architectures and enhancements as LLMs evolve. With this newly available open-source code, AI inference developers can now deploy production-level applications, decrease costs, reduce complexity, and improve the overall user experience.

We look forward to your use of our preview release of TensorRT-LLM and welcome feedback and contributions to help improve the product.

Additional Resources:

TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Note that you must be registered in the NVIDIA Developer Program to apply for the early-access release. You must also be logged in using your organization’s email address. We cannot accept applications from accounts using Gmail, Yahoo, QQ, or other personal email addresses.

To participate, fill out the short application below and provide details about your use case.


Cost: Free

Apply Now