NVIDIA AI Inference Software

Fast and scalable AI in every application.

Triton Inference Server Triton Management Service TensorRT TensorRT-LLM

What is AI Inference?

AI inference is the process of using a trained model to make predictions on never-seen-before data. During training, an AI model learns the patterns and relationships that enable it to generalize on new data. During inference, the model applies its learned knowledge to provide accurate predictions or generate outputs such as images, text, or video. This powerful capability allows businesses to make data-driven decisions, optimize processes, and deliver unique, personalized experiences for internal and external customers.


NVIDIA’s inference software delivers fast and scalable AI in every application

Next-Generation AI Inference With NVIDIA AI Software

There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, image and text generation, and personalized recommendations. Inference is running the trained AI models to make predictions. NVIDIA’s inference software delivers the performance and efficiency critical to powering the next generation of AI products and services.

How Does NVIDIA AI Inference Work?

A workflow diagram showing how NVIDIA AI Inference works

NVIDIA AI Enterprise, an enterprise-grade AI software platform built for production inference, consists of key NVIDIA inference technologies and tools. NVIDIA AI inference supports models of all sizes and scales for different use cases such as speech AI, natural language processing (NLP), computer vision, generative AI, recommenders, and more.


NVIDIA TensorRT is an optimization compiler and runtime that uses multiple techniques like quantization, fusion, and kernel tuning to optimize a trained deep learning models. NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance on the latest LLMs on NVIDIA GPUs. NVIDIA Triton Inference Server™ can be used to deploy, run, and scale trained models from all major frameworks on the cloud, in on-prem data centers, at the edge, or on embedded devices. NVIDIA Triton Management Service (TMS) automates the deployment of multiple Triton Inference Server instances in Kubernetes with resource-efficient model orchestration on GPUs and CPUs.

Discover the modern landscape of AI inference, production use cases from companies, and real-world challenges and solutions.

Explore the NVIDIA Inference Solution

NVIDIA AI Inference allows model deployment from multiple frameworks

Use Multiple Frameworks

Deploy models from all major AI frameworks like TensorFlow, PyTorch, ONNX, XGBoost, Python, JAX, or even custom.

NVIDIA AI Inference delivers high throughput and low latency.

Power High Throughput, Low Latency

Deliver high-throughput and low-latency inference across computer vision, speech AI, NLP, recommender systems, and more.

NVIDIA AI Inference can deploy AI models anywhere

Deploy Anywhere

Deploy, run, and scale optimized AI models consistently on cloud, on prem, at the edge, and on embedded devices.

NVIDIA AI Inference Software

NVIDIA AI Enterprise is an end-to-end AI software platform consisting of NVIDIA TensorRT, NVIDIA TensorRT-LLM, NVIDIA Triton Inference Server, NVIDIA Triton Management Service, and other tools to simplify building, sharing, and deploying AI applications. With enterprise-grade support, stability, manageability, and security, enterprises can accelerate time to value while eliminating unplanned downtime.



Triton Inference Server simplifies inference and increases inference performance

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize AI model deployment and execution in production from all major AI frameworks on any GPU- or CPU-based infrastructure.

Learn More
 NVIDIA Triton Management Service

NVIDIA Triton Management Service

NVIDIA Triton Management Service automates the deployment of multiple Triton Inference Server instances in Kubernetes with resource-efficient model orchestration on GPUs and CPUs.


Learn More
 TensorRT accelerates every inference platform

NVIDIA TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. TensorRT can be deployed, run, and scaled with Triton.


Learn More
NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM

TensorRT-LLM is an open-source library for defining, optimizing, and executing large language models (LLM) for inference in production. It maintains the core functionality of FasterTransformer, paired with TensorRT’s Deep Learning Compiler, in an open source Python API to quickly support new models and customizations.


Learn More

Stay current on the latest NVIDIA AI inference software product updates, content, news, and more.