Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Register Free

NVIDIA AI Inference Software

Fast and scalable AI in every application.

Triton Inference Server TensorRT

What is AI inference?

NVIDIA’s inference software delivers fast and scalable AI in every application

There's an increasing demand for sophisticated AI-enabled services like image and speech recognition, natural language processing, visual search, and personalized recommendations. Inference is running the trained AI models to make predictions. This is the production phase of AI and successful inference should be easy to deploy, run, and scale and highly performant meeting the application requirements. NVIDIA’s inference software delivers the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, at the network’s edge, and in embedded devices.

How does NVIDIA AI inference work?

A workflow diagram showing how NVIDIA AI Inference works

NVIDIA Triton Inference Server can be used to deploy, run and scale trained models from all major frameworks (TensorFlow, PyTorch, XGBoost, and others) on the cloud, on-prem data center, edge, or embedded devices. NVIDIA TensorRT is an optimization compiler and runtime that uses multiple techniques like quantization, fusion, and kernel tuning, to optimize a trained deep learning model to deliver orders of magnitude performance improvements. NVIDIA AI inference supports models of all sizes and scales, for different use cases such as speech AI, NLP, computer vision, recommenders, and more.

Explore the NVIDIA inference solution.

TensorRT speeds up inference by 36X

Use multiple frameworks.

Deploy models from all major AI frameworks like TensorFlow, PyTorch, ONNX, XGBoost, plain Python, JAX, or even from custom frameworks.

TensorRT helps to optimize inference performance

Power high throughput, low latency.

Deliver high-throughput and low-latency inference across computer vision, speech AI, natural language processing, recommender systems, and more.

TensorRT helps to accelerate every workload

Deploy anywhere.

Deploy, run, and scale optimized AI models consistently on cloud, on prem, at the edge, and on embedded devices.

Take a closer look at NVIDIA AI inference software.

Triton Inference Server simplifies inference and increases inference performance

Triton Inference Server

NVIDIA Triton™ Inference Server is an open-source, model-serving software that delivers fast and scalable AI in every application. Triton Inference Server lets teams deploy trained AI models from any framework (TensorFlow, PyTorch, XGBoost, ONNX, Python, and more) on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.

Learn More
 TensorRT accelerates every inference platform


NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference that unlocks the power of NVIDIA Tensor Core GPUs. It delivers up to 36X higher throughput than CPU-only platforms. while minimizing latency. Using TensorRT, you can start from any deep learning framework and rapidly optimize, validate, and run trained neural networks in production.

Learn More

Stay current on the latest NVIDIA Triton™ Inference Server and NVIDIA TensorRT™ product updates, content, news, and more.

Sign Up