NVIDIA Triton Inference Server

NVIDIA Triton™ Inference Server, part of the NVIDIA AI platform, is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.

Get Started

What Is NVIDIA Triton?

Triton Inference Server streamlines AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.

Support for Multiple Frameworks

Triton Inference Server supports all major frameworks, such as TensorFlow, NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, RAPIDS™ FIL (for XGBoost, scikit-learn, etc.), OpenVINO, custom C++, and more.

High Performance Inference

Triton supports all NVIDIA GPU-, x86-, and ARM® CPU-based inferencing. It offers features like dynamic batching, concurrent execution, optimal model configuration, model ensemble, and streaming inputs to maximize throughput and utilization.

Designed for DevOps and MLOps

Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring, supports live model updates, and can be used in all major public cloud AI and Kubernetes platforms. It’s also integrated in many MLOPS software solutions.


Fast and Scalable AI In Every Application

High Inference Throughput

Triton executes multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.

It also optimizes serving for real-time inferencing under strict latency constraints, supports batch inferencing to maximize GPU and CPU utilization, and includes built-in support for audio and video streaming input. Triton supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.

Models can be updated live in production without restarting Triton or the application. Triton enables multi-GPU, multi-node inference on very large models that cannot fit in a single GPU’s memory.

NVIDIA Triton Inference Server delivers high inference throughput.

NVIDIA Triton Inference Server delivers high scalable inference.

Highly Scalable Inference

Available as a Docker container, Triton integrates with Kubernetes for orchestration, metrics, and autoscaling. Triton also integrates with Kubeflow and Kubeflow pipelines for an end-to-end AI workflow and exports Prometheus metrics for monitoring GPU utilization, latency, memory usage, and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers and can easily scale to any number of servers to handle increasing inference loads for any model.

Triton can serve tens or hundreds of models through a model control API. Models can be loaded and unloaded into and out of the inference server based on changes to fit in GPU or CPU memory. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads.

Key Triton Functionality

Triton Forest Inference Library (FIL) Backend

The new Forest Inference Library (FIL) backend provides support for high-performance inference of tree-based models with explainability (Shapley values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML RandomForest, and others in Treelite format.

Learn More

Triton Model Analyzer

Triton Model Analyzer is a tool to automatically evaluate Triton deployment configurations such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints—latency, throughput, and memory requirements. It reduces the time needed to find the optimal configuration from weeks to hours.

Learn More

Ecosystem Integrations with NVIDIA Triton

AI is driving innovation across businesses of every size and scale and NVIDIA AI is at the forefront of this innovation. An open-source software solution, Triton is the top choice for AI inference and model deployment. Triton is supported by Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), Amazon SageMaker, Google Kubernetes Engine (GKE), Google Vertex AI, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS), and Azure Machine Learning. Discover why enterprises use Triton.

With NVIDIA LaunchPad, get Immediate access to hosted infrastructure and experience Triton Inference Server through free curated labs.


Success Stories

Learn how NVIDIA TensorRT supports Amazon.
amazon logo

Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.

Read Blog
Learn how NVIDIA TensorRT supports AMEX.
american express logo

Learn how American Express improved fraud detection by analyzing tens of millions of daily transactions 50X faster.

Read Blog
Learn how NVIDIA TensorRT supports Zoox.
Siemens Energy

Discover how Siemens Energy augmented physical inspections by providing AI-based remote monitoring for leaks, abnormal noises, and more.

LEARN MORE

Resources

Learn how NVIDIA Triton can simplify AI deployment at scale.

Simplify AI Deployment at Scale

Simplify the deployment of AI models at scale in production. Learn how Triton meets the challenges of deploying AI models and review the steps to get started.

Download Overview
Explore the latest NVIDIA Triton on-demand sessions.

Watch GTC Sessions on Demand

Check out the latest on-demand sessions on Triton Inference Server from NVIDIA GTC.

Watch Now
Deploy AI deep learning models.

Deploy AI Models

Get the latest news and updates, and learn more about key benefits on the NVIDIA Technical Blog.

Read blogs

NVIDIA’s Program for Startups

NVIDIA Inception is a free program designed to help startups evolve faster through access to cutting-edge technology like NVIDIA Triton, NVIDIA experts, venture capitalists, and co-marketing support.

LEARN MORE
Learn more about NVIDIA Inception program for startups.