NVIDIA Triton Inference Server

NVIDIA Triton™ Inference Server simplifies the deployment of AI models at scale in production. Open-source inference serving software, it lets teams deploy trained AI models from any framework (TensorFlow, NVIDIA® TensorRT®, PyTorch, ONNX Runtime, or custom) from local storage or cloud platform on any GPU- or CPU-based infrastructure (cloud, data center, or edge).

Download Triton from NGC

Support for Multiple Frameworks

Triton Inference Server supports all major frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime, and even custom framework backends. It provides AI researchers and data scientists the freedom to choose the right framework for their projects.

High-Performance Inference

Triton runs models concurrently on GPUs to maximize utilization, supports CPU-based inferencing, and offers advanced features like model ensemble and streaming inferencing. It helps developers bring models to production rapidly.

Designed for IT, DevOps, and MLOps

Available as a Docker container, Triton integrates with Kubernetes for orchestration and scaling, is part of Kubeflow, exports Prometheus metrics for monitoring, and can be used with cloud AI platforms like Azure ML and Google CAIP. It helps IT/DevOps streamline model deployment in production.


Simplified Model Deployment

Triton can load models from local storage or cloud platforms. As models are retrained with new data, developers can easily make updates without restarting the inference server or disrupting the application.

Triton runs multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.

It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.

Users can also use shared memory. The Inputs and outputs that pass to and from Triton are stored in shared memory, reducing HTTP/gRPC overhead and increasing performance.


Dynamic Scalability

Also available as a Docker container, Triton integrates with Kubernetes for orchestration, metrics, and auto-scaling. It also integrates with Kubeflow and Kubeflow pipelines for an end-to-end AI workflow. It exports Prometheus metrics for monitoring GPU utilization, latency, memory usage, and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers. And it can easily scale to any number of servers to handle increasing inference loads for any model.

Triton can serve tens or hundreds of models through model control API. Models can be loaded and unloaded into and out of the inference server based on changes to fit in GPU or CPU memory. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and dynamically scale out to any CPU or GPU to handle peak loads.

View the latest Triton release notes on GitHub.


Triton is the Top Choice for Inference

AI is driving innovation across businesses of every size and scale. An open-source software solution, Triton is the top choice for AI Inference and model deployment. Triton is supported by Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Google Cloud AI Platform, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS), Azure Machine Learning and Tencent Cloud. Discover why enterprises should use Triton.

Simplify AI Deployment at Scale

Simplify the deployment of AI models at scale in production. Learn how Triton meets the challenges of deploying AI models and review the steps to getting started with Triton.

Download the technical overview

Deep Learning Inference Platform

Achieve the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center and edge.

Read about the AI Inference Platform

Deploy AI Deep Learning Models

Get the latest news and updates, release note highlights, and key benefits of using Triton Inference Server and NGC™ open-source software on the NVIDIA Developer Blog.

Read blogs

On-Demand Sessions

Download Triton Inference Server from NGC.

Download Triton from NGC