NVIDIA Triton Inference Server

NVIDIA Triton™ Inference Server delivers fast and scalable AI in production. Open-source inference serving software, Triton Inference Server streamlines AI inference by enabling teams deploy trained AI models from any framework (TensorFlow, NVIDIA TensorRT®, PyTorch, ONNX, XGBoost, Python, custom and more on any GPU- or CPU-based infrastructure (cloud, data center, or edge).

Download Triton

Support for Multiple Frameworks

Triton Inference Server supports all major frameworks, such asTensorFlow, TensorRT, PyTorch, MXNet, Python, ONNX, RAPIDS FIL (for XGBoost, scikit-learn etc.), OpenVINO, custom C++, and more. Triton provides AI researchers and data scientists the freedom to choose the right framework for their projects.

High Performance Inference

Triton runs models concurrently on GPUs to maximize throughput and utilization, supports x86 and ARM CPU-based inferencing, and offers features like dynamic batching, model analyzer, model ensemble, and audio streaming. Triton helps developers deliver high performance inference across cloud, on-prem, edge, and embedded devices.

Designed for DevOps and MLOps

Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring, supports live model updates, and can be used in all major public cloud machine learning (ML) and managed Kubernetes platforms. Triton helps standardize model deployment in production.


Simplified Model Deployment

NVIDIA Triton Inference Server simplifies and accelerates model deployment by:

  • Supporting all major deep learning and machine learning framework backends.
  • Running multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.
  • Optimizing inference serving for real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. Triton also supports model ensemble for use cases that require multiple models to perform end-to-end inference, such as conversational AI.
  • Dynamic batching of input requests for high throughput and utilization under strict latency constraints.
  • Updating models live in production without restarting the inference server or disrupting the application.
  • Using Model Analyzer to automatically find the optimal model configuration to maximize performance.
  • Supporting multi-GPU, multi-node inference for large models.


Dynamic Scalability

Also available as a Docker container, Triton integrates with Kubernetes for orchestration, metrics, and auto-scaling. Triton also integrates with Kubeflow and Kubeflow pipelines for an end-to-end AI workflow, and exports Prometheus metrics for monitoring GPU utilization, latency, memory usage, and inference throughput. Triton also supports the standard HTTP/gRPC interface to connect with other applications like load balancers and can easily scale to any number of servers to handle increasing inference loads for any model.

Triton can serve tens or hundreds of models through model control API. Models can be loaded and unloaded into and out of the inference server based on changes to fit in GPU or CPU memory. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads.


Triton is the Top Choice for Inference

AI is driving innovation across businesses of every size and scale. An open-source software solution, Triton is the top choice for AI Inference and model deployment. Triton is supported by Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), Amazon SageMaker, Google Kubernetes Engine (GKE), Google Vertex AI, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS), Azure Machine Learning, and Tencent Cloud. Discover why enterprises use Triton.

Simplify AI Deployment at Scale

Simplify the deployment of AI models at scale in production. Learn how Triton meets the challenges of deploying AI models and review the steps to get started.

Download Overview

Deep Learning Inference Platform

Achieve the performance, efficiency, and responsiveness critical to powering the next generation of AI products and services—in the cloud, in the data center, and at the edge.

Learn More

Deploy AI Deep Learning Models

Get the latest news and updates, and learn more about key benefits on the NVIDIA Developer Blog.

Read blogs
NVIDIA Triton Product Documentation

Product Documentation

See what’s new and find out more about the latest features in the Triton release notes.

Read on GitHub

On-Demand Sessions

Download Triton Inference Server from NGC.

Download Triton from NGC