NVIDIA Triton Inference Server
NVIDIA Triton™ Inference Server, is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.Get Started
What is NVIDIA Triton?
Triton Inference Server, part of the NVIDIA AI platform, streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.
Explore the benefits.
Support for multiple frameworks.
Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C++, and more.
Triton supports all NVIDIA GPU-, x86-, Arm® CPU-, and AWS Inferentia-based inferencing. It offers dynamic batching, concurrent execution, optimal model configuration, model ensemble, and streaming audio/video inputs to maximize throughput and utilization.
Designed for DevOps and MLOps.
Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring, supports live model updates, and can be used in all major public cloud AI and Kubernetes platforms. It’s also integrated in many MLOps software solutions.
An integral part of NVIDIA AI.
The NVIDIA AI platform, which includes Triton, gives enterprises the compute power, tools, and algorithms they need to succeed in AI, accelerating workloads from speech recognition and recommender systems to medical imaging and improved logistics.
Fast and scalable AI in every application.
Achieve high-throughput inference.
Triton executes multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, Triton automatically creates an instance of each model on each GPU to increase utilization.
It also optimizes serving for real-time inferencing under strict latency constraints with dynamic batching, supports batch inferencing to maximize GPU and CPU utilization, and includes built-in support for audio and video streaming input. Triton supports model ensemble for use cases that require a pipeline of multiple models to perform end-to-end inference, such as conversational AI.
Models can be updated live in production without restarting Triton or the application. Triton enables multi-GPU, multi-node inference on very large models that cannot fit in a single GPU’s memory.
Scale inference with ease.
Available as a Docker container, Triton integrates with Kubernetes for orchestration, metrics, and autoscaling. Triton also integrates with Kubeflow and KServe for an end-to-end AI workflow and exports Prometheus metrics for monitoring GPU utilization, latency, memory usage, and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers and can easily scale to any number of servers to handle increasing inference loads for any model.
Triton can serve tens or hundreds of models through a model control API. Models can be loaded and unloaded into and out of the inference server based on changes to fit in GPU or CPU memory. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and dynamically scales out to any CPU or GPU to handle peak loads.
Take a closer look at Triton functionality.
Model orchestration with management service.
Triton brings new model orchestration functionality for efficient multi-model inference. This functionality, which runs as a production service, loads models on demand and unloads models when not in use. It efficiently allocates GPU resources by placing as many models as possible on a single GPU server and helps to group models from different frameworks for efficient memory use. The model orchestration feature is in private early access (EA).Sign up for EA
Large language model inference.
Models are rapidly growing in size, especially in areas of natural language processing, e.g., GPT-3 - 175B, Megatron 530B models. GPUs are naturally the right compute resource for these large models, but these models are so large that they cannot fit on a single GPU. Triton can partition the model into multiple smaller files and execute each on a separate GPU within or across servers. FasterTransformer backend in Triton, which enables this multi-GPU, multi-node inference, provides optimized and scalable inference for GPT family, T5, OPT, and UL2 models today.Learn More in the Blog
Optimal model configuration with Model Analyzer.
Triton’s Model Analyzer is a tool that automatically evaluates Triton deployment configurations such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints—latency, throughput, and memory requirements—and reduces the time needed to find the optimal configuration from weeks to hours.Learn More
Tree-based model inference with a Forest Inference Library (FIL) backend.
The new Forest Inference Library (FIL) backend in Triton provides support for high-performance inference of tree-based models with explainability (SHAP values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS™ cuML RandomForest, and others in Treelite format.Learn More
See ecosystem integrations.
AI is driving innovation across businesses of every size and scale, and NVIDIA AI is at the forefront of this innovation. An open-source software solution, Triton is the top choice for AI inference and model deployment. Triton is supported by Alibaba Cloud, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), Amazon SageMaker, Google Kubernetes Engine (GKE), Google Vertex AI, HPE Ezmeral, Microsoft Azure Kubernetes Service (AKS) , and Azure Machine Learning . Discover why enterprises use Triton.
With NVIDIA LaunchPad, get immediate access to hosted infrastructure and experience Triton Inference Server through free curated labs.
Read success stories.
Discover how Amazon improved customer satisfaction with NVIDIA AI by accelerating its inference 5X faster.
Learn how American Express improved fraud detection by analyzing tens of millions of daily transactions 50X faster.
Discover more resources.
Simplify and standardize AI deployment at scale.
Simplify the deployment of AI models at scale in production. Learn how Triton meets the challenges of deploying AI models and review the steps to get started.Read whitepaper
Watch Triton GTC sessions on demand.
Check out the latest on-demand sessions on Triton Inference Server from NVIDIA GTC.Watch Videos
Deploy AI models.
Read the latest news and blogs on NVIDIA Triton, and learn how to streamline your AI inference deployment.Read blogs
Meet NVIDIA’s program for startups.
NVIDIA Inception is a free program designed to help startups evolve faster through access to cutting-edge technology like NVIDIA Triton, NVIDIA experts, venture capitalists, and co-marketing support.
Experience enterprise-ready AI inference.
Access to reliable support is often vital to organizations scaling AI in production. Global NVIDIA Enterprise Support for NVIDIA Triton is available with NVIDIA AI Enterprise, including guaranteed response times, priority security notifications, regular updates, and access to NVIDIA AI experts.
Have an NVIDIA H100? Learn how to activate your NVIDIA AI Enterprise software