NVIDIA Triton Inference Server
NVIDIA® Triton Inference Server (formerly NVIDIA TensorRT Inference Server) simplifies the deployment of AI models at scale in production. It is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge).
Support for Multiple Frameworks
Triton Inference Server supports all major frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime and even custom framework backend. It provides AI researchers and data scientists the freedom to choose the right framework.
High Performance Inference
It runs models concurrently on GPUs maximizing utilization, supports CPU based inferencing, offers advanced features like model ensemble and streaming inferencing. It helps developers bring models to production rapidly.
Designed for IT and DevOps
Available as a Docker container, it integrates with Kubernetes for orchestration and scaling, is part of Kubeflow and exports Prometheus metrics for monitoring. It helps IT/DevOps streamline model deployment in production.
Simplified Model Deployment
NVIDIA Triton Inference Server can load models from local storage or Google Cloud Platform or AWS S3. As the models are retrained continuously with new data, developers can easily update models without restarting the inference server and without any disruption to the application.
Triton Inference Server runs multiple models from the same or different frameworks concurrently on a single GPU using CUDA Streams. In a multi-GPU server, it automatically creates an instance of each model on each GPU. All these increase GPU utilization without any extra coding from the user.
The inference server supports low latency real time inferencing, batch inferencing to maximize GPU/CPU utilization. It also has built-in support for audio streaming input for streaming inference.
Users can use shared memory support for higher performance. Inputs and outputs needed to be passed to and from Triton Inference Server are stored in system/CUDA shared memory. This reduces HTTP/gRPC overhead, increasing overall performance.
It also supports model ensemble. Ensemble is a pipeline of one or more models and the connection of input and output tensors between those models (can be used with a custom backend) to deploy a sequence of models for pre/post processing or for use cases such as conversational AI which require multiple models to perform end to end inference.
Designed for Scalability
Also available as a Docker container, Triton Inference Server integrates with Kubernetes for orchestration, metrics, and auto-scaling. Triton also integrates with Kubeflow and Kubeflow pipelines for an end-to-end AI workflow. Triton Inference Server exports Prometheus metrics for monitoring GPU utilization, latency, memory usage and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers. And it can easily scale to any number of servers to handle increasing inference loads for any model. Kubernetes pod scaling can use the exposed metrics to scale up or down the Triton Inference Server instances to handle the changing inference demand.
Triton Inference Server can serve tens or hundreds of models through model control API. Models can be explicitly loaded and unloaded into and out of the inference server based on changes made in the model-control configuration to fit in the GPU or CPU memory.
Triton Inference Server can be used to serve models on CPUs too. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and helps dynamically scale out to any CPU or GPU to handle peak loads.
Triton Inference Server can be used to deploy models on the cloud, on-premises data center or on the edge. Its open source code can be customized for non-container environments too.
Get started with NVIDIA Triton Inference Server with this Quick Start Guide.