NVIDIA Triton Inference Server
NVIDIA® Triton Inference Server (formerly NVIDIA TensorRT Inference Server) simplifies the deployment of AI models at scale in production. It is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge).
Support for Multiple Frameworks
Triton Inference Server supports all major frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime and even custom framework backend. It provides AI researchers and data scientists the freedom to choose the right framework.
High Performance Inference
It runs models concurrently on GPUs maximizing utilization, supports CPU based inferencing, offers advanced features like model ensemble and streaming inferencing. It helps developers bring models to production rapidly.
Designed for IT and DevOps
Available as a Docker container, it integrates with Kubernetes for orchestration and scaling, is part of Kubeflow and exports Prometheus metrics for monitoring. It helps IT/DevOps streamline model deployment in production.
Simplified Model Deployment
NVIDIA Triton Inference Server can load models from local storage or Google Cloud Platform or AWS S3. As the models are retrained continuously with new data, developers can easily update models without restarting the inference server and without any disruption to the application.
Triton Inference Server runs multiple models from the same or different frameworks concurrently on a single GPU using CUDA Streams. In a multi-GPU server, it automatically creates an instance of each model on each GPU. All these increase GPU utilization without any extra coding from the user.
The inference server supports low latency real time inferencing, batch inferencing to maximize GPU/CPU utilization. It also has built-in support for audio streaming input for streaming inference.
Users can use shared memory support for higher performance. Inputs and outputs needed to be passed to and from Triton Inference Server are stored in system/CUDA shared memory. This reduces HTTP/gRPC overhead, increasing overall performance.
It also supports model ensemble. Ensemble is a pipeline of one or more models and the connection of input and output tensors between those models (can be used with a custom backend) to deploy a sequence of models for pre/post processing or for use cases such as conversational AI which require multiple models to perform end to end inference.
Designed for Scalability
Also available as a Docker container, Triton Inference Server integrates with Kubernetes for orchestration, metrics, and auto-scaling. Triton also integrates with Kubeflow and Kubeflow pipelines for an end-to-end AI workflow. Triton Inference Server exports Prometheus metrics for monitoring GPU utilization, latency, memory usage and inference throughput. It supports the standard HTTP/gRPC interface to connect with other applications like load balancers. And it can easily scale to any number of servers to handle increasing inference loads for any model. Kubernetes pod scaling can use the exposed metrics to scale up or down the Triton Inference Server instances to handle the changing inference demand.
Triton Inference Server can serve tens or hundreds of models through model control API. Models can be explicitly loaded and unloaded into and out of the inference server based on changes made in the model-control configuration to fit in the GPU or CPU memory.
Triton Inference Server can be used to serve models on CPUs too. Supporting a heterogeneous cluster with both GPUs and CPUs helps standardize inference across platforms and helps dynamically scale out to any CPU or GPU to handle peak loads.
Triton Inference Server can be used to deploy models on the cloud, on-premises data center or on the edge. Its open source code can be customized for non-container environments too.
Announcing Triton 2.3
- KFServing’s new community standard gRPC and HTTP/REST protocols
- Triton is the first inference serving software to adopt KFServing’s new community standard GRPC and HTTP/REST data plane v2 protocols v2 protocols. With this new integration, users can now easily deploy serverless inferencing with Triton in Kubernetes. Example BERT inference with Triton in KFServing.
- Support for the latest versions of framework backends
- TensorRT 7.1, TensorFlow 2.2, PyTorch 1.6, ONNX Runtime 1.4
- Python Custom Backend
- New Python custom backend allows any arbitrary Python code (example use cases: pre, post-processing) to be executed inside Triton
- Support for A100, MIG
- Inference with Triton on A100 GPU provides higher performance than V100. Users can also use Triton to serve inferences on individual MIG instances with performance and fault isolation.
- Decoupled inference serving
- The decoupled mode enables Triton to engage a model once sufficient but not all inputs are received. Available for the C/C++ custom backends, this is useful for use cases such as speech recognition and synthesis.
- Triton Model Analyzer - Collection of tools to characterize model performance and memory footprint for efficient serving.
- Triton’s perf_client, now perf_analyzer, helps characterize the throughput and latency of a model for various batch sizes and request concurrency values.
- A new memory analyzer functionality helps characterize the memory footprint of a model for various batch sizes and request concurrency values.
- Azure Machine Learning integration
- Azure Machine Learning users can now use Triton to serve their models. Please read this blog from Microsoft to learn more about this integration and how to use Triton on Azure Machine Learning.
- DeepStream 5.0 Integration
- Triton is natively integrated in DeepStream 5.0 to support multiple deep learning frameworks for AI based multi-sensor streaming analytics.
Read the blog to learn more about 2.3 features.
Get started with NVIDIA Triton Inference Server with this Quick Start Guide.