Technical Walkthrough

Fast and Scalable AI Model Deployment with NVIDIA Triton Inference Server

Discuss (0)

AI is a new way to write software and AI inference is running this software. AI machine learning is unlocking breakthrough applications in various fields such as online product recommendations, image classification, chatbots, forecasting, manufacturing quality inspection and more. Building a platform for production inference is very hard. Here are the reasons why.


Applications have different requirements, which require different optimizations based on users’ needs. 

  • Some want to continuously process streaming data from sensors like cameras or microphones.
  • Some want optimized throughput at lowest cost like processing satellite images to create high fidelity maps. 
  • Some require real-time responsiveness like conversational AI chatbot.

Model Types

Different use cases leverage different model types, which generate very different computational graphs that should be compiled for optimal performance. Some of the popular model types are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Transformers, Decision trees, Random Forests, and Graph Neural Network.


Models are trained in different frameworks and come in different formats – TensorFlow, PyTorch, TensorRT, ONNX, MXNet, and XGBoost and so on.

AI platform

Applications run on top of many different AI platforms – homegrown, platforms like Cloudera or Domino Data Labor or public cloud ML platforms like Amazon SageMaker, Azure ML, or Google Vertex AI.

Computing Processor

Choosing a computing processor is the next step of the long journey. Different generations of GPUs and CPUs are available. Running models on the bare metal or virtualized machines is another consideration.

Deployment environments

Inference deployment environments can vary depending on the application – Public cloud, on-premises core (data center), enterprise edge, and on embedded devices.

A given inference platform should consider all the preceding factors and that is why building an efficient high-performance platform is hard.

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is an open source inference-serving software for fast and scalable AI in applications. It can help satisfy many of the preceding considerations of an inference platform. Here is a summary of the features. For more information, see the Triton Inference Server read me on GitHub.

  • NVIDIA Triton can be used to deploy models from all popular frameworks. It supports TensorFlow 1.x and 2.x, PyTorch, ONNX, TensorRT, RAPIDS FIL (for XGBoost, Scikit-learn Random Forest, LightGBM), OpenVINO, Python, and even custom C++ backends.
  • NVIDIA Triton optimizes inference for multiple query types – real time, batch, streaming, and also supports model ensembles.
  • Supports high-performance inference on both NVIDIA GPUs and x86 & ARM CPUs. 
  • Runs on scale-out cloud or data center, enterprise edge, and even on embedded devices like the NVIDIA Jetson.  It supports both bare metal and virtualized environments (e.g. VMware vSphere) for AI inference. There’s dedicated NVIDIA Triton builds for running on Windows, Jetson, and ARM SBSA.
  • Kubernetes and AI platform support:
    • It is available as a Docker container and integrates easily with Kubernetes platforms like AWS EKS, Google GKE, Azure AKS, Alibaba ACK, Tencent TKE or Red Hat OpenShift.
    • NVIDIA Triton is available in Managed CloudAI workflow platforms like Amazon SageMaker, Azure ML, Google Vertex AI, Alibaba Platform for AI Elastic Algorithm Service, and Tencent TI-EMS.

Concurrent execution

GPUs are compute powerhouses capable of executing multiple workloads at the same time. NVIDIA Triton Inference Server maximizes performance and reduces end-to-end latency by running multiple models concurrently on the GPU. These models can be all the same, or different models from different frameworks. The GPU memory size is the only limitation to the number of models that can run concurrently. This results in high GPU utilization and throughput.

Dynamic batching

One factor in the optimization of inference is batch size, or how many samples you process at once. GPUs deliver high throughput at higher batch size. However, for real-time applications, the real constraint on services isn’t batch size or even throughput, but rather the latency required to deliver an outstanding experience for end customers. Here is a simple example. For a network of smart speakers, the maximum latency for the BERT model used in the NLP portion of the pipeline must be 7 milliseconds or less to deliver a great experience. Now, if you look at Figure 1, you can see that to maintain a latency threshold of <7ms you could run up to batch size 24 to meet your target latency and maximize throughput.

This chart shows how input requests can be batched to increase throughput under latency constraints.
Figure 1 : Throughput-Latency curve.

When running inference in a production environment, there are two types of batching: client side (static) and server side (dynamic). Typically, when a client sends a request to a server, by default the server will process each request sequentially, which isn’t optimal for throughput. 

When implementing inference at scale, developers may want to balance latency and throughput targets to meet the requirements of their application. Figure 4 shows how you can balance these priorities through the use of NVIDIA Triton Inference Server’s dynamic batching. As independent inference requests come into the server, NVIDIA Triton will dynamically group client side requests together on server-side to form a larger batch. Triton can manage this batching to a specified latency target – allowing a balance of maximizing throughput within a specified latency target. NVIDIA Triton manages this task automatically, so you don’t have to make any changes to your code on the client side.

For example, eight different devices might have a request for an image search at the same time, but NVIDIA Triton is able to batch them all together for a single query to the GPU to increase throughput rather than performing inference on each sequential request and still deliver a result at a low latency. 

 This chart shows NVIDIA Triton combining multiple inputs request to form a batch for inference.
Figure 2: NVIDIA Triton dynamic batching.

To understand how this works in practice, look at the example in figure 5 below. The line shows the latency and throughput at different batch sizes for a given model. If the latency threshold to deliver a great experience to the users was 3ms, you could set dynamic batching to 16 and meet the latency target while getting over 3x the performance compared to running at batch size 1. This example shows how Triton is able to help manage batch size and throughput to meet the requirements without relying on batch size to manage your service.

This chart shows how different batches impact throughput & latency.
Figure 3: Throughput-Latency curve for different batch sizes.

In conclusion, NVIDIA Triton’s dynamic-batching feature can substantially increase throughput under strict latency constraints thereby delivering high performance and utilization of the GPUs.

Now let us look at some of the recently added features of NVIDIA Triton.

Model analyzer

For efficient inference serving, optimal model configurations such as the batch size and number of concurrent model instances should be identified for a given processor. Today, users manually try different, sometimes hundreds of, combinations to eventually find the optimal configuration for their throughput, latency, and memory utilization requirements. NVIDIA Triton Model Analyzer is an optimization tool that automates this selection for the users by automatically finding the best configuration for models to get the highest performance. Users can specify performance requirements (such as a latency constraint, throughput target, or memory footprint) and the model analyzer will search through different configurations (batch sizes, concurrent model instances, and so on.)  and find the one that provides the best performance under their constraints. It will then output a summary report, which includes charts like the ones shown below, to help visualize the performance of the top configurations. For example: Documentation and examples can be found in the GitHub repo:

These charts show the output from NVIDIA Triton Model Analyzer.
Figure 4: NVIDIA Triton Model Analyzer-Recommended Configurations.

Multi-GPU Multi-node inference

NLP models are exponentially growing in size (doubling every 2.5 months). For example, NVIDIA Turing NLG (17B parameters) that came out in early 2020 should’ve at least 34G of GPU memory, GPT-3 (175B parameters) which was introduced middle of 2020 should’ve at least 350G of GPU memory, and the latest Megatron NVIDIA Turing NLG (530B parameters) model should’ve > 1TB of GPU memory. These large Transformer models cannot fit in a single GPU.

For these very large Transformer models, NVIDIA Triton introduces Multi-GPU Multi-node inference. It uses model parallelism techniques below to split a large model across multiple GPUs and nodes:

  • Pipeline (Inter-Layer) Parallelism that splits contiguous sets of layers across multiple GPUs. This maximizes GPU utilization in single-node.
  • Tensor (Intra-Layer) Parallelism that splits individual layers across multiple GPUs. This minimizes latency in single-node.

It uses NCCL in Magnum IO for topology-aware communication for high throughput. 

Multi-GPU inference is required for real-time inference performance. The following is the comparison between running the Megatron Turing NLG (530B) model on both GPU and CPU-only servers. The configuration is – Batch 1 Query, Input sequence length=128 tokens (average of 102 words), Output sequence length=8 tokens (average of six words)

  • GPU: ~ ½ second with tensor and pipeline parallelism – 2 DGX-A100-80GB, Batch size=1, FP16, FasterTransformer 4.0, Triton 2.15.
  • CPU: >1 minute –  Xeon Platinum 8280 2S, Up to 1TB/socket System memory, Batch size=1, FP32, TensorFlow.

On two DGXs, the Megatron NVIDIA Turing NLG model can stream eight  output tokens (six words) every ½ second to reduce the latency to display or speak for real- time conversation. As we can see, such large models should’ve the power of a GPU system to be practical. Many GPU systems like DGX use NVLink for high-performance intra-node communication and Infiniband/Ethernet for inter-node communication.

New RAPIDS Forest Inference Library backend

NVIDIA Triton includes RAPIDS (FIL) as a backend for GPU or CPU inference of XGBoost, LightGBM, and ScikitLearn Random Forest models.​ With FIL integration, NVIDIA Triton is now a unified deployment engine for both deep learning and traditional machine learning workloads.​

This chart shows that GPU model deployments with NVIDIA Triton FIL can provide very high throughput while maintaining minimal p99 latency.
Figure 5: Throughput and Latency Curve for Serving an Example XGBoost Model in NVIDIA Triton.

Tree models (XGBoost, Random Forest, LightGBM) are ubiquitous – especially in finance and recommender systems – but packages, which support deep -learning models rarely support tree model inference. In practice, many applications (especially recommender systems) use an ensemble of tree and DL models and require tools to support inference over these ensembles at massive scales. For use-cases like re-scoring all customers against all available products as often as possible or scoring every global credit card transaction against multiple fraud models, the FIL backend offers a highly optimized solution.

While other tree model deployment solutions often require converting models to specialized formats like ONNX, the FIL backend allows users to deploy XGBoost and LightGBM models as-trained. Furthermore, FIL’s GPU-accelerated inference means that users no longer have to compromise on small models to meet tight latency targets. Instead, they can take advantage of huge ensembles that offer far more sensitivity for e.g. fraud detection without compromising their latency budgets.

NVIDIA Triton with FIL backend enables GPU-accelerated inference for even the largest tree models to be deployed for low-latency applications, improving accuracy for fraud detection, recommender systems, and more. Inference on XGBoost, LightGBM for non-categorical features, cuML/Scikit-Learn RandomForest for single-output regression and binary classification can be done on both GPUs and CPUs. Categorical feature support for LightGBM models is coming soon.

New Amazon SageMaker integration

NVIDIA Triton Inference Server is now natively integrated and available in Amazon SageMaker. AWS & NVIDIA have worked closely to deliver this functionality to provide customers with NVIDIA Triton’s benefits in SageMaker. Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. It takes two steps:

  1. Create Triton Model repo and configuration in Amazon S3.
  2. Create SageMaker NVIDIA Triton endpoint and deploy.

For more information, please read this post. NVIDIA Triton Inference Server Containers are available in all regions where Amazon SageMaker is available and comes at no additional cost.

NVIDIA Triton can be used in all the major cloud service providers, AWS, Google Cloud, Microsoft Azure, Alibaba Cloud and Tencent Cloud,  in managed Kubernetes and AI platform services.

Customer success stories

Let us look at some new customers success stories with NVIDIA Triton:

  • Microsoft Teams
    • Microsoft Azure Cognitive Services uses Triton on GPUs for ASR in Microsoft Teams for live captioning and transcription.  Please read the post for more details.
  • Siemens Energy
    • Siemens Energy uses NVIDIA Triton for autonomous operations in power plants. Computer vision models served by NVIDIA Triton Inference Server detect leaks and other abnormalities. Please read the post for more details.


NVIDIA Triton helps with a standardized scalable production AI in every data center, cloud, and embedded device. It supports multiple frameworks, runs models on both CPUs and GPUs, handles different types of inference queries, and integrates with Kubernetes and MLOPs platforms. Download NVIDIA Triton today as a Docker container from NGC and find the documentation on our open source github. In addition, visit the NVIDIA Triton webpage to learn more.