Data Center / Cloud

Simplifying AI Model Deployment at the Edge with NVIDIA Triton Inference Server

Sep 14, 2021

By Shankar Chandrasekaran, Suhas Hariharapura Sheshadri and Mahan Salehi

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA Triton Inference Server is an open-source inference serving software that simplifies AI model deployment at the edge by providing a single standardized inference platform for multiframework models and different deployment environments.
NVIDIA Triton supports multiple framework backends, including TensorFlow, ONNX Runtime, and TensorRT, and allows for custom backends, enabling developers to run models directly on NVIDIA Jetson without conversion.
NVIDIA Triton features like concurrent model execution, dynamic batching, and model ensembles improve inference performance, reduce latency, and maximize hardware utilization on Jetson devices.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more.

AI machine learning (ML) and deep learning (DL) are becoming effective tools for solving diverse computing problems in various fields including robotics, retail, healthcare, industrial, and so on. The need for low latency, real-time responsiveness, and privacy has moved running AI applications right at the edge.

However, deploying AI models in applications and services at the edge can be challenging for infrastructure and operations teams. Factors like diverse frameworks, end to end latency requirements, and lack of standardized implementations can make AI deployments challenging. In this post, we explore how to navigate these challenges and deploy AI models in production at the edge.

Here are the most common challenges of deploying models for inference:

Multiple model frameworks: Data scientists and researchers use different AI and deep learning frameworks like TensorFlow, PyTorch, TensorRT, ONNX Runtime, or just plain Python to build models. Each of these frameworks requires an execution backend to run the model in production. Managing multiple framework backends at the same time can be costly and lead to scalability and maintenance issues.
Different inference query types: Inference serving at the edge requires handling multiple simultaneous queries, queries of different types like real-time online predictions, streaming data, and a complex pipeline of multiple models. Each of these requires special processing for inference.
Constantly evolving models: With this ever-changing world, AI models are continuously retrained and updated based on new data and new algorithms. Models in production must be updated continuously without restarting the device. A typical AI application uses many different models. It compounds the scale of the problem to update the models in the field.

NVIDIA Triton Inference Server is an open-source inference serving software that simplifies inference serving by addressing these complexities. NVIDIA Triton provides a single standardized inference platform that can support running inference on multiframework models and in different deployment environments such as datacenter, cloud, embedded devices, and virtualized environments. It supports different types of inference queries through advanced batching and scheduling algorithms and supports live model updates. NVIDIA Triton is also designed to increase inference performance by maximizing hardware utilization through concurrent model execution and dynamic batching.

We brought Triton Inference Server to Jetson with NVIDIA JetPack 4.6, released in August 2021. With NVIDIA Triton, AI deployment can now be standardized across cloud, data center, and edge.

Key features

Here are some key features of NVIDIA Triton that help you simplify your model deployment in Jetson.

Chart shows how NVIDIA Triton can provide the benefits discussed by showing its internal working. — *Figure 1. Triton Inference Server architecture on NVIDIA Jetson*

Embedded application integration

Direct C-API integration is supported for communication between client applications and Triton Inference Server, though gRPC and HTTP/REST are supported as well. On Jetson, where both the client application and inference serving runs on the same device, client applications can call Triton Inference Server APIs directly with zero communication overhead. NVIDIA Triton is available as a shared library with a C API that enables the full functionality to be included directly in an application. This is best suited for Jetson-based, embedded applications.

Multiple framework support

NVIDIA Triton has natively integrated popular framework backends, such as TensorFlow 1.x/2.x, ONNX Runtime, TensorRT, and even custom backends. This allows developers to run their models directly on Jetson without going through a conversion process. NVIDIA Triton also supports flexibility to add custom backend. The developers get their choice and the infrastructure team streamlines the deployment with a single inference engine.

DLA support

Triton Inference Server on Jetson can run models on both GPU and DLA. DLA is the Deep Learning Accelerator available on Jetson Xavier NX and Jetson AGX Xavier.

Concurrent model execution

Triton Inference Server maximizes performance and reduces end-to-end latency by running multiple models concurrently on Jetson. These models can be all the same models, or different models from different frameworks. The GPU memory size is the only limitation to the number of models that can run concurrently.

Dynamic batching

Batching is a technique to improve inference throughput. There are two ways to batch inference requests: client and server batching. NVIDIA Triton implements server batching by combining individual inference requests together to improve inference throughput. It is dynamic because it builds a batch until a configurable latency threshold. When the threshold is met, NVIDIA Triton schedules the current batch for execution. The scheduling and batching decisions are transparent to the client requesting inference and is configured per model. Through dynamic batching, NVIDIA Triton maximizes throughput while meeting the strict latency requirements.

One of the examples of dynamic batching is where your application involves running both detection and classification models, where the input to classification model are the objects detected from the detection model. In this scenario, since there can be any number of detections to be classified, dynamic batching can make sure that the batch of detected objects can be created dynamically and classification can be run as a batched request, reducing the overall latency and improving the performance of your application.

Model ensembles

The model ensemble feature is used to create a pipeline of different models and pre– or post-processing operations to handle a variety of workloads. NVIDIA Triton ensembles allow users to stitch together multiple models and pre – or post-processing operations into a single pipeline with connected inputs and outputs. NVIDIA Triton can easily manage the execution of the entire pipeline just with a single inference request to an ensemble from the client application. As an example, applications trying to classify vehicles can use NVIDIA Triton model ensembles to run a vehicle detection model and then run vehicle classification model on the detected vehicles.

Custom backends

In addition to the popular AI backends, NVIDIA Triton also supports execution of custom C++ backends. These are useful to create special logic like pre– and post-processing or even regular models.

Dynamic model loading

NVIDIA Triton has a model control API that can be used to load and unload models dynamically. This enables the device to use the models when required by the application. Also, when a model gets retrained with new data, it can be redeployed on NVIDIA Triton seamlessly without any application restarts or disruption to the service, allowing for live model updates.

Conclusion

Triton Inference Server is released as a shared library for Jetson. NVIDIA Triton releases are made monthly, which adds new features and supports newest framework backends. For more information, see Triton Inference Server Support for Jetson and JetPack.

NVIDIA Triton helps with a standardized scalable production AI in every data center, cloud, and embedded device. It supports multiple frameworks, runs models on multiple computing engines like GPU and DLA, handles different types of inference queries. With the integration in NVIDIA JetPack, NVIDIA Triton can be used for embedded applications.

For more information, see the triton-inference-server Jetson GitHub repo for documentation and attend the upcoming webinar, Simplify model deployment and maximize AI inference performance with NVIDIA Triton Inference Server on Jetson. The webinar will include demos on Jetson to showcase various NVIDIA Triton features.

Discuss (0)

About the Authors

About Shankar Chandrasekaran
Shankar is a senior product marketing manager in the data center GPU team at NVIDIA. He is responsible for GPU software infrastructure marketing to help IT and DevOps easily adopt and seamlessly integrate GPUs in their infrastructure. Before NVIDIA, he held engineering, operations, and marketing positions in both small and large technology companies. He holds business and engineering degrees.

View all posts by Shankar Chandrasekaran

About Suhas Hariharapura Sheshadri
Suhas Sheshadri is director of product management at NVIDIA, where he leads software product management for the Jetson and IGX product lines. He is passionate about the intersection of a thriving developer community building innovative projects on Jetson and the thousands of customers deploying Jetson and IGX in production across a wide range of industries. Suhas is driven by a relentless focus on pushing the boundaries of what's possible at the edge. Before joining NVIDIA in 2018, Suhas was a Staff Engineer at Qualcomm, where he worked on GPS, sensor fusion, and always-on location technologies. He holds a Master's in Computer Science from the University of Texas at Dallas and an MBA from the University of California, Berkeley.

View all posts by Suhas Hariharapura Sheshadri

About Mahan Salehi
Mahan Salehi is a product management leader at NVIDIA, where he drives the strategy and development of the company’s generative AI software portfolio. He has served as product owner for several NVIDIA flagship products, including Triton Inference Server, NIM, and NeMo microservices. Mahan also spearheads product strategy for NVIDIA BioNeMo, advancing the use of AI foundation models in life sciences and biology. Before joining NVIDIA, Mahan was the CEO and co-founder of an AI startup dedicated to improving mental health treatment. He holds an engineering degree from the University of Toronto.

View all posts by Mahan Salehi