Data Science

Run Multiple AI Models on the Same GPU with Amazon SageMaker Multi-Model Endpoints Powered by NVIDIA Triton Inference Server

Oct 25, 2022

By Shankar Chandrasekaran and Eliuth Triana

Discuss (0)

AI-Generated Summary

Dislike

AWS integrated NVIDIA Triton Inference Server into Amazon SageMaker last November, allowing data scientists and ML engineers to use NVIDIA Triton multi-framework, high-performance inference serving with Amazon SageMaker's fully managed model deployment.
Amazon SageMaker's multi-model endpoint (MME) on GPUs enables running multiple deep learning or ML models on a GPU simultaneously, sharing GPU instances across models and dynamically loading/unloading models based on traffic.
MMEs on GPUs improve GPU utilization and reduce inference costs by using NVIDIA Triton Inference Server's concurrent model execution capability to run multiple models in parallel on the same AWS GPU instance.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Last November, AWS integrated open-source inference serving software, NVIDIA Triton Inference Server, in Amazon SageMaker. Machine learning (ML) teams can use Amazon SageMaker as a fully managed service to build and deploy ML models at scale.

With this integration, data scientists and ML engineers can easily use the NVIDIA Triton multi-framework, high-performance inference serving with the Amazon SageMaker fully managed model deployment.

Multi-model endpoints enable higher performance at low cost on GPUs

Today, AWS announced Amazon SageMaker multi-model endpoint (MME) on GPUs. MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server. For more information, see Run Multiple Deep Learning Models on GPU with Amazon SageMaker Multi-Model Endpoints.

MME enables sharing GPU instances behind an endpoint across multiple models and dynamically loads and unloads models based on the incoming traffic. With this, you can easily achieve optimal price performance.

Scaling inference with MMEs on GPUs

To harness the tremendous processing power of GPUs, MMEs use the Triton Inference Server concurrent model execution capability, which runs multiple models in parallel on the same AWS GPU instance. This functionality helps ML teams to scale AI by running many models that serve many inference requests and with stringent latency requirements. Your ML team will see an improvement in GPU utilization, and cost of inference.

Support is available in all regions where Amazon SageMaker is available, at no additional cost for the Triton Inference Server container.

Start using the Amazon SageMaker multi-model endpoint today on GPUs.

Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest products.

Discuss (0)

About the Authors

About Shankar Chandrasekaran
Shankar is a senior product marketing manager in the data center GPU team at NVIDIA. He is responsible for GPU software infrastructure marketing to help IT and DevOps easily adopt and seamlessly integrate GPUs in their infrastructure. Before NVIDIA, he held engineering, operations, and marketing positions in both small and large technology companies. He holds business and engineering degrees.

View all posts by Shankar Chandrasekaran

About Eliuth Triana
Eliuth Triana is a developer relations manager at NVIDIA on the Amazon team. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon model inference and training, ML/DL workloads, Amazon EC2 products, and AWS AI services. Eliuth is a passionate mountain biker, tennis player, skier, and poker player.

View all posts by Eliuth Triana