Run Multiple AI Models on the Same GPU with Amazon SageMaker Multi-Model Endpoints Powered by NVIDIA Triton Inference Server

Discuss (0)

Last November, AWS integrated open-source inference serving software, NVIDIA Triton Inference Server, in Amazon SageMaker. Machine learning (ML) teams can use Amazon SageMaker as a fully managed service to build and deploy ML models at scale.

With this integration, data scientists and ML engineers can easily use the NVIDIA Triton multi-framework, high-performance inference serving with the Amazon SageMaker fully managed model deployment.

Multi-model endpoints enable higher performance at low cost on GPUs

Today, AWS announced Amazon SageMaker multi-model endpoint (MME) on GPUs. MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server. For more information, see Run Multiple Deep Learning Models on GPU with Amazon SageMaker Multi-Model Endpoints.

MME enables sharing GPU instances behind an endpoint across multiple models and dynamically loads and unloads models based on the incoming traffic. With this, you can easily achieve optimal price performance.

Scaling inference with MMEs on GPUs

To harness the tremendous processing power of GPUs, MMEs use the Triton Inference Server concurrent model execution capability, which runs multiple models in parallel on the same AWS GPU instance. This functionality helps ML teams to scale AI by running many models that serve many inference requests and with stringent latency requirements. Your ML team will see an improvement in GPU utilization, and cost of inference.

Support is available in all regions where Amazon SageMaker is available, at no additional cost for the Triton Inference Server container.

Start using the Amazon SageMaker multi-model endpoint today on GPUs.

Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest products.