Deployment and integration of trained machine learning (ML) models in production remains a hard problem, both for application developers and the infrastructure teams supporting them. How do you ensure you have the right-sized compute resources to support multiple end-users, serve multiple disparate workloads at the highest level of performance, automatically balancing the load, scale up or down based on demand? All this while delivering the best user-experience, maximizing utilization, and minimizing operational costs. It’s a tall order to say the least.
Solving for all these challenges requires the coming together of two things: (1) Workloads optimized for best inference performance that is repeatable and portable, and (2) simplified and automated cluster infrastructure management that is both secure and scalable. NVIDIA and Amazon Web Services (AWS) have collaborated to do just that – the Amazon Elastic Kubernetes Service (EKS), a managed Kubernetes service to scale, load balance and orchestrate workloads, now offers native support for the Multi-Instance GPU (MIG) feature offered by A100 Tensor Core GPUs, that power the Amazon EC2 P4d instances.
This new integration offers developers access to the right-sized GPU-acceleration for their applications, big and small, and gives infrastructure managers the flexibility to efficiently scale and service multi-user or multi-model AI inference serving use-cases, like Intelligent Video Analytics and Conversational AI pipelines and recommender systems with greater granularity.
With A100’s MIG feature enabled, each EC2 P4d instance can be partitioned into as many as 56 separate 5GB GPU instances, each with their own high-bandwidth memory, cache, and compute cores. Amazon EKS can then provision each P4d instance as a node with up to 56 schedulable GPU instances per node, where each GPU instance can service an independent workload — one EC2 P4d instance, 56 Accelerators. And these instances can be incrementally and dynamically scaled up or down on-demand for optimal utilization and cost savings.
NVIDIA makes deployment even easier with Triton Inference Server software to simplify the deployment of AI models at scale in production. This open-source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Amazon Simple Storage Service (Amazon S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). Triton Inference Server software is available from NGC, a hub for GPU-optimized pre-trained models, scripts, Helm charts and a wide array of AI and HPC Software. With the NGC Catalog now available in AWS Marketplace, developers and infrastructure managers can leverage NVIDIA’s GPU-optimized software stack optimized for the MIG capability of the latest A100 GPUs without leaving the AWS portal.
Ready to get started and evaluate the Amazon EKS and A100 MIG integration? Check out the AWS blog for a step-by-step walkthrough on how to use Amazon EKS with up to scale 56 independent GPU-accelerated Image Super-Resolution (ISR) inference workloads in parallel on a single EC2 P4d instance. With the combination of A100 MIG, Amazon EKS, and the P4d instance, you can get a 2.5x speed-up compared to processing them without MIG enabled on the same instance. Better utilization, better user experiences, and lower costs.