Run:AI
Assign the Right Amount of Compute Power to Users, Automatically
Assign the Right Amount of Compute Power to Users, Automatically Run:AI’s Kubernetes-based software platform for orchestration of containerized AI workloads enables GPU clusters to be utilized for different Deep Learning workloads dynamically - from building AI models, to training, to inference. With Run:AI, jobs at any stage get access to the compute power they need, automatically.
Run:AI’s compute management platform speeds up data science initiatives by pooling available resources and then dynamically allocating resources based on need - maximizing accessible compute.
Key Features
- Fair-share scheduling to allow users to easily and automatically share clusters of GPUs
- Simplified multi-GPU distributed training
- Visibility into workloads and resource utilization to improve user productivity
- Control for cluster admin and ops teams, to align priorities to business goals
- On-demand access to Multi-Instance GPU (MIG) instances for the A100 GPU
Key Benefits
Advanced Kubernetes-based Scheduling Eliminates Static GPU Allocation
The Run:AI Scheduler manages tasks in batches using multiple queues on top of Kubernetes, allowing system admins to define different rules, policies, and requirements for each queue based on business priorities. Combined with an over-quota system and configurable fairness policies, the allocation of resources can be automated and optimized to allow maximum utilization of cluster resources.
Because it was built as a plug-in to K8s, Run:AI’s scheduler requires no advanced setup, and is certified to integrate with any number of Kubernetes “flavors” including Red Hat OpenShift and HPE Ezmeral.
No More Idle Resources
Run:AI’s over-quota system allows users to automatically access idle resources when available based on configurable fairness policies. The platform allocates resources dynamically, for full utilization of cluster resources. Our customers see improvements in utilization from around 25% when we start working with them to over 75%.
Bridge Between HPC and AI
Bridging the efficiency of High-Performance Computing and the simplicity of Kubernetes – the Run:AI Scheduler allows users to easily make use of integer GPUs, multiple-nodes of GPUs, and even GPU MIG instances, for distributed training on Kubernetes. In this way, AI workloads run based on needs, not capacity.
More Information
- See a Demo of Run:AI
- Watch a video - The Run:AI Super-Scheduler in Action
- White paper - NVIDIA DGX Solution Stack
- Case Study - Autonomous Vehicle Company Ends GPU Scheduling ‘Horror’
- Case Study - From 28% to 73% Utilization with Run:AI
- eBook on Improving GPU Utilization
- Watch a video - NVIDIA, King’s College London and Run:AI on How to Build the Best AI Infrastructure Stack