Run:AI

Assign the Right Amount of Compute Power to Users, Automatically

Assign the Right Amount of Compute Power to Users, Automatically Run:AI’s Kubernetes-based software platform for orchestration of containerized AI workloads enables GPU clusters to be utilized for different Deep Learning workloads dynamically - from building AI models, to training, to inference. With Run:AI, jobs at any stage get access to the compute power they need, automatically.

Run:AI’s compute management platform speeds up data science initiatives by pooling available resources and then dynamically allocating resources based on need - maximizing accessible compute.

Key Features

Fair-share scheduling to allow users to easily and automatically share clusters of GPUs
Simplified multi-GPU distributed training
Visibility into workloads and resource utilization to improve user productivity
Control for cluster admin and ops teams, to align priorities to business goals
On-demand access to Multi-Instance GPU (MIG) instances for the A100 GPU

Key Benefits

Advanced Kubernetes-based Scheduling Eliminates Static GPU Allocation

The Run:AI Scheduler manages tasks in batches using multiple queues on top of Kubernetes, allowing system admins to define different rules, policies, and requirements for each queue based on business priorities. Combined with an over-quota system and configurable fairness policies, the allocation of resources can be automated and optimized to allow maximum utilization of cluster resources.

Because it was built as a plug-in to K8s, Run:AI’s scheduler requires no advanced setup, and is certified to integrate with any number of Kubernetes “flavors” including Red Hat OpenShift and HPE Ezmeral.

No More Idle Resources

Run:AI’s over-quota system allows users to automatically access idle resources when available based on configurable fairness policies. The platform allocates resources dynamically, for full utilization of cluster resources. Our customers see improvements in utilization from around 25% when we start working with them to over 75%.

Bridge Between HPC and AI

Bridging the efficiency of High-Performance Computing and the simplicity of Kubernetes – the Run:AI Scheduler allows users to easily make use of integer GPUs, multiple-nodes of GPUs, and even GPU MIG instances, for distributed training on Kubernetes. In this way, AI workloads run based on needs, not capacity.

More Information

See a Demo of Run:AI
Watch a video - The Run:AI Super-Scheduler in Action
White paper - NVIDIA DGX Solution Stack
Case Study - Autonomous Vehicle Company Ends GPU Scheduling ‘Horror’
Case Study - From 28% to 73% Utilization with Run:AI
eBook on Improving GPU Utilization
Watch a video - NVIDIA, King’s College London and Run:AI on How to Build the Best AI Infrastructure Stack