Organizations are increasingly adopting hybrid and multi-cloud strategies to access the latest compute resources, consistently support worldwide customers, and optimize cost. However, a major challenge that engineering teams face is operationalizing AI applications across different platforms as the stack changes. This requires MLOps teams to familiarize themselves with different environments and developers to customize applications to run across target platforms.
NVIDIA offers a consistent, full stack to develop on a GPU-powered on-premises or on-cloud instance. You can then deploy that AI application on any GPU-powered platform without code changes.
Introducing the latest NVIDIA Virtual Machine Image
The NVIDIA Cloud Native Stack Virtual Machine Image (VMI) is GPU-accelerated. It comes pre-installed with Cloud Native Stack, which is a reference architecture that includes upstream Kubernetes and the NVIDIA GPU Operator. NVIDIA Cloud Native Stack VMI enables you to build, test, and run GPU-accelerated containerized applications orchestrated by Kubernetes.
The NVIDIA GPU Operator automates the lifecycle management of the software required to expose GPUs on Kubernetes. It enables advanced functionality, including better GPU performance, utilization, and telemetry. Certified and validated for compatibility with industry-leading Kubernetes solutions, GPU Operator enables organizations to focus on building applications, rather than managing Kubernetes infrastructure.
NVIDIA Cloud Native Stack VMI is available on AWS, Azure, and GCP.
Now Available: Enterprise support by NVIDIA
For enterprise support for NVIDIA Cloud Native Stack VMI and GPU Operator, purchase NVIDIA AI Enterprise through an NVIDIA partner.
Developing AI solutions from concept to deployment is not easy. Keep your AI projects on track with NVIDIA AI Enterprise Support Services. Included with the purchase of the NVIDIA AI Enterprise software suite, this comprehensive offering gives you direct access to NVIDIA AI experts, defined service-level agreements, and control of your upgrade and maintenance schedules with long-term support options. Additional services, including training and AI workload onboarding, are available.
Run:ai is now certified on NVIDIA AI Enterprise
Run:ai, an industry leader in compute orchestration for AI workloads, has certified NVIDIA AI Enterprise, an end-to-end, secure, cloud-native suite of AI software, on their Atlas platform. This additional certification enables enterprises to accelerate the data science pipeline. They can focus on streamlining the development and deployment of predictive AI models to automate essential processes and gain rapid insights from data.
Run:ai provides an AI Computing platform that simplifies the access, management, and utilization of GPUs in cloud and on-premises clusters. Smart scheduling and advanced fractional GPU capabilities ensure that you get the right amount of compute for the job.
Run:ai Atlas includes GPU Orchestration capabilities to help researchers consume GPUs more efficiently. They do this by automating the orchestration of AI workloads and the management and virtualization of hardware resources across teams and clusters.
Run:ai can be installed on any Kubernetes cluster, to provide efficient scheduling and monitoring capabilities to your AI infrastructure. With the NVIDIA Cloud Native Stack VMI, you can add cloud instances to a Kubernetes cluster so that they become GPU-powered worker nodes of the cluster.
Here’s testimony from one of our team members: “As an engineer, without the NVIDIA Cloud Native Stack VMI, there is a lot of manual work involved. With the Cloud Native Stack VMI, it was two clicks and took care of provisioning Kubernetes and Docker and the GPU Operator. It was easier and faster to get started on my work.”
Set up a Cloud Native Stack VMI on AWS
In the AWS marketplace, launch an NVIDIA Cloud Native Stack VMI using the Launch an AWS Marketplace instance instructions.
Ensure that the necessary prerequisites have been met and install Run:ai using the Cluster Install instructions. After the installation, on the Overview dashboard, you should see that the metrics begin to populate. On the Clusters tab, you should also see the cluster as connected.
Next, add a few command components to the kube-apiserver.yaml file to enable user authentication on the Run:ai platform. For more information, see Administration User Interface Setup.
By default, you can find the kube-apiserver.yaml file in the following directory:
/etc/kubernetes/manifests/kube-apiserver.yaml
You can validate that the oidc commands were successfully applied by the kube-apiserver. Look for the oidc
commands in the output.
spec:
containers:
- command:
- kube-apiserver
- --oidc-client-id=runai
- --oidc-issuer-url=https://app.run.ai/auth/realms/nvaie
- --oidc-username-prefix=-
Set up the Unified UI and create a new project. Projects help to dictate GPU quota guarantees for data scientists and researchers who are using the Run:ai platform.
Name the new project and give the project at least one assigned GPU. For this post, I created one project with a two-GPU quota and another project with no GPU quota, labeled nvaie-high-priority
and nvaie-low-priority
, respectively After the project is created, you can install the Run:ai CLI tool, which enables you to submit workloads to the cluster.
The following commands use the runai CLI to submit a job (job1 or job2) leveraging a Docker image called quickstart. Quickstart contains TensorFlow, CUDA, a model, and data that feeds in and trains the model. It leverages one GPU for training (-g 1) and is submitted on behalf of the low-priority or high-priority project denoted by the -p
parameter.
Deploy a few test jobs to show some of Run:ai’s orchestration functionality by running:
runai submit job1 -i gcr.io/run-ai-demo/quickstart -g 1 -p nvaie-high-priority
runai submit job2 -i gcr.io/run-ai-demo/quickstart -g 1 -p nvaie-low-priority
You can check the status of the jobs by running:
runai describe job job1 -p nvaie-high-priority
runai describe job job2 -p nvaie-low-priority
Both workloads are now training on the GPUs, as you can see on the Overview dashboard.
You can submit an additional workload to highlight your job preemption capabilities. Currently, the nvaie-high-priority
project is guaranteed access to both GPUs since their Assigned GPU quota is set to 2. You can submit an additional workload for the nvaie-high-priority
project and observe that you are preempting the nvaie-low-priority
job.
The job preemption enables you to look at the checkpointing process for the training workloads, save the current progress at the checkpoint, and then preempt the workload to remove it from the GPU. Save the training progress and free up the GPU for a higher-priority workload to run.
runai submit job3 -i gcr.io/run-ai-demo/quickstart -g 1 -p nvaie-high-priority
You can check the status of the job by running:
runai describe job job3 -p nvaie-high-priority
If you go back to the overview dashboard, you’ll see the two jobs running for the nvaie-high-priority
project and the workload from nvaie-low-priority
preempted and placed back into the pending queue. The workload in the pending queue is automatically rescheduled when a GPU becomes available.
To clean up your jobs, run the following commands:
runai delete job job1 -p nvaie-low-priority
runai delete job job2 job3 -p nvaie-high-priority
Summary
NVIDIA offers a consistent, full stack to develop on a GPU-powered on-premises or on-cloud instance. Developers and MLOps can then deploy that AI application on any GPU-powered platform without code change.
Run:ai, an industry leader in compute orchestration for AI workloads, has certified NVIDIA AI Enterprise, an end-to-end, secure, cloud-native suite of AI software, on its Atlas platform. You can purchase NVIDIA AI Enterprise through an NVIDIA Partner to obtain enterprise support for NVIDIA VMI and GPU Operator. Included with the purchase of the NVIDIA AI Enterprise software suite, this comprehensive offering gives you direct access to NVIDIA AI experts, defined service-level agreements, and control of your upgrade and maintenance schedules with long-term support options.
For more information, see the following resources: