Developer Blog

Reliably provisioning servers with GPUs can quickly become complex as multiple components must be installed and managed to use GPUs with Kubernetes. The GPU Operator simplifies the initial deployment and management and is based on the Operator Framework. NVIDIA, Red Hat, and others in the community have collaborated on creating the GPU Operator.

Architecture stack including NVIDIA Driver, NVIDIA Kubernetes Device Plugin, GPUDirect RDMA Driver, and EGX Stack on NVIDIA EGX Systems hardware.

To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required: the NVIDIA Driver, NVIDIA Container Toolkit, Kubernetes device plugin, and monitoring. These components must be manually provisioned before GPU resources are available to the cluster and managed during the cluster operation.

The GPU Operator simplifies both the initial deployment and management of the components by containerizing all components and using standard Kubernetes APIs for automating and managing these components, including versioning and upgrades. The GPU Operator is fully open-source and is available on NGC and as part of the NVIDIA EGX Stack and Red Hat OpenShift.

The latest GPU Operator releases, 1.2, 1.3, 1.4, and 1.5, include several new features:

  • Added support for Red Hat OpenShift
  • Integrated GPU discovery
  • Compatibility with NVIDIA A100
  • Air-gapped support
  • Added support for Ubuntu Server 20.04
  • Added support for NVIDIA vGPU

Added support for Red Hat OpenShift

Continuing our line of support for Red Hat OpenShift, GPU Operator 1.5 now includes support for the latest Red Hat OpenShift 4.6 version. GPU Operator 1.3, 1.4, and 1.5 also support Red Hat OpenShift 4.4 and 4.5.

GPU Operator is an OpenShift certified operator. Through the OpenShift web console, you can install and start using the GPU Operator with only a few mouse clicks. Being a certified operator makes it significantly easier for you to use NVIDIA GPUs with Red Hat OpenShift.

View of the NVIDA GPU Operator available in Red Hat OpenShift
Figure 1. NVIDIA GPU Operator available on Red Hat OpenShift.

NGC Support Services

NGC Support Services provide enterprises with direct access to NVIDIA subject matter experts to help minimize downtime during software deployment and maximize user productivity during development. NGC Support Services are offered as part of DGX, NGC-Ready, and NVIDIA-Certified systems.

The support now includes GPU Operator and covers deep learning, inference, and data analytics software containers from the NGC catalog running on bare-metal and virtualized environments on both Ubuntu and RHEL operating systems. Additionally, you have access to the NGC private registry, a secure cloud-hosted environment that includes features such as container scanning, model versioning, and user-based access control to help centralize AI development and deployment efforts.

Integrated GPU Feature Discovery

NVIDIA GPU Feature Discovery for Kubernetes is a software component that leverages the Node Feature Discovery (NFD) software to automatically detect hardware features available on each node in a Kubernetes cluster, and to advertise those features using node labels.

While NFD can detect NVIDIA GPUs, it does not create labels that allow you to distinguish between GPU types (NVIDIA A100, NVIDIA V100). GPU Feature Discovery (GFD) provides that granularity to NFD, allowing you to choose the optimal GPU for your workload.

Previously, you had to either manually configure Kubernetes to recognize the different types of GPUs in a cluster, for example, A100, T4 and V100, or only use nodes with identical GPUs. With GFD, Kubernetes can now differentiate between the different GPU types in a cluster, allowing you more granularity and precision for deploying applications.

Compatibility with NVIDIA A100 

With the latest GPU Operator release, you can now download GPU Operator, including the GPU driver, container runtime, Kubernetes device plugin, and so on, for the NVIDIA A100.

NVIDIA A100 Tensor Core GPU delivers acceleration at every scale to power high-performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. Available in 40-GB and 80-GB memory versions, A100 debuts the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets.

NVIDIA vGPU

GPU Operator 1.5 adds support for NVIDIA vGPU software. NVIDIA vGPU software enables organizations running business-critical applications on hypervisors to share physical GPUs across multiple virtual machines. In addition, GPU Operator is specifically supported with NVIDIA Virtual Compute Server software and NVIDIA RTX Virtual Workstation software editions.

Air-gap support

With the growing trend of cybersecurity threats and strengthening classification levels of data protection, it remains important for many customers to run modern workloads in restricted or completely air-gapped environments. Many industry verticals have strict security and data protection guidelines that dictate operating in these limited connectivity environments, including public sector, intelligence, healthcare, telco, and financial tech.

Another area with limited or no network connectivity is edge computing environments. Often, these environments have no guarantee for connectivity. Even connected edge environments suffer from network instability or throughput inconsistency.

With the current release of GPU Operator, NVIDIA has added Operator on top of Kubernetes platforms that exist on restricted or completely air-gapped environments. Enabling air-gap support enables you to easily configure and consume your NVIDIA accelerators for AI/ML workloads in your most restrictive environments.

Containerd, Red Hat, and CentOS

GPU Operator 1.4 adds support for containerd, which has become one of the most popular container runtimes with Kubernetes. The NVIDIA GPU Operator now supports containerd, Docker, and the CRI-O container runtimes with the use of Red Hat Enterprise Linux CoreOS (RHCOS).

GPU Operator 1.4 also adds support for RHCOS 4.6, GPU Operator 1.4 adds support for CentOS 7 and CentOS 8. Support for Ubuntu has expanded from Ubuntu 18.04 to Ubuntu 20.04, the latest release.

Summary

To start using NVIDIA GPU Operator today, see the following resources: