Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams

NVIDIA GPUs have become mainstream for accelerating a variety of workloads from machine learning, high-performance computing (HPC), content creation workflows, and data center applications. For these enterprise use cases, NVIDIA provides a software stack powered by the CUDA platform: drivers, CUDA-X acceleration libraries, CUDA-optimized applications, and frameworks.

Deploying the NVIDIA driver is one of the fundamental aspects of setting up a GPU accelerated cluster for using CUDA. In the past, installation or upgrades of the NVIDIA drivers have required a full software development environment, such as compiler toolchains and kernel headers, on each GPU node. Enterprise users also desire a tested combination of NVIDIA drivers and Linux kernel combinations for stability and the ability to stay on specific driver branches, which may have different lifetimes.

In this post, I cover the work done on packaging for the NVIDIA driver on Red Hat Enterprise Linux (RHEL) 8 to improve the experience of installing and upgrading drivers. This work provides several benefits–including improved reliability, security, and choice. For this work, use the modularity streams available in RHEL 8 and precompiled kernel modules (kmod) packages.

DNF modularity

Using Modularity, the CUDA repository provides multiple update streams for driver packages. Only updates on the selected stream are considered. You have the option to keep up with the latest and greatest or lock down to a specific driver branch, for example, drivers with major versions equal to “450”.

This new mechanism allows you to switch to different streams based on your use case. You can choose from one of the multiple NVIDIA GPU driver branches available to follow from a single RPM repository. Some NVIDIA drivers are qualified for use on NVIDIA data center GPUs and may have extended lifetimes compared to other driver branches. Enterprise users may choose to stay on a specific driver branch for stability reasons, while others may want to track other branches for access to new features.

You can pick a specific driver branch, such as R418, for which to track updates and only get updates from that branch. The packages also provide a virtual branch called latest and latest-dkms that tracks the most recent NVIDIA driver at each point in time. The branch latest-dkms is the default. The other branches are opt-in, and branches can be switched without requiring the reinstallation of the CUDA Toolkit.

Using precompiled drivers

For supported Red Hat Enterprise Linux 8.x kernel releases (see support matrix below), driver packages are provided that implement an alternative to DKMS. The EPEL repository does not need to be enabled. The source files for these driver kmod packages are compiled in advance and then linked at installation time, so these are called “precompiled drivers.”

The new approach does not require the gcc compiler to be installed, resulting in a reduced attack surface and faster boot up times on kernel and/or driver updates. Using these precompiled kmod packages offers greater stability, as the exact NVIDIA driver version and kernel version string combination has been pre-tested. Say goodbye to black screens (runlevel 3) and hello to a predictable user experience, with a driver installation that no longer depends on kernel-devel and kernel-headers packages.

When a new driver update is released, precompiled driver packages are provided only for the most recently released kernel at the time of the driver update. Likewise, if a new kernel update is released, precompiled driver packages are provided for this kernel. Another way to phrase this is that at any point in time, precompiled drivers are enabled for the most recent RHEL kernel and the most recent NVIDIA driver version (per supported branch) now.

When using precompiled drivers, a plugin for the dnf package manager is enabled that cleans up stale .ko files. To prevent system breakages, the NVIDIA dnf plugin also prevents upgrading to a kernel for which no precompiled driver yet exists. This can delay the application of security fixes but ensures that a tested kernel and driver combination is always used.

Installing using the package manager

Here’s how to get started with using the new driver packages on RHEL 8. First, ensure that the Red Hat repositories are enabled, including RHEL8 AppStream, RHEL8 BaseOS and RHEL8 CRB:

$ subscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpms
$ subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms
$ subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpms

Add the CUDA network repository:

$ sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Install the latest stream to opt into precompiled packages:

$ sudo dnf module install nvidia-driver:latest

Choosing a modularity stream

For improved flexibility, several streams are available in both precompiled and DKMS varieties (Table 1).

NVIDIA driver	Precompiled stream	Legacy DKMS stream
Highest version	latest	latest-dkms
Locked @ 455.x	455	455-dkms
Locked @ 450.x	450	450-dkms
Locked @ 440.x	440	440-dkms
Locked @ 418.x	418	418-dkms

Table 1. List of nvidia-driver module streams available.

The latest option always updates to the highest versioned driver (precompiled):

$ sudo dnf module install nvidia-driver:latest

The <id> option locks the driver updates to the specified driver branch (precompiled). Replace <id> with the appropriate driver branch streams, for example, 455, 450, 440, or 418.

$ sudo dnf module install nvidia-driver:<id>

The latest-dkms option always updates to the highest versioned driver (non-precompiled). This is the default stream.

$ sudo dnf module install nvidia-driver:latest-dkms

The <id>-dkms option locks the driver updates to the specified driver branch (non-precompiled), for example, 455-dkms, 450-dkms, 440-dkms, or 418-dkms.

$ sudo dnf module install nvidia-driver:<id>-dkms

Switching streams

To switch to another stream, first remove the driver packages:

 $ sudo dnf remove nvidia-driver

Then, reset the module stream:

 $ sudo dnf module  reset nvidia-driver

Now the driver can be installed from an appropriate stream.

Using modularity profiles

Modularity profiles work with any supported modularity stream and allow for additional use cases (Table 2).

Stream	Profile	Use case
Default	/default	Installs all the driver packages in a stream.
Kickstart	/ks	Performs unattended Linux OS installation using a config file.
NVSwitch Fabric	/fm	Installs all the driver packages plus components required for bootstrapping an NVSwitch system (including the Fabric Manager and NSCQ telemetry).

Table 2. List of nvidia-driver module profiles available.

Now, you can use the dnf command to specify the stream and profile:

 $ sudo dnf module install nvidia-driver:<stream>/<profile>

The /default option installs all the driver packages in a stream (transitive closure):

 $ sudo dnf module install nvidia-driver:latest/default

The /ks option is intended for unattended Linux OS installation using a Kickstart config file that does not install the cuda-drivers metapackage. That metapackage attempts to remove old driver runfile installations.

 %packages
 @^Minimal Install
 @nvidia-driver:latest-dkms/ks
 %end

The /fm option installs additional packages for bootstrapping NVSwitch, including Fabric Manager and NSCQ (for switch telemetry):

 $ sudo dnf module install nvidia-driver:450/fm

Support matrix for RHEL

Currently, these package improvements are supported for RHEL 8.2 (and later) on x86_64 architecture only. NVIDIA provides precompiled driver packages only for the latest official RHEL kernel, for example, 4.18.0-193.19.1 and later. If you use an earlier kernel, update to start receiving precompiled driver packages. Precompiled drivers are not provided for RHEL EUS kernels.

Table 3 shows the branches that are supported according to NVIDIA driver lifecycle policy.

Driver Branch	Branch Designation	End of Life
418	Long Term Service	March 2022
440	New Feature	November 2020
450	Long Term Service	July 2023
455	Developer	460 availability

Table 3. Support matrix for NVIDIA driver branches.

New kmod packages are typically available within 24 hours of a new RHEL kernel update.

To prevent system breakages, the dnf plugin blocks kernel updates between a kernel going live and kmod package availability. A warning is displayed by dnf during that upgrade situation:

NOTE: Skipping kernel installation since no NVIDIA driver kernel module package kmod-nvidia-${driver}-${kernel} ... could be found

Summary

Deploying the NVIDIA driver on RHEL 8 is a better experience using precompiled kernel module packages and modularity streams. The new driver packages are available in the CUDA repository, so you can get started today.

Packaging templates and instructions are provided on GitHub to allow you to maintain your own precompiled kernel module packages for custom kernels and derivative Linux distros:

For more information, see the following resources: