NVIDIA GPUs have become mainstream for accelerating a variety of workloads from machine learning, high-performance computing (HPC), content creation workflows, and data center applications. For these enterprise use cases, NVIDIA provides a software stack powered by the CUDA platform: drivers, CUDA-X acceleration libraries, CUDA-optimized applications, and frameworks.
Deploying the NVIDIA driver is one of the fundamental aspects of setting up a GPU accelerated cluster for using CUDA. In the past, installation or upgrades of the NVIDIA drivers have required a full software development environment, such as compiler toolchains and kernel headers, on each GPU node. Enterprise users also desire a tested combination of NVIDIA drivers and Linux kernel combinations for stability and the ability to stay on specific driver branches, which may have different lifetimes.
In this post, I cover the work done on packaging for the NVIDIA driver on Red Hat Enterprise Linux (RHEL) 8 to improve the experience of installing and upgrading drivers. This work provides several benefits–including improved reliability, security, and choice. For this work, use the modularity streams available in RHEL 8 and precompiled kernel modules (kmod
) packages.
DNF modularity
Using Modularity, the CUDA repository provides multiple update streams for driver packages. Only updates on the selected stream are considered. You have the option to keep up with the latest and greatest or lock down to a specific driver branch, for example, drivers with major versions equal to “450”.
This new mechanism allows you to switch to different streams based on your use case. You can choose from one of the multiple NVIDIA GPU driver branches available to follow from a single RPM repository. Some NVIDIA drivers are qualified for use on NVIDIA data center GPUs and may have extended lifetimes compared to other driver branches. Enterprise users may choose to stay on a specific driver branch for stability reasons, while others may want to track other branches for access to new features.
You can pick a specific driver branch, such as R418, for which to track updates and only get updates from that branch. The packages also provide a virtual branch called latest
and latest-dkms
that tracks the most recent NVIDIA driver at each point in time. The branch latest-dkms
is the default. The other branches are opt-in, and branches can be switched without requiring the reinstallation of the CUDA Toolkit.
Using precompiled drivers
For supported Red Hat Enterprise Linux 8.x kernel releases (see support matrix below), driver packages are provided that implement an alternative to DKMS. The EPEL repository does not need to be enabled. The source files for these driver kmod packages are compiled in advance and then linked at installation time, so these are called “precompiled drivers.”
The new approach does not require the gcc
compiler to be installed, resulting in a reduced attack surface and faster boot up times on kernel and/or driver updates. Using these precompiled kmod
packages offers greater stability, as the exact NVIDIA driver version and kernel version string combination has been pre-tested. Say goodbye to black screens (runlevel 3) and hello to a predictable user experience, with a driver installation that no longer depends on kernel-devel
and kernel-headers
packages.
When a new driver update is released, precompiled driver packages are provided only for the most recently released kernel at the time of the driver update. Likewise, if a new kernel update is released, precompiled driver packages are provided for this kernel. Another way to phrase this is that at any point in time, precompiled drivers are enabled for the most recent RHEL kernel and the most recent NVIDIA driver version (per supported branch) now.
When using precompiled drivers, a plugin for the dnf
package manager is enabled that cleans up stale .ko files. To prevent system breakages, the NVIDIA dnf plugin also prevents upgrading to a kernel for which no precompiled driver yet exists. This can delay the application of security fixes but ensures that a tested kernel and driver combination is always used.
Installing using the package manager
Here’s how to get started with using the new driver packages on RHEL 8. First, ensure that the Red Hat repositories are enabled, including RHEL8 AppStream, RHEL8 BaseOS and RHEL8 CRB:
$ subscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpms $ subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms $ subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpms
Add the CUDA network repository:
$ sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
Install the latest stream to opt into precompiled packages:
$ sudo dnf module install nvidia-driver:latest
Choosing a modularity stream
For improved flexibility, several streams are available in both precompiled and DKMS varieties (Table 1).
NVIDIA driver | Precompiled stream | Legacy DKMS stream |
Highest version | latest | latest-dkms |
Locked @ 455.x | 455 | 455-dkms |
Locked @ 450.x | 450 | 450-dkms |
Locked @ 440.x | 440 | 440-dkms |
Locked @ 418.x | 418 | 418-dkms |
The latest option always updates to the highest versioned driver (precompiled):
$ sudo dnf module install nvidia-driver:latest
The <id>
option locks the driver updates to the specified driver branch (precompiled). Replace <id>
with the appropriate driver branch streams, for example, 455, 450, 440, or 418.
$ sudo dnf module install nvidia-driver:<id>
The latest-dkms
option always updates to the highest versioned driver (non-precompiled). This is the default stream.
$ sudo dnf module install nvidia-driver:latest-dkms
The <id>-dkms
option locks the driver updates to the specified driver branch (non-precompiled), for example, 455-dkms, 450-dkms, 440-dkms, or 418-dkms.
$ sudo dnf module install nvidia-driver:<id>-dkms
Switching streams
To switch to another stream, first remove the driver packages:
$ sudo dnf remove nvidia-driver
Then, reset the module stream:
$ sudo dnf module reset nvidia-driver
Now the driver can be installed from an appropriate stream.
Using modularity profiles
Modularity profiles work with any supported modularity stream and allow for additional use cases (Table 2).
Stream | Profile | Use case |
Default | /default | Installs all the driver packages in a stream. |
Kickstart | /ks | Performs unattended Linux OS installation using a config file. |
NVSwitch Fabric | /fm | Installs all the driver packages plus components required for bootstrapping an NVSwitch system (including the Fabric Manager and NSCQ telemetry). |
Now, you can use the dnf
command to specify the stream and profile:
$ sudo dnf module install nvidia-driver:<stream>/<profile>
The /default
option installs all the driver packages in a stream (transitive closure):
$ sudo dnf module install nvidia-driver:latest/default
The /ks
option is intended for unattended Linux OS installation using a Kickstart config file that does not install the cuda-drivers
metapackage. That metapackage attempts to remove old driver runfile installations.
%packages @^Minimal Install @nvidia-driver:latest-dkms/ks %end
The /fm
option installs additional packages for bootstrapping NVSwitch, including Fabric Manager and NSCQ (for switch telemetry):
$ sudo dnf module install nvidia-driver:450/fm
Support matrix for RHEL
Currently, these package improvements are supported for RHEL 8.2 (and later) on x86_64 architecture only. NVIDIA provides precompiled driver packages only for the latest official RHEL kernel, for example, 4.18.0-193.19.1 and later. If you use an earlier kernel, update to start receiving precompiled driver packages. Precompiled drivers are not provided for RHEL EUS kernels.
Table 3 shows the branches that are supported according to NVIDIA driver lifecycle policy.
Driver Branch | Branch Designation | End of Life |
418 | Long Term Service | March 2022 |
440 | New Feature | November 2020 |
450 | Long Term Service | July 2023 |
455 | Developer | 460 availability |
New kmod packages are typically available within 24 hours of a new RHEL kernel update.
To prevent system breakages, the dnf
plugin blocks kernel updates between a kernel going live and kmod package availability. A warning is displayed by dnf
during that upgrade situation:
NOTE: Skipping kernel installation since no NVIDIA driver kernel module package kmod-nvidia-${driver}-${kernel} ... could be found
Summary
Deploying the NVIDIA driver on RHEL 8 is a better experience using precompiled kernel module packages and modularity streams. The new driver packages are available in the CUDA repository, so you can get started today.
Packaging templates and instructions are provided on GitHub to allow you to maintain your own precompiled kernel module packages for custom kernels and derivative Linux distros:
For more information, see the following resources:
- Precompiled Kernel Modules: Packaging and Deployment on RHEL8 with Modularity Streams (GTC Fall 2020 session)
- Simplifying NVIDIA GPU Driver Deployment on Red Hat Enterprise Linux (Red Hat Summit 2020)
- RHEL8 precompiled kmod status
To give feedback, send comments or report bugs. If you are not already a member, join the NVIDIA Developer Program.