Simplifying HPC Workflows with NVIDIA NGC Container Environment Modules

Many system administrators use environment modules to manage software deployments. The advantages of environment modules are that they allow you to load and unload software configurations dynamically in a clean fashion, providing end users with the best experience when it comes to customizing a specific configuration for each application.

However, robustly supporting HPC and deep learning applications with complex dependencies can be challenging. Updating dependencies for one application runs the risk of breaking the dependencies of another application. With so many different users and their computing needs, HPC administrators are often overwhelmed with the amount of time they must spend to install, upgrade, and monitor software.

Containers can be a great way to simplify the overall deployment process. In this post, we introduce NVIDIA NGC Container Environment Modules, which bring together containers and environment modules in an easy-to-use, customizable, reference design.

Why containers?

There are significant advantages to using containers when deploying software. Containers allow you to package a software application, libraries, and other runtime dependencies into a single image. This way, the application environment is both portable and consistent, and agnostic to the underlying host system software configuration. Containers eliminate the need to install complex software environments and allow you to pull and run applications on the system without any assistance from system administrators. Deploying an application takes just a few minutes, saving time for both users and administrators.

Containers allow researchers to share their application with other researchers for corroboration. Without containers, it becomes extremely tedious to replicate the exact environment for reproducing the results from a computational model. Equally important is that the performance of the application running in a container compared a bare-metal install is equivalent.

Containers as modules

Over the past few years, we have seen containers become quite popular when it comes to deploying HPC applications, but we also realize that most HPC sites and users are accustomed to using environment modules.

With this in mind, we developed a flexible, open-source reference design called NGC Container Environment Modules. NGC Container Environment Modules are lightweight wrappers that deploy the latest containers from NGC using environment modules.

The reference design has several key benefits:

  • Use familiar environment module commands, ensuring a small learning curve and minimal changes to your workflows.
  • Run your applications, leverage all the benefits of containers such as portability, reproducibility, and security.
  • Extract maximum performance from your hardware and software leveraging the latest versions of HPC and Deep Learning application containers from NGC.
  • Get the benefit of flexible, configurable modules that fit seamlessly in your compute environment.

Why NGC?

NVIDIA NGC offers a comprehensive catalog of GPU-optimized software for deep learning, machine learning, and HPC applications that can be deployed on-premises, in the cloud, or at the edge.

Figure 1. The various types of HPC and DL containers available from NGC.

With over 100 containers, NGC provides easy-to-deploy software proven to deliver the fastest results. By taking care of the plumbing, NGC enables researchers to focus on their research and gather faster insights. For system administrators, the containers from NGC provide a way to empower researchers without having to manage the complex process of upgrading applications. The containers on NGC are tested and tuned to perform optimally on existing hardware through updated libraries and new versions of compilers.

Figure 2. Software performance improvements from one version to another for DL and HPC applications.

Use cases

Lmod, the Lua-based environment modules, and Singularity are prerequisites to using the NGC Container Environment Modules. The NGC Container Environment Modules are a set of Lmod configuration files that transparently maps certain commands to run inside the container. More precisely, the module creates a shell function to map a command “X” to singularity run --nv container.sif X. The modules can also automatically download the NGC container image if necessary or use a local library of container images.

Because Singularity automatically mounts the user’s home directory, current working directory, and /tmp inside the container, commands that depend on files on the host work as expected. You can set the SINGULARITY_BINDPATH environment variable to mount additional host directories inside the container. For instance, if your site has a global scratch directory and shared datasets, set SINGULARITY_BINDPATH=/scratch,/shared/datasets.

1. Download NGC container images to a local shared location in advance. 2. Configure Container Environment Modules to use the local NGC container images. 3. Run standard environment module commands. If the image is not already present in the user's private Singularity cache, first download the image from NGC, then run the program. If the image is already cached, run the program immediately.

Figure 3. Shows the overall workflow for the two supported use cases.

  1. Download NGC container images to a local shared location in advance.
  2. Configure Container Environment Modules to use the local NGC container images.
  3. Run standard environment module commands.
  4. If the image is not already present in the user’s private Singularity cache, first download the image from NGC, then run the program. If the image is already cached, run the program immediately.

The NGC Container Environment Modules support two use cases out-of-the-box:

  • A library of already downloaded container images is shared with all users.
  • A personal copy of the container image is downloaded the first time it is used, and then cached for subsequent uses.

Shared library

To use a container image library, set the NGC_IMAGE_DIR environment variable to the path to the container images. The NGC Container Replicator can be used to setup and maintain a local replica of the NGC container registry. The container image file names in the NGC Container Environment Modules default to the Singularity image files exported by the NGC Container Replicator.

Video 1. Downloading and running the containers from directly from NGC.

Cached image

In the second use case, the container image is downloaded “on-the-fly” the first time one of the mapped commands is invoked. The container image is stored in your Singularity cache, where it can be automatically re-used for any subsequent uses. The default Singularity image case is in your home directory. The home directory quota could be exceeded if you are working with multiple container images.

Video 2. Running the containers from a centrally located local directory.

You may wish to modify the reference HPC Container Modules to customize the behavior for your specific requirements.

“The NGC Container Environment Modules are very similar to an approach we previously developed in-house, so we can adopt them with minimal changes, empower our researchers to focus on their research, and reduce our workload when deploying HPC and AI applications. With easy access to containers from NGC, our researchers can seamlessly deploy a curated set of applications that are validated and highly optimized for execution on NVIDIA GPUs.”

Erik Deumens, Research Computing Director, University of Florida

Summary

In heterogeneous HPC environments running thousands of applications, containers simplify the overall deployment process. NGC Container Environment Modules further simplify the deployment of HPC and deep learning applications with minimal changes to existing workflows and empowers researchers to harness the power of containers, while maximizing their time to find solutions. NGC Container Environment Modules are open-source and are available for immediate use on GitHub. For more information, including sample code, see the NVIDIA/ngc-container-environment-modules repo.