Data Center / Cloud

High-Performance Storage on NVIDIA DGX Cloud with Oracle Cloud Infrastructure

Data center

The incredible advances of accelerated computing are powered by data. The role of data in accelerating AI workloads is crucial for businesses looking to stay ahead of the curve in the current fast-paced digital environment. Speeding up access to that data is yet another way that NVIDIA accelerates entire AI workflows. 

NVIDIA DGX Cloud caters to a wide variety of market use cases. NVIDIA has invested heavily in the integration of software that leverages the unique capabilities offered by our infrastructure partners. Oracle Cloud Infrastructure (OCI) is a leading partner with NVIDIA in enabling the compute, networking, and storage infrastructure crucial for making DGX Cloud a reality. 

To enable high-performance storage for NVIDIA DGX Cloud on OCI, NVIDIA pairs Oracle’s bare-metal infrastructure with the NVIDIA NVMesh software. This enables file storage that is scalable on demand for use on DGX Cloud.

NVIDIA DGX Cloud, powered by NVIDIA partners

NVIDIA DGX Cloud is a multinode AI-training-as-a-service solution, providing enterprises their own AI supercomputer in the cloud. It delivers the tools, workflows, and leadership-class performance to enable developers to increase productivity by speeding time to insights. 

DGX Cloud eliminates the need to procure and install a supercomputer for enterprises that are already operating in the cloud—just open a browser to get started.

DGX Cloud is powered by NVIDIA Base Command Platform, a unified interface where developers and organization admins can interact for experiment and lifecycle management. They can also access experiments, users, and data. DGX Cloud includes NVIDIA AI Enterprise, which provides a variety of AI solution workflows, frameworks, and pretrained models to speed time to insights.

Video 1. Organizations can tap into the full potential of NVIDIA DGX infrastructure with NVIDIA Base Command

To help achieve the performance expected from NVIDIA infrastructure in the cloud, NVIDIA has partnered with leading cloud service providers on NVIDIA-certified compute infrastructure, often leveraging key DGX components like GPUs and networking. OCI’s cloud design includes key design elements that make accessing high-performance infrastructure easy. It is a compelling way to operate DGX Cloud.

How OCI makes high-performance storage easy

Oracle’s cloud design uses key NVIDIA DGX components and prioritizes high-performance networking and storage. The OCI E4 DenseIO compute instances (or shapes) are an excellent fit to use as the building block for high-performance storage. For more details, see Announcing E4 DenseIO Instances with Twice the Performance for Database and Analytics Workloads and the Oracle compute shape documentation.

The bare metal E4 DenseIO shape provides the following hardware configuration:

  • 128 AMD EPYC Milan processor cores
  • 2 TB of system memory
  • 54.4 TB of NVMe storage across a total of 8 NVMe devices
  • 2 x 50 Gbps of high-performance Ethernet networking

Apart from low-latency, high I/O performance (IOPS) direct-attached NVMe storage, the two 50 Gbps physical NICs on E4 DenseIO shapes enable building a redundant, highly available parallel file system. The bare metal form factor, still uncommon among hyperscalers, provides dedicated resources without virtualization. The flexible networking provides security through isolation, streamlining multitenancy configuration.

A performant storage design that best enables DGX Cloud on Oracle infrastructure is achieved by leveraging the performance capabilities of E4 shapes together with the malleability offered by shapes (instead of a more general-purpose file service).

Composing shapes with NVMesh

NVMesh software is one of the key ways that DGX Cloud makes use of the OCI bare metal instances. NVMesh takes the raw NVMe storage provided by OCI E4 shapes and builds a high-performance data volume that maximizes the performance of the underlying hardware. It also provides the data protection capabilities needed to avoid outages from hardware failures. NVMesh also provides encryption by default, further protecting user data from potential security threats.

In a DGX Cloud environment on OCI, a number of E4 DenseIO shapes are deployed for use in an availability domain. These shapes are organized in pairs, with high availability across each pair offered by the NVMesh software. These high-availability shape pairs are then used as the basis for the Lustre file system, which is presented in a DGX Cloud environment to the user as NVIDIA Base Command Platform data set and workspace storage. 

A diagram showing OCI BM.DenseIO.E4 shapes organized in pairs as part of a Lustre file system, connected over a 50 Gbps Ethernet Fabric to OCI BM.GPU.A100-80 shapes for storage IO.
Figure 1. OCI shapes running NVMesh software connect to enable a high-performance Lustre file system for GPU clusters in NVIDIA DGX Cloud

If additional capacity is necessary, additional HA pairs can be provisioned to extend an active Lustre file system with no downtime. The design of the shape pairs also takes into account metadata scalability. Adding more HA pairs extends metadata linearly with capacity, ensuring that the resulting Lustre file system does not bottleneck performance on metadata capacity or operations. 

According to tests conducted by NVIDIA using DGX Cloud on OCI across a wide range of real-world accelerated computing applications, the storage performance enabled results that matched what was observed with on-premises NVIDIA Base Command Platform environments.

Oracle made it easy for NVIDIA to leverage automation and existing technology as building blocks in the integration of DGX Cloud. OCI provides first-class support for Terraform. It is not uncommon for API providers to offer partial support in a tool like Terraform, directing a user outside of Terraform to a custom software component to enable newer functionality. 

However, this was not the case with OCI, leading to a stellar experience relying solely on Terraform for infrastructure automation. Additionally, the oracle-quickstart/oci-nfs repo on GitHb provided an early reference for NVIDIA engineering on OCI best practices for storage service bring-up. This further accelerated the process NVIDIA undertook to adopt OCI’s Terraform-enabled capabilities.

Conclusion

NVIDIA DGX Cloud, powered by NVIDIA Base Command Platform, provides a consistent single-pane experience to manage your AI training jobs and view your infrastructure and model telemetry. It also enables collaboration and resource sharing to increase the productivity of your organization.

Through the NVIDIA partnership with Oracle and purpose-built software like NVMesh, NVIDIA DGX Cloud environments make optimal use of cloud service provider infrastructure to accelerate all aspects of AI workflows. DGX Cloud is an excellent choice for a wide variety of workloads, ranging from large language models (LLMs) to physics-informed machine learning. See Designing Digital Twins with Flexible Workflows on NVIDIA Base Command Platform for more details.

Resources

Discuss (0)

Tags