GTC-DC 2019: Scaling Your Cluster from DGX-1 to DGX SuperPOD with DeepOps
Michael Balint, NVIDIA; Sumit Kumar, NVIDIA
We’ll show how to plan the deployment of AI infrastructure at scale with DGX software using DeepOps, from design and deployment to management and monitoring. DeepOps, an NVIDIA open source project, is used for the deployment and management of DGX POD clusters. It’s also used in the deployment of Kubernetes and Slurm, in an on-premise, optionally air-gapped data center. The modularity of the Ansible scripts in DeepOps gives experienced DevOps administrators the flexibility to customize deployment experience based on their specific IT infrastructure requirements, whether that means implementing a high-performance benchmarking cluster, or providing a data science team with Jupyter Notebooks that tap into GPUs. We’ll also describe how to leverage AI infrastructure at scale to support interactive training, machine learning pipelines, and inference use cases.