Base Command Manager (BCM) provides a comprehensive software solutions for deploying and managing enterprise-grade Linux clusters. BCM provisions, monitors and manages GPU clusters, and makes it an ongoing practice to incorporate the latest enhancements in NVIDIA GPU technology into its products, enabling BCM customers to seamlessly use NVIDIA technology.

Key Features

  • Intuitive web interface provides comprehensive view of GPU and cluster metrics
  • Powerful Cluster Management Shell (CMSH) as alternative administrative interface
  • Full support for NVIDIA libraries, CUDA, OpenCL, OpenACC, CUDA-aware libraries, NCCL, and CUB
  • Comprehensive GPU monitoring and health checking
  • Provisions GPU resources from public (AWS, Azure) and private (OpenStack) clouds within minutes
  • Auto scaling hybrid cloud based on workload and configured policies
  • Supports several popular Linux distributions: RHEL and derivatives, SUSE SLES and Ubuntu LTS
  • GPU-enabled Kubernetes and Singularity for running containers
  • Offers a complete machine learning stack
  • Deployment for popular HPC file systems and management of fast interconnects
  • Gives insight into how jobs utilize the resources that were allocated (e.g. GPU usage throughout a job’s runtime)
  • Provides reporting to administrators to identify users or groups of users that are consistently underutilizing the resources (e.g. GPUs) that they allocate for their jobs.

BCM can sample and monitor metrics from supported GPUs and GPU Computing Systems, such as the NVIDIA Hopper, Blackwell, Grace Hopper , Grace Blackwell and legacy GPU's.

Examples of supported metrics include:

  • GPU temperatures
  • GPU exclusivity modes
  • GPU fan speeds
  • System fan speeds
  • PSU voltages and currents
  • System LED states
  • GPU ECC statistics
  • GPU utilization
  • GPU memory usage

BCM leverages NVIDIA’s Data Center GPU Manager (DCGM) for GPU health monitoring, diagnostics and validation

Key benefits:

  • Unprecedented ease of use
  • Significant cost and time savings
  • Increased uptime and productivity
  • Scalability up to 100,000 compute nodes
  • Significant cost savings through dynamic scaling

BCM for Data Science

BCM empowers organizations to gain actionable insights from rich, complex data. To achieve this, BCM offers a comprehensive deep learning solution that includes:

  • A modern deep learning environment - BCM provides everything needed to spin up an effective deep learning environment, and manage it effectively
  • Choice of machine learning frameworks - BCM provides a choice of machine learning frameworks, including Tensorflow, Tensorflow2, Horovod, Keras, PyTorch, Chainer, fast.ai, DyNet, MXNet, Theano to simplify deep learning projects.
  • Choice of machine learning libraries - BCM includes a selection of the most popular machine learning libraries and tools to help access datasets, including cuDNN, ONNX, and TensorRT.
  • Frameworks are provided for Python 3.7 and for multiple versions of CUDA
  • Supporting infrastructure elements – BCM takes care of finding, configuring, and deploying all of the dependent pieces needed to run deep learning libraries and frameworks, and includes over 400MB of Python modules that support the machine learning packages, plus the NVIDIA hardware drivers, CUDA (parallel computing platform API) drivers, CUB (CUDA building blocks), and NCCL2 (library of standard collective communication routines)
  • Because the machine learning frameworks and libraries are constantly being updated, please see the BCM packages dashboard for the most up to date information

For more information: