Slurm

Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.

SchedMD is the core company behind the Slurm workload manager. Slurm is currently performing workload management on six of the ten most powerful computers in the world including the number 1 system -- Tianhe-2 with 3,120,000 computing cores – as well as number 6, the GPGPU giant Piz Daint, utilizing over 5,000 NVIDIA GPGPUs.

SchedMD performs the majority of Slurm development, reviews and integrates contributions from others, distributes and maintains the canonical version of Slurm, and finally, provides support, installation, configuration, custom development and training.

Key Features of Slurm

Scales to millions of cores and tens of thousands of GPGPUs
Military grade security
Heterogenous platform support allowing users to take advantage of GPGPUs.
Flexible plugin framework enables Slurm to meet complex customization requirements
Topology aware job scheduling for maximum system utilization
Open Source
Extensive scheduling options including advanced reservations, suspend/resume, backfill, fair-share and preemptive scheduling for critical jobs
No single point of failure
Slurm enables new artificial intelligence (AI) capabilities to address some of the most challenging priorities on the largest AI systems in the world.

Availability

Download Slurm or contact SchedMD for Slurm Support at sales@schedmd.com or http://schedmd.com/#contact

More Info

General GPGPU/GRES Slurm documentation
GPGPU configuration options
Current Slurm documentation
Slurm Tutorials