HPC Slurm and MPI
This file focuses on the two most common interview topics in traditional HPC platforms: Slurm and MPI.
Slurm
Slurm is a resource manager and batch scheduler for HPC clusters.
It handles:
- job submission
- queueing
- resource allocation
- task launch
- accounting
Main Slurm Components
| Component | Role |
|---|---|
slurmctld | controller and scheduler |
slurmd | node daemon |
slurmdbd | accounting database daemon |
sbatch | submit batch jobs |
srun | launch tasks |
squeue | inspect jobs |
sinfo | inspect partitions/nodes |
sacct | inspect job history |
Slurm Concepts
Partition
A logical queue or node pool.
Common partitions:
debugcpugpuhighmemlong
QoS
Controls:
- priority
- runtime limits
- preemption behavior
- usage limits
Fairshare
Prevents monopolization by lowering priority for users/projects with heavy recent usage.
Backfilling
Lets small jobs fill gaps without delaying larger reserved jobs.
GRES
Generic resources such as:
- GPUs
- local SSDs
- licenses
Slurm Job Lifecycle
- user writes job script
- submit with
sbatch - job enters queue
- scheduler selects nodes when eligible
- tasks launch via Slurm daemons
- accounting records usage and outcome
Common Job States
| State | Meaning |
|---|---|
PENDING | waiting |
RUNNING | executing |
COMPLETED | success |
FAILED | failed |
TIMEOUT | wall time exceeded |
CANCELLED | cancelled |
NODE_FAIL | interrupted by node failure |
PREEMPTED | interrupted by policy |
Why Slurm Jobs Stay Pending
- insufficient nodes
- fairshare penalty
- QoS or partition limits
- bad or overly large resource request
- fragmentation
- incompatible constraints
Example Slurm Script
#!/bin/bash
#SBATCH --job-name=mpi-run
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=01:00:00
#SBATCH --partition=compute
module load openmpi
srun --mpi=pmix ./app
MPI
MPI is the standard programming model for distributed-memory parallel jobs.
Use MPI when:
- tasks communicate frequently
- the workload spans many nodes
- performance matters enough to justify explicit communication control
Core MPI Concepts
- rank: one MPI process
- communicator: group of ranks
- point-to-point communication: send/receive
- collectives: broadcast, reduce, all-reduce, gather, scatter
Why MPI Matters
Multi-node systems do not share memory. MPI makes communication explicit.
This matters because:
- communication cost is real
- synchronization cost is real
- process placement matters
Common MPI Bottlenecks
- excessive communication
- too many collectives
- load imbalance
- small-message overhead
- poor rank placement
- NUMA locality issues
Process Placement
A good launch often depends on:
- rank-to-core mapping
- rank-to-socket mapping
- rank-to-GPU mapping
- thread binding
Bad placement causes:
- remote memory access
- noisy collectives
- CPU contention
- GPU locality problems
MPI + Slurm
In Slurm-managed clusters, srun is often preferred because it launches tasks inside the scheduler allocation cleanly.
Typical pattern:
srun --mpi=pmix ./my_mpi_app
Interview Summary
Good HPC answers about Slurm and MPI usually cover:
- queue policy
- fairness
- placement
- communication bottlenecks
- workload fit