High Performance Computing (HPC) - Ground Up Deep Dive

Master Sections

What is HPC?
Why HPC Exists
HPC vs General Distributed Systems
Core Concepts You Must Know
HPC Workload Categories
Anatomy of an HPC Cluster
HPC Software Stack
Parallel Programming Models
Slurm Deep Dive
MPI Deep Dive
OpenMP Deep Dive
GPUs in HPC
Storage in HPC
Networking in HPC
Packaging and Environment Management
Cluster Provisioning and Operations
Checkpointing
Cloud HPC
AWS ParallelCluster Deep Dive
Other Important HPC Tools and Technologies
System Design for HPC
HPC Design Tradeoffs
Performance Tuning Checklist
Reliability and Multi-Tenancy
Practical Commands Cheat Sheet
Interview Questions and Answers
30-Minute Revision Sheet
Top 50 HPC Interview Questions
Model Answers for the Top 50 Questions
Mock Interview Prompts
Slurm vs PBS vs Kubernetes vs AWS Batch
Company-Style Interview Angles
STAR-Format Scenario Answers
Likely Follow-Up Questions by Topic
ASCII Architecture Diagrams
Final Mental Model

Focused HPC Files

What is HPC?

High Performance Computing (HPC) is the practice of solving compute-intensive problems by using many CPUs, GPUs, memory systems, storage systems, and networked machines together as one coordinated system.

HPC is used when a single machine is too slow, too small, or too limited for the workload.

Common examples:

Weather simulation
Computational fluid dynamics (CFD)
Molecular dynamics
Genome analysis
Finite element analysis
Seismic processing
Monte Carlo simulation
Risk modeling
AI training and large-scale inference
Rendering and image processing

At a high level, HPC is about:

Parallelism: split work across many cores/nodes
Scale: run across tens, hundreds, or thousands of machines
Efficiency: maximize useful work per dollar, watt, and second
Coordination: schedule jobs, share cluster resources, manage failures

Why HPC Exists

1. Some problems are too large for one machine

Examples:

A simulation needs 10 TB of RAM
A training job needs 1,024 GPUs
A weather model must finish in 30 minutes, not 3 days

2. Some problems are embarrassingly parallel

These can be split into many independent tasks:

Parameter sweeps
Batch rendering
Monte Carlo runs
Backtesting
Genomics pipelines

3. Some problems require tightly coupled communication

These jobs need many processes exchanging data at fine granularity:

MPI-based fluid simulation
Distributed linear algebra
PDE solvers
Spectral solvers

This is where low-latency interconnects matter.

HPC vs General Distributed Systems

Dimension	HPC	General Distributed Systems
Primary goal	Maximum compute throughput / time-to-solution	Availability, elasticity, business transactions
Latency sensitivity	Often microseconds to milliseconds between ranks	Often milliseconds to seconds across services
Workload pattern	Batch jobs, simulations, tightly coupled tasks	Request/response, event-driven, online serving
Failure handling	Often restart job or checkpoint/restart	Retry, replication, graceful degradation
Network	RDMA / InfiniBand / EFA / high bandwidth fabrics	Ethernet is often enough
Storage	Parallel file systems, burst buffers, scratch	Databases, object stores, block stores
Scheduling	Queue-based fair sharing	Autoscaling, service orchestration
Consistency model	Numerical correctness and deterministic execution matter	Business correctness and durability matter

Key point:

A microservices architect optimizes for availability and independent deployability
An HPC architect optimizes for parallel efficiency and deterministic resource usage

Core Concepts You Must Know

1. Node

A node is one machine in the cluster.

Types:

Login node
Head node / scheduler node
Compute node
GPU node
Storage node
Visualization node

2. Core, CPU, Socket, NUMA

Core: execution unit
CPU/socket: physical processor package
Thread: hardware thread, often via SMT/Hyper-Threading
NUMA: non-uniform memory access; memory is physically closer to some CPU sockets than others

NUMA matters because poor memory locality can destroy performance.

Quick NUMA example

If a dual-socket node has:

Socket 0 with local memory bank A
Socket 1 with local memory bank B

and a process is scheduled on cores from socket 0 but reads memory allocated near socket 1, it pays a remote memory penalty. On modern nodes this can be a major source of hidden slowdown.

3. Rank

In MPI, a rank is a process in a communicator.

Example:

128 MPI ranks spread over 8 nodes
16 ranks per node

4. Thread

Within a process, a thread enables shared-memory parallelism.

Examples:

OpenMP threads
pthreads
Intel TBB threads

5. Job

A job is a unit submitted to the scheduler, asking for resources:

number of nodes
CPU cores
GPUs
memory
wall-clock time
queue/partition

Jobs can be:

batch jobs
interactive jobs
array jobs
reservation-backed jobs
preemptible jobs

6. Queue / Partition

A logical pool of resources with policies:

short jobs
long jobs
GPU jobs
debug jobs
large-memory jobs

7. Strong Scaling vs Weak Scaling

Strong scaling

Same total problem size, more processors.

Goal:

finish faster

Challenge:

communication overhead eventually dominates

Weak scaling

Problem size grows with processor count.

Goal:

keep runtime roughly constant as cluster grows

This is common for PDE and simulation workloads.

8. Speedup and Efficiency

Speedup

Speedup = T1 / Tp

T1: runtime on 1 processor
Tp: runtime on p processors

Parallel efficiency

Efficiency = Speedup / p

Example:

1 core runtime = 1000s
100 cores runtime = 20s
speedup = 50x
efficiency = 50 / 100 = 50%

9. Amdahl's Law

If part of a program is serial, that limits parallel speedup.

Speedup <= 1 / (S + (1-S)/N)

S: serial fraction
N: number of processors

Example:

if 10% is serial, max speedup is about 10x even with infinite processors

Meaning:

eliminate serial bottlenecks first

10. Gustafson's Law

Instead of fixing problem size, grow the problem as processors increase.

Meaning:

parallel systems are useful because we solve bigger problems, not only faster ones

11. Throughput vs Time-to-Solution

Two HPC teams can optimize for different outcomes:

Time-to-solution: finish one big job as fast as possible
Throughput: finish the most total work per hour/day

Examples:

weather forecast before a deadline is time-to-solution
nightly Monte Carlo batch is throughput

This distinction affects architecture:

time-to-solution pushes toward premium network/storage
throughput pushes toward lower-cost capacity and high utilization

HPC Workload Categories

1. Embarrassingly Parallel

Minimal communication between tasks.

Examples:

Monte Carlo trials
parameter sweeps
image rendering
independent ETL chunks

Best tools:

Slurm job arrays
AWS Batch
Kubernetes batch
Ray/Dask for some cases

2. Tightly Coupled

Tasks communicate frequently.

Examples:

CFD
climate models
molecular dynamics
linear algebra solvers

Best tools:

MPI
high-speed interconnect
parallel file system

3. Hybrid

MPI across nodes + threads/GPUs within nodes.

Examples:

MPI + OpenMP
MPI + CUDA
MPI + NCCL

This is the most common modern HPC pattern.

Anatomy of an HPC Cluster

Click to view code

Users
  |
  v
Login Nodes
  |
  v
Head / Control Plane
  |- Scheduler (Slurm)
  |- Accounting
  |- Monitoring
  |- Image/config management
  |
  v
High-Speed Network Fabric
  |
  +--> CPU Compute Nodes
  +--> GPU Compute Nodes
  +--> Large-memory Nodes
  +--> Storage Nodes
  |
  v
Shared Storage
  |- Home
  |- Scratch
  |- Project
  |- Archive/Object Store

Components

Users SSH here to:

edit code
compile binaries
submit jobs
inspect results

Do not run heavy compute on login nodes.

2. Head node / control plane

Runs cluster management services:

scheduler controller
accounting database
configuration services
monitoring
identity integration

This is critical infrastructure and must be protected carefully.

3. Compute nodes

Where jobs actually run.

Types:

standard CPU nodes
GPU nodes
memory-optimized nodes
high-frequency nodes

4. Network fabric

Critical for distributed jobs.

Options:

Ethernet
10/25/40/100/200/400 Gbps Ethernet
InfiniBand
AWS EFA (Elastic Fabric Adapter)

Important metrics:

latency
bandwidth
message rate
RDMA support

5. Storage

Usually split by usage:

Home: user directories, smaller, backed up
Scratch: high-performance temporary working area
Project: team-shared persistent data
Archive: low-cost cold storage

Options:

NFS
Lustre
BeeGFS
GPFS / IBM Spectrum Scale
FSx for Lustre
object storage like S3

6. Management and identity plane

Most real clusters also need:

LDAP/AD/SSO integration
centralized SSH key management
IAM or cloud-role integration
audit logging
quota enforcement
image/version management

Without this, the cluster may run jobs but it does not operate cleanly as a shared platform.

7. Scheduler database and accounting plane

A production HPC cluster usually tracks:

who ran what
how many core-hours/GPU-hours were consumed
job exit codes
node allocations
historical queue delays

This is needed for:

fairness
cost chargeback
capacity planning
debugging usage disputes

HPC Software Stack

Layered View

Layer	Typical Tools
Application	GROMACS, OpenFOAM, VASP, WRF, LAMMPS, TensorFlow, PyTorch
Libraries	BLAS, LAPACK, ScaLAPACK, FFTW, PETSc, HDF5, NCCL
Parallel runtime	MPI, OpenMP, CUDA, ROCm
Scheduler	Slurm, PBS Pro, LSF, Grid Engine
Packaging / env	Spack, EasyBuild, Environment Modules, Conda
Containers	Apptainer/Singularity, Docker in limited cases
OS / provisioning	Rocky Linux, Ubuntu, custom AMIs, image builders
Infra	bare metal, cloud VMs, high-speed fabric, file systems

Compilers and Math Libraries

In HPC, the compiler and library stack can materially change runtime.

Common compilers

GCC
Clang/LLVM
Intel oneAPI compilers
NVIDIA HPC SDK compilers
AMD AOCC in some environments

Why compilers matter

vectorization quality
OpenMP implementation quality
architecture-specific code generation
ABI compatibility with MPI and math libraries

Common math libraries

OpenBLAS
Intel MKL
BLIS
LAPACK / ScaLAPACK
FFTW
cuBLAS / cuFFT

Interview-level rule

If the application is math-heavy, do not assume the default compiler and default BLAS are acceptable. Tuned libraries often produce large performance differences with zero algorithmic changes.

Parallel Programming Models

1. Shared Memory Parallelism

Multiple threads access the same memory in one node.

Tools:

OpenMP
pthreads
TBB

Pros:

easier communication
low overhead inside one machine

Cons:

limited to one node's memory space
NUMA effects matter

2. Distributed Memory Parallelism

Each process has its own address space; communication happens via messages.

Tool:

Pros:

scales across many nodes
explicit and predictable

Cons:

harder programming model
communication overhead is visible and real

3. Accelerator Programming

Use GPUs or other accelerators.

Tools:

CUDA
ROCm/HIP
OpenACC
SYCL
NCCL for multi-GPU collective communication

4. Hybrid Programming

Most modern HPC codes mix models:

MPI between nodes
OpenMP threads within node
CUDA on GPUs

Example:

8 nodes
4 GPUs/node
1 MPI rank per GPU
NCCL for all-reduce
OpenMP for CPU-side preprocessing

Slurm Deep Dive

Slurm = Simple Linux Utility for Resource Management.

It is one of the most common job schedulers in HPC.

Slurm does two big jobs:

resource manager: knows which nodes/resources exist
job scheduler: decides when/where jobs run

Main Slurm Components

Component	Role
`slurmctld`	Central controller / scheduler
`slurmd`	Agent on each compute node
`slurmdbd`	Accounting database daemon
`sacct`	Job accounting query tool
`sinfo`	Cluster/partition status
`squeue`	View queued/running jobs
`sbatch`	Submit batch jobs
`srun`	Launch parallel tasks
`scancel`	Cancel jobs
`scontrol`	Inspect/control jobs and nodes

How Slurm Works

Job lifecycle

User writes a job script
User submits with sbatch
Slurm validates request
Job enters pending queue
Scheduler finds eligible nodes
Resources are allocated
slurmd on target nodes launches tasks
Job runs
Accounting and logs are recorded
Resources are released

Example Slurm job script

#!/bin/bash
#SBATCH --job-name=mpi-test
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=01:00:00
#SBATCH --partition=compute
#SBATCH --output=logs/%x-%j.out

module load openmpi

srun ./my_mpi_app input.dat

Important Slurm concepts

Partition

A logical queue or node group.

Examples:

debug
cpu
gpu
long
highmem

QoS (Quality of Service)

Policy layer controlling:

priorities
maximum wall time
preemption
job size limits

Fairshare

Prevents one team/user from monopolizing cluster resources.

Common policy:

users with recent heavy usage get lower priority
users with low recent usage get boosted

Backfilling

Scheduler lets short jobs run in holes before large reserved jobs start.

This increases utilization significantly.

Job arrays

For many similar tasks:

#!/bin/bash
#SBATCH --array=1-1000
#SBATCH --time=00:10:00

python simulate.py --seed ${SLURM_ARRAY_TASK_ID}

Best for:

parameter sweeps
Monte Carlo
rendering batches

Node features and constraints

Match jobs to hardware:

--constraint=gpu
--constraint=avx512
--gres=gpu:4

GRES

Generic RESources like:

GPUs
local SSDs
licenses

Reservations

Used for:

maintenance windows
reserved workshops/classes
priority project windows
guaranteed time for deadlines

Accounting associations

Slurm can map usage to:

user
account/project
cluster
partition

This is important for internal billing and fairshare policy.

Common Slurm Job States

State	Meaning
`PENDING`	waiting for resources or policy eligibility
`RUNNING`	currently executing
`COMPLETED`	finished successfully
`FAILED`	exited with failure
`CANCELLED`	manually/system cancelled
`TIMEOUT`	exceeded wall time
`NODE_FAIL`	node failure interrupted the job
`PREEMPTED`	higher-priority policy interrupted the job

When debugging user complaints, job state history matters as much as the live queue.

Why Jobs Stay Pending

Common reasons:

not enough free nodes
fairshare priority too low
partition/QoS limits
reservation blocking
requested features do not exist together
memory/GPU request too large for available nodes
job asks for more nodes than current fragmentation allows

Typical inspection commands:

squeue -j <jobid>
scontrol show job <jobid>
sprio -j <jobid>

Interactive vs Batch in Slurm

Interactive

Useful for:

debugging
exploratory testing
short validation runs

Example:

srun --pty -N 1 -n 4 --time=00:30:00 bash

Batch

Preferred for:

repeatability
long jobs
production runs
auditable workflows

Slurm Scheduling Design Tradeoffs

Design choice	Benefit	Cost
Aggressive backfill	Higher utilization	More scheduler complexity
Strict fairshare	Better fairness	Large jobs may wait longer
Many partitions	Better policy isolation	Admin complexity
Preemption	Urgent jobs start quickly	Checkpointing and disruption needed
Long wall times	Supports large simulations	Lowers scheduler flexibility

Slurm Failure Modes

Controller failure
Node drains due to health checks
Jobs request impossible resources
Users overestimate wall time
Shared filesystem bottlenecks slow startup
MPI jobs fail due to one bad node

Slurm Best Practices

Separate debug, prod, gpu, and long partitions
Enable accounting and fairshare
Use node health checks
Encourage realistic wall times
Use job arrays for independent workloads
Avoid huge scheduler bursts from millions of tiny jobs
Use prolog/epilog scripts carefully

MPI Deep Dive

MPI = Message Passing Interface.

MPI is the dominant programming model for tightly coupled distributed-memory HPC applications.

Important distinction:

MPI is a standard/API
implementations include Open MPI, MPICH, Intel MPI, MVAPICH

Why MPI Exists

Multiple nodes do not share memory. If process A on node 1 needs data from process B on node 2, it must send/receive messages.

MPI gives explicit control over this.

MPI Core Concepts

1. Rank

Each process has a unique ID inside a communicator.

2. Communicator

A communication group, commonly MPICOMMWORLD.

3. Point-to-point communication

MPI_Send
MPI_Recv
non-blocking: MPIIsend, MPIIrecv

4. Collective communication

Operations involving groups:

MPI_Bcast
MPI_Reduce
MPI_Allreduce
MPI_Scatter
MPI_Gather
MPI_Barrier

5. Synchronization

Important because communication can block and ordering matters.

6. Decomposition

How you split the problem:

domain decomposition
data decomposition
functional decomposition

Minimal MPI example

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    printf("Hello from rank %d of %d\n", rank, size);

    MPI_Finalize();
    return 0;
}

Run:

mpicc hello.c -o hello
mpirun -np 8 ./hello

Blocking vs Non-blocking

Blocking

Simpler but can stall.

Non-blocking

Allows overlap of communication and computation:

MPI_Isend(..., &req1);
MPI_Irecv(..., &req2);
do_local_compute();
MPI_Wait(&req1, MPI_STATUS_IGNORE);
MPI_Wait(&req2, MPI_STATUS_IGNORE);

This is a major optimization technique.

MPI Communication Patterns

Halo exchange

Common in grid/mesh simulations:

each rank exchanges boundary cells with neighbors

Reduction

Common for:

summing residuals
computing norms
loss aggregation

Broadcast

Used to distribute:

input parameters
model weights
configuration

All-to-all

Very expensive but sometimes needed:

FFT transposes
repartitioning

MPI Performance Factors

1. Latency

Time to send a small message.

Important for:

many tiny messages
synchronization-heavy algorithms

2. Bandwidth

Rate of large data transfer.

Important for:

large tensor or matrix transfers
checkpoint distribution

3. Message size

Many tiny messages are often worse than fewer larger messages.

4. Load balance

If one rank is slow, others wait.

5. Topology awareness

Mapping ranks to sockets/nodes matters.

6. Memory locality

NUMA misplacement can slow ranks drastically.

7. Process placement and binding

Performance often depends on:

rank-to-core mapping
rank-to-socket mapping
rank-to-GPU mapping
thread affinity

If placement is wrong:

ranks may fight for the same cores
GPU jobs may use the wrong PCIe path
remote memory access increases
collectives become imbalanced

MPI Placement Mental Model

For a node with:

2 CPU sockets
64 cores total
4 GPUs

A common design is:

4 MPI ranks per node
1 rank per GPU
each rank bound to CPU cores closest to that GPU

This reduces PCIe/NVLink cross-traffic and improves locality.

MPI Collectives Matter More Than Many Engineers Expect

At scale, collectives like MPI_Allreduce can dominate runtime.

This matters for:

iterative solvers
distributed training
convergence checks
global statistics

System design implication:

selecting a good network fabric and MPI implementation is not an optimization detail; it can determine whether the workload scales at all.

MPI Common Problems

Deadlock

Example:

rank 0 waits to receive from rank 1
rank 1 waits to receive from rank 0

Fix:

use matching send/recv ordering
use non-blocking calls
use MPI_Sendrecv

Load imbalance

One rank gets more work.

Fix:

better domain decomposition
dynamic work distribution when possible

Communication overhead

Too much time spent messaging.

Fix:

aggregate messages
reduce synchronization
overlap compute and communication

Poor process placement

Ranks placed badly across sockets/nodes.

Fix:

CPU binding
topology-aware placement
one rank per NUMA domain when appropriate

MPI and Slurm Together

Typical launch pattern:

srun --mpi=pmix ./my_mpi_app

mpirun ./my_mpi_app

In managed clusters, srun integration is often preferred because Slurm already owns the allocation.

When to Use MPI

Use MPI when:

tasks need frequent communication
workload spans many nodes
performance matters more than development simplicity
deterministic control is required

Do not default to MPI when:

tasks are independent
a workflow engine or job array is enough
communication is loose and coarse-grained

OpenMP Deep Dive

OpenMP is a directive-based shared-memory parallel programming model.

Example:

#pragma omp parallel for
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
}

Use OpenMP when:

work fits on one node
you want simpler shared-memory parallelism
you want to complement MPI

Typical pattern:

MPI across nodes
OpenMP within node

Benefits:

fewer MPI ranks
better memory sharing within node
less inter-node communication

Risks:

oversubscription
false sharing
thread imbalance
NUMA issues

OpenMP Environment Variables Worth Knowing

export OMP_NUM_THREADS=16
export OMP_PROC_BIND=close
export OMP_PLACES=cores

These influence:

thread count
binding behavior
locality

Poor defaults can produce noisy or misleading benchmark results.

GPUs in HPC

GPUs massively increase throughput for parallel workloads.

Best for:

matrix operations
stencil operations
deep learning
molecular dynamics
CFD kernels

Not always best for:

branchy code
tiny workloads
memory-latency dominated irregular tasks

GPU HPC Stack

Layer	Tools
Programming	CUDA, HIP, SYCL, OpenACC
Multi-GPU	NCCL
Distributed training	MPI + NCCL, Horovod, PyTorch DDP
Schedulers	Slurm with GRES
Monitoring	`nvidia-smi`, DCGM

GPU Scheduling Example with Slurm

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --time=04:00:00

module load cuda openmpi

srun ./gpu_mpi_app

GPU Design Considerations

PCIe vs NVLink
GPU memory capacity
GPU-to-GPU topology
data transfer overhead
one process per GPU vs multi-threaded process
storage throughput for data feeding

GPU Cluster Anti-Patterns

putting data-intensive training on weak shared storage
requesting GPUs without enough CPUs per GPU
ignoring GPU locality and NUMA affinity
mixing debug jobs with expensive production GPU partitions
checkpointing all ranks simultaneously to the same storage target

Storage in HPC

Storage is often the hidden bottleneck.

Storage Types

1. Home storage

persistent
smaller
backed up
not designed for large scratch I/O

2. Scratch storage

fast
temporary
high-throughput
frequently purged

3. Parallel file system

Used when many nodes read/write together.

Examples:

Lustre
BeeGFS
GPFS

4. Object storage

Examples:

Great for:

datasets
archives
checkpoints at coarse granularity
workflow staging

Not ideal for:

POSIX-heavy metadata-intensive random access

Storage Performance Metrics

throughput
IOPS
metadata ops/sec
file create/delete rate
small-file performance
read/write concurrency

Common HPC Storage Anti-Patterns

millions of tiny files in one directory
checkpointing every rank independently to shared metadata server
using home directory for large scratch data
staging huge jobs directly from object storage without caching

I/O Patterns You Should Recognize

Large sequential reads/writes

Common in:

checkpoint files
large simulation dumps
model shard writes

Needs:

high throughput

Metadata-heavy workloads

Common in:

millions of tiny file creates
workflow engines with many task artifacts
genomics pipelines with file-per-step patterns

Needs:

strong metadata performance

Mixed random access

Common in:

analytics and preprocessing
sparse scientific datasets

Needs:

careful file format and caching choices

Storage Best Practices

use scratch for temporary active data
aggregate small outputs into larger files
use HDF5/NetCDF/Parquet where appropriate
stagger checkpoints
separate metadata-heavy and throughput-heavy workloads

Networking in HPC

Network fabric is often the difference between "works" and "scales".

Important Metrics

latency
bandwidth
bisection bandwidth
packet rate
jitter
collectives performance
RDMA support

Common Fabrics

Ethernet

Good enough for:

loosely coupled workloads
storage-heavy pipelines
job arrays

InfiniBand

Best for:

low-latency tightly coupled MPI
high message rate
RDMA

AWS EFA

Cloud network interface designed for HPC and ML workloads.

Benefits:

lower latency than standard ENA
OS-bypass style capabilities
better MPI/NCCL performance in AWS

Network Design Principles

keep tightly coupled jobs within the same placement group / fabric domain
minimize cross-rack penalties where possible
match communication pattern to topology
use topology-aware scheduler placement when available

RDMA in Plain Language

RDMA allows one machine to access memory on another machine with much lower CPU overhead than traditional TCP-based networking.

Why it matters:

lower latency
reduced CPU overhead
better bandwidth utilization
improved MPI and collective performance

For interview answers, the important point is not protocol detail. It is that HPC networks try to minimize the software overhead of communication because communication is often on the critical path.

Packaging and Environment Management

HPC environments become unmanageable quickly without standards.

Common tools

Environment Modules

Users load compiler/library stacks:

module load gcc/13 openmpi/4.1 hdf5/1.14

Spack

Package manager for HPC software stacks.

Benefits:

compiler variants
dependency trees
reproducible builds
multiple toolchains

EasyBuild

Another common HPC software build and deployment framework.

Conda

Useful in data science environments, but can conflict with optimized MPI/compiler stacks if used carelessly.

Containers

Apptainer / Singularity

Most common HPC container solution.

Why not Docker directly on multi-user HPC?

privilege model concerns
admin/security issues

Benefits of Apptainer:

reproducible environments
easier user-space packaging
works better in multi-user shared clusters

Use cases:

package research code + dependencies
portable software stack across clusters
isolate Python/R environments

Reproducible Build Strategy

A mature HPC platform usually standardizes one of these:

Option 1: Modules + central builds

Best for:

shared institutional clusters
curated production software stacks

Option 2: Spack environments

Best for:

reproducible compiler/library combinations
advanced scientific software trees

Option 3: Apptainer containers

Best for:

user portability
dependency isolation
mixed language stacks

In practice, large platforms often use all three:

modules to expose tools
Spack to build them
Apptainer for application portability

Cluster Provisioning and Operations

Provisioning includes:

node image creation
OS configuration
scheduler install
network tuning
storage mounts
user identity integration
monitoring/alerting

Operational Concerns

1. Health checks

Drain nodes automatically for:

bad GPUs
failed NICs
filesystem issues
ECC errors

2. Observability

Track:

CPU utilization
memory usage
GPU utilization
job wait time
queue depth
filesystem throughput
node failure rate
scheduler latency

Typical tools:

Prometheus
Grafana
CloudWatch
DCGM
Slurm accounting

3. Capacity planning

Questions:

Are jobs waiting on CPUs or GPUs?
Is storage saturated?
Is queue delay due to policy or lack of nodes?
Are large jobs starved by fragmentation?

4. Security

Must cover:

SSH access
least privilege
user isolation
secrets management
software provenance
data governance

5. Change management

Clusters break easily when:

images drift
drivers change without validation
MPI/compiler ABI mismatches appear
bootstrap scripts are edited ad hoc

Good practice:

maintain staging and production clusters
certify software stacks before promotion
version control infrastructure and bootstrap code
test representative workloads after changes

6. User support and documentation

Operationally successful HPC platforms usually include:

example job scripts
queue selection guidance
software stack documentation
quota and storage documentation
onboarding for MPI/GPU best practices

Many perceived infrastructure issues are actually poor user enablement.

Checkpointing

Long-running HPC jobs fail eventually.

Checkpointing saves application state so a job can resume later.

Why checkpoint?

node failure
scheduler wall-time limits
spot/preemptible interruption
software updates
cost optimization

Checkpoint tradeoff

Checkpoint too often:

waste I/O bandwidth

Checkpoint too rarely:

lose too much work on failure

Common strategies

application-level checkpoints
framework-level checkpoints
coordinated checkpoints
asynchronous checkpoints

For large GPU/ML jobs, checkpoint design is often a first-class architecture decision.

Cloud HPC

Cloud HPC lets you build clusters on demand rather than owning a static supercomputer.

Benefits:

elasticity
faster experimentation
access to GPU/CPU variants
global regions
no hardware procurement lead time

Tradeoffs:

network may be weaker than elite on-prem supercomputers
costs can explode without controls
filesystem and data movement need careful design
bare-metal tuning options may be limited

When Cloud HPC Works Well

bursty workloads
parameter sweeps
project-based simulation
training jobs with fluctuating demand
teams without dedicated HPC ops staff

When On-Prem May Win

stable high utilization 24/7
ultra-low-latency tightly coupled workloads
strict data locality/regulatory needs
already-optimized large capital infrastructure

AWS ParallelCluster Deep Dive

AWS ParallelCluster is an AWS-supported open source cluster orchestration tool for deploying and managing HPC clusters on AWS.

It automates:

cluster creation
Slurm integration
networking setup
shared storage integration
compute fleet scaling
custom AMIs/bootstrap hooks

Think of it as:

infrastructure automation for AWS HPC
opinionated cluster deployment tooling

Typical AWS ParallelCluster Architecture

Click to view code

Users
  |
  v
Login Node / Remote Desktop
  |
  v
Head Node
  |- Slurm controller
  |- Cluster config
  |- Shared mounts
  |
  v
Compute Fleet
  |- CPU queues
  |- GPU queues
  |- Spot queues
  |- On-demand queues
  |
  v
Storage
  |- FSx for Lustre
  |- EBS
  |- EFS
  |- S3
  |
  v
Network
  |- VPC
  |- Subnets
  |- Security Groups
  |- Placement groups
  |- EFA

Key ParallelCluster Components

1. Head node

Runs:

Slurm controller
shared config
cluster management hooks

2. Compute fleet

Can scale dynamically based on queued jobs.

Options:

on-demand
spot
multiple instance types
multiple queues

3. Shared storage integrations

Common patterns:

FSx for Lustre for high-performance POSIX workloads
EFS for lighter shared home directories
EBS for node-local or head-node storage
S3 for input/output staging and archive

4. Networking

Often uses:

placement groups for cluster locality
EFA for low-latency MPI/NCCL

Why Use ParallelCluster

faster HPC cluster deployment
standard AWS integration
supports Slurm
good for reproducible cloud HPC environments
easier than manually stitching together EC2, EFA, FSx, IAM, and Slurm

AWS ParallelCluster Design Decisions

Storage mapping

Need	AWS choice
Shared high-throughput scratch	FSx for Lustre
Cheap persistent datasets	S3
Shared home directories	EFS or small FSx/EBS-backed design
Node-local temporary work	NVMe instance store / local SSD

Capacity mapping

Need	AWS choice
Tightly coupled MPI	EFA-enabled instances + placement groups
Cheap burst capacity	Spot instances for fault-tolerant jobs
Stable production jobs	On-demand or reservations
GPU training	P/G family GPU instances depending on generation

Scheduling mapping

Common queue split:

cpu-ondemand
cpu-spot
gpu-ondemand
gpu-spot
debug

ParallelCluster Example Design by Workload

Pattern A: Tightly coupled MPI simulation

Choose:

EFA-enabled instance types
placement groups
FSx for Lustre scratch
on-demand capacity first
Slurm queue with larger node counts

Avoid:

fragmented heterogeneous instance types in the same queue
spot unless checkpoint/restart is solid

Pattern B: Embarrassingly parallel batch

Choose:

mixed instance types
spot-heavy fleet
S3 for durable storage
smaller per-job local scratch

Avoid:

paying for EFA or premium network without measured benefit

Pattern C: GPU training cluster

Choose:

homogeneous GPU generation per queue
topology-aware placement
dataset staging close to nodes
checkpoint export to S3

Avoid:

mixing incompatible GPU memory sizes in the same production queue
relying on weak shared storage for hot datasets

ParallelCluster Example Configuration Shape

High-level concepts in config:

region
image / AMI
head node instance type
scheduler = Slurm
one or more Slurm queues
networking
shared storage mounts
custom actions / bootstrap scripts

Example concepts in config

Region: us-east-1
Image:
  Os: ubuntu2204
HeadNode:
  InstanceType: c7i.2xlarge
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: cpu
      ComputeResources:
        - Name: cputier
          InstanceType: c7i.8xlarge
          MinCount: 0
          MaxCount: 100
    - Name: gpu
      Networking:
        Efa:
          Enabled: true
      ComputeResources:
        - Name: gputier
          InstanceType: p5.48xlarge
          MinCount: 0
          MaxCount: 16
SharedStorage:
  - Name: scratch
    StorageType: FsxLustre

The exact schema evolves by version, but the architectural idea is stable:

define head node
define scheduler
define one or more Slurm queues
attach storage
enable EFA only where justified

ParallelCluster Operational Risks

head node becomes single point of control-plane failure
FSx throughput undersized for checkpoint bursts
spot interruptions break tightly coupled jobs
wrong subnet/placement setup hurts EFA performance
user bootstrap scripts create non-reproducible node state

Best Practices for ParallelCluster

isolate queues by workload and pricing model
use EFA only where workload benefits
stage large static datasets to S3 + cache into FSx
use FSx for Lustre for scratch, not as infinite archive
checkpoint jobs if using spot
version-control ParallelCluster config
test AMIs/bootstrap logic separately before production rollout

Other Important HPC Tools and Technologies

1. PBS Pro / Torque / LSF / Grid Engine

Alternatives to Slurm.

Use cases:

legacy clusters
enterprise licensing preferences
existing admin expertise

2. Lustre

Parallel distributed file system, common in HPC.

Best for:

high-throughput parallel I/O
large shared scratch workloads

3. BeeGFS

Another common parallel file system.

Known for operational flexibility and good performance.

4. GPFS / IBM Spectrum Scale

Enterprise-grade parallel file system with strong data management features.

5. NCCL

NVIDIA collective communication library for multi-GPU and multi-node GPU communication.

Critical for:

distributed deep learning
all-reduce
tensor synchronization

6. HDF5 / NetCDF

Data formats/libraries for scientific structured data.

Useful for:

simulation output
portable scientific datasets
metadata-rich arrays

7. Spack

Almost mandatory in serious multi-user HPC software management.

8. Apptainer

Critical for reproducibility and packaging in multi-user clusters.

9. Ray / Dask / Spark

Not traditional MPI-style HPC tools, but useful for:

Python parallelism
distributed analytics
ML preprocessing
task graphs

Use them when the workload is coarse-grained and developer productivity matters more than ultra-low-level communication control.

10. Workflow Engines

Real platforms often need orchestration above the scheduler.

Examples:

Nextflow
Snakemake
Airflow in some batch pipelines
CWL/WDL tools in genomics

Why they matter:

chain multi-step pipelines
manage dependencies
capture provenance
restart from failed stages

This is important because many "HPC applications" are really end-to-end workflows, not one monolithic binary.

System Design for HPC

This is the section interviewers usually want: not only "what is MPI?" but "how would you design an HPC platform?"

System Design Goals

Before designing, define:

workload type
scale
SLA / time-to-solution
budget
data size
coupling pattern
reproducibility needs
security/compliance constraints

Design Pattern 1: Research University Shared Cluster

Requirements

2,000 researchers
mixed CPU and GPU jobs
fair sharing across labs
on-prem budget
persistent team storage
moderate ops team

Architecture

2 login nodes behind load balancer/DNS rotation
2 Slurm controllers in HA design if possible
compute partitions:

- debug - cpu - gpu - highmem

shared storage:

- home on backed-up NAS/EFS-like equivalent - scratch on Lustre/BeeGFS - archive on object/tape

environment via modules + Spack
containers via Apptainer
accounting + fairshare by lab/project
monitoring with Prometheus/Grafana

Design rationale

login nodes separated from control plane
scratch separated from home to protect metadata performance
GPU isolation avoids CPU jobs clogging expensive nodes
fairshare prevents one lab from taking entire cluster

Design Pattern 2: Cloud Burst HPC for CFD

Requirements

normally 200 cores
occasionally 20,000 cores for urgent runs
solver uses tightly coupled MPI
outputs go to S3 archive

Architecture

base on-prem or small cloud cluster
AWS ParallelCluster for burst capacity
Slurm scheduler
EFA-enabled compute nodes
placement groups
FSx for Lustre as scratch
S3 for input datasets and final outputs
checkpointing enabled

Key tradeoffs

EFA cost is worth it because MPI is tightly coupled
spot may be unsafe unless solver checkpoint/restart is solid
FSx used for working set, S3 for durable storage

Design Pattern 3: Monte Carlo Risk Platform

Requirements

millions of independent simulations nightly
minimal cross-task communication
cost sensitive
finish by market open

Architecture

Slurm or cloud batch scheduler
job arrays
CPU spot instances acceptable
object storage for inputs/outputs
no need for premium network fabric
aggregate results in distributed storage/database

Rationale

This is HPC from a throughput perspective, but not tightly coupled HPC. Do not overengineer with InfiniBand or MPI if independence dominates.

Design Pattern 4: Multi-Node GPU Training Platform

Requirements

256 to 1024 GPUs
large model training
distributed all-reduce
expensive datasets
checkpoint-heavy

Architecture

GPU partition with topology-aware placement
Slurm scheduling
EFA or equivalent fast interconnect
NCCL + PyTorch DDP
high-throughput shared scratch
staged datasets close to compute
checkpoint pipeline to durable object storage
quota and priority controls

Important choices

one process per GPU
local NVMe for shard caching
separate checkpoint and training I/O paths if possible
preemption only if checkpoint cadence supports it

HPC Design Tradeoffs

1. On-Prem vs Cloud

Choice	Pros	Cons
On-prem	predictable cost at high utilization, full control, best tuning	capital expense, slower procurement, fixed capacity
Cloud	elastic, fast provisioning, many instance choices	variable cost, data movement cost, cloud-specific tuning

2. Slurm vs Kubernetes

Choice	Pros	Cons
Slurm	built for batch/HPC, mature resource model, MPI-friendly	less cloud-native app ecosystem
Kubernetes	container-native, rich platform tooling	weaker fit for tightly coupled HPC unless heavily adapted

General guidance:

for traditional HPC, choose Slurm
for service-oriented ML platforms, Kubernetes may coexist beside HPC systems

3. MPI vs Job Arrays

Choice	Best for	Wrong for
MPI	tightly coupled multi-node jobs	independent tasks
Job arrays	embarrassingly parallel workloads	fine-grained tightly coupled communication

4. Parallel File System vs Object Storage

Choice	Best for	Weakness
Parallel file system	POSIX shared scratch, parallel I/O	cost, metadata scaling, ops complexity
Object storage	cheap durable large datasets	not POSIX, poor small random file semantics

5. Spot vs On-Demand

Choice	Pros	Cons
Spot	cheaper	interruptions
On-demand	stable	more expensive

Use spot for:

checkpointable
fault-tolerant
independent workloads

Avoid spot for:

long tightly coupled jobs without restart support

Performance Tuning Checklist

When an HPC job is slow, check these in order:

Is the algorithm scaling poorly?
Is the workload load-balanced?
Is communication dominating runtime?
Are ranks/threads placed well?
Is NUMA locality poor?
Is storage throttling startup/checkpoint/output?
Is the network topology/fabric insufficient?
Are compiler flags and math libraries optimized?
Are you oversubscribing cores or GPUs?
Are you measuring with profiling tools rather than guessing?

Benchmarking Basics

When comparing systems, measure with discipline.

Things to record

node type
CPU/GPU generation
compiler version
MPI implementation
library versions
problem size
rank/thread count
binding settings
filesystem used
network type

Common benchmark mistakes

comparing different problem sizes by accident
including one-time cache warmup effects
ignoring placement and affinity
benchmarking on noisy shared nodes
not separating compute time from I/O time

Good benchmark questions

Does runtime improve?
Does efficiency improve?
Is cost per solved problem lower?
Is time-to-solution acceptable?
Does scaling flatten at a predictable point?

Common Profiling Tools

perf
Intel VTune
NVIDIA Nsight
mpiP
TAU
Arm MAP
application-specific profilers

Reliability and Multi-Tenancy

Shared HPC clusters must balance:

utilization
fairness
reproducibility
security
fault isolation

Multi-tenant controls

per-project quotas
fairshare
partition/QoS isolation
filesystem quotas
software module governance
node health-based draining

Cost Governance in Cloud HPC

For cloud environments, also add:

queue-level spending limits
tagging by project and owner
budget alarms
idle resource cleanup
image sprawl control
spot vs on-demand policy by workload class

Without governance, cloud HPC often fails for financial rather than technical reasons.

Reproducibility controls

versioned modules
pinned Spack environments
containerized runs
immutable cluster configs
archived job scripts and environment metadata

Practical Commands Cheat Sheet

Slurm

sinfo
squeue
sbatch job.sh
srun --nodes=2 --ntasks-per-node=32 ./app
sacct -j 12345
scancel 12345
scontrol show job 12345

MPI

mpicc app.c -o app
mpirun -np 64 ./app

Modules

module avail
module load gcc openmpi
module list
module purge

Basic sanity checks

lscpu
numactl --hardware
nvidia-smi
df -h
free -h

Interview Questions and Answers

1. What is HPC and when do you need it?

Answer: HPC is the use of parallel compute resources to solve problems too large or too time-sensitive for a single machine. You need it when your workload requires massive CPU/GPU throughput, very large memory, or multi-node parallelism. Typical examples are weather models, CFD, molecular simulations, large-scale AI training, and high-volume Monte Carlo simulation.

The key distinction is that HPC optimizes for time-to-solution and parallel efficiency, not primarily for always-on request serving like web systems.

2. What is the difference between HPC and distributed systems?

Answer: Both use many machines, but they optimize for different things. Distributed systems usually prioritize availability, fault tolerance, and serving online requests. HPC prioritizes high throughput, low-latency inter-process communication, and deterministic execution of large batch jobs.

For example:

a payment service uses retries, replicas, and stateless scaling
an MPI simulation uses synchronized ranks, specialized fabrics, and checkpoint/restart

3. What is Slurm?

Answer: Slurm is a cluster resource manager and job scheduler. It tracks available nodes and resources, accepts job submissions, queues them according to policy, allocates resources, launches tasks, and records accounting data.

It is effectively the operating system for a shared HPC cluster from a scheduling perspective.

4. How does Slurm schedule jobs fairly?

Answer: Usually through a combination of:

partitions
priorities
fairshare
QoS policies
job size and age factors
backfilling

Fairshare reduces the priority of users or projects that recently consumed large amounts of cluster time, so that others can get access. Backfilling increases utilization by fitting short jobs into gaps without delaying larger reserved jobs.

5. What is MPI and why is it important?

Answer: MPI is the standard interface for distributed-memory parallel programming. It lets processes on different nodes exchange data using explicit messages. It is important because multi-node systems do not share memory, so tightly coupled jobs need structured communication primitives like send/receive and collectives.

MPI remains essential for simulation and scientific computing because it gives precise control over communication and maps well to high-performance interconnects.

6. When would you use MPI instead of a job array?

Answer: Use MPI when tasks must communicate frequently during execution, such as in domain-decomposed simulations. Use a job array when tasks are independent, such as running 10,000 Monte Carlo trials with different seeds.

If tasks do not need to exchange data during runtime, MPI usually adds complexity without benefit.

7. Explain strong scaling vs weak scaling.

Answer: Strong scaling keeps total problem size fixed and measures whether adding processors reduces runtime. Weak scaling increases problem size proportionally with processor count and measures whether runtime stays flat.

Strong scaling is limited heavily by communication and serial fractions. Weak scaling is often more realistic for scientific workloads because users want to solve bigger problems as systems grow.

8. Why does network matter so much in HPC?

Answer: In tightly coupled applications, ranks exchange data constantly. If network latency is high or bandwidth is low, processors sit idle waiting for messages. As job size grows, communication cost can dominate runtime.

That is why technologies like InfiniBand and EFA matter. They reduce communication overhead and improve collective operation performance.

9. What are the main components of an HPC cluster?

Answer:

login nodes for user access
head/control nodes for scheduling and management
compute nodes for execution
storage systems for home, scratch, and archive
network fabric for node-to-node communication
software stack including scheduler, compilers, MPI, libraries, and environment tools

10. What is AWS ParallelCluster and when would you use it?

Answer: AWS ParallelCluster is a deployment and management tool for running HPC clusters on AWS. It automates cluster creation around Slurm, compute fleets, storage integration, and networking.

Use it when you want AWS-based HPC without building all cluster infrastructure manually from raw EC2, FSx, IAM, and networking components.

It is especially good for burst workloads, project-based compute, and teams needing reproducible cloud HPC clusters.

11. How would you design an HPC platform for tightly coupled CFD jobs on AWS?

Answer: I would use:

AWS ParallelCluster with Slurm
EFA-enabled instance types
cluster placement groups
FSx for Lustre for shared scratch
S3 for durable input/output archive
separate queues for debug, on-demand production, and possibly spot if checkpointing is mature

The key decision is to optimize communication and I/O. For tightly coupled MPI, standard Ethernet-only placement is usually not enough at scale.

12. How would you design a cost-efficient HPC platform for Monte Carlo workloads?

Answer: I would not default to MPI or premium interconnects. I would use independent jobs or job arrays, cheap CPU capacity, aggressive autoscaling, object storage for inputs/outputs, and maybe spot instances because tasks are independent and restartable.

This is a common interview trap: not every HPC workload needs tightly coupled cluster design.

13. What storage would you choose for HPC and why?

Answer: It depends on access pattern:

home data: persistent and backed up shared storage
working scratch: high-performance parallel filesystem
archival data: object storage
node-local temporary data: NVMe/local SSD

For HPC, storage is chosen by I/O pattern, not by one-size-fits-all simplicity.

14. What are common HPC bottlenecks?

Answer:

poor parallel decomposition
communication overhead
load imbalance
NUMA/locality issues
slow or metadata-heavy storage
bad rank placement
oversubscription
checkpoint storms
underestimating scheduler/policy effects

15. What is checkpointing and why is it important?

Answer: Checkpointing saves application state periodically so jobs can resume after interruption or failure. It is critical for long-running jobs, cloud spot usage, and clusters with wall-time limits.

Without checkpointing, one failure near the end of a multi-day run can waste enormous compute time.

16. What is the difference between OpenMP and MPI?

Answer: OpenMP is shared-memory parallelism inside a node. MPI is distributed-memory parallelism across processes, often across nodes.

OpenMP is easier but limited to one shared-memory system. MPI is more complex but scales across many machines. Many applications use both.

17. Why is NUMA important?

Answer: In NUMA systems, memory is physically closer to some CPUs than others. If a thread frequently accesses remote memory, latency rises and bandwidth drops. This can materially hurt performance even when CPU utilization looks high.

Proper thread pinning, process placement, and memory locality are essential in HPC tuning.

18. How would you improve cluster utilization?

Answer:

enable backfilling
separate partitions by workload class
encourage realistic wall times
use job arrays for many small tasks
monitor queue fragmentation
use fairshare
right-size node shapes
reduce scheduler overload from tiny jobs

Utilization is a policy, scheduling, and workload-shaping problem, not only a hardware problem.

19. What is a parallel filesystem and why not just use NFS everywhere?

Answer: A parallel filesystem distributes metadata and data paths to support high-throughput concurrent access from many nodes. NFS can work for light shared storage, especially home directories, but usually becomes a bottleneck for large-scale parallel reads/writes and metadata-heavy HPC workloads.

20. How would you compare Slurm and Kubernetes for HPC?

Answer: Slurm is purpose-built for HPC batch scheduling, MPI integration, fairshare, and cluster resource allocation. Kubernetes is stronger for containerized services and cloud-native app ecosystems.

For classic simulation workloads, Slurm is usually the better scheduler. Kubernetes can complement HPC for surrounding services, portals, notebooks, and some ML pipelines.

21. What would you monitor in an HPC platform?

Answer:

queue wait time
job throughput
job failure rate
node health
CPU/GPU utilization
memory pressure
filesystem throughput and metadata rates
network errors and congestion
scheduler latency
fairness and quota consumption

22. A user says their MPI job scales from 8 to 64 ranks but gets slower from 64 to 512. What do you check?

Answer: I would check:

communication/computation ratio
load balance
collectives overhead
halo exchange frequency
rank placement
NUMA pinning
network fabric saturation
small message overhead
algorithmic scaling limits from Amdahl's Law

I would profile before changing architecture because this is often a communication pattern problem, not only an infrastructure problem.

23. When would you use spot instances in cloud HPC?

Answer: When workloads are:

restartable
checkpointed
embarrassingly parallel
cost-sensitive

I would avoid spot for long tightly coupled jobs unless interruption handling is proven and operationally safe.

24. How do you make HPC environments reproducible?

Answer:

version-controlled infrastructure config
modules or Spack environments with pinned versions
containerized applications with Apptainer
archived job scripts
recorded runtime metadata
stable input datasets and config management

In research and regulated environments, reproducibility is a platform feature, not a user afterthought.

25. Design an interview-ready answer for "build a shared HPC platform for AI + simulations."

Answer: I would split workloads into at least two resource classes:

tightly coupled CPU/GPU simulation jobs
ML training/inference jobs

I would use Slurm for scheduling, separate GPU and CPU partitions, EFA/fast interconnect for distributed jobs, high-performance scratch storage, object storage for durable datasets and checkpoints, modules/containers for reproducibility, and accounting/fairshare for multi-tenant governance.

I would also explicitly separate:

login/control plane
compute plane
home vs scratch vs archive storage
debug vs production queues

The main tradeoff is balancing utilization and fairness while protecting expensive GPU and network resources from noisy or mismatched workloads.

26. What is the difference between `srun`, `sbatch`, and `mpirun`?

Answer: sbatch submits a batch job to Slurm. srun launches tasks, often within an existing Slurm allocation, and can also be used for interactive jobs. mpirun is an MPI launcher provided by the MPI implementation.

In a Slurm-managed cluster, srun is often preferred for launching MPI tasks because it integrates directly with the scheduler's allocation and process management.

27. Why do HPC platforms separate home, scratch, and archive storage?

Answer: Because the access patterns and cost models differ. Home storage should be persistent and often backed up. Scratch should be fast and disposable. Archive should be cheap and durable.

If you merge them into one system, you usually end up paying too much, performing poorly, or both.

28. What is job backfilling and why is it useful?

Answer: Backfilling lets the scheduler run smaller jobs in currently free slots as long as doing so does not delay higher-priority reserved jobs. It improves utilization and reduces wasted idle windows.

This is one of the most important scheduler techniques for shared clusters with mixed job sizes.

29. How do you choose between EFA/InfiniBand and standard Ethernet?

Answer: I start from the communication pattern. If the application is tightly coupled, synchronization-heavy, and sensitive to collective performance, I choose EFA or InfiniBand. If tasks are mostly independent or coarse-grained, standard Ethernet is often enough.

The mistake is buying premium network for embarrassingly parallel jobs or, conversely, trying to scale MPI on commodity networking without measuring the consequences.

30. What is the role of containers in HPC if modules already exist?

Answer: Modules solve environment selection at the cluster level. Containers solve application portability and dependency isolation. They are complementary.

In mature environments:

modules expose compilers, MPI, and site-standard tooling
containers package user applications and language ecosystems

31. How would you debug a long queue wait time complaint?

Answer: I would check:

requested resources
partition and QoS
priority/fairshare
current fragmentation
reservation conflicts
historical queue occupancy

I would not assume "the cluster is full" until I inspect scheduler state. Many long waits are policy or request-shape issues.

32. A team wants one platform for genomics pipelines and tightly coupled CFD. Would you use one cluster?

Answer: Possibly one administrative platform, but not one undifferentiated resource pool. I would separate workload classes through partitions, node types, storage policy, and likely queue-specific operational guidance.

Genomics often cares about workflows, metadata-heavy I/O, and throughput. CFD cares about MPI scaling, network fabric, and tightly coupled runtime behavior. Forcing both into the same tuning and policy envelope usually hurts one of them.

33. What makes an HPC design answer strong in an interview?

Answer: Three things:

classify the workload correctly
map the workload to the right compute, network, storage, and scheduler policy
explain tradeoffs in cost, utilization, and operational risk

Interviewers usually care less about memorizing tool names than about whether you can choose the right architecture for the workload.

What Interviewers Usually Want to Hear

If an interviewer asks about HPC system design, they usually want to hear that you understand:

not all parallel workloads are the same
tightly coupled and embarrassingly parallel systems should be designed differently
scheduler policy is part of architecture
storage and network are first-class design choices
cloud HPC is viable but requires explicit tradeoffs
reproducibility, observability, and checkpointing matter as much as raw CPU count

30-Minute Revision Sheet

Use this section the night before or 30 minutes before an interview.

1. One-line definitions

HPC: using many compute resources together to solve large or time-sensitive problems efficiently
Slurm: resource manager and batch scheduler for shared clusters
MPI: distributed-memory message passing model for tightly coupled parallel jobs
OpenMP: shared-memory threading model inside a node
NUMA: memory locality model where some memory is closer to some CPUs
Parallel filesystem: shared storage built for concurrent high-throughput access from many nodes
Checkpointing: saving job state so work can resume after failure/preemption
AWS ParallelCluster: AWS tooling to deploy/manage HPC clusters, commonly around Slurm

2. Fast classification framework

When someone gives you an HPC problem, classify it first:

A. Is it embarrassingly parallel?

If yes:

job arrays
cheaper networking
spot often acceptable
object storage often enough

B. Is it tightly coupled?

If yes:

MPI
premium network fabric
placement matters
shared scratch matters
spot is risky unless checkpointing is strong

C. Is it GPU-heavy?

If yes:

GPU queue separation
CPU/GPU ratio matters
topology and data pipeline matter
checkpoint and dataset throughput matter

3. The 5-layer answer structure

For almost any design question, answer in this order:

workload shape
execution model
scheduler and policy
infrastructure
operations and reliability

4. What to say about Slurm

Slurm decides who gets which resources and when
key concepts: partitions, QoS, fairshare, backfilling, GRES
sbatch submits
srun launches tasks
squeue shows live queue
sacct shows accounting/history

5. What to say about MPI

MPI is for tightly coupled distributed-memory jobs
ranks exchange data explicitly
collectives and communication cost matter
scaling usually fails because of communication, imbalance, or placement

6. What to say about storage

home = persistent
scratch = fast and temporary
archive = cheap and durable
do not use one storage tier for everything

7. What to say about networking

tightly coupled jobs need low latency and high bandwidth
independent tasks usually do not need premium fabric
EFA/InfiniBand decisions should come from communication pattern

8. Common interview traps

using MPI for independent jobs
recommending expensive network for Monte Carlo
ignoring storage in simulation/training design
not discussing fairshare and multi-tenancy
assuming cloud is always cheaper
forgetting checkpointing

9. The shortest strong answer to "design an HPC platform"

Classify the workload first, then choose the execution model, then design compute, scheduler policy, storage, and network around that workload. Separate login, control, and compute planes. Split storage into home, scratch, and archive. Add observability, quotas, and checkpointing. Use premium networking only when the communication pattern justifies it.

10. Red flags in your own answer

too much tool-name listing without architecture
no workload classification
no storage discussion
no failure/restart strategy
no cost or fairness discussion

Model Answers for the Top 50 Questions

These are compact interview-ready answers. Expand them with workload-specific details when answering live.

1. What is HPC?

HPC is the use of many compute resources together to solve problems that are too large or too time-sensitive for one machine. It focuses on parallelism, scalability, and time-to-solution.

2. How is HPC different from general distributed systems?

HPC usually optimizes for throughput, parallel efficiency, and low-latency communication across jobs like simulations or training. General distributed systems usually optimize for availability, elasticity, and serving online traffic.

3. What kinds of workloads are embarrassingly parallel?

Workloads where tasks are independent and do not need runtime communication, such as Monte Carlo trials, parameter sweeps, batch rendering, and many genomics pipeline stages.

4. What kinds of workloads are tightly coupled?

Workloads where processes exchange data frequently during execution, such as CFD, climate modeling, molecular dynamics, and distributed linear algebra.

5. What is the difference between strong scaling and weak scaling?

Strong scaling keeps problem size fixed and asks whether runtime drops as resources increase. Weak scaling increases problem size with resource count and asks whether runtime stays roughly constant.

6. What is Amdahl's Law and why does it matter?

Amdahl's Law says the serial fraction of a program limits total speedup. It matters because adding more nodes cannot fix a fundamentally serial bottleneck.

7. What is Gustafson's Law?

Gustafson's Law says larger systems are valuable because they let us solve larger problems in similar time, not only because they speed up fixed-size problems.

8. What is Slurm?

Slurm is an HPC resource manager and batch scheduler. It tracks resources, queues jobs, allocates nodes, launches tasks, and records accounting data.

9. What are the key Slurm daemons and client commands?

Key daemons are slurmctld, slurmd, and often slurmdbd. Key commands are sbatch, srun, squeue, sinfo, sacct, scancel, and scontrol.

10. What is the difference between `sbatch`, `srun`, and `scancel`?

sbatch submits a batch script, srun launches tasks inside an allocation or creates an interactive allocation, and scancel stops jobs.

11. What does fairshare mean in Slurm?

Fairshare is a policy mechanism that reduces priority for users or projects that recently consumed more cluster resources, helping prevent monopolization.

12. What is backfilling?

Backfilling allows smaller jobs to run in currently free slots as long as they do not delay higher-priority reserved jobs. It improves utilization.

13. Why do jobs remain pending in Slurm?

Usually because of resource shortages, policy limits, fairshare, reservations, fragmentation, or impossible requests such as incompatible constraints.

14. What is a Slurm partition?

A partition is a logical grouping of nodes and policies, similar to a queue. Clusters often separate partitions for CPU, GPU, debug, long-running, or high-memory jobs.

15. What is QoS in Slurm?

QoS is a policy layer that controls priority, runtime limits, preemption behavior, and sometimes usage limits.

16. What is GRES in Slurm?

GRES means generic resources, such as GPUs, local SSDs, or licensed software tokens that must be scheduled explicitly.

17. What is MPI?

MPI is the standard programming interface for distributed-memory message passing across processes, often across many nodes.

18. What is a rank in MPI?

A rank is a process identity inside an MPI communicator. Communication patterns are often described in terms of rank IDs.

19. What is a communicator?

A communicator is a communication group in MPI, such as MPICOMMWORLD, defining which ranks can talk together in a given context.

20. What is the difference between point-to-point and collective communication?

Point-to-point communication happens between specific ranks, such as send/receive. Collective communication involves a group, such as broadcast, reduce, or all-reduce.

21. When do you use non-blocking communication?

When you want to overlap communication with computation, reduce idle waiting, or avoid deadlock-prone blocking communication sequences.

22. What are common MPI scaling bottlenecks?

Communication overhead, too many collectives, load imbalance, poor rank placement, small-message overhead, and weak memory locality.

23. What is halo exchange?

Halo exchange is a communication pattern where neighboring subdomains exchange boundary data, common in stencil and mesh-based simulations.

24. Why do collectives become expensive at scale?

Because they involve coordinated communication across many ranks. As rank count grows, latency, synchronization, and topology effects become increasingly significant.

25. What is NUMA and why does it matter?

NUMA means memory access cost depends on which CPU socket owns the memory. Poor locality increases latency and lowers bandwidth, hurting performance.

26. What is process/thread affinity?

Affinity controls where processes and threads run. Good affinity improves locality and predictability; bad affinity causes contention and remote memory access.

27. What is OpenMP and when should you use it?

OpenMP is a shared-memory threading model. Use it for intra-node parallelism or together with MPI in hybrid jobs.

28. When would you choose MPI + OpenMP together?

When you want MPI across nodes and threads within each node to reduce inter-node communication, improve memory sharing, or better match NUMA topology.

29. What makes GPU clusters different from CPU-only clusters?

They require explicit GPU scheduling, CPU-to-GPU balance, topology awareness, fast data pipelines, and checkpoint/dataset strategies tuned for accelerator workloads.

30. What is NCCL and why is it important?

NCCL is NVIDIA's collective communication library for GPUs. It is critical for multi-GPU and multi-node training because it accelerates collective operations like all-reduce.

31. What is RDMA in practical terms?

It is a low-overhead communication model that reduces CPU involvement in data transfer, improving latency and throughput for communication-heavy workloads.

32. Why are InfiniBand or EFA useful?

They provide lower-latency, higher-performance networking than standard Ethernet for tightly coupled MPI and distributed GPU jobs.

33. What is a parallel filesystem?

A shared storage system designed for many nodes reading and writing concurrently at high throughput, often with distributed metadata and data services.

34. Why is NFS often insufficient for large HPC scratch workloads?

Because metadata and throughput limits usually appear under large-scale concurrent access, especially with many small files or heavy checkpoint traffic.

35. Why should home, scratch, and archive be separated?

Because they serve different cost, performance, and durability needs. One storage tier usually cannot satisfy all three efficiently.

36. What are common HPC storage bottlenecks?

Metadata storms, too many small files, simultaneous checkpoints, poor file formats, networked storage saturation, and using the wrong tier for the workload.

37. What is checkpointing?

Checkpointing is periodically saving job state so computation can resume after failure, preemption, or wall-time expiration.

38. How do you decide checkpoint frequency?

Balance failure risk against I/O overhead. Checkpoint too often and you waste storage bandwidth; too rarely and you lose too much work on failure.

39. When should you use spot instances in cloud HPC?

For independent or checkpointable jobs where interruption is acceptable and cost savings matter more than continuous execution.

40. When is cloud HPC a bad fit?

When workloads require very stable ultra-low-latency performance, data gravity is extremely high, costs are predictable at high utilization, or regulation strongly favors on-prem.

41. What is AWS ParallelCluster?

It is AWS-supported tooling for deploying and managing HPC clusters, typically with Slurm, compute fleets, storage integration, and cloud networking.

42. How would you design an AWS HPC cluster for MPI-based CFD?

Use ParallelCluster, Slurm, homogeneous EFA-enabled nodes, placement groups, FSx for Lustre scratch, and S3 for durable storage. Optimize for communication and checkpoint efficiency.

43. How would you design a cost-efficient Monte Carlo platform?

Use job arrays or a task scheduler, cheaper CPU capacity, object storage, autoscaling, and spot instances if jobs are restartable. Do not pay for premium fabrics unnecessarily.

44. How would you design a multi-tenant university HPC cluster?

Separate login, control, and compute planes; provide CPU/GPU/high-memory partitions; use fairshare, quotas, accounting, shared scratch, and reproducible software stacks.

45. How do you make HPC environments reproducible?

Version infrastructure, pin software stacks, use modules or Spack, package apps in Apptainer where appropriate, and record job scripts and runtime metadata.

46. What should you monitor in an HPC platform?

Queue delay, utilization, job failures, node health, storage throughput and metadata rates, network errors, scheduler latency, and quota/fairshare usage.

47. How do you benchmark HPC systems correctly?

Control the environment, record node types and software versions, use representative problem sizes, fix affinity and placement, separate compute from I/O time, and repeat runs for consistency.

48. What is the role of workflow engines in HPC?

They orchestrate multi-step pipelines, track dependencies, improve restartability, and capture provenance beyond what a raw scheduler provides.

49. How would you compare Slurm and Kubernetes for HPC workloads?

Slurm is better aligned with classic HPC scheduling, MPI, batch queues, and fairshare. Kubernetes is stronger for container-native services and some ML/data platforms. For traditional HPC, Slurm is usually the better fit.

50. What makes an HPC design answer strong in an interview?

Correct workload classification, correct mapping to compute/network/storage/scheduler choices, explicit tradeoffs, and clear reasoning about failure, cost, and operations.

Mock Interview Prompts

Use these for practice. Each one is phrased the way an interviewer might actually ask it.

Prompt 1: CFD Cluster

"Design an HPC platform for a team running multi-node CFD simulations that must complete within fixed deadlines."