HPC Storage, Networking, and Operations

This file focuses on the infrastructure and operational parts of HPC that often become the real bottlenecks.

Storage Tiers

Home

persistent
user-facing
often backed up

Scratch

fast
temporary
high-throughput
often purged

Why Separate Storage Tiers?

Because performance, cost, and durability goals conflict.

If one storage system is forced to be:

cheap
fast
metadata-heavy
durable

it usually fails at one or more of those goals.

Common Storage Options

NFS
Lustre
BeeGFS
GPFS / Spectrum Scale
FSx for Lustre
S3 or other object storage

Common Storage Problems

metadata storms
too many small files
simultaneous checkpoints
using home for scratch data
weak dataset staging strategy

Networking

For loosely coupled workloads, standard Ethernet may be enough.

For tightly coupled workloads, network performance is often critical.

Important network metrics

latency
bandwidth
packet/message rate
jitter
collective performance

Common fabrics

Ethernet
InfiniBand
AWS EFA

RDMA

RDMA reduces communication overhead and helps communication-heavy HPC applications perform better by reducing software and CPU overhead on the data path.

Why Placement Matters

Good performance depends on:

rank locality
NUMA locality
GPU locality
rack/fabric locality in some clusters

Poor placement can waste expensive nodes even when utilization appears high.

Operations

An HPC cluster is not only compute hardware. It is also an operational platform.

Key operational concerns

node health checks
scheduler reliability
accounting and quotas
user identity and access
environment reproducibility
software version management
change management
observability

What to Monitor

queue depth
queue wait time
CPU/GPU utilization
node failure rate
storage throughput
metadata operation rate
network congestion/errors
scheduler latency

Reproducibility Controls

modules
Spack or other pinned environments
Apptainer containers
archived job scripts
recorded runtime metadata

Benchmarking Basics

Always record:

hardware type
software versions
problem size
process/thread count
placement settings
storage tier used

Without this, benchmark comparisons are often invalid.

Interview Summary

Strong answers on storage and operations usually show:

storage tier separation
awareness of metadata bottlenecks
understanding that networking is workload-dependent
operational maturity beyond raw node count

HPC-02-Slurm-MPI

HPC-04-Cloud-ParallelCluster

HPC-03-Storage-Networking-Operations

HPC Storage, Networking, and Operations

Storage Tiers

Home

Scratch

Archive

Why Separate Storage Tiers?

Common Storage Options

Common Storage Problems

Networking

Important network metrics

Common fabrics

RDMA

Why Placement Matters

Operations

Key operational concerns

What to Monitor

Reproducibility Controls

Benchmarking Basics

Interview Summary

HPC-03-Storage-Networking-Operations

HPC Storage, Networking, and Operations

Storage Tiers

Home

Scratch

Archive

Why Separate Storage Tiers?

Common Storage Options

Common Storage Problems

Networking

Important network metrics

Common fabrics

RDMA

Why Placement Matters

Operations

Key operational concerns

What to Monitor

Reproducibility Controls

Benchmarking Basics

Interview Summary

Related Files