Cloud HPC and AWS ParallelCluster

This file focuses on cloud HPC design, especially on AWS.

Why Cloud HPC?

Cloud HPC is useful when you need:

burst capacity
rapid provisioning
flexible instance choices
project-based clusters

Cloud HPC is less attractive when you need:

stable ultra-high utilization
maximum low-latency performance at all times
strict on-prem data locality

Cloud HPC Design Questions

Before choosing tooling, ask:

Is the workload tightly coupled?
Is it independent batch?
Does it checkpoint?
Is spot acceptable?
How large is the dataset?
How expensive is data movement?

AWS ParallelCluster

AWS ParallelCluster is a cluster deployment and management tool for HPC on AWS.

It commonly manages:

head node
Slurm scheduler
compute fleets
storage integration
EFA-enabled networking

Typical ParallelCluster Architecture

Users
  |
  v
[Login / Head Node]
  |- Slurm
  |- Cluster config
  |
  v
[Compute Queues]
  |- CPU
  |- GPU
  |- Spot
  |- On-demand
  |
  v
[Storage]
  |- FSx for Lustre
  |- EBS
  |- EFS
  |- S3

Storage Mapping on AWS

Need	AWS choice
shared scratch	FSx for Lustre
durable datasets	S3
lighter shared home	EFS
node-local temp space	local NVMe / instance store

Compute and Network Mapping

Tightly coupled MPI

Prefer:

homogeneous instance types
EFA-enabled instances
placement groups
on-demand first unless restart behavior is strong

Independent batch

Prefer:

flexible instance pools
spot-heavy strategies
simpler storage path

GPU training

Prefer:

homogeneous GPU generations per queue
locality-aware launches
strong checkpointing plan
fast dataset access

Spot vs On-Demand

Use spot when:

jobs are restartable
checkpointing exists
tasks are independent or interruption-tolerant

Avoid spot when:

jobs are tightly coupled
restart is expensive
deadlines are strict

Cost Governance

Cloud HPC often fails from cost drift, not technical impossibility.

Add:

project tagging
budget alarms
queue-specific policies
idle resource cleanup
explicit spot/on-demand rules

Interview Summary

Strong cloud HPC answers show:

correct workload classification
service choice justified by workload, not by brand familiarity
storage and networking mapped explicitly
cost and interruption strategy included

HPC-03-Storage-Networking-Operations

HPC-05-Interviews

HPC-04-Cloud-ParallelCluster

Cloud HPC and AWS ParallelCluster

Why Cloud HPC?

Cloud HPC Design Questions

AWS ParallelCluster

Typical ParallelCluster Architecture

Storage Mapping on AWS

Compute and Network Mapping

Tightly coupled MPI

Independent batch

GPU training

Spot vs On-Demand

Cost Governance

Interview Summary

Related Files