Cloud HPC and AWS ParallelCluster

This file focuses on cloud HPC design, especially on AWS.

Why Cloud HPC?

Cloud HPC is useful when you need:

  • burst capacity
  • rapid provisioning
  • flexible instance choices
  • project-based clusters

Cloud HPC is less attractive when you need:

  • stable ultra-high utilization
  • maximum low-latency performance at all times
  • strict on-prem data locality

Cloud HPC Design Questions

Before choosing tooling, ask:

  • Is the workload tightly coupled?
  • Is it independent batch?
  • Does it checkpoint?
  • Is spot acceptable?
  • How large is the dataset?
  • How expensive is data movement?

AWS ParallelCluster

AWS ParallelCluster is a cluster deployment and management tool for HPC on AWS.

It commonly manages:

  • head node
  • Slurm scheduler
  • compute fleets
  • storage integration
  • EFA-enabled networking

Typical ParallelCluster Architecture

Users
  |
  v
[Login / Head Node]
  |- Slurm
  |- Cluster config
  |
  v
[Compute Queues]
  |- CPU
  |- GPU
  |- Spot
  |- On-demand
  |
  v
[Storage]
  |- FSx for Lustre
  |- EBS
  |- EFS
  |- S3

Storage Mapping on AWS

NeedAWS choice
shared scratchFSx for Lustre
durable datasetsS3
lighter shared homeEFS
node-local temp spacelocal NVMe / instance store

Compute and Network Mapping

Tightly coupled MPI

Prefer:

  • homogeneous instance types
  • EFA-enabled instances
  • placement groups
  • on-demand first unless restart behavior is strong

Independent batch

Prefer:

  • flexible instance pools
  • spot-heavy strategies
  • simpler storage path

GPU training

Prefer:

  • homogeneous GPU generations per queue
  • locality-aware launches
  • strong checkpointing plan
  • fast dataset access

Spot vs On-Demand

Use spot when:

  • jobs are restartable
  • checkpointing exists
  • tasks are independent or interruption-tolerant

Avoid spot when:

  • jobs are tightly coupled
  • restart is expensive
  • deadlines are strict

Cost Governance

Cloud HPC often fails from cost drift, not technical impossibility.

Add:

  • project tagging
  • budget alarms
  • queue-specific policies
  • idle resource cleanup
  • explicit spot/on-demand rules

Interview Summary

Strong cloud HPC answers show:

  • correct workload classification
  • service choice justified by workload, not by brand familiarity
  • storage and networking mapped explicitly
  • cost and interruption strategy included
System Design Ultimatum · Last updated 4/28/2026