Scalability and Reliability Patterns

Back-of-Envelope Calculations

Quick estimation techniques for system design interviews. These are rough estimates used to validate architectural decisions.

Time and Storage Units

UnitBytesTime
KB10³1 millisecond
MB10⁶1 second
GB10⁹1 minute
TB10¹²1 hour
PB10¹⁵1 day

Latency Numbers Every Programmer Should Know

Click to view code
L1 cache reference:                     0.5 ns
Branch mispredict:                      5 ns
L2 cache reference:                     7 ns
Mutex lock/unlock:                    100 ns
Main memory reference:                100 ns
Compress 1KB with Snappy:           3,000 ns (3 µs)
Send 1KB over 1 Gbps network:      10,000 ns (10 µs)
Read 1MB sequentially from memory: 250,000 ns (250 µs)
Round trip within same datacenter:  500,000 ns (500 µs)
Read 1MB sequentially from SSD:   1,000,000 ns (1 ms)
Disk seek:                        10,000,000 ns (10 ms)
Read 1MB sequentially from disk:  20,000,000 ns (20 ms)
Send packet CA → Netherlands:    150,000,000 ns (150 ms)

Rule of thumb: Disk is ~40x slower than memory, network is ~150-200x slower than memory

Common QPS (Queries Per Second) Calculations

Given: X million users, Y% daily active, Z requests per user per day

Click to view code
Formula: (X million × Y% × Z requests) / (86,400 seconds/day)

Example: 1B users, 50% DAU, 10 requests/day
= (1,000,000,000 × 0.5 × 10) / 86,400
= 5,000,000,000 / 86,400
= ~57,870 QPS

Peak traffic rule: Multiply by 3-5x (peak is 3-5x average)
= 57,870 × 5 = ~290,000 QPS peak

Data Volume Calculations

Given: X million users, Y data per user, Z days retention

Click to view code
Formula: X million × Y × Z = total storage

Example: 1B users, 1KB per user, 30 days retention
= 1,000,000,000 × 1KB × 30
= 30TB total

But need redundancy:
- 3x replication: 30TB × 3 = 90TB
- 2x for backup: 30TB × 2 = 60TB
- Total: ~150TB

Database Sizing

Single MySQL server capacity:

Click to view code
Memory: 64GB
Max connections: 10,000
Throughput: 
  - 10,000 QPS read (cached)
  - 1,000 QPS write
  - Mixed: 5,000 QPS

Before hitting CPU/memory limits, disk I/O becomes bottleneck

When to shard:

Click to view code
Data size: > 1TB → Consider sharding
QPS: > 10,000 → Consider sharding
Connections: > 5,000 → Consider sharding

Sharding key: User ID, geographic region, or hash

Server Capacity Planning

Single web server (8 core, 32GB RAM):

Click to view code
Memory per connection: ~1MB
Max connections: 32,000

QPS capacity:
- Simple operations: ~10,000 QPS
- Complex operations: ~1,000 QPS
- CPU-bound: ~5,000 QPS

Number of servers needed:

Click to view code
Formula: Peak QPS / QPS per server

Example: 300,000 peak QPS, 10,000 QPS per server
= 300,000 / 10,000 = 30 servers

With redundancy (rolling updates, failures):
= 30 × 1.3 = 39 servers (30% buffer)

Load Balancer Sizing

Click to view code
Modern LB (AWS ALB):
- Max 25,000 new connections/second
- Max 1,000,000 concurrent connections per LB

For 10M concurrent connections:
= 10,000,000 / 1,000,000 = 10 LBs needed

For multi-region HA:
= 10 LBs × 2 regions = 20 LBs total

Cache Sizing

Rule: Cache hit rate ~90% means 10x reduction in database load

Click to view code
Example without cache:
- 100,000 QPS to database
- Database capacity: 10,000 QPS
- Problem: Overloaded

With cache (90% hit rate):
- Cache serves: 90,000 QPS (in-memory, fast)
- DB serves: 10,000 QPS (miss traffic)
- Balanced!

Cache memory needed:

Click to view code
Formula: (QPS × avg_object_size × TTL) / hit_rate

Example: 100,000 QPS, 1KB avg object, 1 hour TTL, 90% hit rate
= (100,000 × 1KB × 3,600) / 0.9
= 360,000,000 KB / 0.9
= ~400GB cache

Practical: Use distributed cache (Redis) across 10 servers
= 40GB per Redis instance (reasonable)

Bandwidth Calculation

Given: Peak QPS, average response size

Click to view code
Formula: Peak QPS × Avg response size

Example: 300,000 peak QPS, 10KB response
= 300,000 × 10KB = 3,000,000 KB/s = 3GB/s

Bandwidth needed:
- 10 Gbps link: 10GB/s (can handle it)
- But need redundancy: 2 × 10Gbps = 20Gbps total

Rule of thumb: Provision 2-3x peak bandwidth for headroom

Video Streaming Bandwidth

Click to view code
Netflix scenario: 100M users, 50% watching simultaneously

Bitrate per stream: 5 Mbps (HD)
Total bandwidth needed: 50M × 5 Mbps = 250 Tbps

Wow, that's massive! How does Netflix handle it?

Solution: CDN + regional caching
- Cache video in edge locations (ISP networks)
- Only ~10-20% requires backbone traffic
- Backbone: 250 Tbps × 15% = 37.5 Tbps (still huge, but manageable)

Example: Design Instagram Feed for 100M DAU

Click to view code
1. Calculate QPS:
   - 100M DAU × 20 requests/day / 86,400 = ~23,000 QPS avg
   - Peak: 23,000 × 5 = ~115,000 QPS

2. Database sizing:
   - Each feed = 100 posts × 1KB = 100KB
   - 100M users × 100KB = 10TB storage
   - With 3x replication: 30TB
   - Needs sharding (> 1TB)

3. Server capacity:
   - 115,000 peak QPS / 10,000 per server = 11.5 → 15 servers
   - With 30% redundancy: ~20 servers

4. Cache sizing:
   - 90% hit rate assumption
   - Feed objects: 100M users × 100KB × 0.1 (10% miss) = 1TB
   - Split across 25 Redis instances = 40GB each

5. Bandwidth:
   - 115,000 QPS × 100KB = 11.5GB/s
   - Need 100 Gbps backbone

Back-of-Envelope Checklist

When estimating, ask:

Click to view code
✓ How many users? (total, DAU, peak concurrent)
✓ Request rate? (QPS average, QPS peak)
✓ Data volume? (per user, total, growth rate)
✓ Latency requirements? (response time SLA)
✓ Availability requirements? (uptime SLA)
✓ Bandwidth constraints? (network capacity)
✓ Storage retention? (how long to keep data)
✓ Consistency requirements? (strong vs eventual)

Then size:
1. Servers needed
2. Database needed
3. Cache needed
4. Bandwidth needed
5. Storage needed

Vertical vs Horizontal Scaling

AspectVertical (Scale Up)Horizontal (Scale Out)When to Use
ApproachBigger serverMore serversVertical: quick fix; Horizontal: long-term
LimitHardware limitVirtually unlimitedHorizontal when vertical hits limits
CostNon-linearLinearHorizontal more cost-effective at scale
ComplexitySimpleComplex (distributed)Vertical for simplicity
DowntimeRequiredZero (rolling)Horizontal for high availability

Why Choose Vertical:

  • Quick wins, legacy apps, starting out

Why Choose Horizontal:

  • High availability, predictable costs, massive scale

Tradeoff Summary:

  • Vertical: Simple ↔ Limited, expensive
  • Horizontal: Unlimited scale ↔ Complexity

CAP Theorem Tradeoffs

ChoiceGuaranteesSacrificeUse Cases
CPConsistency + Partition ToleranceAvailabilityBanking, inventory, auctions
APAvailability + Partition ToleranceConsistencySocial feeds, analytics, caching

Why Choose CP:

  • Financial transactions, inventory management
  • Data correctness > uptime

Why Choose AP:

  • Social media, real-time analytics
  • Uptime > immediate consistency

Tradeoff Summary:

  • CP: Data correctness ↔ May be unavailable
  • AP: Always available ↔ Stale data possible

Load Balancing Strategies

StrategyHow It WorksProsConsWhen to Use
Round RobinDistribute requests sequentially to each serverSimple, easy to implementNo awareness of server loadUniform workloads, simple systems
Least ConnectionsRoute to server with fewest active connectionsHandles variable request durationsOverhead of tracking connectionsLong-lived connections (WebSockets)
IP HashRoute based on client IP (consistent)Sticky sessions without affinityUneven distribution if IPs clusterSession affinity (WebSockets, stateful apps)
Weighted Round RobinSome servers get more trafficHandles heterogeneous serversNeed to configure weightsMix of powerful and weak servers
RandomPick random serverSimple, low overheadNo optimizationVery simple loads

Compare-and-Swap (CAS) for Atomic Operations

What it is: CPU/VM primitive that atomically updates a memory location only if its current value matches an expected value. Avoids locks for simple, contended operations.

Why use it: Fast and non-blocking for small critical sections like counters, versioned writes, or state flips. Reduces contention compared to coarse locks.

How CAS Works

Hardware level (all at once, no interruption):

  1. Read current value from memory address
  2. Compare it against the expected value
  3. If match: write new value and return true
  4. If no match: discard new value and return false

The atomic guarantee: No other thread can modify the memory between the compare and swap steps.

Conceptual flow:

  1. Read current value cur.
  2. Compute next from cur.
  3. CAS(address, expected=cur, new=next) → succeeds or fails.
  4. On failure, read again and retry or back off.

Pseudo-code (increment counter):

Click to view code
do {
  cur = load(addr)                    // Read current
  next = cur + 1                      // Calculate new
} while (!CAS(addr, cur, next))       // Retry if someone changed it

Real-World Examples

1. Atomic Counter (Page View Tracking)

Click to view code
// Without CAS (using lock):
lock(counter_lock)
counter++
unlock(counter_lock)
// Problem: Lock contention on hot counters

// With CAS (lock-free):
do {
  current = load(counter)
  new = current + 1
} while (!CAS(counter, current, new))
// Faster: No lock, just retry loop
// Contention is handled by retries, not waiting

2. Versioned Update (Optimistic Locking)

Click to view code
// Scenario: Multiple threads updating a user record version
// Thread A: Read user version=3, wants to update
// Thread B: Updates version to 4 (increments it)
// Thread A's update should fail since version changed

do {
  user = read(user_id)
  version = user.version
  user.name = "new name"
  user.version = version + 1
} while (!CAS(user_obj, version, version+1))

// This ensures: "Only commit if nobody else changed this"

3. Lock-Free Stack Pop

Click to view code
// Stack: head → [A] → [B] → [C]
// Thread wants to pop A

do {
  top = load(head)                    // top = [A]
  next = top.next                     // next = [B]
} while (!CAS(head, top, next))       // Replace head with [B]

// If another thread pushed new node between read and CAS:
// CAS fails, loop retries with new head

4. Thread-Safe State Machine

Click to view code
// States: IDLE (0), RUNNING (1), STOPPED (2)
enum State { IDLE = 0, RUNNING = 1, STOPPED = 2 }

// Only transition IDLE → RUNNING
do {
  state = load(state_var)
  if (state != IDLE) return false     // Can't start, not idle
} while (!CAS(state_var, IDLE, RUNNING))

// Guarantees: Only one thread successfully transitions to RUNNING

CAS vs Locks: Performance Comparison

Under Low Contention (few threads):

Click to view code
Operation       | Lock Time  | CAS Time   | Winner
----------------|------------|------------|--------
Increment       | 50 ns      | 5 ns       | CAS (10x faster)
Read + Update   | 80 ns      | 10 ns      | CAS (8x faster)

Under High Contention (many threads):

Click to view code
Threads | Lock QPS    | CAS QPS     | Winner
--------|-------------|-------------|--------
1       | 1,000,000   | 5,000,000   | CAS
4       | 200,000     | 3,000,000   | CAS
16      | 50,000      | 1,500,000   | CAS (still!)
64      | 10,000      | 500,000     | CAS (now degrading)
256     | 2,000       | 50,000      | CAS (with backoff)

Why CAS wins even under contention:

  • Locks: Threads block, OS scheduler overhead increases
  • CAS: Threads retry, but no context switching cost
  • Add exponential backoff to CAS under extreme contention

When to Use CAS

Good for:

  • Counters (hit counters, stats)
  • Sequence numbers (request IDs, event ordering)
  • Simple state flags (started, completed)
  • Optimistic concurrency control
  • Lock-free data structures
  • Reference counting

Bad for:

  • Complex multi-field updates
  • Conditional logic involving multiple values
  • Long critical sections
  • When ABA problem is hard to solve

Rule of thumb: If the operation fits in 1-2 CPU instructions, CAS is better. If it requires 10+ instructions, use locks.

The ABA Problem and Solutions

What is it:

Click to view code
Thread 1: reads value = A
         (context switch)
Thread 2: changes A → B → A
         (thread 1 resumes)
Thread 1: CAS(addr, A, new_value)  ✓ Succeeds!
         But the A now is different from the original A!

Example - Stack pop with ABA:

Click to view code
// Stack: [A] → [B]
// Thread 1 tries to pop A
head = load(&stack)          // head = [A]
next = head.next             // next = [B]
                             // (Thread 2 pops [A], then pushes it back)
// Now A is a freed/reused node!
CAS(&stack, head, next)      // ✓ Succeeds, but now pointing to invalid [A]!

Solutions:

  1. Add version counter (most common):
Click to view code
// Store: (value, version)
// CAS(addr, (expected_val, expected_ver), (new_val, new_ver+1))

do {
  (value, version) = load(addr)
  new_version = version + 1
} while (!CAS(addr, (value, version), (new_val, new_version)))
// Version changes guarantee uniqueness
  1. Use generation numbers:
Click to view code
struct VersionedRef {
  void* ptr;           // The actual pointer
  uint64_t generation; // Incremented on each reuse
}
// CAS on entire struct ensures ABA safety
  1. Hazard pointers (advanced):
Click to view code
// Thread marks pointer as "in-use"
// Other threads cannot recycle it
// Slower but fully ABA-safe

Practical Considerations

CPU Support:

Click to view code
x86/x64:        CAS, CMPXCHG (single & double-wide)
ARM:            LDREX, STREX (LL/SC - Load-Link/Store-Conditional)
Modern CPUs:    Compare-And-Swap, Load-Acquire, Store-Release (memory barriers)

Language Support:

Click to view code
Java:           java.util.concurrent.atomic.AtomicInteger.compareAndSet()
C++:            std::atomic::compare_exchange_strong()
Go:             sync/atomic.CompareAndSwapUint64()
Rust:           std::sync::atomic::AtomicUsize::compare_exchange()
Python:         multiprocessing.Value (limited); usually use locks

When Contention Gets Too High - Add Backoff:

Click to view code
// Exponential backoff
uint32_t attempts = 0;
do {
  cur = load(addr)
  next = cur + 1
  if (!CAS(addr, cur, next)) {
    // Failed, back off exponentially
    sleep(min(1ms << attempts++, 100ms))
  }
} while (CAS failed)

Alternatives to CAS:

Click to view code
Pattern                          | Trade-off
---------------------------------|------------------------------------------
Coarse lock                       | Simple, but high contention cost
Sharded locks (per-core)          | Good balance, less contention
CAS with backoff                  | Fast but needs tuning
Partitioned atomic variables      | Best for counters, split work across threads
Lock-free queues                  | Complex, but scale well
Eventual consistency              | Highest throughput, but weaker guarantees

Replication Patterns

PatternHow It WorksProsConsWhen to Use
Master-SlaveSingle master writes, slaves read-onlySimple to understandSingle point of failure for writesRead-heavy workloads
Master-MasterMultiple masters, bidirectional syncNo write SPOFConflict resolution complexMulti-region setup
Quorum-basedMajority of replicas needed for consistencyStrong consistency + HASlower writes (wait for majority)Critical data (banking)

SLI, SLO, and SLA

Definitions:

TermDefinitionExampleOwner
SLI (Service Level Indicator)A specific metric that measures system behaviorAPI response time: 95th percentile = 200ms; Error rate: 0.01%Engineering
SLO (Service Level Objective)Internal target for SLI performanceAPI response time should be ≤200ms for 99.5% of requestsEngineering & PM
SLA (Service Level Agreement)Legal commitment with consequences for missing SLOIf uptime <99.9%, customer gets 10% refundBusiness/Legal

Key Differences:

  • SLI: Measurement (what we track)
  • SLO: Target (what we aim for)
  • SLA: Contract (what we promise)

Common SLIs

CategoryMetricHow to MeasureGood Target
AvailabilityUptime(totaltime - downtime) / totaltime99.9% (8.76 hrs/year downtime)
LatencyResponse time (P50, P95, P99)Request → Response timeP95 < 200ms, P99 < 1s
Error rateFailed requests / total requestsHTTP 5xx + timeouts< 0.1%
ThroughputRequests per secondTrack QPSTrack trend, alert on drops
CompletenessData accuracyQueries returning correct results> 99.9% accuracy
FreshnessData stalenessTime since last update< 5 minutes (depends on use case)

SLO Target Hierarchy

Click to view code
Enterprise customers: 99.99% uptime (52 minutes/year downtime)
  ↓
Standard customers: 99.9% uptime (8.76 hours/year downtime)
  ↓
Free tier: 99% uptime (3.65 days/year downtime)

Why tiered?
- Premium customers need higher reliability
- Premium tier → higher cost → justifies more ops investment

Cost vs reliability:

Click to view code
99% uptime: 1 failure/100 requests (cheap infra)
99.9% uptime: 1 failure/1000 requests (replicas + monitoring)
99.99% uptime: 1 failure/10,000 requests (multi-region, chaos testing)
99.999% uptime: 1 failure/100,000 requests (enterprise-grade)

Each 9 costs ~3-5x more than the previous level

Error Budget Concept

Error budget = Allowed downtime in a period to still meet SLO

Click to view code
SLO: 99.9% uptime per month
Total month hours: 730 hours
Error budget: (1 - 0.999) × 730 = 0.73 hours = 43.8 minutes/month

If already used 30 minutes, only 13.8 minutes remaining
→ Can't do risky deployments
→ Must be extra cautious (no canary → full rollout)

Managing error budget:

Click to view code (python)
# Example implementation
error_budget_remaining = calculate_budget_remaining()

if error_budget_remaining > 50%:
    # Aggressive: canary deployment, feature flags
    deploy_with_canary()
    
elif error_budget_remaining > 10%:
    # Conservative: manual rollout, heavy monitoring
    deploy_manually_with_heavy_monitoring()
    
else:
    # Critical: freeze all deployments
    only_fix_critical_bugs()
    no_new_features()

Incident impact on budget:

Click to view code
Incident: User auth service down for 10 minutes
Impact: 10,000 failed requests / 1,000,000 total = 1% error rate

Error budget burn:
  If SLO is 99.9% (0.1% errors allowed)
  This incident used: 1% / 0.1% = 10x the budget
  
Result: Error budget depleted for rest of month

SLO Design Best Practices

1. Make SLOs realistic

Click to view code
Bad SLO: 99.99% on single-region system (impossible)
Good SLO: 99% on standard tier, 99.9% on premium

Reality: Systems with single points of failure max out at ~99%
Multi-region systems can reach 99.99%

2. Balance multiple metrics

Click to view code
# Don't just optimize uptime
SLO for Netflix:
  - Uptime: 99.99%
  - Latency (P99): < 1 second
  - Error rate: < 0.1%
  
Optimizing only uptime can hurt latency
(always return cached/stale data = 100% uptime but bad UX)

3. Set SLOs lower than infrastructure capability

Click to view code
Infrastructure: 99.99% availability
SLO: 99.9% availability (leave 0.09% buffer)

Why buffer?
- Deployments, maintenance
- Network blips
- Monitoring false positives

4. Monitor SLI trends, not just thresholds

Click to view code
Alert when:
  - P95 latency exceeds 500ms (threshold breach)
  - P95 latency increases 50% week-over-week (trend)
  
Second alert catches degradation before total failure

SLA Examples

Google Cloud SLA:

Click to view code
Compute Engine:
  - 99.95% availability
  - Credits if breached:
    - 99%-99.95%: 10% credit
    - 95%-99%: 30% credit
    - <95%: 100% credit
    
Design: SLO (99.9%) < SLA (99.95%)
If SLO breached, still might make SLA
Extra buffer for one-time incidents

AWS SLA:

Click to view code
EC2:
  - 99.99% availability SLA
  - Applied per instance
  - Multiple instances in different AZs recommended
  
If one instance fails:
  - That instance gets credits
  - Other instances unaffected

Stripe SLA (payment processing):

Click to view code
- 99.99% uptime required (financial services)
- <50ms latency for payment processing
- Incident response: <15 minutes
- Requires multi-region active-active setup

Typical Service Tier SLOs

Service TypeUptime TargetLatency (P95)Error Rate
API Gateway99.99%< 50ms< 0.01%
Web server99.9%< 200ms< 0.1%
Background job99%1 minute< 1%
Batch analytics95%1 hour< 5%
Cache layer99.5%< 5ms< 0.5%

Monitoring & Alerting for SLOs

Click to view code (python)
# Example: Monitor SLO burn rate
def check_slo_burn():
    # SLO: 99.9% uptime = 0.1% error rate allowed
    
    current_error_rate = get_error_rate_last_hour()
    
    if current_error_rate > 1%:
        # 10x error budget burn rate
        # Will deplete monthly budget in ~3 hours
        alert("CRITICAL: SLO burn rate 10x. Incident response needed")
    
    elif current_error_rate > 0.5%:
        # 5x error budget burn rate
        alert("WARNING: High SLO burn rate. Investigate")
    
    elif current_error_rate > 0.15%:
        # 1.5x error budget burn rate (acceptable)
        log("Normal SLO burn rate")

Circuit Breaker Pattern

Click to view code
Normal state:
Request → Service A → Success ✓

Failure state (too many failures):
Request → Circuit Breaker (OPEN) → Fail fast ✗ (don't call failing service)

Recovery state (after timeout):
Request → Circuit Breaker (HALF-OPEN) → Try service A
         ↓
         Success → Close circuit (back to normal)
         Failure → Open circuit again

Benefits:

  • Fail fast: Don't waste time on failing services
  • Cascading failure prevention: Stop propagating failures
  • Recovery: Automatically retry after timeout

Interview Questions & Answers

Q1: Your single MySQL server is maxed out. Scale vertically or horizontally?

Answer: First check:

  1. Is it CPU-bound or I/O bound?
  2. - top / iostat shows where bottleneck is

  3. Can you optimize queries first?
  4. - Add indexes, denormalize, cache - Solves 80% of issues without scaling

If optimization doesn't help:

Short-term (vertical scaling):

  • Upgrade instance: m5.large → m5.4xlarge
  • Gains 3-6 months
  • Quick, no downtime (if prepared replica)

Long-term (horizontal scaling):

  • Shard by user_id or geographic region
  • Each shard gets replicas
  • Unlimited scalability
  • Application-level complexity

Recommended path:

Click to view code
Month 1-3: Optimize queries + vertical scale (buy time)
Month 3-6: Implement sharding (parallel effort)
Month 6+: Run sharded system (replaces single DB)

Q2: Design Netflix architecture for 100M users. What makes it resilient?

Answer: Key principles (Netflix Chaos Engineering):

  1. Multi-AZ deployment
  2. - Users in multiple regions/AZs - Any single AZ failure doesn't affect users

  1. Stateless services
  2. - Any server can be killed - No local state = easy horizontal scaling

  1. Bulkheads (isolation)
  2. - Video recommendations service fails - Homepage still works - Isolated failure domains

  1. Timeouts + Circuit breakers
  2. - If recommendation service slow (>1s) - Don't wait, use cached/default recommendations - Fail fast

  1. Retry with backoff
  2. - Transient failures auto-recover - Exponential backoff prevents thundering herd

Architecture:

Click to view code
User request → Load balancer (multi-AZ)
             ↓
      API Gateway (stateless)
             ↓
   Microservices (Video, Recommendation, Billing, etc.)
             ↓
   Each service has:
   - Multiple instances (horizontal scale)
   - Read replicas (HA)
   - Circuit breakers (cascading failure prevention)
   - Cache layer (Redis)
   - Timeout/retry logic

Failure scenario:

Click to view code
Video service crashes
  ↓
Circuit breaker opens (fails fast)
  ↓
User gets cached video list
  ↓
Meanwhile: Auto-scaler spins up new instances
  ↓
Service recovers in 30 seconds
  ↓
Circuit breaker closes (normal)

Q3: Design a cache invalidation strategy.

Answer: Strategies (in order of preference):

  1. TTL (Time-To-Live)
  2. - Simple: SET key value EX 3600 (1 hour TTL) - Pro: No coordination needed - Con: Stale data until expiry

  1. Event-based invalidation
  2. - On data update: PUBLISH user:123:updated - Subscribers delete cache key - Pro: Instant invalidation, no stale data - Con: Complex, requires pub/sub

  1. Hybrid (recommended)
  2. - TTL + event-based - Event invalidates immediately - TTL catches missed events (safety net)

Implementation:

Click to view code (python)
# Set with TTL
redis.setex("user:123", 3600, user_data)

# On update
db.update_user(123, new_data)
redis.publish("user:123:updated", json.dumps(new_data))

# Subscriber
def on_user_update(message):
    user_id = message['user_id']
    redis.delete(f"user:{user_id}")  # Immediate invalidation
    # Next request re-populates from DB

Cache stampede mitigation:

Click to view code
Multiple requests for same key after expiry
→ All hit DB simultaneously
→ Database overloaded

Solution: Probabilistic early expiry

python def getuser(userid): cached = redis.get(f"user:{userid}") if cached: ttl = redis.ttl(f"user:{userid}")

# If TTL < 10% remaining, refresh with probability if ttl < 360 and random.random() < 0.1: # One thread refreshes while others use stale refreshinbackground(user_id)

return cached

# Load from DB, set cache user = db.get(userid) redis.setex(f"user:{userid}", 3600, user) return user

Click to view code

---

### Q4: Design a system to survive an entire AWS region failure.

**Answer:**
**Active-Active multi-region** (required):

User requests → DNS routing (Route 53) ↓ Geolocation/latency routing ↓ Region 1 (US East) Region 2 (US West) - App instances - App instances - Database - Database - Cache - Cache

User in California → Route 53 sends to US West (lower latency) User in Virginia → Route 53 sends to US East

If US East crashes: - Route 53 detects health check failure - All users → US West (higher latency but working)

Click to view code

**Database replication**:

Multi-master replication (MySQL, DynamoDB) Region 1 DB ←→ Region 2 DB

Writes in Region 1 automatically replicate to Region 2 Writes in Region 2 automatically replicate to Region 1

If Region 1 fails: Users automatically failover to Region 2 All data already there (already replicated) Zero data loss

Click to view code

**Cache/state**:

Don't store session state locally Use distributed cache: DynamoDB or Redis Cluster - Spans multiple regions - User session survives region failure

Click to view code

**Cost trade-off**:

Single region: $1M/month Multi-region: $2M/month (2x cost)

Benefit: Can survive region failure Netflix generates $300M/day revenue Region outage = $12M/hour loss

2x cost is cheaper than 1 hour downtime

Click to view code

---

### Q5: How many load balancers do you need for 10M concurrent connections?

**Answer:**
**Calculation**:

Modern load balancer (AWS ALB):

  • Max 25,000 new connections per second
  • 1M concurrent connections per instance

For 10M concurrent connections:

  • 10M / 1M = 10 load balancers needed

But add redundancy:

  • 2 regions × 10 LBs = 20 LBs
  • Each region can lose 1 LB without impact
  • High availability
Click to view code

**LB architecture**:

DNS (Route 53) ↓ US East: LB1, LB2, LB3, LB4, LB5 (can lose 1) US West: LB6, LB7, LB8, LB9, LB10 (can lose 1)

If LB1 fails: - Route 53 detects - Traffic → LB2-5 (4 LBs instead of 5) - Users unaffected

Click to view code

**Connection distribution**:

Each connection = one TCP flow through LB LB tracks: sourceip, destip, port

With 10M connections:

  • ~1M per LB
  • Memory per connection: ~1-2KB
  • 1M × 2KB = 2GB per LB (acceptable)

**Why not one giant LB?**
- Single point of failure
- Geographic latency (users far away)
- Doesn't scale beyond 1M connections