06-Scalability-Reliability
Scalability and Reliability Patterns
Back-of-Envelope Calculations
Quick estimation techniques for system design interviews. These are rough estimates used to validate architectural decisions.
Time and Storage Units
| Unit | Bytes | Time |
|---|---|---|
| KB | 10³ | 1 millisecond |
| MB | 10⁶ | 1 second |
| GB | 10⁹ | 1 minute |
| TB | 10¹² | 1 hour |
| PB | 10¹⁵ | 1 day |
Latency Numbers Every Programmer Should Know
Click to view code
L1 cache reference: 0.5 ns
Branch mispredict: 5 ns
L2 cache reference: 7 ns
Mutex lock/unlock: 100 ns
Main memory reference: 100 ns
Compress 1KB with Snappy: 3,000 ns (3 µs)
Send 1KB over 1 Gbps network: 10,000 ns (10 µs)
Read 1MB sequentially from memory: 250,000 ns (250 µs)
Round trip within same datacenter: 500,000 ns (500 µs)
Read 1MB sequentially from SSD: 1,000,000 ns (1 ms)
Disk seek: 10,000,000 ns (10 ms)
Read 1MB sequentially from disk: 20,000,000 ns (20 ms)
Send packet CA → Netherlands: 150,000,000 ns (150 ms)
Rule of thumb: Disk is ~40x slower than memory, network is ~150-200x slower than memory
Common QPS (Queries Per Second) Calculations
Given: X million users, Y% daily active, Z requests per user per day
Click to view code
Formula: (X million × Y% × Z requests) / (86,400 seconds/day)
Example: 1B users, 50% DAU, 10 requests/day
= (1,000,000,000 × 0.5 × 10) / 86,400
= 5,000,000,000 / 86,400
= ~57,870 QPS
Peak traffic rule: Multiply by 3-5x (peak is 3-5x average)
= 57,870 × 5 = ~290,000 QPS peak
Data Volume Calculations
Given: X million users, Y data per user, Z days retention
Click to view code
Formula: X million × Y × Z = total storage
Example: 1B users, 1KB per user, 30 days retention
= 1,000,000,000 × 1KB × 30
= 30TB total
But need redundancy:
- 3x replication: 30TB × 3 = 90TB
- 2x for backup: 30TB × 2 = 60TB
- Total: ~150TB
Database Sizing
Single MySQL server capacity:
Click to view code
Memory: 64GB
Max connections: 10,000
Throughput:
- 10,000 QPS read (cached)
- 1,000 QPS write
- Mixed: 5,000 QPS
Before hitting CPU/memory limits, disk I/O becomes bottleneck
When to shard:
Click to view code
Data size: > 1TB → Consider sharding
QPS: > 10,000 → Consider sharding
Connections: > 5,000 → Consider sharding
Sharding key: User ID, geographic region, or hash
Server Capacity Planning
Single web server (8 core, 32GB RAM):
Click to view code
Memory per connection: ~1MB
Max connections: 32,000
QPS capacity:
- Simple operations: ~10,000 QPS
- Complex operations: ~1,000 QPS
- CPU-bound: ~5,000 QPS
Number of servers needed:
Click to view code
Formula: Peak QPS / QPS per server
Example: 300,000 peak QPS, 10,000 QPS per server
= 300,000 / 10,000 = 30 servers
With redundancy (rolling updates, failures):
= 30 × 1.3 = 39 servers (30% buffer)
Load Balancer Sizing
Click to view code
Modern LB (AWS ALB):
- Max 25,000 new connections/second
- Max 1,000,000 concurrent connections per LB
For 10M concurrent connections:
= 10,000,000 / 1,000,000 = 10 LBs needed
For multi-region HA:
= 10 LBs × 2 regions = 20 LBs total
Cache Sizing
Rule: Cache hit rate ~90% means 10x reduction in database load
Click to view code
Example without cache:
- 100,000 QPS to database
- Database capacity: 10,000 QPS
- Problem: Overloaded
With cache (90% hit rate):
- Cache serves: 90,000 QPS (in-memory, fast)
- DB serves: 10,000 QPS (miss traffic)
- Balanced!
Cache memory needed:
Click to view code
Formula: (QPS × avg_object_size × TTL) / hit_rate
Example: 100,000 QPS, 1KB avg object, 1 hour TTL, 90% hit rate
= (100,000 × 1KB × 3,600) / 0.9
= 360,000,000 KB / 0.9
= ~400GB cache
Practical: Use distributed cache (Redis) across 10 servers
= 40GB per Redis instance (reasonable)
Bandwidth Calculation
Given: Peak QPS, average response size
Click to view code
Formula: Peak QPS × Avg response size
Example: 300,000 peak QPS, 10KB response
= 300,000 × 10KB = 3,000,000 KB/s = 3GB/s
Bandwidth needed:
- 10 Gbps link: 10GB/s (can handle it)
- But need redundancy: 2 × 10Gbps = 20Gbps total
Rule of thumb: Provision 2-3x peak bandwidth for headroom
Video Streaming Bandwidth
Click to view code
Netflix scenario: 100M users, 50% watching simultaneously
Bitrate per stream: 5 Mbps (HD)
Total bandwidth needed: 50M × 5 Mbps = 250 Tbps
Wow, that's massive! How does Netflix handle it?
Solution: CDN + regional caching
- Cache video in edge locations (ISP networks)
- Only ~10-20% requires backbone traffic
- Backbone: 250 Tbps × 15% = 37.5 Tbps (still huge, but manageable)
Example: Design Instagram Feed for 100M DAU
Click to view code
1. Calculate QPS:
- 100M DAU × 20 requests/day / 86,400 = ~23,000 QPS avg
- Peak: 23,000 × 5 = ~115,000 QPS
2. Database sizing:
- Each feed = 100 posts × 1KB = 100KB
- 100M users × 100KB = 10TB storage
- With 3x replication: 30TB
- Needs sharding (> 1TB)
3. Server capacity:
- 115,000 peak QPS / 10,000 per server = 11.5 → 15 servers
- With 30% redundancy: ~20 servers
4. Cache sizing:
- 90% hit rate assumption
- Feed objects: 100M users × 100KB × 0.1 (10% miss) = 1TB
- Split across 25 Redis instances = 40GB each
5. Bandwidth:
- 115,000 QPS × 100KB = 11.5GB/s
- Need 100 Gbps backbone
Back-of-Envelope Checklist
When estimating, ask:
Click to view code
✓ How many users? (total, DAU, peak concurrent)
✓ Request rate? (QPS average, QPS peak)
✓ Data volume? (per user, total, growth rate)
✓ Latency requirements? (response time SLA)
✓ Availability requirements? (uptime SLA)
✓ Bandwidth constraints? (network capacity)
✓ Storage retention? (how long to keep data)
✓ Consistency requirements? (strong vs eventual)
Then size:
1. Servers needed
2. Database needed
3. Cache needed
4. Bandwidth needed
5. Storage needed
Vertical vs Horizontal Scaling
| Aspect | Vertical (Scale Up) | Horizontal (Scale Out) | When to Use |
|---|---|---|---|
| Approach | Bigger server | More servers | Vertical: quick fix; Horizontal: long-term |
| Limit | Hardware limit | Virtually unlimited | Horizontal when vertical hits limits |
| Cost | Non-linear | Linear | Horizontal more cost-effective at scale |
| Complexity | Simple | Complex (distributed) | Vertical for simplicity |
| Downtime | Required | Zero (rolling) | Horizontal for high availability |
Why Choose Vertical:
- Quick wins, legacy apps, starting out
Why Choose Horizontal:
- High availability, predictable costs, massive scale
Tradeoff Summary:
- Vertical: Simple ↔ Limited, expensive
- Horizontal: Unlimited scale ↔ Complexity
CAP Theorem Tradeoffs
| Choice | Guarantees | Sacrifice | Use Cases |
|---|---|---|---|
| CP | Consistency + Partition Tolerance | Availability | Banking, inventory, auctions |
| AP | Availability + Partition Tolerance | Consistency | Social feeds, analytics, caching |
Why Choose CP:
- Financial transactions, inventory management
- Data correctness > uptime
Why Choose AP:
- Social media, real-time analytics
- Uptime > immediate consistency
Tradeoff Summary:
- CP: Data correctness ↔ May be unavailable
- AP: Always available ↔ Stale data possible
Load Balancing Strategies
| Strategy | How It Works | Pros | Cons | When to Use |
|---|---|---|---|---|
| Round Robin | Distribute requests sequentially to each server | Simple, easy to implement | No awareness of server load | Uniform workloads, simple systems |
| Least Connections | Route to server with fewest active connections | Handles variable request durations | Overhead of tracking connections | Long-lived connections (WebSockets) |
| IP Hash | Route based on client IP (consistent) | Sticky sessions without affinity | Uneven distribution if IPs cluster | Session affinity (WebSockets, stateful apps) |
| Weighted Round Robin | Some servers get more traffic | Handles heterogeneous servers | Need to configure weights | Mix of powerful and weak servers |
| Random | Pick random server | Simple, low overhead | No optimization | Very simple loads |
Compare-and-Swap (CAS) for Atomic Operations
What it is: CPU/VM primitive that atomically updates a memory location only if its current value matches an expected value. Avoids locks for simple, contended operations.
Why use it: Fast and non-blocking for small critical sections like counters, versioned writes, or state flips. Reduces contention compared to coarse locks.
How CAS Works
Hardware level (all at once, no interruption):
- Read current value from memory address
- Compare it against the expected value
- If match: write new value and return
true - If no match: discard new value and return
false
The atomic guarantee: No other thread can modify the memory between the compare and swap steps.
Conceptual flow:
- Read current value
cur. - Compute
nextfromcur. - CAS(address, expected=cur, new=next) → succeeds or fails.
- On failure, read again and retry or back off.
Pseudo-code (increment counter):
Click to view code
do {
cur = load(addr) // Read current
next = cur + 1 // Calculate new
} while (!CAS(addr, cur, next)) // Retry if someone changed it
Real-World Examples
1. Atomic Counter (Page View Tracking)
Click to view code
// Without CAS (using lock):
lock(counter_lock)
counter++
unlock(counter_lock)
// Problem: Lock contention on hot counters
// With CAS (lock-free):
do {
current = load(counter)
new = current + 1
} while (!CAS(counter, current, new))
// Faster: No lock, just retry loop
// Contention is handled by retries, not waiting
2. Versioned Update (Optimistic Locking)
Click to view code
// Scenario: Multiple threads updating a user record version
// Thread A: Read user version=3, wants to update
// Thread B: Updates version to 4 (increments it)
// Thread A's update should fail since version changed
do {
user = read(user_id)
version = user.version
user.name = "new name"
user.version = version + 1
} while (!CAS(user_obj, version, version+1))
// This ensures: "Only commit if nobody else changed this"
3. Lock-Free Stack Pop
Click to view code
// Stack: head → [A] → [B] → [C]
// Thread wants to pop A
do {
top = load(head) // top = [A]
next = top.next // next = [B]
} while (!CAS(head, top, next)) // Replace head with [B]
// If another thread pushed new node between read and CAS:
// CAS fails, loop retries with new head
4. Thread-Safe State Machine
Click to view code
// States: IDLE (0), RUNNING (1), STOPPED (2)
enum State { IDLE = 0, RUNNING = 1, STOPPED = 2 }
// Only transition IDLE → RUNNING
do {
state = load(state_var)
if (state != IDLE) return false // Can't start, not idle
} while (!CAS(state_var, IDLE, RUNNING))
// Guarantees: Only one thread successfully transitions to RUNNING
CAS vs Locks: Performance Comparison
Under Low Contention (few threads):
Click to view code
Operation | Lock Time | CAS Time | Winner
----------------|------------|------------|--------
Increment | 50 ns | 5 ns | CAS (10x faster)
Read + Update | 80 ns | 10 ns | CAS (8x faster)
Under High Contention (many threads):
Click to view code
Threads | Lock QPS | CAS QPS | Winner
--------|-------------|-------------|--------
1 | 1,000,000 | 5,000,000 | CAS
4 | 200,000 | 3,000,000 | CAS
16 | 50,000 | 1,500,000 | CAS (still!)
64 | 10,000 | 500,000 | CAS (now degrading)
256 | 2,000 | 50,000 | CAS (with backoff)
Why CAS wins even under contention:
- Locks: Threads block, OS scheduler overhead increases
- CAS: Threads retry, but no context switching cost
- Add exponential backoff to CAS under extreme contention
When to Use CAS
✅ Good for:
- Counters (hit counters, stats)
- Sequence numbers (request IDs, event ordering)
- Simple state flags (started, completed)
- Optimistic concurrency control
- Lock-free data structures
- Reference counting
❌ Bad for:
- Complex multi-field updates
- Conditional logic involving multiple values
- Long critical sections
- When ABA problem is hard to solve
Rule of thumb: If the operation fits in 1-2 CPU instructions, CAS is better. If it requires 10+ instructions, use locks.
The ABA Problem and Solutions
What is it:
Click to view code
Thread 1: reads value = A
(context switch)
Thread 2: changes A → B → A
(thread 1 resumes)
Thread 1: CAS(addr, A, new_value) ✓ Succeeds!
But the A now is different from the original A!
Example - Stack pop with ABA:
Click to view code
// Stack: [A] → [B]
// Thread 1 tries to pop A
head = load(&stack) // head = [A]
next = head.next // next = [B]
// (Thread 2 pops [A], then pushes it back)
// Now A is a freed/reused node!
CAS(&stack, head, next) // ✓ Succeeds, but now pointing to invalid [A]!
Solutions:
- Add version counter (most common):
Click to view code
// Store: (value, version)
// CAS(addr, (expected_val, expected_ver), (new_val, new_ver+1))
do {
(value, version) = load(addr)
new_version = version + 1
} while (!CAS(addr, (value, version), (new_val, new_version)))
// Version changes guarantee uniqueness
- Use generation numbers:
Click to view code
struct VersionedRef {
void* ptr; // The actual pointer
uint64_t generation; // Incremented on each reuse
}
// CAS on entire struct ensures ABA safety
- Hazard pointers (advanced):
Click to view code
// Thread marks pointer as "in-use"
// Other threads cannot recycle it
// Slower but fully ABA-safe
Practical Considerations
CPU Support:
Click to view code
x86/x64: CAS, CMPXCHG (single & double-wide)
ARM: LDREX, STREX (LL/SC - Load-Link/Store-Conditional)
Modern CPUs: Compare-And-Swap, Load-Acquire, Store-Release (memory barriers)
Language Support:
Click to view code
Java: java.util.concurrent.atomic.AtomicInteger.compareAndSet()
C++: std::atomic::compare_exchange_strong()
Go: sync/atomic.CompareAndSwapUint64()
Rust: std::sync::atomic::AtomicUsize::compare_exchange()
Python: multiprocessing.Value (limited); usually use locks
When Contention Gets Too High - Add Backoff:
Click to view code
// Exponential backoff
uint32_t attempts = 0;
do {
cur = load(addr)
next = cur + 1
if (!CAS(addr, cur, next)) {
// Failed, back off exponentially
sleep(min(1ms << attempts++, 100ms))
}
} while (CAS failed)
Alternatives to CAS:
Click to view code
Pattern | Trade-off
---------------------------------|------------------------------------------
Coarse lock | Simple, but high contention cost
Sharded locks (per-core) | Good balance, less contention
CAS with backoff | Fast but needs tuning
Partitioned atomic variables | Best for counters, split work across threads
Lock-free queues | Complex, but scale well
Eventual consistency | Highest throughput, but weaker guarantees
Replication Patterns
| Pattern | How It Works | Pros | Cons | When to Use |
|---|---|---|---|---|
| Master-Slave | Single master writes, slaves read-only | Simple to understand | Single point of failure for writes | Read-heavy workloads |
| Master-Master | Multiple masters, bidirectional sync | No write SPOF | Conflict resolution complex | Multi-region setup |
| Quorum-based | Majority of replicas needed for consistency | Strong consistency + HA | Slower writes (wait for majority) | Critical data (banking) |
SLI, SLO, and SLA
Definitions:
| Term | Definition | Example | Owner |
|---|---|---|---|
| SLI (Service Level Indicator) | A specific metric that measures system behavior | API response time: 95th percentile = 200ms; Error rate: 0.01% | Engineering |
| SLO (Service Level Objective) | Internal target for SLI performance | API response time should be ≤200ms for 99.5% of requests | Engineering & PM |
| SLA (Service Level Agreement) | Legal commitment with consequences for missing SLO | If uptime <99.9%, customer gets 10% refund | Business/Legal |
Key Differences:
- SLI: Measurement (what we track)
- SLO: Target (what we aim for)
- SLA: Contract (what we promise)
Common SLIs
| Category | Metric | How to Measure | Good Target |
|---|---|---|---|
| Availability | Uptime | (totaltime - downtime) / totaltime | 99.9% (8.76 hrs/year downtime) |
| Latency | Response time (P50, P95, P99) | Request → Response time | P95 < 200ms, P99 < 1s |
| Error rate | Failed requests / total requests | HTTP 5xx + timeouts | < 0.1% |
| Throughput | Requests per second | Track QPS | Track trend, alert on drops |
| Completeness | Data accuracy | Queries returning correct results | > 99.9% accuracy |
| Freshness | Data staleness | Time since last update | < 5 minutes (depends on use case) |
SLO Target Hierarchy
Click to view code
Enterprise customers: 99.99% uptime (52 minutes/year downtime)
↓
Standard customers: 99.9% uptime (8.76 hours/year downtime)
↓
Free tier: 99% uptime (3.65 days/year downtime)
Why tiered?
- Premium customers need higher reliability
- Premium tier → higher cost → justifies more ops investment
Cost vs reliability:
Click to view code
99% uptime: 1 failure/100 requests (cheap infra)
99.9% uptime: 1 failure/1000 requests (replicas + monitoring)
99.99% uptime: 1 failure/10,000 requests (multi-region, chaos testing)
99.999% uptime: 1 failure/100,000 requests (enterprise-grade)
Each 9 costs ~3-5x more than the previous level
Error Budget Concept
Error budget = Allowed downtime in a period to still meet SLO
Click to view code
SLO: 99.9% uptime per month
Total month hours: 730 hours
Error budget: (1 - 0.999) × 730 = 0.73 hours = 43.8 minutes/month
If already used 30 minutes, only 13.8 minutes remaining
→ Can't do risky deployments
→ Must be extra cautious (no canary → full rollout)
Managing error budget:
Click to view code (python)
# Example implementation
error_budget_remaining = calculate_budget_remaining()
if error_budget_remaining > 50%:
# Aggressive: canary deployment, feature flags
deploy_with_canary()
elif error_budget_remaining > 10%:
# Conservative: manual rollout, heavy monitoring
deploy_manually_with_heavy_monitoring()
else:
# Critical: freeze all deployments
only_fix_critical_bugs()
no_new_features()
Incident impact on budget:
Click to view code
Incident: User auth service down for 10 minutes
Impact: 10,000 failed requests / 1,000,000 total = 1% error rate
Error budget burn:
If SLO is 99.9% (0.1% errors allowed)
This incident used: 1% / 0.1% = 10x the budget
Result: Error budget depleted for rest of month
SLO Design Best Practices
1. Make SLOs realistic
Click to view code
Bad SLO: 99.99% on single-region system (impossible)
Good SLO: 99% on standard tier, 99.9% on premium
Reality: Systems with single points of failure max out at ~99%
Multi-region systems can reach 99.99%
2. Balance multiple metrics
Click to view code
# Don't just optimize uptime
SLO for Netflix:
- Uptime: 99.99%
- Latency (P99): < 1 second
- Error rate: < 0.1%
Optimizing only uptime can hurt latency
(always return cached/stale data = 100% uptime but bad UX)
3. Set SLOs lower than infrastructure capability
Click to view code
Infrastructure: 99.99% availability
SLO: 99.9% availability (leave 0.09% buffer)
Why buffer?
- Deployments, maintenance
- Network blips
- Monitoring false positives
4. Monitor SLI trends, not just thresholds
Click to view code
Alert when:
- P95 latency exceeds 500ms (threshold breach)
- P95 latency increases 50% week-over-week (trend)
Second alert catches degradation before total failure
SLA Examples
Google Cloud SLA:
Click to view code
Compute Engine:
- 99.95% availability
- Credits if breached:
- 99%-99.95%: 10% credit
- 95%-99%: 30% credit
- <95%: 100% credit
Design: SLO (99.9%) < SLA (99.95%)
If SLO breached, still might make SLA
Extra buffer for one-time incidents
AWS SLA:
Click to view code
EC2:
- 99.99% availability SLA
- Applied per instance
- Multiple instances in different AZs recommended
If one instance fails:
- That instance gets credits
- Other instances unaffected
Stripe SLA (payment processing):
Click to view code
- 99.99% uptime required (financial services)
- <50ms latency for payment processing
- Incident response: <15 minutes
- Requires multi-region active-active setup
Typical Service Tier SLOs
| Service Type | Uptime Target | Latency (P95) | Error Rate |
|---|---|---|---|
| API Gateway | 99.99% | < 50ms | < 0.01% |
| Web server | 99.9% | < 200ms | < 0.1% |
| Background job | 99% | 1 minute | < 1% |
| Batch analytics | 95% | 1 hour | < 5% |
| Cache layer | 99.5% | < 5ms | < 0.5% |
Monitoring & Alerting for SLOs
Click to view code (python)
# Example: Monitor SLO burn rate
def check_slo_burn():
# SLO: 99.9% uptime = 0.1% error rate allowed
current_error_rate = get_error_rate_last_hour()
if current_error_rate > 1%:
# 10x error budget burn rate
# Will deplete monthly budget in ~3 hours
alert("CRITICAL: SLO burn rate 10x. Incident response needed")
elif current_error_rate > 0.5%:
# 5x error budget burn rate
alert("WARNING: High SLO burn rate. Investigate")
elif current_error_rate > 0.15%:
# 1.5x error budget burn rate (acceptable)
log("Normal SLO burn rate")
Circuit Breaker Pattern
Click to view code
Normal state:
Request → Service A → Success ✓
Failure state (too many failures):
Request → Circuit Breaker (OPEN) → Fail fast ✗ (don't call failing service)
Recovery state (after timeout):
Request → Circuit Breaker (HALF-OPEN) → Try service A
↓
Success → Close circuit (back to normal)
Failure → Open circuit again
Benefits:
- Fail fast: Don't waste time on failing services
- Cascading failure prevention: Stop propagating failures
- Recovery: Automatically retry after timeout
Interview Questions & Answers
Q1: Your single MySQL server is maxed out. Scale vertically or horizontally?
Answer: First check:
- Is it CPU-bound or I/O bound?
- Can you optimize queries first?
- top / iostat shows where bottleneck is
- Add indexes, denormalize, cache - Solves 80% of issues without scaling
If optimization doesn't help:
Short-term (vertical scaling):
- Upgrade instance: m5.large → m5.4xlarge
- Gains 3-6 months
- Quick, no downtime (if prepared replica)
Long-term (horizontal scaling):
- Shard by user_id or geographic region
- Each shard gets replicas
- Unlimited scalability
- Application-level complexity
Recommended path:
Click to view code
Month 1-3: Optimize queries + vertical scale (buy time)
Month 3-6: Implement sharding (parallel effort)
Month 6+: Run sharded system (replaces single DB)
Q2: Design Netflix architecture for 100M users. What makes it resilient?
Answer: Key principles (Netflix Chaos Engineering):
- Multi-AZ deployment
- Users in multiple regions/AZs - Any single AZ failure doesn't affect users
- Stateless services
- Any server can be killed - No local state = easy horizontal scaling
- Bulkheads (isolation)
- Video recommendations service fails - Homepage still works - Isolated failure domains
- Timeouts + Circuit breakers
- If recommendation service slow (>1s) - Don't wait, use cached/default recommendations - Fail fast
- Retry with backoff
- Transient failures auto-recover - Exponential backoff prevents thundering herd
Architecture:
Click to view code
User request → Load balancer (multi-AZ)
↓
API Gateway (stateless)
↓
Microservices (Video, Recommendation, Billing, etc.)
↓
Each service has:
- Multiple instances (horizontal scale)
- Read replicas (HA)
- Circuit breakers (cascading failure prevention)
- Cache layer (Redis)
- Timeout/retry logic
Failure scenario:
Click to view code
Video service crashes
↓
Circuit breaker opens (fails fast)
↓
User gets cached video list
↓
Meanwhile: Auto-scaler spins up new instances
↓
Service recovers in 30 seconds
↓
Circuit breaker closes (normal)
Q3: Design a cache invalidation strategy.
Answer: Strategies (in order of preference):
- TTL (Time-To-Live)
- Simple: SET key value EX 3600 (1 hour TTL) - Pro: No coordination needed - Con: Stale data until expiry
- Event-based invalidation
- On data update: PUBLISH user:123:updated - Subscribers delete cache key - Pro: Instant invalidation, no stale data - Con: Complex, requires pub/sub
- Hybrid (recommended)
- TTL + event-based - Event invalidates immediately - TTL catches missed events (safety net)
Implementation:
Click to view code (python)
# Set with TTL
redis.setex("user:123", 3600, user_data)
# On update
db.update_user(123, new_data)
redis.publish("user:123:updated", json.dumps(new_data))
# Subscriber
def on_user_update(message):
user_id = message['user_id']
redis.delete(f"user:{user_id}") # Immediate invalidation
# Next request re-populates from DB
Cache stampede mitigation:
Click to view code
Multiple requests for same key after expiry
→ All hit DB simultaneously
→ Database overloaded
Solution: Probabilistic early expiry
python def getuser(userid): cached = redis.get(f"user:{userid}") if cached: ttl = redis.ttl(f"user:{userid}")
# If TTL < 10% remaining, refresh with probability if ttl < 360 and random.random() < 0.1: # One thread refreshes while others use stale refreshinbackground(user_id)
return cached
# Load from DB, set cache user = db.get(userid) redis.setex(f"user:{userid}", 3600, user) return user
Click to view code
---
### Q4: Design a system to survive an entire AWS region failure.
**Answer:**
**Active-Active multi-region** (required):
User requests → DNS routing (Route 53) ↓ Geolocation/latency routing ↓ Region 1 (US East) Region 2 (US West) - App instances - App instances - Database - Database - Cache - Cache
User in California → Route 53 sends to US West (lower latency) User in Virginia → Route 53 sends to US East
If US East crashes: - Route 53 detects health check failure - All users → US West (higher latency but working)
Click to view code
**Database replication**:
Multi-master replication (MySQL, DynamoDB) Region 1 DB ←→ Region 2 DB
Writes in Region 1 automatically replicate to Region 2 Writes in Region 2 automatically replicate to Region 1
If Region 1 fails: Users automatically failover to Region 2 All data already there (already replicated) Zero data loss
Click to view code
**Cache/state**:
Don't store session state locally Use distributed cache: DynamoDB or Redis Cluster - Spans multiple regions - User session survives region failure
Click to view code
**Cost trade-off**:
Single region: $1M/month Multi-region: $2M/month (2x cost)
Benefit: Can survive region failure Netflix generates $300M/day revenue Region outage = $12M/hour loss
2x cost is cheaper than 1 hour downtime
Click to view code
---
### Q5: How many load balancers do you need for 10M concurrent connections?
**Answer:**
**Calculation**:
Modern load balancer (AWS ALB):
- Max 25,000 new connections per second
- 1M concurrent connections per instance
For 10M concurrent connections:
- 10M / 1M = 10 load balancers needed
But add redundancy:
- 2 regions × 10 LBs = 20 LBs
- Each region can lose 1 LB without impact
- High availability
Click to view code
**LB architecture**:
DNS (Route 53) ↓ US East: LB1, LB2, LB3, LB4, LB5 (can lose 1) US West: LB6, LB7, LB8, LB9, LB10 (can lose 1)
If LB1 fails: - Route 53 detects - Traffic → LB2-5 (4 LBs instead of 5) - Users unaffected
Click to view code
**Connection distribution**:
Each connection = one TCP flow through LB LB tracks: sourceip, destip, port
With 10M connections:
- ~1M per LB
- Memory per connection: ~1-2KB
- 1M × 2KB = 2GB per LB (acceptable)
**Why not one giant LB?**
- Single point of failure
- Geographic latency (users far away)
- Doesn't scale beyond 1M connections