Communication Patterns

Synchronous vs Asynchronous

AspectSynchronous (HTTP/gRPC)Asynchronous (Message Queue)When to Use
CouplingTight (caller waits)Loose (fire and forget)Sync: real-time; Async: decouple services
LatencyLow (immediate response)Higher (eventual processing)Sync: user-facing; Async: background jobs
ReliabilityRetry in callerBuilt-in retry, DLQAsync for mission-critical tasks
ScalabilityLimited (caller blocks)High (queue absorbs spikes)Async for traffic spikes

Why Choose Synchronous:

  • User needs immediate response (login, search)
  • Simple request-response workflows
  • Low latency critical

Why Choose Asynchronous:

  • Long-running tasks (video transcoding)
  • Decouple services for resilience
  • Handle traffic spikes

Tradeoff Summary:

  • Sync: Simple + Immediate ↔ Tight coupling
  • Async: Decoupling + Scalability ↔ Complexity

Client Updates: Short Polling vs Long Polling vs WebSockets vs Server-Sent Events (SSE)

AspectShort PollingLong PollingWebSocketsSSE
Connection ModelClient polls on interval (e.g., every 5s)Client holds request open until data or timeout, then re-issuesFull-duplex persistent TCP (via HTTP upgrade)One-way server → client over persistent HTTP
LatencyInterval-based; higher if interval largeLow; server responds immediately when data readyVery low; bi-directionalLow; server pushes as events occur
Server LoadMany requests; wasted when nothing changesFewer requests; each may tie up a worker until dataFew connections; efficient after handshakeFew connections; efficient for push
Scalability PainHigh QPS, connection overheadThread/conn held per client; needs async IOMany open sockets; needs load balancers/proxies that handle sticky/WSMany open connections; similar infra to WS but simpler
Use WhenSimple/low-traffic; no server push neededNeed near-real-time but infra limited to HTTP; moderate scaleInteractive, two-way updates (chat, games, collab)Real-time, server-to-client only (tickers, notifications, logs)
DrawbacksWasted cycles, stale data between pollsHolding connections; timeouts; intermediate proxies can dropMore complex protocol, connection mgmt, backpressureOne-way only; older browsers need polyfills; retry/backoff handling
ExamplesCRON-like dashboard refreshLive score updates without WS/SSE supportChat apps, multiplayer games, collaborative docsStock quotes, live comments, monitoring dashboards

How to choose:

  • Start with long polling if you need push-ish behavior but are limited to plain HTTP and modest scale.
  • Use WebSockets for interactive, high-frequency bi-directional flows or when clients need to push frequently.
  • Use SSE for server-to-client streaming where simplicity and HTTP semantics matter (auto-reconnect, events).
  • Reserve short polling for low-QPS or legacy paths where real-time is not critical and change rate is low.

Tradeoff Summary:

  • Short Polling: Easiest to add ↔ Latency + wasted requests at scale
  • Long Polling: Near-real-time over HTTP ↔ Held connections, proxy timeouts
  • WebSockets: Full-duplex + lowest latency ↔ Infra complexity (sticky sessions, scaling, backpressure)
  • SSE: Simple server→client push ↔ One-way only, needs reconnect logic

Chaos Engineering Levels (Netflix Playbook)

ToolWhat It DoesWhen to UseNotes/Tradeoffs
Chaos MonkeyTerminates random servers/instancesEvery service; baseline resiliency validationCatches single-instance brittleness; assumes stateless or fast reattach to state
Chaos GorillaSimulates losing an entire AZCritical systems where downtime hits revenue/reputationValidates multi-AZ design, autoscaling, and failover runbooks
Chaos KongSimulates losing an entire regionRare; only for global, highest-availability systemsExpensive to practice; requires active-active or warm standby cross-region

Context: Netflix can run these because services are mostly stateless, globally aware, and designed for availability from day one (multi-AZ/region, retries, circuit breakers, resilient data stores).

Practical guidance:

  • Monkey: default for all services; start here to harden base reliability.
  • Gorilla: enable for revenue/brand-critical paths once multi-AZ is proven.
  • Kong: usually overkill; reserve for globally distributed, tier-0 systems with cross-region architecture and clear blast-radius controls.

Message Queue Patterns

Publish-Subscribe (Pub/Sub) vs Point-to-Point (Queue)

PatternUse CaseExample
Pub/SubOne event → Multiple subscribersOrder placed → Email service, Notification service, Analytics service all receive
Point-to-Point (Queue)One producer → One consumerPayment processing → Single payment processor handles each payment

Pub/Sub architecture:

Click to view code
Order service publishes "order.created" event
                            ↓
        Multiple subscribers listen:
        - Email service: sends confirmation
        - Inventory service: decrements stock
        - Analytics service: logs metrics
        - Notification service: sends push notification
        
Each subscriber processes independently

Queue architecture:

Click to view code
Pending tasks → Message Queue
              ↓
         Worker 1: process
         Worker 2: process
         Worker 3: process
         
Each task processed by exactly one worker
If worker fails, queue re-delivers to another worker

Interview Questions & Answers

Q1: Design a payment system. Sync or async architecture?

Answer: Hybrid approach (most critical):

Click to view code
User clicks "Pay" → 
  1. Sync: Validate card (must be instant)
         Process payment (stripe API call)
         If success → return confirmation to user
         If fail → return error immediately
  
  2. Async: After sync success
         - Update order status
         - Send confirmation email
         - Log audit trail
         - Update analytics
         - Send receipt SMS

Why hybrid?

  • Sync (payment): User needs immediate feedback
  • Async (notifications): Email/SMS don't need to block user

Architecture:

Click to view code
Payment request (sync) → Stripe API → DB update
                         ↓
                    Success/Failure
                         ↓
                    If success: queue async tasks
                         ↓
            [Email, SMS, Analytics, Audit log]
            (process in background)

Why not pure async?

  • User can't see if payment succeeded
  • Risk of double payments (user clicks twice)

Q2: Design Twitter's tweet notification system. Which pattern?

Answer: Pub/Sub pattern because:

  • One tweet → millions of followers
  • Multiple subscribers (each gets notified differently)

Architecture:

Click to view code
User tweets → Tweet service publishes "tweet.created"
                            ↓
Multiple subscribers:
  1. Notification service → Push notifications to followers
  2. Timeline service → Update follower timelines
  3. Search service → Index tweet for search
  4. Analytics service → Log tweet metrics
  5. Cache service → Update Redis caches

Each subscriber processes independently
If notification service crashes, tweet still indexed and cached

Why Pub/Sub, not Queue?

  • Queue = 1 consumer per task
  • Pub/Sub = N consumers per event
  • Saves duplicating "send notification, update timeline, index tweet" logic

Scale consideration:

Click to view code
Influencer tweets → 50M followers
1 event → 50M notifications needed

With Pub/Sub:
- Publish once
- Notification service scales horizontally (1000 workers)
- Each worker handles 50K notifications

With Queue:
- Would need 50M messages in queue
- Inefficient

Q3: Your API has spiky traffic (100→10,000 req/sec). Sync or async?

Answer: Async with queue because:

  • Queue absorbs spikes
  • Workers process at steady rate

Architecture:

Click to view code
Normal load (100 req/sec):
  Request → Process (sync)
         ↓
      DB update
         ↓
      Return response (50ms)

Traffic spike (10,000 req/sec):
  Request → Queue (instant)
         ↓
      Return "accepted" (1ms)
         ↓
    Workers process from queue at 500 req/sec
         ↓
    Takes ~20 seconds to clear spike

Without queue:
  10,000 requests hit service
  Service crashes (can't handle)
  Users get 500 errors

Key benefits:

  1. Prevents crashes: Queue absorbs spikes
  2. Graceful degradation: Slower processing, but all requests handled
  3. Predictable latency: Workers at steady state

Implementation:

Click to view code (python)
@app.post("/process")
def process_job(data):
    # Instead of processing here:
    # db.process(data)  # Would crash under load
    
    # Queue it:
    queue.push("jobs", json.dumps(data))
    return {"status": "queued", "job_id": uuid()}

# Separate worker pool
def worker():
    while True:
        job = queue.pop("jobs")
        db.process(json.loads(job))
        # Can scale workers independently

Q4: WebSocket vs Long Polling for live notifications?

Answer: WebSocket for most cases, but long polling has advantages:

Use WebSocket when:

  • Need bi-directional communication
  • High-frequency updates (100+ per second)
  • Low latency critical (<100ms)
  • Team can handle stateful infrastructure
  • Example: Chat, multiplayer gaming

Use Long Polling when:

  • Server→client only (no client→server push)
  • Moderate update frequency (< 10/sec)
  • Simpler infrastructure (no sticky sessions)
  • Load balancers behind HTTP proxy
  • Example: Notifications, live feeds

Scaling comparison:

Click to view code
WebSocket (1 million concurrent):
- Each connection = TCP socket + memory state
- Sticky session required (user always routes to same server)
- 10 servers × 100K connections = complex state mgmt
- Need WebSocket-aware load balancer
- Memory overhead: ~1KB per connection = 1GB for 1M

Long Polling (1 million concurrent):
- Each active poll = HTTP request
- Stateless (can go to any server)
- Load balancer distributes freely
- Memory overhead minimal
- More HTTP requests (higher CPU)
- Better for CDN/simple infrastructure

Hybrid approach (recommended):

Click to view code
- WebSocket for active users (actively using app)
- Long Polling fallback for inactive (load reduction)
- Or: SSE as middle ground (stateless, server-push only)

Q5: Design a retry mechanism for failed async tasks.

Answer: Exponential backoff with Dead Letter Queue (DLQ):

Click to view code
Task fails
  ↓
Retry attempt 1 (after 1 second)
  ↓
If still fails:
Retry attempt 2 (after 2 seconds)
  ↓
If still fails:
Retry attempt 3 (after 4 seconds)
  ↓
If still fails (max retries):
Move to DLQ (manual investigation)

Implementation:

Click to view code (python)
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(5),  # Max 5 attempts
    wait=wait_exponential(
        multiplier=1,  # 1, 2, 4, 8, 16 seconds
        min=1,
        max=60
    )
)
def process_payment(order_id):
    try:
        stripe.charge(order_id)
    except TemporaryError as e:
        # Transient error, retry
        raise
    except PermanentError as e:
        # Don't retry, send to DLQ
        dlq.push("failed_payments", order_id)
        return

def dlq_processor():
    # Manually inspect failed tasks
    for order_id in dlq.get_all("failed_payments"):
        admin_alert(f"Payment failed for order {order_id}")

Why exponential backoff?

  • 1st retry at 1s: Service might be temporarily down
  • 2nd retry at 2s: Gives time to recover
  • 3rd retry at 4s: More recovery time
  • Avoids thundering herd (all retries at once)

When to move to DLQ?

  • Max retries exceeded (5 attempts = 31 seconds total)
  • Permanent error detected (invalid payment info)
  • Task takes too long (timeout)
  • Manual queue for ops team review