Apache Kafka - Complete Deep Dive

What is Kafka?

Apache Kafka is a distributed event streaming platform that enables:

Publishing & Subscribing: Multiple producers and consumers
Durability: Persistent storage with replication
Real-time Processing: Millisecond latency for stream processing
High Throughput: Millions of events/second capacity

Core Characteristics

Aspect	Benefit
Distributed	Horizontal scalability across servers
Fault-tolerant	Automatic replication and failover
Durable	Persists to disk with configurable retention
Ordered	Per-partition event ordering guaranteed
Replicated	3x copies prevent data loss

Kafka Architecture Overview

Click to view code

Producers → [Brokers (1,2,3...)] ← Consumers
              ↓
         Partitioned Topics
              ↓
       Metadata: ZooKeeper/KRaft

Key layers:

Producers: Publish events to topics
Brokers: Store and replicate partitions (cluster)
Topics: Named event streams with partitions
Partitions: Parallel storage units (leader + replicas)
Consumers: Subscribe to partitions via consumer groups
Metadata: ZooKeeper or KRaft manages cluster state

Core Components

1. Topics and Partitions

Topic: A stream/channel of events (like a table in a database)

Click to view code

Topic: user-events

Partition 0:  [event1] → [event2] → [event3] → [event4]
Partition 1:  [event5] → [event6] → [event7]
Partition 2:  [event8] → [event9]

Why partitions?

Parallelism: Multiple consumers can read different partitions simultaneously
Throughput: Spread writes across partitions
Ordering: Events in same partition are ordered (globally unordered)

Partition assignment strategy:

Round-robin: Distribute across partitions evenly
By key: Events with same key go to same partition (ordering guaranteed)

Click to view code (python)

# Example: User ID as key
producer.send(
    topic='user-events',
    value={'action': 'login'},
    key='user123'  # All user123 events go to same partition
)

2. Brokers

Broker: A single Kafka server in the cluster

Each broker:

Stores partition replicas
Handles producer/consumer requests
Replicates data to other brokers

Broker ID: 0, 1, 2, N (unique identifier)

Click to view code

Broker 0:
  - Topic A Partition 0 (leader)
  - Topic A Partition 1 (replica)
  - Topic B Partition 0 (replica)

Broker 1:
  - Topic A Partition 0 (replica)
  - Topic A Partition 1 (leader)
  - Topic B Partition 0 (leader)

Broker 2:
  - Topic A Partition 0 (replica)
  - Topic A Partition 1 (replica)
  - Topic B Partition 0 (replica)

3. Replication

Replication Factor: Number of copies of partition data

Click to view code

Topic: user-events
Replication Factor: 3

Partition 0:
  Leader: Broker 0 (receives writes)
  Replica 1: Broker 1 (copy)
  Replica 2: Broker 2 (copy)

If Broker 0 dies:
  Broker 1 becomes leader (automatic failover)
  Writes/reads continue on Broker 1

In-Sync Replicas (ISR):

Replicas that are caught up with leader
Producer waits for ISR acknowledgment (before Kafka confirms)
If ISR < min.insync.replicas → Producer fails

Click to view code (python)

# Producer settings for durability
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    acks='all',  # Wait for all ISR to acknowledge
    retries=3,
    min_insync_replicas=2  # At least 2 replicas must ack
)

4. Offsets and Partitions

Offset: Position of message in partition (0, 1, 2, 3...)

Click to view code

Partition 0:
Offset 0: {'user_id': 123, 'action': 'login'}
Offset 1: {'user_id': 456, 'action': 'purchase', 'amount': 99.99}
Offset 2: {'user_id': 789, 'action': 'logout'}
         ↑
      Consumer reads from here

Consumer offset tracking:

Kafka stores consumer group's current offset
Consumer can resume from last offset after restart
Enables exactly-once semantics

5. Consumer Groups

Consumer Group: Set of consumers reading same topic together

Click to view code

Topic: user-events (4 partitions)

Consumer Group: analytics-team
  ├─ Consumer 1 → Partition 0
  ├─ Consumer 2 → Partition 1
  ├─ Consumer 3 → Partition 2
  └─ Consumer 4 → Partition 3

Each partition read by exactly ONE consumer in group
Multiple consumers = parallel processing

Rebalancing: When consumer joins/leaves group

Click to view code

Initial: 2 consumers for 4 partitions
  Consumer 1: Partitions 0, 1
  Consumer 2: Partitions 2, 3

New consumer joins:
  Rebalance triggered
  Consumer 1: Partitions 0, 2
  Consumer 2: Partitions 1, 3
  Consumer 3: Partitions (empty or reassigned)

Producers (Deep Dive)

Producer Flow

Click to view code

Producer Code:
  send(topic, message)
       ↓
ProducerRecord created:
  {topic, partition, key, value, timestamp, headers}
       ↓
Partitioner determines partition:
  partition = hash(key) % num_partitions
       ↓
Add to batch buffer (waits for batch.size or linger.ms)
       ↓
Serializer converts to bytes
       ↓
Compression (gzip, snappy, lz4, zstd)
       ↓
Send to broker (batch of messages)
       ↓
Broker acknowledges (acks setting)
       ↓
Callback invoked (success or error)

Producer Acknowledgments (acks)

Setting	Behavior	Durability	Speed
acks=0	No wait for ack	Fire-and-forget; data loss possible	Fastest
acks=1	Wait for leader ack	Leader acknowledged, replicas copying	Medium
acks=all	Wait for ISR ack	All replicas acknowledged	Slowest

Click to view code (python)

# Fire-and-forget (low latency, potential loss)
producer = KafkaProducer(
    acks=0,  # Don't wait
    batch_size=16384,
    linger_ms=10
)

# Balanced (medium latency, good safety)
producer = KafkaProducer(
    acks=1,  # Wait for leader
    retries=3
)

# Safe (higher latency, no data loss)
producer = KafkaProducer(
    acks='all',  # Wait for all ISR
    min_insync_replicas=2,
    retries=3
)

Batching and Performance

Batching parameters:

Click to view code (python)

producer = KafkaProducer(
    batch_size=16384,      # Send when 16KB accumulated (or linger_ms)
    linger_ms=10,          # Wait max 10ms before sending
    compression_type='snappy'  # Compress batch
)

Tradeoff:

Click to view code

batch_size=100B, linger_ms=0:
  - Send every message immediately
  - Latency: 1ms
  - Throughput: Low (1K messages/sec)

batch_size=1MB, linger_ms=100:
  - Accumulate 100ms worth of messages
  - Latency: 100ms
  - Throughput: High (1M messages/sec)
  - Compression: 10:1 ratio (90% smaller)

Idempotent Producer

Problem: Network timeout causes duplicate sends

Click to view code

Producer sends message (offset 100)
  → Broker receives and acks
  → Ack lost (network issue)
  → Producer retries
  → Same message sent again (offset 101)
  → Duplicate in topic!

Solution: Enable idempotence

Click to view code (python)

producer = KafkaProducer(
    enable_idempotence=True,  # Deduplicates retries
    acks='all',
    retries=3
)

# Kafka tracks <ProducerId, SequenceNumber>
# Retried message has same sequence number
# Broker deduplicates automatically

Consumers (Deep Dive)

Consumer Flow

Click to view code

Consumer Code:
  poll(timeout=1000)
       ↓
Request messages from broker (fetch.min.bytes, fetch.max.wait.ms)
       ↓
Broker returns batch of messages
       ↓
Deserializer converts bytes to objects
       ↓
Application processes messages
       ↓
commitSync() or commitAsync() stores offset
       ↓
Coordinator tracks offset in __consumer_offsets topic

Offset Management

Automatic offset commit (default):

Click to view code (python)

consumer = KafkaConsumer(
    'user-events',
    group_id='analytics-group',
    auto_offset_reset='earliest',  # Start from beginning if no offset
    enable_auto_commit=True,  # Auto-commit every 5 seconds
    auto_commit_interval_ms=5000
)

for message in consumer:
    process(message)
    # Offset auto-committed (5s later)

Manual offset commit (safer):

Click to view code (python)

consumer = KafkaConsumer(
    'user-events',
    group_id='analytics-group',
    enable_auto_commit=False  # Don't auto-commit
)

for message in consumer:
    try:
        process(message)
        consumer.commit()  # Only commit after success
    except Exception as e:
        log_error(e)
        # Don't commit; retry from same offset

Consumer Lag

Lag: How far behind the consumer is

Click to view code

Topic: user-events
Partition 0:

Latest offset (producer wrote):  100
Consumer offset:                  85
Lag:                            15

Lag = Latest offset - Consumer offset

High lag → Consumer slow or broken
Zero lag → Consumer caught up (real-time)

Monitoring lag:

Click to view code (python)

from kafka.metrics import Metrics

# Track lag per partition
for partition, offset_data in consumer.committed().items():
    consumer_offset = offset_data
    latest = consumer.end_offsets([partition])[partition]
    lag = latest - consumer_offset
    print(f"Partition {partition}: lag={lag}")

Rebalancing

When does rebalancing happen?

New consumer joins group
Consumer leaves (timeout, shutdown)
New partitions added to topic
Consumer calls leave_group()

Rebalancing process:

Click to view code

1. Stop consuming (all consumers pause)
2. Coordinator selects new partition assignment
3. Offsets revoked from old consumers
4. New assignment given to consumers
5. Resume consuming from new partitions

Impact:
- Pause: 1-30 seconds (depending on rebalance.timeout.ms)
- Data processed twice or missed (care needed)

Minimize rebalancing:

Click to view code (python)

consumer = KafkaConsumer(
    group_id='analytics-group',
    session_timeout_ms=30000,      # Consumer timeout
    rebalance_timeout_ms=60000,    # Time to rejoin
    max_poll_interval_ms=300000,   # Time between polls
)

Zookeeper vs KRaft (Controller)

Zookeeper (Traditional)

Click to view code

Zookeeper Cluster:
  ├─ Zk1 (leader)
  ├─ Zk2
  └─ Zk3

Kafka Brokers:
  ├─ Broker 0 ┐
  ├─ Broker 1 │→ Query Zookeeper for metadata
  └─ Broker 2 ┘

Zookeeper stores:
  - Broker list
  - Topic configuration
  - Partition leaders
  - Consumer offsets (old versions)

Issues:

Extra system to manage (operational overhead)
Metadata changes take time (eventually consistent)
Not scalable for millions of partitions

KRaft (Kafka Raft Consensus, Kafka 3.3+)

Click to view code

Kafka Brokers (with embedded controller):
  ├─ Broker 0 (controller)  ← Elected leader
  ├─ Broker 1
  └─ Broker 2

No external Zookeeper needed
Metadata stored in __cluster_metadata partition

Benefits:

Simpler deployment (one system)
Faster metadata updates
Scales to millions of partitions

Configuration Tuning

Producer Performance

Click to view code (properties)

# Throughput optimization
batch.size=32768                    # Larger batches
linger.ms=50                        # Wait for more messages
compression.type=snappy             # Reduce network traffic
acks=1                              # Don't wait for replicas
buffer.memory=67108864              # Larger buffer (64MB)

# Durability optimization
acks=all                            # Wait for all replicas
retries=3                           # Retry on failure
enable.idempotence=true             # Prevent duplicates

Consumer Performance

Click to view code (properties)

# Throughput
fetch.min.bytes=1024                # Batch at least 1KB
fetch.max.wait.ms=500               # Wait up to 500ms
max.poll.records=500                # Process more records per poll

# Stability
session.timeout.ms=30000            # Timeout before rebalance
heartbeat.interval.ms=10000         # Send heartbeat every 10s
max.poll.interval.ms=300000         # Must poll within 5 min

Broker Configuration

Click to view code (properties)

# Replication
min.insync.replicas=2               # Minimum replicas for acks=all
default.replication.factor=3        # Default copies
unclean.leader.election.enable=false # Don't elect out-of-sync leader

# Performance
num.network.threads=8               # Network threads
num.io.threads=8                    # Disk I/O threads
socket.send.buffer.bytes=102400     # 100KB send buffer
socket.receive.buffer.bytes=102400  # 100KB receive buffer

Use Cases

1. Activity/Event Logging

Click to view code

Web servers → Kafka → Data warehouse

Real-time:
  - Page views
  - User clicks
  - Search queries
  - Video watches
  
Benefits:
  - Decouple logging from business logic
  - Multiple consumers (analytics, real-time dashboard, ML)
  - Replay events for debugging

2. Metrics Collection

Click to view code

Servers → Prometheus → Kafka → Time-series DB

Metrics:
  - CPU, memory, disk
  - Request latency
  - Error rates
  - Custom application metrics

Benefits:
  - Buffering during spikes
  - Multiple metric systems (Prometheus, InfluxDB)
  - Historical data in data lake

3. Real-time Analytics

Click to view code

User events → Kafka → Stream processor (Flink) → Results DB

Examples:
  - Real-time recommendations
  - Live dashboard metrics
  - Anomaly detection
  - Fraud detection

Benefits:
  - Millisecond latency
  - Stateful processing (session windows)
  - Exactly-once semantics

4. ETL Pipelines

Click to view code

Databases → Kafka Connect → Kafka → Data warehouse

Benefits:
  - CDC (Change Data Capture)
  - Decoupling source/destination
  - Exactly-once delivery
  - Connector ecosystem

5. Microservices Communication

Click to view code

Service A → Kafka → Service B
Service A → Kafka → Service C

Benefits:
  - Asynchronous communication
  - Event sourcing
  - Audit trail
  - Temporal decoupling

Performance Benchmarks

Throughput

Click to view code

Single broker, 3 replicas:
  Producer: 1-2 million messages/sec (1KB each = 1-2 GB/sec)
  Consumer: 2-3 million messages/sec

Cluster (10 brokers):
  1-10 million messages/sec depending on configuration

Latency

Click to view code

End-to-end latency (producer → consumer):
  acks=0: 1-5ms (fastest)
  acks=1: 5-10ms (medium)
  acks=all: 10-20ms (safe)
  
With compression and batching:
  Average: 10-50ms for high throughput

Storage

Click to view code

1 billion messages (1KB each):
  Raw: 1TB
  With compression (snappy): 100GB (10:1 ratio)
  
Retention (keeping 7 days):
  100K msg/sec: 8.64 billion/day → Need 100GB/day × 7 = 700GB

Monitoring Kafka

Key Metrics to Monitor

Click to view code

Producer:
  - record-send-rate (msg/sec)
  - record-size-avg (bytes/message)
  - batch-size-avg (messages/batch)
  - compression-rate-avg (reduction %)

Consumer:
  - records-consumed-rate (msg/sec)
  - consumer-lag (offset gap)
  - fetch-latency-avg (ms)

Broker:
  - BytesInPerSec (producer throughput)
  - BytesOutPerSec (consumer throughput)
  - UnderReplicatedPartitions (replication issues)
  - OfflinePartitionsCount (broker failures)

Using JMX Monitoring

Click to view code (python)

# Enable JMX on Kafka
export KAFKA_HEAP_OPTS="-Xmx1G -Xms1G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+DisableExplicitGC"

# Monitor with tools
# - Prometheus + Kafka exporter
# - JConsole
# - Grafana dashboards

Interview Questions & Answers

Requirements:

Track rides in real-time (100K rides/sec)
Process by city (NYC, SF, LA)
Handle surge pricing updates
Multiple independent consumers (pricing, matching, analytics)

Topic Design:

Click to view code (properties)

Topic: rides
  Partitions: 100
  Replication Factor: 3
  Retention: 7 days
  Partition Key: city_id
  
Topic: surge-pricing
  Partitions: 10
  Replication Factor: 2
  Retention: 1 hour
  Partition Key: city_id

Consumer Groups:

Group	Consumers	Purpose
matching-service	10	Real-time ride matching (1 consumer per 10 partitions)
pricing-service	1	Monitor all cities for surge pricing
analytics-batch	1	Daily aggregations (full replay)

Design rationale:

Ordering by city: city_id partition key ensures all city rides ordered (critical for replay)
Parallelism: 100 partitions × 100K rides/sec = 1K rides/sec per partition
Isolation: Different consumer groups process independently
Resilience: 3x replication prevents data loss

Debugging checklist:

Layer	Check
Producer Config	Correct broker? Correct topic? acks setting?
Network	Can reach broker? (telnet localhost:9092) Firewall? DNS?
Broker	Running? Disk space? Under-replicated partitions?
Code	Exception in callback? Message too large? Timeout?
Consumer	Right topic? Right starting offset? Group configured?

Common mistakes & fixes:

Click to view code (python)

# ❌ WRONG: Fire-and-forget, error silently lost
producer.send('my-topic', 'message')

# ✅ RIGHT: Wait for acknowledgment and catch errors
try:
    future = producer.send('my-topic', 'message')
    result = future.get(timeout=10)
except Exception as e:
    logger.error(f"Send failed: {e}")
    # Implement retry logic

# ❌ WRONG: Message too large (> max.message.bytes)
producer.send('my-topic', very_large_message)  # 10MB

# ✅ RIGHT: Compress or split
if len(message) > broker.max_message_bytes:
    message = compress(message)  # or split

Quick diagnosis commands:

Click to view code (bash)

# Verify broker is running
kafka-broker-api-versions.sh --bootstrap-server localhost:9092

# Check topic partitions and leaders
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic

# Check consumer lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group my-group --describe

Q3: Consumer crashes. Avoid message loss or duplicates?

Challenge: Exactly-once semantics (not at-least-once or at-most-once)

Solution: Manual offset commits with error handling

Click to view code (python)

consumer = KafkaConsumer(
    'input-topic',
    group_id='processing-group',
    enable_auto_commit=False,  # Manual commit ONLY
    auto_offset_reset='earliest'
)

for message in consumer:
    try:
        # Process message
        result = process(message)
        
        # Commit ONLY after successful processing
        consumer.commit()
        
    except Exception as e:
        logger.error(f"Processing failed: {e}")
        # Don't commit; resume from same offset on restart
        continue

Behavior on crash:

Click to view code

State 1: Process message at offset 100
         ↓
State 2: Processing completes
         ↓
State 3: About to commit offset 100
         ↓
State 4: ⚠️ CRASH

On Restart:
  Last committed offset: 99
  Consumer resumes from offset 100
  Re-processes message 100 (duplicate possible but safe)
  
Result: No message loss, possibly duplicate output

Advanced: Kafka transactions (Kafka 0.11+)

For zero duplicates with exactly-once semantics:

Click to view code (python)

producer = KafkaProducer(
    transactional_id='processor-1',
    bootstrap_servers=['localhost:9092']
)

consumer = KafkaConsumer(
    'input-topic',
    isolation_level='read_committed'
)

for message in consumer:
    with producer.transaction():
        # Process
        result = process(message)
        
        # Produce + commit offset atomically
        producer.send('output-topic', result)
        consumer.commit()
        
        # All-or-nothing: both succeed or both fail

Crash scenarios with transactions:

Crash before commit → rolled back, re-process on restart
Crash after commit → already processed, skip on restart

Q4: Design Kafka for a system handling 10M events/sec with 99.99% uptime requirement.

Answer:

Scale analysis:

Click to view code

10M events/sec × 86,400 sec/day = 864 billion events/day
× 7 day retention = 6 trillion events stored
× 1KB/event = 6 petabytes

With compression (10:1): 600TB/week (reasonable)

Architecture:

Click to view code

Producers (Web servers, mobile apps):
  ├─ 100 producer threads
  ├─ Batch size: 32KB, linger: 50ms
  ├─ Compression: snappy (90% reduction)
  └─ acks='all' (safety first)
           ↓
Kafka Cluster (AWS):
  ├─ 50 brokers (i3.2xlarge: high memory, NVMe SSD)
  ├─ 2000 partitions (40 partitions/broker)
  ├─ Replication factor: 3 (99.99% uptime)
  ├─ min_insync_replicas: 2
  └─ Multi-AZ (3 availability zones)
           ↓
Consumers:
  ├─ Real-time (Flink): 100 consumer threads
  ├─ Analytics (Spark): batch job 4x/day
  └─ Storage (S3): 1 consumer group

High availability strategy:

Click to view code

1. Replication (3x):
   - Producer writes to leader
   - Replicated to ISR (in-sync replicas)
   - If leader dies: ISR takes over
   - Zero data loss with acks='all'

2. Multi-region failover:
   - Primary: US-East (50 brokers)
   - Replica: US-West (50 brokers, read-only)
   - MirrorMaker 2 replicates topics
   - Switch on primary failure

3. Monitoring:
   - Alert if UnderReplicatedPartitions > 0
   - Alert if consumer lag > 10000
   - Circuit breaker if broker availability < 99%

4. Planned maintenance (zero downtime):
   - Rolling restart of brokers
   - One broker at a time
   - Replicas ensure no downtime

Configuration:

Click to view code (properties)

# Broker
num.replica.fetchers=4              # Faster replication
replica.socket.receive.buffer.bytes=1MB
log.flush.interval.bytes=1GB        # Reduce fsync overhead
log.cleanup.policy=delete,compact   # Delete old, compact keys

# Network
num.network.threads=16
num.io.threads=16
socket.send.buffer.bytes=1MB
socket.receive.buffer.bytes=1MB

# Topic defaults
default.replication.factor=3
min.insync.replicas=2
log.retention.days=7
compression.type=snappy

Expected performance:

Click to view code

Throughput:
  10M msg/sec ÷ 2000 partitions = 5000 msg/sec per partition
  With batching: 5KB/msg × 5000 = 25MB/sec per partition
  50 brokers × 40 partitions × 25MB = 50GB/sec total
  
Latency:
  P99: < 50ms (end-to-end)
  P99.99: < 100ms
  
Availability:
  Uptime: 99.99% (52 minutes downtime/year)
  With 3x replication: Achievable

Q5: How would you implement exactly-once delivery in a payment processing pipeline?

Answer:

Challenge: Payment must be processed exactly once (not 0 or 2 times)

Click to view code

Scenario: Customer pays $100
  
At-most-once (bad):
  - Message lost in transit → Payment never charged
  - Customer pays but not recorded

At-least-once (bad):
  - Message retried → Charged twice
  - Customer charged $200

Exactly-once (good):
  - Payment processed once
  - Idempotent processing
  - No duplicates

Solution: Kafka + Idempotent Consumer + Database

Click to view code (python)

from kafka import KafkaConsumer
import psycopg2

consumer = KafkaConsumer(
    'payment-requests',
    group_id='payment-processor',
    enable_auto_commit=False,  # Manual commit
    isolation_level='read_committed'  # Only committed data
)

db = psycopg2.connect("dbname=payments user=postgres")

for message in consumer:
    payment_request = json.loads(message.value)
    idempotency_key = payment_request['idempotency_key']
    amount = payment_request['amount']
    user_id = payment_request['user_id']
    
    try:
        # Check if already processed (idempotency)
        cursor = db.cursor()
        cursor.execute(
            "SELECT id FROM payments WHERE idempotency_key = %s",
            [idempotency_key]
        )
        
        if cursor.fetchone():
            # Already processed, skip
            logger.info(f"Payment {idempotency_key} already processed")
        else:
            # New payment, process
            cursor.execute(
                """
                INSERT INTO payments (idempotency_key, user_id, amount, status)
                VALUES (%s, %s, %s, 'PENDING')
                """,
                [idempotency_key, user_id, amount]
            )
            db.commit()
            
            # Process with payment gateway
            try:
                charge_result = payment_gateway.charge(user_id, amount)
                
                cursor.execute(
                    "UPDATE payments SET status = %s WHERE idempotency_key = %s",
                    ['SUCCESS', idempotency_key]
                )
                db.commit()
                
                # Send confirmation
                confirmation_topic.send(
                    value={
                        'idempotency_key': idempotency_key,
                        'status': 'SUCCESS',
                        'charge_id': charge_result.id
                    }
                )
            except PaymentGatewayError:
                cursor.execute(
                    "UPDATE payments SET status = %s WHERE idempotency_key = %s",
                    ['FAILED', idempotency_key]
                )
                db.commit()
                raise
        
        # Commit offset ONLY after database write
        consumer.commit()
        
    except Exception as e:
        # Don't commit; will retry
        logger.error(f"Payment processing failed: {e}")
        # Consumer resumes from last committed offset
        db.rollback()
        continue

Why this works:

Click to view code

Scenario 1: Consumer crashes after charge but before commit
  Restart:
    - Resume from last committed offset
    - See same payment request again
    - Idempotency check finds existing record
    - Skip (no duplicate charge)

Scenario 2: Payment gateway timeout
  Exception caught:
  - Database rollback
  - Offset not committed
  - Retry with exponential backoff
  - Eventually succeeds or explicit error

Scenario 3: Duplicate message from producer retry
  Idempotency key identical:
  - First attempt: INSERT succeeds
  - Retry: INSERT fails (duplicate key)
  - Query finds existing record
  - Returns same result (idempotent)

Database schema:

Click to view code (sql)

CREATE TABLE payments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  idempotency_key VARCHAR(255) UNIQUE NOT NULL,
  user_id BIGINT NOT NULL,
  amount DECIMAL(10, 2) NOT NULL,
  status VARCHAR(50) DEFAULT 'PENDING',
  created_at TIMESTAMP DEFAULT NOW(),
  processed_at TIMESTAMP,
  
  INDEX (idempotency_key),  -- Fast lookup
  INDEX (user_id),
  INDEX (status)
);

CREATE TABLE payment_logs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  payment_id UUID REFERENCES payments(id),
  event VARCHAR(50),  -- 'INITIATED', 'SUCCESS', 'FAILED'
  details JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

Kafka Patterns & Best Practices

1. Fan-out (One-to-many)

Single producer → Kafka topic → Multiple independent consumers

2. Event Sourcing

All state changes stored immutably; reconstruct state by replaying events

3. CQRS

Command side (writes) and Query side (reads) separated with Kafka as backbone

Kafka vs Alternatives

System	Throughput	Latency	Best For
Kafka	1M+/sec	10-100ms	Event streaming, replay, durability
RabbitMQ	100K/sec	1-5ms	Task queues, RPC
Redis Streams	1M+/sec	1-5ms	Real-time, low latency
AWS Kinesis	Managed	1sec	AWS-native, serverless
Pulsar	1M+/sec	1-5ms	Multi-tenancy, geo-replication

Disaster Recovery: Architecture Comparison

Single-Region Deployment

Aspect	Detail
Failure	Hard outage (user traffic down, producers blocked)
Data Loss	Yes, unless producers buffer to durable storage
RTO	Hours (need region recovery or complete rebuild)
RPO	Could be high without producer-side buffering
Mitigation	Implement producer buffering (disk/S3/DB)

Interview answer: "Without buffering, hard outage = data loss. With buffering, we delay processing but preserve events."

Active-Passive (Replica in Standby Region)

Aspect	Detail
Failure	Producers/consumers switch to replica region
Data Loss	Bounded by replication lag (RPO: typically < 1 sec)
RTO	5-30 minutes (failover + rebalancing)
Duplicates	Likely (offset mismatch); use idempotent consumers
Setup	MirrorMaker/Cluster Linking replication
Critical	Must auto-redirect producers (DNS/LB)

Interview answer: "Failover gives availability with bounded loss (lag) and requires idempotent consumer design."

Active-Active (Both Regions Serve Traffic)

Aspect	Detail
Failure	West-2 continues uninterrupted (minimal downtime)
Data Loss	Small (in-flight events in dead region may be lost)
RTO	Minutes (already running in standby region)
Complexity	High (split brain, conflicts, ordering limits)
Requirements	Idempotent producers/consumers, conflict resolution
Ordering	Per-key only (global ordering impossible)

Interview answer: "Minimal downtime but high complexity: handle duplicates, conflicts, and ordering carefully."

Database Integration (Critical!)

Problem: Kafka failover is useless if your DB is still single-region

Scenario	Outcome
DB only in west-1	Writes fail in west-2 → effective outage despite Kafka failover
DB with geo-replication	West-2 can serve reads/writes (eventual consistency trade-off)

Key takeaway: Kafka DR must be paired with database DR

Producer-Side Buffering Pattern

Without buffering (block on failure):

Fewer moving parts
Request failures on outage
Data loss possible

With buffering (disk/S3/DB queue):

App stays responsive
Events replay after recovery
Backlog spike on recovery (catch-up phase)

Quick Reference: 30-Second Answers

Click to view code

Single-region:
  "Regional outage = Kafka down. Without buffering, 
   we lose events. With buffering, we delay processing."

Active-passive:
  "Failover to replica. Accept bounded loss (replication lag) 
   and duplicates. Use idempotent consumers."

Active-active:
  "West-2 continues serving. Minimal downtime but handle 
   duplicates, conflicts, and per-key ordering only."

General:
  "Pair Kafka failover with database DR. Replication alone 
   isn't enough—clients must auto-failover."

Summary & Key Takeaways

Kafka excels at:

✓ High-throughput systems (1M+ events/sec)
✓ Event-driven architectures (event sourcing, CQRS)
✓ Real-time processing (stream processing, analytics)
✓ Decoupled systems (microservices, async workflows)
✓ Durable systems (replication, persistence, replay)

Key challenges:

✗ Operational complexity (cluster management, monitoring)
✗ No global ordering (partition-scoped ordering only)
✗ Exactly-once requires idempotency + transactions
✗ Cross-region failover is complex (duplicates, conflicts)

Critical design questions:

What's my target throughput (events/sec)?
How long must I retain data?
How many independent consumer groups?
What's my availability SLA (RTO/RPO)?
Can my consumers be idempotent?
Do I need global or per-partition ordering?
Is cross-region failover required?

Kafka — Practical Guide & Motivating Examples

There is a good chance you've heard of Kafka. According to their website, it's used by 80% of the Fortune 100. Apache Kafka is an open-source distributed event streaming platform that can be used either as a message queue or as a stream processing system. It excels in delivering high performance, scalability, and durability — engineered to handle vast volumes of data in real-time.

A Motivating Example: The World Cup

Imagine running a website providing real-time stats for World Cup matches. Each goal, booking, or substitution generates an event.

Step 1 — Basic Queue:

Producer puts events on a queue
Consumer reads events and updates the website

Step 2 — Scaling (1000 teams, all games simultaneously):

Single server can't keep up → distribute the queue across servers
Problem: How to maintain event ordering?
Solution: Partition by game ID → all events for one game stay on the same queue (partition)

Step 3 — Consumer Scaling:

Single consumer is overwhelmed → add more consumers
Consumer groups ensure each partition is assigned to exactly one consumer in the group
Under normal operation, each event is delivered to a single consumer
In failure scenarios, Kafka's default at-least-once semantics mean a message could be reprocessed

Step 4 — Multiple Sports:

Soccer events shouldn't go to the basketball website
Topics separate event streams — consumers subscribe to specific topics

Basic Terminology and Architecture

Concept	Description
Cluster	Multiple brokers working together
Broker	Individual server storing data and serving clients
Partition	Ordered, immutable, append-only sequence of messages (like a log file)
Topic	Logical grouping of partitions — how you publish/subscribe to data
Producer	Writes data to topics
Consumer	Reads data from topics
Consumer Group	Set of consumers where each partition is assigned to exactly one member

Topic vs Partition: A topic is a logical grouping; a partition is a physical grouping. A topic can have multiple partitions across different brokers.

Queue vs Stream mode:

Message queue: Each message processed by one consumer in a group, then effectively "consumed"
Stream: Log is retained and can be replayed; multiple consumer groups read independently; continuous processing

How Kafka Works

Message Structure

A Kafka message (record) has four fields (all technically optional):

Field	Purpose
Value	The payload
Key	Determines which partition the message goes to
Timestamp	When the message was created/ingested (ordering is by offsets, not timestamps)
Headers	Key-value metadata pairs

Producer Example (kafkajs)

const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092']
});

const producer = kafka.producer();

await producer.connect();
await producer.send({
  topic: 'my_topic',
  messages: [
    { key: 'key1', value: 'Hello, Kafka with key!' },
    { key: 'key2', value: 'Another message with a different key' }
  ],
});

Partition Determination (Two-Step Process)

Partition selection: Hash the message key → partition = hash(key) % num_partitions. Without a key, modern clients use a "sticky" partitioner (batches to one partition, then rotates)
Broker assignment: Cluster metadata maps partitions to brokers; producer sends directly to the correct broker

Append-Only Log Benefits

Benefit	Why
Immutability	Simplifies replication, speeds recovery, avoids consistency issues
Efficiency	Minimizes disk seek times (sequential writes only)
Scalability	Simple mechanism enables horizontal scaling

Each message gets a unique offset — a sequential ID indicating position in the partition. Consumers commit offsets to track progress and resume after failures.

Default delivery: At-least-once. If a consumer crashes after processing but before committing, the message will be reprocessed. Exactly-once requires idempotent producers + transactional APIs.

Replication Model

Component	Role
Leader replica	Handles all writes and (by default) reads for a partition
Follower replicas	Passively replicate from leader; ready for promotion on failure
ISR (In-Sync Replicas)	Followers that are fully caught up with the leader
Controller	Monitors broker health; manages leadership and replication dynamics

Kafka 2.4+ supports consumer reads from follower replicas for latency optimization.

Consumer Example (kafkajs)

const consumer = kafka.consumer({ groupId: 'my-group' });

await consumer.connect();
await consumer.subscribe({ topic: 'my_topic' });

await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    console.log({
      value: message.value?.toString(),
      key: message.key?.toString()
    });
  },
});

Kafka uses a pull-based model — consumers actively poll for new messages. This lets consumers control their rate, simplifies failure handling, prevents overwhelming slow consumers, and enables efficient batching.

When to Use Kafka in Your Interview

As a message queue when:

Processing can be done asynchronously (e.g., YouTube video transcoding — upload SD immediately, transcode via Kafka)
Messages must be processed in order (e.g., Ticketmaster virtual waiting queue)
You need to decouple producer and consumer for independent scaling

As a stream when:

You need continuous real-time processing (e.g., Ad Click Aggregator)
Messages need multiple consumers simultaneously (e.g., FB Live Comments as pub/sub)

Scalability Deep Dive

Single Broker Limits

No hard limit on message size (configurable via message.max.bytes), but keep under 1MB for optimal performance
Good hardware: ~1TB storage, up to 1M messages/second

Anti-pattern: Don't store large blobs (videos, files) in Kafka. Store them in S3 and put a pointer message in Kafka.

Scaling Strategies

Horizontal scaling: Add more brokers. Ensure topics have sufficient partitions to utilize new brokers
Partitioning strategy: The main decision — choose keys that distribute evenly. partition = hash(key) % num_partitions (murmur2 by default)

Handling Hot Partitions

Example: An Ad Click Aggregator partitioned by ad ID. Nike launches a Lebron James ad → that partition is overwhelmed.

Strategy	How It Works	Tradeoff
No key (default)	Sticky partitioner distributes roughly evenly	Lose ordering guarantees
Random salting	Add random suffix to key (e.g., `ad1237`)	Complicates consumer-side aggregation
Compound key	Combine ad ID with region/user segment	Better distribution if attributes vary independently
Back pressure	Slow down the producer when lag is too high	Reduces throughput

Fault Tolerance and Durability

Replication: Each partition replicated across multiple brokers. acks=all ensures message is acknowledged only when all ISRs have received it (strongest guarantee). Replication factor of 3 is common (1 leader + 2 followers).

"Kafka is always available, sometimes consistent." Asking "what if Kafka goes down?" isn't very realistic — focus on consumer failure instead.

When a consumer goes down:

Offset management: Consumer restarts, reads last committed offset, resumes from there. Messages may be reprocessed if crash happened before commit (at-least-once)
Rebalancing: Consumer group redistributes partitions among remaining consumers

Key tradeoff: When to commit offsets? In a Web Crawler, don't commit until raw HTML is stored in blob storage. Keep consumer work small to minimize redo on failure.

Producer Retries

const producer = kafka.producer({
  retry: {
    retries: 5,
    initialRetryTime: 100,
  },
  idempotent: true, // Avoid duplicates on retry
});

Consumer Retries

Kafka doesn't support consumer retries out of the box (AWS SQS does). Common pattern:

Failed messages → retry topic
Separate consumer processes retry topic
After N retries → dead letter queue (DLQ) for investigation

Performance Optimizations

Technique	Description
Batching	Send multiple messages in a single `send()` call; use `sendBatch()` across topics
Compression	GZIP, Snappy, LZ4 — reduces message size for faster transmission
Partition key design	Maximize parallelism with even distribution across partitions

const { CompressionTypes } = require('kafkajs');

await producer.send({
  topic: 'my_topic',
  compression: CompressionTypes.GZIP,
  messages: [{ key: 'key1', value: 'Hello, Kafka!' }],
});

Retention Policies

Default: 7 days (configurable via retention.ms and retention.bytes). Longer retention = higher storage costs. Consider this tradeoff in your interview.

Additional Interview Questions & Answers

Q6: How does Kafka guarantee message ordering? What are the limitations?

Answer: Kafka guarantees ordering only within a single partition, not across partitions.

How it works:

Messages with the same key always go to the same partition (hash(key) % num_partitions)
Within a partition, messages are assigned sequential offsets
Consumers read messages in offset order

Limitations:

No global ordering across partitions — if you need total ordering, you need a single partition (sacrificing parallelism)
Adding partitions breaks existing key-to-partition mappings (messages with the same key may go to a different partition)
Rebalancing can cause brief ordering disruptions during consumer group membership changes

Interview approach: Choose partition keys that group related messages together. For a payment system, partition by accountid so all transactions for one account are ordered. For an event aggregator, partition by entityid.

Q7: What's the difference between Kafka and a traditional message queue like RabbitMQ or SQS?

Answer:

Feature	Kafka	RabbitMQ / SQS
Model	Distributed commit log	Message broker / queue
Retention	Retains messages after consumption (configurable)	Messages deleted after consumption
Replay	Consumers can replay from any offset	No replay (once consumed, gone)
Consumer groups	Multiple independent groups read same data	Typically one consumer per message
Ordering	Per-partition ordering guaranteed	RabbitMQ: per-queue; SQS: best-effort (FIFO available)
Throughput	Very high (1M+ msg/sec per broker)	Lower (tens of thousands/sec)
Consumer retries	Must implement yourself	Built-in (SQS: visibility timeout, DLQ)
Complexity	Higher (cluster management, partitioning)	Lower (managed services available)

When to choose Kafka: High throughput, event sourcing, stream processing, multiple consumers, data replay.

When to choose SQS/RabbitMQ: Simple async processing, built-in retry/DLQ, lower operational overhead, no replay needed.

Q8: Explain consumer group rebalancing. What triggers it and what are the implications?

Answer:

Triggers:

Consumer joins or leaves the group
Consumer crashes (missed heartbeat)
New partitions added to a subscribed topic
Consumer subscription changes

Process:

Group coordinator (a broker) detects membership change
All consumers in the group pause consumption
Partitions are reassigned using a partition assignment strategy (Range, RoundRobin, Sticky, or CooperativeSticky)
Each consumer receives its new assignment and resumes

Implications:

Stop-the-world pause — no messages processed during rebalancing (traditional protocol)
Duplicate processing — uncommitted offsets may be reprocessed after rebalancing
Latency spike — consumers need to rebuild local state after reassignment

Mitigations:

Use CooperativeSticky assignor (incremental rebalancing, minimizes disruption)
Tune session.timeout.ms and heartbeat.interval.ms to detect failures faster
Use static group membership (group.instance.id) to avoid rebalances on transient restarts
Keep consumer processing fast to commit offsets frequently

Q9: How would you design Kafka for exactly-once processing in an e-commerce order system?

Answer:

The problem: Orders must not be duplicated (double-charging) or lost (order never fulfilled).

Kafka's exactly-once components:

Idempotent producer (enable.idempotence=true)

- Assigns a sequence number to each message per partition - Broker deduplicates retries using <producerid, partition, sequencenumber>

Transactional producer (transactional.id set)

- Groups multiple writes (across partitions/topics) into an atomic transaction - Either all messages are committed or none

Consumer with read_committed isolation

- Only reads messages from committed transactions - Skips messages from aborted transactions

End-to-end design:

Order Service → Kafka (orders topic) → Order Processor → Kafka (fulfillment topic) → Fulfillment Service
                                         ↓
                                    PostgreSQL (order state)

Order Processor uses consume-transform-produce pattern within a Kafka transaction
Offset commits are part of the same transaction as the output messages
Downstream consumers use isolation.level=read_committed

Caveats:

Exactly-once is within Kafka only — external side effects (DB writes, API calls) need application-level idempotency
Use a deduplication key (order ID) in the database with a unique constraint
Transactions add latency (~10-50ms overhead)

Q10: What is log compaction and when would you use it?

Answer:

Log compaction retains only the latest value for each key in a topic, rather than retaining all messages for a time period.

How it works:

Kafka periodically scans the log in the background
For each key, it keeps only the most recent message
Older messages with the same key are deleted
A message with a null value (tombstone) signals deletion of that key

Before compaction:

offset 0: key=A, value=1
offset 1: key=B, value=2
offset 2: key=A, value=3    ← latest for A
offset 3: key=B, value=4    ← latest for B

After compaction:

offset 2: key=A, value=3
offset 3: key=B, value=4

Use cases:

Changelog/CDC streams — maintain latest state of each database row
Configuration distribution — latest config for each service
User profile/session state — latest state per user
KTable in Kafka Streams — materialized view of a compacted topic

Configuration:

cleanup.policy=compact           # Enable compaction
min.compaction.lag.ms=0          # Minimum time before eligible for compaction
delete.retention.ms=86400000     # How long tombstones are retained

Not suitable for: Event logs where you need the full history, audit trails, or time-series data.

Q11: How does Kafka handle backpressure? What happens when consumers can't keep up?

Answer:

Kafka's pull-based model provides natural backpressure — consumers only fetch what they can handle.

Indicators that consumers are falling behind:

Consumer lag — difference between latest offset and consumer's committed offset
Monitor via kafka-consumer-groups.sh --describe or JMX metrics

Strategies when consumers can't keep up:

Strategy	Description
Add consumers	Add more consumers to the group (up to # of partitions)
Add partitions	Increase parallelism (requires key remapping consideration)
Optimize processing	Reduce per-message processing time; batch DB writes
Increase `fetch.max.bytes`	Fetch more data per poll to reduce overhead
Separate fast/slow paths	Route time-sensitive messages to a fast topic; batch the rest
Drop/sample	For analytics, sample messages rather than processing all

What Kafka does NOT do:

Does not slow down producers (messages keep flowing)
Does not drop messages (retained per retention policy)
Does not push messages to consumers

Worst case: If lag grows beyond retention period, messages are deleted before consumption → data loss. Monitor lag alerts and set retention appropriately.

Q12: Compare Kafka with Redis Streams for event processing.

Answer:

Feature	Kafka	Redis Streams
Storage	Disk-based (log segments)	In-memory (with optional persistence)
Throughput	Very high (1M+ msg/sec)	High but memory-bound
Retention	Configurable (hours to forever)	Memory-limited, must `XTRIM` manually
Consumer groups	Yes (sophisticated rebalancing)	Yes (simpler model with `XREADGROUP`)
Ordering	Per-partition	Per-stream
Replay	Full replay from any offset	Replay from any ID
Partitioning	Built-in (topic partitions)	Manual (one stream per key)
Exactly-once	Yes (with transactions)	No native support
Ecosystem	Rich (Connect, Schema Registry, ksqlDB, Streams API)	Lightweight
Operational complexity	High (cluster management)	Low (single Redis cluster)

Choose Kafka when: High throughput, long retention, complex stream processing, exactly-once requirements, or rich ecosystem needed.

Choose Redis Streams when: Low-latency requirements, lightweight event processing, already using Redis, volume fits in memory, simpler operational model.

Q13: What is the role of ZooKeeper in Kafka and why is it being replaced by KRaft?

Answer:

ZooKeeper's role (traditional):

Broker registration — tracks which brokers are alive
Controller election — selects which broker is the cluster controller
Topic/partition metadata — stores topic configs, partition assignments, ISR lists
Consumer group coordination (older versions) — tracked consumer offsets (now stored in Kafka's consumer_offsets topic)

Why KRaft (Kafka Raft) replaces ZooKeeper:

Aspect	ZooKeeper	KRaft
Architecture	Separate system to deploy and manage	Built into Kafka brokers
Metadata propagation	Through ZooKeeper → async push to brokers	Direct Raft consensus among controllers
Scaling	ZooKeeper becomes bottleneck at ~200k partitions	Supports millions of partitions
Recovery time	Slow (full metadata reload on controller failover)	Fast (Raft log replay)
Operational complexity	Two systems to manage, monitor, secure	Single system

KRaft architecture:

Some brokers designated as controllers (typically 3)
Controllers form a Raft quorum for metadata consensus
One controller is the active controller; others are standby
Metadata stored as an internal Kafka log (replicated via Raft)

KRaft has been production-ready since Kafka 3.3 and ZooKeeper is being removed entirely in Kafka 4.0.

Answer:

Requirements: Deliver notifications (likes, comments, follows, mentions) to users in order per user, at high throughput.

Partition key choice: user_id

Why user_id:

All notifications for one user land on the same partition → ordered delivery
Users are many and varied → even distribution across partitions
Consumer can maintain per-user state (unread count, batching) locally

Sizing:

Estimate: 100k notifications/sec, each consumer handles 5k/sec
Need ~20 consumers → need at least 20 partitions (can't have more consumers than partitions)
Over-provision to ~50 partitions for headroom

Hot user problem (celebrity with millions of followers):

A single user generating millions of notifications could skew load
Solution 1: Fan-out at application layer before Kafka — break celebrity notifications into batches with compound keys (userid:batchnum)
Solution 2: Separate "high-volume" topic with more partitions for celebrity notifications
Solution 3: Rate-limit notification generation for any single source event

Consumer design:

Each consumer in the group handles a subset of users
Batch notifications by user for efficient delivery (e.g., "You have 5 new likes")
Commit offsets after successful delivery to notification service

Q15: What are Kafka Connect and Schema Registry? When would you use them?

Answer:

Kafka Connect

A framework for streaming data between Kafka and external systems without writing custom code.

Component	Description
Source connectors	Pull data into Kafka (e.g., from PostgreSQL, MongoDB, S3)
Sink connectors	Push data from Kafka to external systems (e.g., to Elasticsearch, HDFS, Snowflake)
Workers	JVM processes that run connectors (distributed or standalone mode)

When to use: CDC pipelines, database-to-warehouse ETL, syncing Kafka with search indexes (Elasticsearch), log shipping.

Example: Debezium (a Kafka Connect source connector) captures row-level changes from PostgreSQL's WAL and streams them to Kafka topics.

Schema Registry

A centralized service for managing and enforcing message schemas (Avro, Protobuf, JSON Schema).

How it works:

Producer registers a schema with the registry before sending messages
Messages include a schema ID in the header (not the full schema)
Consumer fetches the schema from registry to deserialize
Registry enforces compatibility rules (backward, forward, full) to prevent breaking changes

When to use: Multi-team environments, long-lived topics, data contracts between services, preventing schema evolution from breaking consumers.

Interview tip: Mentioning Schema Registry shows awareness of real-world Kafka operations. It's especially relevant for CDC and data pipeline designs where schema changes in the source database must not break downstream consumers.

HPC-05-Interviews

Postgres

Kafka

Apache Kafka - Complete Deep Dive

What is Kafka?

Core Characteristics

Kafka Architecture Overview

Core Components

1. Topics and Partitions

2. Brokers

3. Replication

4. Offsets and Partitions

5. Consumer Groups

Producers (Deep Dive)

Producer Flow

Producer Acknowledgments (acks)

Batching and Performance

Idempotent Producer

Consumers (Deep Dive)

Consumer Flow

Offset Management

Consumer Lag

Rebalancing

Zookeeper vs KRaft (Controller)

Zookeeper (Traditional)

KRaft (Kafka Raft Consensus, Kafka 3.3+)

Configuration Tuning

Producer Performance

Consumer Performance

Broker Configuration

Use Cases

1. Activity/Event Logging

2. Metrics Collection

3. Real-time Analytics

4. ETL Pipelines

5. Microservices Communication

Performance Benchmarks

Throughput

Latency

Storage

Monitoring Kafka

Key Metrics to Monitor

Using JMX Monitoring

Interview Questions & Answers

Q1: Design Kafka for a ride-sharing service (10M events/sec)

Q3: Consumer crashes. Avoid message loss or duplicates?

Q4: Design Kafka for a system handling 10M events/sec with 99.99% uptime requirement.

Q5: How would you implement exactly-once delivery in a payment processing pipeline?

Kafka Patterns & Best Practices

1. Fan-out (One-to-many)

2. Event Sourcing

3. CQRS

Kafka vs Alternatives

Disaster Recovery: Architecture Comparison

Single-Region Deployment

Active-Passive (Replica in Standby Region)

Active-Active (Both Regions Serve Traffic)

Database Integration (Critical!)

Producer-Side Buffering Pattern

Quick Reference: 30-Second Answers

Summary & Key Takeaways

Kafka — Practical Guide & Motivating Examples

A Motivating Example: The World Cup

Basic Terminology and Architecture

How Kafka Works

Message Structure

Producer Example (kafkajs)

Partition Determination (Two-Step Process)

Append-Only Log Benefits

Replication Model

Consumer Example (kafkajs)

When to Use Kafka in Your Interview

Scalability Deep Dive

Single Broker Limits

Scaling Strategies

Handling Hot Partitions

Fault Tolerance and Durability

Producer Retries

Consumer Retries

Performance Optimizations

Retention Policies

Additional Interview Questions & Answers