Apache Cassandra - Complete Deep Dive

What is Cassandra?

Apache Cassandra is a distributed NoSQL database designed for high availability and horizontal scalability that enables:

  • Linear scalability: Add nodes to increase throughput
  • No single point of failure: Peer-to-peer architecture (no master)
  • High write throughput: Millions of writes/second
  • Eventual consistency: Trade strong consistency for availability
  • Multi-datacenter support: Built-in geographic distribution
  • Tunable consistency: Choose per query (consistency level)

Core Characteristics

AspectBenefit
DistributedPeer-to-peer, all nodes equal
Fault-tolerantLose nodes, keep operating (RF=3)
ScalableLinear throughput increase per node
DurableWrite-ahead logging + compaction
Multi-DCReplicate across datacenters natively
Eventually consistentHigh availability over strong consistency

Cassandra Architecture Overview

Click to view code
Client (via CQL driver)
       ↓
Cassandra Cluster
  ├─ Node 1 (Replicates: A-H)
  ├─ Node 2 (Replicates: I-P)
  ├─ Node 3 (Replicates: Q-Z)
  ├─ Node 4 (Replicates: A-H, I-P)
  └─ ...
       ↓
   [Token Ring] → [Hash Partitioning]
       ↓
   [SSTable (commit log + memtable)]

Key layers:

  • Cluster: Collection of nodes sharing same cluster name
  • Node: Single Cassandra instance
  • Keyspace: Database-like container (schema definition)
  • Table: Similar to relational database table
  • Partition Key: Determines which node stores data
  • Token Ring: Maps keys to nodes via consistent hashing
  • Replica: Copy of partition on multiple nodes (replication factor)

Core Components

1. Consistent Hashing & Token Ring

Problem: How to distribute data across nodes?

Solution: Consistent hashing with token ring

Click to view code
Token Ring (0 to 2^63-1):

      Node 1 (Token: 100)
             /\
       Node 2   Node 4
     (Token:    (Token:
      200)      400)
         \        /
        Node 3 (Token: 300)

Key "alice" → Hash → Token 150
  Belongs to: Node 1 (100-200)
  
Key "bob" → Hash → Token 250
  Belongs to: Node 3 (200-300)
  
Key "charlie" → Hash → Token 350
  Belongs to: Node 4 (300-400)

Benefits:

  • Adding/removing nodes = minimal rebalancing
  • Load balanced: Each node owns ~1/N of token space
  • Replication: RF=3 means data on 3 consecutive nodes

Replication placement:

Click to view code
Replication Factor = 3

Key "user:123" → Token 150
  Primary (Node 1): Token 100-200
  Replica 1 (Node 2): Token 200-300
  Replica 2 (Node 3): Token 300-400

2. Keyspaces and Tables

Keyspace: Schema definition (similar to database)

Click to view code (cql)
CREATE KEYSPACE user_keyspace
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'eu-west-1': 2
}
AND durable_writes = true;

Table: Data structure (similar to table)

Click to view code (cql)
CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  email TEXT,
  name TEXT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

3. Data Model: Primary Key

Simple primary key (single column):

Click to view code (cql)
PRIMARY KEY (user_id)
  → Partition key: user_id
  → Clustering key: (none)
  
// All data with same user_id on same partition

Composite primary key (partition key + clustering key):

Click to view code (cql)
CREATE TABLE user_activity (
  user_id UUID,
  timestamp TIMESTAMP,
  action_type TEXT,
  action_data TEXT,
  PRIMARY KEY (user_id, timestamp)
);

WHERE user_id = ? AND timestamp >= ? AND timestamp < ?

How it's stored:

Click to view code
Partition key (user_id): Determines node
Clustering key (timestamp): Sort order within partition

Data on node:
  Partition: user_id=alice
    ├─ timestamp=2024-01-05 10:00:00 → "login"
    ├─ timestamp=2024-01-05 10:15:30 → "view_product"
    └─ timestamp=2024-01-05 10:30:45 → "logout"
  
  Partition: user_id=bob
    ├─ timestamp=2024-01-05 10:02:00 → "login"
    └─ timestamp=2024-01-05 10:05:15 → "purchase"

Advantages:

  • Range queries: Get activity between timestamps efficiently
  • Sorted results: Clustering key order preserved
  • Efficient pagination: Skip to timestamp and fetch next N rows

4. Write Path

Write operation flow:

Click to view code
Client sends INSERT/UPDATE
       ↓
1. Write to commitlog (durability)
       ↓
2. Write to memtable (in-memory, sorted)
       ↓
3. Return success to client (write durability achieved)
       ↓
4. Periodically flush memtable → SSTable (on disk)
       ↓
5. Compaction merges SSTables

Write acknowledgment flow:

Click to view code
Client sends write with consistency_level = QUORUM
       ↓
Coordinator node:
  - Sends to all RF replicas
  - Waits for QUORUM responses (RF/2 + 1)
  - Returns success
       ↓
Consistency achieved:
  QUORUM = 3 nodes → need 2 acks
  If 1 node slow/down → still succeed
  Trade-off: Eventual consistency

5. Read Path

Read operation flow:

Click to view code
Client sends SELECT with consistency_level = QUORUM
       ↓
Coordinator node:
  - Queries primary replica + (RF-1) other replicas
  - Compares timestamps (read repair)
  - Returns latest version
       ↓
Read consistency guarantees:
  ONE: Fastest (1 node)
  QUORUM: Balanced (RF/2 + 1 nodes)
  ALL: Slowest but consistent (all RF nodes)

Read repair:

Click to view code
Scenario: Write to replica A succeeds, B is slow

Write to A (timestamp 100): SUCCESS
Write to B (timestamp 100): SLOW/TIMEOUT

Later, read from A and B:
  A: timestamp 100
  B: timestamp 90 (stale)
  
Read repair:
  Coordinator sees A > B
  Sends write to B to fix stale data
  Returns A's value

Consistency & Availability Trade-offs

Consistency Levels

LevelRead fromWrite toAvailabilityConsistency
ONE1 node1 nodeHighestLowest (stale)
QUORUMRF/2 + 1RF/2 + 1MediumMedium
LOCAL_QUORUMLocal DC quorumLocal DC quorumMediumMedium
ALLAll replicasAll replicasLowest (1 failure = fail)Highest

Write consistency example (RF=3):

Click to view code
Consistency ONE:
  Write to node A → Return success
  Risk: Nodes B,C haven't written yet (data loss if A fails)

Consistency QUORUM:
  Write to nodes A,B,C → Wait for 2 acks → Return success
  Risk: 1 node can fail safely

Consistency ALL:
  Write to nodes A,B,C → Wait for all 3 acks → Return success
  Risk: 1 node down = write fails (low availability)

Hybrid consistency (write QUORUM, read ONE):

Click to view code
Write QUORUM:
  Ensure at least 2 replicas have data

Read ONE:
  Read from fastest replica (very fast)
  Risk: Might be stale
  Mitigation: Read repair fixes stale data in background

Replication & Distribution

Replication Factor (RF)

Click to view code
RF=1: No replication (data loss on node failure)
RF=2: One replica (lose 1 node, still serve)
RF=3: Two replicas (lose 2 nodes, still serve)
RF=5: Four replicas (lose 4 nodes, still serve)

Production: Use RF=3 minimum
Multi-DC: Distribute replicas across DCs
  us-east: RF=2
  eu-west: RF=1

Multi-Datacenter Replication

Click to view code (cql)
CREATE KEYSPACE events
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,      // 3 replicas in US
  'eu-west-1': 2       // 2 replicas in EU
};

Data flow:

Click to view code
Write in US-East:
  1. Write to primary + replicas (US-East)
  2. Replicates asynchronously to EU-West
  3. EU replicas eventually consistent

Benefits:
  - Local reads (low latency)
  - Local writes (high throughput)
  - Automatic cross-DC replication

Trade-off:
  - Eventual consistency (EU lags behind US)
  - Cross-DC bandwidth usage

Compaction Strategy

Problem: Many writes create many SSTables (slow reads)

Solution: Compaction merges SSTables

Click to view code
Initial state (4 SSTables):
  SSTable1: Keys A, M, Z
  SSTable2: Keys A, G, M
  SSTable3: Keys B, K, X
  SSTable4: Keys C, D, L

Read request for key M:
  Check SSTable1 (found)
  Check SSTable2 (might have newer)
  Check SSTable3 (no)
  Check SSTable4 (no)
  → Query multiple SSTables (slow)

After compaction:
  SSTable-merged: Keys A, B, C, D, G, K, L, M, X, Z (sorted)
  → Single SSTable (fast)

Compaction strategies:

StrategyUse CaseTrade-off
Size-tieredDefault, write-heavyMore read overhead
LeveledRead-heavyMore write overhead
Time-windowTime-series dataGood balance for events

Performance Optimization

Write Optimization

Click to view code (cql)
-- Batch writes for throughput
BEGIN BATCH
  INSERT INTO users (user_id, email) VALUES (?, ?)
  INSERT INTO user_index (email, user_id) VALUES (?, ?)
APPLY BATCH;

-- Async writes (fire-and-forget)
consistency_level = ONE
timeout = 100ms

-- Tune commit log settings
commitlog_sync = batch
commitlog_sync_batch_window_in_ms = 10

Read Optimization

Click to view code (cql)
-- Use secondary indexes for filtering
CREATE INDEX idx_email ON users (email);

-- Projection: Select only needed columns
SELECT user_id, email FROM users WHERE user_id = ?;
-- Not: SELECT * FROM users WHERE user_id = ?;

-- Pagination
SELECT * FROM users WHERE user_id = ? LIMIT 100;
-- Fetch next page using last key

-- Caching
SELECT * FROM users WHERE user_id = ? USING CACHE;

Hardware Configuration

Click to view code (properties)
# Memory allocation
-Xms8g -Xmx8g

# Data directory (fast SSD)
data_file_directories: ["/mnt/data/cassandra"]
commitlog_directory: "/mnt/commitlog"

# Network tuning
max_hints_window_in_ms: 10800000
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "192.168.1.1,192.168.1.2"

# Concurrency
concurrent_reads: 32
concurrent_writes: 32

Scalability & High Availability

Cluster Sizing

Small cluster (3-5 nodes):

Click to view code
Handles: 100K-1M ops/sec
Per node: SSD 500GB, 16GB RAM
Use case: Development, testing

Medium cluster (6-20 nodes):

Click to view code
Handles: 1M-10M ops/sec
Per node: SSD 2TB, 32GB RAM
Use case: Production, single DC

Large cluster (20+ nodes):

Click to view code
Handles: 10M+ ops/sec
Per node: SSD 4TB+, 64GB RAM
Use case: High-scale, multi-DC

High Availability Strategy

Click to view code
Single node failure:
  ✓ No impact (RF=3 has 2 other replicas)
  ✓ Automatic failover (read from other replicas)
  
Multiple node failure:
  ✓ Still operational (if RF > number of failures)
  ✗ Reduced throughput (fewer replicas)
  ✗ Possible data loss (if failures > RF-1)

Entire DC failure:
  ✓ Application fails over to other DC
  ✓ Data preserved (multi-DC replication)
  ✗ Possible write losses (async replication)

Repair Mechanism

Click to view code
Repair ensures replicas are consistent:

Command:
  nodetool repair -pr user_keyspace

Process:
  1. Merkle tree of local node
  2. Compare with replica nodes
  3. Stream missing/stale data
  4. Replicas become consistent

Schedule:
  Weekly or after RF-1 nodes fail
  Run during low-traffic window

Use Cases

1. Time-Series Data (Metrics, Logs)

Click to view code (cql)
CREATE TABLE metrics (
  metric_name TEXT,
  timestamp TIMESTAMP,
  host TEXT,
  value FLOAT,
  PRIMARY KEY ((metric_name, host), timestamp)
);

-- Insert millions/sec
INSERT INTO metrics VALUES ('cpu', now(), 'server1', 85.5);

-- Range query (efficient with clustering)
SELECT * FROM metrics 
WHERE metric_name = 'cpu' 
  AND host = 'server1'
  AND timestamp >= ? AND timestamp < ?;

2. User Profiles & Activity

Click to view code (cql)
CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  email TEXT,
  name TEXT,
  bio TEXT
);

CREATE TABLE user_activity (
  user_id UUID,
  timestamp TIMESTAMP,
  action TEXT,
  PRIMARY KEY (user_id, timestamp)
);

-- Fast user lookup + activity range query
SELECT * FROM users WHERE user_id = ?;
SELECT * FROM user_activity 
WHERE user_id = ? AND timestamp >= ? AND timestamp < ?;

3. Message Queue (Kafka-like)

Click to view code (cql)
CREATE TABLE messages (
  topic TEXT,
  partition INT,
  offset BIGINT,
  timestamp TIMESTAMP,
  payload BLOB,
  PRIMARY KEY ((topic, partition), offset)
);

-- Write millions/sec (high throughput)
INSERT INTO messages VALUES ('events', 0, 1000, now(), payload);

-- Consumer reads offset range
SELECT * FROM messages 
WHERE topic = 'events' AND partition = 0 
  AND offset >= ? AND offset <= ?
LIMIT 1000;

4. Real-time Analytics

Click to view code (cql)
CREATE TABLE events (
  event_type TEXT,
  day TEXT,
  hour TEXT,
  timestamp TIMESTAMP,
  user_id UUID,
  data MAP<TEXT, TEXT>,
  PRIMARY KEY ((event_type, day, hour), timestamp)
);

-- Aggregate within partition (efficient)
SELECT COUNT(*), AVG(price) FROM events
WHERE event_type = 'purchase' AND day = '2024-01-05'
GROUP BY hour;

Interview Questions & Answers

Q1: Design a real-time analytics system for 1M events/sec

Requirements:

  • Ingest 1M events/sec
  • Aggregate by event type, time windows (hourly, daily)
  • Query: "Count purchases in last hour"
  • 99.99% uptime

Solution:

Click to view code (cql)
CREATE KEYSPACE analytics 
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 2
};

-- Time-series partitioning (efficient range queries)
CREATE TABLE events (
  event_type TEXT,
  bucket TEXT,           -- Partition by hour (2024-01-05-10)
  timestamp TIMESTAMP,
  event_id UUID,
  user_id UUID,
  data TEXT,
  PRIMARY KEY ((event_type, bucket), timestamp, event_id)
);

-- Aggregation table (pre-computed)
CREATE TABLE event_counts (
  event_type TEXT,
  bucket_hour TEXT,      -- Hour bucket
  count_value BIGINT,
  PRIMARY KEY (event_type, bucket_hour)
);

Write path:

Click to view code (python)
# High-throughput insert
from cassandra.cluster import Cluster
from datetime import datetime
from uuid import uuid4

cluster = Cluster(['node1', 'node2', 'node3'])
session = cluster.connect('analytics')

# Batch writes for throughput
prepared = session.prepare('''
  INSERT INTO events (
    event_type, bucket, timestamp, event_id, user_id, data
  ) VALUES (?, ?, ?, ?, ?, ?)
''')

for event in events_stream:
  bucket = event['timestamp'].strftime('%Y-%m-%d-%H')
  session.execute(prepared, [
    event['type'],
    bucket,
    event['timestamp'],
    uuid4(),
    event['user_id'],
    event['data']
  ], consistency_level=ConsistencyLevel.QUORUM)

Read path (queries):

Click to view code (cql)
-- Count events in last hour
SELECT COUNT(*) FROM events
WHERE event_type = 'purchase'
  AND bucket = '2024-01-05-10'
  AND timestamp >= ? AND timestamp < ?;

-- Top products (via aggregation)
SELECT product_id, COUNT(*) as count
FROM events
WHERE event_type = 'view' AND bucket = '2024-01-05-10'
GROUP BY product_id
LIMIT 10;

Cluster setup:

Click to view code
3 nodes per DC × 2 DCs = 6 nodes
Per node: 
  - 4 CPU
  - 32GB RAM
  - 2TB SSD (write-optimized)
  
Throughput: 1M events/sec ÷ 6 nodes = 167K/sec per node
(Cassandra handles 100K+/sec per node)

Q2: Node fails. How to recover without data loss?

Answer:

Failure scenarios with RF=3:

Click to view code
Scenario 1: 1 node fails
  ✓ No impact (2 replicas still have data)
  ✓ Automatic failover (read from other replicas)
  ✓ No data loss
  
Scenario 2: 2 nodes fail simultaneously
  ⚠️ Depends on replication placement
  - If failures are on different DC: OK
  - If failures are on same DC: Possible data loss
  
Scenario 3: Node recovers after being down
  1. Node rejoins cluster
  2. Detects lag (missing writes while down)
  3. Repair fills in missing data

Recovery process:

Click to view code (bash)
# 1. Node comes back online
# Cassandra detects missed writes (hints)

# 2. Run repair (forces consistency)
nodetool repair -pr user_keyspace

# 3. Monitor repair progress
nodetool netstats

# 4. Verify data consistency
nodetool ring

Hint system (prevents write loss):

Click to view code
Write to A, B (C is down):
  A: success
  B: success
  C: (down, store hint locally)

When C recovers:
  Hints replay → C catches up with missed writes
  
Trade-off:
  - Hints stored locally on coordinator
  - If coordinator fails before replay → loss possible
  - Mitigation: Enable hint window (replay within N hours)

Key takeaway: "With RF=3, lose 1 node = no impact. Lose 2+ nodes = possible data loss. Always run nodetool repair weekly."


Q3: Strong consistency requirement (financial transactions). How to achieve?

Answer:

Challenge: Cassandra is eventually consistent by default

Solution: Hybrid consistency

Click to view code (cql)
-- Use QUORUM consistency on writes
INSERT INTO accounts (account_id, balance) VALUES (?, ?)
USING CONSISTENCY QUORUM;

-- Use QUORUM consistency on reads
SELECT balance FROM accounts 
WHERE account_id = ?
USING CONSISTENCY QUORUM;

-- With RF=3:
--   QUORUM = 2 nodes
--   Write to 2 nodes + read from 2 nodes = strong consistency

Trade-offs:

Click to view code
QUORUM consistency:
  ✓ Strong consistency (within same node set)
  ✓ Withstand 1 node failure (RF=3)
  ✗ Slower than ONE (need to reach 2 nodes)
  ✗ Fails if 2+ nodes down (can't get quorum)

Availability impact:
  ONE: Always available (unless all replicas down)
  QUORUM: Available if > RF/2 replicas up
    RF=3, need 2 alive: Can lose 1 node
    RF=5, need 3 alive: Can lose 2 nodes

Read-before-write pattern:

Click to view code (cql)
-- For balance updates (prevent race conditions)
SELECT balance FROM accounts 
WHERE account_id = ? 
USING CONSISTENCY QUORUM;

-- Compute new balance
new_balance = balance - withdrawal;

-- Update with conditional write (lightweight transaction)
UPDATE accounts SET balance = ?
WHERE account_id = ?
IF balance = ?;  -- Only update if balance unchanged

Lightweight transactions (LWT):

Click to view code (cql)
-- Atomic compare-and-set
UPDATE accounts SET balance = 950
WHERE account_id = 'user1'
IF balance = 1000;  -- Conditional

// Under the hood:
// 1. Read current balance from QUORUM (consensus)
// 2. If matches condition, write to QUORUM
// 3. Atomic within cluster

Key takeaway: "Use QUORUM consistency + lightweight transactions for strong consistency guarantees, but accept higher latency."


Q4: Query latency spike. How to diagnose?

Answer:

Diagnosis checklist:

Click to view code (bash)
# 1. Check cluster health
nodetool status

# 2. Check node load
nodetool info

# 3. Monitor GC pauses
nodetool gcstats

# 4. Check compaction queue
nodetool compactionstats

# 5. Monitor disk I/O
iostat -x 1

# 6. Check slow queries (enable query logging)
nodetool settraceprobability 1.0

# 7. Analyze trace
SELECT * FROM system_traces.sessions 
WHERE session_id = ?;

Common causes and solutions:

ProblemCauseFix
High read latencyBloom filter miss (SSTable scan)Reduce number of SSTables (compact)
High write latencyToo many SSTablesRun compaction
GC pausesLarge heap allocationsReduce batch size, increase heap
Disk I/O bottleneckRandom reads from diskEnsure hot data in cache
Unbalanced clusterUneven token distributionRebalance tokens
Hot partitionSingle key receiving all trafficUse partition sharding (add dimension)

Hot partition mitigation:

Click to view code
Problem: User 'celebrity' has 1M followers
         Everyone writes to their followers_list partition
         
Before (hot):
  CREATE TABLE followers (
    user_id UUID PRIMARY KEY,
    follower_id UUID,
    PRIMARY KEY (user_id, follower_id)
  );
  // All writes go to celebrity's partition (bottleneck)

After (sharded):
  CREATE TABLE followers (
    user_id UUID,
    shard_id INT,
    follower_id UUID,
    PRIMARY KEY ((user_id, shard_id), follower_id)
  );
  // Distribute across 10 shards (10x improvement)
  
  INSERT INTO followers VALUES (celebrity, shard_id=0, user1);
  INSERT INTO followers VALUES (celebrity, shard_id=1, user2);
  // Writes distribute across 10 partitions

Q5: Design a distributed cache (like Redis) with Cassandra durability

Answer:

Architecture:

Click to view code
Cache layer + Cassandra durability:
  - Memcached/Redis layer (fast, in-memory)
  - Cassandra layer (durable, persistent)
  
Write flow:
  Client → Memcached (confirm) → Write to Cassandra (background)

Read flow:
  Client → Memcached (cache hit, fast) 
        → Cassandra (cache miss, slower)
        → Memcached (populate cache)

Implementation:

Click to view code (cql)
CREATE TABLE cache (
  cache_key TEXT PRIMARY KEY,
  value BLOB,
  ttl INT,
  created_at TIMESTAMP
);

-- Insert with TTL (auto-delete after expiry)
INSERT INTO cache (cache_key, value, created_at) 
VALUES (?, ?, now())
USING TTL 3600;  // Auto-delete after 1 hour

-- Read with consistency ONE (fast, from cache)
SELECT value FROM cache 
WHERE cache_key = ?
USING CONSISTENCY ONE;

Cache invalidation strategy:

Click to view code (python)
# Write-through cache
def get_user(user_id):
    # Check Memcached first
    cached = memcached.get(f"user:{user_id}")
    if cached:
        return cached
    
    # Cache miss, read from Cassandra
    user = cassandra.select("user_id = ?", user_id)
    
    # Populate cache
    memcached.set(f"user:{user_id}", user, ttl=1hour)
    return user

def update_user(user_id, data):
    # Update Cassandra
    cassandra.update("user_id = ?", user_id, data)
    
    # Invalidate cache
    memcached.delete(f"user:{user_id}")

Durability guarantees:

Click to view code
Write QUORUM + async Cassandra write:
  1. Write to Memcached (in-memory, fast)
  2. Write to Cassandra QUORUM (durable)
  3. Memcached loss ≠ data loss (Cassandra has it)
  
If Memcached fails:
  Cassandra still has data
  Repopulate Memcached on read (cache miss)

Cassandra vs Alternatives

SystemThroughputLatencyBest ForTrade-off
Cassandra1M+/sec5-20msHigh-write, distributedEventual consistency
DynamoDB100K+/sec5-10msManaged, serverlessAWS vendor lock-in
HBase100K+/sec10-50msHadoop ecosystemOperational complexity
MongoDB100K+/sec5-20msDocument flexibilityACID trade-offs
Redis1M+/sec1-5msIn-memory cacheNo persistence (custom)

Best Practices

Data Modeling

Denormalize (opposite of relational databases) ✓ Design tables around queries (not entities) ✓ Avoid hot partitions (use sharding if needed) ✓ Keep partition size < 100MB (optimal SSTable size) ✓ Use composite keys (partition key + clustering key) ✓ Plan for time-series (bucket by hour/day)

Operational Best Practices

Always use RF ≥ 3 (production minimum) ✓ Monitor compaction (queue shouldn't grow unbounded) ✓ Run repairs weekly (prevents data divergence) ✓ Monitor GC pauses (tune heap if > 100ms) ✓ Use local_quorum for multi-DC (reduce latency) ✓ Plan capacity (monitor disk growth) ✓ Enable client-side caching (reduce read load)

Query Optimization

Partition key must be in WHERE clause (required) ✓ Use clustering key for range queries (efficient) ✓ Avoid large result sets (use LIMIT) ✓ Batch writes (higher throughput) ✓ Use prepared statements (prevent re-parsing) ✓ Fetch only needed columns (reduce network I/O)


Disaster Recovery Strategy

Single-Datacenter Failure

Failure ModeRF=3ImpactRecovery
1 node down✓ SafeNo impactAuto (replicas serve)
2 nodes down⚠️ RiskyPartial data loss possibleRepair after recovery
3+ nodes down✗ LossComplete data lossRestore from backup

Multi-Datacenter Failover

Click to view code
Primary DC (US-East): 3 nodes, RF=3
Secondary DC (EU-West): 2 nodes

Failure: Entire US-East goes down
  - EU still has RF=2 replicas
  - All data preserved
  - Failover time: seconds (client retries)
  - Trade-off: Possible data loss if RF < DC failures
  
Solution: Use RF=5, distribute replicas
  US-East: 3 replicas
  EU-West: 2 replicas
  → Lose entire DC, data survives

Backup and Recovery

Click to view code (bash)
# Snapshot (consistent backup)
nodetool snapshot user_keyspace

# Upload to S3
tar -czf snapshot.tar.gz /var/lib/cassandra/snapshots/
aws s3 cp snapshot.tar.gz s3://cassandra-backups/

# Restore from snapshot
# 1. Restore SSTables to node
# 2. Run repair to sync
nodetool repair -pr user_keyspace

Summary & Key Takeaways

Cassandra excels at:

  • ✓ High write throughput (1M+ writes/sec)
  • ✓ Linear scalability (add nodes, increase capacity)
  • ✓ High availability (distributed, no single point of failure)
  • ✓ Multi-datacenter support (native geo-replication)
  • ✓ Time-series data (efficient range queries)

Key challenges:

  • ✗ Eventual consistency (not strong by default)
  • ✗ Operational complexity (cluster management, repairs)
  • ✗ Query flexibility (must design tables per query)
  • ✗ Learning curve (CQL, data modeling, tuning)
  • ✗ No transactions (no ACID, eventual consistency)

Critical design questions:

  1. What's my write throughput requirement (ops/sec)?
  2. Do I need strong consistency or eventual is OK?
  3. How long must I retain data?
  4. What's my acceptable data loss (RPO)?
  5. Do I need multi-datacenter replication?
  6. What queries must I support (design tables)?
  7. Can I handle operational complexity?

Cassandra — Practical Guide & Deep Dive

Apache Cassandra is an open-source, distributed NoSQL database originally built by Facebook for inbox search. It implements a partitioned wide-column storage model with eventually consistent semantics. Cassandra combines elements of Dynamo (partitioning, replication) and Bigtable (column-oriented storage) to handle massive data footprints, query volume, and flexible storage requirements.

Used by Discord, Netflix, Apple, and Bloomberg — Cassandra is here to stay.


Cassandra Basics

Data Model

ConceptDescription
KeyspaceTop-level unit (like a "database"). Defines replication strategies and owns UDTs
TableLives within a keyspace. Organizes data into rows with a defined schema
RowSingle record identified by a primary key
ColumnData storage unit with name, type, value, and timestamp metadata. Columns can vary per row (wide-column)

At a basic level, Cassandra's data looks like nested JSON:

{
  "keyspace1": {
    "table1": {
      "row1": { "col1": 1, "col2": "2" },
      "row2": { "col1": 10, "col3": 3.0 },
      "row3": { "col4": { "company": "Hello Interview", "city": "Seattle" } }
    }
  }
}

Wide-column flexibility: Unlike relational databases that require an entry for every column per row (even NULL), Cassandra allows columns to vary per row.

Conflict resolution: Every column has a timestamp. Write conflicts between replicas are resolved via "last write wins".

Primary Key

Every row is uniquely identified by a primary key consisting of:

ComponentPurpose
Partition KeyOne or more columns determining which partition stores the row
Clustering KeyZero or more columns determining sorted order within a partition
-- Partition key: a, no clustering keys
CREATE TABLE t (a text, b text, c text, PRIMARY KEY (a));

-- Partition key: a, clustering key: b ascending
CREATE TABLE t (a text, b text, c text, PRIMARY KEY ((a), b))
WITH CLUSTERING ORDER BY (b ASC);

-- Composite partition key: a+b, clustering key: c
CREATE TABLE t (a text, b text, c text, d text, PRIMARY KEY ((a, b), c));

-- Partition key: a, clustering keys: b+c
CREATE TABLE t (a text, b text, c text, d text, PRIMARY KEY ((a), b, c));

Key Concepts

Partitioning — Consistent Hashing

Cassandra achieves horizontal scalability by partitioning data across nodes using consistent hashing.

Traditional hashing problem: hash(value) % num_nodes causes massive data remapping when nodes are added/removed, and can produce uneven distribution.

Consistent hashing solution:

  • Values are hashed to positions on a ring of integers
  • Walking clockwise from the hash position finds the first node → that node stores the data
  • Adding/removing a node only affects the adjacent node's data

Virtual nodes (vnodes): To address uneven load, Cassandra maps multiple virtual nodes on the ring to physical nodes. Benefits:

  • More even distribution across the cluster
  • Larger physical machines can own more vnodes
  • Adding a node redistributes load from many existing nodes (not just one)

Replication

Keyspaces specify replication configuration. Cassandra replicates by scanning clockwise from the assigned vnode to find additional replica nodes (skipping vnodes on the same physical node).

StrategyDescription
SimpleStrategyScans clockwise for replicas. Good for testing/simple deployments
NetworkTopologyStrategyData-center and rack aware. Recommended for production. Ensures replicas span physical failure domains
-- SimpleStrategy: 3 replicas
ALTER KEYSPACE hello_interview
WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 };

-- NetworkTopologyStrategy: 3 replicas in dc1, 2 in dc2
ALTER KEYSPACE hello_interview
WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 2 };

Consistency

Cassandra gives users flexibility over consistency levels for reads and writes — a tunable consistency vs. availability trade-off.

No ACID guarantees. Cassandra only supports atomic and isolated writes at the row level within a partition.

Consistency LevelBehavior
ONESingle replica must respond
QUORUMMajority (n/2 + 1) of replicas must respond
ALLAll replicas must respond

QUORUM on both reads and writes guarantees that writes are visible to reads because at least one overlapping node participates in both operations. With 3 replicas: 3/2 + 1 = 2 nodes must participate in each operation.

Cassandra aims for eventual consistency — all replicas converge to the latest data given enough time.

Query Routing

Any node can be a coordinator. Cassandra nodes know about other alive nodes via gossip protocol. When a client issues a query:

  1. Client selects a node → it becomes the coordinator
  2. Coordinator determines which nodes store the data (consistent hashing + replication strategy)
  3. Coordinator issues queries to relevant replica nodes
  4. Responses are returned to the client

Storage Model — LSM Tree

Cassandra uses a Log Structured Merge Tree (LSM tree) instead of a B-tree, favoring write speed over read speed.

Key insight: Every create/update/delete is a new entry. Cassandra uses ordering of entries to determine row state. Deletes write a tombstone marker.

Three core constructs:

ConstructDescription
Commit LogWrite-ahead log for durability
MemtableIn-memory sorted structure (sorted by primary key)
SSTableImmutable on-disk file flushed from Memtable

Write path:

  1. Write is issued to a node
  2. Written to commit log (durability guarantee)
  3. Written to Memtable (in-memory)
  4. When Memtable reaches threshold → flushed to disk as immutable SSTable
  5. Corresponding commit log entries are removed

Read path:

  1. Check Memtable first (latest data)
  2. If not found, use bloom filter to identify candidate SSTables
  3. Read SSTables newest-to-oldest to find latest data (sorted by primary key)

Compaction: Periodically merges SSTables to consolidate updates and remove tombstones, keeping read performance healthy.

SSTable Indexing: Files mapping keys to byte offsets in SSTables for fast on-disk retrieval.

Gossip Protocol

Cassandra nodes communicate cluster information via peer-to-peer gossip:

  • Nodes track generation (bootstrap timestamp) and version (logical clock) for each known node
  • These form a vector clock — nodes ignore stale state information
  • Seed nodes act as guaranteed gossip hotspots, preventing cluster fragmentation
  • Every node being aware of the cluster eliminates single points of failure

Fault Tolerance

Failure detection: Phi Accrual Failure Detector — each node independently decides if another node is available. Nodes aren't considered truly "down" unless an admin decommissions them.

Hinted handoffs: When a coordinator can't reach a replica node:

  1. Coordinator temporarily stores the write data as a hint
  2. Write proceeds successfully
  3. When the offline node comes back, hints are forwarded to it

Hinted handoffs are short-term. Long-offline nodes require read repairs or rebuilds.


Data Modeling for Cassandra

Cassandra data modeling is query-driven (not entity-relationship-driven like relational DBs).

Key principles:

  • No JOINs, no foreign keys, no referential integrity
  • Design tables around access patterns first
  • Denormalize data across tables as needed
  • Consider: partition key choice, partition size limits, clustering key ordering

Example: Discord Messages

Discord uses Cassandra for message storage. Access pattern: users query recent messages for a channel.

First iteration:

CREATE TABLE messages (
  channel_id bigint,
  message_id bigint,
  author_id bigint,
  content text,
  PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
  • Partition key: channel_id → single partition per channel, efficient reads
  • Clustering key: message_id DESC → reverse chronological order
  • Uses Snowflake IDs (chronologically sortable UUIDs) instead of timestamps to avoid collisions

Problem: Busy channels create huge partitions → performance degradation. Partitions grow monotonically.

Solution — Time bucketing:

CREATE TABLE messages (
  channel_id bigint,
  bucket int,
  message_id bigint,
  author_id bigint,
  content text,
  PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
  • bucket represents 10-day windows aligned to Discord's epoch
  • Limits partition size even for busiest channels
  • Most queries hit a single partition (latest bucket)
  • New buckets created automatically as time progresses

Example: Ticketmaster Ticket Browsing

Access pattern: users view event venue sections, then drill into individual seats.

First iteration (all seats per event):

CREATE TABLE tickets (
  event_id bigint,
  seat_id bigint,
  price bigint,
  PRIMARY KEY (event_id, seat_id)
);

Problem: Events with 10k+ tickets → large partitions, expensive aggregations for section summaries.

Better design — partition by section:

CREATE TABLE tickets (
  event_id bigint,
  section_id bigint,
  seat_id bigint,
  price bigint,
  PRIMARY KEY ((event_id, section_id), seat_id)
);

Denormalized summary table for venue overview:

CREATE TABLE event_sections (
  event_id bigint,
  section_id bigint,
  num_tickets bigint,
  price_floor bigint,
  PRIMARY KEY (event_id, section_id)
);
  • Venue overview → query event_sections (single partition, < 100 sections)
  • Section detail → query tickets (single partition per section)
  • Section stats don't need to be exact (Ticketmaster shows "100+" not exact counts)

Advanced Features

FeatureDescription
Storage Attached Indexes (SAI)Global secondary indexes on columns. Flexible querying with good (not optimal) performance. Avoids excess denormalization for infrequent queries
Materialized ViewsCassandra automatically materializes denormalized tables from a source table. Reduces application-level complexity for multi-table writes
Search IndexingWire up to Elasticsearch or Apache Solr via plugins (e.g., Stratio Lucene Index)

Cassandra in an Interview

When to use:

  • Systems prioritizing availability over consistency
  • High write throughput needs (LSM tree optimized writes)
  • High scalability requirements (linear horizontal scaling)
  • Flexible/sparse schemas (wide-column model)
  • Clear, well-defined access patterns the schema can revolve around

When NOT to use:

  • Strict consistency requirements (financial transactions, inventory)
  • Complex query patterns (multi-table JOINs, ad-hoc aggregations)
  • Small datasets that don't need distributed scaling
  • Rapidly evolving query patterns (each new query may need a new table)

Additional Interview Questions & Answers

Q6: Explain consistent hashing and virtual nodes. Why does Cassandra use them?

Answer:

Consistent hashing maps both data and nodes onto a ring of integers. Data is stored on the first node encountered when walking clockwise from the data's hash position.

Why not traditional hashing (hash % num_nodes)?

  • Adding/removing a node remaps ~all data (catastrophic in a distributed DB)
  • Consistent hashing: adding/removing a node only affects adjacent nodes' data

Virtual nodes (vnodes):

  • Each physical node owns multiple positions on the ring (default: 256 vnodes per node)
  • Benefits:
  • - Even distribution — more ring positions = more uniform data spread - Heterogeneous hardware — larger machines can own more vnodes - Faster rebalancing — adding a node takes small slices from many nodes instead of a large chunk from one - Faster rebuilds — data streams from many nodes in parallel

Interview tip: Consistent hashing is useful beyond Cassandra — apply it to cache distribution, load balancing, or any sharding problem.


Q7: How does Cassandra's write path differ from a traditional relational database? Why is Cassandra faster for writes?

Answer:

AspectRelational DB (B-tree)Cassandra (LSM tree)
Write patternRandom I/O (find and update in-place)Sequential I/O (append to commit log + memtable)
On-disk structureMutable B-tree pagesImmutable SSTables
Write amplificationLow per write, but random seeks are slowHigher (compaction), but writes themselves are fast
Read performanceOptimized (B-tree lookups)Requires checking memtable + multiple SSTables

Why Cassandra is faster for writes:

  1. Commit log is append-only → sequential disk I/O (fastest possible disk operation)
  2. Memtable is in-memory → no disk I/O for the actual data write
  3. No need to read-before-write (no in-place updates)
  4. SSTable flushes are sequential bulk writes
  5. No locking or transaction overhead

Tradeoff: Reads may need to check multiple SSTables → bloom filters and compaction mitigate this.


Q8: What are tombstones in Cassandra? Why can they be problematic?

Answer:

Tombstones are markers that Cassandra writes when data is deleted. Since SSTables are immutable, Cassandra can't remove data in-place — instead it writes a tombstone that says "this data is deleted."

How they work:

  1. Delete issued → tombstone written to memtable → flushed to SSTable
  2. During reads, tombstones override earlier versions of the data
  3. During compaction, tombstoned data is permanently removed (after gcgraceseconds, default 10 days)

Why they're problematic:

  • Read performance degradation — reads must scan through tombstones before finding live data
  • Memory pressure — tombstones consume memory during reads
  • "Tombstone storm" — deleting many rows at once creates massive numbers of tombstones
  • If gcgraceseconds is too short, deleted data can resurrect via repair from a replica that missed the delete

Best practices:

  • Avoid frequent deletes — use TTLs for time-expiring data instead
  • Use appropriate compaction strategies (e.g., TimeWindowCompactionStrategy for time-series data with TTLs)
  • Monitor tombstone counts per read (TombstoneScannedHistogram)
  • Don't set gcgraceseconds lower than your repair cycle

Q9: How would you model a time-series IoT sensor system in Cassandra?

Answer:

Requirements: Millions of sensors, each reporting every few seconds. Query patterns: latest readings per sensor, readings in a time range.

Schema:

CREATE TABLE sensor_readings (
  sensor_id uuid,
  day date,
  reading_time timestamp,
  temperature double,
  humidity double,
  pressure double,
  PRIMARY KEY ((sensor_id, day), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
  AND default_time_to_live = 2592000;  -- 30-day TTL

Design decisions:

DecisionReasoning
Partition key: (sensor_id, day)Bounds partition size to one day per sensor; prevents unbounded growth
Clustering key: reading_time DESCLatest readings returned first (most common query pattern)
TTL: 30 daysAuto-expiration avoids tombstone accumulation from manual deletes
Compaction: TimeWindowCompactionStrategyOptimized for time-series data where entire windows expire together

Query examples:

-- Latest readings for sensor today
SELECT * FROM sensor_readings
WHERE sensor_id = ? AND day = '2026-03-30'
LIMIT 10;

-- Readings in a time range
SELECT * FROM sensor_readings
WHERE sensor_id = ? AND day = '2026-03-30'
AND reading_time >= '2026-03-30 10:00:00'
AND reading_time <= '2026-03-30 12:00:00';

Scaling: With 1M sensors reporting every 5 seconds = 200k writes/sec. At ~100 bytes per reading, daily storage ≈ 1.7 TB. Cassandra handles this comfortably with a well-sized cluster (10-15 nodes).


Q10: Compare Cassandra vs DynamoDB. When would you choose each?

Answer:

FeatureCassandraDynamoDB
DeploymentSelf-managed or managed (Astra DB)Fully managed (AWS)
Data modelWide-column (flexible schema)Key-value + document
Primary keyPartition key + clustering keyPartition key + sort key
ConsistencyTunable (ONE to ALL)Eventually consistent or strongly consistent
ScalingManual (add nodes, rebalance)Automatic (on-demand or provisioned)
Multi-regionNetworkTopologyStrategyGlobal Tables
Secondary indexesSAI, materialized viewsGSI, LSI
Cost modelInfrastructure + ops teamPay-per-request or provisioned capacity
Vendor lock-inNone (open source)AWS only
ThroughputVery high (you control hardware)High (AWS manages limits)

Choose Cassandra when:

  • You need vendor independence
  • You want full control over tuning and infrastructure
  • Your team has operational expertise
  • You need extreme write throughput with custom hardware

Choose DynamoDB when:

  • You want zero operational overhead
  • You're already in the AWS ecosystem
  • You need automatic scaling without capacity planning
  • Your team is small and can't dedicate resources to DB operations

Q11: Explain Cassandra's gossip protocol. How does it prevent split-brain?

Answer:

Gossip is Cassandra's peer-to-peer protocol for sharing cluster state:

  1. Every second, each node randomly selects 1-3 other nodes to gossip with
  2. Nodes exchange heartbeat state — generation (bootstrap timestamp) + version (incrementing counter)
  3. These form a vector clock — newer information supersedes older
  4. Nodes probabilistically bias gossip toward seed nodes

Why seed nodes matter:

  • Seed nodes are guaranteed gossip targets that prevent cluster fragmentation
  • Without them, sub-clusters could form where groups of nodes only know about each other
  • New nodes use seed nodes for initial cluster discovery

Split-brain prevention:

  • Cassandra doesn't use leader election (no single point of failure)
  • Every node can serve requests independently
  • Phi Accrual Failure Detector provides probabilistic failure detection (not binary up/down)
  • Nodes don't make irreversible decisions about cluster membership — a "convicted" node simply re-enters the cluster when it starts heartbeating again
  • Hinted handoffs ensure writes aren't lost during temporary network partitions

Key difference from consensus-based systems (Raft/Paxos): Cassandra favors availability — it doesn't require quorum agreement for cluster membership changes. This makes it more resilient to network partitions but means it relies on eventual consistency.


Q12: How would you handle a "hot partition" in Cassandra?

Answer:

Hot partition: A single partition receives disproportionate read/write traffic, overloading the node(s) hosting it.

Detection:

  • Monitor org.apache.cassandra.metrics:type=Table metrics per table
  • Look for uneven load across nodes (nodetool tablehistograms)
  • Client-side latency spikes for specific partition keys

Solutions:

StrategyHow It WorksTradeoff
Add bucketing to partition keySplit (userid) into (userid, bucket) where bucket = hash(timestamp) % NReads must query N buckets and merge
Application-level cachingCache hot partition data in Redis/MemcachedAdded infrastructure, cache invalidation complexity
Scatter-gather with saltAdd random salt to partition key, fan-out readsHigher read latency, more complex application logic
Redesign data modelRe-examine access patterns; maybe the partition key choice is wrongMay require migration
Rate limitingThrottle writes/reads to hot keys at application layerMay degrade user experience

Prevention (at design time):

  • Avoid low-cardinality partition keys (e.g., country for a global app)
  • Estimate max partition size and access frequency during data modeling
  • Use compound partition keys to distribute load (like Discord's (channel_id, bucket))

Q13: What compaction strategies does Cassandra offer and when should you use each?

Answer:

StrategyBest ForHow It Works
SizeTieredCompactionStrategy (STCS)Write-heavy workloadsMerges similarly-sized SSTables. Simple but can cause temporary space amplification (needs 2x disk)
LeveledCompactionStrategy (LCS)Read-heavy workloadsOrganizes SSTables into levels of increasing size. Guarantees at most 1 SSTable per partition per level. Better read performance, higher write amplification
TimeWindowCompactionStrategy (TWCS)Time-series data with TTLGroups SSTables by time window. Entire windows drop when all data expires. Very efficient for TTL-based expiration
UnifiedCompactionStrategy (UCS)General purpose (Cassandra 5.0+)Adaptive strategy that combines benefits of STCS and LCS based on workload characteristics

Decision guide:

Write-heavy, rarely read → STCS
Read-heavy, frequent updates → LCS
Time-series with TTL → TWCS
Not sure / mixed workload → UCS (if available)

Interview tip: Mentioning compaction strategy shows deep understanding. The key insight is that compaction is the price Cassandra pays for fast writes — it defers the "cleanup" work to background processes.


Q14: How does Cassandra compare to MongoDB for a system design interview?

Answer:

FeatureCassandraMongoDB
Data modelWide-column (rows with varying columns)Document (JSON/BSON)
Query languageCQL (SQL-like, limited)MQL (rich query language, aggregation pipeline)
SchemaDefined per table (flexible columns)Schema-less (flexible documents)
ConsistencyTunable (eventual by default)Strong by default (single-document ACID)
ScalingPeer-to-peer (no single point of failure)Primary-secondary (primary handles writes)
Write performanceExcellent (LSM tree, sequential I/O)Good (WiredTiger B-tree)
JOINsNot supportedNot native (use $lookup for basic joins)
Secondary indexesSAI (limited)Rich secondary indexes
Multi-regionNative (NetworkTopologyStrategy)Atlas Global Clusters

Choose Cassandra when:

  • Availability > consistency
  • Extreme write throughput
  • Well-defined, predictable access patterns
  • Multi-datacenter replication is critical

Choose MongoDB when:

  • Flexible/evolving query patterns
  • Rich querying and aggregation needed
  • Single-document transactions suffice
  • Developer productivity is a priority (richer query language)

Q15: Design a Cassandra schema for a social media feed (like Twitter/X timeline).

Answer:

Access patterns:

  1. View a user's own posts (profile page)
  2. View a user's home timeline (posts from people they follow)
  3. Post a new message

Schema:

-- User's own posts (profile view)
CREATE TABLE user_posts (
  user_id bigint,
  post_id bigint,        -- Snowflake ID (chronologically sortable)
  content text,
  media_urls list<text>,
  like_count counter,     -- Separate counter table in practice
  PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

-- Home timeline (fan-out on write)
CREATE TABLE home_timeline (
  user_id bigint,
  post_id bigint,
  author_id bigint,
  content text,
  author_name text,       -- Denormalized for read efficiency
  PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

Fan-out on write strategy:

  1. User posts a message → write to user_posts
  2. Look up all followers of the user
  3. For each follower, write a copy to their home_timeline

Design considerations:

ConcernSolution
Celebrity fan-outHybrid approach — fan-out on write for normal users, fan-out on read for celebrities (> 1M followers)
Partition sizeUse TTL or bucketing (user_id, bucket) to prevent unbounded timeline growth
Stale denormalized dataAuthor name changes are infrequent; eventual consistency is acceptable
Timeline truncationOnly keep latest 1000 posts per timeline; older posts fetched on-demand

Query:

-- Home timeline (latest 20 posts)
SELECT * FROM home_timeline
WHERE user_id = ?
LIMIT 20;

-- Pagination via search_after
SELECT * FROM home_timeline
WHERE user_id = ? AND post_id < ?
LIMIT 20;