Zookeeper
Apache ZooKeeper for System Design
Coordinating distributed systems is hard. The core problem has not changed: how do multiple machines agree on membership, ownership, configuration, and failure state when networks are unreliable and machines can crash at any time?
Apache ZooKeeper exists to solve exactly that class of problems. It is a distributed coordination service used for:
- Service discovery
- Configuration management
- Leader election
- Distributed locking
ZooKeeper is old by infrastructure standards, but it is still important to understand because it teaches the underlying coordination patterns that show up everywhere else: consensus, ephemeral membership, watches, and quorum-based writes.
Interview rule: ZooKeeper is rarely the first thing you should reach for in a product design interview. It becomes relevant when the problem is fundamentally about distributed coordination, not when you just need a normal database or cache.
Why Coordination Is Hard
Imagine a chat application.
At first, you have one server. Alice and Bob are both connected to it, so routing a message is trivial: the server already knows where each user lives.
Once you scale to multiple servers, things get more difficult:
- Alice may be connected to Server 1 while Bob is connected to Server 2
- Server 1 now needs a reliable way to discover where Bob is connected
- Servers need to detect when peer servers go down
- Configuration changes need to propagate consistently
- Some operations need a single leader to avoid duplicate work
Naive approaches break down quickly:
- Central database: becomes a bottleneck and a single point of failure unless you solve replication and failover
- Broadcast between all servers: creates
O(n^2)communication overhead - Local caches only: introduce stale routing information
- Ad hoc heartbeats: force you to handle network partitions, split brain, and inconsistent membership views
ZooKeeper gives the system a shared, consistent control plane. Instead of every service inventing its own membership and coordination logic, they coordinate through ZooKeeper.
Mental Model
ZooKeeper is best thought of as a highly available metadata filesystem for distributed systems.
It is not a general-purpose database:
- Data is small
- Writes are relatively expensive
- The working set should fit in memory
- The main value is coordination, not bulk storage
Applications store small pieces of metadata in ZooKeeper and subscribe to changes. They usually keep a local cache and use ZooKeeper to stay synchronized.
Basic Concepts
ZNodes
ZooKeeper stores data in a hierarchical namespace that looks like a filesystem:
/chat-app
/servers
/users
/config
Each item in that tree is a ZNode.
ZNodes:
- Have a path like
/chat-app/config/maxmessagesize - Store a small blob of data
- Have metadata such as version numbers and timestamps
- Can have children
ZooKeeper is designed for many small metadata entries, not large values. Keep payloads small.
Types of ZNodes
| Type | Meaning | Common use |
|---|---|---|
| Persistent | Survives until explicitly deleted | Config, metadata |
| Ephemeral | Deleted automatically when the creator's session expires | Membership, liveness |
| Sequential | ZooKeeper appends a monotonically increasing suffix | Leader election, locks, queues |
Examples:
# Persistent node
create /chat-app/config/max_message_size "1024"
# Ephemeral node
create -e /chat-app/servers/server2 "192.168.1.102:8080"
# Sequential node
create -s /chat-app/locks/lock- ""
# Might become /chat-app/locks/lock-0000000007
You can combine ephemeral and sequential:
create -e -s /chat-app/leader/candidate- "server2"
That combination is the foundation for ZooKeeper leader election and lock recipes.
Example Hierarchy
/chat-app
/servers
/server1 -> "192.168.1.101:8080"
/server2 -> "192.168.1.102:8080"
/users
/alice -> "server1"
/bob -> "server2"
/config
/max_users -> "10000"
/message_rate -> "100/sec"
In practice, you usually would not store millions of online users directly as individual ZNodes. ZooKeeper works much better when it tracks:
- Servers
- Leaders
- Shard ownership
- Configuration
- Small membership sets
For massive user populations, a more scalable pattern is:
- Store server membership in ZooKeeper
- Use consistent hashing or partition metadata to map users to servers
Sessions
ZooKeeper tracks clients through sessions.
When a client connects:
- It establishes a session with a timeout
- It sends heartbeats to keep that session alive
- Any ephemeral nodes it creates are tied to that session
If the session expires:
- ZooKeeper deletes that client's ephemeral nodes
- ZooKeeper removes that client's watches
This is why ephemeral nodes are so useful for failure detection. If a server crashes, it cannot keep its session alive, so ZooKeeper eventually removes its membership node automatically.
Important: failure detection is based on session timeout, not instant crash detection. If the timeout is 20 seconds, cleanup happens after ZooKeeper concludes the session is dead.
Watches
Watches are ZooKeeper's notification mechanism.
A client can ask ZooKeeper to notify it when:
- A node's data changes
- A node is created or deleted
- A node's children change
This lets applications avoid constant polling.
Example:
zk.getChildren("/chat-app/servers", true);
If the set of available servers changes, ZooKeeper notifies the client so it can refresh its local routing table.
Watch Semantics
This is where people often oversimplify ZooKeeper.
Watches are:
- One-time triggers: after firing, they must be re-registered
- Best-effort notifications of change, not a durable event stream
- Usually paired with a read-after-notification pattern
The standard pattern is:
- Read the node or child list
- Register a watch
- On notification, read again
- Re-register the watch
ZooKeeper is therefore not a pub/sub system. The watch mechanism is for cache invalidation and coordination, not high-volume event delivery.
Ensemble and Server Roles
ZooKeeper runs as an ensemble of servers, usually 3, 5, or 7 nodes.
Odd numbers matter because ZooKeeper uses quorum decisions:
- 3 nodes tolerate 1 failure
- 5 nodes tolerate 2 failures
- 7 nodes tolerate 3 failures
Server roles:
- Leader: orders and coordinates writes
- Followers: replicate state and serve reads
- Observers: optional non-voting members used mainly to scale reads without affecting quorum
Clients do not maintain active application-level connections to all ensemble members at once. A client typically connects to one server, and if that connection fails, it reconnects to another server in the ensemble while attempting to preserve its session.
Read and Write Path
ZooKeeper is optimized for read-heavy coordination workloads, not write-heavy transactional workloads.
Reads
Reads can be served by any ZooKeeper server from memory, which makes them fast.
Tradeoff:
- Reads from a follower can be stale
If a client needs stronger freshness guarantees, it can call sync() before reading to force the server to catch up with the leader.
Writes
Writes go through the leader:
- Client sends a write request
- Leader assigns a global order
- Leader replicates the proposal to followers
- Once a quorum acknowledges persistence, the write commits
- The client receives success
This gives ZooKeeper:
- Strongly ordered writes
- Atomic updates
- Durable committed state
But it also means writes are materially more expensive than reads.
Rule of thumb: ZooKeeper works well when the coordination dataset is small and the system is far more read-heavy than write-heavy.
Consistency Model
ZooKeeper provides strong guarantees, but it is worth being precise.
What you can rely on:
- Linearizable writes: all committed writes have a single global order
- FIFO client order: a client's operations are observed in the order that client issued them
- Atomicity: an update fully succeeds or fails
- Durability: committed changes survive server failure
- Single system image: clients eventually converge on the same namespace state
What you should not assume:
- Every read from every follower is instantly up to date
ZooKeeper is very strong for coordination, but it is not "every read is always globally fresh from every replica" unless you add explicit synchronization.
Core Use Cases
1. Configuration Management
ZooKeeper is commonly used to store small pieces of runtime configuration:
/ecommerce/config/pricing_algorithm -> "dynamic_v2"
/ecommerce/config/maintenance_mode -> "false"
/ecommerce/config/discount_threshold -> "50.00"
Why it works well:
- Config is small
- Config changes are infrequent
- Services can watch the relevant nodes
- Updates propagate without restarts
Good fit:
- Feature flags
- Service endpoints
- Dynamic throttling values
- Runtime toggles
Bad fit:
- Large configs
- Secrets at very large scale
- Static config that only changes during deployment
In cloud environments, managed tools are often a better first choice:
- AWS Systems Manager Parameter Store
- Azure App Configuration
- Google Cloud-native config systems
2. Service Discovery
Services register themselves with ephemeral nodes:
create -e /streaming/services/video-transcoder/instance1 "10.0.0.1:8080"
create -e /streaming/services/video-transcoder/instance2 "10.0.0.2:8080"
Clients:
- Read the child nodes under the service path
- Build a local view of healthy instances
- Watch for membership changes
This enables:
- Dynamic registration
- Failure-driven deregistration
- Lightweight service discovery
Modern alternatives often preferred today:
- Consul
- etcd
- Kubernetes Services and Endpoints
- Cloud-native service discovery systems
3. Leader Election
ZooKeeper's classic recipe uses ephemeral sequential nodes.
Pattern:
- Each candidate creates an ephemeral sequential node
- The candidate with the smallest sequence number becomes leader
- Each non-leader watches the node immediately ahead of it
- If that node disappears, it re-checks leadership
Example:
create -e -s /chat-app/leader/candidate- "server1"
create -e -s /chat-app/leader/candidate- "server2"
create -e -s /chat-app/leader/candidate- "server3"
If ZooKeeper creates:
candidate-0000000001candidate-0000000002candidate-0000000003
then candidate-0000000001 is the leader.
Why this pattern is good:
- Automatic failover
- No polling loop needed
- Failure cleanup is automatic because nodes are ephemeral
4. Distributed Locks
Distributed locks use essentially the same recipe:
- Each client creates an ephemeral sequential node under a lock path
- The lowest sequence number holds the lock
- Other clients watch the predecessor node
- When the lock holder releases the lock or dies, the next client proceeds
This gives:
- Fair acquisition order
- Automatic cleanup on failure
- Stronger correctness than many lightweight lock systems
ZooKeeper locks are useful when:
- Correctness matters more than raw speed
- Locks may be long-lived
- You want crash-safe cleanup semantics
ZooKeeper locks are a poor fit when:
- Lock acquisition happens at very high frequency
- You need ultra-low latency
- The lock path becomes a hotspot
For simpler high-throughput scenarios, Redis or database-backed coordination may be operationally easier.
How ZooKeeper Works Internally
ZAB and Quorum Replication
ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol to keep the ensemble consistent.
At a high level:
- A leader is elected
- The leader totally orders writes
- Followers persist the leader's proposals
- A write commits after quorum acknowledgment
This ensures that the surviving quorum preserves a consistent, durable history.
Internal Leader Election
Do not confuse ZooKeeper's internal leader election with the application-level leader election recipe built on ZNodes.
Internally, ZooKeeper elects a leader based primarily on:
- The most up-to-date transaction history (
zxid) - Server ID (
sid) as a tiebreaker
So:
- Internal ZooKeeper leader election prefers the most up-to-date server
- Application leader election recipe usually picks the smallest ephemeral sequential node
They solve different problems and use different rules.
Storage
ZooKeeper persists state using:
- Transaction logs for every committed update
- Snapshots for faster recovery
On restart, a server:
- Loads the latest snapshot
- Replays newer transaction log entries
The transaction log is the most performance-sensitive component. Slow disks can directly hurt ZooKeeper write latency.
Failure Handling
ZooKeeper is designed to keep coordination state correct under partial failure.
Follower Failure
If a follower fails:
- Reads and writes continue as long as quorum remains
Leader Failure
If the leader fails:
- The ensemble elects a new leader
- Writes pause briefly during election
- Service resumes once a new leader is established
Network Partition
If the ensemble splits:
- The partition with a quorum can continue
- The minority partition cannot accept writes
This prevents split brain.
Client Failure
If a client dies or is disconnected long enough for its session to expire:
- Its ephemeral nodes are deleted
- Other clients observing those nodes are notified
That is what makes ZooKeeper so effective for:
- Membership tracking
- Automatic failover
- Cleanup of abandoned coordination state
Limitations
ZooKeeper is powerful, but it has clear limits.
1. Write Throughput Is Limited
Because all writes are serialized through the leader and require quorum replication, ZooKeeper is not suitable for high-write data planes.
Do not use it for:
- User messages
- Orders
- Metrics firehoses
- General application data
2. Data Must Stay Small
ZooKeeper stores its working set in memory. Keep values small and the namespace modest enough to fit comfortably in RAM.
3. Watches Can Create Hotspots
If too many clients watch the same node:
- Notifications can become expensive
- Popular coordination paths become bottlenecks
This matters in:
- Global leader election
- Highly contended locks
- Massive membership directories
4. Operational Complexity
ZooKeeper is conceptually simple for application developers, but operating it well requires care:
- JVM tuning
- Correct disk layout
- Sensible session timeouts
- Monitoring of latency and connection counts
- Thoughtful ensemble sizing
Practical summary: simple API, non-trivial operations.
ZooKeeper in the Modern World
ZooKeeper still matters, especially in older and Apache-heavy infrastructure.
It has historically been central to systems such as:
- HBase
- Hadoop ecosystem components
- SolrCloud
- Storm
- NiFi
- Pulsar
- Replicated ClickHouse deployments
But the industry has moved toward alternatives:
- etcd for cloud-native control planes and Kubernetes-style metadata
- Consul for service discovery and network-aware coordination
- Managed cloud services for config and discovery
- Built-in consensus systems inside the application itself
The best example of this shift is Kafka's move away from ZooKeeper toward KRaft, where Kafka embeds its own Raft-based metadata quorum.
That reflects a broader pattern:
- Older systems often externalized coordination into ZooKeeper
- Newer systems often embed consensus or rely on platform-managed control planes
When To Use ZooKeeper in Interviews
ZooKeeper is a good answer when the problem is explicitly about coordination infrastructure.
Strong use cases:
- Designing a distributed message queue
- Designing a distributed task scheduler
- Designing shard or partition leadership
- Membership tracking for brokers or workers
- Coordinating failover for critical controllers
- Hierarchical or correctness-sensitive distributed locks
Weak use cases:
- Standard CRUD systems
- High-throughput primary data storage
- Feature flagging inside a fully managed cloud stack
- Service discovery inside Kubernetes unless the interviewer specifically wants the underlying control-plane mechanics
Good Interview Framing
If you mention ZooKeeper, say why:
"I would use ZooKeeper for the control plane only: broker membership, partition leadership, and failover coordination. The actual user data path stays in the message brokers and storage layer."
That framing shows you understand the boundary between:
- Coordination metadata
- Application data
Comparison With Alternatives
| Tool | Best for | Why you might choose it over ZooKeeper |
|---|---|---|
| etcd | Cloud-native metadata and config | Simpler modern API, strong Kubernetes alignment |
| Consul | Service discovery plus health checks | Broader service-network tooling |
| Redis | Fast lightweight locks and ephemeral coordination | Lower latency, easier ops for simpler needs |
| Cloud-native managed services | Config/discovery in one cloud | Lower operational burden |
| Embedded Raft in the app | Systems that own their own metadata plane | Fewer moving parts than external ZooKeeper |
ZooKeeper still wins when:
- You need a proven coordination system
- The surrounding ecosystem already expects it
- You want ephemeral membership and watch-based coordination with strong correctness
Practical Design Advice
If you use ZooKeeper in a design:
- Keep only metadata in ZooKeeper
- Keep the namespace simple and explicit
- Avoid large fan-out on hot nodes
- Cache data locally and use watches to invalidate
- Use ephemeral nodes for liveness
- Use sequential ephemeral nodes for elections and locks
- Choose session timeout carefully
What not to do:
- Do not route every user request through ZooKeeper
- Do not use ZooKeeper as your primary database
- Do not store large blobs
- Do not design around high write rates to a single coordination path
Summary
ZooKeeper is a distributed coordination service, not a general database.
It solves a narrow but critical set of distributed systems problems:
- Shared configuration
- Membership and service discovery
- Leader election
- Distributed locks
It does this through a small number of powerful primitives:
- Hierarchical ZNodes
- Sessions and ephemeral nodes
- Watches
- Quorum-replicated writes through a leader
Even if you never deploy ZooKeeper directly, understanding it is valuable because the same ideas appear everywhere else in distributed systems.
References
- Apache ZooKeeper documentation
- "ZooKeeper: Wait-free coordination for Internet-scale systems"