Elasticsearch - Complete Deep Dive

What is Elasticsearch?

Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene that enables:

  • Full-text search: Index and search large volumes of data
  • Real-time analytics: Aggregate and visualize data instantly
  • Distributed storage: Horizontal scaling across multiple nodes
  • High availability: Automatic replication and failover
  • Flexible querying: Complex boolean queries, fuzzy search, range queries

Core Characteristics

AspectBenefit
DistributedHorizontal scaling across nodes/clusters
Fault-tolerantAutomatic replica placement and recovery
Real-timeSub-100ms search latency on large datasets
SchemalessDynamic field mapping (flexible structure)
Horizontally scalableAdd nodes to increase throughput/storage
Full-text searchAdvanced tokenization, stemming, analyzers

Elasticsearch Architecture Overview

Click to view code
Clients (HTTP/TCP)
       ↓
   [Elasticsearch Cluster]
       ├─ Node 1 (Data + Master eligible)
       ├─ Node 2 (Data + Master eligible)
       ├─ Node 3 (Data + Master eligible)
       └─ Coordinator Node (routes requests)
       ↓
   [Indices] → [Shards] → [Documents]
       ↓
   [Lucene indexes]

Key layers:

  • Cluster: Collection of nodes working together
  • Node: Single Elasticsearch instance
  • Index: Like a database table, holds documents
  • Shard: Partition of an index (enables parallelism)
  • Replica: Copy of a shard for redundancy
  • Document: JSON object, basic unit of data

Core Components

1. Clusters and Nodes

Cluster: Logical grouping of one or more Elasticsearch nodes

Click to view code
Production Cluster Setup:

Master nodes (3):
  - Elect cluster leader
  - Manage cluster state
  - No data stored
  - Lightweight, high availability

Data nodes (N):
  - Store indices and shards
  - Execute search/index operations
  - Resource intensive (disk, CPU, RAM)

Coordinator nodes (optional):
  - Route requests to data nodes
  - Reduce load on master nodes
  - No data stored

Node roles:

Click to view code (yaml)
# Master-eligible node
node.roles: [master, data]

# Data-only node
node.roles: [data]

# Coordinator node (ingest + voting only)
node.roles: [ingest]

2. Indices and Shards

Index: Similar to a database table; stores documents

Click to view code
Index: products
  ├─ Shard 0 (Primary) → Replica 0
  │   Docs: 100K
  │   Size: 500MB
  │
  ├─ Shard 1 (Primary) → Replica 1
  │   Docs: 100K
  │   Size: 500MB
  │
  └─ Shard 2 (Primary) → Replica 2
      Docs: 100K
      Size: 500MB

Shard: Partition of an index (enables parallel processing)

Click to view code
Why shards?
- Parallelism: Multiple shards can be searched simultaneously
- Throughput: Spread writes across shards
- Storage: Each shard is a complete Lucene index
- Distribution: Shards spread across nodes for load balancing

Shard placement strategy:
- Primary shards: Distributed round-robin across nodes
- Replicas: Placed on different nodes than primary
- Example: 5 shards × 2 replicas = 15 total shard copies

Index settings:
{
  "settings": {
    "number_of_shards": 5,    # Initial shards (can't change)
    "number_of_replicas": 2    # Replicas (can be changed)
  }
}

Replica: Copy of a shard for redundancy

Click to view code
Benefits:
- HA: If primary shard lost, replica promoted to primary
- Search parallelism: Searches hit both primary + replicas
- Read scalability: More replicas = more parallel reads

Trade-offs:
- Disk space: 2 replicas = 3x storage usage
- Network I/O: Replication introduces latency
- Indexing latency: Must replicate to all replicas

3. Documents and Mappings

Document: JSON object indexed in Elasticsearch

Click to view code (json)
{
  "_index": "products",
  "_id": "12345",
  "_type": "_doc",
  "_version": 1,
  "_score": 2.5,
  "_source": {
    "title": "Laptop",
    "price": 999.99,
    "category": "Electronics",
    "tags": ["computer", "portable"],
    "created_at": "2024-01-05T10:00:00Z"
  }
}

Mapping: Schema definition for an index

Click to view code (json)
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "tags": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date",
        "format": "strict_date_time"
      }
    }
  }
}

Field types:

TypeUse CaseSearchableSortable
textFull-text search (title, description)Yes (analyzed)No
keywordExact match, filtering (category, status)Yes (not analyzed)Yes
integer/floatNumbers (price, quantity)YesYes
dateTimestampsYesYes
nestedArray of objects (comments, reviews)YesLimited
geo_pointLatitude/longitudeYes (geo queries)Yes

4. Inverted Index (Core of Lucene)

Inverted Index: Maps terms → documents (enables fast search)

Click to view code
Documents:
Doc 1: "Elasticsearch is powerful"
Doc 2: "Elasticsearch is fast"
Doc 3: "Fast search engine"

Inverted Index:
elasticsearch → [Doc1, Doc2]
is → [Doc1, Doc2]
powerful → [Doc1]
fast → [Doc2, Doc3]
search → [Doc3]
engine → [Doc3]

Query: "fast"
  1. Look up "fast" in index → [Doc2, Doc3]
  2. Return matches instantly

Analysis process (converts text → searchable terms):

Click to view code
Input: "The Quick Brown Fox"
    ↓
1. Tokenizer: ["The", "Quick", "Brown", "Fox"]
    ↓
2. Token Filters:
   - Lowercase: ["the", "quick", "brown", "fox"]
   - Remove stopwords: ["quick", "brown", "fox"]
    ↓
3. Store in inverted index:
   quick → [Doc]
   brown → [Doc]
   fox → [Doc]

Indexing (Writes)

Indexing Flow

Click to view code
Client:
  PUT /products/_doc/12345
  { "title": "Laptop", "price": 999 }
       ↓
Primary Shard:
  1. Parse JSON
  2. Analyze fields
  3. Build inverted index
  4. Store in memory buffer (refresh)
       ↓
Replica Shards:
  Receive and index same document
       ↓
Acknowledgment:
  Return success to client

Indexing Settings

Click to view code (python)
# Bulk indexing for high throughput
from elasticsearch import Elasticsearch

es = Elasticsearch(['localhost:9200'])

# Bulk index documents
actions = [
    {"index": {"_index": "products", "_id": i}},
    {"title": f"Product {i}", "price": i * 10}
    for i in range(10000)
]

from elasticsearch.helpers import bulk
bulk(es, actions, chunk_size=500)

Refresh and flush:

Click to view code
Refresh (every 1 second by default):
  - Flushes buffer to Lucene segment
  - Makes documents searchable
  - No durability guarantee

Flush:
  - Commits Lucene segment to disk
  - Updates transaction log
  - Ensures durability

Optimization:
  - Bulk indexing: Increase refresh_interval during bulk operations
  - Disable replicas: Index to primary only, then enable replicas
  - Bulk size: 5-15MB chunks for optimal performance

Searching (Reads)

Query Types

Match Query (full-text, analyzed):

Click to view code (json)
{
  "query": {
    "match": {
      "title": "fast search"
    }
  }
}
// Matches: "fast", "search", "faster", "searched" (due to analysis)

Term Query (exact match, not analyzed):

Click to view code (json)
{
  "query": {
    "term": {
      "status": "active"
    }
  }
}
// Exact match only

Bool Query (combine conditions):

Click to view code (json)
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "laptop"}},
        {"range": {"price": {"gte": 500, "lte": 1500}}}
      ],
      "filter": [
        {"term": {"in_stock": true}}
      ],
      "should": [
        {"match": {"tags": "gaming"}}
      ]
    }
  }
}

Filter Context (cached, fast):

Click to view code (json)
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"status": "active"}},
        {"range": {"created_at": {"gte": "2024-01-01"}}}
      ]
    }
  }
}
// Filters are cached and don't affect scoring

Search Execution Flow

Click to view code
Query:
  bool {
    must: [match: "laptop", range: price 500-1500],
    filter: [term: in_stock=true]
  }
       ↓
1. Query Phase (find matching shards):
   - All shards execute query
   - Return top 10 doc IDs + scores
   - Coordinator gathers results
       ↓
2. Fetch Phase (get documents):
   - Coordinator fetches full documents
   - Apply sorting/pagination
   - Return to client

Query DSL Examples:

Click to view code (python)
# Python client
es = Elasticsearch(['localhost:9200'])

# Full-text search
results = es.search(index="products", body={
    "query": {
        "match": {
            "title": "laptop"
        }
    },
    "size": 10
})

# Filter + aggregation
results = es.search(index="products", body={
    "query": {
        "bool": {
            "filter": [
                {"term": {"in_stock": True}},
                {"range": {"price": {"gte": 500}}}
            ]
        }
    },
    "aggs": {
        "avg_price": {"avg": {"field": "price"}},
        "categories": {
            "terms": {"field": "category", "size": 10}
        }
    }
})

Aggregations (Analytics)

Aggregations: Group and analyze data without searching

Click to view code (json)
{
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          {"to": 500},
          {"from": 500, "to": 1000},
          {"from": 1000}
        ]
      }
    },
    "by_category": {
      "terms": {
        "field": "category",
        "size": 10
      },
      "aggs": {
        "avg_price": {"avg": {"field": "price"}}
      }
    }
  }
}

Common aggregation types:

AggregationUse CaseExample
termsCount occurrencesTop 10 products by sales
avg/sum/max/minStatisticalAverage price per category
date_histogramTime-seriesSales per day
percentilesDistributionP99 latency
cardinalityUnique countUnique users

Performance Optimization

Indexing Optimization

Click to view code (properties)
# During bulk indexing
index.refresh_interval=-1          # Disable auto-refresh
index.number_of_replicas=0         # No replication

# Bulk indexing settings
bootstrap.mlockall=true            # Lock memory (no swap)
indices.memory.index_buffer_size=30%  # Larger buffer

# Rebalance shards after bulk indexing
index.number_of_replicas=2

# Force merge segments
POST /index/_forcemerge?max_num_segments=1

Search Optimization

Click to view code (properties)
# Field caching for frequent filters
index.fielddata.cache.size=20%

# Shard allocation awareness
cluster.routing.allocation.awareness.attributes=zone
node.attr.zone=us-west-1a

# Query cache (filters)
index.queries.cache.is_enabled=true

# Request cache (aggregations)
indices.requests.cache.size=1%

Resource Configuration

Click to view code (properties)
# Memory allocation
-Xms8g -Xmx8g  # Heap size (half of total system RAM)

# File descriptors
ulimit -n 65535

# Virtual memory
vm.max_map_count=262144

# Network
http.port=9200
transport.port=9300
network.host=0.0.0.0

Scalability & High Availability

Cluster Architecture

Small cluster (1-10 nodes):

Click to view code
3 Master-eligible nodes
+ 5 Data nodes
= 8 total nodes
Handles: 1-10M docs/day

Medium cluster (10-50 nodes):

Click to view code
3-5 Master nodes
+ 20-40 Data nodes
+ Coordinator nodes (optional)
Handles: 100M-1B docs/day

Large cluster (50+ nodes):

Click to view code
5 Master nodes (separate from data)
+ 100+ Data nodes
+ 10+ Coordinator nodes
+ Dedicated ingest nodes (optional)
Handles: 1B+ docs/day

Replica Strategy

Click to view code
Trade-off: Availability vs Resource Cost

0 Replicas:
  - Failure = data loss
  - Lowest cost
  - Fast indexing
  - Use for: Testing, non-critical data

1 Replica:
  - Failure = one node down, data preserved
  - Can lose one node
  - Balanced cost/availability
  - Use for: Production with SLA

2+ Replicas:
  - Failure = multiple nodes can go down
  - Higher cost (3x storage with 2 replicas)
  - Excellent read parallelism
  - Use for: Critical, high-traffic systems

Cross-Cluster Replication

Click to view code
Primary Cluster (US-East):
  [Index A] → Replicates → Secondary Cluster (EU-West)
                              [Index A Mirror]

Use case: Geographic redundancy, disaster recovery

Configuration:
- CCR (Cross-Cluster Replication)
- Bidirectional replication for active-active
- One-way for active-passive DR

Configuration Tuning

Discovery and Cluster Formation

Click to view code (properties)
# Cluster name
cluster.name=my-cluster

# Master nodes
discovery.seed_hosts=["node1", "node2", "node3"]
cluster.initial_master_nodes=["node1", "node2", "node3"]

# Split-brain prevention
discovery.zen.minimum_master_nodes=2  # (N/2 + 1)

# Shard allocation
cluster.routing.allocation.enable=all

Index Configuration

Click to view code (json)
{
  "settings": {
    "number_of_shards": 10,
    "number_of_replicas": 2,
    "index.codec": "best_compression",
    "index.refresh_interval": "30s",
    "index.store.type": "niofs"
  }
}

Shard Size Guidelines

Click to view code
Target shard size: 20-50 GB (per replica)

Calculation:
  Total index size: 1 TB
  Desired shard size: 30 GB
  Number of shards needed: 1TB / 30GB ≈ 34 shards

Too few shards:
  - Few parallel searches
  - Slow recovery
  - Unbalanced load

Too many shards:
  - Overhead from coordination
  - Slow recovery (many shards to rebuild)
  - Per-shard memory overhead

Use Cases

Click to view code
E-commerce search:
  - Product catalog search
  - Faceted search (filter by category, price range)
  - Autocomplete/typeahead
  - Search-as-you-type

2. Logging and Metrics (ELK Stack)

Click to view code
Elasticsearch + Logstash + Kibana:
  - Collect logs from servers
  - Parse and enrich with Logstash
  - Index in Elasticsearch
  - Visualize in Kibana

Benefits:
  - Real-time log analysis
  - Pattern detection
  - Performance monitoring

3. Analytics and BI

Click to view code
Time-series analytics:
  - Event data (clicks, page views, purchases)
  - Metrics (CPU, memory, request latency)
  - Aggregations for dashboards
  - Ad-hoc analysis
Click to view code
Map-based services:
  - "Find restaurants near me"
  - Ride-sharing (nearby drivers)
  - Delivery services (nearest warehouse)

Geo query types:
  - geo_distance: Radius search
  - geo_bounding_box: Area search
  - geo_polygon: Complex boundaries

Interview Questions & Answers

Q1: Design a search system for an e-commerce platform (10M products, 10K QPS)

Requirements:

  • Fast product search (< 100ms)
  • Faceted search (filter by category, price, ratings)
  • Autocomplete suggestions
  • 99.99% uptime

Solution Architecture:

Click to view code
Cluster Setup:
  - 3 Master nodes (t3.medium)
  - 20 Data nodes (r5.4xlarge: 128GB RAM, high memory)
  - Coordinator nodes (optional, for high load)

Index Design:
  "products" index
    - Shards: 20 (1 per data node for parallelism)
    - Replicas: 2 (3x total copies for HA)
    - Mapping:
      {
        "product_id": keyword,
        "title": text (analyzer: standard + stemmer),
        "description": text,
        "category": keyword,
        "price": float,
        "rating": float,
        "in_stock": boolean,
        "created_at": date
      }

Query optimization:

Click to view code (python)
# Faceted search with aggregations
query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"title": user_input}}
            ],
            "filter": [
                {"term": {"category": selected_category}},
                {"range": {"price": {"gte": min_price, "lte": max_price}}},
                {"term": {"in_stock": True}}
            ]
        }
    },
    "aggs": {
        "categories": {"terms": {"field": "category", "size": 50}},
        "price_ranges": {
            "range": {
                "field": "price",
                "ranges": [{"to": 100}, {"from": 100, "to": 500}, ...]
            }
        },
        "ratings": {"terms": {"field": "rating", "size": 5}}
    },
    "size": 20,
    "from": 0
}

Autocomplete:

Click to view code (json)
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_title": {
        "type": "text",
        "analyzer": "autocomplete_analyzer",
        "fields": {
          "suggest": {
            "type": "completion"
          }
        }
      }
    }
  }
}

// Query autocomplete
{
  "query": {
    "match_phrase_prefix": {
      "product_title": "lapt"
    }
  },
  "size": 10
}

Performance expectations:

OperationLatencyNotes
Search50-100msP99 with 20 shards parallel
Faceted search100-200msAggregations take longer
Autocomplete10-50msCompletion suggester cached

Q2: Cluster goes down. How to recover without data loss?

Answer:

Backup strategy:

Click to view code (properties)
# Snapshot repository (S3)
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1"
  }
}

# Take snapshot daily
POST /_snapshot/s3-backup/snapshot-2024-01-05?wait_for_completion=true

Recovery scenarios:

FailureImpactRecovery Time
1 Data node failsNo impact (replicas take over)Automatic (seconds)
Multiple nodes failReplicas insufficientMinutes (depends on replica count)
Entire cluster downData recoverable from snapshot30min-2hrs (restore from S3)

Recovery procedure:

Click to view code (bash)
# 1. Restore cluster from backup
POST /_snapshot/s3-backup/snapshot-2024-01-05/_restore
{
  "indices": "products,users",
  "ignore_unavailable": true
}

# 2. Monitor recovery status
GET /_recovery

# 3. Verify data integrity
GET /products/_count

# 4. Resync replicas
PUT /products/_settings
{
  "index.number_of_replicas": 2
}

Key takeaway: "With 2+ replicas, node failure is automatic failover. For total cluster failure, restore from snapshots (RPO = 1 day if daily snapshots)."


Q3: Search latency spike. How to diagnose?

Answer:

Diagnosis checklist:

Click to view code (bash)
# 1. Check cluster health
GET /_cluster/health

# 2. Check shard allocation
GET /_cat/shards?v

# 3. Check node stats (CPU, memory, GC)
GET /_nodes/stats

# 4. Check slowlog
GET /products/_settings?include_defaults=true | grep slowlog

# 5. Enable slowlog for 1 second queries
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "1s"
}

# 6. Check pending tasks
GET /_cluster/pending_tasks

# 7. Monitor JVM GC
GET /_nodes/jvm/stats

Common causes and fixes:

ProblemCauseFix
High query latencyLarge result set (slow fetch phase)Reduce page size, use scroll API
GC pausesHeap too full, frequent GCIncrease heap, optimize query
Unbalanced shardsHot shards (too much traffic)Rebalance shards, adjust routing
Slow aggregationsToo many bucketsReduce aggregation cardinality
Disk I/O bottleneckSSD at capacityAdd nodes, optimize refresh interval

Q4: Index size growing 10x in 3 months. How to optimize storage?

Answer:

Root cause analysis:

Click to view code (bash)
# Check index size
GET /_cat/indices?v&h=index,store.size,docs.count

# Calculate docs per GB
docs_per_gb = docs.count / (store.size / 1GB)

Optimization strategies:

  1. Compression:
Click to view code (json)
{
  "settings": {
    "index.codec": "best_compression"  // vs default
  }
}
  1. Disable unnecessary fields:
Click to view code (json)
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "index": false  // Don't index, just store
      }
    }
  }
}
  1. Field type optimization:
Click to view code
text field: 2x size of keyword field
nested objects: High overhead
Use keyword where full-text search not needed
  1. Index lifecycle management:
Click to view code
Hot: Current month (full replicas, high refresh rate)
Warm: Last 3 months (fewer replicas, lower refresh rate)
Cold: Older data (minimal replicas, snapshot only)
Delete: > 1 year

Expected improvements:

Click to view code
Before optimization:
  1M docs = 10GB storage (10KB per doc average)

After compression + field optimization:
  1M docs = 5GB storage (50% reduction)

With tiering:
  1M docs = 3GB total (hot: 1GB, warm: 1.5GB, cold: 0.5GB)

Q5: Designing Elasticsearch for real-time log aggregation (1M logs/sec)

Answer:

Architecture:

Click to view code
Log Sources (servers, apps)
       ↓
Logstash (parsing, enrichment)
       ↓
Elasticsearch Cluster (hot-warm-cold)
       ↓
Kibana (visualization)

Cluster sizing:

Click to view code
1M logs/sec × 1KB per log = 1GB/sec = 86TB/day

Storage requirement (7-day retention):
  86TB × 7 = 600TB raw
  With compression (50%): 300TB
  With 2 replicas: 900TB total

Hardware needed:
  - 30 nodes × 30TB each (r5.4xlarge with attached storage)
  - 3 dedicated master nodes
  - 5 coordinator nodes (distribute load)

Index strategy:

Click to view code
Daily rolling indices:
  logs-2024-01-05 (hot)
  logs-2024-01-04 (warm)
  logs-2024-01-03 (warm)
  logs-2024-01-02 (cold)
  logs-2024-01-01 (cold)

Settings per day:
  - Shards: 20 (distribute across data nodes)
  - Replicas: 1 initially, scale to 2 for HA
  - Refresh interval: 30-60s (batch writes)
  - Rollover when: 50GB or 24 hours

Indexing optimization:

Click to view code (python)
# Bulk indexing with Logstash
output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    bulk_path => "/_bulk"
    bulk_size => 1000
    flush_interval => 5
    compression_level => "best"
  }
}

Query optimization:

Click to view code
Use time-range filter (fast):
  range: { @timestamp: { gte: "2024-01-05" } }

Avoid full-text search on all logs:
  Cost: O(n) scan of all documents
  Better: Filter by log level, service, then search

Use aggregations for metrics:
  Errors per service (terms aggregation)
  Response time distribution (percentiles)

Elasticsearch vs Alternatives

SystemThroughputLatencyBest ForTrade-off
Elasticsearch1M+/sec10-100msFull-text, logs, analyticsComplex, resource-hungry
Solr100K/sec50-200msEnterprise searchSlower, Java-heavy
OpenSearch1M+/sec10-100msES alternative (AWS native)Fewer plugins
Algolia100K/sec1-10msHosted search (SaaS)High cost, limited customization
Meilisearch100K/sec1-50msFast search UXLess flexible

Best Practices

Operational Best Practices

Always use replicas (min 2) for production ✓ Monitor JVM GC (long pauses = latency spike) ✓ Use dedicated master nodes (separate from data) ✓ Set ulimit -n 65535 (file descriptors) ✓ Enable security (X-Pack authentication) ✓ Monitor disk space (leave 20% free for merge operations) ✓ Plan for growth (add nodes before hitting limits) ✓ Use snapshot/restore (daily backups to S3/GCS)

Query Best Practices

Filter before search (cached and faster) ✓ Use persistent queries (for aggregations) ✓ Pagination with searchafter (not from/size for large offsets) ✓ Async search for long queries (asyncsearch API) ✓ Batch requests (_msearch instead of individual queries)

Mapping Best Practices

Use keyword for exact match (faster, smaller) ✓ Use text only for full-text search (analyzer overhead) ✓ Set explicit mappings (avoid field explosion) ✓ Use ignore_above (prevent oversized keywords) ✓ Disable source for read-only indices (save space)


Disaster Recovery Strategy

Single-Region Setup

ScenarioImpactRecovery
Single node failureAuto (replicas take over)Seconds
Multiple node failurePartial outage (if replicas insufficient)Minutes (nodes rejoin)
Cluster failureComplete outageRestore from snapshot (30min-2hrs)

Mitigation: Always use 2+ replicas in production

Multi-Region Setup

Click to view code
Primary Cluster (US-East)
  [Logs Index]
       ↓ CCR
Secondary Cluster (EU-West)
  [Logs Index (read-only)]

Failover process:
  1. Stop CCR replication
  2. Promote secondary to read-write
  3. Update application to point to EU-West
  4. RPO: Depends on replication lag (usually < 1 sec)

Summary & Key Takeaways

Elasticsearch excels at:

  • ✓ Full-text search (sub-100ms on large datasets)
  • ✓ Real-time log analytics (ELK stack)
  • ✓ Time-series data (metrics, events)
  • ✓ Flexible schema (dynamic mappings)
  • ✓ High availability (replicas, automatic failover)

Key challenges:

  • ✗ Operational complexity (cluster management)
  • ✗ Resource intensive (memory, disk, CPU)
  • ✗ Consistency trade-offs (eventual consistency)
  • ✗ Learning curve (complex DSL, tuning)
  • ✗ Cost at scale (large clusters needed for high volume)

Critical design questions:

  1. What's my search latency target (ms)?
  2. How much data must I retain (days/months)?
  3. What's my query volume (QPS)?
  4. Do I need full-text or exact-match search?
  5. What's my availability SLA?
  6. Can I afford downtime for rebalancing?
  7. What's my budget for clusters (storage, compute)?

Elasticsearch — Practical Usage Guide

System designs can involve a dizzying array of different technologies, concepts and patterns, but one technology stands out for search and retrieval: Elasticsearch. While most database systems are pretty good at search (e.g. Postgres with a full-text index is sufficient for many problems!), at a certain scale or level of sophistication you'll want a purpose-built system.

From an interview perspective, there are two angles:

  1. How to use Elasticsearch — gives you a powerful tool for product architecture interviews
  2. How Elasticsearch works under the hood — important for infra-heavy roles, especially at cloud companies

Basic Concepts

The important concepts from a client perspective are documents, indices, mappings, and fields.

Documents

Documents are the individual units of data you're searching over. Think of them as JSON objects:

{
  "id": "XYZ123",
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "price": 10.99,
  "createdAt": "2024-01-01T00:00:00.000Z"
}

Indices

An index is a collection of documents. Each document is associated with a unique ID and a set of fields. Think of an index as a database table. Searches happen against indices and return matching document results.

Note: This terminology overloads the general term "index" used for auxiliary data structures that make searches faster.

Mappings and Fields

A mapping is the schema of the index. It defines the fields, their data types, and how they are processed and indexed.

{
  "properties": {
    "id": { "type": "keyword" },
    "title": { "type": "text" },
    "author": { "type": "text" },
    "price": { "type": "float" },
    "createdAt": { "type": "date" }
  }
}

Key distinctions:

  • keyword — treated as a whole value (think: hash table lookup)
  • text — tokenized for full-text search (think: inverted index)

Performance implication: If you include lots of fields in your mapping that aren't used in search, this increases memory overhead. If a User object has 10 fields but you only search by 2, mapping all 10 wastes memory.


Basic Operations

Elasticsearch has a clean REST API for all operations.

Create an Index

PUT /books
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

Set a Mapping

PUT /books/_mapping
{
  "properties": {
    "title": { "type": "text" },
    "author": { "type": "keyword" },
    "description": { "type": "text" },
    "price": { "type": "float" },
    "publish_date": { "type": "date" },
    "categories": { "type": "keyword" },
    "reviews": {
      "type": "nested",
      "properties": {
        "user": { "type": "keyword" },
        "rating": { "type": "integer" },
        "comment": { "type": "text" }
      }
    }
  }
}

Nested vs separate index: If reviews are infrequently updated and frequently queried, nest them within book documents. Otherwise, create a separate index. This is analogous to the normalization/denormalization tradeoff in SQL databases.

Add Documents

POST /books/_doc
{
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "description": "A novel about the American Dream in the Jazz Age",
  "price": 9.99,
  "publish_date": "1925-04-10",
  "categories": ["Classic", "Fiction"],
  "reviews": [
    { "user": "reader1", "rating": 5, "comment": "A masterpiece!" },
    { "user": "reader2", "rating": 4, "comment": "Beautifully written, but a bit sad." }
  ]
}

Response includes a _version field used for atomic updates:

{
  "_index": "books",
  "_id": "kLEHMYkBq7V9x4qGJOnh",
  "_version": 1,
  "result": "created",
  "_shards": { "total": 2, "successful": 1, "failed": 0 }
}

Updating Documents

Full replacement:

PUT /books/_doc/kLEHMYkBq7V9x4qGJOnh
{ ... full document ... }

Optimistic concurrency control — only update if version matches:

PUT /books/_doc/kLEHMYkBq7V9x4qGJOnh?version=1
{ ... }

Partial update:

POST /books/_update/kLEHMYkBq7V9x4qGJOnh
{
  "doc": { "price": 14.99 }
}

Basic Match Query

GET /books/_search
{
  "query": {
    "match": { "title": "Great" }
  }
}

Boolean Query (AND conditions)

GET /books/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Great" } },
        { "range": { "price": { "lte": 15 } } }
      ]
    }
  }
}
GET /books/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "match": { "reviews.comment": "excellent" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      }
    }
  }
}

Results include document IDs, relevance scores (_score), and source documents.


Elasticsearch supports two primary geospatial field types:

TypeUse Case
geo_pointSingle lat/lon pair — restaurant locations, user check-ins
geo_shapeArbitrary geometries — delivery zones, city boundaries

Mapping:

{
  "properties": {
    "name": { "type": "text" },
    "location": { "type": "geo_point" },
    "delivery_zone": { "type": "geo_shape" }
  }
}

Distance query:

GET /restaurants/_search
{
  "query": {
    "geo_distance": {
      "distance": "5km",
      "location": { "lat": 40.7128, "lon": -74.0060 }
    }
  }
}

Under the hood, Elasticsearch uses geohashes, BKD trees (variant of k-d trees optimized for block storage), and R-tree-like structures for fast geospatial queries across millions of documents.


Sorting

Basic Sort

GET /books/_search
{
  "sort": [
    { "price": "asc" },
    { "publish_date": "desc" }
  ],
  "query": { "match_all": {} }
}

Script-Based Sort

GET /books/_search
{
  "sort": [{
    "_script": {
      "type": "number",
      "script": { "source": "doc['price'].value * 0.9" },
      "order": "asc"
    }
  }],
  "query": { "match_all": {} }
}

Nested Field Sort

GET /books/_search
{
  "sort": [{
    "reviews.rating": {
      "order": "desc",
      "mode": "max",
      "nested": { "path": "reviews" }
    }
  }],
  "query": { "match_all": {} }
}

Relevance-Based Sorting

Without a specified sort, Elasticsearch sorts by relevance score (_score). The default scoring algorithm is closely related to TF-IDF (Term Frequency-Inverse Document Frequency).


Pagination

From/Size Pagination

GET /my_index/_search
{
  "from": 0,
  "size": 10,
  "query": { "match": { "title": "elasticsearch" } }
}

Limitation: Becomes inefficient beyond ~10,000 results — the cluster must retrieve and sort all preceding documents on each request.

Search After (Keyset Pagination)

More efficient for deep pagination. Uses sort values of the last result as the starting point:

GET /my_index/_search
{
  "size": 10,
  "query": { "match": { "title": "elasticsearch" } },
  "sort": [
    { "date": "desc" },
    { "_id": "desc" }
  ],
  "search_after": [1463538857, "654323"]
}

Pros: Efficient for deep pagination, no duplicates Cons: Forward-only, requires client-side state, may miss prior-page updates

Cursors (Point in Time / PIT)

Provides a consistent snapshot of data across paginated requests:

POST /my_index/_pit?keep_alive=1m          # Create PIT → returns PIT ID

GET /_search                                # Use PIT in search
{
  "size": 10,
  "pit": { "id": "46To...", "keep_alive": "1m" },
  "sort": [{ "_score": "desc" }, { "_id": "asc" }],
  "search_after": [1.0, "1234"]
}

DELETE /_pit                                # Close PIT when done
{ "id": "46To..." }

How Elasticsearch Works Under the Hood

Elasticsearch = high-level orchestration framework for Apache Lucene (the optimized low-level search library). Elasticsearch handles distributed systems (cluster coordination, APIs, aggregations, real-time capabilities) while Lucene handles the "heart" of search.

Node Types

Node TypeResponsibility
MasterCluster coordination — add/remove nodes, create/delete indices
DataStore data and execute search operations
CoordinatingReceive client requests, distribute to appropriate nodes, merge results
IngestData transformation and preparation for indexing
Machine LearningML tasks

Each instance can be multiple types. In sophisticated deployments, dedicated hosts per type (e.g. CPU-heavy ingest nodes, memory-heavy data nodes).

Cluster startup: Seed nodes (master-eligible) perform leader election. Only one active master at a time; others on standby.

Data Nodes — The Search Engine

Data nodes separate raw _source data from Lucene indexes used in search. Requests proceed in two phases:

  1. Query phase — identify relevant documents using optimized index structures
  2. Fetch phase — (optionally) pull document IDs from nodes

Hierarchy: Index → Shards → Lucene Indexes → Lucene Segments

  • Shards split data across hosts for parallel search
  • Replicas provide HA and increased throughput (X TPS × Y replicas)
  • Elasticsearch shards are 1:1 with Lucene indexes

Lucene Segments — Immutable Architecture

Segments are immutable containers of indexed data.

OperationHow It Works
InsertAdded to buffer, flushed as a new segment
DeleteSegment maintains a deleted-IDs set; cleaned up on merge
UpdateSoft-delete old document + insert new one; cleaned up on merge
MergeMultiple segments combined into one, deleted docs cleaned up

Benefits of immutability:

  • Fast writes (no modification of existing segments)
  • Safe caching (no consistency worries)
  • Simplified concurrency (data doesn't change mid-query)
  • Easier crash recovery
  • Better compression
  • Faster searches

Tradeoff: Updates have worse performance than inserts due to soft-deletion bookkeeping. This is why Elasticsearch isn't ideal for rapidly updating data.

Inverted Index

The heart of Lucene. Maps content (words, numbers) to document locations.

Example: The word "lazy" maps to documents #12 and #53. Instead of scanning all 1 billion documents (O(n)), look up the inverted index for O(1) retrieval.

Doc Values

Columnar, contiguous representation of a single field across all documents in a segment. Used for sorting and aggregations.

  • Inverted index → "which documents match?"
  • Doc values → "what are the values for sorting/aggregation?"

Coordinating Nodes — Query Planning

The query planner determines the most efficient execution strategy by keeping statistics on field types, keyword popularity, and document lengths.

Example: Searching for "bill nye" — "bill" has millions of entries, "nye" has hundreds. The planner starts with the smaller set ("nye") to minimize work.


Using Elasticsearch in Your Interview

When to use it:

  • Complex search requirements (full-text, faceting, ranking, geospatial)
  • Read-heavy workloads at scale
  • Typically attached via Change Data Capture (CDC) to an authoritative store (Postgres, DynamoDB)

When NOT to use it:

  • As your primary database (consistency/durability issues)
  • Write-heavy systems (rapidly updating fields like like-counts)
  • Small datasets (< 100k documents) — a simple DB query may suffice
  • When you can't tolerate eventual consistency

Key principles:

  • Denormalize data for efficient search queries
  • Keep Elasticsearch in sync with underlying data (drift is a common bug source)
  • Aim for results from 1-2 queries

Lessons from Elasticsearch for System Design

  1. Immutability — enhances caching, compression, and eliminates synchronization issues
  2. Separation of concerns — query execution (coordinating nodes) vs storage (data nodes) can be optimized independently
  3. Indexing strategies — inverted indexes for search, doc values for sorting/aggregation; structure data for common query patterns
  4. Distributed trade-offs — scalability and fault tolerance come with consistency complexity (CAP theorem)
  5. Efficient data structures — skip lists, finite state transducers, BKD trees show how tailored structures dramatically improve specific use cases

Additional Interview Questions & Answers

Q6: What is an inverted index and why is it crucial for Elasticsearch?

Answer: An inverted index is a data structure that maps content (words, tokens) to the documents that contain them — the reverse of a normal index that maps documents to their content.

How it works:

  1. During indexing, each document's text fields are analyzed (tokenized, lowercased, stemmed, etc.)
  2. Each unique token is stored as a key, with a posting list of document IDs as the value
  3. At search time, the query is analyzed the same way, and the matching posting lists are retrieved and intersected

Example:

"lazy" → [doc #12, doc #53]
"fox"  → [doc #12, doc #99, doc #150]

Searching for "lazy fox" intersects both lists → doc #12.

Why it's crucial:

  • Turns O(n) full-text scan into O(1) lookup per term
  • Enables boolean queries (AND/OR/NOT) via set operations on posting lists
  • Supports phrase queries, proximity searches, and fuzzy matching
  • Each Lucene segment maintains its own inverted index

Q7: Explain the difference between keyword and text field types. When would you use each?

Answer:

Aspecttextkeyword
AnalysisTokenized, lowercased, stemmedStored as-is, no analysis
Use caseFull-text search ("find books about cats")Exact match, filtering, sorting, aggregations
Index structureInverted index with analyzed tokensInverted index with exact values
ExamplesBook descriptions, comments, articlesIDs, email addresses, status codes, categories

Common pattern: Use both via multi-fields:

{
  "title": {
    "type": "text",
    "fields": {
      "raw": { "type": "keyword" }
    }
  }
}

This lets you full-text search on title and sort/aggregate on title.raw.


Q8: How does Elasticsearch handle consistency? What are the implications?

Answer: Elasticsearch is eventually consistent by design.

Write path:

  1. Document is written to the primary shard's translog (write-ahead log)
  2. Document is buffered in memory
  3. Every ~1 second (configurable refresh_interval), the buffer is flushed to a new Lucene segment — only then is it searchable
  4. Periodically, segments are fsynced to disk

Implications:

  • There's a ~1 second delay between writing a document and being able to search for it
  • GET /index/doc/{id} is real-time (reads from translog), but search is not
  • You can force a refresh with POST /index/_refresh but this is expensive at scale
  • Replica synchronization adds further delay

Interview tip: Always acknowledge this gap. If the user writes a review and immediately searches for it, they might not see it. Solutions include:

  • Read-your-own-writes at the application layer
  • Using the ?refresh=wait_for parameter (with caution)
  • Returning the written document directly without searching

Q9: What is the difference between from/size, search_after, and PIT-based pagination?

Answer:

MethodMechanismDeep PaginationConsistencyRandom Access
from/sizeSkip + limitPoor (O(from+size) per request)NoneYes
search_afterKeyset after last result's sort valuesGood (only fetches next page)None (data can shift)No (forward-only)
PIT + search_afterSnapshot + keysetGoodYes (frozen view)No (forward-only)

When to use each:

  • from/size — simple UIs with < 10k results, where users rarely go past page 5
  • search_after — infinite scroll, large result sets, acceptable if underlying data shifts
  • PIT + search_after — export/ETL jobs, audit trails, or any case requiring a consistent snapshot

Q10: How do shards work and how do you decide the number of shards?

Answer:

How shards work:

  • An index is split into N shards (set at index creation, hard to change later)
  • Each shard is a complete Lucene index
  • Documents are routed to shards via hash(routingkey) % numshards (default routing key is _id)
  • Searches query all shards in parallel; coordinating node merges results

Sizing guidelines:

FactorRecommendation
Shard size10–50 GB per shard (sweet spot)
Shard count per node~20 shards per GB of heap
Too few shardsCan't parallelize, single node bottleneck
Too many shardsOverhead per shard (memory, file handles, thread pools), slow cluster state updates

Practical approach:

  1. Estimate total data size
  2. Target 20–40 GB per shard
  3. Ensure shard count allows distribution across nodes
  4. For time-series data, use index-per-time-period (daily/weekly) with ILM policies

Q11: How would you keep Elasticsearch in sync with a primary database?

Answer: The standard approach is Change Data Capture (CDC):

Primary DB → CDC (e.g., Debezium) → Message Queue (Kafka) → ES Consumer → Elasticsearch

Options ranked by reliability:

ApproachProsCons
CDC + KafkaReliable, ordered, replayableOperational complexity
Dual writesSimpleRisk of inconsistency if one write fails
Application-level eventsFlexibleTight coupling, easy to miss updates
Periodic batch syncSimple, self-healingStale data between syncs

Handling drift:

  • Periodic reconciliation jobs compare DB and ES counts/checksums
  • Use version fields or timestamps to detect and fix inconsistencies
  • Implement dead letter queues for failed indexing attempts

Interview tip: Always mention that Elasticsearch should NOT be the source of truth. The primary database is authoritative; ES is a read-optimized projection.


Q12: Explain TF-IDF and how Elasticsearch uses it for relevance scoring.

Answer:

TF-IDF = Term Frequency × Inverse Document Frequency

ComponentFormulaMeaning
TFcount(term in doc) / total terms in docHow often the term appears in this document
IDFlog(total docs / docs containing term)How rare the term is across all documents
TF-IDFTF × IDFTerms that are frequent in a document but rare overall score highest

Example: Searching for "elasticsearch" across 1M documents:

  • A doc mentioning "elasticsearch" 10 times scores high TF
  • If only 1000 docs mention "elasticsearch", IDF is high
  • A doc mentioning "the" 50 times has high TF but very low IDF (common word)

Elasticsearch specifics:

  • Modern versions use BM25 (an evolution of TF-IDF) as the default
  • BM25 adds saturation (diminishing returns for repeated terms) and document length normalization
  • Scores can be customized via function_score, boost, or custom similarity plugins

Answer:

AspectElasticsearchRelational DB (e.g., Postgres)
Data modelDenormalized documents (JSON)Normalized tables with relations
SearchBuilt-in full-text, fuzzy, geo, semanticFull-text possible but bolt-on (tsvector)
Scoring/RankingNative relevance scoring (BM25)Manual implementation
JoinsExpensive/limited (nested, parent-child)Native and optimized
ConsistencyEventualStrong (ACID)
Write performanceModerate (immutable segments, merges)Good (in-place updates)
AggregationsVery fast (doc values, columnar)Slower at scale (row-oriented)
SchemaFlexible (dynamic mapping)Rigid (migrations)

Rule of thumb: Use a relational DB as your source of truth. Add Elasticsearch when you need full-text search, complex filtering/faceting, or relevance ranking at scale that your DB can't handle efficiently.


Q14: How does Elasticsearch handle node failures and data recovery?

Answer:

Detection:

  • Nodes send heartbeats to the master node
  • If a node misses heartbeats for a configurable timeout, the master marks it as failed

Recovery — Data node failure:

  1. Master identifies which shards were on the failed node
  2. For shards that have replicas on other nodes, a replica is promoted to primary
  3. New replicas are allocated to other nodes to restore the replication factor
  4. The new replicas are built by copying data from the primary shard

Recovery — Master node failure:

  1. Master-eligible nodes detect the master is gone
  2. A new master is elected from the remaining master-eligible nodes
  3. The new master rebuilds the cluster state

Split-brain prevention:

  • Elasticsearch requires a quorum of master-eligible nodes to elect a master
  • The discovery.zen.minimummasternodes setting (or automatic in 7.x+) prevents two halves of a cluster from independently electing masters

Data durability:

  • The translog (write-ahead log) ensures that acknowledged writes survive node restarts
  • On recovery, unfinished operations are replayed from the translog

Q15: Design a search system for a food delivery app (like Uber Eats / DoorDash). What role does Elasticsearch play?

Answer:

Requirements: Search restaurants by name, cuisine, location, rating, delivery time, price range.

Architecture:

User → API Gateway → Search Service → Elasticsearch
                                    ↑
Restaurant DB → CDC (Debezium) → Kafka → ES Indexer

Elasticsearch index mapping:

{
  "properties": {
    "name": { "type": "text" },
    "cuisine": { "type": "keyword" },
    "location": { "type": "geo_point" },
    "rating": { "type": "float" },
    "avg_delivery_time_min": { "type": "integer" },
    "price_range": { "type": "keyword" },
    "is_open": { "type": "boolean" },
    "menu_items": {
      "type": "nested",
      "properties": {
        "name": { "type": "text" },
        "price": { "type": "float" }
      }
    }
  }
}

Example query — "Pizza near me, rating > 4, within 3km":

{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "pizza" } },
        { "range": { "rating": { "gte": 4.0 } } },
        { "geo_distance": { "distance": "3km", "location": { "lat": 40.71, "lon": -74.00 } } },
        { "term": { "is_open": true } }
      ]
    }
  },
  "sort": [
    { "_geo_distance": { "location": { "lat": 40.71, "lon": -74.00 }, "order": "asc" } },
    { "rating": "desc" }
  ]
}

Key design decisions:

  • Denormalize restaurant + menu data into one index for single-query results
  • Use CDC to keep ES in sync with the restaurant database
  • is_open updated via a lightweight scheduled job (not full CDC)
  • Hot key mitigation — popular restaurants cached at the application layer
  • Paginationsearch_after for infinite scroll

References