Database Tradeoffs

SQL vs NoSQL

AspectSQL (Relational)NoSQLWhen to Use
Data ModelStructured, fixed schema with relationsFlexible schema, document/key-value/graphUse SQL for complex relationships; NoSQL for flexible, rapidly changing data
ACID ComplianceStrong ACID guaranteesOften eventual consistency (some offer ACID)Use SQL for financial transactions; NoSQL for high throughput, eventual consistency OK
ScalabilityVertical scaling (scale up); horizontal with sharding complexityHorizontal scaling built-inUse SQL for moderate scale; NoSQL for massive scale-out needs
Query FlexibilityRich SQL queries, joins, aggregationsLimited queries; optimized for specific access patternsUse SQL for ad-hoc analytics; NoSQL for known access patterns
PerformanceSlower writes due to ACID overhead; optimized reads with indexesFast writes; read performance depends on data modelUse SQL for read-heavy with complex queries; NoSQL for write-heavy workloads
ExamplesMySQL, PostgreSQL, OracleMongoDB, Cassandra, DynamoDB, Redis

Why Choose SQL:

  • Need strong consistency and ACID transactions (banking, e-commerce orders)
  • Complex queries with multiple joins (reporting, analytics)
  • Well-defined schema that doesn't change frequently
  • Team expertise with SQL

Why Choose NoSQL:

  • Need to scale horizontally to handle massive traffic (social media feeds, IoT)
  • Schema flexibility for rapidly evolving features (startups, prototyping)
  • Simple key-value or document lookups (user profiles, session storage)
  • Geographic distribution with eventual consistency acceptable

Tradeoff Summary:

  • SQL: Consistency + Complex Queries ↔ Harder to scale horizontally
  • NoSQL: Scalability + Flexibility ↔ Limited query capabilities, eventual consistency

Common Databases: When to Use What

DBModelStrengthsBest ForWatchouts
DynamoDBManaged key-value/documentServerless, predictable latency, auto-scaling, TTLsMassive scale key-value lookups, bursty traffic, IoT/session/user settingsHot partitions cost $$, limited query patterns, single-table design learning curve
CassandraWide-column (AP)Linear write scaling, multi-region friendly, tunable consistencyWrite-heavy time-series/logs/metrics, large append-only datasetsEventual consistency by default, partition-key centric modeling, tombstone/GC tuning
PostgreSQLRelationalStrong ACID, rich SQL/JSONB, extensions (PostGIS/FDW), mature toolingOLTP with complex queries/joins, moderate scale analytics, transactional workloadsVertical limits; sharding/replica mgmt needed at very high scale; tune vacuum/autovacuum
MySQLRelationalBattle-tested, easy ops, good replication, wide ecosystemSimpler OLTP, LAMP stacks, read-heavy with replicasManual sharding at scale, fewer advanced SQL features vs Postgres, replica lag considerations
MongoDBDocumentFlexible schema, rich queries and indexes, good developer velocityEvolving schemas (content/user profiles), nested documents, moderate real-time readsData integrity relies on schema discipline; multi-doc transactions newer; large/hot docs hurt
Neo4jGraphNative graph traversals, expressive Cypher queriesHighly connected data (social, fraud rings, recommendations, network/topology)Not optimized for wide OLTP; different scaling model; specialized skill set/licensing
Other quick picksRedis: ultra-fast cache/leaderboards; Snowflake/BigQuery: analytics; Elasticsearch/OpenSearch: search/logsUse per specialized needEach adds infra/ops cost; keep surface area small

How to decide quickly: Start with PostgreSQL or MySQL for most OLTP apps; pick DynamoDB/Cassandra when horizontal write scaling and simple access patterns dominate; choose MongoDB for flexible schemas without heavy joins; reach for graph (Neo4j) only when relationships are the core query.


Geospatial/Location Query Tradeoffs

SolutionTechnology StackStrengthsBest ForWatchouts
PostgreSQL + PostGISSQL + Spatial ExtensionPowerful spatial functions (within, distance, nearest); supports complex geometries (polygons, multipolygons); ACID transactions; excellent for analyticsRide-sharing (nearby drivers), location filtering with complex queries, geographic analytics, multi-criteria location searchesScales vertically; sharding complexity increases; slower on massive datasets; slower than specialized geo DBs for simple queries
Redis Geo CommandsIn-memory key-value with geo moduleUltra-fast lookups (O(log N)); perfect for leaderboards/nearby lists; built-in sorting by distance; simple radius/bbox queriesReal-time "nearby" queries (food delivery partners, rideshare), leaderboards with location, sessions with geo dataLimited query complexity; no persistence (need RDB/AOF); doesn't handle complex polygons; re-indexes required for updates; memory-intensive
MongoDB + Geospatial IndexesDocument DB with 2dsphere/2d indexesFlexible schema with geo data; supports complex geometries (multipoint, polygon); good scaling via sharding; TTL indexes for sessionsLocation-based social features (check-ins), user discovery within region, delivery zone filtering with flexible data modelSlower than Redis for simple queries; index memory overhead; aggregation pipeline for complex geo + filters can be expensive; 2dsphere slower than 2d for simple radius
Elasticsearch/OpenSearchSearch engine with geo pluginFast range/radius searches; excellent for combining location + full-text search (e.g., "restaurants near me"); aggregations by geo grid/distanceLocation-based search (nearby restaurants), autocomplete with geo filtering, geo-bucketing analyticsNot ideal for pure location queries (overkill); writes slower than reads; not ACID; large memory footprint; indexing cost high
DynamoDB + Geo PartitioningManaged NoSQL with custom shardingServerless, predictable latency, auto-scaling; good for simple location lookups with partition key designMobile apps with bursty location traffic, session management by region, geo-distributed user profilesComplex geo queries nearly impossible; single-table design learning curve; hot partitions if users cluster in one area; range queries limited; need pre-computed geohashes
S2 Geometry (Google's library)Geometric library used with any DBHierarchical space-filling curves; consistent cell boundaries globally; handles poles/datelines correctly; efficient bit operationsGlobal tile-based systems, precise location-to-cell mapping, seamless cross-region queries, geographic hierarchiesNot a database; requires implementation effort; learning curve steep; requires pre-processing locations into cells
H3 Hexagonal IndexHierarchical hexagonal grid (Uber's)Hexagonal cells (no biasing); great for aggregations by region; consistent parent-child hierarchy; smaller cell sizes than S2Uber-style mapping (hexagonal heat maps), fleet optimization, region-based analytics, disaster response zonesPre-processing cost; not native to all databases; larger index than geohash; client-side computation overhead
Geohash-based ShardingCustom partitioning strategySimple implementation; reduces range query complexity; can be pre-computedDistributed systems needing geo-aware sharding, pre-computing user regions, hot-spot mitigationApproximation (cell boundaries don't match exact distances); requires custom code; ordering by distance not native; geohash collision edge cases
Neo4j with Spatial DataGraph DB with spatial pluginGraph-native geospatial traversals; relationship-based location logic (friends near you); strong for social/location networksSocial networks with location relationships (friends near me), recommendation systems with geo filters, network topology with locationsOverkill for pure location queries; scaling challenges for massive graphs; spatial queries not as optimized as PostGIS

Detailed Decision Flowchart: Location Queries

Question 1: How many location queries per second (QPS)?

  • < 100 QPS: PostgreSQL + PostGIS (simplicity, ACID, complex queries)
  • 100-10K QPS: Redis Geo or MongoDB (fast reads, moderate scaling)
  • > 10K QPS: Redis Geo + PostgreSQL (Redis for hot reads, Postgres for cold/complex)

Question 2: Query complexity?

  • Simple radius/nearby: Redis Geo or DynamoDB Geo Partitioning (blazing fast)
  • Radius + filters (distance + category): MongoDB Geo Indexes or Elasticsearch
  • Complex polygons + analytics: PostgreSQL + PostGIS
  • Full-text search + location: Elasticsearch with geo plugin

Question 3: Scale and data distribution?

  • Single region, < 10GB data: PostgreSQL + PostGIS
  • Multi-region, high write volume: DynamoDB Geo Partition (serverless) or Redis Geo + Postgres replication
  • Globally distributed users: S2/H3 geohash + DynamoDB or Cassandra (pre-compute regions)

Question 4: Consistency requirements?

  • Strong consistency critical: PostgreSQL + PostGIS (ACID transactions)
  • Eventual consistency acceptable: Redis Geo (cache layer), MongoDB, Elasticsearch

Question 5: Need real-time updates?

  • Yes (food delivery, rideshare): Redis Geo (TTL for expiring locations) + pub/sub for updates
  • Not critical: PostgreSQL + PostGIS with caching, DynamoDB

Common Location Query Patterns

PatternBest SolutionWhyExample
Find nearby (radius search)Redis Geo with GEORADIUSO(log N) with distance sorting; <10ms latency"Nearby restaurants within 5km"
K-nearest neighborsMongoDB Geo Aggregate Pipeline or PostGISBoth efficient; MongoDB for flexible schema, PostGIS for complex geometries"5 closest Uber drivers"
Bounding box (rectangular region)Redis GEORADIUSBYMEMBER or PostgreSQLRedis for speed; PostGIS for accurate geography"All delivery zones in viewport"
Point-in-polygon (user in zone)PostgreSQL PostGIS ST_ContainsHandles complex polygon boundaries; ACID safe"Is delivery address within service area?"
Distance matrix (many-to-many)PostgreSQL PostGIS or S2 geometryPostGIS for accuracy; S2 for pre-computationRoute optimization, distance between all depot-customer pairs
Geofence (notify when entering region)Redis Geo + streams or DynamoDB Geo + LambdaRedis for hot geofences; DynamoDB for serverlessGeo-triggered notifications, store crossing alerts
Aggregation by region (heatmap)Elasticsearch Geo Aggregate or H3 bucketingElasticsearch for combined geo + analytics; H3 for custom regions"Orders by delivery zone", "traffic density map"
Hierarchical regions (country→state→city)PostgreSQL PostGIS + hierarchical tables or S2/H3PostGIS for ACID joins; S2/H3 for cell hierarchies"All cities in California with avg distance"

Real-World Trade-offs: Examples

Use PostgreSQL + PostGIS when:

  • Complex multi-criteria queries (distance + polygon + filters)
  • ACID consistency critical (financial transactions with locations)
  • Moderate QPS (<1K), accept eventual caching layer
  • Team familiar with SQL and spatial extensions
  • Example: Ride-sharing pricing zone verification, property boundary queries

Use Redis Geo when:

  • Simple radius searches dominate (nearby driver/restaurant queries)
  • Real-time, sub-millisecond latency required
  • Can tolerate eventual consistency (cache layer over DB)
  • User locations change frequently (need TTL auto-expire)
  • Example: Uber finding nearby drivers, food delivery "nearby restaurants"

Use MongoDB Geo Indexes when:

  • Flexible schema with location data (user profiles with optional location)
  • Combining location filters with other document queries
  • Moderate QPS with geographic scaling via sharding
  • Example: Dating app "find matches within 20 miles who like hiking"

Use Elasticsearch Geo when:

  • Location search + full-text search together (restaurants near me + name search)
  • Complex aggregations (heatmaps, zone analytics)
  • Accept higher latency (100ms+) for powerful analytics
  • Example: Job search "Software engineer jobs in NYC area"

Use DynamoDB Geo Partitioning when:

  • Serverless, minimal ops requirement
  • Predictable location distribution (geohash bucket design)
  • Can pre-compute regions (avoid hot partitions)
  • Example: Mobile app tracking user sessions by region, IoT device location tracking

Use S2/H3 Geometry Libraries when:

  • Building custom tiling/hexagonal systems
  • Globally accurate cells (poles, datelines matter)
  • Pre-computing expensive: heatmaps, regional aggregations
  • Example: Uber's hexagon-based fleet optimization, disaster response zone mapping

Read Replicas vs Sharding

AspectRead ReplicasSharding (Horizontal Partitioning)When to Use
PurposeScale read throughputScale both reads and writes; handle large datasetsUse replicas for read-heavy; sharding for write-heavy or large data
Write ScalingNo (all writes go to primary)Yes (writes distributed across shards)Sharding when write traffic exceeds single DB capacity
Read ScalingYes (distribute reads across replicas)Yes (each shard handles subset of reads)Both scale reads; replicas simpler for read-only scaling
Data SizeAll data on each replicaData partitioned across shardsSharding when data doesn't fit on single node
ComplexityLow (just replication setup)High (partition key selection, rebalancing, cross-shard queries)Replicas for simplicity; sharding when necessary for scale
ConsistencyReplication lag (eventual consistency for reads)Depends on implementation; can be strong per shardBoth have consistency tradeoffs
Failure ImpactRead capacity reduced; primary still handles writesShard unavailable affects subset of dataReplicas: graceful degradation; Sharding: partial outage

Why Choose Read Replicas:

  • Read:write ratio is 10:1 or higher (news sites, blogs, product catalogs)
  • Need to offload reporting/analytics queries from primary
  • Geographic distribution of read traffic
  • Simple to implement and maintain

Why Choose Sharding:

  • Write throughput exceeds single database capacity (high-frequency trading, IoT ingestion)
  • Dataset size exceeds single server storage (multi-TB datasets)
  • Need to isolate tenant data (multi-tenant SaaS)
  • Combined with replicas for maximum scale

Tradeoff Summary:

  • Replicas: Easy to implement + Read scalability ↔ No write scaling
  • Sharding: Write + Read scalability + Large data ↔ High complexity, cross-shard queries difficult

Normalization vs Denormalization

AspectNormalizationDenormalizationWhen to Use
Data RedundancyMinimal (DRY principle)High (data duplicated)Normalize to save storage; denormalize for performance
Write PerformanceFaster (single location update)Slower (update multiple locations)Normalize for write-heavy; denormalize for read-heavy
Read PerformanceSlower (requires joins)Faster (pre-joined data)Denormalize for low-latency reads
Data IntegrityEasier (single source of truth)Harder (risk of inconsistency)Normalize for critical data accuracy
Storage CostLowerHigherNormalize when storage is expensive; denormalize when reads are expensive
Use CaseOLTP (transactional systems)OLAP (analytics, reporting), NoSQL databases

Why Choose Normalization:

  • Data integrity is critical (user accounts, financial records)
  • Write-heavy workload (frequent updates to same entities)
  • Storage costs are high
  • Data changes frequently

Why Choose Denormalization:

  • Read performance is critical (newsfeed, product search)
  • Read:write ratio is very high
  • Willing to sacrifice consistency for speed
  • Pre-computed aggregations needed (dashboards, leaderboards)

Tradeoff Summary:

  • Normalization: Data integrity + Storage efficiency ↔ Slow reads (joins)
  • Denormalization: Fast reads ↔ Data redundancy + Update complexity

Data Structures & Indexing Tradeoffs

StructureStrengthsTradeoffsWhen to Use
Bloom FilterVery space-efficient membership checks; O(k) inserts/queries; avoids expensive lookupsFalse positives; no value retrieval; hard deletes unless counting/quotient; needs sizing to meet FP ratePrefiltering before DB/cache lookups, cache-penetration protection, existence checks at massive scale
Redis Hash (Hash Map)Compact storage for many small fields; O(1) field access; atomic field updatesNo per-field TTL; no secondary indexes or range queries; very large hashes can rehash and become memory-heavyGrouping related attributes under one key (user profiles, counters, feature flags)
Trie (Prefix Tree)Fast prefix search/autocomplete; lexicographic traversal; O(k) lookupsHigh memory overhead; pointer-heavy/cache-unfriendly; slower for random access vs hashPrefix matching, autocomplete, routing tables, dictionary lookups
B+ TreeOrdered index; efficient range scans; disk/page-friendly with high fanoutMore complex writes (splits/merges); slower point lookups than hash; tuning page size and fill factorDatabase indexes, range queries, ordered pagination, time-series by key

Why Choose Bloom Filter:

  • Minimize expensive DB hits for non-existent keys
  • Protect caches from "thundering misses" or abuse
  • Pre-check existence in distributed systems

Why Choose Redis Hash:

  • Reduce key overhead vs many small keys
  • Update fields atomically without rewriting whole object
  • Keep related data co-located for cache locality

Why Choose Trie:

  • Prefix search or autocomplete is core query
  • Need ordered, lexicographic traversal
  • Longest-prefix matching (routing, IPs)

Why Choose B+ Tree:

  • Range queries, ordered scans, and pagination
  • Disk-based storage where minimizing IO matters
  • Multi-column indexes that benefit from ordering

Tradeoff Summary:

  • Bloom Filter: Space-efficient existence checks ↔ False positives, no value retrieval
  • Redis Hash: Memory efficiency + field-level updates ↔ No per-field TTL, no range queries
  • Trie: Fast prefix queries ↔ High memory overhead
  • B+ Tree: Range scans + ordered access ↔ More complex writes

Interview Questions & Answers

Q1: Design a recommendation system like Netflix. Would you use SQL or NoSQL? Why?

Answer: Use MongoDB (NoSQL) for the following reasons:

  • Flexible schema: User preferences, watch history, and metadata change frequently
  • Horizontal scalability: Netflix scales to millions of users globally
  • Document model: Natural representation of complex nested data (shows, ratings, genres)
  • Eventual consistency acceptable: Real-time recommendations don't need strict ACID

However, use PostgreSQL for:

  • Payment and billing (ACID critical)
  • User authentication (strong consistency needed)

Hybrid approach: MongoDB for user viewing history/recommendations + PostgreSQL for core transactions = best of both worlds.


Q2: You're building an Uber-like system. How would you handle location queries for "find nearby drivers"?

Answer: Primary: Redis Geo because:

  • Drivers' locations update frequently (every few seconds)
  • Need <100ms latency for instant matching
  • Simple radius search (5-10km around user)
  • O(log N) complexity with distance sorting

Secondary: PostgreSQL + PostGIS because:

  • Store driver locations persistently (RDB backup)
  • Handle edge cases (no drivers in radius, unexpected zones)
  • Analytics queries (driver distribution by hour, peak zones)

Architecture:

  1. Driver sends location → Redis Geo (real-time, fast)
  2. Periodically sync to PostgreSQL (eventual consistency)
  3. Use S2/H3 for surge pricing zones (pre-computed hexagons)
  4. For analytics, query PostgreSQL historical data

Q3: Your database reads are slow. Should you add read replicas or shard?

Answer: Decision tree:

  1. Check read:write ratio: If 10:1 or higher → Read replicas
  2. - Example: News site (mostly reads, few updates) - Simpler, no cross-shard coordination

  1. If write-heavy (high traffic, write volume exceeds single DB) → Sharding
  2. - Example: Real-time metrics, IoT sensors, high-frequency trading - Write scaling required; replicas alone won't help

  1. Combination approach (most real systems):
  2. - Shard by userid or geographic region - Add replicas to each shard (primary + 2 secondaries) - Example: Stripe payment system (sharded by merchantid, replicated for HA)

Performance check before deciding:

  • Run EXPLAIN ANALYZE on slow queries
  • Add indexes first (often solves 80% of slowness)
  • Then consider replicas/sharding

Q4: When designing a social network (like Facebook), should you normalize or denormalize the user profile?

Answer: Denormalize user profiles because:

  • User profiles are read-heavy (viewed hundreds of times, updated rarely)
  • Denormalized: { userid, name, profilepic, bio, location, friends_count }
  • Avoids expensive joins on every page load

But normalize core data:

  • User credentials, email, phone → normalized (change rarely, critical consistency)
  • Payments and transactions → heavily normalized (ACID required)

Practical implementation:

Click to view code
Normalized tables:
- users (id, email, password_hash, created_at)
- user_profiles (user_id, name, bio, location) → denormalized cache

On profile update:
1. Update user_profiles (fast read)
2. Sync background job to update cache in Redis
3. Read heavy calls hit cache/denormalized data

Tradeoff: Trade storage for speed; profile denormalization costs 10% more storage but saves thousands of joins.


Q5: How would you design geospatial queries for a map-based service like Google Maps?

Answer: Layered approach:

  1. Index level: PostgreSQL + PostGIS
  2. - Store all geographic data (roads, POIs, boundaries) - Handle complex queries (within polygon, distance calculations)

  1. Cache level: Redis Geo
  2. - Cache hot queries (major cities, popular routes) - <10ms latency for cached results

  1. Pre-computation level: S2/H3 geometry
  2. - Pre-compute tiles and hexagons - Avoid expensive real-time calculations - Use for heatmaps, density-based features

Query flow for "restaurants near me":

Click to view code
User location (40.7128, -74.0060) within 5km

1. Check Redis Geo: GEORADIUS restaurants 40.7128 -74.0060 5km
2. Cache hit → return instantly
3. Cache miss → Query Elasticsearch (location + full-text search on "restaurant")
4. Elasticsearch queries PostgreSQL + PostGIS for complex boundaries
5. Store result in Redis with 1-hour TTL

This hybrid design balances speed (Redis), power (PostGIS), and scale (Elasticsearch).