Database Sharding Explained: Scale to Millions of Users

Your user table has 50 million rows. Queries are slow. You've added indexes, upgraded to a bigger server, added read replicas. But writes are still the bottleneck.

This is the problem sharding solves. Instead of one big database, you split the data across multiple databases — each one handles a subset of the data.

What Is Sharding?

Sharding is horizontal partitioning — splitting a table's rows across multiple database instances (called shards).

Each shard has the same schema but different data. Together, they hold the complete dataset.

plaintext

Without sharding:
Database → Users table (50M rows)
 
With sharding:
Shard 1 → Users 1–12.5M
Shard 2 → Users 12.5M–25M
Shard 3 → Users 25M–37.5M
Shard 4 → Users 37.5M–50M

Each shard can be on a different server. Each server handles 25% of the write load.

Sharding Strategies

Range-Based Sharding

Split data based on ranges of the shard key. User IDs 1–1M go to Shard 1, 1M–2M to Shard 2, etc.

python

def get_shard(user_id: int, num_shards: int = 4) -> str:
    shard_size = 1_000_000
    shard_num = user_id // shard_size
    return f"db_shard_{shard_num % num_shards}"

Advantage: Simple. Range queries (find all users created between date X and Y) are efficient — they usually hit one or two shards. Problem: Hot spots. If most active users have high IDs (recently created), Shard 4 handles all the load while Shard 1 sits idle.

Hash-Based Sharding

Hash the shard key and use modulo to pick a shard.

python

import hashlib
 
def get_shard(user_id: int, num_shards: int = 4) -> str:
    hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
    shard_num = hash_value % num_shards
    return f"db_shard_{shard_num}"

Advantage: Even distribution. No hot spots. Problem: Range queries become expensive — you have to check all shards. Resharding (adding more shards) is painful — you have to move data around.

Solution to resharding: Use consistent hashing instead of simple modulo. Cassandra and DynamoDB do this.

Directory-Based Sharding

A lookup service (routing table) maps each key to its shard.

python

# Lookup table stored in Redis or a database
routing_table = {
    "user:1": "shard_1",
    "user:2": "shard_3",
    ...
}
 
def get_shard(user_id: int) -> str:
    return redis.get(f"routing:user:{user_id}") or "shard_default"

Advantage: Most flexible. You can move data between shards easily by updating the routing table. Problem: The routing service becomes a bottleneck and a single point of failure. Extra network hop per query.

Choosing a Shard Key

The shard key determines how data is distributed. A bad shard key creates hot spots.

Good shard keys:

High cardinality (many unique values)
Evenly distributed access patterns
Rarely changes (changing a shard key means moving data)

Common choices:

user_id — if most queries are per-user
tenant_id — for multi-tenant SaaS apps
created_at (with hash) — for time-series data
Geographic region — if users are clustered geographically

Avoid:

Boolean fields — only 2 shards max
Low-cardinality fields — uneven distribution
Fields that change frequently

The Real Challenges of Sharding

Sharding solves write scale but introduces complexity.

Cross-shard queries: If you need data that spans multiple shards (e.g., "find all users who purchased product X"), you have to query all shards and merge results in your application. This is expensive.

Joins: You can't JOIN across shards in SQL. You have to fetch data from Shard 1 and Shard 2 separately and join them in application code.

Distributed transactions: Updating data across shards in a single atomic transaction requires distributed transaction protocols (2-phase commit), which are complex and slow.

Schema changes: Running a migration on 4 shards instead of 1 takes 4× longer and needs careful coordination.

Before You Shard

Sharding is complex. Try these first:

Add indexes (often the real fix)
Read replicas (scale reads, not writes)
Caching (Redis for frequently-read data)
Vertical scaling (bigger machines)
Partition by table (move large tables like logs to separate databases)

Sharding should be your last resort for scaling writes. Most applications never need it.

When Managed Databases Handle Sharding for You

You don't always need to implement sharding yourself:

Amazon DynamoDB: Automatically shards and rebalances under the hood
Google Spanner: Horizontally scalable SQL with automatic sharding
CockroachDB: Distributed PostgreSQL-compatible database
Cassandra: Native sharding with consistent hashing

If you're at a scale that needs sharding, strongly consider these managed options before implementing it yourself.

Sharding in Practice: A Real Example

Instagram's user media was sharded by user_id. Each shard holds all photos for a range of users.

When you view a profile, all media for that user is on one shard — no cross-shard query. Efficient.

But searching for photos by hashtag across all users is expensive — it has to touch all shards. Instagram built a separate search index for this rather than cross-shard querying.

This is the pattern: shard for your hot path (user profile views), build separate systems for the expensive cross-shard queries.

Key Takeaways

Sharding splits data across multiple databases to scale write throughput
Hash-based sharding distributes evenly; range-based is better for range queries
Choose your shard key carefully — it's hard to change later
Cross-shard queries, joins, and distributed transactions are the main pain points
Try indexing, caching, and read replicas before sharding
Managed databases (DynamoDB, Spanner, CockroachDB) handle sharding for you

Sharding is powerful but complex. Use it when you have to, not because you can.

Database Sharding Explained: Scale to Millions of Users

What Is Sharding?

Sharding Strategies

Range-Based Sharding

Hash-Based Sharding

Directory-Based Sharding

Choosing a Shard Key

The Real Challenges of Sharding

Before You Shard

When Managed Databases Handle Sharding for You

Sharding in Practice: A Real Example

Key Takeaways

Enjoyed this article?

Related Posts

Database Indexing Explained: Why Queries Are Slow

Vertical vs Horizontal Scaling: When to Use Each

Idempotency in APIs: Preventing Duplicate Operations