Database Sharding Explained: Scale to Millions of Users
Learn how database sharding works, when to use it, and common strategies. Covers horizontal partitioning, shard keys, and challenges with real examples.
Your user table has 50 million rows. Queries are slow. You've added indexes, upgraded to a bigger server, added read replicas. But writes are still the bottleneck.
This is the problem sharding solves. Instead of one big database, you split the data across multiple databases — each one handles a subset of the data.
What Is Sharding?
Sharding is horizontal partitioning — splitting a table's rows across multiple database instances (called shards).
Each shard has the same schema but different data. Together, they hold the complete dataset.
Without sharding:
Database → Users table (50M rows)
With sharding:
Shard 1 → Users 1–12.5M
Shard 2 → Users 12.5M–25M
Shard 3 → Users 25M–37.5M
Shard 4 → Users 37.5M–50MEach shard can be on a different server. Each server handles 25% of the write load.
Sharding Strategies
Range-Based Sharding
Split data based on ranges of the shard key. User IDs 1–1M go to Shard 1, 1M–2M to Shard 2, etc.
def get_shard(user_id: int, num_shards: int = 4) -> str:
shard_size = 1_000_000
shard_num = user_id // shard_size
return f"db_shard_{shard_num % num_shards}"Advantage: Simple. Range queries (find all users created between date X and Y) are efficient — they usually hit one or two shards. Problem: Hot spots. If most active users have high IDs (recently created), Shard 4 handles all the load while Shard 1 sits idle.
Hash-Based Sharding
Hash the shard key and use modulo to pick a shard.
import hashlib
def get_shard(user_id: int, num_shards: int = 4) -> str:
hash_value = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
shard_num = hash_value % num_shards
return f"db_shard_{shard_num}"Advantage: Even distribution. No hot spots. Problem: Range queries become expensive — you have to check all shards. Resharding (adding more shards) is painful — you have to move data around.
Solution to resharding: Use consistent hashing instead of simple modulo. Cassandra and DynamoDB do this.
Directory-Based Sharding
A lookup service (routing table) maps each key to its shard.
# Lookup table stored in Redis or a database
routing_table = {
"user:1": "shard_1",
"user:2": "shard_3",
...
}
def get_shard(user_id: int) -> str:
return redis.get(f"routing:user:{user_id}") or "shard_default"Advantage: Most flexible. You can move data between shards easily by updating the routing table. Problem: The routing service becomes a bottleneck and a single point of failure. Extra network hop per query.
Choosing a Shard Key
The shard key determines how data is distributed. A bad shard key creates hot spots.
Good shard keys:
- High cardinality (many unique values)
- Evenly distributed access patterns
- Rarely changes (changing a shard key means moving data)
Common choices:
user_id— if most queries are per-usertenant_id— for multi-tenant SaaS appscreated_at(with hash) — for time-series data- Geographic region — if users are clustered geographically
Avoid:
- Boolean fields — only 2 shards max
- Low-cardinality fields — uneven distribution
- Fields that change frequently
The Real Challenges of Sharding
Sharding solves write scale but introduces complexity.
Cross-shard queries: If you need data that spans multiple shards (e.g., "find all users who purchased product X"), you have to query all shards and merge results in your application. This is expensive.
Joins: You can't JOIN across shards in SQL. You have to fetch data from Shard 1 and Shard 2 separately and join them in application code.
Distributed transactions: Updating data across shards in a single atomic transaction requires distributed transaction protocols (2-phase commit), which are complex and slow.
Schema changes: Running a migration on 4 shards instead of 1 takes 4× longer and needs careful coordination.
Before You Shard
Sharding is complex. Try these first:
- Add indexes (often the real fix)
- Read replicas (scale reads, not writes)
- Caching (Redis for frequently-read data)
- Vertical scaling (bigger machines)
- Partition by table (move large tables like logs to separate databases)
Sharding should be your last resort for scaling writes. Most applications never need it.
When Managed Databases Handle Sharding for You
You don't always need to implement sharding yourself:
- Amazon DynamoDB: Automatically shards and rebalances under the hood
- Google Spanner: Horizontally scalable SQL with automatic sharding
- CockroachDB: Distributed PostgreSQL-compatible database
- Cassandra: Native sharding with consistent hashing
If you're at a scale that needs sharding, strongly consider these managed options before implementing it yourself.
Sharding in Practice: A Real Example
Instagram's user media was sharded by user_id. Each shard holds all photos for a range of users.
When you view a profile, all media for that user is on one shard — no cross-shard query. Efficient.
But searching for photos by hashtag across all users is expensive — it has to touch all shards. Instagram built a separate search index for this rather than cross-shard querying.
This is the pattern: shard for your hot path (user profile views), build separate systems for the expensive cross-shard queries.
Key Takeaways
- Sharding splits data across multiple databases to scale write throughput
- Hash-based sharding distributes evenly; range-based is better for range queries
- Choose your shard key carefully — it's hard to change later
- Cross-shard queries, joins, and distributed transactions are the main pain points
- Try indexing, caching, and read replicas before sharding
- Managed databases (DynamoDB, Spanner, CockroachDB) handle sharding for you
Sharding is powerful but complex. Use it when you have to, not because you can.
Related reading: Database Indexing Explained · CAP Theorem Explained · Consistent Hashing
Enjoyed this article?
Get weekly insights on backend architecture, system design, and Go programming.
Related Posts
Continue reading with these related posts
Database Indexing Explained: Why Queries Are Slow
Learn how database indexes work, when to add them, and common mistakes. Covers B-tree indexes, composite indexes, and when indexes hurt performance.
Vertical vs Horizontal Scaling: When to Use Each
Learn the difference between vertical and horizontal scaling. Understand trade-offs, real costs, and when each strategy makes sense for your system.
Idempotency in APIs: Preventing Duplicate Operations
Learn what idempotency means in API design and why it matters for payments, retries, and distributed systems. With practical implementation patterns.