Database Replication Explained: Primary, Replica, and Failover
Learn how database replication works. Covers primary-replica setup, synchronous vs asynchronous replication, read replicas, and automatic failover strategies.
Your database is a single server. If it goes down, your app goes down. If it gets overwhelmed with reads, everything slows. Replication solves both.
What Is Database Replication?
Replication copies data from one database server to one or more others. The original is the primary (or leader). The copies are replicas (or followers).
Without replication:
App → Database (single point of failure, single server handles all load)
With replication:
App writes → Primary ──→ Replica 1 (read from here)
└──→ Replica 2 (read from here)
└──→ Replica 3 (failover target)Why replicate:
- High availability: If primary fails, promote a replica
- Read scaling: Distribute read queries across replicas
- Backups: Take backups from a replica without impacting primary performance
- Geographic distribution: Replica in another region for lower latency reads
Primary-Replica (Leader-Follower)
The most common setup. One primary accepts all writes. Replicas receive changes and stay in sync. Reads can go to replicas.
# Application code with read/write splitting
from sqlalchemy import create_engine
primary = create_engine("postgresql://primary-host/mydb") # Writes
replica1 = create_engine("postgresql://replica1-host/mydb") # Reads
replica2 = create_engine("postgresql://replica2-host/mydb") # Reads
def create_order(data):
with primary.connect() as conn: # Always write to primary
conn.execute("INSERT INTO orders ...", data)
def get_user_orders(user_id):
with replica1.connect() as conn: # Read from replica
return conn.execute("SELECT * FROM orders WHERE user_id = %s", user_id)Most ORMs (SQLAlchemy, Django ORM) support read/write routing with minimal config.
Synchronous vs Asynchronous Replication
How quickly replicas receive writes.
Asynchronous (default in most databases)
Primary writes, confirms success to the app, then sends changes to replicas in the background.
Write → Primary commits → Returns "success" to app
↓ (async, may be milliseconds behind)
Replica receives changePros: Fast writes. Primary doesn't wait for replicas.
Cons: Replication lag — replicas may be slightly behind. If primary crashes between commit and replication, that data is lost. Replica reads may return stale data.
Replication lag is the delay between primary write and replica receiving it. Usually milliseconds, can be seconds under heavy load.
Synchronous
Primary doesn't confirm success until at least one replica has confirmed the write.
Write → Primary commits → Waits for replica ACK → Returns "success" to app
↓ (sync)
Replica commitsPros: Zero data loss on primary failure. Reads from replica are guaranteed fresh.
Cons: Write latency increases (waits for network round trip to replica). If replica is slow, primary slows too.
PostgreSQL: Configuring Replication Mode
-- postgresql.conf on primary
synchronous_standby_names = 'replica1' -- Sync with replica1
-- '' = async (default)
-- 'replica1' = sync with one replica
-- 'ANY 1 (replica1, replica2)' = sync with any one of two replicas-- Per-transaction override
BEGIN;
SET synchronous_commit = 'on'; -- This transaction waits for replica
INSERT INTO payments ...;
COMMIT;Most apps use async replication and accept the small risk of brief stale reads. Financial data benefits from sync replication.
Read Replicas: Scaling Reads
Read replicas are the most common use of replication. Direct all SELECT queries to replicas, all writes to primary.
Read traffic (80%) → Replica 1
→ Replica 2
→ Replica 3
Write traffic (20%) → PrimaryThis lets you scale read capacity independently. Add more replicas when read load grows, without touching the primary.
What to put on replicas:
- User profile fetches
- Product catalog reads
- Report generation
- Analytics queries
- Search queries
What must go to primary:
- Any write
- Reads that must be immediately consistent after a write (read-your-own-writes)
The Stale Read Problem
You update your username. Then you immediately fetch your profile. If you hit a replica, it may return the old name (replication lag).
Fix strategies:
# Strategy 1: Read from primary after a write
def update_username(user_id, name):
primary.execute("UPDATE users SET name = %s WHERE id = %s", name, user_id)
# Return the new value directly, don't fetch from replica
return {"name": name}
# Strategy 2: Route reads to primary for a short window after write
from datetime import datetime, timedelta
user_recently_wrote = {}
def update_username(user_id, name):
primary.execute("UPDATE users SET name = %s WHERE id = %s", name, user_id)
user_recently_wrote[user_id] = datetime.utcnow()
def get_user(user_id):
wrote_at = user_recently_wrote.get(user_id)
if wrote_at and datetime.utcnow() - wrote_at < timedelta(seconds=5):
return primary.execute("SELECT * FROM users WHERE id = %s", user_id)
return replica.execute("SELECT * FROM users WHERE id = %s", user_id)Failover: When Primary Goes Down
When primary fails, a replica must be promoted. This is failover.
Manual failover: DBA promotes replica manually. Downtime: minutes.
Automated failover: A tool watches the primary and promotes a replica automatically. Downtime: seconds.
Tools:
- Patroni (PostgreSQL): Stores leader election state in etcd/Consul. Automatically promotes the most up-to-date replica.
- AWS RDS Multi-AZ: Sync replica in another availability zone. Automatic failover in 60-120 seconds.
- PlanetScale / Neon: Managed Postgres with automatic failover built in.
# Patroni config (simplified)
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB — don't promote replica too far behind
postgresql:
listen: 0.0.0.0:5432
connect_address: primary-host:5432After failover, your app's connection string needs to point to the new primary. Use a virtual IP or DNS that switches automatically, or a proxy like pgBouncer/HAProxy.
Multi-Primary (Multi-Master)
Multiple primaries accept writes. Each replicates to the others. More complex — now you have potential write conflicts.
App → Primary 1 ←→ Primary 2 ← App
↓ ↓
Replica ReplicaUsed for: geographic distribution (write to the nearest region), maximum write availability.
Conflict resolution gets complex: two primaries update the same row at the same time. Who wins? Last-write-wins? Application-level merge?
Most applications don't need multi-primary. Stick with primary-replica unless you have very specific multi-region write requirements.
Key Takeaways
- Replication copies data from primary to replicas — for availability and read scaling
- Async replication: fast writes, possible brief lag — good for most apps
- Sync replication: zero data loss, slower writes — good for financial data
- Read replicas scale read throughput — direct SELECTs there, writes to primary
- Replication lag causes stale reads — route reads after writes to primary for consistency
- Failover promotes a replica when primary fails — automate it with Patroni or managed DBs
- Multi-primary adds write conflicts — only use it if primary-replica isn't enough
Start with a single primary and add read replicas when read load grows. Add automated failover before you need it.
Related reading: Database Sharding Explained · CAP Theorem Explained · Database Indexing Explained
Enjoyed this article?
Get weekly insights on backend architecture, system design, and Go programming.
Related Posts
Continue reading with these related posts
ACID Properties Explained: Database Transactions
Learn what ACID means in databases — Atomicity, Consistency, Isolation, and Durability. Why it matters, how it works, and when you need it.
Database Sharding Explained: Scale to Millions of Users
Learn how database sharding works, when to use it, and common strategies. Covers horizontal partitioning, shard keys, and challenges with real examples.
SQL vs NoSQL: How to Choose the Right Database
Learn the real differences between SQL and NoSQL databases. When to use PostgreSQL, MongoDB, Cassandra, or Redis. Practical guide with use cases and tradeoffs.