database

Database Replication Explained: Primary, Replica, and Failover

Learn how database replication works. Covers primary-replica setup, synchronous vs asynchronous replication, read replicas, and automatic failover strategies.

By Akash Sharma·6 min read
#database
#replication
#postgresql
#system design
#scalability
#reliability
#backend

Your database is a single server. If it goes down, your app goes down. If it gets overwhelmed with reads, everything slows. Replication solves both.

What Is Database Replication?

Replication copies data from one database server to one or more others. The original is the primary (or leader). The copies are replicas (or followers).

plaintext
Without replication:
App → Database (single point of failure, single server handles all load)
 
With replication:
App writes → Primary ──→ Replica 1 (read from here)
                    └──→ Replica 2 (read from here)
                    └──→ Replica 3 (failover target)

Why replicate:

  • High availability: If primary fails, promote a replica
  • Read scaling: Distribute read queries across replicas
  • Backups: Take backups from a replica without impacting primary performance
  • Geographic distribution: Replica in another region for lower latency reads

Primary-Replica (Leader-Follower)

The most common setup. One primary accepts all writes. Replicas receive changes and stay in sync. Reads can go to replicas.

python
# Application code with read/write splitting
from sqlalchemy import create_engine
 
primary = create_engine("postgresql://primary-host/mydb")    # Writes
replica1 = create_engine("postgresql://replica1-host/mydb")  # Reads
replica2 = create_engine("postgresql://replica2-host/mydb")  # Reads
 
def create_order(data):
    with primary.connect() as conn:  # Always write to primary
        conn.execute("INSERT INTO orders ...", data)
 
def get_user_orders(user_id):
    with replica1.connect() as conn:  # Read from replica
        return conn.execute("SELECT * FROM orders WHERE user_id = %s", user_id)

Most ORMs (SQLAlchemy, Django ORM) support read/write routing with minimal config.

Synchronous vs Asynchronous Replication

How quickly replicas receive writes.

Asynchronous (default in most databases)

Primary writes, confirms success to the app, then sends changes to replicas in the background.

plaintext
Write → Primary commits → Returns "success" to app
              ↓ (async, may be milliseconds behind)
         Replica receives change

Pros: Fast writes. Primary doesn't wait for replicas.

Cons: Replication lag — replicas may be slightly behind. If primary crashes between commit and replication, that data is lost. Replica reads may return stale data.

Replication lag is the delay between primary write and replica receiving it. Usually milliseconds, can be seconds under heavy load.

Synchronous

Primary doesn't confirm success until at least one replica has confirmed the write.

plaintext
Write → Primary commits → Waits for replica ACK → Returns "success" to app
              ↓ (sync)
         Replica commits

Pros: Zero data loss on primary failure. Reads from replica are guaranteed fresh.

Cons: Write latency increases (waits for network round trip to replica). If replica is slow, primary slows too.

PostgreSQL: Configuring Replication Mode

sql
-- postgresql.conf on primary
synchronous_standby_names = 'replica1'  -- Sync with replica1
-- '' = async (default)
-- 'replica1' = sync with one replica
-- 'ANY 1 (replica1, replica2)' = sync with any one of two replicas
sql
-- Per-transaction override
BEGIN;
SET synchronous_commit = 'on';  -- This transaction waits for replica
INSERT INTO payments ...;
COMMIT;

Most apps use async replication and accept the small risk of brief stale reads. Financial data benefits from sync replication.

Read Replicas: Scaling Reads

Read replicas are the most common use of replication. Direct all SELECT queries to replicas, all writes to primary.

plaintext
Read traffic (80%) → Replica 1
                   → Replica 2
                   → Replica 3
 
Write traffic (20%) → Primary

This lets you scale read capacity independently. Add more replicas when read load grows, without touching the primary.

What to put on replicas:

  • User profile fetches
  • Product catalog reads
  • Report generation
  • Analytics queries
  • Search queries

What must go to primary:

  • Any write
  • Reads that must be immediately consistent after a write (read-your-own-writes)

The Stale Read Problem

You update your username. Then you immediately fetch your profile. If you hit a replica, it may return the old name (replication lag).

Fix strategies:

python
# Strategy 1: Read from primary after a write
def update_username(user_id, name):
    primary.execute("UPDATE users SET name = %s WHERE id = %s", name, user_id)
    # Return the new value directly, don't fetch from replica
    return {"name": name}
 
# Strategy 2: Route reads to primary for a short window after write
from datetime import datetime, timedelta
 
user_recently_wrote = {}
 
def update_username(user_id, name):
    primary.execute("UPDATE users SET name = %s WHERE id = %s", name, user_id)
    user_recently_wrote[user_id] = datetime.utcnow()
 
def get_user(user_id):
    wrote_at = user_recently_wrote.get(user_id)
    if wrote_at and datetime.utcnow() - wrote_at < timedelta(seconds=5):
        return primary.execute("SELECT * FROM users WHERE id = %s", user_id)
    return replica.execute("SELECT * FROM users WHERE id = %s", user_id)

Failover: When Primary Goes Down

When primary fails, a replica must be promoted. This is failover.

Manual failover: DBA promotes replica manually. Downtime: minutes.

Automated failover: A tool watches the primary and promotes a replica automatically. Downtime: seconds.

Tools:

  • Patroni (PostgreSQL): Stores leader election state in etcd/Consul. Automatically promotes the most up-to-date replica.
  • AWS RDS Multi-AZ: Sync replica in another availability zone. Automatic failover in 60-120 seconds.
  • PlanetScale / Neon: Managed Postgres with automatic failover built in.
yaml
# Patroni config (simplified)
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — don't promote replica too far behind
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: primary-host:5432

After failover, your app's connection string needs to point to the new primary. Use a virtual IP or DNS that switches automatically, or a proxy like pgBouncer/HAProxy.

Multi-Primary (Multi-Master)

Multiple primaries accept writes. Each replicates to the others. More complex — now you have potential write conflicts.

plaintext
App → Primary 1 ←→ Primary 2 ← App
         ↓               ↓
      Replica         Replica

Used for: geographic distribution (write to the nearest region), maximum write availability.

Conflict resolution gets complex: two primaries update the same row at the same time. Who wins? Last-write-wins? Application-level merge?

Most applications don't need multi-primary. Stick with primary-replica unless you have very specific multi-region write requirements.

Key Takeaways

  • Replication copies data from primary to replicas — for availability and read scaling
  • Async replication: fast writes, possible brief lag — good for most apps
  • Sync replication: zero data loss, slower writes — good for financial data
  • Read replicas scale read throughput — direct SELECTs there, writes to primary
  • Replication lag causes stale reads — route reads after writes to primary for consistency
  • Failover promotes a replica when primary fails — automate it with Patroni or managed DBs
  • Multi-primary adds write conflicts — only use it if primary-replica isn't enough

Start with a single primary and add read replicas when read load grows. Add automated failover before you need it.

Related reading: Database Sharding Explained · CAP Theorem Explained · Database Indexing Explained

Enjoyed this article?

Get weekly insights on backend architecture, system design, and Go programming.