Circuit Breaker Pattern: Stop Cascading Failures
Learn how the circuit breaker pattern prevents cascading failures in distributed systems. With Python examples and real-world use cases from Netflix.
Your payment service is down. Normally, a failed payment just returns an error. But every request to the payment service is waiting 30 seconds before timing out. Your entire checkout flow slows to a crawl. Users start abandoning carts. Other services sharing the same thread pool start failing too.
One broken service took down your whole system.
The circuit breaker pattern prevents this.
How Circuit Breakers Work
Think of a circuit breaker exactly like the one in your house. When too much current flows (a fault), the breaker trips. Power to that circuit stops. The rest of your house keeps running. Once the fault is fixed, you reset the breaker.
In software, the circuit breaker wraps calls to a service. When that service starts failing, the breaker "trips" — it stops sending requests there and returns errors immediately. Your service stays responsive, other features keep working.
Three states:
Closed (normal): Requests pass through. Failures are counted.
Open (tripped): Requests fail immediately without even calling the downstream service. Error returned instantly.
Half-open (testing): After a timeout, a few test requests are allowed through. If they succeed, the breaker closes. If they fail, it opens again.
Closed → [failure threshold exceeded] → Open
Open → [timeout period] → Half-Open
Half-Open → [test requests fail] → Open
Half-Open → [test requests succeed] → ClosedWhy This Matters: Cascading Failures
Without a circuit breaker, one slow service causes a cascade:
- Payment service responds slowly (5s instead of 200ms)
- Checkout service waits. Thread blocked.
- More requests come in. More threads blocked.
- Checkout service runs out of threads — it stops responding
- API gateway waiting for checkout. Threads blocked.
- Everything fails
With a circuit breaker:
- Payment service starts failing
- After 5 failures, circuit opens
- All subsequent calls fail instantly (not after 5-second timeout)
- Checkout service stays responsive — returns a useful error immediately
- Other features continue working
Simple Implementation
import time
from enum import Enum
from functools import wraps
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half-open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
self.failure_threshold = failure_threshold # Failures before opening
self.timeout = timeout # Seconds before trying again
self.success_threshold = success_threshold # Successes to close from half-open
self.failure_count = 0
self.success_count = 0
self.state = State.CLOSED
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == State.OPEN:
# Check if timeout has passed — try half-open
if time.time() - self.last_failure_time > self.timeout:
self.state = State.HALF_OPEN
self.success_count = 0
else:
raise Exception("Circuit breaker is OPEN — service unavailable")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
if self.state == State.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = State.CLOSED # Recovery complete
self.failure_count = 0
elif self.state == State.CLOSED:
self.failure_count = 0 # Reset on success
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = State.OPEN # Trip the breaker
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, timeout=30)
def process_payment(order_id, amount):
return payment_breaker.call(payment_service.charge, order_id, amount)Fallback Behavior
When a circuit is open, you need a fallback. Good options:
Return a cached result: For read operations, return the last known value.
Return a graceful degradation: Instead of failing completely, offer reduced functionality.
def get_product_recommendations(user_id):
try:
return recommendation_breaker.call(recommendation_service.get, user_id)
except Exception:
# Circuit is open — return generic popular items
return cache.get("popular_products") or []Queue for later: For write operations, queue the action and process it when the service recovers.
Netflix's Hystrix popularized this pattern. They found that graceful degradation (showing less personalized content instead of an error) was far better than complete failure.
Production Libraries
Don't implement circuit breakers from scratch in production. Use battle-tested libraries:
Python:
# pybreaker
from pybreaker import CircuitBreaker
db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)
@db_breaker
def fetch_user_data(user_id):
return db.query("SELECT * FROM users WHERE id = ?", user_id)Go:
// sony/gobreaker
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "payment-service",
MaxRequests: 3,
Interval: 60 * time.Second,
Timeout: 30 * time.Second,
})
result, err := cb.Execute(func() (interface{}, error) {
return paymentService.Charge(orderID, amount)
})Infrastructure level: Istio, Envoy, and AWS App Mesh support circuit breaking at the service mesh level — no code changes needed.
Circuit Breaker vs Retry
These work together, not instead of each other.
Retry: Good for transient errors (network hiccup, momentary overload). Retry 3 times with exponential backoff.
Circuit breaker: Good for sustained failures. Don't retry when the service is clearly down — fail fast.
The combination: retry with backoff for brief failures, circuit breaker to stop retrying when things are seriously broken.
Key Takeaways
- One slow service can cascade and take down your whole system
- Circuit breakers fail fast when a service is degraded, protecting everything else
- Three states: Closed (normal) → Open (failing fast) → Half-open (testing recovery)
- Always define a fallback behavior for when the circuit is open
- Use a library (pybreaker, gobreaker, Hystrix) in production — don't reinvent this
- Combine with retry and exponential backoff for full resilience
The circuit breaker pattern is one of the most important resilience patterns in distributed systems.
Related reading: Rate Limiting · Load Balancing Strategies
Enjoyed this article?
Get weekly insights on backend architecture, system design, and Go programming.
Related Posts
Continue reading with these related posts
Service Discovery: How Microservices Find Each Other
Learn how service discovery works in microservices. Covers client-side vs server-side discovery, Consul, etcd, and Kubernetes DNS with practical examples.
Distributed Tracing: Debug Requests Across Services
Learn how distributed tracing works and how to implement it. Covers trace IDs, spans, OpenTelemetry, Jaeger, and how to find performance bottlenecks in microservices.
Saga Pattern: Distributed Transactions Without 2PC
Learn how the saga pattern handles distributed transactions in microservices. Covers choreography vs orchestration, compensating transactions, and real examples.