backend

Rate Limiting: Protect Your API From Being Overwhelmed

Learn how to implement rate limiting with Redis. Covers token bucket, fixed window, and sliding window algorithms with Python and Go examples.

By Akash Sharma·5 min read
#rate limiting
#redis
#api
#backend
#security
#python
#system design

One user sends 10,000 requests per minute to your API. Your server slows to a crawl. Legitimate users get timeouts. Your costs spike.

Rate limiting stops this. It caps how many requests a client can make in a given time period.

Why Rate Limiting Matters

Beyond protection from abuse, rate limiting helps with:

  • Preventing DDoS: A flood of requests can't bring down your service
  • Fair usage: One heavy user can't degrade the experience for everyone
  • Cost control: Limits on expensive operations (like AI inference calls) prevent runaway bills
  • Business rules: APIs often have usage tiers — free users get 100 calls/day, paid users get 10,000

Rate Limiting Algorithms

Fixed Window Counter

Divide time into fixed windows (e.g., each minute). Count requests in the current window. If count exceeds limit, reject.

plaintext
Window: 12:00:00 - 12:00:59 → 95 requests → OK
Window: 12:01:00 - 12:01:59 → 120 requests → 20 requests rejected

Simple to implement, but has an edge case: a client can double the limit by sending 100 requests at 12:00:59 and 100 more at 12:01:00 — that's 200 requests in 2 seconds, but both windows show 100.

Sliding Window Log

Track the timestamp of every request. On each new request, count requests in the last N seconds.

More accurate, but stores a timestamp for every request — expensive for high-volume APIs.

Sliding Window Counter

A practical middle ground. Weight the previous window count based on how far into the current window you are.

plaintext
Previous window: 80 requests
Current window: 40 requests (30 seconds in, out of 60)
 
Estimated rate = (80 × 0.5) + 40 = 80 requests in this window

This is what most production systems use.

Token Bucket (Most Flexible)

Imagine a bucket that fills with tokens at a steady rate. Each request uses one token. If the bucket is empty, the request is rejected.

  • Bucket capacity: 100 tokens (max burst)
  • Refill rate: 10 tokens/second
  • A client can burst to 100 requests, then is limited to 10/second
python
import redis
import time
 
class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity: int, refill_rate: float):
        self.r = redis_client
        self.capacity = capacity        # Max tokens (burst limit)
        self.refill_rate = refill_rate  # Tokens added per second
    
    def is_allowed(self, client_id: str) -> bool:
        now = time.time()
        key = f"rate_limit:{client_id}"
        
        # Atomic Lua script to prevent race conditions
        script = """
        local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
        local last_refill = tonumber(redis.call('GET', KEYS[2]) or ARGV[2])
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[3])
        local now = tonumber(ARGV[2])
        
        -- Add tokens based on elapsed time
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + (elapsed * refill_rate))
        
        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('SET', KEYS[1], tokens, 'EX', 3600)
            redis.call('SET', KEYS[2], now, 'EX', 3600)
            return 1  -- allowed
        else
            return 0  -- rejected
        end
        """
        
        result = self.r.eval(
            script, 2,
            f"{key}:tokens", f"{key}:last_refill",
            self.capacity, now, self.refill_rate
        )
        return bool(result)

Simple Redis Implementation (Fixed Window)

For most APIs, a simple fixed window counter in Redis is enough:

python
import redis
 
r = redis.Redis(host='localhost', port=6379)
 
def is_rate_limited(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    key = f"rate_limit:{client_id}:{int(time.time() / window_seconds)}"
    
    current = r.incr(key)
    
    if current == 1:
        # First request in this window — set expiry
        r.expire(key, window_seconds)
    
    return current > limit
 
# Usage in FastAPI
from fastapi import Request, HTTPException
 
@app.get("/api/data")
async def get_data(request: Request):
    client_ip = request.client.host
    if is_rate_limited(client_ip, limit=100, window_seconds=60):
        raise HTTPException(status_code=429, detail="Too many requests")
    return {"data": "..."}

Returning Useful Rate Limit Headers

Good APIs tell clients their rate limit status:

python
@app.get("/api/data")
async def get_data(request: Request):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}:{int(time.time() / 60)}"
    
    current = r.incr(key)
    if current == 1:
        r.expire(key, 60)
    
    remaining = max(0, 100 - current)
    
    if current > 100:
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded",
            headers={
                "X-RateLimit-Limit": "100",
                "X-RateLimit-Remaining": "0",
                "X-RateLimit-Reset": str(int(time.time() / 60 + 1) * 60),
                "Retry-After": "60",
            }
        )
    
    return JSONResponse(
        {"data": "..."},
        headers={
            "X-RateLimit-Limit": "100",
            "X-RateLimit-Remaining": str(remaining),
        }
    )

Where to Apply Rate Limits

Per IP address: Good for anonymous APIs. Easy to implement. Per API key / user: Better for authenticated APIs. Allows different limits per tier. Per endpoint: Expensive operations (image generation, AI calls) need tighter limits. Globally: Protect against total traffic overload regardless of individual client behavior.

In production, you often combine these:

  • Global limit: 10,000 req/sec total
  • Per IP: 100 req/min
  • Authenticated free tier: 1,000 req/day
  • Authenticated paid tier: 100,000 req/day

Rate Limiting at Infrastructure Level

For serious production systems, put rate limiting in the infrastructure layer, not just in application code:

  • Nginx: limit_req_zone module for per-IP limiting
  • API Gateway: AWS API Gateway, Kong, or Traefik all support rate limiting natively
  • CDN level: Cloudflare's rate limiting rules before requests even reach your server

Infrastructure-level rate limiting is more efficient (rejects requests before they consume server resources) and works across all your services.

Key Takeaways

  • Rate limiting protects your API from abuse, accidental overload, and runaway costs
  • Fixed window is the simplest algorithm — good enough for most use cases
  • Token bucket allows burst traffic while enforcing an average rate
  • Redis is the standard tool for distributed rate limiting
  • Always return X-RateLimit-* headers so clients know their status
  • Apply rate limits per client AND globally for full protection

Rate limiting is one of those things that seems unnecessary until you don't have it.

Related reading: REST API Design Best Practices · Redis Caching Explained

Enjoyed this article?

Get weekly insights on backend architecture, system design, and Go programming.