Rate Limiting: Protect Your API From Being Overwhelmed

One user sends 10,000 requests per minute to your API. Your server slows to a crawl. Legitimate users get timeouts. Your costs spike.

Rate limiting stops this. It caps how many requests a client can make in a given time period.

Why Rate Limiting Matters

Beyond protection from abuse, rate limiting helps with:

Preventing DDoS: A flood of requests can't bring down your service
Fair usage: One heavy user can't degrade the experience for everyone
Cost control: Limits on expensive operations (like AI inference calls) prevent runaway bills
Business rules: APIs often have usage tiers — free users get 100 calls/day, paid users get 10,000

Rate Limiting Algorithms

Fixed Window Counter

Divide time into fixed windows (e.g., each minute). Count requests in the current window. If count exceeds limit, reject.

plaintext

Window: 12:00:00 - 12:00:59 → 95 requests → OK
Window: 12:01:00 - 12:01:59 → 120 requests → 20 requests rejected

Simple to implement, but has an edge case: a client can double the limit by sending 100 requests at 12:00:59 and 100 more at 12:01:00 — that's 200 requests in 2 seconds, but both windows show 100.

Sliding Window Log

Track the timestamp of every request. On each new request, count requests in the last N seconds.

More accurate, but stores a timestamp for every request — expensive for high-volume APIs.

Sliding Window Counter

A practical middle ground. Weight the previous window count based on how far into the current window you are.

plaintext

Previous window: 80 requests
Current window: 40 requests (30 seconds in, out of 60)
 
Estimated rate = (80 × 0.5) + 40 = 80 requests in this window

This is what most production systems use.

Token Bucket (Most Flexible)

Imagine a bucket that fills with tokens at a steady rate. Each request uses one token. If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens (max burst)
Refill rate: 10 tokens/second
A client can burst to 100 requests, then is limited to 10/second

python

import redis
import time
 
class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity: int, refill_rate: float):
        self.r = redis_client
        self.capacity = capacity        # Max tokens (burst limit)
        self.refill_rate = refill_rate  # Tokens added per second
    
    def is_allowed(self, client_id: str) -> bool:
        now = time.time()
        key = f"rate_limit:{client_id}"
        
        # Atomic Lua script to prevent race conditions
        script = """
        local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
        local last_refill = tonumber(redis.call('GET', KEYS[2]) or ARGV[2])
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[3])
        local now = tonumber(ARGV[2])
        
        -- Add tokens based on elapsed time
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + (elapsed * refill_rate))
        
        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('SET', KEYS[1], tokens, 'EX', 3600)
            redis.call('SET', KEYS[2], now, 'EX', 3600)
            return 1  -- allowed
        else
            return 0  -- rejected
        end
        """
        
        result = self.r.eval(
            script, 2,
            f"{key}:tokens", f"{key}:last_refill",
            self.capacity, now, self.refill_rate
        )
        return bool(result)

Simple Redis Implementation (Fixed Window)

For most APIs, a simple fixed window counter in Redis is enough:

python

import redis
 
r = redis.Redis(host='localhost', port=6379)
 
def is_rate_limited(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    key = f"rate_limit:{client_id}:{int(time.time() / window_seconds)}"
    
    current = r.incr(key)
    
    if current == 1:
        # First request in this window — set expiry
        r.expire(key, window_seconds)
    
    return current > limit
 
# Usage in FastAPI
from fastapi import Request, HTTPException
 
@app.get("/api/data")
async def get_data(request: Request):
    client_ip = request.client.host
    if is_rate_limited(client_ip, limit=100, window_seconds=60):
        raise HTTPException(status_code=429, detail="Too many requests")
    return {"data": "..."}

Returning Useful Rate Limit Headers

Good APIs tell clients their rate limit status:

python

@app.get("/api/data")
async def get_data(request: Request):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}:{int(time.time() / 60)}"
    
    current = r.incr(key)
    if current == 1:
        r.expire(key, 60)
    
    remaining = max(0, 100 - current)
    
    if current > 100:
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded",
            headers={
                "X-RateLimit-Limit": "100",
                "X-RateLimit-Remaining": "0",
                "X-RateLimit-Reset": str(int(time.time() / 60 + 1) * 60),
                "Retry-After": "60",
            }
        )
    
    return JSONResponse(
        {"data": "..."},
        headers={
            "X-RateLimit-Limit": "100",
            "X-RateLimit-Remaining": str(remaining),
        }
    )

Where to Apply Rate Limits

Per IP address: Good for anonymous APIs. Easy to implement. Per API key / user: Better for authenticated APIs. Allows different limits per tier. Per endpoint: Expensive operations (image generation, AI calls) need tighter limits. Globally: Protect against total traffic overload regardless of individual client behavior.

In production, you often combine these:

Global limit: 10,000 req/sec total
Per IP: 100 req/min
Authenticated free tier: 1,000 req/day
Authenticated paid tier: 100,000 req/day

Rate Limiting at Infrastructure Level

For serious production systems, put rate limiting in the infrastructure layer, not just in application code:

Nginx: limit_req_zone module for per-IP limiting
API Gateway: AWS API Gateway, Kong, or Traefik all support rate limiting natively
CDN level: Cloudflare's rate limiting rules before requests even reach your server

Infrastructure-level rate limiting is more efficient (rejects requests before they consume server resources) and works across all your services.

Key Takeaways

Rate limiting protects your API from abuse, accidental overload, and runaway costs
Fixed window is the simplest algorithm — good enough for most use cases
Token bucket allows burst traffic while enforcing an average rate
Redis is the standard tool for distributed rate limiting
Always return X-RateLimit-* headers so clients know their status
Apply rate limits per client AND globally for full protection

Rate limiting is one of those things that seems unnecessary until you don't have it.

Rate Limiting: Protect Your API From Being Overwhelmed

Why Rate Limiting Matters

Rate Limiting Algorithms

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Token Bucket (Most Flexible)

Simple Redis Implementation (Fixed Window)

Returning Useful Rate Limit Headers

Where to Apply Rate Limits

Rate Limiting at Infrastructure Level

Key Takeaways

Enjoyed this article?

Related Posts

JWT Authentication: How It Works and Common Pitfalls

Redis Caching Explained: Speed Up Your Backend

REST API Design Best Practices That Actually Matter