system design

Distributed Tracing: Debug Requests Across Services

Learn how distributed tracing works and how to implement it. Covers trace IDs, spans, OpenTelemetry, Jaeger, and how to find performance bottlenecks in microservices.

By Akash Sharma·6 min read
#distributed tracing
#observability
#opentelemetry
#jaeger
#microservices
#system design
#backend

A user reports checkout is slow. You check your services — order service looks fine, payment service looks fine, inventory service looks fine. But the checkout takes 4 seconds.

Where is the time going?

In a monolith, you'd add a profiler and see the call stack. In microservices, a single request touches 5 services. Logs are scattered. Metrics show aggregate data, not individual requests.

Distributed tracing gives you the call stack back.

What Is Distributed Tracing?

Distributed tracing follows a single request as it flows through multiple services. Every service records how long it took to handle its part. You see the full journey in one view.

plaintext
User request: checkout
 
Order Service (200ms)
 └── Inventory Service (50ms)          ← called by Order
 └── Payment Service (3500ms) ← SLOW  ← called by Order
      └── Fraud Detection (3200ms) ← VERY SLOW ← called by Payment
 └── Notification Service (100ms)      ← called by Order
 
Total: 4 seconds. Problem: Fraud Detection.

Without tracing, you'd see "checkout is slow" and guess. With tracing, you see exactly where 80% of the time went.

Traces and Spans

Trace: The entire journey of one request. Has a unique trace ID.

Span: One unit of work within a trace (one service call, one DB query, one HTTP request). Has a span ID and records start time, end time, service name, operation name.

Parent span / child span: Spans are nested. The order service span is parent to the inventory service span.

plaintext
Trace ID: abc-123
 
Span: order-service/checkout        [0ms ————————————— 4000ms]
  Span: inventory-service/reserve   [10ms — 60ms]
  Span: payment-service/charge      [70ms ————————— 3570ms]
    Span: fraud-service/check       [80ms ———————— 3280ms]
  Span: notification-service/send   [3600ms — 3700ms]

This waterfall view immediately shows where time is spent.

Propagating Context

For tracing to work, each service must pass the trace ID to the next service. This is called context propagation.

When the order service calls the payment service, it adds trace headers:

plaintext
HTTP request:
  POST /charge
  traceparent: 00-abc123-def456-01
               ↑   ↑       ↑     ↑
               version trace-id span-id flags

The payment service reads traceparent, creates a child span under the same trace, and passes it along when calling fraud detection.

Without propagation, each service would start a new trace — you'd lose the connection between spans.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the open standard for tracing (and metrics and logs). Most languages have OTel SDKs. You instrument once and send to any backend (Jaeger, Zipkin, Datadog, Honeycomb).

python
# Install: pip install opentelemetry-sdk opentelemetry-exporter-jaeger
 
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
 
# Setup (do this once at startup)
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer("order-service")
 
# Create spans manually
def checkout(order_id: str):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", get_user_id())
        
        with tracer.start_as_current_span("reserve-inventory"):
            inventory_result = reserve_inventory(order_id)
        
        with tracer.start_as_current_span("charge-payment"):
            payment_result = charge_payment(order_id)
        
        span.set_attribute("payment.id", payment_result.id)
        return {"status": "success"}
python
# FastAPI auto-instrumentation — zero manual spans needed
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
 
app = FastAPI()
 
# These auto-create spans for every request, HTTP call, and DB query
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

Auto-instrumentation adds spans for HTTP requests, database queries, and cache calls without touching your business logic code.

Jaeger: Visualizing Traces

Jaeger is a popular open-source tracing backend. You send spans to it, and it gives you a UI to search and visualize traces.

yaml
# docker-compose.yml — run Jaeger locally
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "6831:6831/udp"  # Receive spans
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411

Open http://localhost:16686 to see traces. Search by service, operation name, trace ID, or duration.

The trace view shows:

  • Full waterfall of spans
  • Duration of each span
  • Tags (attributes) you set
  • Errors marked in red

What to Tag on Spans

Attributes (tags) make traces searchable. Tag the things you'll want to filter on:

python
with tracer.start_as_current_span("charge-payment") as span:
    span.set_attribute("payment.amount", amount)
    span.set_attribute("payment.currency", "USD")
    span.set_attribute("user.id", user_id)
    span.set_attribute("order.id", order_id)
    
    try:
        result = payment_service.charge(amount)
        span.set_attribute("payment.id", result.id)
        span.set_attribute("payment.status", "success")
    except PaymentError as e:
        span.set_attribute("payment.status", "failed")
        span.set_attribute("error.message", str(e))
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        raise

Now you can search Jaeger for payment.status=failed and see every failed payment trace with full context.

Sampling: Not Every Request Needs Tracing

At high traffic, tracing every request is expensive. Sampling lets you trace only a percentage.

python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
 
# Trace 10% of requests
sampler = ParentBased(root=TraceIdRatioBased(0.1))
 
provider = TracerProvider(sampler=sampler)

Common strategies:

  • 1-10% random sampling: Catch a representative sample of traffic
  • Always sample errors: Even at 1% sampling, always trace error requests
  • Always sample slow requests: Trace requests that take > 1 second
python
# Head-based sampling with tail override isn't built into OTel SDK directly,
# but you can manually force sampling:
span = tracer.start_span("request")
if response_time > 1000:
    span.set_attribute("sampled.reason", "slow-request")
    # Force this span to be exported even if sampler says skip

Go Example

go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)
 
var tracer = otel.Tracer("payment-service")
 
func ChargePayment(ctx context.Context, orderID string, amount int) error {
    ctx, span := tracer.Start(ctx, "charge-payment")
    defer span.End()
 
    span.SetAttributes(
        attribute.String("order.id", orderID),
        attribute.Int("payment.amount", amount),
    )
 
    // Pass ctx to downstream calls — context carries the trace
    result, err := fraudService.Check(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }
 
    span.SetAttributes(attribute.String("fraud.status", result.Status))
    return nil
}

The ctx carries the trace ID. When fraudService.Check creates its own span, it automatically becomes a child of this span.

Tracing vs Logging vs Metrics

These three are complementary, not alternatives:

Metrics: Aggregate numbers over time. "Payment failures increased to 2% at 14:00." Fast to query, no per-request detail.

Logs: Text output from services. "Payment failed for order-123." Good detail but hard to correlate across services.

Traces: Follow one request across all services. "This specific checkout took 4s because fraud check took 3.2s." Essential for multi-service debugging.

The ideal setup: metrics alert you ("something is slow"), traces help you find it ("here's which service and which span"), logs give you the details ("here's the exact error").

Key Takeaways

  • Distributed tracing follows a single request through multiple services — you see the full timeline
  • Traces consist of spans; spans are nested (parent/child) to show call hierarchy
  • Context propagation passes trace IDs between services via HTTP headers
  • OpenTelemetry is the standard SDK — instrument once, send to any backend
  • Jaeger (open source) or Datadog/Honeycomb (commercial) visualize traces
  • Sample 1-10% of traffic — always sample errors and slow requests
  • Tag spans with IDs and status so you can search for specific failed/slow requests

Distributed tracing turns "something is slow somewhere" into "this specific service, this specific operation, at this exact time." Set it up before you need it.

Related reading: Circuit Breaker Pattern · Message Queues Explained · API Gateway Pattern

Enjoyed this article?

Get weekly insights on backend architecture, system design, and Go programming.