Distributed Tracing: Debug Requests Across Services

A user reports checkout is slow. You check your services — order service looks fine, payment service looks fine, inventory service looks fine. But the checkout takes 4 seconds.

Where is the time going?

In a monolith, you'd add a profiler and see the call stack. In microservices, a single request touches 5 services. Logs are scattered. Metrics show aggregate data, not individual requests.

Distributed tracing gives you the call stack back.

What Is Distributed Tracing?

Distributed tracing follows a single request as it flows through multiple services. Every service records how long it took to handle its part. You see the full journey in one view.

plaintext

User request: checkout
 
Order Service (200ms)
 └── Inventory Service (50ms)          ← called by Order
 └── Payment Service (3500ms) ← SLOW  ← called by Order
      └── Fraud Detection (3200ms) ← VERY SLOW ← called by Payment
 └── Notification Service (100ms)      ← called by Order
 
Total: 4 seconds. Problem: Fraud Detection.

Without tracing, you'd see "checkout is slow" and guess. With tracing, you see exactly where 80% of the time went.

Traces and Spans

Trace: The entire journey of one request. Has a unique trace ID.

Span: One unit of work within a trace (one service call, one DB query, one HTTP request). Has a span ID and records start time, end time, service name, operation name.

Parent span / child span: Spans are nested. The order service span is parent to the inventory service span.

plaintext

Trace ID: abc-123
 
Span: order-service/checkout        [0ms ————————————— 4000ms]
  Span: inventory-service/reserve   [10ms — 60ms]
  Span: payment-service/charge      [70ms ————————— 3570ms]
    Span: fraud-service/check       [80ms ———————— 3280ms]
  Span: notification-service/send   [3600ms — 3700ms]

This waterfall view immediately shows where time is spent.

Propagating Context

For tracing to work, each service must pass the trace ID to the next service. This is called context propagation.

When the order service calls the payment service, it adds trace headers:

plaintext

HTTP request:
  POST /charge
  traceparent: 00-abc123-def456-01
               ↑   ↑       ↑     ↑
               version trace-id span-id flags

The payment service reads traceparent, creates a child span under the same trace, and passes it along when calling fraud detection.

Without propagation, each service would start a new trace — you'd lose the connection between spans.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the open standard for tracing (and metrics and logs). Most languages have OTel SDKs. You instrument once and send to any backend (Jaeger, Zipkin, Datadog, Honeycomb).

python

# Install: pip install opentelemetry-sdk opentelemetry-exporter-jaeger
 
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
 
# Setup (do this once at startup)
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer("order-service")
 
# Create spans manually
def checkout(order_id: str):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", get_user_id())
        
        with tracer.start_as_current_span("reserve-inventory"):
            inventory_result = reserve_inventory(order_id)
        
        with tracer.start_as_current_span("charge-payment"):
            payment_result = charge_payment(order_id)
        
        span.set_attribute("payment.id", payment_result.id)
        return {"status": "success"}

python

# FastAPI auto-instrumentation — zero manual spans needed
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
 
app = FastAPI()
 
# These auto-create spans for every request, HTTP call, and DB query
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

Auto-instrumentation adds spans for HTTP requests, database queries, and cache calls without touching your business logic code.

Jaeger: Visualizing Traces

Jaeger is a popular open-source tracing backend. You send spans to it, and it gives you a UI to search and visualize traces.

yaml

# docker-compose.yml — run Jaeger locally
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "6831:6831/udp"  # Receive spans
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411

Open http://localhost:16686 to see traces. Search by service, operation name, trace ID, or duration.

The trace view shows:

Full waterfall of spans
Duration of each span
Tags (attributes) you set
Errors marked in red

What to Tag on Spans

Attributes (tags) make traces searchable. Tag the things you'll want to filter on:

python

with tracer.start_as_current_span("charge-payment") as span:
    span.set_attribute("payment.amount", amount)
    span.set_attribute("payment.currency", "USD")
    span.set_attribute("user.id", user_id)
    span.set_attribute("order.id", order_id)
    
    try:
        result = payment_service.charge(amount)
        span.set_attribute("payment.id", result.id)
        span.set_attribute("payment.status", "success")
    except PaymentError as e:
        span.set_attribute("payment.status", "failed")
        span.set_attribute("error.message", str(e))
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        raise

Now you can search Jaeger for payment.status=failed and see every failed payment trace with full context.

Sampling: Not Every Request Needs Tracing

At high traffic, tracing every request is expensive. Sampling lets you trace only a percentage.

python

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
 
# Trace 10% of requests
sampler = ParentBased(root=TraceIdRatioBased(0.1))
 
provider = TracerProvider(sampler=sampler)

Common strategies:

1-10% random sampling: Catch a representative sample of traffic
Always sample errors: Even at 1% sampling, always trace error requests
Always sample slow requests: Trace requests that take > 1 second

python

# Head-based sampling with tail override isn't built into OTel SDK directly,
# but you can manually force sampling:
span = tracer.start_span("request")
if response_time > 1000:
    span.set_attribute("sampled.reason", "slow-request")
    # Force this span to be exported even if sampler says skip

Go Example

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)
 
var tracer = otel.Tracer("payment-service")
 
func ChargePayment(ctx context.Context, orderID string, amount int) error {
    ctx, span := tracer.Start(ctx, "charge-payment")
    defer span.End()
 
    span.SetAttributes(
        attribute.String("order.id", orderID),
        attribute.Int("payment.amount", amount),
    )
 
    // Pass ctx to downstream calls — context carries the trace
    result, err := fraudService.Check(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }
 
    span.SetAttributes(attribute.String("fraud.status", result.Status))
    return nil
}

The ctx carries the trace ID. When fraudService.Check creates its own span, it automatically becomes a child of this span.

Tracing vs Logging vs Metrics

These three are complementary, not alternatives:

Metrics: Aggregate numbers over time. "Payment failures increased to 2% at 14:00." Fast to query, no per-request detail.

Logs: Text output from services. "Payment failed for order-123." Good detail but hard to correlate across services.

Traces: Follow one request across all services. "This specific checkout took 4s because fraud check took 3.2s." Essential for multi-service debugging.

The ideal setup: metrics alert you ("something is slow"), traces help you find it ("here's which service and which span"), logs give you the details ("here's the exact error").

Key Takeaways

Distributed tracing follows a single request through multiple services — you see the full timeline
Traces consist of spans; spans are nested (parent/child) to show call hierarchy
Context propagation passes trace IDs between services via HTTP headers
OpenTelemetry is the standard SDK — instrument once, send to any backend
Jaeger (open source) or Datadog/Honeycomb (commercial) visualize traces
Sample 1-10% of traffic — always sample errors and slow requests
Tag spans with IDs and status so you can search for specific failed/slow requests

Distributed tracing turns "something is slow somewhere" into "this specific service, this specific operation, at this exact time." Set it up before you need it.

Distributed Tracing: Debug Requests Across Services

What Is Distributed Tracing?

Traces and Spans

Propagating Context

OpenTelemetry: The Standard

Jaeger: Visualizing Traces

What to Tag on Spans

Sampling: Not Every Request Needs Tracing

Go Example

Tracing vs Logging vs Metrics

Key Takeaways

Enjoyed this article?

Related Posts

Saga Pattern: Distributed Transactions Without 2PC

Service Discovery: How Microservices Find Each Other

API Gateway Pattern: The Front Door to Your Services