Distributed Tracing: Debug Requests Across Services
Learn how distributed tracing works and how to implement it. Covers trace IDs, spans, OpenTelemetry, Jaeger, and how to find performance bottlenecks in microservices.
A user reports checkout is slow. You check your services — order service looks fine, payment service looks fine, inventory service looks fine. But the checkout takes 4 seconds.
Where is the time going?
In a monolith, you'd add a profiler and see the call stack. In microservices, a single request touches 5 services. Logs are scattered. Metrics show aggregate data, not individual requests.
Distributed tracing gives you the call stack back.
What Is Distributed Tracing?
Distributed tracing follows a single request as it flows through multiple services. Every service records how long it took to handle its part. You see the full journey in one view.
User request: checkout
Order Service (200ms)
└── Inventory Service (50ms) ← called by Order
└── Payment Service (3500ms) ← SLOW ← called by Order
└── Fraud Detection (3200ms) ← VERY SLOW ← called by Payment
└── Notification Service (100ms) ← called by Order
Total: 4 seconds. Problem: Fraud Detection.Without tracing, you'd see "checkout is slow" and guess. With tracing, you see exactly where 80% of the time went.
Traces and Spans
Trace: The entire journey of one request. Has a unique trace ID.
Span: One unit of work within a trace (one service call, one DB query, one HTTP request). Has a span ID and records start time, end time, service name, operation name.
Parent span / child span: Spans are nested. The order service span is parent to the inventory service span.
Trace ID: abc-123
Span: order-service/checkout [0ms ————————————— 4000ms]
Span: inventory-service/reserve [10ms — 60ms]
Span: payment-service/charge [70ms ————————— 3570ms]
Span: fraud-service/check [80ms ———————— 3280ms]
Span: notification-service/send [3600ms — 3700ms]This waterfall view immediately shows where time is spent.
Propagating Context
For tracing to work, each service must pass the trace ID to the next service. This is called context propagation.
When the order service calls the payment service, it adds trace headers:
HTTP request:
POST /charge
traceparent: 00-abc123-def456-01
↑ ↑ ↑ ↑
version trace-id span-id flagsThe payment service reads traceparent, creates a child span under the same trace, and passes it along when calling fraud detection.
Without propagation, each service would start a new trace — you'd lose the connection between spans.
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the open standard for tracing (and metrics and logs). Most languages have OTel SDKs. You instrument once and send to any backend (Jaeger, Zipkin, Datadog, Honeycomb).
# Install: pip install opentelemetry-sdk opentelemetry-exporter-jaeger
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# Setup (do this once at startup)
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
# Create spans manually
def checkout(order_id: str):
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", get_user_id())
with tracer.start_as_current_span("reserve-inventory"):
inventory_result = reserve_inventory(order_id)
with tracer.start_as_current_span("charge-payment"):
payment_result = charge_payment(order_id)
span.set_attribute("payment.id", payment_result.id)
return {"status": "success"}# FastAPI auto-instrumentation — zero manual spans needed
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
app = FastAPI()
# These auto-create spans for every request, HTTP call, and DB query
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()Auto-instrumentation adds spans for HTTP requests, database queries, and cache calls without touching your business logic code.
Jaeger: Visualizing Traces
Jaeger is a popular open-source tracing backend. You send spans to it, and it gives you a UI to search and visualize traces.
# docker-compose.yml — run Jaeger locally
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "6831:6831/udp" # Receive spans
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411Open http://localhost:16686 to see traces. Search by service, operation name, trace ID, or duration.
The trace view shows:
- Full waterfall of spans
- Duration of each span
- Tags (attributes) you set
- Errors marked in red
What to Tag on Spans
Attributes (tags) make traces searchable. Tag the things you'll want to filter on:
with tracer.start_as_current_span("charge-payment") as span:
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.currency", "USD")
span.set_attribute("user.id", user_id)
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(amount)
span.set_attribute("payment.id", result.id)
span.set_attribute("payment.status", "success")
except PaymentError as e:
span.set_attribute("payment.status", "failed")
span.set_attribute("error.message", str(e))
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))
raiseNow you can search Jaeger for payment.status=failed and see every failed payment trace with full context.
Sampling: Not Every Request Needs Tracing
At high traffic, tracing every request is expensive. Sampling lets you trace only a percentage.
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
# Trace 10% of requests
sampler = ParentBased(root=TraceIdRatioBased(0.1))
provider = TracerProvider(sampler=sampler)Common strategies:
- 1-10% random sampling: Catch a representative sample of traffic
- Always sample errors: Even at 1% sampling, always trace error requests
- Always sample slow requests: Trace requests that take > 1 second
# Head-based sampling with tail override isn't built into OTel SDK directly,
# but you can manually force sampling:
span = tracer.start_span("request")
if response_time > 1000:
span.set_attribute("sampled.reason", "slow-request")
# Force this span to be exported even if sampler says skipGo Example
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("payment-service")
func ChargePayment(ctx context.Context, orderID string, amount int) error {
ctx, span := tracer.Start(ctx, "charge-payment")
defer span.End()
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Int("payment.amount", amount),
)
// Pass ctx to downstream calls — context carries the trace
result, err := fraudService.Check(ctx, orderID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetAttributes(attribute.String("fraud.status", result.Status))
return nil
}The ctx carries the trace ID. When fraudService.Check creates its own span, it automatically becomes a child of this span.
Tracing vs Logging vs Metrics
These three are complementary, not alternatives:
Metrics: Aggregate numbers over time. "Payment failures increased to 2% at 14:00." Fast to query, no per-request detail.
Logs: Text output from services. "Payment failed for order-123." Good detail but hard to correlate across services.
Traces: Follow one request across all services. "This specific checkout took 4s because fraud check took 3.2s." Essential for multi-service debugging.
The ideal setup: metrics alert you ("something is slow"), traces help you find it ("here's which service and which span"), logs give you the details ("here's the exact error").
Key Takeaways
- Distributed tracing follows a single request through multiple services — you see the full timeline
- Traces consist of spans; spans are nested (parent/child) to show call hierarchy
- Context propagation passes trace IDs between services via HTTP headers
- OpenTelemetry is the standard SDK — instrument once, send to any backend
- Jaeger (open source) or Datadog/Honeycomb (commercial) visualize traces
- Sample 1-10% of traffic — always sample errors and slow requests
- Tag spans with IDs and status so you can search for specific failed/slow requests
Distributed tracing turns "something is slow somewhere" into "this specific service, this specific operation, at this exact time." Set it up before you need it.
Related reading: Circuit Breaker Pattern · Message Queues Explained · API Gateway Pattern
Enjoyed this article?
Get weekly insights on backend architecture, system design, and Go programming.
Related Posts
Continue reading with these related posts
Saga Pattern: Distributed Transactions Without 2PC
Learn how the saga pattern handles distributed transactions in microservices. Covers choreography vs orchestration, compensating transactions, and real examples.
Service Discovery: How Microservices Find Each Other
Learn how service discovery works in microservices. Covers client-side vs server-side discovery, Consul, etcd, and Kubernetes DNS with practical examples.
API Gateway Pattern: The Front Door to Your Services
Learn what an API gateway does, when to use one, and how to set it up. Covers routing, authentication, rate limiting, and tools like Kong, AWS API Gateway, and Traefik.