Cloud-Native Observability Stack Part 2 - Distributed Tracing Across Microservices
Series Introduction
- Part 1: OpenTelemetry Instrumentation
- Part 2: Distributed Tracing Across Microservices (Current)
- Part 3: Structured Logging with Correlation IDs
- Part 4: Metrics and Alerting with Prometheus/Grafana
- Part 5: Debugging Production Issues with Observability Data
What is Distributed Tracing?
Distributed tracing visualizes the entire path of a request as it passes through multiple services.
User Request
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ API Gateway │────▶│Order Service│────▶│Payment Svc │
│ Span A │ │ Span B │ │ Span C │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
▼
┌─────────────┐
│Inventory Svc│
│ Span D │
└─────────────┘
Trace Context Structure
W3C Trace Context Standard
traceparent: 00-{trace-id}-{span-id}-{trace-flags}
tracestate: vendor1=value1,vendor2=value2
Example:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
- trace-id: 32-character hex identifying the entire trace
- span-id: 16-character hex identifying the current span
- trace-flags: 01 = sampled
Practical Distributed Tracing Implementation
Multi-Service Architecture
# docker-compose.yml
version: '3.8'
services:
api-gateway:
build: ./api-gateway
ports:
- "8080:8080"
environment:
- OTEL_SERVICE_NAME=api-gateway
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
order-service:
build: ./order-service
ports:
- "8081:8081"
environment:
- OTEL_SERVICE_NAME=order-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
payment-service:
build: ./payment-service
ports:
- "8082:8082"
environment:
- OTEL_SERVICE_NAME=payment-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
inventory-service:
build: ./inventory-service
ports:
- "8083:8083"
environment:
- OTEL_SERVICE_NAME=inventory-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
jaeger:
image: jaegertracing/all-in-one:1.53
ports:
- "16686:16686"
- "4317:4317"
environment:
- COLLECTOR_OTLP_ENABLED=true
API Gateway
@RestController
@RequestMapping("/api")
class GatewayController(
private val orderServiceClient: OrderServiceClient,
private val tracer: Tracer
) {
@PostMapping("/orders")
fun createOrder(@RequestBody request: CreateOrderRequest): ResponseEntity<OrderResponse> {
val span = tracer.spanBuilder("gateway.createOrder")
.setSpanKind(SpanKind.SERVER)
.setAttribute("http.method", "POST")
.setAttribute("http.route", "/api/orders")
.startSpan()
return try {
span.makeCurrent().use {
val order = orderServiceClient.createOrder(request)
span.setAttribute("order.id", order.id)
ResponseEntity.created(URI.create("/api/orders/${order.id}")).body(order)
}
} catch (e: Exception) {
span.recordException(e)
span.setStatus(StatusCode.ERROR)
throw e
} finally {
span.end()
}
}
}
Order Service Client (Context Propagation)
@Component
class OrderServiceClient(
private val webClient: WebClient,
private val openTelemetry: OpenTelemetry
) {
fun createOrder(request: CreateOrderRequest): OrderResponse {
return webClient.post()
.uri("/orders")
.bodyValue(request)
.headers { headers ->
// Inject Trace Context
openTelemetry.propagators.textMapPropagator.inject(
Context.current(),
headers
) { carrier, key, value ->
carrier?.set(key, value)
}
}
.retrieve()
.bodyToMono(OrderResponse::class.java)
.block()!!
}
}
Order Service
@RestController
@RequestMapping("/orders")
class OrderController(
private val orderService: OrderService,
private val tracer: Tracer,
private val openTelemetry: OpenTelemetry
) {
@PostMapping
fun createOrder(
@RequestBody request: CreateOrderRequest,
@RequestHeader headers: HttpHeaders
): ResponseEntity<OrderResponse> {
// Extract parent Context
val parentContext = openTelemetry.propagators.textMapPropagator.extract(
Context.current(),
headers
) { carrier, key -> carrier?.getFirst(key) }
val span = tracer.spanBuilder("order.create")
.setParent(parentContext)
.setSpanKind(SpanKind.SERVER)
.startSpan()
return try {
span.makeCurrent().use {
val order = orderService.createOrder(request)
ResponseEntity.ok(OrderResponse(order))
}
} finally {
span.end()
}
}
}
@Service
class OrderService(
private val orderRepository: OrderRepository,
private val paymentClient: PaymentClient,
private val inventoryClient: InventoryClient,
private val tracer: Tracer
) {
@Transactional
fun createOrder(request: CreateOrderRequest): Order {
// Check inventory
val inventorySpan = tracer.spanBuilder("order.checkInventory")
.setSpanKind(SpanKind.CLIENT)
.startSpan()
try {
inventorySpan.makeCurrent().use {
inventoryClient.checkAndReserve(request.items)
}
} finally {
inventorySpan.end()
}
// Save order
val saveSpan = tracer.spanBuilder("order.save")
.setAttribute("db.system", "postgresql")
.startSpan()
val order = try {
saveSpan.makeCurrent().use {
orderRepository.save(Order.create(request))
}
} finally {
saveSpan.end()
}
// Process payment
val paymentSpan = tracer.spanBuilder("order.processPayment")
.setSpanKind(SpanKind.CLIENT)
.startSpan()
try {
paymentSpan.makeCurrent().use {
paymentClient.charge(order.customerId, order.totalAmount)
}
} finally {
paymentSpan.end()
}
return order
}
}
Span Hierarchy
Parent-Child Relationships
Trace: abc123
│
├── Span A: gateway.createOrder (Root Span)
│ │
│ └── Span B: order.create (Child of A)
│ │
│ ├── Span C: order.checkInventory (Child of B)
│ │ │
│ │ └── Span E: inventory.reserve (Child of C)
│ │
│ ├── Span D: order.save (Child of B)
│ │
│ └── Span F: order.processPayment (Child of B)
│ │
│ └── Span G: payment.charge (Child of F)
Span Links (Parallel Processing)
@Service
class BatchOrderProcessor(
private val tracer: Tracer
) {
fun processBatch(orders: List<Order>) {
val batchSpan = tracer.spanBuilder("batch.process")
.startSpan()
try {
batchSpan.makeCurrent().use {
orders.parallelStream().forEach { order ->
val orderSpan = tracer.spanBuilder("batch.processOrder")
.addLink(batchSpan.spanContext) // Connect with link
.setAttribute("order.id", order.id)
.startSpan()
try {
orderSpan.makeCurrent().use {
processOrder(order)
}
} finally {
orderSpan.end()
}
}
}
} finally {
batchSpan.end()
}
}
}
Sampling Strategies
Head-based Sampling
Sampling decision made at request start:
@Configuration
class SamplingConfig {
@Bean
fun sdkTracerProvider(): SdkTracerProvider {
return SdkTracerProvider.builder()
.setSampler(
Sampler.parentBased(
Sampler.traceIdRatioBased(0.1) // 10% sampling
)
)
.build()
}
}
Tail-based Sampling (OTel Collector)
Sampling decision made after request completion:
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 1000
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/jaeger]
Using Jaeger UI
Trace Search
service=order-service operation=order.create minDuration=100ms
Service Dependency Graph
You can visualize service dependencies through the System Architecture tab in Jaeger UI.
Performance Analysis
- Critical Path analysis
- Time comparison between Spans
- Bottleneck identification
Span Attributes Best Practices
Semantic Conventions
// HTTP related
span.setAttribute(SemanticAttributes.HTTP_METHOD, "POST")
span.setAttribute(SemanticAttributes.HTTP_URL, "/api/orders")
span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, 200)
// Database related
span.setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql")
span.setAttribute(SemanticAttributes.DB_OPERATION, "SELECT")
span.setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM orders WHERE id = ?")
// Messaging related
span.setAttribute(SemanticAttributes.MESSAGING_SYSTEM, "kafka")
span.setAttribute(SemanticAttributes.MESSAGING_DESTINATION, "order-events")
span.setAttribute(SemanticAttributes.MESSAGING_OPERATION, "publish")
Custom Attributes
// Business context
span.setAttribute("order.id", orderId)
span.setAttribute("customer.tier", "premium")
span.setAttribute("order.item_count", items.size.toLong())
span.setAttribute("order.total_amount", totalAmount.toDouble())
Error Tracking
try {
processOrder(order)
} catch (e: PaymentException) {
span.setStatus(StatusCode.ERROR, "Payment processing failed")
span.recordException(e, Attributes.builder()
.put("exception.escaped", false)
.put("payment.error_code", e.errorCode)
.build()
)
throw e
}
Summary
Key aspects of distributed tracing:
| Item | Description |
|---|---|
| Trace Context | W3C standard for context propagation between services |
| Span Hierarchy | Parent-child relationships express request flow |
| Sampling | Head/Tail based for cost optimization |
| Attributes | Follow Semantic Conventions |
In the next post, we’ll cover structured logging and Correlation IDs.
댓글남기기