4 분 소요


시리즈 소개

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: 마이크로서비스 분산 추적 (현재 글)
  3. Part 3: 구조화된 로깅과 Correlation ID
  4. Part 4: Prometheus/Grafana로 메트릭과 알림
  5. Part 5: Observability 데이터로 프로덕션 이슈 디버깅

분산 추적이란?

분산 추적은 요청이 여러 서비스를 거쳐가는 전체 경로를 시각화합니다.

User Request
    │
    ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ API Gateway │────▶│Order Service│────▶│Payment Svc  │
│   Span A    │     │   Span B    │     │   Span C    │
└─────────────┘     └──────┬──────┘     └─────────────┘
                          │
                          ▼
                   ┌─────────────┐
                   │Inventory Svc│
                   │   Span D    │
                   └─────────────┘

Trace Context 구조

W3C Trace Context 표준

traceparent: 00-{trace-id}-{span-id}-{trace-flags}
tracestate: vendor1=value1,vendor2=value2

예시:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
  • trace-id: 전체 트레이스를 식별하는 32자리 hex
  • span-id: 현재 스팬을 식별하는 16자리 hex
  • trace-flags: 01 = sampled

실전 분산 추적 구현

멀티 서비스 아키텍처

# docker-compose.yml
version: '3.8'
services:
  api-gateway:
    build: ./api-gateway
    ports:
      - "8080:8080"
    environment:
      - OTEL_SERVICE_NAME=api-gateway
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

  order-service:
    build: ./order-service
    ports:
      - "8081:8081"
    environment:
      - OTEL_SERVICE_NAME=order-service
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

  payment-service:
    build: ./payment-service
    ports:
      - "8082:8082"
    environment:
      - OTEL_SERVICE_NAME=payment-service
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

  inventory-service:
    build: ./inventory-service
    ports:
      - "8083:8083"
    environment:
      - OTEL_SERVICE_NAME=inventory-service
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"
      - "4317:4317"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

API Gateway

@RestController
@RequestMapping("/api")
class GatewayController(
    private val orderServiceClient: OrderServiceClient,
    private val tracer: Tracer
) {
    @PostMapping("/orders")
    fun createOrder(@RequestBody request: CreateOrderRequest): ResponseEntity<OrderResponse> {
        val span = tracer.spanBuilder("gateway.createOrder")
            .setSpanKind(SpanKind.SERVER)
            .setAttribute("http.method", "POST")
            .setAttribute("http.route", "/api/orders")
            .startSpan()

        return try {
            span.makeCurrent().use {
                val order = orderServiceClient.createOrder(request)
                span.setAttribute("order.id", order.id)
                ResponseEntity.created(URI.create("/api/orders/${order.id}")).body(order)
            }
        } catch (e: Exception) {
            span.recordException(e)
            span.setStatus(StatusCode.ERROR)
            throw e
        } finally {
            span.end()
        }
    }
}

Order Service Client (Context 전파)

@Component
class OrderServiceClient(
    private val webClient: WebClient,
    private val openTelemetry: OpenTelemetry
) {
    fun createOrder(request: CreateOrderRequest): OrderResponse {
        return webClient.post()
            .uri("/orders")
            .bodyValue(request)
            .headers { headers ->
                // Trace Context 주입
                openTelemetry.propagators.textMapPropagator.inject(
                    Context.current(),
                    headers
                ) { carrier, key, value ->
                    carrier?.set(key, value)
                }
            }
            .retrieve()
            .bodyToMono(OrderResponse::class.java)
            .block()!!
    }
}

Order Service

@RestController
@RequestMapping("/orders")
class OrderController(
    private val orderService: OrderService,
    private val tracer: Tracer,
    private val openTelemetry: OpenTelemetry
) {
    @PostMapping
    fun createOrder(
        @RequestBody request: CreateOrderRequest,
        @RequestHeader headers: HttpHeaders
    ): ResponseEntity<OrderResponse> {
        // 부모 Context 추출
        val parentContext = openTelemetry.propagators.textMapPropagator.extract(
            Context.current(),
            headers
        ) { carrier, key -> carrier?.getFirst(key) }

        val span = tracer.spanBuilder("order.create")
            .setParent(parentContext)
            .setSpanKind(SpanKind.SERVER)
            .startSpan()

        return try {
            span.makeCurrent().use {
                val order = orderService.createOrder(request)
                ResponseEntity.ok(OrderResponse(order))
            }
        } finally {
            span.end()
        }
    }
}

@Service
class OrderService(
    private val orderRepository: OrderRepository,
    private val paymentClient: PaymentClient,
    private val inventoryClient: InventoryClient,
    private val tracer: Tracer
) {
    @Transactional
    fun createOrder(request: CreateOrderRequest): Order {
        // 재고 확인
        val inventorySpan = tracer.spanBuilder("order.checkInventory")
            .setSpanKind(SpanKind.CLIENT)
            .startSpan()

        try {
            inventorySpan.makeCurrent().use {
                inventoryClient.checkAndReserve(request.items)
            }
        } finally {
            inventorySpan.end()
        }

        // 주문 저장
        val saveSpan = tracer.spanBuilder("order.save")
            .setAttribute("db.system", "postgresql")
            .startSpan()

        val order = try {
            saveSpan.makeCurrent().use {
                orderRepository.save(Order.create(request))
            }
        } finally {
            saveSpan.end()
        }

        // 결제 처리
        val paymentSpan = tracer.spanBuilder("order.processPayment")
            .setSpanKind(SpanKind.CLIENT)
            .startSpan()

        try {
            paymentSpan.makeCurrent().use {
                paymentClient.charge(order.customerId, order.totalAmount)
            }
        } finally {
            paymentSpan.end()
        }

        return order
    }
}

Span 계층 구조

부모-자식 관계

Trace: abc123
│
├── Span A: gateway.createOrder (Root Span)
│   │
│   └── Span B: order.create (Child of A)
│       │
│       ├── Span C: order.checkInventory (Child of B)
│       │   │
│       │   └── Span E: inventory.reserve (Child of C)
│       │
│       ├── Span D: order.save (Child of B)
│       │
│       └── Span F: order.processPayment (Child of B)
│           │
│           └── Span G: payment.charge (Child of F)
@Service
class BatchOrderProcessor(
    private val tracer: Tracer
) {
    fun processBatch(orders: List<Order>) {
        val batchSpan = tracer.spanBuilder("batch.process")
            .startSpan()

        try {
            batchSpan.makeCurrent().use {
                orders.parallelStream().forEach { order ->
                    val orderSpan = tracer.spanBuilder("batch.processOrder")
                        .addLink(batchSpan.spanContext)  // 링크로 연결
                        .setAttribute("order.id", order.id)
                        .startSpan()

                    try {
                        orderSpan.makeCurrent().use {
                            processOrder(order)
                        }
                    } finally {
                        orderSpan.end()
                    }
                }
            }
        } finally {
            batchSpan.end()
        }
    }
}

샘플링 전략

Head-based Sampling

요청 시작 시점에 샘플링 결정:

@Configuration
class SamplingConfig {

    @Bean
    fun sdkTracerProvider(): SdkTracerProvider {
        return SdkTracerProvider.builder()
            .setSampler(
                Sampler.parentBased(
                    Sampler.traceIdRatioBased(0.1)  // 10% 샘플링
                )
            )
            .build()
    }
}

Tail-based Sampling (OTel Collector)

요청 완료 후 샘플링 결정:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]

Jaeger UI 활용

트레이스 검색

service=order-service operation=order.create minDuration=100ms

서비스 의존성 그래프

Jaeger UI에서 System Architecture 탭을 통해 서비스 간 의존성을 시각화할 수 있습니다.

성능 분석

  • Critical Path 분석
  • Span 간 시간 비교
  • 병목 지점 식별

Span Attributes 모범 사례

Semantic Conventions

// HTTP 관련
span.setAttribute(SemanticAttributes.HTTP_METHOD, "POST")
span.setAttribute(SemanticAttributes.HTTP_URL, "/api/orders")
span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, 200)

// Database 관련
span.setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql")
span.setAttribute(SemanticAttributes.DB_OPERATION, "SELECT")
span.setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM orders WHERE id = ?")

// Messaging 관련
span.setAttribute(SemanticAttributes.MESSAGING_SYSTEM, "kafka")
span.setAttribute(SemanticAttributes.MESSAGING_DESTINATION, "order-events")
span.setAttribute(SemanticAttributes.MESSAGING_OPERATION, "publish")

커스텀 Attributes

// 비즈니스 컨텍스트
span.setAttribute("order.id", orderId)
span.setAttribute("customer.tier", "premium")
span.setAttribute("order.item_count", items.size.toLong())
span.setAttribute("order.total_amount", totalAmount.toDouble())

에러 추적

try {
    processOrder(order)
} catch (e: PaymentException) {
    span.setStatus(StatusCode.ERROR, "Payment processing failed")
    span.recordException(e, Attributes.builder()
        .put("exception.escaped", false)
        .put("payment.error_code", e.errorCode)
        .build()
    )
    throw e
}

정리

분산 추적의 핵심:

항목 설명
Trace Context W3C 표준으로 서비스 간 컨텍스트 전파
Span 계층 부모-자식 관계로 요청 흐름 표현
샘플링 Head/Tail 기반으로 비용 최적화
Attributes Semantic Conventions 준수

다음 글에서는 구조화된 로깅과 Correlation ID를 다루겠습니다.

Series Introduction

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: Distributed Tracing Across Microservices (Current)
  3. Part 3: Structured Logging with Correlation IDs
  4. Part 4: Metrics and Alerting with Prometheus/Grafana
  5. Part 5: Debugging Production Issues with Observability Data

What is Distributed Tracing?

Distributed tracing visualizes the entire path of a request as it passes through multiple services.

Trace Context Structure

W3C Trace Context Standard

traceparent: 00-{trace-id}-{span-id}-{trace-flags}
tracestate: vendor1=value1,vendor2=value2

Practical Distributed Tracing Implementation

Multi-Service Architecture

The trace context is automatically propagated through HTTP headers, allowing you to see the complete request flow across all services in Jaeger UI.

Sampling Strategies

  • Head-based Sampling: Decision made at request start
  • Tail-based Sampling: Decision made after request completion (useful for capturing all errors)

Summary

Key aspects of distributed tracing:

Item Description
Trace Context W3C standard for context propagation between services
Span Hierarchy Parent-child relationships express request flow
Sampling Head/Tail based for cost optimization
Attributes Follow Semantic Conventions

In the next post, we’ll cover structured logging and Correlation IDs.

댓글남기기