클라우드 네이티브 Observability Part 2 - 마이크로서비스 분산 추적 Cloud-Native Observability Stack Part 2 - Distributed Tracing Across Microservices
시리즈 소개
- Part 1: OpenTelemetry Instrumentation
- Part 2: 마이크로서비스 분산 추적 (현재 글)
- Part 3: 구조화된 로깅과 Correlation ID
- Part 4: Prometheus/Grafana로 메트릭과 알림
- Part 5: Observability 데이터로 프로덕션 이슈 디버깅
분산 추적이란?
분산 추적은 요청이 여러 서비스를 거쳐가는 전체 경로를 시각화합니다.
User Request
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ API Gateway │────▶│Order Service│────▶│Payment Svc │
│ Span A │ │ Span B │ │ Span C │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
▼
┌─────────────┐
│Inventory Svc│
│ Span D │
└─────────────┘
Trace Context 구조
W3C Trace Context 표준
traceparent: 00-{trace-id}-{span-id}-{trace-flags}
tracestate: vendor1=value1,vendor2=value2
예시:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
- trace-id: 전체 트레이스를 식별하는 32자리 hex
- span-id: 현재 스팬을 식별하는 16자리 hex
- trace-flags: 01 = sampled
실전 분산 추적 구현
멀티 서비스 아키텍처
# docker-compose.yml
version: '3.8'
services:
api-gateway:
build: ./api-gateway
ports:
- "8080:8080"
environment:
- OTEL_SERVICE_NAME=api-gateway
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
order-service:
build: ./order-service
ports:
- "8081:8081"
environment:
- OTEL_SERVICE_NAME=order-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
payment-service:
build: ./payment-service
ports:
- "8082:8082"
environment:
- OTEL_SERVICE_NAME=payment-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
inventory-service:
build: ./inventory-service
ports:
- "8083:8083"
environment:
- OTEL_SERVICE_NAME=inventory-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
jaeger:
image: jaegertracing/all-in-one:1.53
ports:
- "16686:16686"
- "4317:4317"
environment:
- COLLECTOR_OTLP_ENABLED=true
API Gateway
@RestController
@RequestMapping("/api")
class GatewayController(
private val orderServiceClient: OrderServiceClient,
private val tracer: Tracer
) {
@PostMapping("/orders")
fun createOrder(@RequestBody request: CreateOrderRequest): ResponseEntity<OrderResponse> {
val span = tracer.spanBuilder("gateway.createOrder")
.setSpanKind(SpanKind.SERVER)
.setAttribute("http.method", "POST")
.setAttribute("http.route", "/api/orders")
.startSpan()
return try {
span.makeCurrent().use {
val order = orderServiceClient.createOrder(request)
span.setAttribute("order.id", order.id)
ResponseEntity.created(URI.create("/api/orders/${order.id}")).body(order)
}
} catch (e: Exception) {
span.recordException(e)
span.setStatus(StatusCode.ERROR)
throw e
} finally {
span.end()
}
}
}
Order Service Client (Context 전파)
@Component
class OrderServiceClient(
private val webClient: WebClient,
private val openTelemetry: OpenTelemetry
) {
fun createOrder(request: CreateOrderRequest): OrderResponse {
return webClient.post()
.uri("/orders")
.bodyValue(request)
.headers { headers ->
// Trace Context 주입
openTelemetry.propagators.textMapPropagator.inject(
Context.current(),
headers
) { carrier, key, value ->
carrier?.set(key, value)
}
}
.retrieve()
.bodyToMono(OrderResponse::class.java)
.block()!!
}
}
Order Service
@RestController
@RequestMapping("/orders")
class OrderController(
private val orderService: OrderService,
private val tracer: Tracer,
private val openTelemetry: OpenTelemetry
) {
@PostMapping
fun createOrder(
@RequestBody request: CreateOrderRequest,
@RequestHeader headers: HttpHeaders
): ResponseEntity<OrderResponse> {
// 부모 Context 추출
val parentContext = openTelemetry.propagators.textMapPropagator.extract(
Context.current(),
headers
) { carrier, key -> carrier?.getFirst(key) }
val span = tracer.spanBuilder("order.create")
.setParent(parentContext)
.setSpanKind(SpanKind.SERVER)
.startSpan()
return try {
span.makeCurrent().use {
val order = orderService.createOrder(request)
ResponseEntity.ok(OrderResponse(order))
}
} finally {
span.end()
}
}
}
@Service
class OrderService(
private val orderRepository: OrderRepository,
private val paymentClient: PaymentClient,
private val inventoryClient: InventoryClient,
private val tracer: Tracer
) {
@Transactional
fun createOrder(request: CreateOrderRequest): Order {
// 재고 확인
val inventorySpan = tracer.spanBuilder("order.checkInventory")
.setSpanKind(SpanKind.CLIENT)
.startSpan()
try {
inventorySpan.makeCurrent().use {
inventoryClient.checkAndReserve(request.items)
}
} finally {
inventorySpan.end()
}
// 주문 저장
val saveSpan = tracer.spanBuilder("order.save")
.setAttribute("db.system", "postgresql")
.startSpan()
val order = try {
saveSpan.makeCurrent().use {
orderRepository.save(Order.create(request))
}
} finally {
saveSpan.end()
}
// 결제 처리
val paymentSpan = tracer.spanBuilder("order.processPayment")
.setSpanKind(SpanKind.CLIENT)
.startSpan()
try {
paymentSpan.makeCurrent().use {
paymentClient.charge(order.customerId, order.totalAmount)
}
} finally {
paymentSpan.end()
}
return order
}
}
Span 계층 구조
부모-자식 관계
Trace: abc123
│
├── Span A: gateway.createOrder (Root Span)
│ │
│ └── Span B: order.create (Child of A)
│ │
│ ├── Span C: order.checkInventory (Child of B)
│ │ │
│ │ └── Span E: inventory.reserve (Child of C)
│ │
│ ├── Span D: order.save (Child of B)
│ │
│ └── Span F: order.processPayment (Child of B)
│ │
│ └── Span G: payment.charge (Child of F)
Span Link (병렬 처리)
@Service
class BatchOrderProcessor(
private val tracer: Tracer
) {
fun processBatch(orders: List<Order>) {
val batchSpan = tracer.spanBuilder("batch.process")
.startSpan()
try {
batchSpan.makeCurrent().use {
orders.parallelStream().forEach { order ->
val orderSpan = tracer.spanBuilder("batch.processOrder")
.addLink(batchSpan.spanContext) // 링크로 연결
.setAttribute("order.id", order.id)
.startSpan()
try {
orderSpan.makeCurrent().use {
processOrder(order)
}
} finally {
orderSpan.end()
}
}
}
} finally {
batchSpan.end()
}
}
}
샘플링 전략
Head-based Sampling
요청 시작 시점에 샘플링 결정:
@Configuration
class SamplingConfig {
@Bean
fun sdkTracerProvider(): SdkTracerProvider {
return SdkTracerProvider.builder()
.setSampler(
Sampler.parentBased(
Sampler.traceIdRatioBased(0.1) // 10% 샘플링
)
)
.build()
}
}
Tail-based Sampling (OTel Collector)
요청 완료 후 샘플링 결정:
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 1000
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/jaeger]
Jaeger UI 활용
트레이스 검색
service=order-service operation=order.create minDuration=100ms
서비스 의존성 그래프
Jaeger UI에서 System Architecture 탭을 통해 서비스 간 의존성을 시각화할 수 있습니다.
성능 분석
- Critical Path 분석
- Span 간 시간 비교
- 병목 지점 식별
Span Attributes 모범 사례
Semantic Conventions
// HTTP 관련
span.setAttribute(SemanticAttributes.HTTP_METHOD, "POST")
span.setAttribute(SemanticAttributes.HTTP_URL, "/api/orders")
span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, 200)
// Database 관련
span.setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql")
span.setAttribute(SemanticAttributes.DB_OPERATION, "SELECT")
span.setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM orders WHERE id = ?")
// Messaging 관련
span.setAttribute(SemanticAttributes.MESSAGING_SYSTEM, "kafka")
span.setAttribute(SemanticAttributes.MESSAGING_DESTINATION, "order-events")
span.setAttribute(SemanticAttributes.MESSAGING_OPERATION, "publish")
커스텀 Attributes
// 비즈니스 컨텍스트
span.setAttribute("order.id", orderId)
span.setAttribute("customer.tier", "premium")
span.setAttribute("order.item_count", items.size.toLong())
span.setAttribute("order.total_amount", totalAmount.toDouble())
에러 추적
try {
processOrder(order)
} catch (e: PaymentException) {
span.setStatus(StatusCode.ERROR, "Payment processing failed")
span.recordException(e, Attributes.builder()
.put("exception.escaped", false)
.put("payment.error_code", e.errorCode)
.build()
)
throw e
}
정리
분산 추적의 핵심:
| 항목 | 설명 |
|---|---|
| Trace Context | W3C 표준으로 서비스 간 컨텍스트 전파 |
| Span 계층 | 부모-자식 관계로 요청 흐름 표현 |
| 샘플링 | Head/Tail 기반으로 비용 최적화 |
| Attributes | Semantic Conventions 준수 |
다음 글에서는 구조화된 로깅과 Correlation ID를 다루겠습니다.
Series Introduction
- Part 1: OpenTelemetry Instrumentation
- Part 2: Distributed Tracing Across Microservices (Current)
- Part 3: Structured Logging with Correlation IDs
- Part 4: Metrics and Alerting with Prometheus/Grafana
- Part 5: Debugging Production Issues with Observability Data
What is Distributed Tracing?
Distributed tracing visualizes the entire path of a request as it passes through multiple services.
Trace Context Structure
W3C Trace Context Standard
traceparent: 00-{trace-id}-{span-id}-{trace-flags}
tracestate: vendor1=value1,vendor2=value2
Practical Distributed Tracing Implementation
Multi-Service Architecture
The trace context is automatically propagated through HTTP headers, allowing you to see the complete request flow across all services in Jaeger UI.
Sampling Strategies
- Head-based Sampling: Decision made at request start
- Tail-based Sampling: Decision made after request completion (useful for capturing all errors)
Summary
Key aspects of distributed tracing:
| Item | Description |
|---|---|
| Trace Context | W3C standard for context propagation between services |
| Span Hierarchy | Parent-child relationships express request flow |
| Sampling | Head/Tail based for cost optimization |
| Attributes | Follow Semantic Conventions |
In the next post, we’ll cover structured logging and Correlation IDs.
댓글남기기