클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅 Cloud-Native Observability Stack Part 5 - Debugging Production Issues with Observability Data

3 분 소요

Series Introduction

Part 1: OpenTelemetry Instrumentation
Part 2: Distributed Tracing Across Microservices
Part 3: Structured Logging with Correlation IDs
Part 4: Metrics and Alerting with Prometheus/Grafana
Part 5: Debugging Production Issues with Observability Data (Current)

Debugging Workflow

MELT Approach

Metrics → Events → Logs → Traces

Detect problems with Metrics
Identify timing with Events/Alerts
Get details with Logs
Track request flow with Traces

Real-World Failure Scenarios

Scenario 1: Intermittent Timeouts

Symptom: Some order creation requests timeout after 30 seconds

Step 1: Check Metrics

# Check P99 latency spike
histogram_quantile(0.99,
  sum(rate(http_server_requests_seconds_bucket{uri="/api/orders"}[5m])) by (le)
)

Grafana observation: P99 latency spikes to 30 seconds during certain time periods

Step 2: Trace Analysis

Search for slow requests in Jaeger:

service=order-service minDuration=10s

Discovery: inventory.checkStock span takes 29 seconds

Step 3: Log Investigation

{service="inventory-service"} | json | latency > 10000

Discovery: Database query is slow for a specific product ID

Step 4: Root Cause

-- Check execution plan
EXPLAIN ANALYZE SELECT * FROM inventory WHERE product_id = 'PROD-12345';

Cause: Missing index on product_id

Resolution:

CREATE INDEX idx_inventory_product_id ON inventory(product_id);

Scenario 2: Memory Leak

Symptom: Service periodically restarts due to OOM

Step 1: Check Metrics

# Heap memory usage trend
jvm_memory_used_bytes{area="heap",application="order-service"}

Pattern discovered: Memory gradually increases then drops suddenly (restart)

Step 2: GC Log Analysis

# GC frequency increase
rate(jvm_gc_pause_seconds_count{application="order-service"}[5m])

Discovery: Full GC frequency is steadily increasing

Step 3: Heap Dump Analysis

# Generate heap dump
jmap -dump:format=b,file=heapdump.hprof <pid>

# Analyze with MAT or VisualVM

Discovery: OrderCache objects consume 80% of memory

Step 4: Code Review

// Problematic code
@Component
class OrderCache {
    private val cache = ConcurrentHashMap<String, Order>()

    fun put(orderId: String, order: Order) {
        cache[orderId] = order  // No removal logic!
    }
}

Resolution:

@Component
class OrderCache {
    private val cache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(Duration.ofHours(1))
        .build<String, Order>()
}

Scenario 3: Cascading Failure Between Services

Symptom: Payment service failure leads to entire system outage

Step 1: Check Dependency Graph

Observed in Jaeger Service Map:

Order Service → Payment Service (synchronous call)
Payment Service failure causes Order Service thread blocking

Step 2: Check Metrics

# Connection pool exhaustion
hikaricp_connections_active{application="order-service"}
hikaricp_connections_pending{application="order-service"}

Discovery: All connections are in waiting state during Payment Service timeout

Step 3: Log Investigation

{service="order-service"} |= "Connection pool exhausted"

Resolution: Apply Circuit Breaker Pattern

@Service
class PaymentClient(
    private val circuitBreakerFactory: Resilience4JCircuitBreakerFactory
) {
    private val circuitBreaker = circuitBreakerFactory.create("payment")

    fun processPayment(order: Order): PaymentResult {
        return circuitBreaker.run(
            { paymentApi.charge(order) },
            { fallback -> handleFallback(order) }
        )
    }

    private fun handleFallback(order: Order): PaymentResult {
        // Add to payment queue for later processing
        paymentQueue.add(order)
        return PaymentResult.PENDING
    }
}

Debugging Toolkit

1. Distributed Trace Search Queries

# Slow requests
service=order-service minDuration=1s

# Error requests
service=order-service error=true

# Specific user
service=order-service tag.customer.id=CUST-123

2. Useful PromQL Queries

# Find services with error rate spikes
topk(5,
  sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
)

# Endpoints with latency spikes
topk(5,
  histogram_quantile(0.99,
    sum(rate(http_server_requests_seconds_bucket[5m])) by (uri, le)
  )
)

# Top services by memory usage
topk(5,
  jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
)

3. Useful LogQL Queries

# Aggregate error logs
sum by (errorType) (
  count_over_time({service="order-service"} | json | level="ERROR" [1h])
)

# All logs for a specific traceId
{service=~".+"} |= "traceId=abc123"

# Slow query logs
{service=~".+"} | json | queryTime > 1000

On-Call Playbook

When Service Is Down

Immediate Verification
- Check up{job="spring-boot-apps"} metric
- Check Pod status: kubectl get pods
Check Recent Changes
- Recent deployment history
- Configuration changes
Log Investigation
- Check startup logs for errors
- Verify OOM occurrence
Rollback Decision
- If quick recovery is needed, rollback to previous version

When Performance Degrades

Determine Impact Scope
- Entire service? Specific endpoint?
Identify Bottleneck
- Check slow spans via traces
- External dependency issue?
Resource Check
- CPU, memory, disk I/O
- Connection pool status
Temporary Measures
- Scale out
- Apply rate limiting

Postmortem Template

# Incident Report: [Title]

## Overview
- Occurrence: YYYY-MM-DD HH:MM ~ HH:MM (UTC)
- Impact Scope: [Service name, number of users]
- Severity: [Critical/High/Medium/Low]

## Timeline
- HH:MM - First alert triggered
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Normal operation confirmed

## Root Cause
[Detailed explanation]

## Resolution
[Actions taken]

## Impact
- Error rate: X%
- Affected requests: N

## Lessons Learned
### What Went Well
-

### What Could Be Improved
-

## Action Items
- [ ] [Owner] Action description (Deadline)

Series Conclusion

Topics covered in this series:

Part	Topic	Key Point
1	OpenTelemetry	Instrumentation basics
2	Distributed Tracing	Request flow visualization
3	Structured Logging	Searchable logs
4	Metrics/Alerting	Proactive monitoring
5	Debugging	Real-world problem solving

Observability is not just monitoring. It’s the ability to understand your system, prevent problems, and resolve them quickly.

Twitter Facebook LinkedIn

울이

클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅 Cloud-Native Observability Stack Part 5 - Debugging Production Issues with Observability Data

Series Introduction

Debugging Workflow

MELT Approach

Real-World Failure Scenarios

Scenario 1: Intermittent Timeouts

Scenario 2: Memory Leak

Scenario 3: Cascading Failure Between Services

Debugging Toolkit

1. Distributed Trace Search Queries

2. Useful PromQL Queries

3. Useful LogQL Queries

On-Call Playbook

When Service Is Down

When Performance Degrades

Postmortem Template

Series Conclusion

공유하기

댓글남기기

참고

백엔드 개발자를 위한 엣지 컴퓨팅 패턴 Edge Computing Patterns for Backend Developers

백엔드 개발자를 위한 엣지 컴퓨팅 패턴 Edge Computing Patterns for Backend Developers

클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅 Cloud-Native Observability Stack Part 5 - Debugging Production Issues with Observability Data

클라우드 네이티브 Observability Part 4 - Prometheus/Grafana로 메트릭과 알림 Cloud-Native Observability Stack Part 4 - Metrics and Alerting with Prometheus/Grafana

클라우드 네이티브 Observability Part 4 - Prometheus/Grafana로 메트릭과 알림 Cloud-Native Observability Stack Part 4 - Metrics and Alerting with Prometheus/Grafana

클라우드 네이티브 Observability Part 3 - 구조화된 로깅과 Correlation ID Cloud-Native Observability Stack Part 3 - Structured Logging with Correlation IDs

Cloud-Native Observability Stack Part 3 - Structured Logging with Correlation IDs

클라우드 네이티브 Observability Part 2 - 마이크로서비스 분산 추적 Cloud-Native Observability Stack Part 2 - Distributed Tracing Across Microservices