3 분 소요


Series Introduction

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: Distributed Tracing Across Microservices
  3. Part 3: Structured Logging with Correlation IDs
  4. Part 4: Metrics and Alerting with Prometheus/Grafana
  5. Part 5: Debugging Production Issues with Observability Data (Current)

Debugging Workflow

MELT Approach

Metrics → Events → Logs → Traces

  1. Detect problems with Metrics
  2. Identify timing with Events/Alerts
  3. Get details with Logs
  4. Track request flow with Traces

Real-World Failure Scenarios

Scenario 1: Intermittent Timeouts

Symptom: Some order creation requests timeout after 30 seconds

Step 1: Check Metrics

# Check P99 latency spike
histogram_quantile(0.99,
  sum(rate(http_server_requests_seconds_bucket{uri="/api/orders"}[5m])) by (le)
)

Grafana observation: P99 latency spikes to 30 seconds during certain time periods

Step 2: Trace Analysis

Search for slow requests in Jaeger:

service=order-service minDuration=10s

Discovery: inventory.checkStock span takes 29 seconds

Step 3: Log Investigation

{service="inventory-service"} | json | latency > 10000

Discovery: Database query is slow for a specific product ID

Step 4: Root Cause

-- Check execution plan
EXPLAIN ANALYZE SELECT * FROM inventory WHERE product_id = 'PROD-12345';

Cause: Missing index on product_id

Resolution:

CREATE INDEX idx_inventory_product_id ON inventory(product_id);

Scenario 2: Memory Leak

Symptom: Service periodically restarts due to OOM

Step 1: Check Metrics

# Heap memory usage trend
jvm_memory_used_bytes{area="heap",application="order-service"}

Pattern discovered: Memory gradually increases then drops suddenly (restart)

Step 2: GC Log Analysis

# GC frequency increase
rate(jvm_gc_pause_seconds_count{application="order-service"}[5m])

Discovery: Full GC frequency is steadily increasing

Step 3: Heap Dump Analysis

# Generate heap dump
jmap -dump:format=b,file=heapdump.hprof <pid>

# Analyze with MAT or VisualVM

Discovery: OrderCache objects consume 80% of memory

Step 4: Code Review

// Problematic code
@Component
class OrderCache {
    private val cache = ConcurrentHashMap<String, Order>()

    fun put(orderId: String, order: Order) {
        cache[orderId] = order  // No removal logic!
    }
}

Resolution:

@Component
class OrderCache {
    private val cache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(Duration.ofHours(1))
        .build<String, Order>()
}

Scenario 3: Cascading Failure Between Services

Symptom: Payment service failure leads to entire system outage

Step 1: Check Dependency Graph

Observed in Jaeger Service Map:

  • Order Service → Payment Service (synchronous call)
  • Payment Service failure causes Order Service thread blocking

Step 2: Check Metrics

# Connection pool exhaustion
hikaricp_connections_active{application="order-service"}
hikaricp_connections_pending{application="order-service"}

Discovery: All connections are in waiting state during Payment Service timeout

Step 3: Log Investigation

{service="order-service"} |= "Connection pool exhausted"

Resolution: Apply Circuit Breaker Pattern

@Service
class PaymentClient(
    private val circuitBreakerFactory: Resilience4JCircuitBreakerFactory
) {
    private val circuitBreaker = circuitBreakerFactory.create("payment")

    fun processPayment(order: Order): PaymentResult {
        return circuitBreaker.run(
            { paymentApi.charge(order) },
            { fallback -> handleFallback(order) }
        )
    }

    private fun handleFallback(order: Order): PaymentResult {
        // Add to payment queue for later processing
        paymentQueue.add(order)
        return PaymentResult.PENDING
    }
}

Debugging Toolkit

1. Distributed Trace Search Queries

# Slow requests
service=order-service minDuration=1s

# Error requests
service=order-service error=true

# Specific user
service=order-service tag.customer.id=CUST-123

2. Useful PromQL Queries

# Find services with error rate spikes
topk(5,
  sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
)

# Endpoints with latency spikes
topk(5,
  histogram_quantile(0.99,
    sum(rate(http_server_requests_seconds_bucket[5m])) by (uri, le)
  )
)

# Top services by memory usage
topk(5,
  jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
)

3. Useful LogQL Queries

# Aggregate error logs
sum by (errorType) (
  count_over_time({service="order-service"} | json | level="ERROR" [1h])
)

# All logs for a specific traceId
{service=~".+"} |= "traceId=abc123"

# Slow query logs
{service=~".+"} | json | queryTime > 1000

On-Call Playbook

When Service Is Down

  1. Immediate Verification
    • Check up{job="spring-boot-apps"} metric
    • Check Pod status: kubectl get pods
  2. Check Recent Changes
    • Recent deployment history
    • Configuration changes
  3. Log Investigation
    • Check startup logs for errors
    • Verify OOM occurrence
  4. Rollback Decision
    • If quick recovery is needed, rollback to previous version

When Performance Degrades

  1. Determine Impact Scope
    • Entire service? Specific endpoint?
  2. Identify Bottleneck
    • Check slow spans via traces
    • External dependency issue?
  3. Resource Check
    • CPU, memory, disk I/O
    • Connection pool status
  4. Temporary Measures
    • Scale out
    • Apply rate limiting

Postmortem Template

# Incident Report: [Title]

## Overview
- Occurrence: YYYY-MM-DD HH:MM ~ HH:MM (UTC)
- Impact Scope: [Service name, number of users]
- Severity: [Critical/High/Medium/Low]

## Timeline
- HH:MM - First alert triggered
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Normal operation confirmed

## Root Cause
[Detailed explanation]

## Resolution
[Actions taken]

## Impact
- Error rate: X%
- Affected requests: N

## Lessons Learned
### What Went Well
-

### What Could Be Improved
-

## Action Items
- [ ] [Owner] Action description (Deadline)

Series Conclusion

Topics covered in this series:

Part Topic Key Point
1 OpenTelemetry Instrumentation basics
2 Distributed Tracing Request flow visualization
3 Structured Logging Searchable logs
4 Metrics/Alerting Proactive monitoring
5 Debugging Real-world problem solving

Observability is not just monitoring. It’s the ability to understand your system, prevent problems, and resolve them quickly.

댓글남기기