클라우드 네이티브 Observability Part 5 - Observability 데이터로 프로덕션 이슈 디버깅 Cloud-Native Observability Stack Part 5 - Debugging Production Issues with Observability Data
Series Introduction
- Part 1: OpenTelemetry Instrumentation
- Part 2: Distributed Tracing Across Microservices
- Part 3: Structured Logging with Correlation IDs
- Part 4: Metrics and Alerting with Prometheus/Grafana
- Part 5: Debugging Production Issues with Observability Data (Current)
Debugging Workflow
MELT Approach
Metrics → Events → Logs → Traces
- Detect problems with Metrics
- Identify timing with Events/Alerts
- Get details with Logs
- Track request flow with Traces
Real-World Failure Scenarios
Scenario 1: Intermittent Timeouts
Symptom: Some order creation requests timeout after 30 seconds
Step 1: Check Metrics
# Check P99 latency spike
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{uri="/api/orders"}[5m])) by (le)
)
Grafana observation: P99 latency spikes to 30 seconds during certain time periods
Step 2: Trace Analysis
Search for slow requests in Jaeger:
service=order-service minDuration=10s
Discovery: inventory.checkStock span takes 29 seconds
Step 3: Log Investigation
{service="inventory-service"} | json | latency > 10000
Discovery: Database query is slow for a specific product ID
Step 4: Root Cause
-- Check execution plan
EXPLAIN ANALYZE SELECT * FROM inventory WHERE product_id = 'PROD-12345';
Cause: Missing index on product_id
Resolution:
CREATE INDEX idx_inventory_product_id ON inventory(product_id);
Scenario 2: Memory Leak
Symptom: Service periodically restarts due to OOM
Step 1: Check Metrics
# Heap memory usage trend
jvm_memory_used_bytes{area="heap",application="order-service"}
Pattern discovered: Memory gradually increases then drops suddenly (restart)
Step 2: GC Log Analysis
# GC frequency increase
rate(jvm_gc_pause_seconds_count{application="order-service"}[5m])
Discovery: Full GC frequency is steadily increasing
Step 3: Heap Dump Analysis
# Generate heap dump
jmap -dump:format=b,file=heapdump.hprof <pid>
# Analyze with MAT or VisualVM
Discovery: OrderCache objects consume 80% of memory
Step 4: Code Review
// Problematic code
@Component
class OrderCache {
private val cache = ConcurrentHashMap<String, Order>()
fun put(orderId: String, order: Order) {
cache[orderId] = order // No removal logic!
}
}
Resolution:
@Component
class OrderCache {
private val cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofHours(1))
.build<String, Order>()
}
Scenario 3: Cascading Failure Between Services
Symptom: Payment service failure leads to entire system outage
Step 1: Check Dependency Graph
Observed in Jaeger Service Map:
- Order Service → Payment Service (synchronous call)
- Payment Service failure causes Order Service thread blocking
Step 2: Check Metrics
# Connection pool exhaustion
hikaricp_connections_active{application="order-service"}
hikaricp_connections_pending{application="order-service"}
Discovery: All connections are in waiting state during Payment Service timeout
Step 3: Log Investigation
{service="order-service"} |= "Connection pool exhausted"
Resolution: Apply Circuit Breaker Pattern
@Service
class PaymentClient(
private val circuitBreakerFactory: Resilience4JCircuitBreakerFactory
) {
private val circuitBreaker = circuitBreakerFactory.create("payment")
fun processPayment(order: Order): PaymentResult {
return circuitBreaker.run(
{ paymentApi.charge(order) },
{ fallback -> handleFallback(order) }
)
}
private fun handleFallback(order: Order): PaymentResult {
// Add to payment queue for later processing
paymentQueue.add(order)
return PaymentResult.PENDING
}
}
Debugging Toolkit
1. Distributed Trace Search Queries
# Slow requests
service=order-service minDuration=1s
# Error requests
service=order-service error=true
# Specific user
service=order-service tag.customer.id=CUST-123
2. Useful PromQL Queries
# Find services with error rate spikes
topk(5,
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
)
# Endpoints with latency spikes
topk(5,
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[5m])) by (uri, le)
)
)
# Top services by memory usage
topk(5,
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
)
3. Useful LogQL Queries
# Aggregate error logs
sum by (errorType) (
count_over_time({service="order-service"} | json | level="ERROR" [1h])
)
# All logs for a specific traceId
{service=~".+"} |= "traceId=abc123"
# Slow query logs
{service=~".+"} | json | queryTime > 1000
On-Call Playbook
When Service Is Down
- Immediate Verification
- Check
up{job="spring-boot-apps"}metric - Check Pod status:
kubectl get pods
- Check
- Check Recent Changes
- Recent deployment history
- Configuration changes
- Log Investigation
- Check startup logs for errors
- Verify OOM occurrence
- Rollback Decision
- If quick recovery is needed, rollback to previous version
When Performance Degrades
- Determine Impact Scope
- Entire service? Specific endpoint?
- Identify Bottleneck
- Check slow spans via traces
- External dependency issue?
- Resource Check
- CPU, memory, disk I/O
- Connection pool status
- Temporary Measures
- Scale out
- Apply rate limiting
Postmortem Template
# Incident Report: [Title]
## Overview
- Occurrence: YYYY-MM-DD HH:MM ~ HH:MM (UTC)
- Impact Scope: [Service name, number of users]
- Severity: [Critical/High/Medium/Low]
## Timeline
- HH:MM - First alert triggered
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Normal operation confirmed
## Root Cause
[Detailed explanation]
## Resolution
[Actions taken]
## Impact
- Error rate: X%
- Affected requests: N
## Lessons Learned
### What Went Well
-
### What Could Be Improved
-
## Action Items
- [ ] [Owner] Action description (Deadline)
Series Conclusion
Topics covered in this series:
| Part | Topic | Key Point |
|---|---|---|
| 1 | OpenTelemetry | Instrumentation basics |
| 2 | Distributed Tracing | Request flow visualization |
| 3 | Structured Logging | Searchable logs |
| 4 | Metrics/Alerting | Proactive monitoring |
| 5 | Debugging | Real-world problem solving |
Observability is not just monitoring. It’s the ability to understand your system, prevent problems, and resolve them quickly.
댓글남기기