클라우드 네이티브 Observability Part 4 - Prometheus/Grafana로 메트릭과 알림 Cloud-Native Observability Stack Part 4 - Metrics and Alerting with Prometheus/Grafana
시리즈 소개
- Part 1: OpenTelemetry Instrumentation
- Part 2: 마이크로서비스 분산 추적
- Part 3: 구조화된 로깅과 Correlation ID
- Part 4: Prometheus/Grafana로 메트릭과 알림 (현재 글)
- Part 5: Observability 데이터로 프로덕션 이슈 디버깅
메트릭의 중요성
메트릭은 시스템의 건강 상태를 수치로 보여줍니다:
- 요청 처리량 (Throughput)
- 응답 시간 (Latency)
- 에러율 (Error Rate)
- 리소스 사용량 (CPU, Memory)
Spring Boot + Micrometer 설정
의존성 추가
dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("io.micrometer:micrometer-registry-prometheus")
}
Application 설정
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
health:
show-details: always
metrics:
tags:
application: order-service
environment: production
distribution:
percentiles-histogram:
http.server.requests: true
slo:
http.server.requests: 100ms,500ms,1000ms
기본 제공 메트릭
HTTP 요청 메트릭
http_server_requests_seconds_count{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_sum{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_bucket{method="POST",uri="/api/orders",status="200",le="0.1"}
JVM 메트릭
jvm_memory_used_bytes{area="heap",id="G1 Eden Space"}
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause"}
jvm_threads_live_threads
커스텀 메트릭 구현
Counter (카운터)
@Service
class OrderMetrics(private val meterRegistry: MeterRegistry) {
private val ordersCreated = Counter.builder("orders.created")
.description("Total number of orders created")
.tag("service", "order-service")
.register(meterRegistry)
private val ordersFailed = Counter.builder("orders.failed")
.description("Total number of failed orders")
.tag("service", "order-service")
.register(meterRegistry)
fun recordOrderCreated() {
ordersCreated.increment()
}
fun recordOrderFailed(reason: String) {
Counter.builder("orders.failed")
.tag("reason", reason)
.register(meterRegistry)
.increment()
}
}
Gauge (게이지)
@Component
class QueueMetrics(
meterRegistry: MeterRegistry,
private val orderQueue: OrderQueue
) {
init {
Gauge.builder("order.queue.size", orderQueue) { queue ->
queue.size().toDouble()
}
.description("Current size of order processing queue")
.register(meterRegistry)
}
}
Timer (타이머)
@Service
class PaymentService(private val meterRegistry: MeterRegistry) {
private val paymentTimer = Timer.builder("payment.processing.time")
.description("Time taken to process payments")
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry)
fun processPayment(order: Order): PaymentResult {
return paymentTimer.recordCallable {
// 결제 처리 로직
paymentGateway.charge(order.customerId, order.totalAmount)
}!!
}
}
Distribution Summary
@Service
class OrderAnalytics(private val meterRegistry: MeterRegistry) {
private val orderAmountSummary = DistributionSummary.builder("order.amount")
.description("Distribution of order amounts")
.baseUnit("KRW")
.publishPercentiles(0.5, 0.75, 0.95)
.register(meterRegistry)
fun recordOrderAmount(amount: BigDecimal) {
orderAmountSummary.record(amount.toDouble())
}
}
Prometheus 설정
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'order-service:8080'
- 'payment-service:8081'
- 'inventory-service:8082'
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Grafana 대시보드
RED Method 대시보드
Rate, Errors, Duration - 서비스 관점:
# Request Rate
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
# Error Rate
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
# Duration (P99)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))
USE Method 대시보드
Utilization, Saturation, Errors - 리소스 관점:
# CPU Utilization
system_cpu_usage{application="order-service"}
# Memory Utilization
jvm_memory_used_bytes{application="order-service",area="heap"}
/
jvm_memory_max_bytes{application="order-service",area="heap"}
# Thread Pool Saturation
hikaricp_connections_pending{application="order-service"}
SLI/SLO 정의
Service Level Indicators
# SLI 정의
slis:
- name: availability
query: |
sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count[5m]))
- name: latency_p99
query: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[5m])) by (le)
)
- name: error_rate
query: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count[5m]))
Service Level Objectives
slos:
- name: availability
target: 99.9%
window: 30d
- name: latency_p99
target: 500ms
window: 30d
- name: error_rate
target: 0.1%
window: 30d
알림 설정
Alertmanager 규칙
# alert-rules.yml
groups:
- name: order-service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is "
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is s"
- alert: PodDown
expr: up{job="spring-boot-apps"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
Slack 알림 설정
# alertmanager.yml
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
title: ': '
text: ''
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
Docker Compose 전체 설정
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
정리
메트릭과 알림의 핵심:
| 항목 | 설명 |
|---|---|
| Micrometer | Spring Boot 메트릭 추상화 |
| RED Method | Rate, Errors, Duration - 서비스 관점 |
| USE Method | Utilization, Saturation, Errors - 리소스 관점 |
| SLI/SLO | 서비스 품질 목표 정의 |
| 알림 | 임계값 기반 자동 알림 |
다음 글에서는 Observability 데이터를 활용한 프로덕션 이슈 디버깅을 다루겠습니다.
Series Introduction
- Part 1: OpenTelemetry Instrumentation
- Part 2: Distributed Tracing Across Microservices
- Part 3: Structured Logging with Correlation IDs
- Part 4: Metrics and Alerting with Prometheus/Grafana (Current)
- Part 5: Debugging Production Issues with Observability Data
Importance of Metrics
Metrics show the health of your system numerically:
- Throughput
- Latency
- Error Rate
- Resource Usage (CPU, Memory)
Spring Boot + Micrometer Setup
Micrometer provides a vendor-neutral metrics facade for Spring Boot applications, with built-in support for Prometheus.
Custom Metrics Implementation
- Counter: Track cumulative values (orders created, errors)
- Gauge: Track current values (queue size, active connections)
- Timer: Track duration and count (payment processing time)
- Distribution Summary: Track distribution of values (order amounts)
Grafana Dashboards
RED Method Dashboard
Rate, Errors, Duration - Service perspective
USE Method Dashboard
Utilization, Saturation, Errors - Resource perspective
SLI/SLO Definition
Define Service Level Indicators and Objectives to measure and track service quality.
Summary
Key aspects of metrics and alerting:
| Item | Description |
|---|---|
| Micrometer | Spring Boot metrics abstraction |
| RED Method | Rate, Errors, Duration - service perspective |
| USE Method | Utilization, Saturation, Errors - resource perspective |
| SLI/SLO | Service quality objectives definition |
| Alerting | Threshold-based automatic alerts |
In the next post, we’ll cover debugging production issues using observability data.
댓글남기기