4 분 소요


시리즈 소개

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: 마이크로서비스 분산 추적
  3. Part 3: 구조화된 로깅과 Correlation ID
  4. Part 4: Prometheus/Grafana로 메트릭과 알림 (현재 글)
  5. Part 5: Observability 데이터로 프로덕션 이슈 디버깅

메트릭의 중요성

메트릭은 시스템의 건강 상태를 수치로 보여줍니다:

  • 요청 처리량 (Throughput)
  • 응답 시간 (Latency)
  • 에러율 (Error Rate)
  • 리소스 사용량 (CPU, Memory)

Spring Boot + Micrometer 설정

의존성 추가

dependencies {
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("io.micrometer:micrometer-registry-prometheus")
}

Application 설정

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  endpoint:
    health:
      show-details: always
  metrics:
    tags:
      application: order-service
      environment: production
    distribution:
      percentiles-histogram:
        http.server.requests: true
      slo:
        http.server.requests: 100ms,500ms,1000ms

기본 제공 메트릭

HTTP 요청 메트릭

http_server_requests_seconds_count{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_sum{method="POST",uri="/api/orders",status="200"}
http_server_requests_seconds_bucket{method="POST",uri="/api/orders",status="200",le="0.1"}

JVM 메트릭

jvm_memory_used_bytes{area="heap",id="G1 Eden Space"}
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause"}
jvm_threads_live_threads

커스텀 메트릭 구현

Counter (카운터)

@Service
class OrderMetrics(private val meterRegistry: MeterRegistry) {

    private val ordersCreated = Counter.builder("orders.created")
        .description("Total number of orders created")
        .tag("service", "order-service")
        .register(meterRegistry)

    private val ordersFailed = Counter.builder("orders.failed")
        .description("Total number of failed orders")
        .tag("service", "order-service")
        .register(meterRegistry)

    fun recordOrderCreated() {
        ordersCreated.increment()
    }

    fun recordOrderFailed(reason: String) {
        Counter.builder("orders.failed")
            .tag("reason", reason)
            .register(meterRegistry)
            .increment()
    }
}

Gauge (게이지)

@Component
class QueueMetrics(
    meterRegistry: MeterRegistry,
    private val orderQueue: OrderQueue
) {
    init {
        Gauge.builder("order.queue.size", orderQueue) { queue ->
            queue.size().toDouble()
        }
            .description("Current size of order processing queue")
            .register(meterRegistry)
    }
}

Timer (타이머)

@Service
class PaymentService(private val meterRegistry: MeterRegistry) {

    private val paymentTimer = Timer.builder("payment.processing.time")
        .description("Time taken to process payments")
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(meterRegistry)

    fun processPayment(order: Order): PaymentResult {
        return paymentTimer.recordCallable {
            // 결제 처리 로직
            paymentGateway.charge(order.customerId, order.totalAmount)
        }!!
    }
}

Distribution Summary

@Service
class OrderAnalytics(private val meterRegistry: MeterRegistry) {

    private val orderAmountSummary = DistributionSummary.builder("order.amount")
        .description("Distribution of order amounts")
        .baseUnit("KRW")
        .publishPercentiles(0.5, 0.75, 0.95)
        .register(meterRegistry)

    fun recordOrderAmount(amount: BigDecimal) {
        orderAmountSummary.record(amount.toDouble())
    }
}

Prometheus 설정

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets:
        - 'order-service:8080'
        - 'payment-service:8081'
        - 'inventory-service:8082'

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Grafana 대시보드

RED Method 대시보드

Rate, Errors, Duration - 서비스 관점:

# Request Rate
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Error Rate
sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))

# Duration (P99)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le))

USE Method 대시보드

Utilization, Saturation, Errors - 리소스 관점:

# CPU Utilization
system_cpu_usage{application="order-service"}

# Memory Utilization
jvm_memory_used_bytes{application="order-service",area="heap"}
/
jvm_memory_max_bytes{application="order-service",area="heap"}

# Thread Pool Saturation
hikaricp_connections_pending{application="order-service"}

SLI/SLO 정의

Service Level Indicators

# SLI 정의
slis:
  - name: availability
    query: |
      sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
      /
      sum(rate(http_server_requests_seconds_count[5m]))

  - name: latency_p99
    query: |
      histogram_quantile(0.99,
        sum(rate(http_server_requests_seconds_bucket[5m])) by (le)
      )

  - name: error_rate
    query: |
      sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
      /
      sum(rate(http_server_requests_seconds_count[5m]))

Service Level Objectives

slos:
  - name: availability
    target: 99.9%
    window: 30d

  - name: latency_p99
    target: 500ms
    window: 30d

  - name: error_rate
    target: 0.1%
    window: 30d

알림 설정

Alertmanager 규칙

# alert-rules.yml
groups:
  - name: order-service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{application="order-service",status=~"5.."}[5m]))
          /
          sum(rate(http_server_requests_seconds_count{application="order-service"}[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is "

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is s"

      - alert: PodDown
        expr: up{job="spring-boot-apps"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Slack 알림 설정

# alertmanager.yml
route:
  receiver: 'slack-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true
        title: ': '
        text: ''

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true

Docker Compose 전체 설정

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources

  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

정리

메트릭과 알림의 핵심:

항목 설명
Micrometer Spring Boot 메트릭 추상화
RED Method Rate, Errors, Duration - 서비스 관점
USE Method Utilization, Saturation, Errors - 리소스 관점
SLI/SLO 서비스 품질 목표 정의
알림 임계값 기반 자동 알림

다음 글에서는 Observability 데이터를 활용한 프로덕션 이슈 디버깅을 다루겠습니다.

Series Introduction

  1. Part 1: OpenTelemetry Instrumentation
  2. Part 2: Distributed Tracing Across Microservices
  3. Part 3: Structured Logging with Correlation IDs
  4. Part 4: Metrics and Alerting with Prometheus/Grafana (Current)
  5. Part 5: Debugging Production Issues with Observability Data

Importance of Metrics

Metrics show the health of your system numerically:

  • Throughput
  • Latency
  • Error Rate
  • Resource Usage (CPU, Memory)

Spring Boot + Micrometer Setup

Micrometer provides a vendor-neutral metrics facade for Spring Boot applications, with built-in support for Prometheus.

Custom Metrics Implementation

  • Counter: Track cumulative values (orders created, errors)
  • Gauge: Track current values (queue size, active connections)
  • Timer: Track duration and count (payment processing time)
  • Distribution Summary: Track distribution of values (order amounts)

Grafana Dashboards

RED Method Dashboard

Rate, Errors, Duration - Service perspective

USE Method Dashboard

Utilization, Saturation, Errors - Resource perspective

SLI/SLO Definition

Define Service Level Indicators and Objectives to measure and track service quality.

Summary

Key aspects of metrics and alerting:

Item Description
Micrometer Spring Boot metrics abstraction
RED Method Rate, Errors, Duration - service perspective
USE Method Utilization, Saturation, Errors - resource perspective
SLI/SLO Service quality objectives definition
Alerting Threshold-based automatic alerts

In the next post, we’ll cover debugging production issues using observability data.

댓글남기기