Database Replication Lag 완벽 가이드: 프로덕션 30초 임계값과 2025 최신 모니터링 전략

왜 방금 입력한 데이터가 조회되지 않을까?

사용자가 프로필을 업데이트하고 새로고침 버튼을 누릅니다. 하지만 변경사항이 반영되지 않습니다. 몇 초 후 다시 새로고침하니 그제서야 나타납니다. “버그인가요?” 라는 문의가 들어옵니다. 로그를 확인하니 데이터는 정상적으로 저장되었습니다. 그런데 왜 바로 조회되지 않았을까요?

또 다른 시나리오: Primary 데이터베이스가 장애를 일으켰습니다. 자동 Failover 시스템이 Replica를 Primary로 승격시켜야 하는데, Replica가 Primary보다 5분 뒤처져 있습니다. 만약 이 상태로 Failover하면 최근 5분간의 거래 데이터가 모두 사라집니다. 수천 건의 주문, 결제, 포인트 적립이 증발하는 상황입니다.

이것이 바로 Database Replication Lag (복제 지연) 문제입니다.

Replication Lag이 위험한 이유:

Stale Read (오래된 데이터 읽기): Primary에 쓴 데이터를 Replica에서 못 읽음
Inconsistent User Experience: 동일 사용자가 다른 화면에서 다른 데이터 봄
Failover 데이터 손실: Lag 만큼의 최신 트랜잭션 유실
부정확한 Analytics: 지연된 데이터로 잘못된 의사결정
감지 어려움: 정상 동작하는 것처럼 보이다가 갑자기 문제 발생

프로덕션 임계값 (2025년 표준):

OLTP 시스템: 5초 이상 = 경고, 30초 이상 = 위험 (Failover 불가)
Analytics Replica: 60초 이상 = 경고, 300초 이상 = 위험
Google Cloud SQL: replica_lag 메트릭 1분 주기 수집
Azure SQL Database: replication_lag 메트릭 2025년 Public Preview
PostgreSQL HA: 30초 초과 시 Red Flag (Failover readiness 저하)

전형적인 피해 사례:

데이터 손실: Failover 시 Lag 만큼의 트랜잭션 유실 (평균 2~5분)
사용자 혼란: “방금 저장했는데 왜 안 보여요?” 문의 폭증
비즈니스 영향: 결제 완료 후 포인트 미적립, 주문 누락
Failover 실패: 30초 초과 Lag로 자동 Failover 중단
복구 시간: Lag 원인 파악 및 해결까지 평균 1~3시간

이 글에서는 Replication Lag의 3대 원인 (느린 쿼리, 네트워크 지연, 디스크 I/O)을 진단하고, 2025년 최신 모니터링 전략과 Multi-Threaded Replication 최적화를 다룹니다.

Database Replication 기본 개념

Primary-Replica 구조

Replication은 Primary (Master) 데이터베이스의 변경사항을 하나 이상의 Replica (Slave)에 복사하는 프로세스입니다.

[Primary Database]
 ↓ (Replication)
 ↓ (Write Ahead Log)
 ↓
[Replica Database]

용도:
- Read 부하 분산 (Primary: Write, Replica: Read)
- High Availability (Primary 장애 시 Replica로 Failover)
- Backup (Replica에서 백업 수행)
- Analytics (Replica에서 무거운 쿼리 실행)

Replication Lag이란?

Definition: Primary에서 발생한 변경사항이 Replica에 반영되기까지 걸리는 시간

측정 방법:

Replication Lag = 현재 시간 - Replica가 마지막으로 적용한 트랜잭션의 타임스탬프

예시:
- Primary: 트랜잭션 T1 커밋 (14:00:00)
- Replica: T1 적용 완료 (14:00:05)
- Replication Lag: 5초

정상 vs 비정상:

 정상:
- Lag: 0~2초 (거의 실시간)
- Primary와 Replica가 거의 동기화
- Failover 안전

️ 경고:
- Lag: 5~30초
- Stale Read 발생 가능
- Failover 시 일부 데이터 손실 가능

 위험:
- Lag: 30초 이상
- Failover 불가 (데이터 손실 위험)
- 사용자 경험 저하
- Replica 재구축 필요 가능

Replication Lag의 3대 원인

원인 #1: Primary의 느린 쿼리

시나리오: Primary에서 무거운 쿼리가 실행되면 Replication Stream이 막힘

** 문제 쿼리:**

-- Primary에서 실행 (10초 소요)
UPDATE products
SET inventory = inventory - 1
WHERE category = 'electronics'; -- 100만 개 상품 업데이트 (인덱스 없음)

-- 문제:
-- 1. Primary에서 10초 소요
-- 2. Replication Log에 10초짜리 트랜잭션 기록
-- 3. Replica도 10초 동안 이 트랜잭션 처리
-- 4. 그동안 쌓이는 다른 트랜잭션들
-- → Lag 급증!

Lag 증가 패턴:

T=0초: Primary에서 대량 UPDATE 시작
T=5초: Lag = 5초 (Replica가 처리 시작)
T=10초: Lag = 10초 (Replica 처리 중)
T=15초: Lag = 15초 (아직 처리 중, 새 트랜잭션 누적)
T=20초: Lag = 20초
T=25초: Lag = 25초 (다른 트랜잭션 대기)
T=30초: Lag = 30초 ( Red Flag!)

→ Failover 불가 상태!

** 최적화된 쿼리:**

-- 배치 크기 제한 + 인덱스 사용
-- 한 번에 1,000개씩만 업데이트
UPDATE products
SET inventory = inventory - 1
WHERE category = 'electronics'
 AND id IN (
 SELECT id FROM products
 WHERE category = 'electronics'
 LIMIT 1000
 );

-- 인덱스 생성
CREATE INDEX idx_products_category ON products(category);

-- 결과:
-- - 각 배치 0.1초 소요
-- - 1,000번 반복해도 총 100초
-- - 하지만 각 트랜잭션은 0.1초 → Lag 누적 안 됨!
-- - Lag 유지: 1~2초

원인 #2: 네트워크 지연

시나리오: Primary와 Replica 간 네트워크 대역폭 부족 또는 지연

문제 상황:

Primary (Seoul Region)
 ↓ (100 Mbps 네트워크)
 ↓ (50ms latency)
 ↓
Replica (Tokyo Region)

대량 트랜잭션:
- 초당 10,000 INSERT
- 각 레코드 1KB
- 총 10MB/s 전송 필요

하지만:
- 네트워크: 100 Mbps = 12.5 MB/s
- 실제 처리량: ~8 MB/s (오버헤드)
- 2 MB/s 부족!

결과:
- Replication Stream 병목
- Lag 계속 증가
- 복구 불가

진단:

-- PostgreSQL: Replication 지연 확인
SELECT
 client_addr,
 state,
 sent_lsn,
 write_lsn,
 flush_lsn,
 replay_lsn,
 pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- 출력:
-- client_addr | state | lag_bytes
-- ------------|-----------|----------
-- 10.0.2.100 | streaming | 524288000 ← 500MB 뒤처짐!

-- lag_bytes가 계속 증가 = 네트워크 병목

** 해결책:**

1. 네트워크 업그레이드:
 - 100 Mbps → 1 Gbps
 - 또는 Direct Connect / VPN 전용선

2. Replica 위치 변경:
 - 같은 리전으로 이동
 - Latency: 50ms → 1ms

3. 압축 활성화:
 - PostgreSQL: wal_compression = on
 - MySQL: binlog_row_image = minimal

4. Replication Slot 크기 증가:
 - max_wal_size = 10GB (기본 1GB)

원인 #3: Replica의 디스크 I/O 병목

시나리오: Replica의 디스크가 느려서 트랜잭션 적용 지연

문제 상황:

Primary (SSD: 10,000 IOPS)
 ↓
Replica (HDD: 200 IOPS) ← 병목!

Primary:
- 초당 5,000 트랜잭션 처리 (Write IOPS 충분)

Replica:
- 초당 200 트랜잭션만 처리 (Disk IOPS 부족)
- 4,800 트랜잭션 누적
- Lag 급증!

1초 후:
- Lag: 4,800 트랜잭션
- 계속 증가 중...

진단:

# Replica 서버에서 디스크 I/O 확인
iostat -x 1

# 출력:
# Device r/s w/s %util
# sda 50 180 99.9 ← 디스크 100% 사용!

# PostgreSQL: Replica Apply 속도 확인
SELECT
 now() - pg_last_xact_replay_timestamp() AS replication_lag
FROM pg_stat_replication;

# 출력:
# replication_lag
# ?---------------
# 00:02:35 ← 2분 35초 뒤처짐!

** 해결책:**

1. 디스크 업그레이드:
 - HDD → SSD
 - 200 IOPS → 10,000 IOPS

2. RAID 구성:
 - RAID 10 (성능 + 안정성)
 - Write 성능 2배 향상

3. Write-Ahead Log 최적화:
 - PostgreSQL: synchronous_commit = off (Replica에서)
 - MySQL: innodb_flush_log_at_trx_commit = 2

4. 병렬 Replication 활성화:
 - MySQL: slave_parallel_workers = 8
 - PostgreSQL: max_parallel_workers = 8

PostgreSQL Replication Lag 모니터링

pg_stat_replication 뷰

실시간 Lag 확인:

-- Primary에서 실행
SELECT
 application_name,
 client_addr,
 state,
 sync_state,
 pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
 (EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())))::INT AS lag_seconds
FROM pg_stat_replication;

-- 출력:
-- application_name | client_addr | state | sync_state | lag_bytes | lag_seconds
-- -----------------|-------------|-----------|------------|-----------|------------
-- replica1 | 10.0.2.100 | streaming | async | 524288 | 15

-- 해석:
-- - lag_bytes: 512KB 뒤처짐 (Replication Log)
-- - lag_seconds: 15초 지연 (시간 기준)
-- - sync_state: async (비동기 복제)

Lag 임계값 알람

Prometheus + Alertmanager:

# prometheus.yml
scrape_txtigs:
 - job_name: 'postgres'
 static_txtigs:
 - targets: ['postgres-primary:9187']

# 알람 규칙
groups:
 - name: database_replication
 interval: 30s
 rules:
 - alert: ReplicationLagHigh
 expr: pg_replication_lag_seconds > 30
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Replication lag is {{ $value }}s"
 description: "PostgreSQL replication lag > 30 seconds for 2 minutes"

 - alert: ReplicationLagWarning
 expr: pg_replication_lag_seconds > 5
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "Replication lag is {{ $value }}s"

pg_last_xact_replay_timestamp()

Replica에서 실행:

-- Replica에서 마지막 트랜잭션 적용 시간 확인
SELECT
 now() AS current_time,
 pg_last_xact_replay_timestamp() AS last_replay_time,
 now() - pg_last_xact_replay_timestamp() AS replication_lag
FROM pg_stat_replication;

-- 출력:
-- current_time | last_replay_time | replication_lag
-- ------------------------|-------------------------|----------------
-- 2025-11-04 15:30:45+00 | 2025-11-04 15:30:15+00 | 00:00:30

-- 30초 Lag!

MySQL Replication Lag 모니터링

SHOW SLAVE STATUS

기본 모니터링:

-- Replica에서 실행
SHOW SLAVE STATUS\G

-- 주요 필드:
-- Slave_IO_Running: Yes ← I/O Thread 정상
-- Slave_SQL_Running: Yes ← SQL Thread 정상
-- Seconds_Behind_Master: 35 ← 35초 Lag!
-- Master_Log_File: mysql-bin.000123
-- Read_Master_Log_Pos: 154789
-- Relay_Log_Space: 524288000 ← 500MB Relay Log 누적

-- 문제 진단:
-- - Seconds_Behind_Master > 30: 위험!
-- - Relay_Log_Space 계속 증가: Replica 처리 느림

Seconds_Behind_Master 계산 원리

Seconds_Behind_Master = 현재 시간 - Relay Log의 마지막 트랜잭션 타임스탬프

예시:
- 현재 시간: 15:30:45
- Relay Log 마지막 트랜잭션: 15:30:10 (Primary에서 발생)
- Seconds_Behind_Master = 45 - 10 = 35초

NULL 값 주의:

-- Seconds_Behind_Master = NULL인 경우:
-- 1. Replication이 중단됨 (IO/SQL Thread 중 하나 멈춤)
-- 2. 네트워크 단절
-- 3. Replica 장애

-- 확인:
SHOW SLAVE STATUS\G

-- Slave_IO_Running: No ← 문제!
-- Slave_SQL_Running: Yes
-- Last_IO_Error: error connecting to master ← 네트워크 문제

Multi-Threaded Replication (병렬 복제)

Single-Threaded의 한계

전통적인 방식:

Primary (8 Threads)
 ↓ (병렬 Write)
 ↓
Replication Log
 ↓
Replica (1 Thread) ← 병목!
 ↓ (순차 Apply)
 ↓
Slow!

문제:
- Primary: 초당 10,000 트랜잭션 (8 threads)
- Replica: 초당 2,000 트랜잭션 (1 thread)
- Gap: 8,000 트랜잭션/초
- Lag 계속 증가!

Multi-Threaded Replication 활성화

MySQL 5.7+:

-- Replica에서 설정
STOP SLAVE;

SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK'; -- 논리 클럭 기반
SET GLOBAL slave_parallel_workers = 8; -- 8개 Worker Thread
SET GLOBAL slave_preserve_commit_order = 1; -- 커밋 순서 보존

START SLAVE;

-- 확인:
SHOW PROCESSLIST;

-- 출력:
-- Id | User | Command | State
-- ----|--------------|--------------|------------------
-- 10 | system user | Connect | Waiting for master
-- 11 | system user | Connect | Slave has read all relay log ← SQL Thread
-- 12 | system user | Connect | Waiting for an event from Coordinator ← Worker 1
-- 13 | system user | Connect | Waiting for an event from Coordinator ← Worker 2
--...
-- 19 | system user | Connect | Waiting for an event from Coordinator ← Worker 8

-- 8개 Worker Thread 동작 중!

성능 개선:

Before (Single-Thread):
- Replica 처리량: 2,000 TPS

After (8 Threads):
- Replica 처리량: 12,000 TPS (6배 개선!)
- Lag: 35초 → 3초 (12배 개선!)

PostgreSQL Parallel Apply (PostgreSQL 16+)

-- postgresql.txt
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
max_worker_processes = 16

-- Logical Replication Parallel Apply (PG 16+)
ALTER SUBSCRIPTION my_subscription
SET (streaming = 'parallel', parallel_workers = 8);

-- 확인:
SELECT
 subname,
 parallel_workers,
 stream
FROM pg_subscription;

-- 출력:
-- subname | parallel_workers | stream
-- ---------------|------------------|----------
-- my_subscription| 8 | parallel

2025년 최신 모니터링 도구

Azure SQL Database Replication Lag Metric

2025년 Public Preview:

Azure Portal → SQL Database → Metrics

Metric: Replication lag (seconds)
- 수집 주기: 1분
- 범위: Primary-Secondary 간 지연
- 단위: 초

알람 설정:
- Threshold: 30초
- Aggregation: Average (1분)
- Action: Email + PagerDuty

Azure CLI로 확인:

# Replication Lag 조회
az sql db replica list-links \
 --resource-group myResourceGroup \
 --server myPrimaryServer \
 --database myDatabase \
 --query "[].{Replica:partnerServer, Lag:replicationLag}"

# 출력:
# Replica | Lag
# -----------------|----
# mySecondaryServer| 15 ← 15초 Lag

Google Cloud SQL Replication Metrics

replica_lag 메트릭:

# gcloud 명령어
gcloud sql instances describe my-replica \
 --format="value(replicatxtiguration.replicaLag)"

# 출력:
# 25 ← 25초 Lag

# Monitoring Dashboard
# Metric: cloudsql.googleapis.com/database/replication/replica_lag
# Type: GAUGE
# Unit: seconds

network_lag 메트릭:

network_lag = Primary에서 전송한 시간 - Replica가 받은 시간

예시:
- Primary 전송: 15:30:00
- Replica 수신: 15:30:03
- network_lag: 3초

→ Lag의 원인이 네트워크인지 Replica 처리 속도인지 구분!

Percona Monitoring and Management (PMM)

설치:

# PMM Server 설치 (Docker)
docker run -d \
 -p 443:443 \
 --name pmm-server \
 -e METRICS_RETENTION=720h \
 percona/pmm-server:latest

# PMM Client 설치 (Primary + Replica)
wget https://www.percona.com/downloads/pmm2/pmm-client-latest.tar.gz
tar -xzf pmm-client-latest.tar.gz
sudo./install

# PostgreSQL 모니터링 추가
pmm-admin add postgresql \
 --username=pmm \
 --password=password \
 --query-source=pgstatmonitor \
 postgres-primary

pmm-admin add postgresql \
 --username=pmm \
 --password=password \
 --replication-set=postgres-replica \
 postgres-replica

Dashboard:

PMM UI → Dashboards → PostgreSQL Replication

Metrics:
- Replication Lag (seconds)
- Replication Lag (bytes)
- Replay LSN vs Sent LSN
- Replication Slot Size
- Sync State (async/sync)

Alerts:
- Lag > 30s: Critical
- Lag > 5s: Warning
- Replication Stopped: Critical

Failover 안전성 확보

Synchronous Replication (동기 복제)

비동기 vs 동기:

Async Replication (기본):
Primary → Write → Commit (즉시 반환) ← 빠름
 ↓ (나중에 전송)
Replica → Apply

장점: 빠름
단점: Primary 장애 시 Replica가 뒤처질 수 있음 (데이터 손실)

Sync Replication:
Primary → Write → Wait (Replica 확인 대기) ← 느림
 ↓ (실시간 전송)
Replica → Apply → ACK
 ↓
Primary → Commit (Replica 확인 후)

장점: 데이터 손실 없음
단점: 느림 (네트워크 Latency 영향)

PostgreSQL Synchronous Replication:

-- postgresql.txt (Primary)
synchronous_commit = on -- 동기 커밋 활성화
synchronous_standby_names = 'replica1' -- 동기 Replica 지정

-- 재시작
pg_ctl restart

-- 확인:
SELECT
 application_name,
 sync_state,
 state
FROM pg_stat_replication;

-- 출력:
-- application_name | sync_state | state
-- -----------------|------------|----------
-- replica1 | sync | streaming ← 동기 복제!

-- 이제 Primary Commit은 replica1의 ACK를 기다림
-- → Lag = 0 (항상 동기화)
-- → Failover 안전 (데이터 손실 없음)

성능 트레이드오프:

Async:
- Write Latency: 1ms
- Throughput: 10,000 TPS
- Lag: 1~5초
- Failover: 일부 데이터 손실 가능

Sync:
- Write Latency: 10ms (네트워크 RTT 추가)
- Throughput: 5,000 TPS (50% 감소)
- Lag: 0초 (항상 동기화)
- Failover: 데이터 손실 없음

Lag 임계값 기반 Failover 정책

Automatic Failover 조건:

import psycopg2

def check_failover_readiness():
 """Replica가 Failover 준비되었는지 확인"""

 # Primary 연결
 primary_conn = psycopg2.connect(
 host="primary.example.com",
 database="mydb",
 user="postgres"
 )

 cursor = primary_conn.cursor()

 # Replication Lag 확인
 cursor.execute("""
 SELECT
 application_name,
 pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
 EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT AS lag_seconds
 FROM pg_stat_replication
 WHERE application_name = 'replica1'
 """
 )

 result = cursor.fetchone()

 if not result:
 print("️ Replica not connected!")
 return False

 app_name, lag_bytes, lag_seconds = result

 print(f"Replica: {app_name}")
 print(f"Lag: {lag_bytes} bytes, {lag_seconds} seconds")

 # Failover 안전 임계값
 MAX_LAG_SECONDS = 30
 MAX_LAG_BYTES = 16 * 1024 * 1024 # 16MB

 if lag_seconds > MAX_LAG_SECONDS:
 print(f" Lag {lag_seconds}s > {MAX_LAG_SECONDS}s - Failover UNSAFE!")
 return False

 if lag_bytes > MAX_LAG_BYTES:
 print(f" Lag {lag_bytes} bytes > {MAX_LAG_BYTES} bytes - Failover UNSAFE!")
 return False

 print(" Failover SAFE - Lag within threshold")
 return True

# 사용
if check_failover_readiness():
 print("Proceeding with failover...")
 # perform_failover()
else:
 print("Failover aborted - Lag too high")

프로덕션 최적화 체크리스트

Replication 설정 최적화

PostgreSQL:

-- postgresql.txt (Primary)
wal_level = replica -- Replication 활성화
max_wal_senders = 10 -- 최대 10개 Replica 지원
wal_keep_size = 1024MB -- 1GB WAL 보관 (네트워크 단절 대비)
max_replication_slots = 10 -- Replication Slot
wal_compression = on -- WAL 압축 (네트워크 절약)

-- postgresql.txt (Replica)
hot_standby = on -- Read 쿼리 허용
max_standby_streaming_delay = 30s -- 최대 30초 Lag 허용
wal_receiver_status_interval = 10s -- 10초마다 상태 보고

MySQL:

-- my.cnf (Primary)
server-id = 1
log-bin = mysql-bin
binlog_format = ROW
sync_binlog = 1 -- 안전성 (성능 저하)
binlog_row_image = MINIMAL -- 네트워크 절약

-- my.cnf (Replica)
server-id = 2
relay-log = relay-bin
read_only = 1 -- Read-Only 모드
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8 -- 8 Thread 병렬 처리
slave_preserve_commit_order = 1

모니터링 설정

Prometheus Exporter:

# docker-compose.yml
version: '3'
services:
 postgres_exporter:
 image: prometheuscommunity/postgres-exporter
 environment:
 DATA_SOURCE_NAME: "postgresql://postgres:password@primary:5432/postgres?sslmode=disable"
 ports:
 - "9187:9187"

 mysql_exporter:
 image: prom/mysqld-exporter
 environment:
 DATA_SOURCE_NAME: "exporter:password@(primary:3306)/"
 ports:
 - "9104:9104"

Grafana Dashboard:

{
 "dashboard": {
 "title": "Database Replication Monitoring",
 "panels": [
 {
 "title": "Replication Lag (seconds)",
 "targets": [
 {
 "expr": "pg_replication_lag_seconds",
 "legendFormat": "{{instance}}"
 }
 ],
 "thresholds": [
 { "value": 5, "color": "yellow" },
 { "value": 30, "color": "red" }
 ]
 },
 {
 "title": "Replication Lag (bytes)",
 "targets": [
 {
 "expr": "pg_replication_lag_bytes",
 "legendFormat": "{{instance}}"
 }
 ]
 }
 ]
 }
}

알람 설정

# alertmanager.yml
groups:
 - name: replication
 interval: 30s
 rules:
 # PostgreSQL Lag 경고
 - alert: PostgreSQLReplicationLagHigh
 expr: pg_replication_lag_seconds > 5
 for: 2m
 labels:
 severity: warning
 annotations:
 summary: "Replication lag is {{ $value }}s"

 # PostgreSQL Lag 위험
 - alert: PostgreSQLReplicationLagCritical
 expr: pg_replication_lag_seconds > 30
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "Replication lag is {{ $value }}s - FAILOVER UNSAFE!"

 # MySQL Lag 경고
 - alert: MySQLReplicationLagHigh
 expr: mysql_slave_status_seconds_behind_master > 5
 for: 2m
 labels:
 severity: warning

 # Replication 중단
 - alert: ReplicationStopped
 expr: pg_replication_is_replica == 0
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "Replication has stopped!"

실전 디버깅 시나리오

시나리오 1: 갑자기 Lag 급증

증상:

정상: Lag 1~2초
갑자기: Lag 120초

진단:

-- 1. Primary에서 느린 쿼리 확인
SELECT
 pid,
 now() - query_start AS duration,
 state,
 query
FROM pg_stat_activity
WHERE state = 'active'
 AND now() - query_start > INTERVAL '10 seconds'
ORDER BY duration DESC;

-- 출력:
-- pid | duration | state | query
-- ------|----------|--------|---------------------------
-- 12345 | 00:02:15 | active | UPDATE products SET... ← 2분째 실행 중!

-- 2. Replication Slot 크기 확인
SELECT
 slot_name,
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots;

-- 출력:
-- slot_name | lag_size
-- ----------|---------
-- replica1 | 2048 MB ← WAL 2GB 누적!

해결:

-- 느린 쿼리 강제 종료
SELECT pg_terminate_backend(12345);

-- 30초 후 Lag 확인
-- Lag: 120초 → 5초 (복구됨!)

시나리오 2: Replication 중단

증상:

pg_stat_replication 뷰에 Replica 없음

진단:

# Replica 로그 확인
tail -f /var/log/postgresql/postgresql-15-main.log

# 출력:
# FATAL: could not connect to the primary server: Connection refused
# connection to server failed: Connection refused

해결:

# 1. Primary 방화벽 확인
sudo ufw status

# Port 5432 허용되어 있는지 확인
# sudo ufw allow 5432/tcp

# 2. pg_hba.txt 확인 (Primary)
sudo nano /etc/postgresql/15/main/pg_hba.txt

# 추가:
# host replication all 10.0.2.0/24 md5

# 3. PostgreSQL 재시작
sudo systemctl restart postgresql

# 4. Replica 재시작
# Replica 서버에서:
sudo systemctl restart postgresql

# 5. 확인
# Primary에서:
SELECT * FROM pg_stat_replication;

# Replica 다시 연결됨!

마치며

Database Replication Lag은 **“보이지 않는 데이터 손실”**을 일으키는 위험한 문제입니다. 이 글에서 다룬 핵심 사항들을 정리하면:

핵심 요약:

임계값 준수: OLTP 30초, Analytics 60초 초과 시 즉시 조치
3대 원인: 느린 쿼리, 네트워크 지연, 디스크 I/O 병목
Multi-Threaded Replication: 8 Workers로 6배 성능 향상
Synchronous Replication: 데이터 손실 방지 (Failover 안전)
2025 최신 도구: Azure SQL lag metric, Google Cloud SQL replica_lag
모니터링 필수: Prometheus + Grafana + Alertmanager
Failover 정책: 30초 초과 Lag 시 Failover 중단

다음 단계:

pg_stat_replication 모니터링 설정
Prometheus Exporter 설치
Grafana Dashboard 구축
알람 임계값 설정 (5초 경고, 30초 위험)
Multi-Threaded Replication 활성화
Slow Query 최적화 (배치 크기 제한)
네트워크 대역폭 확인 (1 Gbps 이상)
Replica 디스크 업그레이드 (SSD)

방금 저장한 데이터가 조회되지 않는 문제는 대부분 Replication Lag이 원인입니다. 30초 초과 Lag는 Failover를 불가능하게 만들며, 장애 시 수천 건의 트랜잭션 손실로 이어집니다. 하지만 Multi-Threaded Replication 8 Workers 설정 하나로 6배 성능 향상이 가능하며, Prometheus 모니터링으로 30초 전에 조기 감지할 수 있습니다. 지금 바로 프로덕션 Replication Lag를 점검하세요!

Database Replication Lag 완벽 가이드: 프로덕션 30초 임계값과 2025 최신 모니터링 전략

왜 방금 입력한 데이터가 조회되지 않을까?

Database Replication 기본 개념

Primary-Replica 구조

Replication Lag이란?

Replication Lag의 3대 원인

원인 #1: Primary의 느린 쿼리

원인 #2: 네트워크 지연

원인 #3: Replica의 디스크 I/O 병목

PostgreSQL Replication Lag 모니터링

pg_stat_replication 뷰

Lag 임계값 알람

pg_last_xact_replay_timestamp()

MySQL Replication Lag 모니터링

SHOW SLAVE STATUS

Seconds_Behind_Master 계산 원리

Multi-Threaded Replication (병렬 복제)

Single-Threaded의 한계

Multi-Threaded Replication 활성화

PostgreSQL Parallel Apply (PostgreSQL 16+)

2025년 최신 모니터링 도구

Azure SQL Database Replication Lag Metric

Google Cloud SQL Replication Metrics

Percona Monitoring and Management (PMM)

Failover 안전성 확보

Synchronous Replication (동기 복제)

Lag 임계값 기반 Failover 정책

프로덕션 최적화 체크리스트

Replication 설정 최적화

모니터링 설정

알람 설정

실전 디버깅 시나리오

시나리오 1: 갑자기 Lag 급증

시나리오 2: Replication 중단

마치며

글을 마치며

이혼소송시 재산명시명령 자료 혼자 만드는 방법, 1탄 - 어카운트인포 (은행계좌, 증권계좌, 보험, 대출 등)

Node.js 프로덕션 메모리 누수 디버깅: 실전 가이드와 해결 전략