Bloom Filter 완벽 가이드: 확률적 자료구조로 성능 극대화하기

Q: Bloom Filter란?

Bloom Filter는 1970년 Burton Howard Bloom이 제안한 **공간 효율적인 확률적 자료구조**입니다. "특정 원소가 집합에 포함되어 있는가?"라는 질문에 답하는데, 재미있게도 **"아마도 있다"** 또는 **"확실히 없다"**라는 두 가지 대답만 합니다.

Bloom Filter란?

Bloom Filter는 1970년 Burton Howard Bloom이 제안한 공간 효율적인 확률적 자료구조입니다. “특정 원소가 집합에 포함되어 있는가?”라는 질문에 답하는데, 재미있게도 “아마도 있다” 또는 **“확실히 없다”**라는 두 가지 대답만 합니다.

┌─────────────────────────────────────────────────────────┐
│                     Bloom Filter                        │
├─────────────────────────────────────────────────────────┤
│  "apple" 검색 → "아마도 있음" (실제로 있을 수도, 없을 수도) │
│  "grape" 검색 → "확실히 없음" (100% 보장)                │
└─────────────────────────────────────────────────────────┘

이 특성 때문에 False Positive(없는데 있다고 함)는 발생하지만, False Negative(있는데 없다고 함)는 절대 발생하지 않습니다.

왜 Bloom Filter를 사용하는가?

문제 상황

수십억 개의 URL을 저장한 데이터베이스가 있다고 가정해봅시다. 사용자가 입력한 URL이 이미 존재하는지 확인하려면:

# 순진한 접근법
def url_exists(url: str) -> bool:
    return db.query("SELECT 1 FROM urls WHERE url = ?", url) is not None

매번 디스크 I/O가 발생하며, 수십억 레코드에서의 조회는 인덱스가 있어도 비용이 큽니다.

Bloom Filter 해결책

# Bloom Filter 사용
def url_exists_optimized(url: str) -> bool:
    if not bloom_filter.might_contain(url):
        return False  # 확실히 없음 - DB 조회 불필요!

    # "아마도 있음" - DB에서 확인 필요
    return db.query("SELECT 1 FROM urls WHERE url = ?", url) is not None

Bloom Filter가 “없다”고 하면 DB 조회를 완전히 건너뛸 수 있습니다. 대부분의 조회가 “없음”인 경우(예: 스팸 필터링), 엄청난 성능 향상을 얻을 수 있습니다.

작동 원리

기본 구조

Bloom Filter는 두 가지 요소로 구성됩니다:

비트 배열: m비트 크기의 배열 (초기값 모두 0)
해시 함수들: k개의 독립적인 해시 함수

삽입 연산

원소를 삽입할 때, k개의 해시 함수로 k개의 위치를 계산하고 해당 비트를 1로 설정합니다.

"apple" 삽입 (k=3 해시 함수 사용)
├── hash1("apple") = 2
├── hash2("apple") = 5
└── hash3("apple") = 9

비트 배열 변화:
Before: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
After:  [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]
              ↑        ↑           ↑
             [2]      [5]         [9]

조회 연산

원소를 조회할 때, 동일한 k개의 해시 함수로 위치를 계산하고 모든 비트가 1인지 확인합니다.

"apple" 조회
├── hash1("apple") = 2 → 비트[2] = 1 ✓
├── hash2("apple") = 5 → 비트[5] = 1 ✓
└── hash3("apple") = 9 → 비트[9] = 1 ✓
→ 모두 1이므로 "아마도 있음" 반환

"grape" 조회
├── hash1("grape") = 3 → 비트[3] = 0 ✗
→ 하나라도 0이면 "확실히 없음" 반환

TypeScript로 구현하기

실제로 동작하는 Bloom Filter를 구현해보겠습니다.

import { createHash } from 'crypto';

class BloomFilter {
  private bitArray: Uint8Array;
  private size: number;
  private hashCount: number;

  constructor(size: number, hashCount: number) {
    this.size = size;
    this.hashCount = hashCount;
    // 비트 배열을 바이트 단위로 저장
    this.bitArray = new Uint8Array(Math.ceil(size / 8));
  }

  // 여러 해시 함수를 시뮬레이션 (Double Hashing 기법)
  private getHashPositions(item: string): number[] {
    const hash1 = this.hash(item, 'sha256');
    const hash2 = this.hash(item, 'md5');

    const positions: number[] = [];
    for (let i = 0; i < this.hashCount; i++) {
      // h(i) = (hash1 + i * hash2) mod size
      const position = Math.abs((hash1 + i * hash2) % this.size);
      positions.push(position);
    }
    return positions;
  }

  private hash(item: string, algorithm: string): number {
    const hashBuffer = createHash(algorithm).update(item).digest();
    // 첫 4바이트를 정수로 변환
    return hashBuffer.readUInt32BE(0);
  }

  private setBit(position: number): void {
    const byteIndex = Math.floor(position / 8);
    const bitIndex = position % 8;
    this.bitArray[byteIndex] |= (1 << bitIndex);
  }

  private getBit(position: number): boolean {
    const byteIndex = Math.floor(position / 8);
    const bitIndex = position % 8;
    return (this.bitArray[byteIndex] & (1 << bitIndex)) !== 0;
  }

  add(item: string): void {
    const positions = this.getHashPositions(item);
    for (const pos of positions) {
      this.setBit(pos);
    }
  }

  mightContain(item: string): boolean {
    const positions = this.getHashPositions(item);
    return positions.every(pos => this.getBit(pos));
  }

  // False Positive 확률 계산
  getFalsePositiveRate(insertedCount: number): number {
    // (1 - e^(-k*n/m))^k
    const exponent = -this.hashCount * insertedCount / this.size;
    return Math.pow(1 - Math.exp(exponent), this.hashCount);
  }
}

사용 예시

// 10만 비트, 7개 해시 함수
const bloom = new BloomFilter(100000, 7);

// URL 추가
bloom.add('https://example.com/page1');
bloom.add('https://example.com/page2');
bloom.add('https://example.com/page3');

// 조회
console.log(bloom.mightContain('https://example.com/page1')); // true
console.log(bloom.mightContain('https://example.com/unknown')); // false (대부분)

// False Positive 확률 확인
console.log(bloom.getFalsePositiveRate(3)); // 약 0.000002%

최적의 파라미터 선택

Bloom Filter의 성능은 세 가지 파라미터에 의해 결정됩니다:

n: 예상 삽입 원소 수
m: 비트 배열 크기
k: 해시 함수 개수

False Positive 확률 공식

$$ P(fp) = \left(1 - e^{-\frac{kn}{m}}\right)^k $$

최적 해시 함수 개수

주어진 m과 n에 대해 최적의 k는:

$$ k_{optimal} = \frac{m}{n} \ln 2 \approx 0.693 \times \frac{m}{n} $$

필요한 비트 수 계산

원하는 False Positive 확률 p에 대해 필요한 비트 수:

$$ m = -\frac{n \ln p}{(\ln 2)^2} $$

function calculateOptimalParameters(
  expectedItems: number,
  falsePositiveRate: number
): { bits: number; hashFunctions: number } {
  // 필요한 비트 수
  const bits = Math.ceil(
    -expectedItems * Math.log(falsePositiveRate) / Math.pow(Math.log(2), 2)
  );

  // 최적 해시 함수 개수
  const hashFunctions = Math.round(
    (bits / expectedItems) * Math.log(2)
  );

  return { bits, hashFunctions };
}

// 100만 개 원소, 1% False Positive 확률
const params = calculateOptimalParameters(1_000_000, 0.01);
console.log(params);
// { bits: 9585059, hashFunctions: 7 }
// 약 1.2MB로 100만 개 원소 처리 가능!

:::tip 100만 개의 원소를 HashSet으로 저장하면 수십~수백 MB가 필요하지만, Bloom Filter는 단 1.2MB로 1%의 False Positive 확률을 달성합니다. :::

실전 활용 사례

1. 캐시 관통(Cache Penetration) 방지

존재하지 않는 키에 대한 요청이 캐시를 우회하여 DB에 직접 도달하는 문제를 방지합니다.

class CacheWithBloomFilter {
  private bloom: BloomFilter;
  private cache: Map<string, any>;

  constructor() {
    this.bloom = new BloomFilter(1000000, 7);
    this.cache = new Map();
  }

  async get(key: string): Promise<any | null> {
    // 1. Bloom Filter로 먼저 확인
    if (!this.bloom.mightContain(key)) {
      return null; // 확실히 없음 - DB 조회 불필요
    }

    // 2. 캐시 확인
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    // 3. DB 조회 (Bloom Filter가 "있을 수도"라고 했으므로)
    const value = await this.fetchFromDB(key);
    if (value !== null) {
      this.cache.set(key, value);
    }
    return value;
  }

  async set(key: string, value: any): Promise<void> {
    this.bloom.add(key);
    this.cache.set(key, value);
    await this.saveToDb(key, value);
  }
}

2. 중복 요청 필터링

크롤러나 스트리밍 시스템에서 이미 처리한 항목을 필터링합니다.

class WebCrawler {
  private visitedUrls: BloomFilter;

  constructor() {
    // 1억 개 URL, 0.1% False Positive
    const { bits, hashFunctions } = calculateOptimalParameters(100_000_000, 0.001);
    this.visitedUrls = new BloomFilter(bits, hashFunctions);
  }

  async crawl(url: string): Promise<void> {
    if (this.visitedUrls.mightContain(url)) {
      console.log(`Skipping likely visited URL: ${url}`);
      return;
    }

    this.visitedUrls.add(url);

    // 실제 크롤링 수행
    const html = await fetch(url).then(r => r.text());
    const links = this.extractLinks(html);

    for (const link of links) {
      await this.crawl(link);
    }
  }
}

3. 스팸/악성 URL 필터링

브라우저의 세이프 브라우징 기능은 Bloom Filter를 사용합니다.

class SafeBrowsingFilter {
  private maliciousUrls: BloomFilter;

  constructor(knownMaliciousUrls: string[]) {
    const { bits, hashFunctions } = calculateOptimalParameters(
      knownMaliciousUrls.length,
      0.0001 // 0.01% False Positive
    );

    this.maliciousUrls = new BloomFilter(bits, hashFunctions);

    for (const url of knownMaliciousUrls) {
      this.maliciousUrls.add(url);
    }
  }

  checkUrl(url: string): 'safe' | 'potentially_malicious' {
    if (this.maliciousUrls.mightContain(url)) {
      return 'potentially_malicious'; // 서버에 추가 확인 필요
    }
    return 'safe'; // 100% 안전
  }
}

Redis의 Bloom Filter

Redis는 RedisBloom 모듈을 통해 Bloom Filter를 네이티브로 지원합니다.

# RedisBloom 모듈 로드
redis-server --loadmodule /path/to/redisbloom.so

import { createClient } from 'redis';

const client = createClient();
await client.connect();

// Bloom Filter 생성 (100만 원소, 1% 에러율)
await client.bf.reserve('myfilter', 0.01, 1000000);

// 원소 추가
await client.bf.add('myfilter', 'item1');
await client.bf.mAdd('myfilter', ['item2', 'item3', 'item4']);

// 존재 여부 확인
const exists = await client.bf.exists('myfilter', 'item1'); // true
const notExists = await client.bf.exists('myfilter', 'unknown'); // false (대부분)

// 여러 원소 한번에 확인
const results = await client.bf.mExists('myfilter', ['item1', 'unknown']);
// [true, false]

Redis Bloom Filter 활용 패턴

class DistributedDuplicateChecker {
  private redis: RedisClient;
  private filterKey: string;

  constructor(redis: RedisClient, filterKey: string) {
    this.redis = redis;
    this.filterKey = filterKey;
  }

  async initialize(expectedItems: number, errorRate: number): Promise<void> {
    try {
      await this.redis.bf.reserve(this.filterKey, errorRate, expectedItems);
    } catch (e) {
      // 이미 존재하는 경우 무시
      if (!e.message.includes('item exists')) throw e;
    }
  }

  async addIfNotExists(item: string): Promise<boolean> {
    // BF.ADD는 이미 존재하면 0, 새로 추가하면 1 반환
    const added = await this.redis.bf.add(this.filterKey, item);
    return added === 1;
  }

  async processEvent(eventId: string, handler: () => Promise<void>): Promise<void> {
    const isNew = await this.addIfNotExists(eventId);

    if (isNew) {
      await handler();
    } else {
      console.log(`Event ${eventId} likely already processed, skipping`);
    }
  }
}

Counting Bloom Filter

기본 Bloom Filter의 단점 중 하나는 삭제가 불가능하다는 것입니다. 비트를 0으로 설정하면 다른 원소까지 영향을 받을 수 있기 때문입니다.

Counting Bloom Filter는 각 비트 대신 카운터를 사용하여 이 문제를 해결합니다.

class CountingBloomFilter {
  private counters: Uint8Array;
  private size: number;
  private hashCount: number;

  constructor(size: number, hashCount: number) {
    this.size = size;
    this.hashCount = hashCount;
    this.counters = new Uint8Array(size); // 각 위치에 카운터
  }

  add(item: string): void {
    const positions = this.getHashPositions(item);
    for (const pos of positions) {
      if (this.counters[pos] < 255) { // 오버플로우 방지
        this.counters[pos]++;
      }
    }
  }

  remove(item: string): boolean {
    // 먼저 존재하는지 확인
    if (!this.mightContain(item)) {
      return false;
    }

    const positions = this.getHashPositions(item);
    for (const pos of positions) {
      if (this.counters[pos] > 0) {
        this.counters[pos]--;
      }
    }
    return true;
  }

  mightContain(item: string): boolean {
    const positions = this.getHashPositions(item);
    return positions.every(pos => this.counters[pos] > 0);
  }
}

:::caution Counting Bloom Filter는 기본 Bloom Filter보다 4~8배 더 많은 메모리를 사용합니다. 삭제가 꼭 필요한 경우에만 사용하세요. :::

Bloom Filter vs 다른 자료구조

특성	HashSet	Bloom Filter	Cuckoo Filter
메모리	높음	매우 낮음	낮음
삭제 지원	O	X	O
False Positive	없음	있음	있음
False Negative	없음	없음	없음
조회 시간	O(1)	O(k)	O(1)

주의사항

1. False Positive 처리

Bloom Filter가 “있음”이라고 해도 반드시 존재하는 것은 아닙니다. 중요한 작업에는 이중 확인이 필요합니다.

async function safeCheck(item: string): Promise<boolean> {
  if (!bloomFilter.mightContain(item)) {
    return false; // 확실히 없음
  }

  // Bloom Filter 긍정 응답은 반드시 실제 저장소에서 확인
  return await actualDataStore.contains(item);
}

2. 용량 초과 시 성능 저하

예상보다 많은 원소가 삽입되면 False Positive 확률이 급격히 증가합니다.

// 모니터링 추가
class MonitoredBloomFilter extends BloomFilter {
  private insertCount = 0;
  private expectedCapacity: number;

  constructor(size: number, hashCount: number, expectedCapacity: number) {
    super(size, hashCount);
    this.expectedCapacity = expectedCapacity;
  }

  add(item: string): void {
    super.add(item);
    this.insertCount++;

    if (this.insertCount > this.expectedCapacity * 0.9) {
      console.warn('Bloom Filter approaching capacity, consider resizing');
    }
  }
}

3. 크기 조정 불가

Bloom Filter는 생성 후 크기를 변경할 수 없습니다. Scalable Bloom Filter를 사용하면 이 문제를 해결할 수 있습니다.

class ScalableBloomFilter {
  private filters: BloomFilter[] = [];
  private targetFPR: number;
  private scaleFactor: number;
  private initialSize: number;

  constructor(initialSize: number, targetFPR: number, scaleFactor = 2) {
    this.initialSize = initialSize;
    this.targetFPR = targetFPR;
    this.scaleFactor = scaleFactor;
    this.addNewFilter();
  }

  private addNewFilter(): void {
    const filterIndex = this.filters.length;
    // 각 새 필터는 더 낮은 FPR을 가짐
    const fpr = this.targetFPR * Math.pow(0.5, filterIndex);
    const size = this.initialSize * Math.pow(this.scaleFactor, filterIndex);
    const { hashFunctions } = calculateOptimalParameters(size, fpr);

    this.filters.push(new BloomFilter(size, hashFunctions));
  }

  add(item: string): void {
    const lastFilter = this.filters[this.filters.length - 1];

    // 마지막 필터가 가득 찼으면 새 필터 추가
    if (lastFilter.isFull()) {
      this.addNewFilter();
    }

    this.filters[this.filters.length - 1].add(item);
  }

  mightContain(item: string): boolean {
    // 모든 필터를 확인
    return this.filters.some(filter => filter.mightContain(item));
  }
}

마치며

Bloom Filter는 “확실히 없음”을 빠르게 판단할 수 있는 강력한 도구입니다. 메모리가 제한적이거나 대용량 데이터의 존재 여부를 빠르게 확인해야 할 때 특히 유용합니다.

핵심 포인트를 정리하면:

False Negative 없음: “없다”는 응답은 100% 신뢰할 수 있음
공간 효율성: 원본 데이터 대비 극소량의 메모리로 동작
O(k) 시간 복잡도: 해시 함수 개수에만 의존, 데이터 크기와 무관
단방향: 기본 Bloom Filter는 삭제 불가

캐시 시스템, 분산 데이터베이스, 네트워크 라우터 등 다양한 곳에서 활용되는 이 우아한 자료구조를 여러분의 시스템에도 적용해보세요.