Implementing Token Bucket Rate Limiting

A SOC ingestion tier sized for steady-state EPS will quietly exhaust memory the moment a misconfigured forwarder or a beaconing endpoint floods it faster than the parsers can drain — and a token bucket is the smallest correct fix. This page builds one as a concrete technique inside rate limiting strategies, part of the wider Log Ingestion & Parsing Workflows pipeline.

Root-Cause Context

The failure is not raw volume — it is the absence of a throughput ceiling between the network receiver and the parser. A syslog listener, HTTP webhook, or Kafka consumer that hands every arriving payload straight to schema validation behaves correctly right up to the point where arrival rate exceeds drain rate. From that instant the gap accumulates: pending payloads pile into whatever buffer sits in front of the validator, resident memory climbs linearly with the backlog, and the kernel OOM-killer eventually reaps the process. Every event generated during that window is lost.

Two naive throttles make this worse. A fixed-window counter resets abruptly at each interval boundary, so a source can spend its entire budget in the last millisecond of one window and the first of the next, delivering a 2x burst that saturates the very buffer you were protecting. A sliding-window implementation that retains a per-event timestamp deque trades that bug for unbounded memory growth under exactly the spike it is supposed to absorb. The token bucket avoids both: it stores two floats — current tokens and a last-update timestamp — and refills lazily from elapsed time, so it admits controlled bursts up to a hard ceiling while holding the long-run rate flat.

In SOC terms this surfaces as correlation gaps, not an obvious crash. A credential-stuffing wave mapped to MITRE ATT&CK T1110 (Brute Force), or a high-volume log flood resembling T1498 (Network Denial of Service), can emit tens of thousands of events in seconds. If the ingestion tier only persists a fraction of them, the correlation window drifts, the rule that should have fired on the brute-force sequence never sees a complete stream, and the alert either suppresses or escalates as noise. The limiter’s job is to convert that flood into a smooth, parser-paced stream and to label whatever it cannot admit so error categorization frameworks can act on it.

Prerequisites

Python 3.11+ — the implementation uses dataclasses(slots=True), asyncio, and modern type hints; no third-party packages are required for the limiter itself.
time.monotonic, not time.time — the wall clock can step backward on NTP correction, which would corrupt the refill calculation. The monotonic clock never regresses.
A single asyncio event loop owning the limiter. The bucket is guarded by an asyncio.Lock, so it is safe across coroutines on one loop but is not intended to be shared across OS threads or processes — for multi-worker deployments, give each worker its own bucket sized to its share of the budget, or front them with a shared store such as Redis.
A measured drain rate: profile your schema validation pipelines under load to learn the sustainable events-per-second before setting refill_rate.

Production-Ready Implementation

The limiter sits at the ingestion boundary: tokens admit a payload straight to validation, an empty bucket parks the payload in a bounded deferred queue (so a transient spike is absorbed, not dropped), and a full queue is the explicit signal to apply backpressure upstream rather than allocate without limit. Every outcome is a typed Disposition, and the two failure outcomes carry stable ERR_RATE_* codes for downstream routing. The block is self-contained and runnable as-is.

from __future__ import annotations

import asyncio
import logging
import time
from collections.abc import AsyncIterator
from dataclasses import dataclass, field
from enum import Enum
from typing import Any

logger = logging.getLogger("soc.rate_limiter")


class Disposition(Enum):
    """Terminal outcome for a single payload at the ingestion boundary."""

    PROCESSED = "processed"        # token acquired -> forwarded to schema validation
    BATCHED = "batched_deferred"   # no token -> parked in the bounded deferred queue
    SHED = "shed_backpressure"     # queue full -> upstream told to back off (ERR_RATE_002)


@dataclass(slots=True)
class TokenBucket:
    """Monotonic-clock token bucket; concurrency-safe on a single event loop."""

    capacity: float          # maximum burst, in tokens
    refill_rate: float       # tokens replenished per second (steady-state EPS ceiling)
    _tokens: float = field(init=False)
    _updated: float = field(init=False)
    _lock: asyncio.Lock = field(default_factory=asyncio.Lock, init=False)

    def __post_init__(self) -> None:
        if self.capacity <= 0 or self.refill_rate <= 0:
            raise ValueError("ERR_RATE_003: capacity and refill_rate must be positive")
        self._tokens = self.capacity
        self._updated = time.monotonic()

    def _replenish(self) -> None:
        now = time.monotonic()
        elapsed = now - self._updated
        if elapsed > 0:
            self._tokens = min(self.capacity, self._tokens + elapsed * self.refill_rate)
            self._updated = now

    async def try_acquire(self, cost: float = 1.0) -> bool:
        if cost <= 0 or cost > self.capacity:
            raise ValueError("ERR_RATE_003: cost must be > 0 and <= capacity")
        async with self._lock:
            self._replenish()
            if self._tokens >= cost:
                self._tokens -= cost
                return True
            return False

    async def acquire_within(self, cost: float = 1.0, deadline_s: float = 2.0) -> bool:
        """Block up to deadline_s for `cost` tokens; never busy-waits."""
        start = time.monotonic()
        backoff = min(0.05, 1.0 / self.refill_rate)
        while time.monotonic() - start < deadline_s:
            if await self.try_acquire(cost):
                return True
            await asyncio.sleep(backoff)
        return False

    @property
    def available(self) -> float:
        return self._tokens


@dataclass(slots=True)
class IngestionGate:
    """Token-gated boundary with a bounded deferred-batch queue."""

    bucket: TokenBucket
    batch_size: int = 500
    max_queue_depth: int = 10_000
    acquire_deadline_s: float = 2.0
    _queue: asyncio.Queue[dict[str, Any]] = field(init=False)
    _running: bool = field(default=False, init=False)
    metrics: dict[str, int] = field(
        default_factory=lambda: {"processed": 0, "batched": 0, "shed": 0}, init=False
    )

    def __post_init__(self) -> None:
        self._queue = asyncio.Queue(maxsize=self.max_queue_depth)

    async def admit(self, payload: dict[str, Any]) -> Disposition:
        if await self.bucket.acquire_within(deadline_s=self.acquire_deadline_s):
            self.metrics["processed"] += 1
            await self._forward(payload)
            return Disposition.PROCESSED
        try:
            self._queue.put_nowait(payload)
            self.metrics["batched"] += 1
            return Disposition.BATCHED
        except asyncio.QueueFull:
            self.metrics["shed"] += 1
            logger.warning("ERR_RATE_002: deferred queue saturated; signalling backpressure")
            return Disposition.SHED

    async def _drain(self) -> AsyncIterator[list[dict[str, Any]]]:
        batch: list[dict[str, Any]] = []
        while self._running or not self._queue.empty():
            try:
                payload = await asyncio.wait_for(self._queue.get(), timeout=1.0)
            except asyncio.TimeoutError:
                if batch:
                    yield batch
                    batch = []
                continue
            batch.append(payload)
            self._queue.task_done()
            if len(batch) >= self.batch_size:
                yield batch
                batch = []
        if batch:
            yield batch

    async def run_deferred_worker(self) -> None:
        """Replay parked payloads through the same token gate as they replenish."""
        self._running = True
        async for batch in self._drain():
            for payload in batch:
                # Re-gate each deferred payload so replay still respects the ceiling.
                if await self.bucket.acquire_within(deadline_s=self.acquire_deadline_s):
                    self.metrics["processed"] += 1
                    await self._forward(payload)
                else:
                    self.metrics["shed"] += 1
                    logger.warning("ERR_RATE_001: deferred payload expired before a token freed")

    async def _forward(self, payload: dict[str, Any]) -> None:
        # Hook into schema validation / normalization / SIEM routing here.
        logger.debug("forwarding event id=%s to validation", payload.get("event.id"))

    def stop(self) -> None:
        self._running = False


async def main() -> None:
    logging.basicConfig(level=logging.INFO)
    gate = IngestionGate(TokenBucket(capacity=200.0, refill_rate=1000.0))
    worker = asyncio.create_task(gate.run_deferred_worker())
    # Simulate a 5x burst: 5000 events offered against a 1000 EPS ceiling.
    dispositions = await asyncio.gather(
        *(gate.admit({"event.id": i}) for i in range(5000))
    )
    gate.stop()
    await worker
    summary = {d.value: dispositions.count(d) for d in Disposition}
    logger.info("offered=5000 dispositions=%s metrics=%s", summary, gate.metrics)


if __name__ == "__main__":
    asyncio.run(main())

The acquire_within loop sleeps for one token-interval (capped at 50 ms) instead of spinning, so an exhausted bucket costs no CPU. The bounded asyncio.Queue enforces a hard memory ceiling: peak backlog is max_queue_depth payloads, independent of how fast the source bursts. Deferred payloads are replayed through the same gate, so replay never violates the ceiling either — and the metrics counters conserve, with processed + shed (across the gate and the worker) plus the queue’s residual accounting for every offered event.

Error-Code Reference

Code	Meaning	Action
`ERR_RATE_001`	A deferred payload waited in the queue but no token freed before its acquire deadline.	Increase `refill_rate` toward sustainable parser EPS, or raise parser concurrency so the queue drains faster.
`ERR_RATE_002`	The deferred queue hit `max_queue_depth`; the payload was shed and backpressure signalled upstream.	Apply forwarder-side throttling; investigate the source for misconfiguration, T1498 (Network DoS), or T1499 (Endpoint DoS).
`ERR_RATE_003`	Invalid configuration — non-positive `capacity`/`refill_rate`, or a per-call `cost` larger than `capacity`.	Fix the limiter config at startup; a cost above capacity can never be satisfied and is a programming error.
`ERR_RATE_004`	Sustained `SHED` rate above the alert threshold (for example, >5% of EPS for 30 s).	Treat as an overload incident: correlate source IPs with network-flow data, isolate compromised endpoints, scale parsers.

These codes follow the ERR_CATEGORY_NNN convention used across the pipeline, so the limiter’s output slots directly into the shared error categorization frameworks without a translation layer. (The first three are emitted by the code above; ERR_RATE_004 is a derived alerting condition computed from the metrics counters.)

Operational Notes

CPU/memory profile. The bucket is two floats and a lock — effectively zero memory and constant-time per acquire. The dominant memory cost is the deferred queue: budget roughly max_queue_depth x average_payload_bytes and size it against the host’s available RAM, leaving headroom for the parser workers themselves.
Sizing refill_rate. Set it at 80–90% of measured sustainable parser EPS. Run it at 100% and a single GC pause or downstream hiccup tips the queue into backpressure; leaving headroom keeps SHED events rare and meaningful.
Sizing capacity. Match it to the legitimate burst you must absorb — typically 3–5 seconds of peak EPS. Too small and you shed real telemetry surges (a login storm after a deploy); too large and you defeat the point, admitting a flood that the parsers can’t keep up with.
Distributed deployments. The in-process bucket is per-worker. For a fleet, either divide the global budget across workers (simple, slightly conservative) or back the bucket with a Redis Lua script so the ceiling is enforced cluster-wide — the same try_acquire/acquire_within contract holds.
Replay vs. shed. Parking exhausted payloads (BATCHED) preserves forensic completeness during a transient spike, which matters for audit retention under NIST SP 800-92. A queue that stays full, though, means the ceiling or downstream throughput is genuinely too low — that is the ERR_RATE_004 signal, not a tuning nuisance to suppress.
Pairs with batching. Downstream of the gate, hand validated events to async log batching so the SIEM sees size-triggered batches under a single egress quota rather than per-event writes.

Verification Checklist

Under a synthetic 5x burst, resident memory plateaus at the deferred-queue ceiling and the process never OOM-kills.
Measured admission rate stays at or below refill_rate across a multi-minute spike (the bucket holds the long-run ceiling).
capacity worth of tokens — and no more — is admitted instantly from a full bucket before throttling engages.
The metrics counters conserve: processed + shed plus the queue’s residual equals total events offered.
Every SHED/deferred-expiry path emits a populated ERR_RATE_* code, and each code maps to a row in the reference table.
At steady state (no spike) the shed and deferred counts are zero; a non-zero value signals the rate or parser concurrency needs raising.

FAQ

Why a token bucket instead of a fixed-window or sliding-window limiter?

A fixed-window counter resets abruptly at each boundary, so a source can spend its full budget at the end of one window and the start of the next, delivering a 2x burst that saturates the buffer you meant to protect. A sliding window that keeps per-event timestamps trades that for unbounded memory growth under the exact spike it should absorb. The token bucket stores only current tokens plus a last-update timestamp and refills lazily from elapsed monotonic time, so it admits controlled bursts up to capacity while holding the long-run rate flat at refill_rate — bounded memory and correct burst behaviour at once.

How do I set capacity and refill_rate for a real parser tier?

Profile the parsers first: drive your schema validation under load and record the events-per-second they sustain without GC pauses or thread starvation. Set refill_rate to 80–90% of that number so a momentary stall doesn’t immediately trip backpressure. Set capacity to the largest legitimate burst you must absorb, usually 3–5 seconds of peak EPS — enough to ride out a post-deploy login storm without shedding, but not so large that you wave through a flood the parsers can never drain.

What should fire an alert versus being handled silently?

PROCESSED and a brief BATCHED blip are normal — a transient spike parked and replayed needs no human. The alert condition is ERR_RATE_004: a sustained SHED rate (for example >5% of EPS held for 30 seconds), which means the deferred queue stayed full and real telemetry is being dropped. Treat that as an overload incident — correlate the source IPs against network-flow data to separate a misconfigured forwarder from T1498 (Network DoS) or a compromised, beaconing endpoint.

Rate Limiting Strategies — parent technique
Log Ingestion & Parsing Workflows — parent architecture
Async Log Batching
Schema Validation Pipelines
Error Categorization Frameworks