Building Async Log Collectors with asyncio

A synchronous SOC log collector looks healthy at steady state but silently drops events the moment a traffic spike makes its blocking I/O and unbounded buffers collide with the host’s memory ceiling. This page builds a non-blocking, memory-bounded collector with asyncio, as one technique inside async log batching and the wider Log Ingestion & Parsing Workflows pipeline.

Root-Cause Context

The failure is not bandwidth — it is the collector’s inability to yield control during an I/O wait. A synchronous forwarder built on blocking requests or raw socket calls, feeding an unbounded Python list, behaves predictably right up to the point where a downstream SIEM throttles or a network partition occurs. Then sends stop completing, the producer keeps appending, and resident memory climbs linearly with the pending payload until the kernel OOM-killer reaps the process. Every event generated during that kill window is lost.

In a SOC this surfaces as correlation gaps rather than an obvious crash. A credential-stuffing wave mapped to MITRE ATT&CK T1110 (Brute Force), or a lateral-movement sequence using T1078 (Valid Accounts), can emit 50,000 authentication events in 90 seconds from a handful of sources. If the collector only persists 32,000 of them, the rule that should have fired on the brute-force sequence never sees a complete stream, so the alert either suppresses entirely or escalates as a noisy false positive.

asyncio removes the thread-contention root cause by multiplexing every I/O-bound task onto a single event loop: a coroutine awaiting a slow endpoint suspends and lets others run instead of pinning a thread. Pairing that cooperative model with an explicit maxsize on every asyncio.Queue converts the unbounded-memory failure into deterministic backpressure — when the queue fills, the producer blocks (or sheds) instead of allocating without limit.

Prerequisites

Python 3.11+ for asyncio.TaskGroup and modern timeout semantics.
Two third-party libraries beyond the standard library:

pip install "aiohttp>=3.9"     # non-blocking HTTP fetch + connection pooling
pip install "pydantic>=2.7"    # strict, typed validation at the queue boundary

A pull-based log source that returns JSON over HTTP (cloud audit API, vendor webhook spool, or an internal collector endpoint), and a downstream sink that accepts batched POSTs (Splunk HEC, Elastic _bulk, or an OTLP gateway).
A reachable dead-letter sink — an on-disk spool, a Kafka topic, or an object-store prefix — for events that fail validation or exhaust retries.

Production-Ready Implementation

The collector is one self-contained module wiring four bounded stages onto a single event loop: a semaphore-limited fetcher, a Pydantic validator that routes failures to a dead-letter queue, a token-bucket egress limiter, and a size-triggered batch dispatcher. Every queue carries an explicit maxsize, so memory is bounded by construction rather than by hope.

from __future__ import annotations

import asyncio
import logging
import time
from datetime import datetime
from typing import Any, Optional

import aiohttp
from aiohttp import ClientTimeout
from pydantic import BaseModel, Field, ValidationError

logger = logging.getLogger("soc.async_collector")

# --- Typed event model ------------------------------------------------------

class SOCLogRecord(BaseModel):
    timestamp: datetime
    source_ip: str
    event_type: str
    severity: int = Field(ge=0, le=10)
    raw_payload: Optional[dict[str, Any]] = None


# --- Token-bucket egress limiter -------------------------------------------

class TokenBucketLimiter:
    """Smooths dispatch rate to the downstream SIEM quota."""

    def __init__(self, rate: float, max_tokens: int) -> None:
        self.rate = rate
        self.max_tokens = max_tokens
        self.tokens = float(max_tokens)
        self.last_refill = time.monotonic()

    async def acquire(self) -> None:
        now = time.monotonic()
        self.tokens = min(self.max_tokens, self.tokens + (now - self.last_refill) * self.rate)
        self.last_refill = now
        if self.tokens < 1:
            await asyncio.sleep((1 - self.tokens) / self.rate)
            self.tokens = 0.0
        else:
            self.tokens -= 1


# --- Stage 1: non-blocking fetch with bounded concurrency ------------------

async def fetch_logs(
    session: aiohttp.ClientSession,
    endpoint: str,
    semaphore: asyncio.Semaphore,
    raw_queue: asyncio.Queue[dict[str, Any]],
    stats: dict[str, int],
    max_retries: int = 3,
) -> None:
    timeout = ClientTimeout(total=15, connect=5)
    for attempt in range(max_retries):
        async with semaphore:
            try:
                async with session.get(endpoint, timeout=timeout) as resp:
                    resp.raise_for_status()
                    payload = await resp.json()
                    for record in payload.get("logs", []):
                        try:
                            raw_queue.put_nowait(record)   # ERR_QUEUE_001 if full
                        except asyncio.QueueFull:
                            stats["shed"] += 1
                            logger.warning("ERR_QUEUE_001 backpressure shed")
                    return
            except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
                stats["fetch_errors"] += 1
                if attempt == max_retries - 1:
                    logger.error("ERR_FETCH_001 fetch failed after %d attempts: %s", max_retries, exc)
                    return
                await asyncio.sleep(2 ** attempt)          # exponential backoff


# --- Stage 2: validate at the queue boundary, route failures to DLQ --------

async def validate_and_route(
    raw_queue: asyncio.Queue[dict[str, Any]],
    valid_queue: asyncio.Queue[dict[str, Any]],
    dlq: asyncio.Queue[dict[str, Any]],
    stats: dict[str, int],
) -> None:
    while True:
        raw = await raw_queue.get()
        try:
            validated = SOCLogRecord(**raw)
            await valid_queue.put(validated.model_dump(mode="json"))
        except ValidationError as ve:
            stats["schema_errors"] += 1
            await dlq.put({"record": raw, "code": "ERR_SCHEMA_001", "error": str(ve)})
        except Exception as exc:                            # noqa: BLE001 - DLQ catch-all
            stats["unknown_errors"] += 1
            await dlq.put({"record": raw, "code": "ERR_SCHEMA_002", "error": str(exc)})
        finally:
            raw_queue.task_done()


# --- Stage 3: token-bucketed, size-triggered batch dispatch ----------------

async def batch_dispatch(
    valid_queue: asyncio.Queue[dict[str, Any]],
    session: aiohttp.ClientSession,
    limiter: TokenBucketLimiter,
    siem_endpoint: str,
    stats: dict[str, int],
    batch_size: int = 500,
) -> None:
    batch: list[dict[str, Any]] = []
    while True:
        record = await valid_queue.get()
        batch.append(record)
        if len(batch) >= batch_size:
            await limiter.acquire()
            try:
                async with session.post(siem_endpoint, json=batch) as resp:
                    if resp.status == 429:
                        logger.warning("ERR_DISPATCH_001 throttled by SIEM (429)")
                    resp.raise_for_status()
                    stats["dispatched"] += len(batch)
            except aiohttp.ClientResponseError as exc:
                stats["dispatch_errors"] += 1
                logger.error("ERR_DISPATCH_002 batch rejected: %s", exc)
            finally:
                batch.clear()
        valid_queue.task_done()


# --- Orchestration ----------------------------------------------------------

async def run_collector(
    source_endpoints: list[str],
    siem_endpoint: str,
    concurrency: int = 8,
    queue_max: int = 10_000,
) -> dict[str, int]:
    raw_queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue(maxsize=queue_max)
    valid_queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue(maxsize=queue_max)
    dlq: asyncio.Queue[dict[str, Any]] = asyncio.Queue(maxsize=queue_max)
    semaphore = asyncio.Semaphore(concurrency)
    limiter = TokenBucketLimiter(rate=50.0, max_tokens=100)
    stats: dict[str, int] = {
        "shed": 0, "fetch_errors": 0, "schema_errors": 0,
        "unknown_errors": 0, "dispatched": 0, "dispatch_errors": 0,
    }

    async with aiohttp.ClientSession() as session:
        validator = asyncio.create_task(validate_and_route(raw_queue, valid_queue, dlq, stats))
        dispatcher = asyncio.create_task(
            batch_dispatch(valid_queue, session, limiter, siem_endpoint, stats)
        )
        async with asyncio.TaskGroup() as tg:
            for endpoint in source_endpoints:
                tg.create_task(fetch_logs(session, endpoint, semaphore, raw_queue, stats))

        await raw_queue.join()     # drain validation
        await valid_queue.join()   # drain dispatch
        validator.cancel()
        dispatcher.cancel()

    stats["dlq_depth"] = dlq.qsize()
    return stats


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    result = asyncio.run(run_collector(
        source_endpoints=[f"https://audit.internal/api/logs?page={p}" for p in range(20)],
        siem_endpoint="https://siem.internal/api/v1/logs",
    ))
    logger.info("collector stats: %s", result)

The fetcher never awaits inside the queue’s put path — it uses put_nowait so a full queue sheds a labeled record instead of letting one slow source block every other coroutine. Validation runs off the hot fetch path, exactly the boundary where the schema validation pipelines model expects type enforcement, and any record that fails is dead-lettered with an error code rather than dropped silently. Dispatch is gated by the token bucket so flush rate stays inside the SIEM quota described in the pipeline’s rate limiting strategies.

Error-Code Reference

The codes follow the ERR_CATEGORY_NNN convention so the dead-letter sink is triageable by category. Mapping failure modes to recoverable versus fatal dispositions is the job of the error categorization frameworks downstream of this collector.

Code	Meaning	Action
`ERR_QUEUE_001`	Bounded raw queue full; record shed under backpressure	Increase `queue_max` or `concurrency`, or rate-limit the source; alert if `shed` is non-zero at steady state
`ERR_FETCH_001`	Source fetch exhausted retries (timeout/connection)	Verify source reachability; open a circuit breaker on the endpoint; replay the page once healthy
`ERR_SCHEMA_001`	Pydantic validation failure (type/range/missing field)	Diff source payload against `SOCLogRecord`; patch model on vendor schema drift; replay from DLQ
`ERR_SCHEMA_002`	Unexpected processing failure during validation	Inspect DLQ record; treat as a code defect, not data drift; fix and replay
`ERR_DISPATCH_001`	SIEM returned HTTP 429 (throttled)	Lower token-bucket `rate`; back off and retry the batch; do not clear before confirming delivery
`ERR_DISPATCH_002`	SIEM rejected batch (4xx/5xx after raise)	Route batch to DLQ; check payload size against sink limit; reduce `batch_size`

Operational Notes

Memory profile. Peak resident memory is bounded by queue_max × 3 × avg_record_size plus one in-flight batch. With queue_max=10_000 and ~1 KB records that is roughly 30 MB of queued state — flat regardless of how hard the source bursts, because put_nowait sheds rather than grows.
Batch and concurrency sizing. Start at batch_size=500 and concurrency=8. Size the batch to the smaller of the sink’s payload ceiling (commonly 1–10 MB) and its per-request event cap; if dispatch p99 latency climbs, lower the batch before raising it.
GC under load. Python’s cyclic collector can pause the loop during heavy allocation. For sustained spikes, call gc.disable() at startup and gc.collect() on a low-traffic timer rather than letting automatic collection stall a flush.
Vendor quirks. Some cloud audit APIs paginate with opaque cursors and rate-limit per token, not per IP — keep concurrency at or below the documented per-token limit, or the fetcher’s own retries will manufacture ERR_DISPATCH_001 upstream.
Priority lane. During a spike, a second valid_queue for severity >= 8 records dispatched ahead of bulk telemetry preserves correlation fidelity for the events most likely to matter.

Verification Checklist

Under a synthetic 10x burst, resident memory plateaus at the computed ceiling and the process never OOM-kills.
stats["dispatched"] + stats["schema_errors"] + stats["unknown_errors"] + stats["shed"] equals the total records fetched (conservation holds).
raw_queue.qsize() rises to maxsize and holds there under load instead of growing without bound.
Measured dispatch rate stays at or below the token-bucket rate across a 5-minute spike.
Every DLQ entry carries a populated code field, and each code maps to an entry in the reference table.
asyncio.all_tasks() shows no coroutine stuck in await after run_collector returns (no leaked tasks, no missing task_done()).

FAQ

Why use asyncio instead of a thread pool for log collection?

Log collection is I/O-bound, not CPU-bound, so the cost you want to hide is network wait, not compute. A thread pool pays for that with one OS thread per concurrent request plus context-switch overhead and lock contention, and it still has no natural memory bound. asyncio multiplexes thousands of in-flight requests on a single thread, suspending each coroutine during its I/O wait, and an explicit asyncio.Queue(maxsize=...) gives you the deterministic backpressure a thread pool lacks.

How do I keep the collector from running out of memory during a spike?

Bound every queue with maxsize and enqueue with put_nowait on the producer side, so a full queue sheds a labeled ERR_QUEUE_001 record instead of allocating without limit. Peak memory is then a function of the queue ceilings and record size, not of source burst rate. Tune the ceiling against the host’s available RAM, and alert on a non-zero shed count at steady state since that means the ceiling or downstream throughput needs raising.

What belongs in the dead-letter queue versus an in-process retry?

Transient failures — a timeout or a 429 — are worth a bounded in-process retry with backoff, because the next attempt is likely to succeed and ordering stays local. Structural failures — a schema violation, or a 4xx after retries — belong in the dead-letter sink with an error code, because retrying them in memory just leaks. The split between ERR_SCHEMA_* (dead-letter) and ERR_FETCH_*/ERR_DISPATCH_001 (retry) encodes exactly that decision.

Async Log Batching — parent technique
Log Ingestion & Parsing Workflows — parent architecture
Rate Limiting Strategies
Schema Validation Pipelines
Error Categorization Frameworks