Error Categorization Frameworks for SOC Log P…

An ingestion pipeline that cannot name its own failures cannot be operated. When a parser silently drops a malformed payload, a type mismatch is logged as a generic exception, or an authentication-failure burst is indistinguishable from a vendor outage in the metrics, the Security Operations Center loses the one signal it needs most: a deterministic, machine-readable label on every event and every error. An error categorization framework supplies that label. It is the classification layer that maps both raw security events and pipeline faults to an immutable taxonomy, so downstream correlation, alert routing, and dead-letter recovery all key off a stable code rather than fragile string matching. As part of the broader Log Ingestion & Parsing Workflows pipeline, this framework sits immediately after structural validation and before correlation — it is the gate that turns heterogeneous telemetry into tagged, routable records. This guide walks a production-grade Python implementation: the taxonomy model, the deterministic categorizer, error-code handling, async spike resilience, and the observability that lets you trust the output in a live SOC.

Problem Framing

Consider a SOC ingesting roughly 40,000 events per second across SSH and RADIUS authentication logs, firewall flow records, EDR alerts, and cloud audit streams (AWS CloudTrail, Azure Activity). Each source emits a different shape: RFC 5424 syslog with structured-data blocks, vendor JSON with nested event.outcome fields, and CEF key-value pairs. Two distinct categorization problems collide in this stream.

The first is event categorization — assigning a security-meaningful class to a well-formed record, for example tagging Failed password for root from 203.0.113.7 as auth.credential_mismatch with a brute_force_indicator context so the correlation layer can open a 5-minute sliding window keyed on source IP. The second is error categorization — naming what went wrong when a record cannot be processed at all: invalid JSON, a missing required field, an unmapped severity enum, or a backpressure drop during a traffic spike. Most pipelines conflate these or handle the second with a bare except Exception, which means an upstream serialization bug and a transient transport corruption produce the same opaque log line. The concrete failure is unroutable ambiguity: analysts cannot tell whether a quarantine spike means a compromised agent, a vendor schema change, or a Kafka hiccup. A deterministic framework resolves this by applying a stable taxonomy to both events and faults, emitting an ERR_CATEGORY_NNN code for every rejection so that quarantine growth becomes a diagnosable metric rather than a mystery.

Prerequisites & Environment

The reference implementation targets Python 3.11+ for StrEnum, the typing improvements, and faster interpreter startup inside high-fan-out consumers. The categorization core is intentionally dependency-light so it can run inside a Kafka consumer, a syslog receiver, or a SOAR step function without a heavy framework.

python3 -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6" "structlog>=24.1"
# Optional, for the async spike-resilient batcher shown later:
pip install "uvloop>=0.19"

Infrastructure assumptions:

A transport layer (syslog/Kafka/HTTP) has already terminated and decoded bytes to UTF-8 strings before they reach the categorizer.
Structural validation runs upstream. Categorization classifies; it does not re-validate. Build the gate first via the schema validation pipelines under the ingestion layer — category quality is bounded by field hygiene before this stage.
A dead-letter topic or table exists so events that fail categorization are preserved with their error code, never silently dropped.

Architecture Overview

The framework is a deterministic function wrapped in a thin async service. A decoded log string enters; a structured CategoryResult carrying a primary domain, a secondary failure mode, a tertiary context, and — on failure — an ERR_CATEGORY_NNN code comes out. Pattern compilation happens once at startup; per-event work is a bounded scan over precompiled rules with no network calls, which keeps the stage idempotent and horizontally scalable by partition.

Step-by-Step Implementation

Step 1 — Model the taxonomy and result contract

The taxonomy follows a three-tier model: primary domain (auth, network, endpoint, cloud), secondary failure mode (credential_mismatch, rate_exceeded, service_unavailable), and tertiary context (brute_force_indicator, misconfiguration, vendor_outage). Modeling the result as a Pydantic object makes the category tag a typed contract between this stage and the correlation layer rather than a loose dict.

from __future__ import annotations

from enum import StrEnum

from pydantic import BaseModel, Field


class Domain(StrEnum):
    AUTH = "auth"
    NETWORK = "network"
    ENDPOINT = "endpoint"
    CLOUD = "cloud"
    UNKNOWN = "unknown"


class CategoryResult(BaseModel):
    status: str                                  # "categorized" | "rejected"
    domain: Domain = Domain.UNKNOWN
    failure_mode: str = "uncategorized"
    context: str = "none"
    error_code: str | None = None                # ERR_CATEGORY_NNN on rejection
    event_id: str = "N/A"

    @property
    def category(self) -> str:
        return f"{self.domain}.{self.failure_mode}"

Step 2 — Declare the deterministic rule set

Keep every pattern in one compiled, ordered table so detection engineers can tune it through config-as-code without touching pipeline logic. Compiling the regex set once at import time — never per event — is the single most important latency decision in this stage. Each rule binds a domain and failure mode to a precompiled pattern plus the tertiary context it implies.

import re
from dataclasses import dataclass


@dataclass(frozen=True, slots=True)
class Rule:
    domain: Domain
    failure_mode: str
    context: str
    pattern: re.Pattern[str]


CATEGORY_RULES: tuple[Rule, ...] = (
    Rule(Domain.AUTH, "credential_mismatch", "brute_force_indicator",
         re.compile(r"(?:invalid|failed|incorrect)\s*(?:password|credential|auth)", re.I)),
    Rule(Domain.AUTH, "mfa_timeout", "challenge_expired",
         re.compile(r"mfa\s*(?:expired|timeout|challenge\s*failed)", re.I)),
    Rule(Domain.NETWORK, "rate_exceeded", "capacity_signal",
         re.compile(r"rate\s*limit\s*(?:exceeded|hit|throttled)", re.I)),
    Rule(Domain.NETWORK, "connection_refused", "transport_fault",
         re.compile(r"connection\s*(?:refused|reset|timed\s*out)", re.I)),
    Rule(Domain.ENDPOINT, "service_unavailable", "vendor_outage",
         re.compile(r"service\s*(?:unavailable|crashed|not\s*responding)", re.I)),
    Rule(Domain.CLOUD, "policy_violation", "misconfiguration",
         re.compile(r"(?:policy|iam)\s*(?:denied|violation|misconfigured)", re.I)),
)

Step 3 — Implement the deterministic categorizer

The core decodes the payload, extracts the message field safely, and scans the ordered rule set. Order matters: the first match wins, so the most security-relevant rules are declared first. Critically, the categorizer never uses eval, exec, or any dynamic dispatch on log content — every code path is bounded and auditable.

import json

import structlog

log = structlog.get_logger("error_categorizer")


def categorize(raw_payload: str) -> CategoryResult:
    """Classify a decoded log string against the deterministic rule set."""
    try:
        payload = json.loads(raw_payload)
    except json.JSONDecodeError as exc:
        log.warning("categorize_reject", error_code="ERR_CATEGORY_001", detail=str(exc))
        return CategoryResult(status="rejected", error_code="ERR_CATEGORY_001")

    if not isinstance(payload, dict):
        return CategoryResult(status="rejected", error_code="ERR_CATEGORY_002")

    raw_message = payload.get("message", payload.get("msg", ""))
    message = raw_message if isinstance(raw_message, str) else str(raw_message)
    event_id = str(payload.get("event_id", "N/A"))

    for rule in CATEGORY_RULES:
        if rule.pattern.search(message):
            result = CategoryResult(
                status="categorized",
                domain=rule.domain,
                failure_mode=rule.failure_mode,
                context=rule.context,
                event_id=event_id,
            )
            log.info("event_categorized", category=result.category,
                     context=result.context, event_id=event_id)
            return result

    log.info("event_uncategorized", error_code="ERR_CATEGORY_004", event_id=event_id)
    return CategoryResult(status="rejected", error_code="ERR_CATEGORY_004",
                          event_id=event_id)

Step 4 — Run it end to end

if __name__ == "__main__":
    structlog.configure(processors=[structlog.processors.JSONRenderer()])

    samples = [
        '{"event_id": "evt_9921", "source_ip": "203.0.113.7", '
        '"message": "Failed password for root from 203.0.113.7 port 22"}',
        '{"event_id": "evt_9922", "message": "rate limit exceeded for tenant acme"}',
        '{"event_id": "evt_9923", "message": "informational heartbeat ok"}',
        '{"event_id": "evt_9924", "message": ',  # malformed JSON
    ]
    for s in samples:
        r = categorize(s)
        print(f"{r.event_id:>10}  {r.status:<12}  {r.category}  {r.error_code or ''}")

A brute-force probe resolves to auth.credential_mismatch, a throttling event to network.rate_exceeded, an unmatched heartbeat to ERR_CATEGORY_004 (uncategorized, routed for review rather than alerting), and a truncated payload to ERR_CATEGORY_001. Every outcome is a stable, replayable label.

Schema & Validation Integration

Categorization is only as reliable as the field naming it reads. Map incoming records to Elastic Common Schema before this stage so the categorizer keys off canonical fields rather than vendor-specific names — event.outcome and event.action for the message intent, source.ip for the entity, and event.category/event.type to carry the assigned domain forward (see the official ECS field reference). Emitting the three-tier result into event.category (domain), event.type (failure mode), and a custom labels.context keeps the framework portable across the JSON event normalization patterns defined in the architecture taxonomy.

The division of labor matters: the schema validation pipelines upstream guarantee structure (required fields present, types correct, enums valid), and the categorizer assumes that guarantee. If a record reaches categorization with a non-string message or a missing event_id, that is a validation regression, not a categorization concern — the framework tags it ERR_CATEGORY_002 and routes it back rather than silently coercing, so the upstream gate gets the signal it needs.

Error Handling & DLQ Routing

Categorization failures fall into distinct classes, each with a stable code and a defined recovery path. Codes follow the ERR_CATEGORY_NNN convention used across the site, which lets dashboards and playbooks branch on the code without parsing free text.

Error code	Trigger	Action
`ERR_CATEGORY_001`	`json.JSONDecodeError` — truncated or non-JSON payload	Route raw bytes to the DLQ with the decode detail; alert if the rate exceeds baseline transport corruption
`ERR_CATEGORY_002`	Decoded payload is not a JSON object (array/scalar)	Quarantine; usually an upstream serialization bug — page the data-platform owner
`ERR_CATEGORY_004`	No rule matched a well-formed event	Route to a review queue, not an alert; rising volume means the rule table is stale and needs a new pattern
`ERR_CATEGORY_010`	Bounded-queue backpressure drop during a spike	Sample-and-defer to cold storage; emit a drop metric so capacity planning sees the pressure

def categorize_or_dlq(raw_payload: str, dlq) -> CategoryResult | None:
    result = categorize(raw_payload)
    if result.status == "rejected" and result.error_code in {"ERR_CATEGORY_001", "ERR_CATEGORY_002"}:
        dlq.publish({"error_code": result.error_code, "payload": raw_payload})
        return None
    return result

The non-fatal class (ERR_CATEGORY_004) never blocks the pipeline — an uncategorized but well-formed event is preserved and surfaced as a metric so detection engineers can backfill a rule, exactly as a missing technique mapping degrades gracefully in dynamic severity scoring. Persistent DLQ growth on ERR_CATEGORY_001/002 is the canary for upstream schema drift or a flapping transport.

Performance Tuning

The categorizer is pure CPU and allocates little beyond the parsed dict, so the practical ceiling is event throughput, not classification cost. On a single modern core the categorize() call runs in low single-digit microseconds once patterns are precompiled; the dominant cost is JSON decoding, which you amortize by batching and by ordering rules so common categories match early.

Precompile once, never per event. Every re.compile lives at module load. Recompiling inside the hot loop is the most common throughput regression — it can cost 50x.
Order rules by frequency and severity. First match wins, so place high-volume, high-priority patterns (auth failures) ahead of rare ones; the average scan terminates sooner.
Batch under backpressure. Group events into chunks of 500–2,000 per consumer poll through async log batching so per-event Python overhead amortizes without inflating tail latency.
Bound the queue and the memory. Use a bounded asyncio.Queue with explicit backpressure; a 2,000-event batch stays well under 10 MB. Set the container limit to 256 MB and alert at 70%.
Latency target. Decode + scan + emit should hold under 1 ms p99 per event at 40k EPS across the partition set. If you exceed it, the cause is almost always an unbounded queue thrashing the garbage collector, not the scan itself.

The async batcher below applies these rules: a bounded queue, timeout-guarded ingest that converts spike overflow into a clean ERR_CATEGORY_010 drop, and a worker that flushes by size. Pair it with the gateway-level rate limiting strategies so excess telemetry is shed before it ever reaches the categorizer.

import asyncio

import structlog

log = structlog.get_logger("async_categorizer")


class SpikeResilientBatcher:
    def __init__(self, max_queue: int = 10_000, batch_size: int = 500) -> None:
        self.queue: asyncio.Queue[str] = asyncio.Queue(maxsize=max_queue)
        self.batch_size = batch_size
        self.running = False

    async def ingest(self, raw_payload: str) -> bool:
        """Enqueue with backpressure; overflow becomes a clean ERR_CATEGORY_010 drop."""
        try:
            await asyncio.wait_for(self.queue.put(raw_payload), timeout=0.5)
            return True
        except asyncio.TimeoutError:
            log.warning("spike_drop", error_code="ERR_CATEGORY_010",
                        queue_depth=self.queue.qsize())
            return False

    async def _flush(self, batch: list[str]) -> None:
        results = [categorize(p) for p in batch]
        categorized = sum(r.status == "categorized" for r in results)
        log.info("batch_flushed", size=len(batch), categorized=categorized)

    async def worker(self) -> None:
        self.running = True
        batch: list[str] = []
        while self.running:
            try:
                batch.append(await self.queue.get())
                self.queue.task_done()
                if len(batch) >= self.batch_size:
                    await self._flush(batch)
                    batch.clear()
            except asyncio.CancelledError:
                break
        if batch:
            await self._flush(batch)

Verification & Observability

Trust in a categorization framework comes from being able to replay any decision and assert on it. Emit the structured result for every event and pin the taxonomy with tests so a rule change cannot silently reclassify production traffic.

def test_brute_force_categorization() -> None:
    payload = '{"event_id": "t1", "message": "Failed password for root"}'
    result = categorize(payload)
    assert result.status == "categorized"
    assert result.category == "auth.credential_mismatch"
    assert result.context == "brute_force_indicator"


def test_malformed_payload_is_coded() -> None:
    result = categorize('{"event_id": "t2", "message":')
    assert result.status == "rejected"
    assert result.error_code == "ERR_CATEGORY_001"

Operational signals to dashboard:

Category distribution — counts per domain.failure_mode per hour. A sudden shift toward auth.credential_mismatch is a brute-force signal; a collapse to unknown means a vendor changed its message format.
Uncategorized rate — ERR_CATEGORY_004 per minute. A rising value means the rule table is stale and a new pattern is overdue.
DLQ rate — ERR_CATEGORY_001/002 per minute; a step change signals upstream schema drift or transport corruption.
Spike-drop rate — ERR_CATEGORY_010 per minute, correlated with queue depth, tells capacity planning when to scale partitions.
Decision replay — index the structured result so any categorization can be reconstructed during incident review, aligned with NIST SP 800-92 guidance on retaining security-relevant log management records.

Troubleshooting

Everything lands in unknown / ERR_CATEGORY_004. The categorizer is reading the wrong field. Confirm normalization populates message/msg before this stage and that ECS mapping ran; inspect a raw sample to see where the human-readable text actually lives.
Throughput collapses under load. A re.compile is running inside the hot loop, or the queue is unbounded and the GC is thrashing. Move all compilation to module scope and switch to a bounded asyncio.Queue with backpressure.
A known attack pattern is misclassified. Rule order is wrong — a broader pattern earlier in the tuple is matching first. Reorder so specific, high-severity rules precede general ones; first match wins by design.
DLQ fills with valid-looking events. They are arriving as JSON arrays or scalars (ERR_CATEGORY_002), not objects. The upstream serializer is wrapping records; fix the producer rather than loosening the categorizer.
Spike drops with healthy CPU. The queue ceiling is too low for the burst profile. Raise max_queue, add the gateway rate limiter ahead of ingest, and route deferred telemetry to cold storage instead of dropping it outright.

FAQ

What is the difference between event categorization and error categorization?

Event categorization assigns a security-meaningful class to a well-formed record — for example tagging a failed login as auth.credential_mismatch so correlation can window on it. Error categorization names what went wrong when a record cannot be processed at all, emitting an ERR_CATEGORY_NNN code such as malformed JSON or a backpressure drop. A robust framework does both with the same taxonomy so quarantine growth and attack signals are equally diagnosable.

Why use a fixed regex rule table instead of a machine-learning classifier?

SOC triage demands explainability and determinism: an analyst must be able to see exactly why an event was tagged, and the same input must always produce the same code for audit and replay. A precompiled, ordered rule table gives a full, reconstructable decision for every event. Analyst feedback can later inform new patterns, but a black-box model that cannot justify a classification is hard to operate and harder to defend in a compliance review.

Where does the categorization stage belong in the pipeline?

Immediately after structural validation and before correlation. Validation guarantees the record is well-formed; categorization assigns the routable label that the correlation and severity-scoring stages key off. Keeping it stateless and CPU-bound — no network calls, no per-event state — lets you scale it horizontally by stream partition.

How should the framework behave during a log storm?

Shed load deterministically rather than crashing. A bounded queue converts overflow into a clean ERR_CATEGORY_010 drop with a metric, and a gateway rate limiter sheds excess before ingest. Critical categories such as authentication failures keep priority while verbose, low-value telemetry is sampled or deferred to cold storage, preserving real-time alerting continuity under pressure.

Log Ingestion & Parsing Workflows — parent architecture
Schema Validation Pipelines
Async Log Batching
Rate Limiting Strategies
JSON Event Normalization

Error Categorization Frameworks for SOC Log Pipelines

Related guides