Validating JSON Logs Against JSON Schema in P…

Generic try/except json.loads() confirms a log is parseable but never proves the decoded object carries the typed fields a correlation rule depends on, so malformed-but-valid JSON slips straight through to detection logic. This page drills the single technique of asserting a decoded record against a formal JSON Schema document with Python’s jsonschema library — the deep field-level check that runs inside a schema validation pipeline, itself one stage of the broader Log Ingestion & Parsing Workflows pipeline. The focus here is narrow and practical: choosing the right draft, turning on format assertion, and extracting one precise, actionable error per failing record instead of a wall of nested validator output.

Root-Cause Context

JSON looks self-describing, which lulls teams into trusting it structurally. In a SOC the opposite is true: every vendor emits a different shape and changes it without notice. A decoded object is syntactically valid JSON yet still wrong for detection in three recurring ways.

First, type drift. An agent serializes event.severity as the string "8" after a library upgrade. json.loads() accepts it; a correlation rule keyed on severity >= 7 does an integer comparison against a string and either raises or silently never matches. Second, format violations that pass type checks. source.ip arrives as "not-an-ip" or an IPv6 literal where the mapping assumes IPv4 — a perfectly good string, a useless network indicator. Third, the dotted-key trap unique to ECS-style logs. Elastic Common Schema field names like event.id are commonly emitted as flat top-level keys with literal dots ({"event.id": "..."}), not as a nested {"event": {"id": "..."}} object. A schema written against the wrong representation rejects every record. JSON Schema is the only mechanism that catches all three deterministically, before the record reaches normalization or alert severity scoring, where a mis-typed field corrupts the score rather than failing loudly.

Prerequisites

The reference uses the interpreted jsonschema library specifically because it exposes rich, walkable error objects (best_match, iter_errors, JSON paths) that a compiled validator hides — exactly what you need for per-field diagnosis. Target Python 3.11+ for StrEnum.

python3 -m venv .venv
source .venv/bin/activate
# fqdn/rfc3339-validator back the "format" assertions for ipv4 and date-time
pip install "jsonschema>=4.21" "rfc3339-validator>=0.1.4" "fqdn>=1.5"

Assumptions: transport has already decoded bytes to a Python dict (the ingestion stage handles the orjson decode guard and the async batching); this stage receives an object and proves its fields. Validation is pure CPU and stateless, so it parallelizes cleanly behind async log batching.

Production-Ready Implementation

The key decisions: pin Draft202012Validator so behavior is reproducible across jsonschema upgrades; attach a FormatChecker so "format": "ipv4" and "date-time" are asserted rather than treated as documentation (format is annotation-only by default); and collapse the validator’s error tree to a single best_match so each record yields one stable ERR_JSONSCHEMA_NNN code.

from __future__ import annotations

from enum import StrEnum
from typing import Any

from jsonschema import Draft202012Validator, FormatChecker
from jsonschema.exceptions import ValidationError, best_match

# ECS dotted keys are flat top-level fields, NOT a nested object — match the wire shape.
LOG_SCHEMA: dict[str, Any] = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "required": ["event.id", "@timestamp", "source.ip", "event.severity"],
    "properties": {
        "event.id": {"type": "string", "minLength": 1},
        "@timestamp": {"type": "string", "format": "date-time"},
        "source.ip": {"type": "string", "format": "ipv4"},
        "event.severity": {"type": "integer", "minimum": 0, "maximum": 10},
        "event.category": {
            "type": "string",
            "enum": ["authentication", "network", "endpoint", "cloud", "iam"],
        },
    },
    "additionalProperties": True,  # keep vendor extras for downstream normalization
}


class JsonSchemaError(StrEnum):
    NOT_OBJECT = "ERR_JSONSCHEMA_001"     # payload is not a JSON object
    MISSING_FIELD = "ERR_JSONSCHEMA_002"  # a required field is absent
    TYPE_MISMATCH = "ERR_JSONSCHEMA_003"  # field present, wrong JSON type
    FORMAT = "ERR_JSONSCHEMA_004"         # ipv4 / date-time format assertion failed
    CONSTRAINT = "ERR_JSONSCHEMA_005"     # enum / bounds / minLength violation


# Built once at import: format assertion turns "format" keywords into hard checks.
_VALIDATOR = Draft202012Validator(LOG_SCHEMA, format_checker=FormatChecker())


def _code_for(error: ValidationError) -> JsonSchemaError:
    """Map the single most-relevant ValidationError to a stable error code."""
    match error.validator:
        case "required":
            return JsonSchemaError.MISSING_FIELD
        case "type":
            return JsonSchemaError.TYPE_MISMATCH
        case "format":
            return JsonSchemaError.FORMAT
        case _:  # enum, minimum, maximum, minLength, ...
            return JsonSchemaError.CONSTRAINT


def validate_record(record: object) -> tuple[bool, JsonSchemaError | None, str]:
    """Validate one decoded JSON log. Returns (ok, error_code, human_detail)."""
    if not isinstance(record, dict):
        return False, JsonSchemaError.NOT_OBJECT, "top-level payload is not an object"

    # best_match collapses the whole error tree to the single most useful failure,
    # so a record with several problems still yields one deterministic code.
    error = best_match(_VALIDATOR.iter_errors(record))
    if error is None:
        return True, None, "ok"

    field = ".".join(str(p) for p in error.absolute_path) or "<root>"
    code = _code_for(error)
    return False, code, f"{field}: {error.message}"


if __name__ == "__main__":
    samples: list[object] = [
        {"event.id": "evt_1", "@timestamp": "2026-06-27T10:00:00Z",
         "source.ip": "203.0.113.7", "event.severity": 8, "event.category": "authentication"},
        {"event.id": "evt_2", "source.ip": "10.0.0.5", "event.severity": 5},        # missing @timestamp
        {"event.id": "evt_3", "@timestamp": "2026-06-27T10:00:00Z",
         "source.ip": "10.0.0.6", "event.severity": "high"},                         # severity is a string
        {"event.id": "evt_4", "@timestamp": "2026-06-27T10:00:00Z",
         "source.ip": "not-an-ip", "event.severity": 3},                             # bad ipv4 format
        {"event.id": "evt_5", "@timestamp": "2026-06-27T10:00:00Z",
         "source.ip": "10.0.0.8", "event.severity": 99},                            # out of bounds
        ["not", "an", "object"],                                                      # wrong top-level type
    ]
    for rec in samples:
        ok, err, detail = validate_record(rec)
        print(f"{'PASS' if ok else err:<20} {detail}")

Running it prints one line per record: the valid event passes; the others resolve to ERR_JSONSCHEMA_002 (missing @timestamp), 003 (string severity), 004 (source.ip is not a valid IPv4), 005 (severity out of bounds), and 001 (top-level array). Because best_match ranks errors by relevance, a record with multiple defects still produces the single most actionable message rather than a tangle of nested validator output.

Error-Code Reference

Error code	Meaning	Action
`ERR_JSONSCHEMA_001`	Decoded payload is not a JSON object (array, scalar, or `null`)	Quarantine raw bytes; almost always a wrapped envelope or a batch-framing bug upstream
`ERR_JSONSCHEMA_002`	A `required` field is absent	Quarantine with the field name; a rising rate signals agent config drift or a vendor field rename
`ERR_JSONSCHEMA_003`	Field present but wrong JSON type (e.g. string `"8"` for integer)	Route to a coercion worker, or quarantine if coercion is ambiguous; usually an upstream serializer bug
`ERR_JSONSCHEMA_004`	`format` assertion failed — `source.ip` not IPv4, `@timestamp` not RFC 3339	Quarantine; check for IPv6 sources or a non-UTC timestamp format the producer started emitting
`ERR_JSONSCHEMA_005`	Constraint failure — `enum`, `minimum`/`maximum`, `minLength`	Quarantine for review; a new enum value means the schema needs an additive update

Codes follow the ERR_CATEGORY_NNN convention used across the site, so dashboards and dead-letter routing branch on the code without parsing free text. Structural rejection here is distinct from the security-meaning classification done later by the error categorization framework, which assumes the fields already exist and are well-typed.

Operational Notes

Format is annotation-only by default. Without format_checker=FormatChecker(), a schema with "format": "ipv4" validates "not-an-ip" as passing. This is the single most common silent gap — the schema looks strict but never enforces the format keywords. Always attach the checker and install the format backends (rfc3339-validator, fqdn) so date-time, ipv4, and hostname actually assert.
Match the wire shape, not the ECS model. If records arrive with flat dotted keys ("event.id"), the schema’s properties keys must be the literal dotted strings, as above. If a source emits genuinely nested objects, write nested properties. Mixing the two is the cause of “everything is rejected as ERR_JSONSCHEMA_002.”
Build the validator once. Draft202012Validator(LOG_SCHEMA, ...) lives at module scope. Re-instantiating it per record re-parses the schema and is the dominant throughput regression for the interpreted library.
Performance profile. The interpreted jsonschema validator runs roughly 5–20x slower than a compiled fastjsonschema check — single-digit milliseconds not microseconds for a schema this size. Use it where rich error paths matter (quarantine triage, schema authoring, replay) and the compiled path in the highest-throughput hot loop; both can share the same schema document.
Memory and batch sizing. Validation holds only the record under inspection, so a 1,000–2,000 record batch validated in place stays well under 10 MB. Pair it with gateway rate limiting strategies so spike overflow is shed before it reaches the validator rather than thrashing the interpreter.
Vendor quirk. Some cloud sources emit event.severity as a float (8.0). JSON Schema "type": "integer" accepts 8.0 (a number with zero fractional part is an integer per the spec) but rejects 8.5 — decide deliberately whether to add "multipleOf": 1 or coerce.

Verification Checklist

A well-formed event returns (True, None, "ok") and every malformed sample returns the expected ERR_JSONSCHEMA_NNN code.
Removing format_checker=FormatChecker() makes the "not-an-ip" sample pass — confirming format assertion is actually active in your build.
pip show rfc3339-validator fqdn lists both backends, so date-time and ipv4/hostname formats are enforced rather than skipped.
The schema’s properties keys match the literal field names on the wire (dotted vs nested) — verified against a real captured payload, not a doc example.
best_match returns one error for a record with multiple defects (e.g. missing field and bad format), giving a single deterministic code.
The validator is instantiated once at module load; no Draft202012Validator(...) call appears inside the per-record path.

FAQ

Why does my schema have a "format": "ipv4" rule but invalid IPs still pass?

Because format is an annotation by default in JSON Schema, not an assertion. The validator records the format keyword but does not check it unless you pass a FormatChecker. Construct the validator as Draft202012Validator(schema, format_checker=FormatChecker()) and install the optional backends (rfc3339-validator for date-time, fqdn for hostname). Without those, the format keyword is silently inert and malformed IPs and timestamps validate as correct.

How do I get one clear error per record instead of the full nested error tree?

Use jsonschema.exceptions.best_match(validator.iter_errors(record)). iter_errors yields every failure including deeply nested ones; best_match ranks them and returns the single most relevant ValidationError. Read error.validator to map to a stable code and error.absolute_path to name the offending field. This turns an unbounded error tree into one deterministic ERR_JSONSCHEMA_NNN label per record — far easier to dashboard and route than raw validator output.

My ECS logs have keys like "event.id" — should the schema be nested or flat?

Match whatever is on the wire. If the producer emits flat top-level keys with literal dots ({"event.id": "..."}), your schema’s properties must use those exact dotted strings as keys, as shown above. If the producer emits a nested object ({"event": {"id": "..."}}), write nested properties and a nested required. A schema written for the wrong representation rejects every record as missing its required fields, which presents as a flood of ERR_JSONSCHEMA_002.

Schema Validation Pipelines — parent guide
Error Categorization Frameworks
Async Log Batching
Rate Limiting Strategies
Log Ingestion & Parsing Workflows — parent architecture

Validating JSON Logs Against JSON Schema in Python