vllm_observability

`vllm_observability` ¶

vLLM observability primitives for production + benchmark.

Schema-frozen generation-observability events emitted by VllmBackend.generate() and consumed by downstream observability surfaces (structured logs, wandb, the benchmark harness's per-cell aggregator).

Four primitives, all degraded-mode by design:

:class:NvmlPeakSampler — daemon-thread peak device-VRAM tracker via pynvml. Reads at the driver layer so it sees vLLM worker-subprocess allocations regardless of which process holds the torch handle — sidesteps the VLLM_ENABLE_V1_MULTIPROCESSING=1 blind spot where torch.cuda.max_memory_allocated() in the harness reads 0.
:func:read_loadavg — /proc/loadavg snapshot as a (1m, 5m, 15m) triple. None on non-Linux or read failure.
:func:probe_engine_runtime_config — best-effort introspection of the engine's effective scheduler/cache/speculative settings via llm.llm_engine.vllm_config. Empty dict on any failure.
:func:read_vllm_runtime_metrics — one-shot snapshot of llm.get_metrics() for KV-cache usage, prefix-cache hit rate, and speculative-decoding acceptance rate. Returns a dict with stable keys regardless of which metrics the engine actually exposed.

Plus the :class:GenerationObservability pydantic model — the schema for the generation-complete structured event emitted at the end of each generation invocation. Forward-compatible: new optional fields can be added without breaking existing consumers because the model uses extra="forbid" (so producers are forced to update when they add new fields) but every existing field has a default of None or empty.

References (design): - HuggingFace train_memory blog ("Visualize and understand GPU memory in PyTorch") covers in-process torch profiling, which is the gap NVML fills here (out-of-process VRAM visibility): https://huggingface.co/blog/train_memory - spark-dashboard (Rust) demonstrates the NVML + /metrics pattern at ~1s polling cadence — the precedent for combining NVML's driver-level reading with vLLM's Prometheus surface: https://github.com/niklasfrick/spark-dashboard - vLLM's metrics design doc enumerates the gauges/counters :func:read_vllm_runtime_metrics reads: https://github.com/vllm-project/vllm/blob/main/docs/design/metrics.md

Classes:

Name	Description
`NvmlPeakSampler`	Daemon-thread sampler tracking peak device VRAM via NVML.
`GenerationObservability`	One generation-complete event payload.
`VllmRuntimeMetrics`	End-of-generation vLLM metric snapshot with a fixed key set.

Functions:

Name	Description
`read_loadavg`	Return `/proc/loadavg` as a (1m, 5m, 15m) triple; `None` when unavailable.
`probe_engine_runtime_config`	Best-effort introspection of the engine's effective runtime config.
`flag_engagement_mismatches`	Return human-readable mismatch descriptions; empty list means clean engagement.
`read_vllm_runtime_metrics`	Snapshot `llm.get_metrics()` for known metrics; degraded-mode on failure.

`NvmlPeakSampler(device_index=None, interval_seconds=0.25)` ¶

Daemon-thread sampler tracking peak device VRAM via NVML.

Use as a context manager wrapping the work whose peak VRAM you want::

with NvmlPeakSampler() as vram:
    ...  # build engine / run training / generate
peak_gb = vram.peak_gb  # float | None

Returns None from :attr:peak_gb when NVML isn't available (driver missing, pynvml import failed, device index invalid). Reads at the driver layer, so it sees allocations made by worker subprocesses regardless of which process holds the torch handle. Reports device-wide VRAM -- on a dedicated host that equals the workload's allocation; on a shared GPU it includes other process allocations.

device_index defaults to the first CUDA_VISIBLE_DEVICES entry (see :func:_default_nvml_device_index) so the sampler follows the workload's GPU on multi-GPU hosts instead of always reading physical GPU 0. Pass an explicit index to override.

Attributes:

Name	Type	Description
`peak_gb`	`float \| None`	Peak device-wide VRAM (GiB) observed during sampling; `None` if NVML unavailable.

Source code in src/nemo_safe_synthesizer/observability.py

def __init__(self, device_index: int | None = None, interval_seconds: float = 0.25) -> None:
    self._device_index = _default_nvml_device_index() if device_index is None else device_index
    self._interval = interval_seconds
    self._stop = threading.Event()
    self._peak_bytes = 0
    self._lock = threading.Lock()
    self._thread: threading.Thread | None = None
    self._handle: Any = None
    self._pynvml: Any = None

`peak_gb` `property` ¶

Peak device-wide VRAM (GiB) observed during sampling; None if NVML unavailable.

`GenerationObservability` `pydantic-model` ¶

Bases: BaseModel

One generation-complete event payload.

Emitted by VllmBackend.generate() at end of each generation invocation. Consumed by:

Structured log routing (default — flows through logger.runtime.info(...) like the rest of PR-1's trace telemetry).
Wandb (when a run is active) — logged to the current wandb run.
The benchmark harness's per-cell aggregator (composes this into its richer CandidateMetrics schema).

Every measurement field is optional; producers should populate what they can capture and leave the rest at the default. Wandb drops None values silently which is the right behavior for "this metric wasn't reachable on this generation".

Config:

extra: forbid

Fields:

peak_vram_gb (float | None)
kv_cache_usage_perc (float | None)
prefix_cache_hit_rate (float | None)
spec_accept_rate (float | None)
loadavg_pre (tuple[float, float, float] | None)
loadavg_post (tuple[float, float, float] | None)
engine_runtime_config (dict[str, Any])
flag_did_not_engage (bool)

`peak_vram_gb = None` `pydantic-field` ¶

Peak device-wide VRAM usage in GiB, sampled by NVML (pynvml.nvmlDeviceGetMemoryInfo) across the whole generation. None when NVML is unavailable. Device-wide reading; on a shared GPU it includes other processes.

`kv_cache_usage_perc = None` `pydantic-field` ¶

vLLM's vllm:kv_cache_usage_perc gauge (fraction 0..1 of KV cache blocks in use) at end of generation. None when the engine doesn't expose the gauge or the call failed. Approximates peak; vLLM only publishes the instantaneous value, not a max-over-time.

`prefix_cache_hit_rate = None` `pydantic-field` ¶

Derived from vllm:prefix_cache_hits / vllm:prefix_cache_queries at end of generation. None when either counter is absent or queries==0. Surfaces whether shared schema prefixes actually amortized across the batch.

`spec_accept_rate = None` `pydantic-field` ¶

Derived from vllm:spec_decode_num_accepted_tokens / num_draft_tokens at end of generation. None when speculative decoding wasn't enabled on this generation (counters absent) or no drafts were proposed (denominator==0).

`loadavg_pre = None` `pydantic-field` ¶

Host /proc/loadavg snapshot captured at the start of this generation (1-min, 5-min, 15-min averages). None when /proc/loadavg is unavailable (non-Linux).

`loadavg_post = None` `pydantic-field` ¶

Host /proc/loadavg snapshot captured at the end of this generation. Drift from loadavg_pre signals load change during the generation.

`engine_runtime_config` `pydantic-field` ¶

Best-effort probe of the engine's effective runtime config (enable_prefix_caching, enable_chunked_prefill, max_num_seqs, max_num_batched_tokens, kv_cache_dtype, speculative_method when populated). Empty dict on probe failure.

`flag_did_not_engage = False` `pydantic-field` ¶

True when engine_runtime_config disagrees with the candidate/caller's intended setting on any checked field — an unsupported knob silently ignored, a default-on flag overriding an explicit-off intent, etc.

`to_wandb_payload(prefix='vllm_gen')` ¶

Flatten this event into a wandb-friendly wandb.log(...) dict.

Wandb plots scalars cleanly but renders tuples/dicts as opaque blobs, so this method:

Drops None values (wandb would drop them anyway; explicit here for documentation).
Unpacks loadavg_pre / loadavg_post 3-tuples to per- duration scalars (loadavg_pre_1m / _5m / _15m).
Flattens engine_runtime_config to engine_runtime/<key> scalars (mirrors the existing flattening pattern in the benchmark harness).

All keys are namespaced under prefix so production generation events don't collide with other wandb metrics in the same run.

Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py

def to_wandb_payload(self, prefix: str = "vllm_gen") -> dict[str, Any]:
    """Flatten this event into a wandb-friendly ``wandb.log(...)`` dict.

    Wandb plots scalars cleanly but renders tuples/dicts as opaque
    blobs, so this method:

    - Drops ``None`` values (wandb would drop them anyway; explicit
      here for documentation).
    - Unpacks ``loadavg_pre`` / ``loadavg_post`` 3-tuples to per-
      duration scalars (``loadavg_pre_1m`` / ``_5m`` / ``_15m``).
    - Flattens ``engine_runtime_config`` to ``engine_runtime/<key>``
      scalars (mirrors the existing flattening pattern in the
      benchmark harness).

    All keys are namespaced under ``prefix`` so production generation
    events don't collide with other wandb metrics in the same run.
    """
    payload: dict[str, Any] = {}
    for scalar_field in (
        "peak_vram_gb",
        "kv_cache_usage_perc",
        "prefix_cache_hit_rate",
        "spec_accept_rate",
        "flag_did_not_engage",
    ):
        value = getattr(self, scalar_field)
        if value is not None:
            payload[f"{prefix}/{scalar_field}"] = value
    for side in ("pre", "post"):
        tup = getattr(self, f"loadavg_{side}")
        if tup is None:
            continue
        for label, value in zip(_LOADAVG_HORIZON_LABELS, tup, strict=False):
            payload[f"{prefix}/loadavg_{side}_{label}"] = value
    for key, value in self.engine_runtime_config.items():
        if value is not None:
            payload[f"{prefix}/engine_runtime/{key}"] = value
    return payload

`VllmRuntimeMetrics` ¶

Bases: TypedDict

End-of-generation vLLM metric snapshot with a fixed key set.

Every value is float | None; None means the engine did not surface that counter on this generation (distinct from a measured zero).

A TypedDict rather than a dataclass on purpose: the value stays a plain dict at runtime, so dict-style consumers (e.g. the benchmark harness) are unaffected, while callers gain static key checking and float | None value typing instead of dict[str, float | None].

`read_loadavg()` ¶

Return /proc/loadavg as a (1m, 5m, 15m) triple; None when unavailable.

Linux-only. Cheap (one syscall). Safe to call from any process -- the read is host-scoped, not process-scoped. Designed to bracket a workload: caller reads pre + post, the pair is informative about whether host load drifted during the run.

Source code in src/nemo_safe_synthesizer/observability.py

def read_loadavg() -> tuple[float, float, float] | None:
    """Return ``/proc/loadavg`` as a (1m, 5m, 15m) triple; ``None`` when unavailable.

    Linux-only. Cheap (one syscall). Safe to call from any process -- the read
    is host-scoped, not process-scoped. Designed to bracket a workload: caller
    reads pre + post, the pair is informative about whether host load drifted
    during the run.
    """
    try:
        with open("/proc/loadavg", encoding="utf-8") as f:
            parts = f.read().split()
        return (float(parts[0]), float(parts[1]), float(parts[2]))
    except (OSError, ValueError, IndexError):
        return None

`probe_engine_runtime_config(llm)` ¶

Best-effort introspection of the engine's effective runtime config.

Returns a flat dict of the load-bearing scheduler/cache/speculative settings drawn from :data:_PROBE_FIELDS. Empty dict when the engine config can't be reached — this is observability, not a correctness gate.

Degrades at field granularity: a malformed individual attribute skips that one field rather than emptying the whole result.

Typed object (not LLM) on purpose: the probe is pure defensive getattr introspection and degrades on any shape, so it does not require — and must not claim to require — the concrete engine type.

Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py

def probe_engine_runtime_config(llm: object) -> dict[str, Any]:
    """Best-effort introspection of the engine's effective runtime config.

    Returns a flat dict of the load-bearing scheduler/cache/speculative
    settings drawn from :data:`_PROBE_FIELDS`. Empty dict when the engine
    config can't be reached — this is observability, not a correctness gate.

    Degrades at field granularity: a malformed individual attribute skips
    that one field rather than emptying the whole result.

    Typed ``object`` (not ``LLM``) on purpose: the probe is pure defensive
    ``getattr`` introspection and degrades on any shape, so it does not
    require — and must not claim to require — the concrete engine type.
    """
    try:
        vcfg = _engine_vllm_config(llm)
    except Exception:  # noqa: BLE001 — degraded mode by design
        logger.debug("engine-probe: vllm_config unreachable; returning empty probe", exc_info=True)
        return {}
    if vcfg is None:
        return {}
    out: dict[str, Any] = {}
    for spec in _PROBE_FIELDS:
        try:
            section = getattr(vcfg, spec.section, None)
            if section is None:
                continue
            value = _first_present(section, spec.sources)
            if value is not None:
                out[spec.out_key] = spec.transform(value)
        except Exception:  # noqa: BLE001 — degrade one field, not the whole probe
            logger.debug("engine-probe: field %r failed; skipping", spec.out_key, exc_info=True)
            continue
    return out

`flag_engagement_mismatches(intended, actual, checked_fields=ENGINE_CONFIG_CHECKED_FIELDS)` ¶

Return human-readable mismatch descriptions; empty list means clean engagement.

Only checks fields the caller explicitly set in intended (i.e., fields whose value is not None); a None on the intended side means "use engine default" so there's no reference value to compare against. Fields missing from actual are skipped — the probe is best-effort and may not expose every flag.

The dict-vs-dict shape (rather than a typed pydantic model) is deliberate so this helper works regardless of whether the caller has a VllmEngineParameters instance or just raw vLLM kwargs.

Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py

def flag_engagement_mismatches(
    intended: dict[str, Any],
    actual: dict[str, Any],
    checked_fields: tuple[str, ...] = ENGINE_CONFIG_CHECKED_FIELDS,
) -> list[str]:
    """Return human-readable mismatch descriptions; empty list means clean engagement.

    Only checks fields the caller explicitly set in ``intended`` (i.e.,
    fields whose value is not ``None``); a ``None`` on the intended side
    means "use engine default" so there's no reference value to compare
    against. Fields missing from ``actual`` are skipped — the probe is
    best-effort and may not expose every flag.

    The dict-vs-dict shape (rather than a typed pydantic model) is
    deliberate so this helper works regardless of whether the caller
    has a ``VllmEngineParameters`` instance or just raw vLLM kwargs.
    """
    mismatches: list[str] = []
    for field in checked_fields:
        intended_val = intended.get(field)
        if intended_val is None:
            continue
        actual_val = actual.get(field)
        if actual_val is None:
            continue
        if intended_val != actual_val:
            mismatches.append(f"{field}: intended={intended_val!r} actual={actual_val!r}")
    return mismatches

`read_vllm_runtime_metrics(llm)` ¶

Snapshot llm.get_metrics() for known metrics; degraded-mode on failure.

Returns a :class:VllmRuntimeMetrics with stable keys regardless of which metrics the engine actually exposed — missing metrics map to None. Callers should treat None as "engine didn't surface this counter" and not crash on missing data.

Currently captures:

kv_cache_usage_perc — vLLM gauge, fraction (0..1) of used KV cache blocks at the moment of read.
prefix_cache_hit_rate — derived from vllm:prefix_cache_hits / vllm:prefix_cache_queries.
spec_accept_rate — derived from vllm:spec_decode_num_accepted_tokens / num_draft_tokens. None when speculative decoding wasn't enabled (counters registered at runtime by the spec-decode subsystem; absent otherwise) — distinguishes "not measured" from "measured zero".

Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py

def read_vllm_runtime_metrics(llm: LLM | None) -> VllmRuntimeMetrics:
    """Snapshot ``llm.get_metrics()`` for known metrics; degraded-mode on failure.

    Returns a :class:`VllmRuntimeMetrics` with stable keys regardless of
    which metrics the engine actually exposed — missing metrics map to
    ``None``. Callers should treat ``None`` as "engine didn't surface this
    counter" and not crash on missing data.

    Currently captures:

    - ``kv_cache_usage_perc`` — vLLM gauge, fraction (0..1) of used KV
      cache blocks at the moment of read.
    - ``prefix_cache_hit_rate`` — derived from
      ``vllm:prefix_cache_hits / vllm:prefix_cache_queries``.
    - ``spec_accept_rate`` — derived from
      ``vllm:spec_decode_num_accepted_tokens / num_draft_tokens``.
      ``None`` when speculative decoding wasn't enabled (counters
      registered at runtime by the spec-decode subsystem; absent
      otherwise) — distinguishes "not measured" from "measured zero".
    """
    if llm is None:
        return _empty_runtime_metrics()

    # Best-effort probe: the engine call or a non-numeric metric value is the
    # only thing that can fail, so the guard wraps exactly that. The ratio
    # derivation below is pure dict arithmetic and cannot raise.
    try:
        raw = _collect_raw_metrics(llm)
    except Exception:  # noqa: BLE001 — degraded mode by design
        logger.debug("read_vllm_runtime_metrics failed", exc_info=True)
        return _empty_runtime_metrics()

    return VllmRuntimeMetrics(
        kv_cache_usage_perc=raw.get(METRIC_KV_CACHE_USAGE_PERC),
        prefix_cache_hit_rate=_safe_ratio(raw.get(METRIC_PREFIX_CACHE_HITS), raw.get(METRIC_PREFIX_CACHE_QUERIES)),
        spec_accept_rate=_safe_ratio(raw.get(METRIC_SPEC_NUM_ACCEPTED_TOKENS), raw.get(METRIC_SPEC_NUM_DRAFT_TOKENS)),
    )

vllm_observability