vllm_observability
vllm_observability
¶
vLLM observability primitives for production + benchmark.
Schema-frozen generation-observability events emitted by VllmBackend.generate()
and consumed by downstream observability surfaces (structured logs, wandb,
the benchmark harness's per-cell aggregator).
Four primitives, all degraded-mode by design:
- :class:
NvmlPeakSampler— daemon-thread peak device-VRAM tracker viapynvml. Reads at the driver layer so it sees vLLM worker-subprocess allocations regardless of which process holds the torch handle — sidesteps theVLLM_ENABLE_V1_MULTIPROCESSING=1blind spot wheretorch.cuda.max_memory_allocated()in the harness reads 0. - :func:
read_loadavg—/proc/loadavgsnapshot as a (1m, 5m, 15m) triple.Noneon non-Linux or read failure. - :func:
probe_engine_runtime_config— best-effort introspection of the engine's effective scheduler/cache/speculative settings viallm.llm_engine.vllm_config. Empty dict on any failure. - :func:
read_vllm_runtime_metrics— one-shot snapshot ofllm.get_metrics()for KV-cache usage, prefix-cache hit rate, and speculative-decoding acceptance rate. Returns a dict with stable keys regardless of which metrics the engine actually exposed.
Plus the :class:GenerationObservability pydantic model — the schema for the
generation-complete structured event emitted at the end of each
generation invocation. Forward-compatible: new optional fields can be
added without breaking existing consumers because the model uses
extra="forbid" (so producers are forced to update when they add new
fields) but every existing field has a default of None or empty.
References (design):
- HuggingFace train_memory blog ("Visualize and understand GPU memory
in PyTorch") covers in-process torch profiling, which is the gap NVML
fills here (out-of-process VRAM visibility):
https://huggingface.co/blog/train_memory
- spark-dashboard (Rust) demonstrates the NVML + /metrics pattern at
~1s polling cadence — the precedent for combining NVML's driver-level
reading with vLLM's Prometheus surface:
https://github.com/niklasfrick/spark-dashboard
- vLLM's metrics design doc enumerates the gauges/counters
:func:read_vllm_runtime_metrics reads:
https://github.com/vllm-project/vllm/blob/main/docs/design/metrics.md
Classes:
| Name | Description |
|---|---|
NvmlPeakSampler |
Daemon-thread sampler tracking peak device VRAM via NVML. |
GenerationObservability |
One generation-complete event payload. |
VllmRuntimeMetrics |
End-of-generation vLLM metric snapshot with a fixed key set. |
Functions:
| Name | Description |
|---|---|
read_loadavg |
Return |
probe_engine_runtime_config |
Best-effort introspection of the engine's effective runtime config. |
flag_engagement_mismatches |
Return human-readable mismatch descriptions; empty list means clean engagement. |
read_vllm_runtime_metrics |
Snapshot |
NvmlPeakSampler(device_index=None, interval_seconds=0.25)
¶
Daemon-thread sampler tracking peak device VRAM via NVML.
Use as a context manager wrapping the work whose peak VRAM you want::
with NvmlPeakSampler() as vram:
... # build engine / run training / generate
peak_gb = vram.peak_gb # float | None
Returns None from :attr:peak_gb when NVML isn't available (driver
missing, pynvml import failed, device index invalid). Reads at the driver
layer, so it sees allocations made by worker subprocesses regardless of
which process holds the torch handle. Reports device-wide VRAM -- on a
dedicated host that equals the workload's allocation; on a shared GPU it
includes other process allocations.
device_index defaults to the first CUDA_VISIBLE_DEVICES entry (see
:func:_default_nvml_device_index) so the sampler follows the workload's
GPU on multi-GPU hosts instead of always reading physical GPU 0. Pass an
explicit index to override.
Attributes:
| Name | Type | Description |
|---|---|---|
peak_gb |
float | None
|
Peak device-wide VRAM (GiB) observed during sampling; |
Source code in src/nemo_safe_synthesizer/observability.py
peak_gb
property
¶
Peak device-wide VRAM (GiB) observed during sampling; None if NVML unavailable.
GenerationObservability
pydantic-model
¶
Bases: BaseModel
One generation-complete event payload.
Emitted by VllmBackend.generate() at end of each generation
invocation. Consumed by:
- Structured log routing (default — flows through
logger.runtime.info(...)like the rest of PR-1's trace telemetry). - Wandb (when a run is active) — logged to the current wandb run.
- The benchmark harness's per-cell aggregator (composes this into
its richer
CandidateMetricsschema).
Every measurement field is optional; producers should populate what
they can capture and leave the rest at the default. Wandb drops
None values silently which is the right behavior for "this
metric wasn't reachable on this generation".
Config:
extra:forbid
Fields:
-
peak_vram_gb(float | None) -
kv_cache_usage_perc(float | None) -
prefix_cache_hit_rate(float | None) -
spec_accept_rate(float | None) -
loadavg_pre(tuple[float, float, float] | None) -
loadavg_post(tuple[float, float, float] | None) -
engine_runtime_config(dict[str, Any]) -
flag_did_not_engage(bool)
peak_vram_gb = None
pydantic-field
¶
Peak device-wide VRAM usage in GiB, sampled by NVML (pynvml.nvmlDeviceGetMemoryInfo) across the whole generation. None when NVML is unavailable. Device-wide reading; on a shared GPU it includes other processes.
kv_cache_usage_perc = None
pydantic-field
¶
vLLM's vllm:kv_cache_usage_perc gauge (fraction 0..1 of KV cache blocks in use) at end of generation. None when the engine doesn't expose the gauge or the call failed. Approximates peak; vLLM only publishes the instantaneous value, not a max-over-time.
prefix_cache_hit_rate = None
pydantic-field
¶
Derived from vllm:prefix_cache_hits / vllm:prefix_cache_queries at end of generation. None when either counter is absent or queries==0. Surfaces whether shared schema prefixes actually amortized across the batch.
spec_accept_rate = None
pydantic-field
¶
Derived from vllm:spec_decode_num_accepted_tokens / num_draft_tokens at end of generation. None when speculative decoding wasn't enabled on this generation (counters absent) or no drafts were proposed (denominator==0).
loadavg_pre = None
pydantic-field
¶
Host /proc/loadavg snapshot captured at the start of this generation (1-min, 5-min, 15-min averages). None when /proc/loadavg is unavailable (non-Linux).
loadavg_post = None
pydantic-field
¶
Host /proc/loadavg snapshot captured at the end of this generation. Drift from loadavg_pre signals load change during the generation.
engine_runtime_config
pydantic-field
¶
Best-effort probe of the engine's effective runtime config (enable_prefix_caching, enable_chunked_prefill, max_num_seqs, max_num_batched_tokens, kv_cache_dtype, speculative_method when populated). Empty dict on probe failure.
flag_did_not_engage = False
pydantic-field
¶
True when engine_runtime_config disagrees with the candidate/caller's intended setting on any checked field — an unsupported knob silently ignored, a default-on flag overriding an explicit-off intent, etc.
to_wandb_payload(prefix='vllm_gen')
¶
Flatten this event into a wandb-friendly wandb.log(...) dict.
Wandb plots scalars cleanly but renders tuples/dicts as opaque blobs, so this method:
- Drops
Nonevalues (wandb would drop them anyway; explicit here for documentation). - Unpacks
loadavg_pre/loadavg_post3-tuples to per- duration scalars (loadavg_pre_1m/_5m/_15m). - Flattens
engine_runtime_configtoengine_runtime/<key>scalars (mirrors the existing flattening pattern in the benchmark harness).
All keys are namespaced under prefix so production generation
events don't collide with other wandb metrics in the same run.
Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py
VllmRuntimeMetrics
¶
Bases: TypedDict
End-of-generation vLLM metric snapshot with a fixed key set.
Every value is float | None; None means the engine did not
surface that counter on this generation (distinct from a measured zero).
A TypedDict rather than a dataclass on purpose: the value stays a
plain dict at runtime, so dict-style consumers (e.g. the benchmark
harness) are unaffected, while callers gain static key checking and
float | None value typing instead of dict[str, float | None].
read_loadavg()
¶
Return /proc/loadavg as a (1m, 5m, 15m) triple; None when unavailable.
Linux-only. Cheap (one syscall). Safe to call from any process -- the read is host-scoped, not process-scoped. Designed to bracket a workload: caller reads pre + post, the pair is informative about whether host load drifted during the run.
Source code in src/nemo_safe_synthesizer/observability.py
probe_engine_runtime_config(llm)
¶
Best-effort introspection of the engine's effective runtime config.
Returns a flat dict of the load-bearing scheduler/cache/speculative
settings drawn from :data:_PROBE_FIELDS. Empty dict when the engine
config can't be reached — this is observability, not a correctness gate.
Degrades at field granularity: a malformed individual attribute skips that one field rather than emptying the whole result.
Typed object (not LLM) on purpose: the probe is pure defensive
getattr introspection and degrades on any shape, so it does not
require — and must not claim to require — the concrete engine type.
Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py
flag_engagement_mismatches(intended, actual, checked_fields=ENGINE_CONFIG_CHECKED_FIELDS)
¶
Return human-readable mismatch descriptions; empty list means clean engagement.
Only checks fields the caller explicitly set in intended (i.e.,
fields whose value is not None); a None on the intended side
means "use engine default" so there's no reference value to compare
against. Fields missing from actual are skipped — the probe is
best-effort and may not expose every flag.
The dict-vs-dict shape (rather than a typed pydantic model) is
deliberate so this helper works regardless of whether the caller
has a VllmEngineParameters instance or just raw vLLM kwargs.
Source code in src/nemo_safe_synthesizer/generation/vllm_observability.py
read_vllm_runtime_metrics(llm)
¶
Snapshot llm.get_metrics() for known metrics; degraded-mode on failure.
Returns a :class:VllmRuntimeMetrics with stable keys regardless of
which metrics the engine actually exposed — missing metrics map to
None. Callers should treat None as "engine didn't surface this
counter" and not crash on missing data.
Currently captures:
kv_cache_usage_perc— vLLM gauge, fraction (0..1) of used KV cache blocks at the moment of read.prefix_cache_hit_rate— derived fromvllm:prefix_cache_hits / vllm:prefix_cache_queries.spec_accept_rate— derived fromvllm:spec_decode_num_accepted_tokens / num_draft_tokens.Nonewhen speculative decoding wasn't enabled (counters registered at runtime by the spec-decode subsystem; absent otherwise) — distinguishes "not measured" from "measured zero".