environment
environment
¶
Environment-stage checks: GPU, VRAM, tokens, log settings.
Classes:
| Name | Description |
|---|---|
CUDAAvailabilityCheck |
Validate CUDA GPU availability. |
VRAMComponentEstimate |
Per-device training VRAM components for |
VRAMHeadroomCheck |
Estimate whether GPU VRAM is sufficient for training. |
InferenceModelCheck |
Validate the inference configuration used for PII column classification. |
HFModelAvailabilityCheck |
Validate local model, HF cache, and online HF access readiness. |
Functions:
| Name | Description |
|---|---|
param_count_from_empty_model |
Count parameters by instantiating the model on the |
estimate_params_from_shape |
Shape-only fallback param count used when meta-tensor construction fails. |
estimate_base_model_params |
Return |
bytes_per_base_weight |
Return expected bytes/param for the base model load mode. |
estimate_training_vram_components |
Compose base weights, overhead, and optional activation estimate (GiB). |
CUDAAvailabilityCheck
¶
VRAMComponentEstimate(base_weights_gib, overhead_gib, activation_gib, total_gib)
dataclass
¶
Per-device training VRAM components for gpu.vram preflight.
VRAMHeadroomCheck
¶
Bases: MetadataCheck
Estimate whether GPU VRAM is sufficient for training.
The estimate is intentionally conservative/heuristic, not worst-case-accurate.
Parameter counts come from estimate_base_model_params via meta tensors
when possible.
Activation memory uses estimate_training_vram_components when
metadata.max_seq_length and transformer shape fields resolve to
positive integers; missing inputs leave activations unspecified and revert
to a legacy lumped overhead. Per-device VRAM compares against
get_max_vram(max_vram_fraction=training.max_vram_fraction) headroom.
LoRA adapters, full optimizer footprint, \(O(B S^2)\) attention material, and quantization workspace are partially covered only by residual overhead -- passing does not guarantee a fit; failing is a strong signal of OOM risk.
References
- EleutherAI, "Transformer Math 101". https://blog.eleuther.ai/transformer-math/
InferenceModelCheck
¶
Bases: ConfigCheck
Validate the inference configuration used for PII column classification.
When classification is enabled, the runtime calls an OpenAI-compatible
inference endpoint configured by NSS_INFERENCE_KEY,
NSS_INFERENCE_MODEL, and NSS_INFERENCE_ENDPOINT (set directly or via
the matching CLI flags, which are propagated to the environment before
preflight runs). This check reads those env vars -- not config -- because
the inference settings live in CLISettings/the environment rather than in
SafeSynthesizerParameters.
The body uses a single-dispatch match over (model, key, endpoint),
so at most one finding is emitted per run -- the highest-priority problem.
Priority order: invalid endpoint, then missing key, then blank model id. The
invalid endpoint is an error (a non-http(s) endpoint cannot succeed, so the
run must not pass --validate); the key and model findings are warnings
(classification degrades or falls back rather than failing the run). The
error is checked first so a lower-severity warning never masks it.
HFModelAvailabilityCheck
¶
param_count_from_empty_model(autoconfig)
¶
Count parameters by instantiating the model on the meta device.
accelerate.init_empty_weights constructs the full nn.Module graph
with every parameter on torch.device("meta") -- no storage is
allocated and no weights are downloaded. AutoModelForCausalLM.from_config
consults the transformers model-class registry to pick the right
architecture (handling Nemotron's non-gated MLP, MoE experts, biases,
tied embeddings, and any future variant automatically).
Returns None if accelerate/transformers are missing, the config
doesn't map to a registered architecture (e.g. trust_remote_code
custom archs), or instantiation fails for any other reason. The caller
should fall back to
estimate_params_from_shape.
References
- HuggingFace accelerate, "Big Model Inference" -- https://huggingface.co/docs/accelerate/concept_guides/big_model_inference
- HuggingFace accelerate, "Model memory estimator" -- same
meta-device technique exposed as
accelerate estimate-memory; reported accurate to within a few percent of real CUDA load. https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator - PyTorch meta device -- https://docs.pytorch.org/docs/stable/meta.html
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
estimate_params_from_shape(autoconfig)
¶
Shape-only fallback param count used when meta-tensor construction fails.
Models a decoder-only transformer with grouped-query attention (which
degrades to multi-head when num_key_value_heads == num_attention_heads)
and a gated SwiGLU/GeGLU MLP -- the shape NSS sees on its supported
model families (Llama, Qwen, Mistral, SmolLM, Granite, TinyLlama). For
non-gated variants (e.g. Nemotron's squared-ReLU MLP) this over-counts
MLP params by 50%, which is why the meta-tensor path is preferred.
With hidden size \(H\), intermediate size \(I\), \(L\) layers, vocabulary \(V\), \(n_\text{kv}\) KV heads and per-head dim \(d\), the per-layer cost is
(full Q/O projections; K/V shrunk by GQA; gate/up/down for SwiGLU). Total parameters:
References
- Ainslie, J. et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023) -- reduced K/V projection shape. https://arxiv.org/abs/2305.13245
- So, D. R. et al. "Primer: Searching for Efficient Transformers for Language Modeling" (2021) -- squared-ReLU MLP (Nemotron family), 2 projections; motivates the 50% over-count caveat above. https://arxiv.org/abs/2109.08668
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
estimate_base_model_params(autoconfig)
¶
Return (n_params, method) for the base model, or None if unknown.
method == "exact" means the meta-tensor path succeeded and the count
is architecture-accurate. method == "approximate" means the shape
formula was used as a fallback (see
estimate_params_from_shape
for its known error modes) and the caller should flag the downstream VRAM
estimate as heuristic. Benchmarked fallback error on supported
architectures: \(-22\%\) to \(+33\%\); hybrid Mamba-Transformer models
(e.g. Nemotron-H) can drift further.
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
bytes_per_base_weight(training_cfg)
¶
Return expected bytes/param for the base model load mode.
NSS always trains via LoRA-style adapters, so the base model's storage precision dominates VRAM (LoRA adapter params, gradients, and optimizer state are comparatively negligible).
Runtime quantization is controlled by training.quantize_model. The
PEFT type string alone is not enough: with quantize_model=False the
base weights are loaded as bf16 even when peft_implementation is
configured as "QLORA".
- Quantized load: \(\text{bits}/8 + 0.1\) to cover quant state (absmax / block scales) and dequant workspace. Yields \(\approx 0.6\) for 4-bit, \(\approx 1.1\) for 8-bit.
- Unquantized load: \(2\) bytes (bf16/fp16 base weights).
References
- Hu, E. J. et al. "LoRA: Low-Rank Adaptation of Large Language Models" (2021) -- base weights frozen; adapter + gradients + optimizer state are small relative to \(N b\). https://arxiv.org/abs/2106.09685
- Dettmers, T. et al. "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) -- 4-bit NF4 quantization with block-wise absmax scales; the \(+0.1\) term accounts for these scales and the dequant workspace. https://arxiv.org/abs/2305.14314
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
activation_memory_gib(*, batch_size, seq_len, hidden_size, num_hidden_layers, bytes_per_activation_element=2.0)
¶
Rough activation VRAM on one device given micro-batch geometry.
Uses training.batch_size (HF per_device_train_batch_size), not
gradient_accumulation_steps. Matches bf16-ish training tensors at
2 bytes/element:
Omit attention \(O(B S^2)\) blocks and recomputation specifics; goal is
order-of-magnitude headroom versus absurd batch_size values.
References
- Korthikanti, V. et al. (2022) -- recomputation vs stored activations. https://arxiv.org/abs/2205.05198
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
estimate_training_vram_components(*, n_params, training_cfg, batch_size, seq_len, hidden_size, num_hidden_layers, bytes_per_activation_element=2.0)
¶
Compose base weights, overhead, and optional activation estimate (GiB).