environment
environment
¶
Environment-stage checks: GPU, VRAM, tokens, log settings.
Classes:
| Name | Description |
|---|---|
CUDAAvailabilityCheck |
Validate CUDA GPU availability. |
VRAMHeadroomCheck |
Estimate whether GPU VRAM is sufficient for training. |
InferenceKeyCheck |
Check NSS_INFERENCE_KEY environment variable. |
HFModelAvailabilityCheck |
Validate local model, HF cache, and online HF access readiness. |
Functions:
| Name | Description |
|---|---|
param_count_from_empty_model |
Count parameters by instantiating the model on the |
estimate_params_from_shape |
Shape-only fallback param count used when meta-tensor construction fails. |
estimate_base_model_params |
Return |
bytes_per_base_weight |
Return expected bytes/param for the base model given PEFT mode. |
CUDAAvailabilityCheck
¶
VRAMHeadroomCheck
¶
Bases: MetadataCheck
Estimate whether GPU VRAM is sufficient for training.
The estimate is intentionally a lower bound:
where \(N\) is the base-model parameter count (see estimate_base_model_params; exact via the meta-tensor path, or the shape-heuristic fallback), \(b\) is the bytes-per-param for the selected PEFT mode (see bytes_per_base_weight), and \(C\) is a fixed overhead for CUDA kernels and checkpointed activations. The expression excludes the fine-grained activation term \(\mathcal{O}(B \cdot S \cdot H \cdot L)\), LoRA adapter parameters, gradients, and optimizer state. Those are typically small compared to the base weights for parameter-efficient fine-tuning, but not zero. Passing this check does not guarantee training will fit in VRAM; failing it is a strong signal that it will OOM.
References
- EleutherAI, "Transformer Math 101" -- grounds the rule of thumb that inference adds ~20% over raw weights; training adds considerably more. https://blog.eleuther.ai/transformer-math/
InferenceKeyCheck
¶
HFModelAvailabilityCheck
¶
param_count_from_empty_model(autoconfig)
¶
Count parameters by instantiating the model on the meta device.
accelerate.init_empty_weights constructs the full nn.Module graph
with every parameter on torch.device("meta") -- no storage is
allocated and no weights are downloaded. AutoModelForCausalLM.from_config
consults the transformers model-class registry to pick the right
architecture (handling Nemotron's non-gated MLP, MoE experts, biases,
tied embeddings, and any future variant automatically).
Returns None if accelerate/transformers are missing, the config
doesn't map to a registered architecture (e.g. trust_remote_code
custom archs), or instantiation fails for any other reason. The caller
should fall back to
estimate_params_from_shape.
References
- HuggingFace accelerate, "Big Model Inference" -- https://huggingface.co/docs/accelerate/concept_guides/big_model_inference
- HuggingFace accelerate, "Model memory estimator" -- same
meta-device technique exposed as
accelerate estimate-memory; reported accurate to within a few percent of real CUDA load. https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator - PyTorch meta device -- https://docs.pytorch.org/docs/stable/meta.html
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
estimate_params_from_shape(autoconfig)
¶
Shape-only fallback param count used when meta-tensor construction fails.
Models a decoder-only transformer with grouped-query attention (which
degrades to multi-head when num_key_value_heads == num_attention_heads)
and a gated SwiGLU/GeGLU MLP -- the shape NSS sees on its supported
model families (Llama, Qwen, Mistral, SmolLM, Granite, TinyLlama). For
non-gated variants (e.g. Nemotron's squared-ReLU MLP) this over-counts
MLP params by 50%, which is why the meta-tensor path is preferred.
With hidden size \(H\), intermediate size \(I\), \(L\) layers, vocabulary \(V\), \(n_\text{kv}\) KV heads and per-head dim \(d\), the per-layer cost is
(full Q/O projections; K/V shrunk by GQA; gate/up/down for SwiGLU). Total parameters:
References
- Ainslie, J. et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023) -- reduced K/V projection shape. https://arxiv.org/abs/2305.13245
- So, D. R. et al. "Primer: Searching for Efficient Transformers for Language Modeling" (2021) -- squared-ReLU MLP (Nemotron family), 2 projections; motivates the 50% over-count caveat above. https://arxiv.org/abs/2109.08668
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
estimate_base_model_params(autoconfig)
¶
Return (n_params, method) for the base model, or None if unknown.
method == "exact" means the meta-tensor path succeeded and the count
is architecture-accurate. method == "approximate" means the shape
formula was used as a fallback (see
estimate_params_from_shape
for its known error modes) and the caller should flag the downstream VRAM
estimate as heuristic. Benchmarked fallback error on supported
architectures: \(-22\%\) to \(+33\%\); hybrid Mamba-Transformer models
(e.g. Nemotron-H) can drift further.
Source code in src/nemo_safe_synthesizer/preflight/checks/environment.py
bytes_per_base_weight(training_cfg)
¶
Return expected bytes/param for the base model given PEFT mode.
NSS always trains via LoRA or QLoRA, so the base model's storage precision dominates VRAM (LoRA adapter params, gradients, and optimizer state are comparatively negligible).
- QLoRA: \(\text{bits}/8 + 0.1\) to cover quant state (absmax / block scales) and dequant workspace. Yields \(\approx 0.6\) for 4-bit, \(\approx 1.1\) for 8-bit.
- LoRA (unquantized): \(2\) bytes (bf16/fp16 base weights).
References
- Hu, E. J. et al. "LoRA: Low-Rank Adaptation of Large Language Models" (2021) -- base weights frozen; adapter + gradients + optimizer state are small relative to \(N b\). https://arxiv.org/abs/2106.09685
- Dettmers, T. et al. "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) -- 4-bit NF4 quantization with block-wise absmax scales; the \(+0.1\) term accounts for these scales and the dequant workspace. https://arxiv.org/abs/2305.14314