Skip to content

utils

utils

GPU memory management, quantization, device mapping, and tokenizer helpers for LLM loading.

Optional LLM dependencies are imported inside the helpers that need them so lightweight utilities such as trust_remote_code_for_model remain usable without installing the full training or inference stack.

Classes:

Name Description
ModelRef

Resolved model reference for local cache and trust policy decisions.

Functions:

Name Description
trust_remote_code_for_model

Determine whether to trust remote code when loading a model.

cleanup_memory

Run garbage collection and empty the CUDA cache.

gpu_stats

Log current GPU memory reservation and total capacity.

get_max_vram

Return vLLM-style GPU utilization fractions for each available GPU.

get_max_memory_map

Return Hugging Face max_memory byte limits for each available GPU.

add_bos_eos_tokens_to_tokenizer

Enable BOS/EOS token injection and set a pad token if missing.

get_param_from_config

Read a single attribute from a HuggingFace AutoConfig.

load_fast_tokenizer

Load a tokenizer, preferring the Rust tokenizers backend.

get_device_name

Get the name of the current device (first index). Returns 'undefined' if the device is not available.

get_device_map

Infer the device map for a model and optionally pin all layers to one device.

count_trainable_params

Count trainable and total parameters in a PEFT model.

get_quantization_config

Compatibility wrapper for building a transformers v5 quantization config.

ModelRef(original, repo_id=None, revision='main', local_path=None, cache_root=None) dataclass

Resolved model reference for local cache and trust policy decisions.

Intended public API: - parse() normalizes a user-supplied model string or path without contacting Hugging Face. - target() returns the value that should be passed to from_pretrained-style loaders: a local snapshot path when available, otherwise the original model reference. - trust_remote_code reports whether the reference belongs to a trusted organization after accounting for resolved local HF cache paths. - partial_cached_snapshot() returns HF's local snapshot path for the repo/revision, even when the snapshot is incomplete. - missing_required_components() reports whether a local model directory has the components this project expects before an offline load. - missing_remote_code_components() reports trusted remote-code files referenced by Transformers auto_map metadata but absent locally.

Deliberate Hugging Face coupling: repo-id validation, cache-root resolution, cache scanning, snapshot layout, artifact names, tokenizer filenames, and sharded weight index parsing mirror current Hugging Face Hub and Transformers behavior. This is intentional so NSS decisions match the libraries that load the model. If model loading or cache preflight behavior changes after an upstream HF release, inspect this class first.

Internal helpers are not a generic model-layout abstraction. They should stay close to HF's implementation rather than grow compatibility shims for unrelated storage formats.

Methods:

Name Description
parse

Parse a model identifier or path without contacting Hugging Face.

missing_required_components

Return local model components missing from model_dir.

missing_remote_code_components

Return trusted remote-code components referenced by config but absent locally.

partial_cached_snapshot

Return the local HF snapshot for this repo/revision, even if it is partial.

is_trusted_org

Return whether an organization is allowed to load remote code.

target

Return the local snapshot path when available, otherwise the original input.

Attributes:

Name Type Description
trust_remote_code bool

Whether loaders should pass trust_remote_code=True for this model.

trust_remote_code property

Whether loaders should pass trust_remote_code=True for this model.

parse(model_name, *, revision='main', cache_root=None) classmethod

Parse a model identifier or path without contacting Hugging Face.

This is safe to call in preflight and loader setup because it uses Hugging Face's local cache APIs only. Cached-model hits may still cost a few milliseconds because HF cache scanning walks cache metadata to confirm model artifacts exist.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def parse(
    cls,
    model_name: str | Path,
    *,
    revision: str = "main",
    cache_root: str | Path | None = None,
) -> Self:
    """Parse a model identifier or path without contacting Hugging Face.

    This is safe to call in preflight and loader setup because it uses
    Hugging Face's local cache APIs only. Cached-model hits may still cost a
    few milliseconds because HF cache scanning walks cache metadata to
    confirm model artifacts exist.
    """
    cache_root_path = Path(cache_root) if cache_root is not None else cls._default_hf_cache_root()
    model_ref = str(model_name)
    if not model_ref:
        return cls(original=model_name, revision=revision, cache_root=cache_root_path)

    model_path = Path(model_name)
    if model_path.exists():
        repo_id = cls._repo_id_from_hf_cache_path(model_path, cache_root_path)
        return cls(
            original=model_name,
            repo_id=repo_id,
            revision=revision,
            local_path=model_path,
            cache_root=cache_root_path,
        )

    repo_id = cls._repo_id_from_hub_identifier(model_ref)
    local_path = cls._cached_snapshot_for_repo(repo_id, revision, cache_root_path) if repo_id else None
    return cls(
        original=model_name,
        repo_id=repo_id,
        revision=revision,
        local_path=local_path,
        cache_root=cache_root_path,
    )

missing_required_components(model_dir) classmethod

Return local model components missing from model_dir.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def missing_required_components(cls, model_dir: Path) -> list[str]:
    """Return local model components missing from ``model_dir``."""
    return [name for name, present in cls._required_component_status(model_dir).items() if not present]

missing_remote_code_components(model_dir) classmethod

Return trusted remote-code components referenced by config but absent locally.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def missing_remote_code_components(cls, model_dir: Path) -> list[str]:
    """Return trusted remote-code components referenced by config but absent locally."""
    required = cls._remote_code_components(model_dir)
    missing: list[str] = []
    for component, local_path in required:
        if local_path is None or not (model_dir / local_path).is_file():
            missing.append(component)
    return sorted(missing)

partial_cached_snapshot()

Return the local HF snapshot for this repo/revision, even if it is partial.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def partial_cached_snapshot(self) -> Path | None:
    """Return the local HF snapshot for this repo/revision, even if it is partial."""
    if self.repo_id is None or self.cache_root is None:
        return None
    return self._local_snapshot_for_repo(self.repo_id, self.revision, self.cache_root)

is_trusted_org(org) classmethod

Return whether an organization is allowed to load remote code.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def is_trusted_org(cls, org: str) -> bool:
    """Return whether an organization is allowed to load remote code."""
    return org.casefold() in cls.trusted_orgs

target()

Return the local snapshot path when available, otherwise the original input.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def target(self) -> str:
    """Return the local snapshot path when available, otherwise the original input."""
    return str(self.local_path or self.original)

trust_remote_code_for_model(model_name, *, cache_root=None)

Determine whether to trust remote code when loading a model.

Returns True for model identifiers owned by trusted organizations, including configured Hugging Face cache snapshots for those organizations.

Parameters:

Name Type Description Default
model_name str | Path

HuggingFace model identifier or local path.

required
cache_root str | Path | None

Hugging Face Hub cache root. Defaults to the configured hub cache.

None

Returns:

Type Description
bool

Whether to set trust_remote_code=True when loading the model.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def trust_remote_code_for_model(model_name: str | Path, *, cache_root: str | Path | None = None) -> bool:
    """Determine whether to trust remote code when loading a model.

    Returns ``True`` for model identifiers owned by trusted organizations,
    including configured Hugging Face cache snapshots for those organizations.

    Args:
        model_name: HuggingFace model identifier or local path.
        cache_root: Hugging Face Hub cache root. Defaults to the configured hub cache.

    Returns:
        Whether to set ``trust_remote_code=True`` when loading the model.
    """
    return ModelRef.parse(model_name, cache_root=cache_root).trust_remote_code

cleanup_memory()

Run garbage collection and empty the CUDA cache.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def cleanup_memory() -> None:
    """Run garbage collection and empty the CUDA cache."""
    import torch

    gc.collect()
    with torch.no_grad():
        torch.cuda.empty_cache()

gpu_stats()

Log current GPU memory reservation and total capacity.

Queries CUDA device 0 and logs the peak reserved memory and total available memory in GiB.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def gpu_stats() -> None:
    """Log current GPU memory reservation and total capacity.

    Queries CUDA device 0 and logs the peak reserved memory and total
    available memory in GiB.
    """
    import torch

    def round_gb(value: float) -> float:
        return round(value / 1024 / 1024 / 1024, 3)

    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round_gb(torch.cuda.max_memory_reserved())
    max_memory = round_gb(gpu_stats.total_memory)
    logger.info(f"{start_gpu_memory} GB of memory reserved.")
    logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

get_max_vram(max_vram_fraction=None)

Return vLLM-style GPU utilization fractions for each available GPU.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_max_vram(max_vram_fraction: float | None = None) -> dict[int, float]:
    """Return vLLM-style GPU utilization fractions for each available GPU."""
    return {device: allocation.utilization for device, allocation in _get_vram_allocations(max_vram_fraction).items()}

get_max_memory_map(max_vram_fraction=None)

Return Hugging Face max_memory byte limits for each available GPU.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_max_memory_map(max_vram_fraction: float | None = None) -> dict[int, int]:
    """Return Hugging Face ``max_memory`` byte limits for each available GPU."""
    return {device: allocation.memory_bytes for device, allocation in _get_vram_allocations(max_vram_fraction).items()}

add_bos_eos_tokens_to_tokenizer(tokenizer)

Enable BOS/EOS token injection and set a pad token if missing.

Mutates tokenizer in-place to set add_bos_token and add_eos_token to True. If no pad token is configured, pad_token_id is set to eos_token_id.

Parameters:

Name Type Description Default
tokenizer PreTrainedTokenizer

The tokenizer to configure.

required

Returns:

Type Description
PreTrainedTokenizer

The same tokenizer instance, modified in-place.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def add_bos_eos_tokens_to_tokenizer(tokenizer: PreTrainedTokenizer) -> PreTrainedTokenizer:
    """Enable BOS/EOS token injection and set a pad token if missing.

    Mutates ``tokenizer`` in-place to set ``add_bos_token`` and
    ``add_eos_token`` to ``True``.  If no pad token is configured,
    ``pad_token_id`` is set to ``eos_token_id``.

    Args:
        tokenizer: The tokenizer to configure.

    Returns:
        The same tokenizer instance, modified in-place.
    """
    tokenizer.add_bos_token = True
    tokenizer.add_eos_token = True
    if not tokenizer.pad_token_id:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    return tokenizer

get_param_from_config(param, default_value=None, model_name=None, trust_remote_code=None, config=None)

Read a single attribute from a HuggingFace AutoConfig.

Either an existing config object or a model_name (used to load one on the fly) must be provided.

Parameters:

Name Type Description Default
param str

Name of the config attribute to retrieve.

required
default_value Any | None

Fallback value when the attribute is absent.

None
model_name str | None

HuggingFace model identifier. Required when config is not supplied.

None
trust_remote_code bool | None

Passed through to AutoConfig.from_pretrained when loading a config.

None
config AutoConfig | None

Pre-loaded AutoConfig. Takes precedence over model_name.

None

Returns:

Type Description
str | None

The attribute value, or default_value if the attribute does

str | None

not exist on the config.

Raises:

Type Description
ValueError

If neither model_name nor config is provided.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_param_from_config(
    param: str,
    default_value: Any | None = None,
    model_name: str | None = None,
    trust_remote_code: bool | None = None,
    config: AutoConfig | None = None,
) -> str | None:
    """Read a single attribute from a HuggingFace ``AutoConfig``.

    Either an existing ``config`` object or a ``model_name`` (used to
    load one on the fly) must be provided.

    Args:
        param: Name of the config attribute to retrieve.
        default_value: Fallback value when the attribute is absent.
        model_name: HuggingFace model identifier.  Required when
            ``config`` is not supplied.
        trust_remote_code: Passed through to
            ``AutoConfig.from_pretrained`` when loading a config.
        config: Pre-loaded ``AutoConfig``.  Takes precedence over
            ``model_name``.

    Returns:
        The attribute value, or ``default_value`` if the attribute does
        not exist on the config.

    Raises:
        ValueError: If neither ``model_name`` nor ``config`` is provided.
    """
    from transformers import AutoConfig

    if config is None:
        if model_name is None:
            raise ValueError("model_name is required if config is not provided")
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=trust_remote_code)

    return getattr(config, param, default_value)

load_fast_tokenizer(model_name_or_path, **kwargs)

Load a tokenizer, preferring the Rust tokenizers backend.

Centralizes our tokenizer loads so we consistently request the fast (Rust) backend that transformers v5 auto-selects, and log when the selected backend falls back to the slow Python implementation.

Why this matters under v5: transformers v5 consolidated the previously split tokenization_*.py / tokenization_*_fast.py modules into a single file per model with automatic backend selection. use_fast defaults to True, but a small set of models with no Rust port (older SentencePiece-only checkpoints) still resolve to the slow backend. Surfacing that fallback gives operators a clear signal when tokenization is on the slow path — meaningful in our data-prep pipeline where assembling training examples is tokenizer-bound.

Parameters:

Name Type Description Default
model_name_or_path Path | str

HuggingFace model id or local path.

required
**kwargs Any

Forwarded to AutoTokenizer.from_pretrained (e.g. model_max_length, trust_remote_code). use_fast is forced to True.

{}

Returns:

Type Description
PreTrainedTokenizer

Loaded PreTrainedTokenizer (Rust-backed when available).

Source code in src/nemo_safe_synthesizer/llm/utils.py
def load_fast_tokenizer(model_name_or_path: Path | str, **kwargs: Any) -> PreTrainedTokenizer:
    """Load a tokenizer, preferring the Rust ``tokenizers`` backend.

    Centralizes our tokenizer loads so we consistently request the fast
    (Rust) backend that transformers v5 auto-selects, and log when the
    selected backend falls back to the slow Python implementation.

    Why this matters under v5: transformers v5 consolidated the previously
    split ``tokenization_*.py`` / ``tokenization_*_fast.py`` modules into
    a single file per model with automatic backend selection. ``use_fast``
    defaults to ``True``, but a small set of models with no Rust port
    (older SentencePiece-only checkpoints) still resolve to the slow
    backend. Surfacing that fallback gives operators a clear signal when
    tokenization is on the slow path — meaningful in our data-prep
    pipeline where assembling training examples is tokenizer-bound.

    Args:
        model_name_or_path: HuggingFace model id or local path.
        **kwargs: Forwarded to ``AutoTokenizer.from_pretrained`` (e.g.
            ``model_max_length``, ``trust_remote_code``). ``use_fast`` is
            forced to ``True``.

    Returns:
        Loaded ``PreTrainedTokenizer`` (Rust-backed when available).
    """
    from transformers import AutoTokenizer, PreTrainedTokenizer

    kwargs["use_fast"] = True
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, **kwargs)
    if not getattr(tokenizer, "is_fast", False):
        logger.warning(
            "Loaded slow (Python) tokenizer for %r — no Rust backend available. "
            "Data-prep tokenization will be ~5-10x slower than the fast path.",
            str(model_name_or_path),
        )
    return cast(PreTrainedTokenizer, tokenizer)

get_device_name()

Get the name of the current device (first index). Returns 'undefined' if the device is not available.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_device_name() -> str:
    """Get the name of the current device (first index). Returns 'undefined' if the device is not available."""
    # torch may be absent (CPU-only install); CUDA/driver problems surface as
    # RuntimeError/AssertionError from get_device_properties. Anything else is
    # unexpected and should propagate rather than masquerade as 'undefined'.
    try:
        import torch

        return torch.cuda.get_device_properties(0).name
    except (ImportError, RuntimeError, AssertionError):
        logger.debug("Could not resolve CUDA device name; reporting 'undefined'.", exc_info=True)
        return "undefined"

get_device_map(model_target, autoconfig=None, revision=None, trust_remote_code=False, local_files_only=False, force_single_device=None)

Infer the device map for a model and optionally pin all layers to one device.

Uses accelerate.infer_auto_device_map on an empty-weight model skeleton to determine layer-to-device assignments.

Parameters:

Name Type Description Default
model_target str

HuggingFace model identifier or local path.

required
autoconfig AutoConfig | None

Pre-loaded AutoConfig. If None, one is loaded from model_target.

None
revision str | None

Model revision (branch, tag, or commit hash).

None
trust_remote_code bool

Whether to trust remote code when loading.

False
local_files_only bool

Restrict loading to local files only.

False
force_single_device int | None

When set, every layer is assigned to this CUDA device index.

None

Returns:

Type Description
str | dict[str, int | str]

Ordered dictionary mapping layer names to device identifiers.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_device_map(
    model_target: str,
    autoconfig: AutoConfig | None = None,
    revision: str | None = None,
    trust_remote_code: bool = False,
    local_files_only: bool = False,
    force_single_device: int | None = None,
) -> str | dict[str, int | str]:
    """Infer the device map for a model and optionally pin all layers to one device.

    Uses ``accelerate.infer_auto_device_map`` on an empty-weight model
    skeleton to determine layer-to-device assignments.

    Args:
        model_target: HuggingFace model identifier or local path.
        autoconfig: Pre-loaded ``AutoConfig``.  If ``None``, one is
            loaded from ``model_target``.
        revision: Model revision (branch, tag, or commit hash).
        trust_remote_code: Whether to trust remote code when loading.
        local_files_only: Restrict loading to local files only.
        force_single_device: When set, every layer is assigned to this
            CUDA device index.

    Returns:
        Ordered dictionary mapping layer names to device identifiers.
    """
    from accelerate import infer_auto_device_map, init_empty_weights
    from transformers import AutoConfig, AutoModelForCausalLM

    config = autoconfig or AutoConfig.from_pretrained(
        model_target,
        revision=revision,
        trust_remote_code=trust_remote_code,
        local_files_only=local_files_only,
    )
    # Create an empty model with the configuration
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config, trust_remote_code=trust_remote_code)
    device_map = infer_auto_device_map(model)
    if force_single_device is not None:
        for key in device_map:
            device_map[key] = force_single_device
    return device_map

count_trainable_params(model)

Count trainable and total parameters in a PEFT model.

Parameters:

Name Type Description Default
model PeftModel

A PeftModel (or any torch.nn.Module) to inspect.

required

Returns:

Type Description
tuple[int, int]

A tuple of (trainable_params, all_params).

Source code in src/nemo_safe_synthesizer/llm/utils.py
def count_trainable_params(model: PeftModel) -> tuple[int, int]:
    """Count trainable and total parameters in a PEFT model.

    Args:
        model: A ``PeftModel`` (or any ``torch.nn.Module``) to inspect.

    Returns:
        A tuple of ``(trainable_params, all_params)``.
    """
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    return trainable_params, all_params

get_quantization_config(scheme)

Compatibility wrapper for building a transformers v5 quantization config.

Accepts a :class:QuantizationScheme (or its string value) for new callers, or an integer 4 / 8 for backward compatibility with the legacy quantization_bits field (4 → bnb-4bit, 8 → bnb-8bit). New code should prefer :meth:nemo_safe_synthesizer.config.training.QuantizationScheme.to_transformers_config.

Parameters:

Name Type Description Default
scheme QuantizationScheme | str | Literal[4, 8]

A QuantizationScheme value, its string equivalent (e.g. "nvfp4"), or the legacy bit-count alias.

required

Returns:

Type Description
QuantizationConfigMixin

A transformers QuantizationConfigMixin subclass instance

QuantizationConfigMixin

(BitsAndBytesConfig, FineGrainedFP8Config, TorchAoConfig,

QuantizationConfigMixin

or Mxfp4Config) ready to pass to from_pretrained() via the

QuantizationConfigMixin

quantization_config= kwarg.

Raises:

Type Description
ValueError

If scheme is not a recognized scheme name or bit count.

ImportError

If the underlying quantization backend is not installed (e.g. torchao for NVFP4 / MXFP4).

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_quantization_config(scheme: QuantizationScheme | str | Literal[4, 8]) -> QuantizationConfigMixin:
    """Compatibility wrapper for building a transformers v5 quantization config.

    Accepts a :class:`QuantizationScheme` (or its string value) for new
    callers, or an integer ``4`` / ``8`` for backward compatibility with the
    legacy ``quantization_bits`` field (4 → ``bnb-4bit``, 8 → ``bnb-8bit``).
    New code should prefer
    :meth:`nemo_safe_synthesizer.config.training.QuantizationScheme.to_transformers_config`.

    Args:
        scheme: A ``QuantizationScheme`` value, its string equivalent
            (e.g. ``"nvfp4"``), or the legacy bit-count alias.

    Returns:
        A transformers ``QuantizationConfigMixin`` subclass instance
        (``BitsAndBytesConfig``, ``FineGrainedFP8Config``, ``TorchAoConfig``,
        or ``Mxfp4Config``) ready to pass to ``from_pretrained()`` via the
        ``quantization_config=`` kwarg.

    Raises:
        ValueError: If ``scheme`` is not a recognized scheme name or bit count.
        ImportError: If the underlying quantization backend is not installed
            (e.g. torchao for NVFP4 / MXFP4).
    """
    from ..config.training import QuantizationScheme

    return QuantizationScheme.from_alias(scheme).to_transformers_config()