Skip to content

utils

utils

GPU memory management, quantization, device mapping, and tokenizer helpers for LLM loading.

Optional LLM dependencies are imported inside the helpers that need them so lightweight utilities such as trust_remote_code_for_model remain usable without installing the full training or inference stack.

Classes:

Name Description
ModelRef

Resolved model reference for local cache and trust policy decisions.

Functions:

Name Description
trust_remote_code_for_model

Determine whether to trust remote code when loading a model.

cleanup_memory

Run garbage collection and empty the CUDA cache.

gpu_stats

Log current GPU memory reservation and total capacity.

get_max_vram

Calculate maximum memory allocation for each available GPU.

add_bos_eos_tokens_to_tokenizer

Enable BOS/EOS token injection and set a pad token if missing.

get_param_from_config

Read a single attribute from a HuggingFace AutoConfig.

get_device_map

Infer the device map for a model and optionally pin all layers to one device.

count_trainable_params

Count trainable and total parameters in a PEFT model.

get_quantization_config

Build a BitsAndBytesConfig for 4-bit or 8-bit quantization.

get_device_name

Get the name of the current device (first index). Returns 'undefined' if the device is not available.

ModelRef(original, repo_id=None, revision='main', local_path=None, cache_root=None) dataclass

Resolved model reference for local cache and trust policy decisions.

Intended public API: - parse() normalizes a user-supplied model string or path without contacting Hugging Face. - target() returns the value that should be passed to from_pretrained-style loaders: a local snapshot path when available, otherwise the original model reference. - trust_remote_code reports whether the reference belongs to a trusted organization after accounting for resolved local HF cache paths. - partial_cached_snapshot() returns HF's local snapshot path for the repo/revision, even when the snapshot is incomplete. - missing_required_components() reports whether a local model directory has the components this project expects before an offline load. - missing_remote_code_components() reports trusted remote-code files referenced by Transformers auto_map metadata but absent locally.

Deliberate Hugging Face coupling: repo-id validation, cache-root resolution, cache scanning, snapshot layout, artifact names, tokenizer filenames, and sharded weight index parsing mirror current Hugging Face Hub and Transformers behavior. This is intentional so NSS decisions match the libraries that load the model. If model loading or cache preflight behavior changes after an upstream HF release, inspect this class first.

Internal helpers are not a generic model-layout abstraction. They should stay close to HF's implementation rather than grow compatibility shims for unrelated storage formats.

Methods:

Name Description
parse

Parse a model identifier or path without contacting Hugging Face.

missing_required_components

Return local model components missing from model_dir.

missing_remote_code_components

Return trusted remote-code components referenced by config but absent locally.

partial_cached_snapshot

Return the local HF snapshot for this repo/revision, even if it is partial.

is_trusted_org

Return whether an organization is allowed to load remote code.

target

Return the local snapshot path when available, otherwise the original input.

Attributes:

Name Type Description
trust_remote_code bool

Whether loaders should pass trust_remote_code=True for this model.

trust_remote_code property

Whether loaders should pass trust_remote_code=True for this model.

parse(model_name, *, revision='main', cache_root=None) classmethod

Parse a model identifier or path without contacting Hugging Face.

This is safe to call in preflight and loader setup because it uses Hugging Face's local cache APIs only. Cached-model hits may still cost a few milliseconds because HF cache scanning walks cache metadata to confirm model artifacts exist.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def parse(
    cls,
    model_name: str | Path,
    *,
    revision: str = "main",
    cache_root: str | Path | None = None,
) -> Self:
    """Parse a model identifier or path without contacting Hugging Face.

    This is safe to call in preflight and loader setup because it uses
    Hugging Face's local cache APIs only. Cached-model hits may still cost a
    few milliseconds because HF cache scanning walks cache metadata to
    confirm model artifacts exist.
    """
    cache_root_path = Path(cache_root) if cache_root is not None else cls._default_hf_cache_root()
    model_ref = str(model_name)
    if not model_ref:
        return cls(original=model_name, revision=revision, cache_root=cache_root_path)

    model_path = Path(model_name)
    if model_path.exists():
        repo_id = cls._repo_id_from_hf_cache_path(model_path, cache_root_path)
        return cls(
            original=model_name,
            repo_id=repo_id,
            revision=revision,
            local_path=model_path,
            cache_root=cache_root_path,
        )

    repo_id = cls._repo_id_from_hub_identifier(model_ref)
    local_path = cls._cached_snapshot_for_repo(repo_id, revision, cache_root_path) if repo_id else None
    return cls(
        original=model_name,
        repo_id=repo_id,
        revision=revision,
        local_path=local_path,
        cache_root=cache_root_path,
    )

missing_required_components(model_dir) classmethod

Return local model components missing from model_dir.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def missing_required_components(cls, model_dir: Path) -> list[str]:
    """Return local model components missing from ``model_dir``."""
    return [name for name, present in cls._required_component_status(model_dir).items() if not present]

missing_remote_code_components(model_dir) classmethod

Return trusted remote-code components referenced by config but absent locally.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def missing_remote_code_components(cls, model_dir: Path) -> list[str]:
    """Return trusted remote-code components referenced by config but absent locally."""
    required = cls._remote_code_components(model_dir)
    missing: list[str] = []
    for component, local_path in required:
        if local_path is None or not (model_dir / local_path).is_file():
            missing.append(component)
    return sorted(missing)

partial_cached_snapshot()

Return the local HF snapshot for this repo/revision, even if it is partial.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def partial_cached_snapshot(self) -> Path | None:
    """Return the local HF snapshot for this repo/revision, even if it is partial."""
    if self.repo_id is None or self.cache_root is None:
        return None
    return self._local_snapshot_for_repo(self.repo_id, self.revision, self.cache_root)

is_trusted_org(org) classmethod

Return whether an organization is allowed to load remote code.

Source code in src/nemo_safe_synthesizer/llm/utils.py
@classmethod
def is_trusted_org(cls, org: str) -> bool:
    """Return whether an organization is allowed to load remote code."""
    return org.casefold() in cls.trusted_orgs

target()

Return the local snapshot path when available, otherwise the original input.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def target(self) -> str:
    """Return the local snapshot path when available, otherwise the original input."""
    return str(self.local_path or self.original)

trust_remote_code_for_model(model_name, *, cache_root=None)

Determine whether to trust remote code when loading a model.

Returns True for model identifiers owned by trusted organizations, including configured Hugging Face cache snapshots for those organizations.

Parameters:

Name Type Description Default
model_name str | Path

HuggingFace model identifier or local path.

required
cache_root str | Path | None

Hugging Face Hub cache root. Defaults to the configured hub cache.

None

Returns:

Type Description
bool

Whether to set trust_remote_code=True when loading the model.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def trust_remote_code_for_model(model_name: str | Path, *, cache_root: str | Path | None = None) -> bool:
    """Determine whether to trust remote code when loading a model.

    Returns ``True`` for model identifiers owned by trusted organizations,
    including configured Hugging Face cache snapshots for those organizations.

    Args:
        model_name: HuggingFace model identifier or local path.
        cache_root: Hugging Face Hub cache root. Defaults to the configured hub cache.

    Returns:
        Whether to set ``trust_remote_code=True`` when loading the model.
    """
    return ModelRef.parse(model_name, cache_root=cache_root).trust_remote_code

cleanup_memory()

Run garbage collection and empty the CUDA cache.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def cleanup_memory() -> None:
    """Run garbage collection and empty the CUDA cache."""
    import torch

    gc.collect()
    with torch.no_grad():
        torch.cuda.empty_cache()

gpu_stats()

Log current GPU memory reservation and total capacity.

Queries CUDA device 0 and logs the peak reserved memory and total available memory in GiB.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def gpu_stats() -> None:
    """Log current GPU memory reservation and total capacity.

    Queries CUDA device 0 and logs the peak reserved memory and total
    available memory in GiB.
    """
    import torch

    def round_gb(value: float) -> float:
        return round(value / 1024 / 1024 / 1024, 3)

    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round_gb(torch.cuda.max_memory_reserved())
    max_memory = round_gb(gpu_stats.total_memory)
    logger.info(f"{start_gpu_memory} GB of memory reserved.")
    logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

get_max_vram(max_vram_fraction=None)

Calculate maximum memory allocation for each available GPU.

Reserves a 2 GiB safety buffer on each device, then applies max_vram_fraction to the remaining free memory.

Parameters:

Name Type Description Default
max_vram_fraction float | None

Fraction of total GPU memory to allocate. Defaults to 0.8 (80 %).

None

Returns:

Type Description
dict[int, float]

Mapping of CUDA device index to the usable memory fraction.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_max_vram(max_vram_fraction: float | None = None) -> dict[int, float]:
    """Calculate maximum memory allocation for each available GPU.

    Reserves a 2 GiB safety buffer on each device, then applies
    ``max_vram_fraction`` to the remaining free memory.

    Args:
        max_vram_fraction: Fraction of total GPU memory to allocate.
            Defaults to ``0.8`` (80 %).

    Returns:
        Mapping of CUDA device index to the usable memory fraction.
    """
    import torch

    if max_vram_fraction is None:
        max_vram_fraction = 0.8
    max_memory = {}

    if torch.cuda.is_available():
        num_gpus = torch.cuda.device_count()
        for i in range(num_gpus):
            free, total = torch.cuda.mem_get_info(device=i)
            safe_free = free - (2 * 1024**3)
            gpu_memory_utilization = min(max_vram_fraction, safe_free / total)
            memory_gib = gpu_memory_utilization * total / (1024**3)
            max_memory[i] = gpu_memory_utilization
            logger.info(
                f"GPU {i}: Will allocate {memory_gib:.2f}GiB ({max_vram_fraction * 100}% of {total / (1024**3):.2f}GiB)"
            )

    return max_memory

add_bos_eos_tokens_to_tokenizer(tokenizer)

Enable BOS/EOS token injection and set a pad token if missing.

Mutates tokenizer in-place to set add_bos_token and add_eos_token to True. If no pad token is configured, pad_token_id is set to eos_token_id.

Parameters:

Name Type Description Default
tokenizer PreTrainedTokenizer

The tokenizer to configure.

required

Returns:

Type Description
PreTrainedTokenizer

The same tokenizer instance, modified in-place.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def add_bos_eos_tokens_to_tokenizer(tokenizer: PreTrainedTokenizer) -> PreTrainedTokenizer:
    """Enable BOS/EOS token injection and set a pad token if missing.

    Mutates ``tokenizer`` in-place to set ``add_bos_token`` and
    ``add_eos_token`` to ``True``.  If no pad token is configured,
    ``pad_token_id`` is set to ``eos_token_id``.

    Args:
        tokenizer: The tokenizer to configure.

    Returns:
        The same tokenizer instance, modified in-place.
    """
    tokenizer.add_bos_token = True
    tokenizer.add_eos_token = True
    if not tokenizer.pad_token_id:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    return tokenizer

get_param_from_config(param, default_value=None, model_name=None, trust_remote_code=None, config=None)

Read a single attribute from a HuggingFace AutoConfig.

Either an existing config object or a model_name (used to load one on the fly) must be provided.

Parameters:

Name Type Description Default
param str

Name of the config attribute to retrieve.

required
default_value Any | None

Fallback value when the attribute is absent.

None
model_name str | None

HuggingFace model identifier. Required when config is not supplied.

None
trust_remote_code bool | None

Passed through to AutoConfig.from_pretrained when loading a config.

None
config AutoConfig | None

Pre-loaded AutoConfig. Takes precedence over model_name.

None

Returns:

Type Description
str | None

The attribute value, or default_value if the attribute does

str | None

not exist on the config.

Raises:

Type Description
ValueError

If neither model_name nor config is provided.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_param_from_config(
    param: str,
    default_value: Any | None = None,
    model_name: str | None = None,
    trust_remote_code: bool | None = None,
    config: AutoConfig | None = None,
) -> str | None:
    """Read a single attribute from a HuggingFace ``AutoConfig``.

    Either an existing ``config`` object or a ``model_name`` (used to
    load one on the fly) must be provided.

    Args:
        param: Name of the config attribute to retrieve.
        default_value: Fallback value when the attribute is absent.
        model_name: HuggingFace model identifier.  Required when
            ``config`` is not supplied.
        trust_remote_code: Passed through to
            ``AutoConfig.from_pretrained`` when loading a config.
        config: Pre-loaded ``AutoConfig``.  Takes precedence over
            ``model_name``.

    Returns:
        The attribute value, or ``default_value`` if the attribute does
        not exist on the config.

    Raises:
        ValueError: If neither ``model_name`` nor ``config`` is provided.
    """
    from transformers import AutoConfig

    if config is None:
        if model_name is None:
            raise ValueError("model_name is required if config is not provided")
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=trust_remote_code)

    return getattr(config, param, default_value)

get_device_map(model_target, autoconfig=None, revision=None, trust_remote_code=False, local_files_only=False, force_single_device=None)

Infer the device map for a model and optionally pin all layers to one device.

Uses accelerate.infer_auto_device_map on an empty-weight model skeleton to determine layer-to-device assignments.

Parameters:

Name Type Description Default
model_target str

HuggingFace model identifier or local path.

required
autoconfig AutoConfig | None

Pre-loaded AutoConfig. If None, one is loaded from model_target.

None
revision str | None

Model revision (branch, tag, or commit hash).

None
trust_remote_code bool

Whether to trust remote code when loading.

False
local_files_only bool

Restrict loading to local files only.

False
force_single_device int | None

When set, every layer is assigned to this CUDA device index.

None

Returns:

Type Description
str | dict[str, int | str]

Ordered dictionary mapping layer names to device identifiers.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_device_map(
    model_target: str,
    autoconfig: AutoConfig | None = None,
    revision: str | None = None,
    trust_remote_code: bool = False,
    local_files_only: bool = False,
    force_single_device: int | None = None,
) -> str | dict[str, int | str]:
    """Infer the device map for a model and optionally pin all layers to one device.

    Uses ``accelerate.infer_auto_device_map`` on an empty-weight model
    skeleton to determine layer-to-device assignments.

    Args:
        model_target: HuggingFace model identifier or local path.
        autoconfig: Pre-loaded ``AutoConfig``.  If ``None``, one is
            loaded from ``model_target``.
        revision: Model revision (branch, tag, or commit hash).
        trust_remote_code: Whether to trust remote code when loading.
        local_files_only: Restrict loading to local files only.
        force_single_device: When set, every layer is assigned to this
            CUDA device index.

    Returns:
        Ordered dictionary mapping layer names to device identifiers.
    """
    from accelerate import infer_auto_device_map, init_empty_weights
    from transformers import AutoConfig, AutoModelForCausalLM

    config = autoconfig or AutoConfig.from_pretrained(
        model_target,
        revision=revision,
        trust_remote_code=trust_remote_code,
        local_files_only=local_files_only,
    )
    # Create an empty model with the configuration
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config, trust_remote_code=trust_remote_code)
    device_map = infer_auto_device_map(model)
    if force_single_device is not None:
        for key in device_map:
            device_map[key] = force_single_device
    return device_map

count_trainable_params(model)

Count trainable and total parameters in a PEFT model.

Parameters:

Name Type Description Default
model PeftModel

A PeftModel (or any torch.nn.Module) to inspect.

required

Returns:

Type Description
tuple[int, int]

A tuple of (trainable_params, all_params).

Source code in src/nemo_safe_synthesizer/llm/utils.py
def count_trainable_params(model: PeftModel) -> tuple[int, int]:
    """Count trainable and total parameters in a PEFT model.

    Args:
        model: A ``PeftModel`` (or any ``torch.nn.Module``) to inspect.

    Returns:
        A tuple of ``(trainable_params, all_params)``.
    """
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    return trainable_params, all_params

get_quantization_config(quantization_bits)

Build a BitsAndBytesConfig for 4-bit or 8-bit quantization.

Both configurations use NormalFloat quantization with double quantization enabled and bfloat16 as the compute dtype.

Parameters:

Name Type Description Default
quantization_bits Literal[4, 8]

Number of bits — must be 4 or 8.

required

Returns:

Type Description
BitsAndBytesConfig

A BitsAndBytesConfig ready to pass to model loading.

Raises:

Type Description
ValueError

If quantization_bits is not 4 or 8.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_quantization_config(quantization_bits: Literal[4, 8]) -> BitsAndBytesConfig:
    """Build a ``BitsAndBytesConfig`` for 4-bit or 8-bit quantization.

    Both configurations use NormalFloat quantization with double
    quantization enabled and ``bfloat16`` as the compute dtype.

    Args:
        quantization_bits: Number of bits — must be ``4`` or ``8``.

    Returns:
        A ``BitsAndBytesConfig`` ready to pass to model loading.

    Raises:
        ValueError: If ``quantization_bits`` is not 4 or 8.
    """
    import torch
    from transformers import BitsAndBytesConfig

    if quantization_bits == 4:
        return BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
    elif quantization_bits == 8:
        return BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_quant_type="nf8",
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
        )
    else:
        raise ValueError(f"Invalid quantization bits: {quantization_bits}")

get_device_name()

Get the name of the current device (first index). Returns 'undefined' if the device is not available.

Source code in src/nemo_safe_synthesizer/llm/utils.py
def get_device_name() -> str:
    """Get the name of the current device (first index). Returns 'undefined' if the device is not available."""
    try:
        import torch

        return torch.cuda.get_device_properties(0).name
    except Exception:
        return "undefined"