utils
utils
¶
GPU memory management, quantization, device mapping, and tokenizer helpers for LLM loading.
Functions:
| Name | Description |
|---|---|
trust_remote_code_for_model |
Determine whether to trust remote code when loading a model. |
cleanup_memory |
Run garbage collection and empty the CUDA cache. |
gpu_stats |
Log current GPU memory reservation and total capacity. |
get_max_vram |
Calculate maximum memory allocation for each available GPU. |
add_bos_eos_tokens_to_tokenizer |
Enable BOS/EOS token injection and set a pad token if missing. |
get_param_from_config |
Read a single attribute from a HuggingFace |
get_device_map |
Infer the device map for a model and optionally pin all layers to one device. |
count_trainable_params |
Count trainable and total parameters in a PEFT model. |
optimize_for_inference |
Context manager that applies Unsloth inference-time optimizations. |
get_quantization_config |
Build a |
trust_remote_code_for_model(model_name)
¶
Determine whether to trust remote code when loading a model.
Returns True only for models whose name starts with
"nvidia/".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str | Path
|
HuggingFace model identifier or local path. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
Whether to set |
Source code in src/nemo_safe_synthesizer/llm/utils.py
cleanup_memory()
¶
gpu_stats()
¶
Log current GPU memory reservation and total capacity.
Queries CUDA device 0 and logs the peak reserved memory and total available memory in GiB.
Source code in src/nemo_safe_synthesizer/llm/utils.py
get_max_vram(max_vram_fraction=None)
¶
Calculate maximum memory allocation for each available GPU.
Reserves a 2 GiB safety buffer on each device, then applies
max_vram_fraction to the remaining free memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_vram_fraction
|
float | None
|
Fraction of total GPU memory to allocate.
Defaults to |
None
|
Returns:
| Type | Description |
|---|---|
dict[int, float]
|
Mapping of CUDA device index to the usable memory fraction. |
Source code in src/nemo_safe_synthesizer/llm/utils.py
add_bos_eos_tokens_to_tokenizer(tokenizer)
¶
Enable BOS/EOS token injection and set a pad token if missing.
Mutates tokenizer in-place to set add_bos_token and
add_eos_token to True. If no pad token is configured,
pad_token_id is set to eos_token_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
PreTrainedTokenizer
|
The tokenizer to configure. |
required |
Returns:
| Type | Description |
|---|---|
PreTrainedTokenizer
|
The same tokenizer instance, modified in-place. |
Source code in src/nemo_safe_synthesizer/llm/utils.py
get_param_from_config(param, default_value=None, model_name=None, trust_remote_code=None, config=None)
¶
Read a single attribute from a HuggingFace AutoConfig.
Either an existing config object or a model_name (used to
load one on the fly) must be provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param
|
str
|
Name of the config attribute to retrieve. |
required |
default_value
|
Any | None
|
Fallback value when the attribute is absent. |
None
|
model_name
|
str | None
|
HuggingFace model identifier. Required when
|
None
|
trust_remote_code
|
bool | None
|
Passed through to
|
None
|
config
|
AutoConfig | None
|
Pre-loaded |
None
|
Returns:
| Type | Description |
|---|---|
str | None
|
The attribute value, or |
str | None
|
not exist on the config. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither |
Source code in src/nemo_safe_synthesizer/llm/utils.py
get_device_map(model_target, autoconfig=None, revision=None, trust_remote_code=False, local_files_only=False, force_single_device=None)
¶
Infer the device map for a model and optionally pin all layers to one device.
Uses accelerate.infer_auto_device_map on an empty-weight model
skeleton to determine layer-to-device assignments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_target
|
str
|
HuggingFace model identifier or local path. |
required |
autoconfig
|
AutoConfig | None
|
Pre-loaded |
None
|
revision
|
str | None
|
Model revision (branch, tag, or commit hash). |
None
|
trust_remote_code
|
bool
|
Whether to trust remote code when loading. |
False
|
local_files_only
|
bool
|
Restrict loading to local files only. |
False
|
force_single_device
|
int | None
|
When set, every layer is assigned to this CUDA device index. |
None
|
Returns:
| Type | Description |
|---|---|
|
Ordered dictionary mapping layer names to device identifiers. |
Source code in src/nemo_safe_synthesizer/llm/utils.py
count_trainable_params(model)
¶
Count trainable and total parameters in a PEFT model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
PeftModel
|
A |
required |
Returns:
| Type | Description |
|---|---|
tuple[int, int]
|
A tuple of |
Source code in src/nemo_safe_synthesizer/llm/utils.py
optimize_for_inference(model)
¶
Context manager that applies Unsloth inference-time optimizations.
On enter, switches the model to inference mode via
FastLanguageModel.for_inference. On exit, reverts to training
mode. If CUDA is unavailable or the model is not a
FastLanguageModel, yields immediately without modification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[FastLanguageModel, AutoModelForCausalLM]
|
The language model to optimize. Must be an Unsloth
|
required |
Yields:
| Type | Description |
|---|---|
None
|
None |
Source code in src/nemo_safe_synthesizer/llm/utils.py
get_quantization_config(quantization_bits)
¶
Build a BitsAndBytesConfig for 4-bit or 8-bit quantization.
Both configurations use NormalFloat quantization with double
quantization enabled and bfloat16 as the compute dtype.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
quantization_bits
|
Literal[4, 8]
|
Number of bits — must be |
required |
Returns:
| Type | Description |
|---|---|
BitsAndBytesConfig
|
A |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |