metadata

`metadata` ¶

Model-family metadata for prompt formatting, RoPE scaling, and runtime bookkeeping.

Provides ModelMetadata and its per-family subclasses (Llama32, Mistral, Qwen, etc.) that capture prompt templates, special-token settings, and context-window configuration. The RopeScaling model handles context-window extension via Rotary Position Embeddings.

A global maximum sequence length (GLOBAL_MAX_SEQ_LENGTH = 2048 * 6) is applied as a safety cap to prevent OOM and underfitting errors.

Classes:

Name	Description
`LLMPromptConfig`	Prompt template and special-token settings.
`RopeScaling`	RoPE scaling parameters for context-window extension.
`ModelMetadata`	Base container for model-family-specific metadata.
`Granite`	IBM Granite family metadata.
`Llama32`	Meta Llama 3.2 family metadata.
`Mistral`	Mistral AI family metadata.
`Nemotron`	NVIDIA Nemotron family metadata.
`Qwen`	Alibaba Qwen family metadata.
`SmolLM2`	HuggingFace SmolLM2 family metadata.
`SmolLM3`	HuggingFace SmolLM3 family metadata.
`TinyLlama`	TinyLlama family metadata.

Functions:

Name	Description
`resolve_rope_scaling_factor`	Normalize a rope-scaling specification into a `RopeScaling` or `None`.
`get_base_max_seq_length`	Derive the base max sequence length from a model config.

`LLMPromptConfig` `pydantic-model` ¶

Bases: BaseModel

Prompt template and special-token settings for an LLM.

Holds the Jinja-style prompt template together with flags and token values that control how BOS/EOS markers are injected during training and inference.

Fields:

template (str)
add_bos_token_to_prompt (bool)
add_eos_token_to_prompt (bool)
bos_token (str)
bos_token_id (int)
eos_token (str)
eos_token_id (int)

`template` `pydantic-field` ¶

Prompt template with {instruction}, {schema}, and {prefill} placeholders.

{instruction} -- task directive telling the model what to generate (e.g. "Generate a JSONL dataset with the following columns: ").
{schema} -- column schema fragment listing expected output fields, typically formatted as "col":<unk>,"col2":<unk>.
{prefill} -- optional text injected at the start of the model's response to steer generation, currently used for time series data.

`add_bos_token_to_prompt` `pydantic-field` ¶

Whether to prepend the BOS token to the prompt.

`add_eos_token_to_prompt` `pydantic-field` ¶

Whether to append the EOS token to the prompt.

`bos_token` `pydantic-field` ¶

Beginning-of-sequence token string.

`bos_token_id` `pydantic-field` ¶

Integer id for the BOS token.

`eos_token` `pydantic-field` ¶

End-of-sequence token string.

`eos_token_id` `pydantic-field` ¶

Integer id for the EOS token.

`from_tokenizer(name, tokenizer=None, **kwargs)` `classmethod` ¶

Create a prompt config by reading from settings of a tokenizer.

If no tokenizer is supplied one is loaded from name via AutoTokenizer.from_pretrained. Individual fields can be overridden through **kwargs (e.g. bos_token, template).

Parameters:

Name	Type	Description	Default
`name`	`str`	HuggingFace model identifier used to load the tokenizer when `tokenizer` is `None`.	required
`tokenizer`	`AutoTokenizer \| None`	Optional pre-loaded tokenizer instance.	`None`
`**kwargs`		Overrides for any `LLMPromptConfig` field.	`{}`

Returns:

Type	Description
`LLMPromptConfig`	A new `LLMPromptConfig` populated from the tokenizer.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@classmethod
def from_tokenizer(cls, name: str, tokenizer: AutoTokenizer | None = None, **kwargs) -> LLMPromptConfig:
    """Create a prompt config by reading from settings of a tokenizer.

    If no ``tokenizer`` is supplied one is loaded from ``name``
    via ``AutoTokenizer.from_pretrained``.  Individual fields can
    be overridden through ``**kwargs`` (e.g. ``bos_token``,
    ``template``).

    Args:
        name: HuggingFace model identifier used to load the
            tokenizer when ``tokenizer`` is ``None``.
        tokenizer: Optional pre-loaded tokenizer instance.
        **kwargs: Overrides for any ``LLMPromptConfig`` field.

    Returns:
        A new ``LLMPromptConfig`` populated from the tokenizer.
    """
    tokenizer = tokenizer or AutoTokenizer.from_pretrained(name)
    bos_token = kwargs.get("bos_token", getattr(tokenizer, "bos_token", None))
    bos_token_id = kwargs.get("bos_token_id", getattr(tokenizer, "bos_token_id", None))
    eos_token = kwargs.get("eos_token", getattr(tokenizer, "eos_token", None))
    eos_token_id = kwargs.get("eos_token_id", getattr(tokenizer, "eos_token_id", None))
    template = kwargs.get("template", PROMPT_TEMPLATE)
    add_bos_token_to_prompt = kwargs.get("add_bos_token_to_prompt", True)
    add_eos_token_to_prompt = kwargs.get("add_eos_token_to_prompt", True)

    pc = {
        "template": template,
        "add_bos_token_to_prompt": add_bos_token_to_prompt,
        "add_eos_token_to_prompt": add_eos_token_to_prompt,
        "bos_token": bos_token,
        "bos_token_id": bos_token_id,
        "eos_token": eos_token,
        "eos_token_id": eos_token_id,
    }

    return cls(**pc)

`RopeScaling` `pydantic-model` ¶

Bases: BaseModel

Rotary Position Embedding (RoPE) scaling configuration.

Encapsulates the parameters needed to extend a model's context window via RoPE scaling. Will be superseded by RotaryEmbeddingConfigMixin when available in transformers v5.

Fields:

rope_type (Literal['linear', 'dynamic', 'default', 'yarn', 'llama3'])
factor (float)
theta (float)

Validators:

validate_factor → factor

`rope_type = 'default'` `pydantic-field` ¶

Scaling algorithm: linear, dynamic, default, yarn, or llama3.

`factor = 1.0` `pydantic-field` ¶

Multiplier for RoPE scaling to extend the context window; values above MAX_ROPE_SCALING_FACTOR are clamped.

`theta = 10000.0` `pydantic-field` ¶

Theta for rope scaling.

`validate_factor(v)` `pydantic-validator` ¶

Clamp factor to MAX_ROPE_SCALING_FACTOR and warn if exceeded.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@field_validator("factor", mode="after")
@classmethod
def validate_factor(cls, v: float | int | None) -> float | int | None:
    """Clamp ``factor`` to ``MAX_ROPE_SCALING_FACTOR`` and warn if exceeded."""
    if v is None or v <= MAX_ROPE_SCALING_FACTOR:
        return v
    logger.warning(
        f"Rope scaling factor {v} is greater than MAX_ROPE_SCALING_FACTOR: {MAX_ROPE_SCALING_FACTOR}, setting to {MAX_ROPE_SCALING_FACTOR}"
    )
    return MAX_ROPE_SCALING_FACTOR

`from_autoconfig(config, factor=None)` `classmethod` ¶

Create a RopeScaling from a HuggingFace PretrainedConfig.

Reads the model's native rope_theta and rope_type and optionally overrides the scaling factor.

Parameters:

Name	Type	Description	Default
`config`	`PretrainedConfig`	A loaded HuggingFace model config.	required
`factor`	`float \| int \| None`	Scaling factor override. Defaults to `1.0`.	`None`

Returns:

Type	Description
`'RopeScaling'`	A `RopeScaling` populated from the config.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@classmethod
def from_autoconfig(cls, config: PretrainedConfig, factor: float | int | None = None) -> "RopeScaling":
    """Create a ``RopeScaling`` from a HuggingFace ``PretrainedConfig``.

    Reads the model's native ``rope_theta`` and ``rope_type`` and
    optionally overrides the scaling ``factor``.

    Args:
        config: A loaded HuggingFace model config.
        factor: Scaling factor override.  Defaults to ``1.0``.

    Returns:
        A ``RopeScaling`` populated from the config.
    """
    # Try to get theta from config (different models use different attribute names)
    theta = getattr(config, "rope_theta", None) or 10000.0

    # Try to get rope_type from config
    rope_type = getattr(config, "rope_scaling", {})
    if isinstance(rope_type, dict):
        rope_type = rope_type.get("rope_type", "default")
    else:
        rope_type = "default"

    return cls(
        rope_type=rope_type,
        factor=factor or 1.0,
        theta=theta,
    )

`to_hf_dict()` ¶

Convert to the HuggingFace rope_scaling dict format.

Returns None when factor is 1.0 (no scaling).

Returns:

Type	Description
`dict \| None`	A dict with keys `rope_type`, `factor`, and `theta`,
`dict \| None`	or `None`.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def to_hf_dict(self) -> dict | None:
    """Convert to the HuggingFace ``rope_scaling`` dict format.

    Returns ``None`` when ``factor`` is ``1.0`` (no scaling).

    Returns:
        A dict with keys ``rope_type``, ``factor``, and ``theta``,
        or ``None``.
    """
    if self.factor == 1.0:
        return None
    return {
        "rope_type": self.rope_type,
        "factor": self.factor,
        "theta": self.theta,
    }

`ModelMetadata` `pydantic-model` ¶

Bases: BaseModel

Base container for model-family-specific metadata.

Stores prompt formats, special tokens, RoPE scaling parameters, and runtime bookkeeping needed to load, fine-tune, and generate with a given LLM family. Each supported model family has a concrete subclass (e.g. Llama32, Mistral) that sets the correct defaults.

Use the factory methods from_str_or_path, from_config, or from_metadata_json to construct instances rather than calling the constructor directly.

Config:

arbitrary_types_allowed: True

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

`model_name_or_path` `pydantic-field` ¶

HuggingFace model identifier or local path.

`prompt_config` `pydantic-field` ¶

Prompt template and token settings.

`autoconfig` `pydantic-field` ¶

HuggingFace PretrainedConfig (excluded from serialization).

`base_max_seq_length = None` `pydantic-field` ¶

Supported context window for base model, before rope scaling factor adjustment.

`rope_scaling = None` `pydantic-field` ¶

RoPE scaling configuration for context window extension. Accepts a RopeScaling instance, a dict of RopeScaling fields, a numeric scale factor (requires autoconfig), or None.

`max_sequences_per_example = None` `pydantic-field` ¶

Cap on sequences packed into one training example.

Resolved by AutoConfigResolver to 1 when DP is enabled, 10 when DP is disabled and set to "auto", or a user-supplied integer.

`workdir = None` `pydantic-field` ¶

Artifact directory layout.

`is_adapter = False` `pydantic-field` ¶

Whether an adapter checkpoint is loaded.

`instruction = DEFAULT_INSTRUCTION` `pydantic-field` ¶

Default system instruction text.

`rope_parameters_location = 'automodel'` `pydantic-field` ¶

Where to read RoPE parameters from: autoconfig or automodel.

`initial_prefill = None` `pydantic-field` ¶

Currently used for time series data. May be a single string or a per-column dict.

`adapter_path` `property` ¶

The path where adapter model files are stored.

Raises:

Type	Description
`ValueError`	If workdir is not set.

`metadata_path` `property` ¶

The path to the metadata JSON file.

Uses workdir.metadata_file which automatically resolves to the parent workdir's path when resuming for generation.

Raises:

Type	Description
`ValueError`	If workdir is not set.

`rope_scaling_factor` `property` ¶

The rope scaling factor for backwards compatibility.

`max_seq_length` `property` ¶

Actual context window for training.

Includes any adjustment for rope_scaling.factor.

`populate_derived_fields(data)` `pydantic-validator` ¶

Auto-populate autoconfig, rope_scaling, and base_max_seq_length.

Called by Pydantic before field validation. Loads an AutoConfig from model_name_or_path when one is not already present, derives base_max_seq_length from that config, and resolves the rope_scaling specification into a RopeScaling instance (or None).

Parameters:

Name	Type	Description	Default
`data`	`dict`	Raw field values dict supplied to the constructor.	required

Returns:

Type	Description
`dict`	The mutated `data` dict with derived fields populated.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@model_validator(mode="before")
@classmethod
def populate_derived_fields(cls, data: dict) -> dict:
    """Auto-populate ``autoconfig``, ``rope_scaling``, and ``base_max_seq_length``.

    Called by Pydantic before field validation.  Loads an
    ``AutoConfig`` from ``model_name_or_path`` when one is not
    already present, derives ``base_max_seq_length`` from that
    config, and resolves the ``rope_scaling`` specification into a
    ``RopeScaling`` instance (or ``None``).

    Args:
        data: Raw field values dict supplied to the constructor.

    Returns:
        The mutated ``data`` dict with derived fields populated.
    """
    if data.get("autoconfig") is None:
        data["autoconfig"] = AutoConfig.from_pretrained(data["model_name_or_path"])

    if data.get("base_max_seq_length") is None:
        data["base_max_seq_length"] = get_base_max_seq_length(data["autoconfig"])

    rsf = data.get("rope_scaling")
    data["rope_scaling"] = resolve_rope_scaling_factor(rsf, data["autoconfig"])

    return data

`serialize_autoconfig(config)` ¶

Serialize PretrainedConfig to a plain dict for JSON export.

Parameters:

Name	Type	Description	Default
`config`	`PretrainedConfig`	The HuggingFace config to serialize.	required

Returns:

Type	Description
`dict`	Dict representation of the config.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@field_serializer("autoconfig")
def serialize_autoconfig(self, config: PretrainedConfig) -> dict:
    """Serialize ``PretrainedConfig`` to a plain dict for JSON export.

    Args:
        config: The HuggingFace config to serialize.

    Returns:
        Dict representation of the config.
    """
    return config.to_dict()

`save_metadata()` ¶

Save model metadata to JSON file.

Raises:

Type	Description
`ValueError`	If workdir is not set.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def save_metadata(self) -> None:
    """Save model metadata to JSON file.

    Raises:
        ValueError: If workdir is not set.
    """
    if self.workdir is None:
        raise ValueError("Cannot save metadata: workdir is not set")
    write_json(
        self.model_dump(mode="json"),
        path=self.workdir.train.adapter.metadata,
        indent=4,
    )

`from_str_or_path(model_name_or_path, **kwargs)` `classmethod` ¶

Instantiate the correct ModelMetadata subclass from a model name or path.

Performs case-insensitive substring matching of each registered subclass name against model_name_or_path.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`Path \| str`	HuggingFace model identifier or local filesystem path.	required
`**kwargs`		Forwarded to the matched subclass constructor.	`{}`

Returns:

Type	Description
`ModelMetadata`	An instance of the matched `ModelMetadata` subclass.

Raises:

Type	Description
`ValueError`	If no registered subclass matches.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@classmethod
def from_str_or_path(cls: type["ModelMetadata"], model_name_or_path: Path | str, **kwargs) -> ModelMetadata:
    """Instantiate the correct ``ModelMetadata`` subclass from a model name or path.

    Performs case-insensitive substring matching of each registered
    subclass name against ``model_name_or_path``.

    Args:
        model_name_or_path: HuggingFace model identifier or local
            filesystem path.
        **kwargs: Forwarded to the matched subclass constructor.

    Returns:
        An instance of the matched ``ModelMetadata`` subclass.

    Raises:
        ValueError: If no registered subclass matches.
    """
    return cls._resolve_model_class(model_name_or_path)(model_name_or_path=str(model_name_or_path), **kwargs)

`from_config(config, workdir=None)` `classmethod` ¶

Create ModelMetadata from SafeSynthesizerParameters.

The config should have been resolved with AutoConfigResolver before calling this method.

If rope_scaling_factor is set, a RopeScaling object is created with the model's native theta. max_sequences_per_example is always forwarded from config.data -- AutoConfigResolver resolves it to 1 when DP is enabled, 10 when set to "auto" with DP disabled, or the user-supplied integer.

Parameters:

Name	Type	Description	Default
`config`	`SafeSynthesizerParameters`	Resolved parameters with model and training configuration.	required
`workdir`	`Workdir \| None`	Artifact directory layout. Required for saving model artifacts.	`None`

Returns:

Type	Description
`ModelMetadata`	A `ModelMetadata` subclass instance matching the
`ModelMetadata`	configured pretrained model.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@classmethod
def from_config(
    cls: type["ModelMetadata"],
    config: SafeSynthesizerParameters,
    workdir: Workdir | None = None,
) -> ModelMetadata:
    """Create ``ModelMetadata`` from ``SafeSynthesizerParameters``.

    The *config* should have been resolved with
    ``AutoConfigResolver`` before calling this method.

    If ``rope_scaling_factor`` is set, a ``RopeScaling`` object is
    created with the model's native theta.
    ``max_sequences_per_example`` is always forwarded from
    ``config.data`` -- ``AutoConfigResolver`` resolves it to ``1``
    when DP is enabled, ``10`` when set to ``"auto"`` with DP
    disabled, or the user-supplied integer.

    Args:
        config: Resolved parameters with model and training
            configuration.
        workdir: Artifact directory layout.  Required for saving
            model artifacts.

    Returns:
        A ``ModelMetadata`` subclass instance matching the
        configured pretrained model.
    """
    kwargs: dict = {"workdir": workdir}

    if config.training.rope_scaling_factor is not None and config.training.rope_scaling_factor != "auto":
        # Pass the factor; the subclass will create the RopeScaling with proper theta
        kwargs["rope_scaling_factor"] = config.training.rope_scaling_factor

    # Pass max_sequences_per_example from data config - critical for DP training
    kwargs["max_sequences_per_example"] = config.data.max_sequences_per_example

    return ModelMetadata.from_str_or_path(config.training.pretrained_model, **kwargs)

`from_metadata_json(path, workdir=None)` `classmethod` ¶

Load ModelMetadata from a saved JSON file.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to the metadata JSON file.	required
`workdir`	`Workdir \| None`	Workdir instance for artifact paths. If not provided, will be None.	`None`

Returns:

Type	Description
`ModelMetadata`	ModelMetadata instance with the loaded configuration.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

@classmethod
def from_metadata_json(
    cls: type["ModelMetadata"],
    path: Path | str,
    workdir: Workdir | None = None,
) -> ModelMetadata:
    """Load ModelMetadata from a saved JSON file.

    Args:
        path: Path to the metadata JSON file.
        workdir: Workdir instance for artifact paths. If not provided, will be None.

    Returns:
        ModelMetadata instance with the loaded configuration.
    """
    path = Path(path).resolve()
    kwargs = load_json(path)
    if workdir is not None:
        kwargs["workdir"] = workdir
    return cls(**kwargs)

`Granite(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for IBM Granite model family.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Optional RoPE scaling factor.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=True,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`Llama32(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for Meta Llama 3.2 model family.

Uses <|im_start|> (id 151644) as the BOS token and disables automatic BOS/EOS injection in prompts.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Optional RoPE scaling factor.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            bos_token="<|im_start|>",
            bos_token_id=151644,
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=False,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`Mistral(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for Mistral AI model family.

RoPE scaling is not supported for Mistral models. Any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`	`AutoTokenizer \| None`	Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Ignored with a warning if provided.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer: AutoTokenizer | None = None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for Mistral due to longer default context lengths. Ignoring."
        )

    template = "[INST] {instruction} \n\n {schema} [/INST]{prefill}"
    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template=template,
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`Nemotron(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for NVIDIA Nemotron model family.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Optional RoPE scaling factor.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="[INST] {instruction} \n\n {schema} [/INST]{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`Qwen(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for Alibaba Qwen model family.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Optional RoPE scaling factor.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        # Matched with vllm prompt 2024-12-18
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`SmolLM2(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for HuggingFace SmolLM2 model family (e.g. SmolLM2-135M).

RoPE scaling is not supported and any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Ignored with a warning if provided.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for SmolLM2 due to longer default context lengths. Ignoring."
        )

    im_start_id = tokenizer.convert_tokens_to_ids("<|im_start|>")
    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            bos_token="<|im_start|>",
            bos_token_id=im_start_id,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`SmolLM3(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for HuggingFace SmolLM3 model family.

Uses <|im_start|> (id 128011) as the BOS token. RoPE scaling is not supported. Any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Ignored with a warning if provided.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)

    # we use the bos token here explicitly for support during group-by SFT.
    # the groupby assumes there is a bos token at the start of the prompt.
    bos_token = "<|im_start|>"
    bos_token_id = 128011

    # SmolLM3 uses high theta values (1.5M-5M) so it's important to read from config
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for SmolLM3 due to longer default context lengths. Ignoring."
        )

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} <|im_end|> \n assistant\n{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            name=model_name_or_path,
            bos_token=bos_token,
            bos_token_id=bos_token_id,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`TinyLlama(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

Bases: ModelMetadata

Metadata for the TinyLlama model family.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	HuggingFace model identifier or local path.	required
`tokenizer`		Optional pre-loaded tokenizer.	`None`
`rope_scaling_factor`	`float \| None`	Optional RoPE scaling factor.	`None`
`**kwargs`		Forwarded to `ModelMetadata`.	`{}`

Fields:

model_name_or_path (str)
prompt_config (LLMPromptConfig)
autoconfig (PretrainedConfig)
base_max_seq_length (int | None)
rope_scaling (RopeScaling | None)
max_sequences_per_example (int | None)
workdir (Workdir | None)
is_adapter (bool)
instruction (str)
rope_parameters_location (Literal['autoconfig', 'automodel'])
initial_prefill (dict[str, str] | str | None)

Validators:

populate_derived_fields

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = tokenizer or AutoTokenizer.from_pretrained(model_name_or_path)
    config = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template=PROMPT_TEMPLATE,
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

`resolve_rope_scaling_factor(factor=None, autoconfig=None)` ¶

Normalize a rope-scaling specification into a RopeScaling or None.

Accepts several convenience representations and converts them into a canonical RopeScaling instance.

Parameters:

Name	Type	Description	Default
`factor`	`float \| int \| RopeScaling \| dict \| None`	The scaling specification. Accepted forms: `None`, `1`, or `1.0` — no scaling (returns `None`). `RopeScaling` — returned as-is. `dict` — unpacked as `RopeScaling(**factor)`. `int` / `float` — used as the scaling factor; requires `autoconfig` to read `rope_theta` and `rope_type`.	`None`
`autoconfig`	`PretrainedConfig \| None`	A HuggingFace `PretrainedConfig`. Required when `factor` is a bare numeric value.	`None`

Returns:

Type	Description
`RopeScaling \| None`	A `RopeScaling` instance, or `None` when no scaling is needed.

Raises:

Type	Description
`ValueError`	If a numeric `factor` is given without `autoconfig`, or if the input type is unsupported.

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def resolve_rope_scaling_factor(
    factor: float | int | RopeScaling | dict | None = None,
    autoconfig: PretrainedConfig | None = None,
) -> RopeScaling | None:
    """Normalize a rope-scaling specification into a ``RopeScaling`` or ``None``.

    Accepts several convenience representations and converts them into a
    canonical ``RopeScaling`` instance.

    Args:
        factor: The scaling specification.  Accepted forms:

            * ``None``, ``1``, or ``1.0`` — no scaling (returns ``None``).
            * ``RopeScaling`` — returned as-is.
            * ``dict`` — unpacked as ``RopeScaling(**factor)``.
            * ``int`` / ``float`` — used as the scaling factor; requires
              ``autoconfig`` to read ``rope_theta`` and ``rope_type``.
        autoconfig: A HuggingFace ``PretrainedConfig``.  Required when
            ``factor`` is a bare numeric value.

    Returns:
        A ``RopeScaling`` instance, or ``None`` when no scaling is needed.

    Raises:
        ValueError: If a numeric ``factor`` is given without
            ``autoconfig``, or if the input type is unsupported.
    """
    match factor, autoconfig:
        case None | 1 | 1.0, _:
            return None
        case RopeScaling() as r, _:
            return r
        case dict() as d, _:
            return RopeScaling(**d)
        case int(x) | float(x), PretrainedConfig() as c:
            return RopeScaling.from_autoconfig(config=c, factor=x)
        case int(x) | float(x), None:
            raise ValueError("autoconfig is required when factor is an int or float")
        case _, None:
            raise ValueError("autoconfig is required when factor is not a RopeScaling, dict, or int/float")
        case _, _:
            raise ValueError("Invalid input type for rope scaling factor")

`get_base_max_seq_length(config)` ¶

Derive the base max sequence length from a model config.

Reads max_position_embeddings from the config and clamps it to GLOBAL_MAX_SEQ_LENGTH to prevent OOM and underfitting errors. Falls back to DEFAULT_MAX_SEQ_LENGTH when the attribute is absent.

Parameters:

Name	Type	Description	Default
`config`	`AutoConfig`	A HuggingFace `AutoConfig` for the model.	required

Returns:

Type	Description
`int`	The effective base sequence length (before RoPE scaling).

Source code in src/nemo_safe_synthesizer/llm/metadata.py

def get_base_max_seq_length(config: AutoConfig) -> int:
    """Derive the base max sequence length from a model config.

    Reads ``max_position_embeddings`` from the config and clamps it to
    ``GLOBAL_MAX_SEQ_LENGTH`` to prevent OOM and underfitting errors.
    Falls back to ``DEFAULT_MAX_SEQ_LENGTH`` when the attribute is
    absent.

    Args:
        config: A HuggingFace ``AutoConfig`` for the model.

    Returns:
        The effective base sequence length (before RoPE scaling).
    """
    if mpe := getattr(config, "max_position_embeddings", None):
        logger.info(f"Using max_position_embeddings from config: {mpe}")
        if mpe > GLOBAL_MAX_SEQ_LENGTH:
            msg = f"max_position_embeddings is greater than GLOBAL_MAX_SEQ_LENGTH: {mpe} > {GLOBAL_MAX_SEQ_LENGTH}"
            msg += "\n This is a temporary workaround to prevent OOM and underfitting errors"
            msg += "\n In the future, we will use a more dyanmic approach based on available VRAM and the tokens in your dataset."
            logger.warning(msg)
        return min(mpe, GLOBAL_MAX_SEQ_LENGTH)
    logger.info(f"Using default max_position_embeddings: {DEFAULT_MAX_SEQ_LENGTH}")
    return DEFAULT_MAX_SEQ_LENGTH

metadata

metadata ¶

LLMPromptConfig pydantic-model ¶

template pydantic-field ¶

add_bos_token_to_prompt pydantic-field ¶

add_eos_token_to_prompt pydantic-field ¶

bos_token pydantic-field ¶

bos_token_id pydantic-field ¶

eos_token pydantic-field ¶

eos_token_id pydantic-field ¶

from_tokenizer(name, tokenizer=None, **kwargs) classmethod ¶

RopeScaling pydantic-model ¶

rope_type = 'default' pydantic-field ¶

factor = 1.0 pydantic-field ¶

theta = 10000.0 pydantic-field ¶

validate_factor(v) pydantic-validator ¶

from_autoconfig(config, factor=None) classmethod ¶

to_hf_dict() ¶

ModelMetadata pydantic-model ¶

model_name_or_path pydantic-field ¶

prompt_config pydantic-field ¶

autoconfig pydantic-field ¶

base_max_seq_length = None pydantic-field ¶

rope_scaling = None pydantic-field ¶

max_sequences_per_example = None pydantic-field ¶

workdir = None pydantic-field ¶

is_adapter = False pydantic-field ¶

instruction = DEFAULT_INSTRUCTION pydantic-field ¶

rope_parameters_location = 'automodel' pydantic-field ¶

initial_prefill = None pydantic-field ¶

adapter_path property ¶

metadata_path property ¶

rope_scaling_factor property ¶

max_seq_length property ¶

populate_derived_fields(data) pydantic-validator ¶

serialize_autoconfig(config) ¶

save_metadata() ¶

from_str_or_path(model_name_or_path, **kwargs) classmethod ¶

from_config(config, workdir=None) classmethod ¶

from_metadata_json(path, workdir=None) classmethod ¶

Granite(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

Llama32(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

Mistral(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

Nemotron(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

Qwen(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

SmolLM2(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

SmolLM3(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

TinyLlama(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model ¶

resolve_rope_scaling_factor(factor=None, autoconfig=None) ¶

get_base_max_seq_length(config) ¶

`metadata` ¶

`LLMPromptConfig` `pydantic-model` ¶

`template` `pydantic-field` ¶

`add_bos_token_to_prompt` `pydantic-field` ¶

`add_eos_token_to_prompt` `pydantic-field` ¶

`bos_token` `pydantic-field` ¶

`bos_token_id` `pydantic-field` ¶

`eos_token` `pydantic-field` ¶

`eos_token_id` `pydantic-field` ¶

`from_tokenizer(name, tokenizer=None, **kwargs)` `classmethod` ¶

`RopeScaling` `pydantic-model` ¶

`rope_type = 'default'` `pydantic-field` ¶

`factor = 1.0` `pydantic-field` ¶

`theta = 10000.0` `pydantic-field` ¶

`validate_factor(v)` `pydantic-validator` ¶

`from_autoconfig(config, factor=None)` `classmethod` ¶

`to_hf_dict()` ¶

`ModelMetadata` `pydantic-model` ¶

`model_name_or_path` `pydantic-field` ¶

`prompt_config` `pydantic-field` ¶

`autoconfig` `pydantic-field` ¶

`base_max_seq_length = None` `pydantic-field` ¶

`rope_scaling = None` `pydantic-field` ¶

`max_sequences_per_example = None` `pydantic-field` ¶

`workdir = None` `pydantic-field` ¶

`is_adapter = False` `pydantic-field` ¶

`instruction = DEFAULT_INSTRUCTION` `pydantic-field` ¶

`rope_parameters_location = 'automodel'` `pydantic-field` ¶

`initial_prefill = None` `pydantic-field` ¶

`adapter_path` `property` ¶

`metadata_path` `property` ¶

`rope_scaling_factor` `property` ¶

`max_seq_length` `property` ¶

`populate_derived_fields(data)` `pydantic-validator` ¶

`serialize_autoconfig(config)` ¶

`save_metadata()` ¶

`from_str_or_path(model_name_or_path, **kwargs)` `classmethod` ¶

`from_config(config, workdir=None)` `classmethod` ¶

`from_metadata_json(path, workdir=None)` `classmethod` ¶

`Granite(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`Llama32(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`Mistral(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`Nemotron(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`Qwen(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`SmolLM2(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`SmolLM3(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`TinyLlama(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs)` `pydantic-model` ¶

`resolve_rope_scaling_factor(factor=None, autoconfig=None)` ¶

`get_base_max_seq_length(config)` ¶