Skip to content

metadata

metadata

Model-family metadata for prompt formatting, RoPE scaling, and runtime bookkeeping.

Provides ModelMetadata and its per-family subclasses (Llama32, Mistral, Qwen, etc.) that capture prompt templates, special-token settings, and context-window configuration. The RopeScaling model handles context-window extension via Rotary Position Embeddings.

A global maximum sequence length (GLOBAL_MAX_SEQ_LENGTH = 2048 * 6) is applied as a safety cap to prevent OOM and underfitting errors.

Classes:

Name Description
LLMPromptConfig

Prompt template and special-token settings.

RopeScaling

RoPE scaling parameters for context-window extension.

ModelMetadata

Base container for model-family-specific metadata.

Granite

IBM Granite family metadata.

Llama32

Meta Llama 3.2 family metadata.

Mistral

Mistral AI family metadata.

Nemotron

NVIDIA Nemotron family metadata.

Qwen

Alibaba Qwen family metadata.

SmolLM2

HuggingFace SmolLM2 family metadata.

SmolLM3

HuggingFace SmolLM3 family metadata.

TinyLlama

TinyLlama family metadata.

Functions:

Name Description
resolve_rope_scaling_factor

Normalize a rope-scaling specification into a RopeScaling or None.

get_base_max_seq_length

Derive the base max sequence length from a model config.

LLMPromptConfig pydantic-model

Bases: BaseModel

Prompt template and special-token settings for an LLM.

Holds the Jinja-style prompt template together with flags and token values that control how BOS/EOS markers are injected during training and inference.

Fields:

template pydantic-field

Prompt template with {instruction}, {schema}, and {prefill} placeholders.

  • {instruction} -- task directive telling the model what to generate (e.g. "Generate a JSONL dataset with the following columns: ").
  • {schema} -- column schema fragment listing expected output fields, typically formatted as "col":<unk>,"col2":<unk>.
  • {prefill} -- optional text injected at the start of the model's response to steer generation, currently used for time series data.

add_bos_token_to_prompt pydantic-field

Whether to prepend the BOS token to the prompt.

add_eos_token_to_prompt pydantic-field

Whether to append the EOS token to the prompt.

bos_token pydantic-field

Beginning-of-sequence token string.

bos_token_id pydantic-field

Integer id for the BOS token.

eos_token pydantic-field

End-of-sequence token string.

eos_token_id pydantic-field

Integer id for the EOS token.

from_tokenizer(name, tokenizer=None, **kwargs) classmethod

Create a prompt config by reading from settings of a tokenizer.

If no tokenizer is supplied one is loaded from name via AutoTokenizer.from_pretrained. Individual fields can be overridden through **kwargs (e.g. bos_token, template).

Parameters:

Name Type Description Default
name str

HuggingFace model identifier used to load the tokenizer when tokenizer is None.

required
tokenizer AutoTokenizer | None

Optional pre-loaded tokenizer instance.

None
**kwargs

Overrides for any LLMPromptConfig field.

{}

Returns:

Type Description
LLMPromptConfig

A new LLMPromptConfig populated from the tokenizer.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@classmethod
def from_tokenizer(cls, name: str, tokenizer: AutoTokenizer | None = None, **kwargs) -> LLMPromptConfig:
    """Create a prompt config by reading from settings of a tokenizer.

    If no ``tokenizer`` is supplied one is loaded from ``name``
    via ``AutoTokenizer.from_pretrained``.  Individual fields can
    be overridden through ``**kwargs`` (e.g. ``bos_token``,
    ``template``).

    Args:
        name: HuggingFace model identifier used to load the
            tokenizer when ``tokenizer`` is ``None``.
        tokenizer: Optional pre-loaded tokenizer instance.
        **kwargs: Overrides for any ``LLMPromptConfig`` field.

    Returns:
        A new ``LLMPromptConfig`` populated from the tokenizer.
    """
    tokenizer = tokenizer or AutoTokenizer.from_pretrained(name)
    bos_token = kwargs.get("bos_token", getattr(tokenizer, "bos_token", None))
    bos_token_id = kwargs.get("bos_token_id", getattr(tokenizer, "bos_token_id", None))
    eos_token = kwargs.get("eos_token", getattr(tokenizer, "eos_token", None))
    eos_token_id = kwargs.get("eos_token_id", getattr(tokenizer, "eos_token_id", None))
    template = kwargs.get("template", PROMPT_TEMPLATE)
    add_bos_token_to_prompt = kwargs.get("add_bos_token_to_prompt", True)
    add_eos_token_to_prompt = kwargs.get("add_eos_token_to_prompt", True)

    pc = {
        "template": template,
        "add_bos_token_to_prompt": add_bos_token_to_prompt,
        "add_eos_token_to_prompt": add_eos_token_to_prompt,
        "bos_token": bos_token,
        "bos_token_id": bos_token_id,
        "eos_token": eos_token,
        "eos_token_id": eos_token_id,
    }

    return cls(**pc)

RopeScaling pydantic-model

Bases: BaseModel

Rotary Position Embedding (RoPE) scaling configuration.

Encapsulates the parameters needed to extend a model's context window via RoPE scaling. Will be superseded by RotaryEmbeddingConfigMixin when available in transformers v5.

Fields:

  • rope_type (Literal['linear', 'dynamic', 'default', 'yarn', 'llama3'])
  • factor (float)
  • theta (float)

Validators:

rope_type = 'default' pydantic-field

Scaling algorithm: linear, dynamic, default, yarn, or llama3.

factor = 1.0 pydantic-field

Multiplier for RoPE scaling to extend the context window; values above MAX_ROPE_SCALING_FACTOR are clamped.

theta = 10000.0 pydantic-field

Theta for rope scaling.

validate_factor(v) pydantic-validator

Clamp factor to MAX_ROPE_SCALING_FACTOR and warn if exceeded.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@field_validator("factor", mode="after")
@classmethod
def validate_factor(cls, v: float | int | None) -> float | int | None:
    """Clamp ``factor`` to ``MAX_ROPE_SCALING_FACTOR`` and warn if exceeded."""
    if v is None or v <= MAX_ROPE_SCALING_FACTOR:
        return v
    logger.warning(
        f"Rope scaling factor {v} is greater than MAX_ROPE_SCALING_FACTOR: {MAX_ROPE_SCALING_FACTOR}, setting to {MAX_ROPE_SCALING_FACTOR}"
    )
    return MAX_ROPE_SCALING_FACTOR

from_autoconfig(config, factor=None) classmethod

Create a RopeScaling from a HuggingFace PretrainedConfig.

Reads the model's native rope_theta and rope_type and optionally overrides the scaling factor.

Parameters:

Name Type Description Default
config PretrainedConfig

A loaded HuggingFace model config.

required
factor float | int | None

Scaling factor override. Defaults to 1.0.

None

Returns:

Type Description
'RopeScaling'

A RopeScaling populated from the config.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@classmethod
def from_autoconfig(cls, config: PretrainedConfig, factor: float | int | None = None) -> "RopeScaling":
    """Create a ``RopeScaling`` from a HuggingFace ``PretrainedConfig``.

    Reads the model's native ``rope_theta`` and ``rope_type`` and
    optionally overrides the scaling ``factor``.

    Args:
        config: A loaded HuggingFace model config.
        factor: Scaling factor override.  Defaults to ``1.0``.

    Returns:
        A ``RopeScaling`` populated from the config.
    """
    # Try to get theta from config (different models use different attribute names)
    theta = getattr(config, "rope_theta", None) or 10000.0

    # Try to get rope_type from config
    rope_type = getattr(config, "rope_scaling", {})
    if isinstance(rope_type, dict):
        rope_type = rope_type.get("rope_type", "default")
    else:
        rope_type = "default"

    return cls(
        rope_type=rope_type,
        factor=factor or 1.0,
        theta=theta,
    )

to_hf_dict()

Convert to the HuggingFace rope_scaling dict format.

Returns None when factor is 1.0 (no scaling).

Returns:

Type Description
dict | None

A dict with keys rope_type, factor, and theta,

dict | None

or None.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def to_hf_dict(self) -> dict | None:
    """Convert to the HuggingFace ``rope_scaling`` dict format.

    Returns ``None`` when ``factor`` is ``1.0`` (no scaling).

    Returns:
        A dict with keys ``rope_type``, ``factor``, and ``theta``,
        or ``None``.
    """
    if self.factor == 1.0:
        return None
    return {
        "rope_type": self.rope_type,
        "factor": self.factor,
        "theta": self.theta,
    }

ModelMetadata pydantic-model

Bases: BaseModel

Base container for model-family-specific metadata.

Stores prompt formats, special tokens, RoPE scaling parameters, and runtime bookkeeping needed to load, fine-tune, and generate with a given LLM family. Each supported model family has a concrete subclass (e.g. Llama32, Mistral) that sets the correct defaults.

Use the factory methods from_str_or_path, from_config, or from_metadata_json to construct instances rather than calling the constructor directly.

Config:

  • arbitrary_types_allowed: True

Fields:

Validators:

model_name_or_path pydantic-field

HuggingFace model identifier or local path.

prompt_config pydantic-field

Prompt template and token settings.

autoconfig pydantic-field

HuggingFace PretrainedConfig (excluded from serialization).

base_max_seq_length = None pydantic-field

Supported context window for base model, before rope scaling factor adjustment.

rope_scaling = None pydantic-field

RoPE scaling configuration for context window extension. Accepts a RopeScaling instance, a dict of RopeScaling fields, a numeric scale factor (requires autoconfig), or None.

max_sequences_per_example = None pydantic-field

Cap on sequences packed into one training example.

Resolved by AutoConfigResolver to 1 when DP is enabled, 10 when DP is disabled and set to "auto", or a user-supplied integer.

workdir = None pydantic-field

Artifact directory layout.

is_adapter = False pydantic-field

Whether an adapter checkpoint is loaded.

instruction = DEFAULT_INSTRUCTION pydantic-field

Default system instruction text.

rope_parameters_location = 'automodel' pydantic-field

Where to read RoPE parameters from: autoconfig or automodel.

initial_prefill = None pydantic-field

Currently used for time series data. May be a single string or a per-column dict.

adapter_path property

The path where adapter model files are stored.

Raises:

Type Description
ValueError

If workdir is not set.

metadata_path property

The path to the metadata JSON file.

Uses workdir.metadata_file which automatically resolves to the parent workdir's path when resuming for generation.

Raises:

Type Description
ValueError

If workdir is not set.

rope_scaling_factor property

The rope scaling factor for backwards compatibility.

max_seq_length property

Actual context window for training.

Includes any adjustment for rope_scaling.factor.

populate_derived_fields(data) pydantic-validator

Auto-populate autoconfig, rope_scaling, and base_max_seq_length.

Called by Pydantic before field validation. Loads an AutoConfig from model_name_or_path when one is not already present, derives base_max_seq_length from that config, and resolves the rope_scaling specification into a RopeScaling instance (or None).

Parameters:

Name Type Description Default
data dict

Raw field values dict supplied to the constructor.

required

Returns:

Type Description
dict

The mutated data dict with derived fields populated.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@model_validator(mode="before")
@classmethod
def populate_derived_fields(cls, data: dict) -> dict:
    """Auto-populate ``autoconfig``, ``rope_scaling``, and ``base_max_seq_length``.

    Called by Pydantic before field validation.  Loads an
    ``AutoConfig`` from ``model_name_or_path`` when one is not
    already present, derives ``base_max_seq_length`` from that
    config, and resolves the ``rope_scaling`` specification into a
    ``RopeScaling`` instance (or ``None``).

    Args:
        data: Raw field values dict supplied to the constructor.

    Returns:
        The mutated ``data`` dict with derived fields populated.
    """
    if data.get("autoconfig") is None:
        data["autoconfig"] = AutoConfig.from_pretrained(data["model_name_or_path"])

    if data.get("base_max_seq_length") is None:
        data["base_max_seq_length"] = get_base_max_seq_length(data["autoconfig"])

    rsf = data.get("rope_scaling")
    data["rope_scaling"] = resolve_rope_scaling_factor(rsf, data["autoconfig"])

    return data

serialize_autoconfig(config)

Serialize PretrainedConfig to a plain dict for JSON export.

Parameters:

Name Type Description Default
config PretrainedConfig

The HuggingFace config to serialize.

required

Returns:

Type Description
dict

Dict representation of the config.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@field_serializer("autoconfig")
def serialize_autoconfig(self, config: PretrainedConfig) -> dict:
    """Serialize ``PretrainedConfig`` to a plain dict for JSON export.

    Args:
        config: The HuggingFace config to serialize.

    Returns:
        Dict representation of the config.
    """
    return config.to_dict()

save_metadata()

Save model metadata to JSON file.

Raises:

Type Description
ValueError

If workdir is not set.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def save_metadata(self) -> None:
    """Save model metadata to JSON file.

    Raises:
        ValueError: If workdir is not set.
    """
    if self.workdir is None:
        raise ValueError("Cannot save metadata: workdir is not set")
    write_json(
        self.model_dump(mode="json"),
        path=self.workdir.train.adapter.metadata,
        indent=4,
    )

from_str_or_path(model_name_or_path, **kwargs) classmethod

Instantiate the correct ModelMetadata subclass from a model name or path.

Performs case-insensitive substring matching of each registered subclass name against model_name_or_path.

Parameters:

Name Type Description Default
model_name_or_path Path | str

HuggingFace model identifier or local filesystem path.

required
**kwargs

Forwarded to the matched subclass constructor.

{}

Returns:

Type Description
ModelMetadata

An instance of the matched ModelMetadata subclass.

Raises:

Type Description
ValueError

If no registered subclass matches.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@classmethod
def from_str_or_path(cls: type["ModelMetadata"], model_name_or_path: Path | str, **kwargs) -> ModelMetadata:
    """Instantiate the correct ``ModelMetadata`` subclass from a model name or path.

    Performs case-insensitive substring matching of each registered
    subclass name against ``model_name_or_path``.

    Args:
        model_name_or_path: HuggingFace model identifier or local
            filesystem path.
        **kwargs: Forwarded to the matched subclass constructor.

    Returns:
        An instance of the matched ``ModelMetadata`` subclass.

    Raises:
        ValueError: If no registered subclass matches.
    """
    return cls._resolve_model_class(model_name_or_path)(model_name_or_path=str(model_name_or_path), **kwargs)

from_config(config, workdir=None) classmethod

Create ModelMetadata from SafeSynthesizerParameters.

The config should have been resolved with AutoConfigResolver before calling this method.

If rope_scaling_factor is set, a RopeScaling object is created with the model's native theta. max_sequences_per_example is always forwarded from config.data -- AutoConfigResolver resolves it to 1 when DP is enabled, 10 when set to "auto" with DP disabled, or the user-supplied integer.

Parameters:

Name Type Description Default
config SafeSynthesizerParameters

Resolved parameters with model and training configuration.

required
workdir Workdir | None

Artifact directory layout. Required for saving model artifacts.

None

Returns:

Type Description
ModelMetadata

A ModelMetadata subclass instance matching the

ModelMetadata

configured pretrained model.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@classmethod
def from_config(
    cls: type["ModelMetadata"],
    config: SafeSynthesizerParameters,
    workdir: Workdir | None = None,
) -> ModelMetadata:
    """Create ``ModelMetadata`` from ``SafeSynthesizerParameters``.

    The *config* should have been resolved with
    ``AutoConfigResolver`` before calling this method.

    If ``rope_scaling_factor`` is set, a ``RopeScaling`` object is
    created with the model's native theta.
    ``max_sequences_per_example`` is always forwarded from
    ``config.data`` -- ``AutoConfigResolver`` resolves it to ``1``
    when DP is enabled, ``10`` when set to ``"auto"`` with DP
    disabled, or the user-supplied integer.

    Args:
        config: Resolved parameters with model and training
            configuration.
        workdir: Artifact directory layout.  Required for saving
            model artifacts.

    Returns:
        A ``ModelMetadata`` subclass instance matching the
        configured pretrained model.
    """
    kwargs: dict = {"workdir": workdir}

    if config.training.rope_scaling_factor is not None and config.training.rope_scaling_factor != "auto":
        # Pass the factor; the subclass will create the RopeScaling with proper theta
        kwargs["rope_scaling_factor"] = config.training.rope_scaling_factor

    # Pass max_sequences_per_example from data config - critical for DP training
    kwargs["max_sequences_per_example"] = config.data.max_sequences_per_example

    return ModelMetadata.from_str_or_path(config.training.pretrained_model, **kwargs)

from_metadata_json(path, workdir=None) classmethod

Load ModelMetadata from a saved JSON file.

Parameters:

Name Type Description Default
path Path | str

Path to the metadata JSON file.

required
workdir Workdir | None

Workdir instance for artifact paths. If not provided, will be None.

None

Returns:

Type Description
ModelMetadata

ModelMetadata instance with the loaded configuration.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
@classmethod
def from_metadata_json(
    cls: type["ModelMetadata"],
    path: Path | str,
    workdir: Workdir | None = None,
) -> ModelMetadata:
    """Load ModelMetadata from a saved JSON file.

    Args:
        path: Path to the metadata JSON file.
        workdir: Workdir instance for artifact paths. If not provided, will be None.

    Returns:
        ModelMetadata instance with the loaded configuration.
    """
    path = Path(path).resolve()
    kwargs = load_json(path)
    if workdir is not None:
        kwargs["workdir"] = workdir
    return cls(**kwargs)

Granite(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for IBM Granite model family.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Optional RoPE scaling factor.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=True,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

Llama32(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for Meta Llama 3.2 model family.

Uses <|im_start|> (id 151644) as the BOS token and disables automatic BOS/EOS injection in prompts.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Optional RoPE scaling factor.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            bos_token="<|im_start|>",
            bos_token_id=151644,
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=False,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

Mistral(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for Mistral AI model family.

RoPE scaling is not supported for Mistral models. Any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer AutoTokenizer | None

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Ignored with a warning if provided.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer: AutoTokenizer | None = None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for Mistral due to longer default context lengths. Ignoring."
        )

    template = "[INST] {instruction} \n\n {schema} [/INST]{prefill}"
    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            name=model_name_or_path,
            template=template,
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

Nemotron(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for NVIDIA Nemotron model family.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Optional RoPE scaling factor.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config: PretrainedConfig = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="[INST] {instruction} \n\n {schema} [/INST]{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

Qwen(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for Alibaba Qwen model family.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Optional RoPE scaling factor.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        # Matched with vllm prompt 2024-12-18
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

SmolLM2(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for HuggingFace SmolLM2 model family (e.g. SmolLM2-135M).

RoPE scaling is not supported and any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Ignored with a warning if provided.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for SmolLM2 due to longer default context lengths. Ignoring."
        )

    im_start_id = tokenizer.convert_tokens_to_ids("<|im_start|>")
    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} \n assistant\n{prefill}",
            add_bos_token_to_prompt=False,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            bos_token="<|im_start|>",
            bos_token_id=im_start_id,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

SmolLM3(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for HuggingFace SmolLM3 model family.

Uses <|im_start|> (id 128011) as the BOS token. RoPE scaling is not supported. Any supplied rope_scaling_factor will be ignored with a warning.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Ignored with a warning if provided.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) if tokenizer is None else tokenizer
    config = AutoConfig.from_pretrained(model_name_or_path)

    # we use the bos token here explicitly for support during group-by SFT.
    # the groupby assumes there is a bos token at the start of the prompt.
    bos_token = "<|im_start|>"
    bos_token_id = 128011

    # SmolLM3 uses high theta values (1.5M-5M) so it's important to read from config
    if rope_scaling_factor:
        logger.warning(
            f"Rope scaling factor {rope_scaling_factor} is not supported for SmolLM3 due to longer default context lengths. Ignoring."
        )

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template="user\n {instruction} {schema} <|im_end|> \n assistant\n{prefill}",
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=False,
            tokenizer=tokenizer,
            name=model_name_or_path,
            bos_token=bos_token,
            bos_token_id=bos_token_id,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=None,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

TinyLlama(model_name_or_path, tokenizer=None, rope_scaling_factor=None, **kwargs) pydantic-model

Bases: ModelMetadata

Metadata for the TinyLlama model family.

Parameters:

Name Type Description Default
model_name_or_path str

HuggingFace model identifier or local path.

required
tokenizer

Optional pre-loaded tokenizer.

None
rope_scaling_factor float | None

Optional RoPE scaling factor.

None
**kwargs

Forwarded to ModelMetadata.

{}

Fields:

Validators:

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def __init__(
    self,
    model_name_or_path: str,
    tokenizer=None,
    rope_scaling_factor: float | None = None,
    **kwargs,
) -> None:
    tokenizer = tokenizer or AutoTokenizer.from_pretrained(model_name_or_path)
    config = AutoConfig.from_pretrained(model_name_or_path)

    super().__init__(
        autoconfig=config,
        instruction=DEFAULT_INSTRUCTION,
        prompt_config=LLMPromptConfig.from_tokenizer(
            template=PROMPT_TEMPLATE,
            add_bos_token_to_prompt=True,
            add_eos_token_to_prompt=True,
            tokenizer=tokenizer,
            name=model_name_or_path,
        ),
        model_name_or_path=model_name_or_path,
        rope_scaling=rope_scaling_factor,
        rope_parameters_location="autoconfig",
        **kwargs,
    )

resolve_rope_scaling_factor(factor=None, autoconfig=None)

Normalize a rope-scaling specification into a RopeScaling or None.

Accepts several convenience representations and converts them into a canonical RopeScaling instance.

Parameters:

Name Type Description Default
factor float | int | RopeScaling | dict | None

The scaling specification. Accepted forms:

  • None, 1, or 1.0 — no scaling (returns None).
  • RopeScaling — returned as-is.
  • dict — unpacked as RopeScaling(**factor).
  • int / float — used as the scaling factor; requires autoconfig to read rope_theta and rope_type.
None
autoconfig PretrainedConfig | None

A HuggingFace PretrainedConfig. Required when factor is a bare numeric value.

None

Returns:

Type Description
RopeScaling | None

A RopeScaling instance, or None when no scaling is needed.

Raises:

Type Description
ValueError

If a numeric factor is given without autoconfig, or if the input type is unsupported.

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def resolve_rope_scaling_factor(
    factor: float | int | RopeScaling | dict | None = None,
    autoconfig: PretrainedConfig | None = None,
) -> RopeScaling | None:
    """Normalize a rope-scaling specification into a ``RopeScaling`` or ``None``.

    Accepts several convenience representations and converts them into a
    canonical ``RopeScaling`` instance.

    Args:
        factor: The scaling specification.  Accepted forms:

            * ``None``, ``1``, or ``1.0`` — no scaling (returns ``None``).
            * ``RopeScaling`` — returned as-is.
            * ``dict`` — unpacked as ``RopeScaling(**factor)``.
            * ``int`` / ``float`` — used as the scaling factor; requires
              ``autoconfig`` to read ``rope_theta`` and ``rope_type``.
        autoconfig: A HuggingFace ``PretrainedConfig``.  Required when
            ``factor`` is a bare numeric value.

    Returns:
        A ``RopeScaling`` instance, or ``None`` when no scaling is needed.

    Raises:
        ValueError: If a numeric ``factor`` is given without
            ``autoconfig``, or if the input type is unsupported.
    """
    match factor, autoconfig:
        case None | 1 | 1.0, _:
            return None
        case RopeScaling() as r, _:
            return r
        case dict() as d, _:
            return RopeScaling(**d)
        case int(x) | float(x), PretrainedConfig() as c:
            return RopeScaling.from_autoconfig(config=c, factor=x)
        case int(x) | float(x), None:
            raise ValueError("autoconfig is required when factor is an int or float")
        case _, None:
            raise ValueError("autoconfig is required when factor is not a RopeScaling, dict, or int/float")
        case _, _:
            raise ValueError("Invalid input type for rope scaling factor")

get_base_max_seq_length(config)

Derive the base max sequence length from a model config.

Reads max_position_embeddings from the config and clamps it to GLOBAL_MAX_SEQ_LENGTH to prevent OOM and underfitting errors. Falls back to DEFAULT_MAX_SEQ_LENGTH when the attribute is absent.

Parameters:

Name Type Description Default
config AutoConfig

A HuggingFace AutoConfig for the model.

required

Returns:

Type Description
int

The effective base sequence length (before RoPE scaling).

Source code in src/nemo_safe_synthesizer/llm/metadata.py
def get_base_max_seq_length(config: AutoConfig) -> int:
    """Derive the base max sequence length from a model config.

    Reads ``max_position_embeddings`` from the config and clamps it to
    ``GLOBAL_MAX_SEQ_LENGTH`` to prevent OOM and underfitting errors.
    Falls back to ``DEFAULT_MAX_SEQ_LENGTH`` when the attribute is
    absent.

    Args:
        config: A HuggingFace ``AutoConfig`` for the model.

    Returns:
        The effective base sequence length (before RoPE scaling).
    """
    if mpe := getattr(config, "max_position_embeddings", None):
        logger.info(f"Using max_position_embeddings from config: {mpe}")
        if mpe > GLOBAL_MAX_SEQ_LENGTH:
            msg = f"max_position_embeddings is greater than GLOBAL_MAX_SEQ_LENGTH: {mpe} > {GLOBAL_MAX_SEQ_LENGTH}"
            msg += "\n This is a temporary workaround to prevent OOM and underfitting errors"
            msg += "\n In the future, we will use a more dyanmic approach based on available VRAM and the tokens in your dataset."
            logger.warning(msg)
        return min(mpe, GLOBAL_MAX_SEQ_LENGTH)
    logger.info(f"Using default max_position_embeddings: {DEFAULT_MAX_SEQ_LENGTH}")
    return DEFAULT_MAX_SEQ_LENGTH