assembler

`assembler` ¶

Classes:

Name	Description
`Example`	A single training example containing a prompt and records.
`TrainingExamples`	Container for managing a dataset of training examples.
`TrainingExampleAssembler`	Base class for assembling LLM training examples.
`TabularDataExampleAssembler`	Assembler for standard tabular (non-grouped, non-sequential) data.
`SequentialExampleAssembler`	Assembler for sequential/time series data that preserves record ordering.
`GroupedDataExampleAssembler`	Grouped data example assembler.

`Example(prompt, tokenizer, metadata)` ¶

A single training example containing a prompt and records.

A training example consists of a prompt followed by one or more sequences of records, where each sequence is (optionally) enclosed by the BOS and EOS special tokens. Tokens from the prompt are masked with label -100 so they are ignored during loss computation.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Schema prompt text prepended to every example.	required
`tokenizer`	`PreTrainedTokenizer`	Tokenizer used to encode the prompt.	required
`metadata`	`ModelMetadata`	Model metadata controlling special-token placement.	required

Methods:

Name	Description
`add_sequence`	Add a sequence of records to the example.
`to_dict`	Convert the example to a dictionary format suitable for training.

Attributes:

Name	Type	Description
`num_tokens`	`int`	Total number of tokens in this example (prompt + all sequences).

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def __init__(
    self,
    prompt: str,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
):
    self.prompt = prompt
    self.tokenizer = tokenizer
    self.metadata = metadata

    self.input_ids = self.tokenizer.encode(prompt, add_special_tokens=False)

    if self.metadata.prompt_config.add_bos_token_to_prompt:
        self.input_ids = [self.metadata.prompt_config.bos_token_id] + self.input_ids
    if self.metadata.prompt_config.add_eos_token_to_prompt:
        self.input_ids = self.input_ids + [self.metadata.prompt_config.eos_token_id]

    # We use -100 to ignore the prompt tokens when calculating the loss.
    self.labels = [-100] * len(self.input_ids)
    self.attention_mask = [1] * len(self.input_ids)

    self.num_sequences = 0

`num_tokens` `property` ¶

Total number of tokens in this example (prompt + all sequences).

`add_sequence(seq, add_special_tokens=True)` ¶

Add a sequence of records to the example.

Parameters:

Name	Type	Description	Default
`seq`	`dict[str, list[int]]`	Dictionary containing 'input_ids' and 'attention_mask' for the sequence.	required
`add_special_tokens`	`bool`	Whether to add special tokens to the sequence.	`True`

Raises:

Type	Description
`GenerationError`	If the number of tokens in the example exceeds the context length.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def add_sequence(self, seq: dict[str, list[int]], add_special_tokens: bool = True) -> None:
    """Add a sequence of records to the example.

    Args:
        seq: Dictionary containing 'input_ids' and 'attention_mask' for the sequence.
        add_special_tokens: Whether to add special tokens to the sequence.

    Raises:
        GenerationError: If the number of tokens in the example exceeds the context length.
    """
    input_ids = (
        [self.metadata.prompt_config.bos_token_id] + seq["input_ids"] + [self.metadata.prompt_config.eos_token_id]
        if add_special_tokens
        else seq["input_ids"]
    )
    attention_mask = [1] + seq["attention_mask"] + [1] if add_special_tokens else seq["attention_mask"]
    self.input_ids.extend(input_ids)
    self.attention_mask.extend(attention_mask)
    self.labels.extend(input_ids)
    self.num_sequences += 1

    if self.num_tokens > self.metadata.max_seq_length:
        max_tokens_action = _get_max_tokens_action(self.metadata.rope_scaling_factor)
        msg = f"The number of tokens in an example exceeds the available context length. {max_tokens_action}"
        raise GenerationError(msg)

`to_dict()` ¶

Convert the example to a dictionary format suitable for training.

Returns:

Type	Description
`dict[str, list]`	A dictionary containing `input_ids`, `attention_mask`, and `labels`.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def to_dict(self) -> dict[str, list]:
    """Convert the example to a dictionary format suitable for training.

    Returns:
        A dictionary containing ``input_ids``, ``attention_mask``, and ``labels``.
    """
    return {
        "input_ids": self.input_ids,
        "attention_mask": self.attention_mask,
        "labels": self.labels,
    }

`TrainingExamples(train, stats, test=None)` `dataclass` ¶

Container for managing a dataset of training examples.

Attributes:

Name	Type	Description
`train`	`Dataset`	🤗 Dataset of the training examples.
`stats`	`dict[str, Statistics]`	Running statistics calculated during example construction.
`test`	`Dataset \| None`	🤗 Dataset of the test examples, if available.

`TrainingExampleAssembler(dataset, tokenizer, metadata, keep_columns=None, test_size=None, cache_file_path=None, seed=None, *args, **kwargs)` ¶

Bases: ABC

Base class for assembling LLM training examples.

Subclasses of this class are responsible for converting a dataset into a format suitable for training / fine-tuning LLMs.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	Dataset to be processed.	required
`tokenizer`	`PreTrainedTokenizer`	Tokenizer used for tokenizing the dataset records.	required
`metadata`	`ModelMetadata`	Internal training configuration, e.g., prompt template, bos/eos tokens, and where to use them.	required
`keep_columns`	`list[str] \| None`	List of columns to keep in the tokenized dataset. This is useful if you need certain fields for subsequent processing (e.g., grouping).	`None`
`test_size`	`int \| None`	Absolute number of records you want in the test set. If None or 0, there will be no test set and hence no evaluation during training.	`None`
`cache_file_path`	`str \| Path \| None`	Path to store the cached dataset for efficient data access.	`None`
`seed`	`int \| None`	Seed for the random number generator and train-test split.	`None`

Methods:

Name	Description
`assemble_training_examples`	Build examples from the tokenized dataset.
`from_data`	Select and construct the appropriate assembler subclass from config.

Attributes:

Name	Type	Description
`num_records_train`	`int`	Number of records in the training split.
`num_records_validation`	`int`	Number of records in the validation split.
`num_records_total`	`int`	Total number of records across training and validation splits.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def __init__(
    self,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    keep_columns: list[str] | None = None,
    test_size: int | None = None,
    cache_file_path: str | Path | None = None,
    seed: int | None = None,  # TODO: probably should include with metadata!
    *args,
    **kwargs,
):
    if test_size is not None and test_size > 0 and test_size > len(dataset) - TRAIN_SET_SIZE_BUFFER:
        msg = (
            "The test set size is too large compared to the input dataset. You must "
            f"have `test_set_size < len(dataset) - {TRAIN_SET_SIZE_BUFFER} records`. "
            f"You gave `test_set_size = {test_size}` and `len(dataset) = {len(dataset)}`. "
            "Please reduce the test set size or provide a larger dataset."
        )
        raise ParameterError(msg)

    self.metadata = metadata
    self.tokenizer = tokenizer
    self.stats = defaultdict(RunningStatistics)
    self.stats_val = defaultdict(RunningStatistics)
    # adding this extra instead of "" due to hf datasets being weird about the cache path parent dirs -
    # it's erroring out when we pass an empty string, with a 'filenotfound' error.
    fp = Path(cache_file_path) if cache_file_path else Path.cwd()
    self.cache_file_path = fp / f"{DEFAULT_CACHE_PREFIX}_{uuid.uuid4().hex[:5]}"
    self.test_size = test_size
    self.keep_columns = keep_columns or []
    self.seed = seed
    self._window_rng = None

    self.schema_prompt = utils.create_schema_prompt(
        dataset.column_names,
        instruction=metadata.instruction,
        prompt_template=metadata.prompt_config.template,
    )

    # The prompt IDs attribute does *not* include special tokens.
    self.schema_prompt_ids: list[int] = tokenizer(self.schema_prompt, add_special_tokens=False)["input_ids"]

    self.tokenized_records = self._tokenize_dataset(dataset, keep_columns)
    processed_dataset = self._preprocess_before_splitting(self.tokenized_records)
    self._apply_train_test_split(processed_dataset)

`num_records_train` `abstractmethod` `property` ¶

Number of records in the training split.

`num_records_validation` `abstractmethod` `property` ¶

Number of records in the validation split.

`num_records_total` `property` ¶

Total number of records across training and validation splits.

`assemble_training_examples(data_fraction=1.0)` `abstractmethod` ¶

Build examples from the tokenized dataset.

Parameters:

Name	Type	Description	Default
`data_fraction`	`float`	Fraction of the dataset to use for example generation.	`1.0`

Returns:

Type	Description
`TrainingExamples`	TrainingExamples object containing a 🤗 Dataset objects for the train
`TrainingExamples`	and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

@abstractmethod
def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples from the tokenized dataset.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """

`from_data(dataset, tokenizer, metadata, config, test_size=None, seed=None, cache_file_path=None, keep_columns=None, **kwargs)` `classmethod` ¶

Select and construct the appropriate assembler subclass from config.

Returns a SequentialExampleAssembler for time-series data, a GroupedDataExampleAssembler when group_training_examples_by is set, or a TabularDataExampleAssembler otherwise.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	A HuggingFace `datasets.Dataset` of tabular records to assemble training examples from.	required
`tokenizer`	`PreTrainedTokenizer`	Tokenizer used for encoding records.	required
`metadata`	`ModelMetadata`	Model metadata (prompt config, sequence lengths, etc.).	required
`config`	`SafeSynthesizerParameters`	Full pipeline configuration used to determine the assembler type.	required
`test_size`	`int \| None`	Fraction of the dataset to reserve for validation (0 <= test_size < 1).	`None`
`seed`	`int \| None`	Random seed for reproducibility.	`None`
`cache_file_path`	`str \| Path \| None`	Path for caching intermediate datasets.	`None`
`keep_columns`	`list[str] \| None`	Columns to preserve through tokenization.	`None`
`**kwargs`		Forwarded to the chosen assembler constructor.	`{}`

Returns:

Type	Description
`GroupedDataExampleAssembler \| TabularDataExampleAssembler \| SequentialExampleAssembler`	An assembler instance appropriate for the data type described by `config`.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

@classmethod
def from_data(
    cls,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    config: SafeSynthesizerParameters,
    test_size: int | None = None,
    seed: int | None = None,
    cache_file_path: str | Path | None = None,
    keep_columns: list[str] | None = None,
    **kwargs,
) -> GroupedDataExampleAssembler | TabularDataExampleAssembler | SequentialExampleAssembler:
    """Select and construct the appropriate assembler subclass from config.

    Returns a ``SequentialExampleAssembler`` for time-series data, a
    ``GroupedDataExampleAssembler`` when ``group_training_examples_by``
    is set, or a ``TabularDataExampleAssembler`` otherwise.

    Args:
        dataset: A HuggingFace ``datasets.Dataset`` of tabular records to
            assemble training examples from.
        tokenizer: Tokenizer used for encoding records.
        metadata: Model metadata (prompt config, sequence lengths, etc.).
        config: Full pipeline configuration used to determine the assembler type.
        test_size: Fraction of the dataset to reserve for validation (0 <= test_size < 1).
        seed: Random seed for reproducibility.
        cache_file_path: Path for caching intermediate datasets.
        keep_columns: Columns to preserve through tokenization.
        **kwargs: Forwarded to the chosen assembler constructor.

    Returns:
        An assembler instance appropriate for the data type described by ``config``.
    """
    if config.time_series.is_timeseries:
        # group_by and order_by should be set by timeseries preprocessing
        # (adds pseudo-group if needed, sets order_by to timestamp column)
        group_by = config.data.group_training_examples_by
        order_by = config.data.order_training_examples_by
        if group_by is None or order_by is None:  # for type checking
            raise RuntimeError("Internal error: group_by and order_by should be set by timeseries preprocessing")

        return SequentialExampleAssembler(
            group_training_examples_by=group_by,
            order_training_examples_by=order_by,
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            config=config,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )

    if config.data.group_training_examples_by is not None:
        return GroupedDataExampleAssembler(
            group_training_examples_by=config.data.group_training_examples_by,
            order_training_examples_by=config.data.order_training_examples_by,
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )
    else:
        return TabularDataExampleAssembler(
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )

`TabularDataExampleAssembler(*args, **kwargs)` ¶

Bases: TrainingExampleAssembler

Assembler for standard tabular (non-grouped, non-sequential) data.

Records are shuffled and packed into examples that fill the model's context window. Each example contains a single sequence of concatenated records enclosed by BOS/EOS tokens.

Methods:

Name	Description
`assemble_training_examples`	Build examples with randomly shuffled records.

Attributes:

Name	Type	Description
`num_records_train`	`int`	Number of records in the training split.
`num_records_validation`	`int`	Number of records in the validation split.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

`num_records_train` `property` ¶

Number of records in the training split.

`num_records_validation` `property` ¶

Number of records in the validation split.

`assemble_training_examples(data_fraction=1.0)` ¶

Build examples with randomly shuffled records.

Parameters:

Name	Type	Description	Default
`data_fraction`	`float`	Fraction of the dataset to use for example generation.	`1.0`

Returns:

Type	Description
`TrainingExamples`	TrainingExamples object containing a 🤗 Dataset objects for the train
`TrainingExamples`	and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples with randomly shuffled records.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """
    logger.info(
        f"Assembling examples from {data_fraction:.1%} of the input records",
    )

    rng = utils.get_random_number_generator(self.seed)
    # Process both training and test datasets
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples

`SequentialExampleAssembler(dataset, tokenizer, metadata, *, group_training_examples_by, order_training_examples_by, keep_columns=None, **kwargs)` ¶

Bases: TabularDataExampleAssembler

Assembler for sequential/time series data that preserves record ordering.

This assembler extends TabularDataExampleAssembler to handle time series and other sequential data where record order is semantically meaningful. Unlike the base class which shuffles records, this assembler maintains chronological order within groups and ensures each training example contains records from only one group.

Key Concepts

Order Preservation: Records are never shuffled. Within each group, records maintain their original order (typically chronological by timestamp).
Single-Group Examples: Each training example contains records from exactly one group. This ensures the model learns patterns within a group's sequence without cross-group contamination.
Sequence Continuation: When a group's records span multiple examples, the sequence continues naturally across example boundaries. The model sees (example1: records 0-99) then (example2: records 100-199) for the same group.
Pseudo-Group Handling: When no group column is specified, preprocessing adds a PSEUDO_GROUP_COLUMN so ungrouped time series is treated as a single group. This unifies the grouped and ungrouped code paths.
Initial Prefill: For each group, the first 3 records are stored in model_metadata.initial_prefill as a dict mapping group_id -> prefill string. This is used by TimeseriesBackend during generation to seed each group's context.

Processing Flow

Initialization: a. Validate that group and order columns exist in dataset b. Reorder columns: group_by first, order_by second, then rest c. Build keep_columns list to preserve group/order through tokenization d. Override schema_prompt to exclude PSEUDO_GROUP_COLUMN from visible schema
Train/Test Split (_apply_grouped_train_test_split): a. Split along group boundaries using GroupShuffleSplit b. Entire groups go to train OR validation, never split across c. Re-sort after split (GroupShuffleSplit shuffles indices) d. Add row indices column for detecting dataset restart boundaries
Dataset Preparation (_prepare_dataset_for_training): a. For data_fraction > 1, concatenate multiple passes of the dataset (no shuffling, just sequential duplication) b. Run example generation via _fill_context_with_records_generator
Example Generation (_fill_context_with_records_generator): a. Iterate through records sequentially b. Track token budget per example (randomized between MIN/MAX_FILL_RATIO) c. Flush example when any boundary condition is met:
- Group changes (record_group != current_group_value)
- Dataset restarts (row_idx < prev_row_idx, from duplication wrap)
- Token budget exceeded
- Max sequences per example reached d. Each flushed example becomes one training sample

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	A HuggingFace `datasets.Dataset` of tabular records. Must contain the columns specified by `group_training_examples_by` and `order_training_examples_by`.	required
`tokenizer`	`PreTrainedTokenizer`	Tokenizer used for encoding records.	required
`metadata`	`ModelMetadata`	Model metadata containing prompt config, sequence lengths, etc.	required
`group_training_examples_by`	`str`	Column to group training examples by. For time series without explicit grouping, this is set to `PSEUDO_GROUP_COLUMN` by the preprocessing step.	required
`order_training_examples_by`	`str`	Column to order records within groups.	required
`keep_columns`	`list[str] \| None`	Columns to preserve through tokenization.	`None`
`**kwargs`		Additional arguments forwarded to `TabularDataExampleAssembler`.	`{}`

Attributes:

Name	Type	Description
`group_by_column`		Column name used to group records. For time series, this might be `device_id`, `customer_id`, etc. For ungrouped data, this is `PSEUDO_GROUP_COLUMN` added during preprocessing.
`order_by_column`		Column name used to order records within groups. Typically a timestamp column for time series data.

Example

For a dataset with 2 groups (A, B) and records ordered by timestamp:

Group A: records a1, a2, a3, a4, a5
Group B: records b1, b2, b3, b4

With token budget fitting ~3 records per example, output might be:

Example 1: [a1, a2, a3] (group A)
Example 2: [a4, a5] (group A, continues sequence)
Example 3: [b1, b2, b3] (group B)
Example 4: [b4] (group B, continues sequence)

Note: Examples never mix groups (no [a1, a2, b1]).

Methods:

Name	Description
`assemble_training_examples`	Build examples preserving sequential order within groups.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def __init__(
    self,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    *,
    group_training_examples_by: str,
    order_training_examples_by: str,
    keep_columns: list[str] | None = None,
    **kwargs,
):
    self.group_by_column = group_training_examples_by
    self.order_by_column = order_training_examples_by

    self._validate_columns(dataset)
    dataset = self._reorder_columns(dataset)
    keep_columns = self._build_keep_columns(keep_columns)

    super().__init__(
        dataset=dataset,
        tokenizer=tokenizer,
        metadata=metadata,
        keep_columns=keep_columns,
        **kwargs,
    )

    self._build_schema_prompt_excluding_pseudo_group(dataset, metadata, tokenizer)

`num_groups_train` `property` ¶

Number of unique groups in the training split.

`num_groups_validation` `property` ¶

Number of unique groups in the validation split.

`assemble_training_examples(data_fraction=1.0)` ¶

Build examples preserving sequential order within groups.

Parameters:

Name	Type	Description	Default
`data_fraction`	`float`	Fraction of the dataset to use for example generation.	`1.0`

Returns:

Type	Description
`TrainingExamples`	TrainingExamples object containing train/test datasets and statistics
`TrainingExamples`	including examples_per_group distribution.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples preserving sequential order within groups.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing train/test datasets and statistics
        including examples_per_group distribution.
    """
    logger.info(
        f"Assembling sequential examples from {data_fraction:.1%} of the input records",
    )

    rng = utils.get_random_number_generator(self.seed)
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
            "examples_per_group": self.stats["examples_per_group"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples

`GroupedDataExampleAssembler(group_training_examples_by, order_training_examples_by, dataset, tokenizer, metadata, test_size=None, cache_file_path=None, seed=None, keep_columns=None, *args, **kwargs)` ¶

Bases: TrainingExampleAssembler

Grouped data example assembler.

Parameters:

Name	Type	Description	Default
`group_training_examples_by`	`str`	Column to group training examples by.	required
`order_training_examples_by`	`str \| None`	Column to order training examples by.	required
`dataset`	`Dataset`	Dataset to be processed.	required
`tokenizer`	`PreTrainedTokenizer`	Tokenizer used for tokenizing the dataset records.	required
`metadata`	`ModelMetadata`	training configuration, e.g., group by, order by, prompt template, bos/eos tokens and where to use them.	required
`test_size`	`int \| float \| None`	Fraction of the dataset to use for testing. If None, there will be no test set and hence no evaluation during training.	`None`
`cache_file_path`	`str \| Path \| None`	Path to store the cached dataset for efficient data access.	`None`
`seed`	`int \| None`	Seed for the random number generator and train-test split.	`None`

Methods:

Name	Description
`assemble_training_examples`	Build examples with grouped (and optionally ordered) records.

Attributes:

Name	Type	Description
`num_records_train`	`int`	Total number of individual records across all training groups.
`num_records_validation`	`int`	Total number of individual records across all validation groups.
`num_groups_train`	`int`	Number of groups in the training split.
`num_groups_validation`	`int`	Number of groups in the validation split.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def __init__(
    self,
    group_training_examples_by: str,
    order_training_examples_by: str | None,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    test_size: int | float | None = None,
    cache_file_path: str | Path | None = None,
    seed: int | None = None,
    keep_columns: list[str] | None = None,
    *args,
    **kwargs,
):
    if group_training_examples_by is None:
        raise ValueError("GroupedDataExampleAssembler created with no groupby columns set")

    self.group_by: list[str] = [group_training_examples_by]
    self.order_by: str | None = order_training_examples_by

    # GroupedDataExampleAssembler needs group_by and order_by columns for its processing.
    # Merge any caller-provided keep_columns with the required columns.
    required_columns = self.group_by.copy()
    if self.order_by is not None:
        required_columns.append(self.order_by)
    if keep_columns:
        required_columns = list(set(required_columns + keep_columns))

    # We need to split the dataset first so that the grouping column(s) are still present when we invoke
    # `utils.grouped_train_test_split`. After the split we tokenize and perform the (potentially expensive) grouping step independently for
    # train and test.
    if test_size is not None and test_size > 0:
        df_dataset = dataset.to_pandas()
        train_raw, test_raw = grouped_train_test_split(
            df_dataset,
            group_by=self.group_by[0],
            test_size=test_size,
            random_state=seed,
        )
        train_raw = Dataset.from_pandas(train_raw)
        if isinstance(test_raw, pd.DataFrame):
            test_raw = Dataset.from_pandas(test_raw)
            test_raw.info.description += "is_val"
    else:
        train_raw = dataset
        test_raw = None

    super().__init__(
        dataset=train_raw,
        tokenizer=tokenizer,
        metadata=metadata,
        keep_columns=required_columns,
        test_size=None,  # we already did the split
        cache_file_path=cache_file_path,
        seed=seed,
        *args,
        **kwargs,
    )

    # tokenize and preprocess the test set if it exists
    if test_raw is not None:
        tokenized_test = self._tokenize_dataset(test_raw, required_columns)
        processed_test = self._preprocess_before_splitting(tokenized_test)
        self.validation_dataset = processed_test
        self.validation_dataset.info.description += "is_val"
    else:
        self.validation_dataset = None

`num_records_train` `property` ¶

Total number of individual records across all training groups.

`num_records_validation` `property` ¶

Total number of individual records across all validation groups.

`num_groups_train` `property` ¶

Number of groups in the training split.

`num_groups_validation` `property` ¶

Number of groups in the validation split.

`assemble_training_examples(data_fraction=1.0)` ¶

Build examples with grouped (and optionally ordered) records.

Parameters:

Name	Type	Description	Default
`data_fraction`	`float`	Fraction of the dataset to use for example generation.	`1.0`

Returns:

Type	Description
`TrainingExamples`	TrainingExamples object containing a 🤗 Dataset objects for the train
`TrainingExamples`	and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py

def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples with grouped (and optionally ordered) records.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """
    logger.info(
        f"Assembling grouped examples from {data_fraction:.1%} of the input training records",
    )

    rng = utils.get_random_number_generator(self.seed)
    # Process both training and validation datasets
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_group": self.stats["tokens_per_group"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
            "groups_per_example": self.stats["groups_per_example"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples

assembler

assembler ¶

Example(prompt, tokenizer, metadata) ¶

num_tokens property ¶

add_sequence(seq, add_special_tokens=True) ¶

to_dict() ¶

TrainingExamples(train, stats, test=None) dataclass ¶

TrainingExampleAssembler(dataset, tokenizer, metadata, keep_columns=None, test_size=None, cache_file_path=None, seed=None, *args, **kwargs) ¶

num_records_train abstractmethod property ¶

num_records_validation abstractmethod property ¶

num_records_total property ¶

assemble_training_examples(data_fraction=1.0) abstractmethod ¶

from_data(dataset, tokenizer, metadata, config, test_size=None, seed=None, cache_file_path=None, keep_columns=None, **kwargs) classmethod ¶

TabularDataExampleAssembler(*args, **kwargs) ¶

num_records_train property ¶

num_records_validation property ¶

assemble_training_examples(data_fraction=1.0) ¶

SequentialExampleAssembler(dataset, tokenizer, metadata, *, group_training_examples_by, order_training_examples_by, keep_columns=None, **kwargs) ¶

num_groups_train property ¶

num_groups_validation property ¶

assemble_training_examples(data_fraction=1.0) ¶

GroupedDataExampleAssembler(group_training_examples_by, order_training_examples_by, dataset, tokenizer, metadata, test_size=None, cache_file_path=None, seed=None, keep_columns=None, *args, **kwargs) ¶

num_records_train property ¶

num_records_validation property ¶

num_groups_train property ¶

num_groups_validation property ¶

assemble_training_examples(data_fraction=1.0) ¶

`assembler` ¶

`Example(prompt, tokenizer, metadata)` ¶

`num_tokens` `property` ¶

`add_sequence(seq, add_special_tokens=True)` ¶

`to_dict()` ¶

`TrainingExamples(train, stats, test=None)` `dataclass` ¶

`TrainingExampleAssembler(dataset, tokenizer, metadata, keep_columns=None, test_size=None, cache_file_path=None, seed=None, *args, **kwargs)` ¶

`num_records_train` `abstractmethod` `property` ¶

`num_records_validation` `abstractmethod` `property` ¶

`num_records_total` `property` ¶

`assemble_training_examples(data_fraction=1.0)` `abstractmethod` ¶

`from_data(dataset, tokenizer, metadata, config, test_size=None, seed=None, cache_file_path=None, keep_columns=None, **kwargs)` `classmethod` ¶

`TabularDataExampleAssembler(*args, **kwargs)` ¶

`num_records_train` `property` ¶

`num_records_validation` `property` ¶

`assemble_training_examples(data_fraction=1.0)` ¶

`SequentialExampleAssembler(dataset, tokenizer, metadata, *, group_training_examples_by, order_training_examples_by, keep_columns=None, **kwargs)` ¶

`num_groups_train` `property` ¶

`num_groups_validation` `property` ¶

`assemble_training_examples(data_fraction=1.0)` ¶

`GroupedDataExampleAssembler(group_training_examples_by, order_training_examples_by, dataset, tokenizer, metadata, test_size=None, cache_file_path=None, seed=None, keep_columns=None, *args, **kwargs)` ¶

`num_records_train` `property` ¶

`num_records_validation` `property` ¶

`num_groups_train` `property` ¶

`num_groups_validation` `property` ¶

`assemble_training_examples(data_fraction=1.0)` ¶