Skip to content

assembler

assembler

Classes:

Name Description
Example

A single training example containing a prompt and records.

TrainingExamples

Container for managing a dataset of training examples.

TrainingExampleAssembler

Base class for assembling LLM training examples.

TabularDataExampleAssembler

Assembler for standard tabular (non-grouped, non-sequential) data.

SequentialExampleAssembler

Assembler for sequential/time series data that preserves record ordering.

GroupedDataExampleAssembler

Grouped data example assembler.

Example(prompt, tokenizer, metadata)

A single training example containing a prompt and records.

A training example consists of a prompt followed by one or more sequences of records, where each sequence is (optionally) enclosed by the BOS and EOS special tokens. Tokens from the prompt are masked with label -100 so they are ignored during loss computation.

Parameters:

Name Type Description Default
prompt str

Schema prompt text prepended to every example.

required
tokenizer PreTrainedTokenizer

Tokenizer used to encode the prompt.

required
metadata ModelMetadata

Model metadata controlling special-token placement.

required

Methods:

Name Description
add_sequence

Add a sequence of records to the example.

to_dict

Convert the example to a dictionary format suitable for training.

Attributes:

Name Type Description
num_tokens int

Total number of tokens in this example (prompt + all sequences).

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def __init__(
    self,
    prompt: str,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
):
    self.prompt = prompt
    self.tokenizer = tokenizer
    self.metadata = metadata

    self.input_ids = self.tokenizer.encode(prompt, add_special_tokens=False)

    if self.metadata.prompt_config.add_bos_token_to_prompt:
        self.input_ids = [self.metadata.prompt_config.bos_token_id] + self.input_ids
    if self.metadata.prompt_config.add_eos_token_to_prompt:
        self.input_ids = self.input_ids + [self.metadata.prompt_config.eos_token_id]

    # We use -100 to ignore the prompt tokens when calculating the loss.
    self.labels = [-100] * len(self.input_ids)
    self.attention_mask = [1] * len(self.input_ids)

    self.num_sequences = 0

num_tokens property

Total number of tokens in this example (prompt + all sequences).

add_sequence(seq, add_special_tokens=True)

Add a sequence of records to the example.

Parameters:

Name Type Description Default
seq dict[str, list[int]]

Dictionary containing 'input_ids' and 'attention_mask' for the sequence.

required
add_special_tokens bool

Whether to add special tokens to the sequence.

True

Raises:

Type Description
GenerationError

If the number of tokens in the example exceeds the context length.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def add_sequence(self, seq: dict[str, list[int]], add_special_tokens: bool = True) -> None:
    """Add a sequence of records to the example.

    Args:
        seq: Dictionary containing 'input_ids' and 'attention_mask' for the sequence.
        add_special_tokens: Whether to add special tokens to the sequence.

    Raises:
        GenerationError: If the number of tokens in the example exceeds the context length.
    """
    input_ids = (
        [self.metadata.prompt_config.bos_token_id] + seq["input_ids"] + [self.metadata.prompt_config.eos_token_id]
        if add_special_tokens
        else seq["input_ids"]
    )
    attention_mask = [1] + seq["attention_mask"] + [1] if add_special_tokens else seq["attention_mask"]
    self.input_ids.extend(input_ids)
    self.attention_mask.extend(attention_mask)
    self.labels.extend(input_ids)
    self.num_sequences += 1

    if self.num_tokens > self.metadata.max_seq_length:
        max_tokens_action = _get_max_tokens_action(self.metadata.rope_scaling_factor)
        msg = f"The number of tokens in an example exceeds the available context length. {max_tokens_action}"
        raise GenerationError(msg)

to_dict()

Convert the example to a dictionary format suitable for training.

Returns:

Type Description
dict[str, list]

A dictionary containing input_ids, attention_mask, and labels.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def to_dict(self) -> dict[str, list]:
    """Convert the example to a dictionary format suitable for training.

    Returns:
        A dictionary containing ``input_ids``, ``attention_mask``, and ``labels``.
    """
    return {
        "input_ids": self.input_ids,
        "attention_mask": self.attention_mask,
        "labels": self.labels,
    }

TrainingExamples(train, stats, test=None) dataclass

Container for managing a dataset of training examples.

Attributes:

Name Type Description
train Dataset

🤗 Dataset of the training examples.

stats dict[str, Statistics]

Running statistics calculated during example construction.

test Dataset | None

🤗 Dataset of the test examples, if available.

TrainingExampleAssembler(dataset, tokenizer, metadata, keep_columns=None, test_size=None, cache_file_path=None, seed=None, *args, **kwargs)

Bases: ABC

Base class for assembling LLM training examples.

Subclasses of this class are responsible for converting a dataset into a format suitable for training / fine-tuning LLMs.

Parameters:

Name Type Description Default
dataset Dataset

Dataset to be processed.

required
tokenizer PreTrainedTokenizer

Tokenizer used for tokenizing the dataset records.

required
metadata ModelMetadata

Internal training configuration, e.g., prompt template, bos/eos tokens, and where to use them.

required
keep_columns list[str] | None

List of columns to keep in the tokenized dataset. This is useful if you need certain fields for subsequent processing (e.g., grouping).

None
test_size int | None

Absolute number of records you want in the test set. If None or 0, there will be no test set and hence no evaluation during training.

None
cache_file_path str | Path | None

Path to store the cached dataset for efficient data access.

None
seed int | None

Seed for the random number generator and train-test split.

None

Methods:

Name Description
assemble_training_examples

Build examples from the tokenized dataset.

from_data

Select and construct the appropriate assembler subclass from config.

Attributes:

Name Type Description
num_records_train int

Number of records in the training split.

num_records_validation int

Number of records in the validation split.

num_records_total int

Total number of records across training and validation splits.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def __init__(
    self,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    keep_columns: list[str] | None = None,
    test_size: int | None = None,
    cache_file_path: str | Path | None = None,
    seed: int | None = None,  # TODO: probably should include with metadata!
    *args,
    **kwargs,
):
    if test_size is not None and test_size > 0 and test_size > len(dataset) - TRAIN_SET_SIZE_BUFFER:
        msg = (
            "The test set size is too large compared to the input dataset. You must "
            f"have `test_set_size < len(dataset) - {TRAIN_SET_SIZE_BUFFER} records`. "
            f"You gave `test_set_size = {test_size}` and `len(dataset) = {len(dataset)}`. "
            "Please reduce the test set size or provide a larger dataset."
        )
        raise ParameterError(msg)

    self.metadata = metadata
    self.tokenizer = tokenizer
    self.stats = defaultdict(RunningStatistics)
    self.stats_val = defaultdict(RunningStatistics)
    # adding this extra instead of "" due to hf datasets being weird about the cache path parent dirs -
    # it's erroring out when we pass an empty string, with a 'filenotfound' error.
    fp = Path(cache_file_path) if cache_file_path else Path.cwd()
    self.cache_file_path = fp / f"{DEFAULT_CACHE_PREFIX}_{uuid.uuid4().hex[:5]}"
    self.test_size = test_size
    self.keep_columns = keep_columns or []
    self.seed = seed
    self._window_rng = None

    self.schema_prompt = utils.create_schema_prompt(
        dataset.column_names,
        instruction=metadata.instruction,
        prompt_template=metadata.prompt_config.template,
    )

    # The prompt IDs attribute does *not* include special tokens.
    self.schema_prompt_ids: list[int] = tokenizer(self.schema_prompt, add_special_tokens=False)["input_ids"]

    self.tokenized_records = self._tokenize_dataset(dataset, keep_columns)
    processed_dataset = self._preprocess_before_splitting(self.tokenized_records)
    self._apply_train_test_split(processed_dataset)

num_records_train abstractmethod property

Number of records in the training split.

num_records_validation abstractmethod property

Number of records in the validation split.

num_records_total property

Total number of records across training and validation splits.

assemble_training_examples(data_fraction=1.0) abstractmethod

Build examples from the tokenized dataset.

Parameters:

Name Type Description Default
data_fraction float

Fraction of the dataset to use for example generation.

1.0

Returns:

Type Description
TrainingExamples

TrainingExamples object containing a 🤗 Dataset objects for the train

TrainingExamples

and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
@abstractmethod
def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples from the tokenized dataset.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """

from_data(dataset, tokenizer, metadata, config, test_size=None, seed=None, cache_file_path=None, keep_columns=None, **kwargs) classmethod

Select and construct the appropriate assembler subclass from config.

Returns a SequentialExampleAssembler for time-series data, a GroupedDataExampleAssembler when group_training_examples_by is set, or a TabularDataExampleAssembler otherwise.

Parameters:

Name Type Description Default
dataset Dataset

A HuggingFace datasets.Dataset of tabular records to assemble training examples from.

required
tokenizer PreTrainedTokenizer

Tokenizer used for encoding records.

required
metadata ModelMetadata

Model metadata (prompt config, sequence lengths, etc.).

required
config SafeSynthesizerParameters

Full pipeline configuration used to determine the assembler type.

required
test_size int | None

Fraction of the dataset to reserve for validation (0 <= test_size < 1).

None
seed int | None

Random seed for reproducibility.

None
cache_file_path str | Path | None

Path for caching intermediate datasets.

None
keep_columns list[str] | None

Columns to preserve through tokenization.

None
**kwargs

Forwarded to the chosen assembler constructor.

{}

Returns:

Type Description
GroupedDataExampleAssembler | TabularDataExampleAssembler | SequentialExampleAssembler

An assembler instance appropriate for the data type described by config.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
@classmethod
def from_data(
    cls,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    config: SafeSynthesizerParameters,
    test_size: int | None = None,
    seed: int | None = None,
    cache_file_path: str | Path | None = None,
    keep_columns: list[str] | None = None,
    **kwargs,
) -> GroupedDataExampleAssembler | TabularDataExampleAssembler | SequentialExampleAssembler:
    """Select and construct the appropriate assembler subclass from config.

    Returns a ``SequentialExampleAssembler`` for time-series data, a
    ``GroupedDataExampleAssembler`` when ``group_training_examples_by``
    is set, or a ``TabularDataExampleAssembler`` otherwise.

    Args:
        dataset: A HuggingFace ``datasets.Dataset`` of tabular records to
            assemble training examples from.
        tokenizer: Tokenizer used for encoding records.
        metadata: Model metadata (prompt config, sequence lengths, etc.).
        config: Full pipeline configuration used to determine the assembler type.
        test_size: Fraction of the dataset to reserve for validation (0 <= test_size < 1).
        seed: Random seed for reproducibility.
        cache_file_path: Path for caching intermediate datasets.
        keep_columns: Columns to preserve through tokenization.
        **kwargs: Forwarded to the chosen assembler constructor.

    Returns:
        An assembler instance appropriate for the data type described by ``config``.
    """
    if config.time_series.is_timeseries:
        # group_by and order_by should be set by timeseries preprocessing
        # (adds pseudo-group if needed, sets order_by to timestamp column)
        group_by = config.data.group_training_examples_by
        order_by = config.data.order_training_examples_by
        if group_by is None or order_by is None:  # for type checking
            raise RuntimeError("Internal error: group_by and order_by should be set by timeseries preprocessing")

        return SequentialExampleAssembler(
            group_training_examples_by=group_by,
            order_training_examples_by=order_by,
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            config=config,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )

    if config.data.group_training_examples_by is not None:
        return GroupedDataExampleAssembler(
            group_training_examples_by=config.data.group_training_examples_by,
            order_training_examples_by=config.data.order_training_examples_by,
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )
    else:
        return TabularDataExampleAssembler(
            dataset=dataset,
            tokenizer=tokenizer,
            metadata=metadata,
            test_size=config.training.validation_ratio,
            seed=seed,
            cache_file_path=cache_file_path,
            keep_columns=keep_columns,
            **kwargs,
        )

TabularDataExampleAssembler(*args, **kwargs)

Bases: TrainingExampleAssembler

Assembler for standard tabular (non-grouped, non-sequential) data.

Records are shuffled and packed into examples that fill the model's context window. Each example contains a single sequence of concatenated records enclosed by BOS/EOS tokens.

Methods:

Name Description
assemble_training_examples

Build examples with randomly shuffled records.

Attributes:

Name Type Description
num_records_train int

Number of records in the training split.

num_records_validation int

Number of records in the validation split.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)

num_records_train property

Number of records in the training split.

num_records_validation property

Number of records in the validation split.

assemble_training_examples(data_fraction=1.0)

Build examples with randomly shuffled records.

Parameters:

Name Type Description Default
data_fraction float

Fraction of the dataset to use for example generation.

1.0

Returns:

Type Description
TrainingExamples

TrainingExamples object containing a 🤗 Dataset objects for the train

TrainingExamples

and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples with randomly shuffled records.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """
    logger.info(
        f"Assembling examples from {data_fraction:.1%} of the input records",
    )

    rng = utils.get_random_number_generator(self.seed)
    # Process both training and test datasets
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples

SequentialExampleAssembler(dataset, tokenizer, metadata, *, group_training_examples_by, order_training_examples_by, keep_columns=None, **kwargs)

Bases: TabularDataExampleAssembler

Assembler for sequential/time series data that preserves record ordering.

This assembler extends TabularDataExampleAssembler to handle time series and other sequential data where record order is semantically meaningful. Unlike the base class which shuffles records, this assembler maintains chronological order within groups and ensures each training example contains records from only one group.

Key Concepts
  • Order Preservation: Records are never shuffled. Within each group, records maintain their original order (typically chronological by timestamp).
  • Single-Group Examples: Each training example contains records from exactly one group. This ensures the model learns patterns within a group's sequence without cross-group contamination.
  • Sequence Continuation: When a group's records span multiple examples, the sequence continues naturally across example boundaries. The model sees (example1: records 0-99) then (example2: records 100-199) for the same group.
  • Pseudo-Group Handling: When no group column is specified, preprocessing adds a PSEUDO_GROUP_COLUMN so ungrouped time series is treated as a single group. This unifies the grouped and ungrouped code paths.
  • Initial Prefill: For each group, the first 3 records are stored in model_metadata.initial_prefill as a dict mapping group_id -> prefill string. This is used by TimeseriesBackend during generation to seed each group's context.
Processing Flow
  1. Initialization: a. Validate that group and order columns exist in dataset b. Reorder columns: group_by first, order_by second, then rest c. Build keep_columns list to preserve group/order through tokenization d. Override schema_prompt to exclude PSEUDO_GROUP_COLUMN from visible schema

  2. Train/Test Split (_apply_grouped_train_test_split): a. Split along group boundaries using GroupShuffleSplit b. Entire groups go to train OR validation, never split across c. Re-sort after split (GroupShuffleSplit shuffles indices) d. Add row indices column for detecting dataset restart boundaries

  3. Dataset Preparation (_prepare_dataset_for_training): a. For data_fraction > 1, concatenate multiple passes of the dataset (no shuffling, just sequential duplication) b. Run example generation via _fill_context_with_records_generator

  4. Example Generation (_fill_context_with_records_generator): a. Iterate through records sequentially b. Track token budget per example (randomized between MIN/MAX_FILL_RATIO) c. Flush example when any boundary condition is met:

    • Group changes (record_group != current_group_value)
    • Dataset restarts (row_idx < prev_row_idx, from duplication wrap)
    • Token budget exceeded
    • Max sequences per example reached d. Each flushed example becomes one training sample

Parameters:

Name Type Description Default
dataset Dataset

A HuggingFace datasets.Dataset of tabular records. Must contain the columns specified by group_training_examples_by and order_training_examples_by.

required
tokenizer PreTrainedTokenizer

Tokenizer used for encoding records.

required
metadata ModelMetadata

Model metadata containing prompt config, sequence lengths, etc.

required
group_training_examples_by str

Column to group training examples by. For time series without explicit grouping, this is set to PSEUDO_GROUP_COLUMN by the preprocessing step.

required
order_training_examples_by str

Column to order records within groups.

required
keep_columns list[str] | None

Columns to preserve through tokenization.

None
**kwargs

Additional arguments forwarded to TabularDataExampleAssembler.

{}

Attributes:

Name Type Description
group_by_column

Column name used to group records. For time series, this might be device_id, customer_id, etc. For ungrouped data, this is PSEUDO_GROUP_COLUMN added during preprocessing.

order_by_column

Column name used to order records within groups. Typically a timestamp column for time series data.

Example

For a dataset with 2 groups (A, B) and records ordered by timestamp:

  • Group A: records a1, a2, a3, a4, a5
  • Group B: records b1, b2, b3, b4

With token budget fitting ~3 records per example, output might be:

  • Example 1: [a1, a2, a3] (group A)
  • Example 2: [a4, a5] (group A, continues sequence)
  • Example 3: [b1, b2, b3] (group B)
  • Example 4: [b4] (group B, continues sequence)

Note: Examples never mix groups (no [a1, a2, b1]).

Methods:

Name Description
assemble_training_examples

Build examples preserving sequential order within groups.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def __init__(
    self,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    *,
    group_training_examples_by: str,
    order_training_examples_by: str,
    keep_columns: list[str] | None = None,
    **kwargs,
):
    self.group_by_column = group_training_examples_by
    self.order_by_column = order_training_examples_by

    self._validate_columns(dataset)
    dataset = self._reorder_columns(dataset)
    keep_columns = self._build_keep_columns(keep_columns)

    super().__init__(
        dataset=dataset,
        tokenizer=tokenizer,
        metadata=metadata,
        keep_columns=keep_columns,
        **kwargs,
    )

    self._build_schema_prompt_excluding_pseudo_group(dataset, metadata, tokenizer)

num_groups_train property

Number of unique groups in the training split.

num_groups_validation property

Number of unique groups in the validation split.

assemble_training_examples(data_fraction=1.0)

Build examples preserving sequential order within groups.

Parameters:

Name Type Description Default
data_fraction float

Fraction of the dataset to use for example generation.

1.0

Returns:

Type Description
TrainingExamples

TrainingExamples object containing train/test datasets and statistics

TrainingExamples

including examples_per_group distribution.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples preserving sequential order within groups.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing train/test datasets and statistics
        including examples_per_group distribution.
    """
    logger.info(
        f"Assembling sequential examples from {data_fraction:.1%} of the input records",
    )

    rng = utils.get_random_number_generator(self.seed)
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
            "examples_per_group": self.stats["examples_per_group"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples

GroupedDataExampleAssembler(group_training_examples_by, order_training_examples_by, dataset, tokenizer, metadata, test_size=None, cache_file_path=None, seed=None, keep_columns=None, *args, **kwargs)

Bases: TrainingExampleAssembler

Grouped data example assembler.

Parameters:

Name Type Description Default
group_training_examples_by str

Column to group training examples by.

required
order_training_examples_by str | None

Column to order training examples by.

required
dataset Dataset

Dataset to be processed.

required
tokenizer PreTrainedTokenizer

Tokenizer used for tokenizing the dataset records.

required
metadata ModelMetadata

training configuration, e.g., group by, order by, prompt template, bos/eos tokens and where to use them.

required
test_size int | float | None

Fraction of the dataset to use for testing. If None, there will be no test set and hence no evaluation during training.

None
cache_file_path str | Path | None

Path to store the cached dataset for efficient data access.

None
seed int | None

Seed for the random number generator and train-test split.

None

Methods:

Name Description
assemble_training_examples

Build examples with grouped (and optionally ordered) records.

Attributes:

Name Type Description
num_records_train int

Total number of individual records across all training groups.

num_records_validation int

Total number of individual records across all validation groups.

num_groups_train int

Number of groups in the training split.

num_groups_validation int

Number of groups in the validation split.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def __init__(
    self,
    group_training_examples_by: str,
    order_training_examples_by: str | None,
    dataset: Dataset,
    tokenizer: PreTrainedTokenizer,
    metadata: ModelMetadata,
    test_size: int | float | None = None,
    cache_file_path: str | Path | None = None,
    seed: int | None = None,
    keep_columns: list[str] | None = None,
    *args,
    **kwargs,
):
    if group_training_examples_by is None:
        raise ValueError("GroupedDataExampleAssembler created with no groupby columns set")

    self.group_by: list[str] = [group_training_examples_by]
    self.order_by: str | None = order_training_examples_by

    # GroupedDataExampleAssembler needs group_by and order_by columns for its processing.
    # Merge any caller-provided keep_columns with the required columns.
    required_columns = self.group_by.copy()
    if self.order_by is not None:
        required_columns.append(self.order_by)
    if keep_columns:
        required_columns = list(set(required_columns + keep_columns))

    # We need to split the dataset first so that the grouping column(s) are still present when we invoke
    # `utils.grouped_train_test_split`. After the split we tokenize and perform the (potentially expensive) grouping step independently for
    # train and test.
    if test_size is not None and test_size > 0:
        df_dataset = dataset.to_pandas()
        train_raw, test_raw = grouped_train_test_split(
            df_dataset,
            group_by=self.group_by[0],
            test_size=test_size,
            random_state=seed,
        )
        train_raw = Dataset.from_pandas(train_raw)
        if isinstance(test_raw, pd.DataFrame):
            test_raw = Dataset.from_pandas(test_raw)
            test_raw.info.description += "is_val"
    else:
        train_raw = dataset
        test_raw = None

    super().__init__(
        dataset=train_raw,
        tokenizer=tokenizer,
        metadata=metadata,
        keep_columns=required_columns,
        test_size=None,  # we already did the split
        cache_file_path=cache_file_path,
        seed=seed,
        *args,
        **kwargs,
    )

    # tokenize and preprocess the test set if it exists
    if test_raw is not None:
        tokenized_test = self._tokenize_dataset(test_raw, required_columns)
        processed_test = self._preprocess_before_splitting(tokenized_test)
        self.validation_dataset = processed_test
        self.validation_dataset.info.description += "is_val"
    else:
        self.validation_dataset = None

num_records_train property

Total number of individual records across all training groups.

num_records_validation property

Total number of individual records across all validation groups.

num_groups_train property

Number of groups in the training split.

num_groups_validation property

Number of groups in the validation split.

assemble_training_examples(data_fraction=1.0)

Build examples with grouped (and optionally ordered) records.

Parameters:

Name Type Description Default
data_fraction float

Fraction of the dataset to use for example generation.

1.0

Returns:

Type Description
TrainingExamples

TrainingExamples object containing a 🤗 Dataset objects for the train

TrainingExamples

and test set of examples, as well as an object with associated statistics.

Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
def assemble_training_examples(self, data_fraction: float = 1.0) -> TrainingExamples:
    """Build examples with grouped (and optionally ordered) records.

    Args:
        data_fraction: Fraction of the dataset to use for example generation.

    Returns:
        TrainingExamples object containing a 🤗 Dataset objects for the train
        and test set of examples, as well as an object with associated statistics.
    """
    logger.info(
        f"Assembling grouped examples from {data_fraction:.1%} of the input training records",
    )

    rng = utils.get_random_number_generator(self.seed)
    # Process both training and validation datasets
    training_dataset = self._prepare_dataset_for_training(self.train_dataset, data_fraction, rng)
    validation_dataset = self._prepare_dataset_for_training(self.validation_dataset, 1.0, rng)

    examples = TrainingExamples(
        train=training_dataset,
        test=validation_dataset,
        stats={
            "tokens_per_record": self.stats["tokens_per_record"],
            "tokens_per_group": self.stats["tokens_per_group"],
            "tokens_per_example": self.stats["tokens_per_example"],
            "records_per_example": self.stats["records_per_example"],
            "groups_per_example": self.stats["groups_per_example"],
        },
    )

    utils.log_training_example_stats(examples.stats)

    return examples