assembler
assembler
¶
Classes:
| Name | Description |
|---|---|
Example |
A single training example containing a prompt and records. |
TrainingExamples |
Container for managing a dataset of training examples. |
TrainingExampleAssembler |
Base class for assembling LLM training examples. |
TabularDataExampleAssembler |
Assembler for standard tabular (non-grouped, non-sequential) data. |
SequentialExampleAssembler |
Assembler for sequential/time series data that preserves record ordering. |
GroupedDataExampleAssembler |
Grouped data example assembler. |
Example(prompt, tokenizer, metadata)
¶
A single training example containing a prompt and records.
A training example consists of a prompt followed by one or more
sequences of records, where each sequence is (optionally) enclosed
by the BOS and EOS special tokens. Tokens from the prompt are masked
with label -100 so they are ignored during loss computation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Schema prompt text prepended to every example. |
required |
tokenizer
|
PreTrainedTokenizer
|
Tokenizer used to encode the prompt. |
required |
metadata
|
ModelMetadata
|
Model metadata controlling special-token placement. |
required |
Methods:
| Name | Description |
|---|---|
add_sequence |
Add a sequence of records to the example. |
to_dict |
Convert the example to a dictionary format suitable for training. |
Attributes:
| Name | Type | Description |
|---|---|---|
num_tokens |
int
|
Total number of tokens in this example (prompt + all sequences). |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
num_tokens
property
¶
Total number of tokens in this example (prompt + all sequences).
add_sequence(seq, add_special_tokens=True)
¶
Add a sequence of records to the example.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq
|
dict[str, list[int]]
|
Dictionary containing 'input_ids' and 'attention_mask' for the sequence. |
required |
add_special_tokens
|
bool
|
Whether to add special tokens to the sequence. |
True
|
Raises:
| Type | Description |
|---|---|
GenerationError
|
If the number of tokens in the example exceeds the context length. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
to_dict()
¶
Convert the example to a dictionary format suitable for training.
Returns:
| Type | Description |
|---|---|
dict[str, list]
|
A dictionary containing |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
TrainingExamples(train, stats, test=None)
dataclass
¶
Container for managing a dataset of training examples.
Attributes:
| Name | Type | Description |
|---|---|---|
train |
Dataset
|
🤗 Dataset of the training examples. |
stats |
dict[str, Statistics]
|
Running statistics calculated during example construction. |
test |
Dataset | None
|
🤗 Dataset of the test examples, if available. |
TrainingExampleAssembler(dataset, tokenizer, metadata, keep_columns=None, test_size=None, cache_file_path=None, seed=None, *args, **kwargs)
¶
Bases: ABC
Base class for assembling LLM training examples.
Subclasses of this class are responsible for converting a dataset into a format suitable for training / fine-tuning LLMs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
Dataset to be processed. |
required |
tokenizer
|
PreTrainedTokenizer
|
Tokenizer used for tokenizing the dataset records. |
required |
metadata
|
ModelMetadata
|
Internal training configuration, e.g., prompt template, bos/eos tokens, and where to use them. |
required |
keep_columns
|
list[str] | None
|
List of columns to keep in the tokenized dataset. This is useful if you need certain fields for subsequent processing (e.g., grouping). |
None
|
test_size
|
int | None
|
Absolute number of records you want in the test set. If None or 0, there will be no test set and hence no evaluation during training. |
None
|
cache_file_path
|
str | Path | None
|
Path to store the cached dataset for efficient data access. |
None
|
seed
|
int | None
|
Seed for the random number generator and train-test split. |
None
|
Methods:
| Name | Description |
|---|---|
assemble_training_examples |
Build examples from the tokenized dataset. |
from_data |
Select and construct the appropriate assembler subclass from config. |
Attributes:
| Name | Type | Description |
|---|---|---|
num_records_train |
int
|
Number of records in the training split. |
num_records_validation |
int
|
Number of records in the validation split. |
num_records_total |
int
|
Total number of records across training and validation splits. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
num_records_train
abstractmethod
property
¶
Number of records in the training split.
num_records_validation
abstractmethod
property
¶
Number of records in the validation split.
num_records_total
property
¶
Total number of records across training and validation splits.
assemble_training_examples(data_fraction=1.0)
abstractmethod
¶
Build examples from the tokenized dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_fraction
|
float
|
Fraction of the dataset to use for example generation. |
1.0
|
Returns:
| Type | Description |
|---|---|
TrainingExamples
|
TrainingExamples object containing a 🤗 Dataset objects for the train |
TrainingExamples
|
and test set of examples, as well as an object with associated statistics. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
from_data(dataset, tokenizer, metadata, config, test_size=None, seed=None, cache_file_path=None, keep_columns=None, **kwargs)
classmethod
¶
Select and construct the appropriate assembler subclass from config.
Returns a SequentialExampleAssembler for time-series data, a
GroupedDataExampleAssembler when group_training_examples_by
is set, or a TabularDataExampleAssembler otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
A HuggingFace |
required |
tokenizer
|
PreTrainedTokenizer
|
Tokenizer used for encoding records. |
required |
metadata
|
ModelMetadata
|
Model metadata (prompt config, sequence lengths, etc.). |
required |
config
|
SafeSynthesizerParameters
|
Full pipeline configuration used to determine the assembler type. |
required |
test_size
|
int | None
|
Fraction of the dataset to reserve for validation (0 <= test_size < 1). |
None
|
seed
|
int | None
|
Random seed for reproducibility. |
None
|
cache_file_path
|
str | Path | None
|
Path for caching intermediate datasets. |
None
|
keep_columns
|
list[str] | None
|
Columns to preserve through tokenization. |
None
|
**kwargs
|
Forwarded to the chosen assembler constructor. |
{}
|
Returns:
| Type | Description |
|---|---|
GroupedDataExampleAssembler | TabularDataExampleAssembler | SequentialExampleAssembler
|
An assembler instance appropriate for the data type described by |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 | |
TabularDataExampleAssembler(*args, **kwargs)
¶
Bases: TrainingExampleAssembler
Assembler for standard tabular (non-grouped, non-sequential) data.
Records are shuffled and packed into examples that fill the model's context window. Each example contains a single sequence of concatenated records enclosed by BOS/EOS tokens.
Methods:
| Name | Description |
|---|---|
assemble_training_examples |
Build examples with randomly shuffled records. |
Attributes:
| Name | Type | Description |
|---|---|---|
num_records_train |
int
|
Number of records in the training split. |
num_records_validation |
int
|
Number of records in the validation split. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
num_records_train
property
¶
Number of records in the training split.
num_records_validation
property
¶
Number of records in the validation split.
assemble_training_examples(data_fraction=1.0)
¶
Build examples with randomly shuffled records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_fraction
|
float
|
Fraction of the dataset to use for example generation. |
1.0
|
Returns:
| Type | Description |
|---|---|
TrainingExamples
|
TrainingExamples object containing a 🤗 Dataset objects for the train |
TrainingExamples
|
and test set of examples, as well as an object with associated statistics. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
SequentialExampleAssembler(dataset, tokenizer, metadata, *, group_training_examples_by, order_training_examples_by, keep_columns=None, **kwargs)
¶
Bases: TabularDataExampleAssembler
Assembler for sequential/time series data that preserves record ordering.
This assembler extends TabularDataExampleAssembler to handle time series and other sequential data where record order is semantically meaningful. Unlike the base class which shuffles records, this assembler maintains chronological order within groups and ensures each training example contains records from only one group.
Key Concepts
- Order Preservation: Records are never shuffled. Within each group, records maintain their original order (typically chronological by timestamp).
- Single-Group Examples: Each training example contains records from exactly one group. This ensures the model learns patterns within a group's sequence without cross-group contamination.
- Sequence Continuation: When a group's records span multiple examples, the sequence continues naturally across example boundaries. The model sees (example1: records 0-99) then (example2: records 100-199) for the same group.
- Pseudo-Group Handling: When no group column is specified, preprocessing adds a PSEUDO_GROUP_COLUMN so ungrouped time series is treated as a single group. This unifies the grouped and ungrouped code paths.
- Initial Prefill: For each group, the first 3 records are stored in
model_metadata.initial_prefillas a dict mapping group_id -> prefill string. This is used by TimeseriesBackend during generation to seed each group's context.
Processing Flow
-
Initialization: a. Validate that group and order columns exist in dataset b. Reorder columns: group_by first, order_by second, then rest c. Build keep_columns list to preserve group/order through tokenization d. Override schema_prompt to exclude PSEUDO_GROUP_COLUMN from visible schema
-
Train/Test Split (_apply_grouped_train_test_split): a. Split along group boundaries using GroupShuffleSplit b. Entire groups go to train OR validation, never split across c. Re-sort after split (GroupShuffleSplit shuffles indices) d. Add row indices column for detecting dataset restart boundaries
-
Dataset Preparation (_prepare_dataset_for_training): a. For data_fraction > 1, concatenate multiple passes of the dataset (no shuffling, just sequential duplication) b. Run example generation via _fill_context_with_records_generator
-
Example Generation (_fill_context_with_records_generator): a. Iterate through records sequentially b. Track token budget per example (randomized between MIN/MAX_FILL_RATIO) c. Flush example when any boundary condition is met:
- Group changes (record_group != current_group_value)
- Dataset restarts (row_idx < prev_row_idx, from duplication wrap)
- Token budget exceeded
- Max sequences per example reached d. Each flushed example becomes one training sample
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
A HuggingFace |
required |
tokenizer
|
PreTrainedTokenizer
|
Tokenizer used for encoding records. |
required |
metadata
|
ModelMetadata
|
Model metadata containing prompt config, sequence lengths, etc. |
required |
group_training_examples_by
|
str
|
Column to group training examples by.
For time series without explicit grouping, this is set to
|
required |
order_training_examples_by
|
str
|
Column to order records within groups. |
required |
keep_columns
|
list[str] | None
|
Columns to preserve through tokenization. |
None
|
**kwargs
|
Additional arguments forwarded to
|
{}
|
Attributes:
| Name | Type | Description |
|---|---|---|
group_by_column |
Column name used to group records. For time series,
this might be |
|
order_by_column |
Column name used to order records within groups. Typically a timestamp column for time series data. |
Example
For a dataset with 2 groups (A, B) and records ordered by timestamp:
- Group A: records a1, a2, a3, a4, a5
- Group B: records b1, b2, b3, b4
With token budget fitting ~3 records per example, output might be:
- Example 1: [a1, a2, a3] (group A)
- Example 2: [a4, a5] (group A, continues sequence)
- Example 3: [b1, b2, b3] (group B)
- Example 4: [b4] (group B, continues sequence)
Note: Examples never mix groups (no [a1, a2, b1]).
Methods:
| Name | Description |
|---|---|
assemble_training_examples |
Build examples preserving sequential order within groups. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
num_groups_train
property
¶
Number of unique groups in the training split.
num_groups_validation
property
¶
Number of unique groups in the validation split.
assemble_training_examples(data_fraction=1.0)
¶
Build examples preserving sequential order within groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_fraction
|
float
|
Fraction of the dataset to use for example generation. |
1.0
|
Returns:
| Type | Description |
|---|---|
TrainingExamples
|
TrainingExamples object containing train/test datasets and statistics |
TrainingExamples
|
including examples_per_group distribution. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
GroupedDataExampleAssembler(group_training_examples_by, order_training_examples_by, dataset, tokenizer, metadata, test_size=None, cache_file_path=None, seed=None, keep_columns=None, *args, **kwargs)
¶
Bases: TrainingExampleAssembler
Grouped data example assembler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_training_examples_by
|
str
|
Column to group training examples by. |
required |
order_training_examples_by
|
str | None
|
Column to order training examples by. |
required |
dataset
|
Dataset
|
Dataset to be processed. |
required |
tokenizer
|
PreTrainedTokenizer
|
Tokenizer used for tokenizing the dataset records. |
required |
metadata
|
ModelMetadata
|
training configuration, e.g., group by, order by, prompt template, bos/eos tokens and where to use them. |
required |
test_size
|
int | float | None
|
Fraction of the dataset to use for testing. If None, there will be no test set and hence no evaluation during training. |
None
|
cache_file_path
|
str | Path | None
|
Path to store the cached dataset for efficient data access. |
None
|
seed
|
int | None
|
Seed for the random number generator and train-test split. |
None
|
Methods:
| Name | Description |
|---|---|
assemble_training_examples |
Build examples with grouped (and optionally ordered) records. |
Attributes:
| Name | Type | Description |
|---|---|---|
num_records_train |
int
|
Total number of individual records across all training groups. |
num_records_validation |
int
|
Total number of individual records across all validation groups. |
num_groups_train |
int
|
Number of groups in the training split. |
num_groups_validation |
int
|
Number of groups in the validation split. |
Source code in src/nemo_safe_synthesizer/data_processing/assembler.py
1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 | |
num_records_train
property
¶
Total number of individual records across all training groups.
num_records_validation
property
¶
Total number of individual records across all validation groups.
num_groups_train
property
¶
Number of groups in the training split.
num_groups_validation
property
¶
Number of groups in the validation split.
assemble_training_examples(data_fraction=1.0)
¶
Build examples with grouped (and optionally ordered) records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_fraction
|
float
|
Fraction of the dataset to use for example generation. |
1.0
|
Returns:
| Type | Description |
|---|---|
TrainingExamples
|
TrainingExamples object containing a 🤗 Dataset objects for the train |
TrainingExamples
|
and test set of examples, as well as an object with associated statistics. |