budget
budget
¶
Shared token budget computation used by both the assembler and preflight.
Functions:
| Name | Description |
|---|---|
compute_schema_prompt_ids |
Tokenize the full schema prompt using the same path as the assembler. |
compute_max_new_tokens |
Max tokens available for record content after schema and special tokens. |
tokenize_record |
Tokenize a single record using the same JSONL serialization as the assembler. |
tokenize_records |
Tokenize multiple records using shared JSONL serialization. |
compute_schema_prompt_ids(columns, metadata, *, exclude_columns=())
¶
Tokenize the full schema prompt using the same path as the assembler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
Column names. |
required |
metadata
|
ModelMetadata
|
Model metadata with tokenizer, instruction, and prompt config. |
required |
exclude_columns
|
Sequence[str]
|
Column names to omit from the schema prompt. |
()
|
Returns:
| Type | Description |
|---|---|
list[int]
|
Token IDs for the schema prompt (no special tokens). |
Source code in src/nemo_safe_synthesizer/data_processing/budget.py
compute_max_new_tokens(schema_prompt_ids, max_seq_length)
¶
Max tokens available for record content after schema and special tokens.
Uses the same formula as assembler._tokenize_records:
max_seq_length - len(schema_prompt_ids) - 2 * NUM_SPECIAL_TOKENS.
Source code in src/nemo_safe_synthesizer/data_processing/budget.py
tokenize_record(row, tokenizer)
¶
Tokenize a single record using the same JSONL serialization as the assembler.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row
|
Series
|
A single DataFrame row. |
required |
tokenizer
|
Any
|
HuggingFace tokenizer instance. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
Token IDs for the record (no special tokens). |
Source code in src/nemo_safe_synthesizer/data_processing/budget.py
tokenize_records(df, tokenizer, *, exclude_columns=())
¶
Tokenize multiple records using shared JSONL serialization.
Uses batch tokenization when available, and falls back to per-record
encode() for tokenizers that only expose single-record APIs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame whose rows represent records to tokenize. |
required |
tokenizer
|
PreTrainedTokenizerBase
|
HuggingFace tokenizer instance. |
required |
exclude_columns
|
Sequence[str]
|
Column names to omit from serialized records. |
()
|
Returns:
| Type | Description |
|---|---|
list[list[int]]
|
List of token-id lists, one per input row. |