generate
generate
¶
Classes:
| Name | Description |
|---|---|
ValidationParameters |
Configuration for record and sequence validation. |
GenerateParameters |
Configuration parameters for synthetic data generation. |
ValidationParameters
pydantic-model
¶
Bases: Parameters, BaseModel
Configuration for record and sequence validation.
These parameters control the validation and automatic fixes when going from LLM output to tabular data.
Fields:
-
group_by_accept_no_delineator(bool) -
group_by_ignore_invalid_records(bool) -
group_by_fix_non_unique_value(bool) -
group_by_fix_unordered_records(bool)
group_by_accept_no_delineator
pydantic-field
¶
Whether to accept completions without both beginning and end of sequence delineators as a single sequence.
group_by_ignore_invalid_records
pydantic-field
¶
Whether to ignore invalid records in a sequence and proceed with the valid records.
group_by_fix_non_unique_value
pydantic-field
¶
Whether to automatically fix non-unique group-by values in a sequence by using the first unique value for all records.
group_by_fix_unordered_records
pydantic-field
¶
Whether to automatically fix unordered records in a sequence by sorting the records.
GenerateParameters
pydantic-model
¶
Bases: Parameters, BaseModel
Configuration parameters for synthetic data generation.
These parameters control how synthetic data is generated after the model is trained. They affect the quality, diversity, and validity of the generated synthetic records.
Fields:
-
num_records(int) -
temperature(float) -
repetition_penalty(float) -
top_p(float) -
patience(int) -
invalid_fraction_threshold(float) -
use_structured_generation(bool) -
structured_generation_backend(Literal['auto', 'xgrammar', 'guidance', 'outlines', 'lm-format-enforcer']) -
structured_generation_schema_method(Literal['regex', 'json_schema']) -
structured_generation_use_single_sequence(bool) -
enforce_timeseries_fidelity(bool) -
validation(ValidationParameters) -
attention_backend(str | None)
num_records
pydantic-field
¶
Number of records to generate.
temperature
pydantic-field
¶
Sampling temperature for controlling randomness (higher = more random).
repetition_penalty
pydantic-field
¶
The value used to control the likelihood of the model repeating the same token. Must be > 0.
top_p
pydantic-field
¶
Nucleus sampling probability for token selection. Must be in (0, 1].
patience
pydantic-field
¶
Number of consecutive generations where the invalid_fraction_threshold is reached before stopping generation. Must be >= 1.
invalid_fraction_threshold
pydantic-field
¶
The fraction of invalid records that will stop generation after the patience limit is reached. Must be in [0, 1].
use_structured_generation
pydantic-field
¶
Whether to use structured generation for better format control.
structured_generation_backend
pydantic-field
¶
The backend used by vLLM when use_structured_generation is True. Supported backends: 'outlines', 'guidance', 'xgrammar', 'lm-format-enforcer'. 'auto' will allow vLLM to choose the backend.
structured_generation_schema_method
pydantic-field
¶
The method used to generate the schema from your dataset and pass it to the generation backend. 'regex' uses a custom regex construction method that tends to be more comprehensive than 'json_schema' at the cost of speed.
structured_generation_use_single_sequence
pydantic-field
¶
Whether to use a regex that matches exactly one sequence or record if max_sequences_per_example is 1.
enforce_timeseries_fidelity
pydantic-field
¶
Enforce time-series fidelity by enforcing order, intervals, start and end times of the records.
validation
pydantic-field
¶
Validation parameters controlling validation logic and automatic fixes when parsing LLM output and converting to tabular data.
attention_backend
pydantic-field
¶
The attention backend for the vLLM engine. Common values: 'FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION'. If None or 'auto', vLLM will auto-select the best available backend.