Skip to content

Column Configurations

The column_configs module defines configuration objects for all Data Designer column types. Each configuration inherits from SingleColumnConfig, which provides shared arguments like the column name, whether to drop the column after generation, and the column_type.

column_type is a discriminator field

The column_type argument is used to identify column types when deserializing the Data Designer Config from JSON/YAML. It acts as the discriminator in a discriminated union, allowing Pydantic to automatically determine which column configuration class to instantiate.

Classes:

Name Description
ExpressionColumnConfig

Configuration for derived columns using Jinja2 expressions.

LLMCodeColumnConfig

Configuration for code generation columns using Large Language Models.

LLMJudgeColumnConfig

Configuration for LLM-as-a-judge quality assessment and scoring columns.

LLMStructuredColumnConfig

Configuration for structured JSON generation columns using Large Language Models.

LLMTextColumnConfig

Configuration for text generation columns using Large Language Models.

SamplerColumnConfig

Configuration for columns generated using numerical samplers.

Score

Configuration for a "score" in an LLM judge evaluation.

SeedDatasetColumnConfig

Configuration for columns sourced from seed datasets.

SingleColumnConfig

Abstract base class for all single-column configuration types.

ValidationColumnConfig

Configuration for validation columns that validate existing columns.

ExpressionColumnConfig

Bases: SingleColumnConfig

Configuration for derived columns using Jinja2 expressions.

Expression columns compute values by evaluating Jinja2 templates that reference other columns. Useful for transformations, concatenations, conditional logic, and derived features without requiring LLM generation. The expression is evaluated row-by-row.

Attributes:

Name Type Description
expr str

Jinja2 expression to evaluate. Can reference other column values using {{ column_name }} syntax. Supports filters, conditionals, and arithmetic. Must be a valid, non-empty Jinja2 template.

dtype Literal['int', 'float', 'str', 'bool']

Data type to cast the result to. Must be one of "int", "float", "str", or "bool". Defaults to "str". Type conversion is applied after expression evaluation.

column_type Literal['expression']

Discriminator field, always "expression" for this configuration type.

Methods:

Name Description
assert_expression_valid_jinja

Validate that the expression is a valid, non-empty Jinja2 template.

required_columns property

Returns the columns referenced in the expression template.

assert_expression_valid_jinja()

Validate that the expression is a valid, non-empty Jinja2 template.

Returns:

Type Description
Self

The validated instance.

Raises:

Type Description
InvalidConfigError

If expression is empty or contains invalid Jinja2 syntax.

Source code in src/data_designer/config/column_configs.py
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
@model_validator(mode="after")
def assert_expression_valid_jinja(self) -> Self:
    """Validate that the expression is a valid, non-empty Jinja2 template.

    Returns:
        The validated instance.

    Raises:
        InvalidConfigError: If expression is empty or contains invalid Jinja2 syntax.
    """
    if not self.expr.strip():
        raise InvalidConfigError(
            f"🛑 Expression column '{self.name}' has an empty or whitespace-only expression. "
            f"Please provide a valid Jinja2 expression (e.g., '{{ column_name }}' or '{{ col1 }} + {{ col2 }}') "
            "or remove this column if not needed."
        )
    assert_valid_jinja2_template(self.expr)
    return self

LLMCodeColumnConfig

Bases: LLMTextColumnConfig

Configuration for code generation columns using Large Language Models.

Extends LLMTextColumnConfig to generate code snippets in specific programming languages or SQL dialects. The generated code is automatically extracted from markdown code blocks for the specified language. Inherits all prompt templating capabilities.

Attributes:

Name Type Description
code_lang CodeLang

Programming language or SQL dialect for code generation. Supported values include: "python", "javascript", "typescript", "java", "kotlin", "go", "rust", "ruby", "scala", "swift", "sql:sqlite", "sql:postgres", "sql:mysql", "sql:tsql", "sql:bigquery", "sql:ansi". See CodeLang enum for complete list.

column_type Literal['llm-code']

Discriminator field, always "llm-code" for this configuration type.

LLMJudgeColumnConfig

Bases: LLMTextColumnConfig

Configuration for LLM-as-a-judge quality assessment and scoring columns.

Extends LLMTextColumnConfig to create judge columns that evaluate and score other generated content based on the defined criteria. Useful for quality assessment, preference ranking, and multi-dimensional evaluation of generated data.

Attributes:

Name Type Description
scores list[Score]

List of Score objects defining the evaluation dimensions. Each score represents a different aspect to evaluate (e.g., accuracy, relevance, fluency). Must contain at least one score.

column_type Literal['llm-judge']

Discriminator field, always "llm-judge" for this configuration type.

LLMStructuredColumnConfig

Bases: LLMTextColumnConfig

Configuration for structured JSON generation columns using Large Language Models.

Extends LLMTextColumnConfig to generate structured data conforming to a specified schema. Uses JSON schema or Pydantic models to define the expected output structure, enabling type-safe and validated structured output generation. Inherits prompt templating capabilities.

Attributes:

Name Type Description
output_format Union[dict, Type[BaseModel]]

The schema defining the expected output structure. Can be either: - A Pydantic BaseModel class (recommended) - A JSON schema dictionary

column_type Literal['llm-structured']

Discriminator field, always "llm-structured" for this configuration type.

Methods:

Name Description
validate_output_format

Convert Pydantic model to JSON schema if needed.

validate_output_format()

Convert Pydantic model to JSON schema if needed.

Returns:

Type Description
Self

The validated instance with output_format as a JSON schema dict.

Source code in src/data_designer/config/column_configs.py
228
229
230
231
232
233
234
235
236
237
@model_validator(mode="after")
def validate_output_format(self) -> Self:
    """Convert Pydantic model to JSON schema if needed.

    Returns:
        The validated instance with output_format as a JSON schema dict.
    """
    if not isinstance(self.output_format, dict) and issubclass(self.output_format, BaseModel):
        self.output_format = self.output_format.model_json_schema()
    return self

LLMTextColumnConfig

Bases: SingleColumnConfig

Configuration for text generation columns using Large Language Models.

LLM text columns generate free-form text content using language models via LiteLLM. Prompts support Jinja2 templating to reference values from other columns, enabling context-aware generation. The generated text can optionally include reasoning traces when models support extended thinking.

Attributes:

Name Type Description
prompt str

Prompt template for text generation. Supports Jinja2 syntax to reference other columns (e.g., "Write a story about {{ character_name }}"). Must be a valid Jinja2 template.

model_alias str

Alias of the model configuration to use for generation. Must match a model alias defined when initializing the DataDesignerConfigBuilder.

system_prompt Optional[str]

Optional system prompt to set model behavior and constraints. Also supports Jinja2 templating. If provided, must be a valid Jinja2 template. Do not put any output parsing instructions in the system prompt. Instead, use the appropriate column type for the output you want to generate - e.g., LLMStructuredColumnConfig for structured output, LLMCodeColumnConfig for code.

multi_modal_context Optional[list[ImageContext]]

Optional list of image contexts for multi-modal generation. Enables vision-capable models to generate text based on image inputs.

column_type Literal['llm-text']

Discriminator field, always "llm-text" for this configuration type.

Methods:

Name Description
assert_prompt_valid_jinja

Validate that prompt and system_prompt are valid Jinja2 templates.

required_columns property

Get columns referenced in the prompt and system_prompt templates.

Returns:

Type Description
list[str]

List of unique column names referenced in Jinja2 templates.

side_effect_columns property

Returns the reasoning trace column, which may be generated alongside the main column.

Reasoning traces are only returned if the served model parses and returns reasoning content.

Returns:

Type Description
list[str]

List containing the reasoning trace column name.

assert_prompt_valid_jinja()

Validate that prompt and system_prompt are valid Jinja2 templates.

Returns:

Type Description
Self

The validated instance.

Raises:

Type Description
InvalidConfigError

If prompt or system_prompt contains invalid Jinja2 syntax.

Source code in src/data_designer/config/column_configs.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
@model_validator(mode="after")
def assert_prompt_valid_jinja(self) -> Self:
    """Validate that prompt and system_prompt are valid Jinja2 templates.

    Returns:
        The validated instance.

    Raises:
        InvalidConfigError: If prompt or system_prompt contains invalid Jinja2 syntax.
    """
    assert_valid_jinja2_template(self.prompt)
    if self.system_prompt:
        assert_valid_jinja2_template(self.system_prompt)
    return self

SamplerColumnConfig

Bases: SingleColumnConfig

Configuration for columns generated using numerical samplers.

Sampler columns provide efficient data generation using numerical samplers for common data types and distributions. Supported samplers include UUID generation, datetime/timedelta sampling, person generation, category / subcategory sampling, and various statistical distributions (uniform, gaussian, binomial, poisson, scipy).

Attributes:

Name Type Description
sampler_type SamplerType

Type of sampler to use. Available types include: "uuid", "category", "subcategory", "uniform", "gaussian", "bernoulli", "bernoulli_mixture", "binomial", "poisson", "scipy", "person", "datetime", "timedelta".

params Annotated[SamplerParamsT, Discriminator(sampler_type)]

Parameters specific to the chosen sampler type. Type varies based on the sampler_type (e.g., CategorySamplerParams, UniformSamplerParams, PersonSamplerParams).

conditional_params dict[str, Annotated[SamplerParamsT, Discriminator(sampler_type)]]

Optional dictionary for conditional parameters. The dict keys are the conditions that must be met (e.g., "age > 21") for the conditional parameters to be used. The values of dict are the parameters to use when the condition is met.

convert_to Optional[str]

Optional type conversion to apply after sampling. Must be one of "float", "int", or "str". Useful for converting numerical samples to strings or other types.

column_type Literal['sampler']

Discriminator field, always "sampler" for this configuration type.

Displaying available samplers and their parameters

The config builder has an info attribute that can be used to display the available samplers and their parameters:

config_builder.info.display("samplers")

Methods:

Name Description
inject_sampler_type_into_params

Inject sampler_type into params dict to enable discriminated union resolution.

inject_sampler_type_into_params(data) classmethod

Inject sampler_type into params dict to enable discriminated union resolution.

This allows users to pass params as a simple dict without the sampler_type field, which will be automatically added based on the outer sampler_type field.

Source code in src/data_designer/config/column_configs.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
@model_validator(mode="before")
@classmethod
def inject_sampler_type_into_params(cls, data: dict) -> dict:
    """Inject sampler_type into params dict to enable discriminated union resolution.

    This allows users to pass params as a simple dict without the sampler_type field,
    which will be automatically added based on the outer sampler_type field.
    """
    if isinstance(data, dict):
        sampler_type = data.get("sampler_type")
        params = data.get("params")

        # If params is a dict and doesn't have sampler_type, inject it
        if sampler_type and isinstance(params, dict) and "sampler_type" not in params:
            data["params"] = {"sampler_type": sampler_type, **params}

        # Handle conditional_params similarly
        conditional_params = data.get("conditional_params")
        if conditional_params and isinstance(conditional_params, dict):
            for condition, cond_params in conditional_params.items():
                if isinstance(cond_params, dict) and "sampler_type" not in cond_params:
                    data["conditional_params"][condition] = {"sampler_type": sampler_type, **cond_params}

    return data

Score

Bases: ConfigBase

Configuration for a "score" in an LLM judge evaluation.

Defines a single scoring criterion with its possible values and descriptions. Multiple Score objects can be combined in an LLMJudgeColumnConfig to create multi-dimensional quality assessments.

Attributes:

Name Type Description
name str

A clear, concise name for this scoring dimension (e.g., "Relevance", "Fluency").

description str

An informative and detailed assessment guide explaining how to evaluate this dimension. Should provide clear criteria for scoring.

options dict[Union[int, str], str]

Dictionary mapping score values to their descriptions. Keys can be integers (e.g., 1-5 scale) or strings (e.g., "Poor", "Good", "Excellent"). Values are descriptions explaining what each score level means.

SeedDatasetColumnConfig

Bases: SingleColumnConfig

Configuration for columns sourced from seed datasets.

This config marks columns that come from seed data. It is typically created automatically when calling with_seed_dataset() on the builder, rather than being instantiated directly by users.

Attributes:

Name Type Description
column_type Literal['seed-dataset']

Discriminator field, always "seed-dataset" for this configuration type.

SingleColumnConfig

Bases: ConfigBase, ABC

Abstract base class for all single-column configuration types.

This class serves as the foundation for all column configurations in DataDesigner, defining shared fields and properties across all column types.

Attributes:

Name Type Description
name str

Unique name of the column to be generated.

drop bool

If True, the column will be generated but removed from the final dataset. Useful for intermediate columns that are dependencies for other columns.

column_type str

Discriminator field that identifies the specific column type. Subclasses must override this field to specify the column type with a Literal value.

required_columns property

Returns a list of column names that must exist before this column can be generated.

Returns:

Type Description
list[str]

List of column names that this column depends on. Empty list indicates

list[str]

no dependencies. Override in subclasses to specify dependencies.

side_effect_columns property

Returns a list of additional columns that this column will create as a side effect.

Some column types generate additional metadata or auxiliary columns alongside the primary column (e.g., reasoning traces for LLM columns).

Returns:

Type Description
list[str]

List of column names that this column will create as a side effect. Empty list

list[str]

indicates no side effect columns. Override in subclasses to specify side effects.

ValidationColumnConfig

Bases: SingleColumnConfig

Configuration for validation columns that validate existing columns.

Validation columns execute validation logic against specified target columns and return structured results indicating pass/fail status with validation details. Supports multiple validation strategies: code execution (Python/SQL), local callable functions (library only), and remote HTTP endpoints.

Attributes:

Name Type Description
target_columns list[str]

List of column names to validate. These columns are passed to the validator for validation. All target columns must exist in the dataset before validation runs.

validator_type ValidatorType

The type of validator to use. Options: - "code": Execute code (Python or SQL) for validation. The code receives a DataFrame with target columns and must return a DataFrame with validation results. - "local_callable": Call a local Python function with the data. Only supported when running DataDesigner locally. - "remote": Send data to a remote HTTP endpoint for validation. Useful for

validator_params ValidatorParamsT

Parameters specific to the validator type. Type varies by validator: - CodeValidatorParams: Specifies code language (python or SQL dialect like "sql:postgres", "sql:mysql"). - LocalCallableValidatorParams: Provides validation function (Callable[[pd.DataFrame], pd.DataFrame]) and optional output schema for validation results. - RemoteValidatorParams: Configures endpoint URL, HTTP timeout, retry behavior (max_retries, retry_backoff), and parallel request limits (max_parallel_requests).

batch_size int

Number of records to process in each validation batch. Defaults to 10. Larger batches are more efficient but use more memory. Adjust based on validator complexity and available resources.

column_type Literal['validation']

Discriminator field, always "validation" for this configuration type.

required_columns property

Returns the columns that need to be validated.