Processors

The processors module defines configuration objects for post-generation data transformations. Processors run after column generation and can modify the dataset schema or content before output.

Classes:

Name	Description
`DropColumnsProcessorConfig`	Configuration for dropping columns from the output dataset.
`ProcessorConfig`	Abstract base class for all processor configuration types.
`ProcessorType`	Enumeration of available processor types.
`SchemaTransformProcessorConfig`	Configuration for transforming the dataset schema using Jinja2 templates.

Functions:

Name	Description
`get_processor_config_from_kwargs`	Create a processor configuration from a processor type and keyword arguments.

`DropColumnsProcessorConfig`

Bases: ProcessorConfig

Configuration for dropping columns from the output dataset.

This processor removes specified columns from the generated dataset. The dropped columns are saved separately in a dropped-columns directory for reference. When this processor is added via the config builder, the corresponding column configs are automatically marked with drop = True.

Alternatively, you can set drop = True when configuring a column.

Attributes:

Name	Type	Description
`column_names`	`list[str]`	List of column names to remove from the output dataset.
`processor_type`	`Literal[DROP_COLUMNS]`	Discriminator field, always `ProcessorType.DROP_COLUMNS` for this configuration type.

`ProcessorConfig`

Bases: ConfigBase, ABC

Abstract base class for all processor configuration types.

Processors are transformations that run before or after columns are generated. They can modify, reshape, or augment the dataset before it's saved.

Attributes:

Name	Type	Description
`name`	`str`	Unique name of the processor, used to identify the processor in results and to name output artifacts on disk.
`build_stage`	`BuildStage`	The stage at which the processor runs. Currently only `POST_BATCH` is supported, meaning processors run after each batch of columns is generated.

`ProcessorType`

Bases: str, Enum

Enumeration of available processor types.

Attributes:

Name	Type	Description
`DROP_COLUMNS`		Processor that removes specified columns from the output dataset.
`SCHEMA_TRANSFORM`		Processor that creates a new dataset with a transformed schema using Jinja2 templates.

`SchemaTransformProcessorConfig`

Bases: ProcessorConfig

Configuration for transforming the dataset schema using Jinja2 templates.

This processor creates a new dataset with a transformed schema. Each key in the template becomes a column in the output, and values are Jinja2 templates that can reference any column in the batch. The transformed dataset is written to a processors-outputs/{processor_name}/ directory alongside the main dataset.

Attributes:

Name	Type	Description
`template`	`dict[str, Any]`	Dictionary defining the output schema. Keys are new column names, values are Jinja2 templates (strings, lists, or nested structures). Must be JSON-serializable.
`processor_type`	`Literal[SCHEMA_TRANSFORM]`	Discriminator field, always `ProcessorType.SCHEMA_TRANSFORM` for this configuration type.

`get_processor_config_from_kwargs(processor_type, **kwargs)`

Create a processor configuration from a processor type and keyword arguments.

Parameters:

Name	Type	Description	Default
`processor_type`	`ProcessorType`	The type of processor to create.	required
`**kwargs`	`Any`	Additional keyword arguments passed to the processor constructor.	`{}`

Returns:

Type	Description
`ProcessorConfig`	A processor configuration object of the specified type.

Source code in packages/data-designer-config/src/data_designer/config/processors.py

def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs: Any) -> ProcessorConfig:
    """Create a processor configuration from a processor type and keyword arguments.

    Args:
        processor_type: The type of processor to create.
        **kwargs: Additional keyword arguments passed to the processor constructor.

    Returns:
        A processor configuration object of the specified type.
    """
    if processor_type == ProcessorType.DROP_COLUMNS:
        return DropColumnsProcessorConfig(**kwargs)
    elif processor_type == ProcessorType.SCHEMA_TRANSFORM:
        return SchemaTransformProcessorConfig(**kwargs)