Skip to content

Processors

The processors module defines configuration objects for post-generation data transformations. Processors run after column generation and can modify the dataset schema or content before output.

Classes:

Name Description
DropColumnsProcessorConfig

Configuration for dropping columns from the output dataset.

ProcessorConfig

Abstract base class for all processor configuration types.

ProcessorType

Enumeration of available processor types.

SchemaTransformProcessorConfig

Configuration for transforming the dataset schema using Jinja2 templates.

Functions:

Name Description
get_processor_config_from_kwargs

Create a processor configuration from a processor type and keyword arguments.

DropColumnsProcessorConfig

Bases: ProcessorConfig

Configuration for dropping columns from the output dataset.

This processor removes specified columns from the generated dataset. The dropped columns are saved separately in a dropped-columns directory for reference. When this processor is added via the config builder, the corresponding column configs are automatically marked with drop = True.

Alternatively, you can set drop = True when configuring a column.

Attributes:

Name Type Description
column_names list[str]

List of column names to remove from the output dataset.

processor_type Literal[DROP_COLUMNS]

Discriminator field, always ProcessorType.DROP_COLUMNS for this configuration type.

ProcessorConfig

Bases: ConfigBase, ABC

Abstract base class for all processor configuration types.

Processors are transformations that run before or after columns are generated. They can modify, reshape, or augment the dataset before it's saved.

Attributes:

Name Type Description
name str

Unique name of the processor, used to identify the processor in results and to name output artifacts on disk.

build_stage BuildStage

The stage at which the processor runs. Currently only POST_BATCH is supported, meaning processors run after each batch of columns is generated.

ProcessorType

Bases: str, Enum

Enumeration of available processor types.

Attributes:

Name Type Description
DROP_COLUMNS

Processor that removes specified columns from the output dataset.

SCHEMA_TRANSFORM

Processor that creates a new dataset with a transformed schema using Jinja2 templates.

SchemaTransformProcessorConfig

Bases: ProcessorConfig

Configuration for transforming the dataset schema using Jinja2 templates.

This processor creates a new dataset with a transformed schema. Each key in the template becomes a column in the output, and values are Jinja2 templates that can reference any column in the batch. The transformed dataset is written to a processors-outputs/{processor_name}/ directory alongside the main dataset.

Attributes:

Name Type Description
template dict[str, Any]

Dictionary defining the output schema. Keys are new column names, values are Jinja2 templates (strings, lists, or nested structures). Must be JSON-serializable.

processor_type Literal[SCHEMA_TRANSFORM]

Discriminator field, always ProcessorType.SCHEMA_TRANSFORM for this configuration type.

get_processor_config_from_kwargs(processor_type, **kwargs)

Create a processor configuration from a processor type and keyword arguments.

Parameters:

Name Type Description Default
processor_type ProcessorType

The type of processor to create.

required
**kwargs Any

Additional keyword arguments passed to the processor constructor.

{}

Returns:

Type Description
ProcessorConfig

A processor configuration object of the specified type.

Source code in src/data_designer/config/processors.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs: Any) -> ProcessorConfig:
    """Create a processor configuration from a processor type and keyword arguments.

    Args:
        processor_type: The type of processor to create.
        **kwargs: Additional keyword arguments passed to the processor constructor.

    Returns:
        A processor configuration object of the specified type.
    """
    if processor_type == ProcessorType.DROP_COLUMNS:
        return DropColumnsProcessorConfig(**kwargs)
    elif processor_type == ProcessorType.SCHEMA_TRANSFORM:
        return SchemaTransformProcessorConfig(**kwargs)