data_actions
data_actions
¶
Extensible data action framework for pre/post-processing, validation, and generation.
Defines BaseAction and its subclasses (GenerateAction, ColAction,
ValidationAction) which encapsulate data transformations applied at
different pipeline phases. ActionExecutor orchestrates running the
registered actions in order.
Classes:
| Name | Description |
|---|---|
ProcessFn |
Callable that transforms a DataFrame in-place during a processing phase. |
ValidateBatchFn |
Callable that splits a batch into valid and rejected DataFrames. |
ProcessPhase |
Pipeline phases that apply DataFrame-to-DataFrame transformations. |
ValidateBatchPhase |
Pipeline phase for batch validation. |
BaseAction |
Abstract base class for all data actions in the pipeline. |
GenerateAction |
Action that generates net-new data for the DataFrame. |
GenExpression |
|
GenRawExpression |
Low-level action that passes raw transforms_v2 update payloads. |
ReplaceDataSource |
|
GenDatetimeDistribution |
Generate a datetime from a provided datetime distribution. |
DateConstraint |
|
ColAction |
Action that operates on a single named column. |
DatetimeCol |
|
ActionExecutor |
Orchestrate a sequence of |
Functions:
| Name | Description |
|---|---|
data_actions_fn |
Applies an action executor to a dataframe. |
ProcessFn
¶
Bases: Protocol
Callable that transforms a DataFrame in-place during a processing phase.
ValidateBatchFn
¶
Bases: Protocol
Callable that splits a batch into valid and rejected DataFrames.
ProcessPhase
¶
Bases: str, Enum
Pipeline phases that apply DataFrame-to-DataFrame transformations.
ValidateBatchPhase
¶
Bases: str, Enum
Pipeline phase for batch validation.
BaseAction
pydantic-model
¶
Bases: BaseModel, ABC
Abstract base class for all data actions in the pipeline.
Subclasses implement one or more phase methods (preprocess,
postprocess, validate_batch, generate) to transform data at
the corresponding pipeline stage. The functions method introspects
which methods were actually overridden, so only non-default actions run.
State can be shared across phases via set_state / get_state,
which persist to the ActionCtx.state dictionary keyed by the
action's hash.
Config:
alias_generator:type_alias_fn
Validators:
preprocess(df)
¶
Transform the input dataset before training.
Override to modify the shape or contents of the data (e.g., encoding datetimes, dropping columns). The default implementation is a no-op.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
postprocess(df)
¶
Transform generated data after generation, often reverting preprocessing.
The default implementation is a no-op.
validate_batch(batch, df)
¶
Split a generated batch into valid and rejected rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
DataFrame
|
Newly generated data to validate. |
required |
df
|
DataFrame
|
Reference dataset providing context for validation. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
A tuple of (valid_rows, rejected_rows) DataFrames. |
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
generate(df)
¶
Generate new data and merge it into the DataFrame.
Override to create net-new columns or rows. The default implementation is a no-op.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
functions()
¶
Return a Functions bundle containing only the overridden phase methods.
Methods that were not overridden from BaseAction are excluded so
that only actions with real work appear during debugging.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
add_ctx(info)
pydantic-validator
¶
Inject ActionCtx from pydantic's validation context, if provided.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
with_ctx(ctx)
¶
get_type()
¶
Return the discriminator type_ value, or "unknown" if unset.
Works around the fact that type_ cannot be an abstract property
on BaseAction due to pydantic discriminator constraints.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
hash()
¶
Deterministic key for storing per-action state in ActionCtx.state.
set_state(state_obj)
¶
Persist a Pydantic model as JSON in ActionCtx.state.
get_state(state_obj_type)
¶
Retrieve and deserialize a previously persisted state object.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
GenerateAction
pydantic-model
¶
Bases: BaseAction, ABC
Action that generates net-new data for the DataFrame.
GenerateAction subclasses must implement generate. The phase
field controls when generate runs:
GENERATE(default) -- after training, during synthetic data creation.PREPROCESS-- before training.POSTPROCESS-- after generation, for cleanup.
Create a new GenerateAction when you need to synthesize a column
based on other columns, fill in faker data, etc.
Fields:
-
phase(ProcessPhase)
Validators:
functions()
¶
Route generate to the correct phase slot based on self.phase.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
generate(df)
abstractmethod
¶
Generate new data based on the existing data in the DataFrame.
generate_records(num_records)
¶
Generate records without an existing DataFrame.
Creates an empty DataFrame with num_records rows, runs generate,
and returns the result as a list of dicts.
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
GenExpression
pydantic-model
¶
Bases: GenerateAction
Fields:
-
phase(ProcessPhase) -
type_(Literal['gen_expression']) -
col(str) -
expression(Optional[str]) -
expressions(Optional[list[str]]) -
dtype(Optional[str])
Validators:
-
add_ctx -
validate_model
expression = None
pydantic-field
¶
A jinja transforms_v2 expression that specifies the value of the column.
expressions = None
pydantic-field
¶
Similar to expression, but allows you to specify multiple statements
that'll be processed in sequence to transforms_v2. This might be useful
if you have a more complex set of expressions.
dtype = None
pydantic-field
¶
If specified, the column will be cast as this dtype after generation.
GenRawExpression
pydantic-model
¶
Bases: GenerateAction
Low-level action that passes raw transforms_v2 update payloads.
Unlike GenExpression which targets a single column, this action
accepts a full list of TransformsUpdate steps. Prefer
GenExpression for simpler use cases.
Fields:
-
phase(ProcessPhase) -
type_(Literal['gen_raw_expression']) -
expressions(list[TransformsUpdate])
Validators:
ReplaceDataSource(**data)
pydantic-model
¶
Bases: BaseAction
Fields:
-
type_(Literal['replace_datasource']) -
col(str) -
data_source(DataSourceT)
Validators:
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
State
pydantic-model
¶
Bases: BaseModel
Fields:
-
column_index(Optional[int])
column_index
pydantic-field
¶
The index in which col was before preprocessing dropped it. If None,
then that means col was not in the original df.
GenDatetimeDistribution
pydantic-model
¶
Bases: GenerateAction
Generate a datetime from a provided datetime distribution.
Fields:
-
phase(ProcessPhase) -
type_(Literal['gen_datetime_distribution']) -
col(str) -
distribution(DatetimeDistributionT)
Validators:
DateConstraint
pydantic-model
¶
Bases: BaseAction
Fields:
-
type_(Literal['date_constraint']) -
colA(str) -
colB(str) -
operator(Literal['gt', 'ge', 'lt', 'le'])
Validators:
ColAction
pydantic-model
¶
Bases: BaseAction, ABC
Action that operates on a single named column.
Useful for defining serialization/deserialization rules (e.g., datetime formatting, categorical validation) applied before training or after generation.
Fields:
-
name(str)
Validators:
DatetimeCol
pydantic-model
¶
ActionExecutor(**data)
pydantic-model
¶
Bases: BaseModel
Orchestrate a sequence of BaseAction instances across pipeline phases.
Groups each action's overridden methods by phase (preprocess, postprocess, validate_batch, generate) and runs them in order. Postprocess functions run in reverse order to properly unwind preprocessing transformations.
Fields:
-
actions(list[ActionT]) -
ctx(Optional[ActionCtx]) -
_phase_to_functions(dict[FunctionPhase, list[Callable]])
Source code in src/nemo_safe_synthesizer/data_processing/actions/data_actions.py
data_actions_fn(action_executor)
¶
Applies an action executor to a dataframe.