Analysis

The analysis modules provide tools for profiling and analyzing generated datasets. It includes statistics tracking, column profiling, and reporting capabilities.

Column Statistics

Column statistics are automatically computed for every column after generation. They provide basic metrics specific to the column type. For example, LLM columns track token usage statistics, sampler columns track distribution information, and validation columns track validation success rates.

The classes below are result objects that store the computed statistics for each column type and provide methods for formatting these results for display in reports.

Classes:

Name	Description
`BaseColumnStatistics`	Abstract base class for all column statistics types.
`CategoricalDistribution`	Container for computed categorical distribution statistics.
`CategoricalHistogramData`	Container for categorical distribution histogram data.
`ExpressionColumnStatistics`	Container for statistics on expression-based derived columns.
`GeneralColumnStatistics`	Container for general statistics applicable to all column types.
`LLMCodeColumnStatistics`	Container for statistics on LLM-generated code columns.
`LLMJudgedColumnStatistics`	Container for statistics on LLM-as-a-judge quality assessment columns.
`LLMStructuredColumnStatistics`	Container for statistics on LLM-generated structured JSON columns.
`LLMTextColumnStatistics`	Container for statistics on LLM-generated text columns.
`NumericalDistribution`	Container for computed numerical distribution statistics.
`SamplerColumnStatistics`	Container for statistics on sampler-generated columns.
`SeedDatasetColumnStatistics`	Container for statistics on columns sourced from seed datasets.
`ValidationColumnStatistics`	Container for statistics on validation result columns.

`BaseColumnStatistics`

Bases: BaseModel, ABC

Abstract base class for all column statistics types.

Serves as a container for computed statistics across different column types in Data-Designer-generated datasets. Subclasses hold column-specific statistical results and provide methods for formatting these results for display in reports.

Methods:

Name	Description
`create_report_row_data`	Creates a formatted dictionary of statistics for display in reports.

`create_report_row_data()` `abstractmethod`

Creates a formatted dictionary of statistics for display in reports.

Returns:

Type	Description
`dict[str, str]`	Dictionary mapping display labels to formatted statistic values.

Source code in packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py

@abstractmethod
def create_report_row_data(self) -> dict[str, str]:
    """Creates a formatted dictionary of statistics for display in reports.

    Returns:
        Dictionary mapping display labels to formatted statistic values.
    """
    ...

`CategoricalDistribution`

Bases: BaseModel

Container for computed categorical distribution statistics.

Attributes:

Name	Type	Description
`most_common_value`	`str \| int`	The category value that appears most frequently in the data.
`least_common_value`	`str \| int`	The category value that appears least frequently in the data.
`histogram`	`CategoricalHistogramData`	Complete frequency distribution showing all categories and their counts.

`CategoricalHistogramData`

Bases: BaseModel

Container for categorical distribution histogram data.

Stores the computed frequency distribution of categorical values.

Attributes:

Name	Type	Description
`categories`	`list[float \| int \| str]`	List of unique category values that appear in the data.
`counts`	`list[int]`	List of occurrence counts for each category.

Methods:

Name	Description
`ensure_python_types`	Ensure numerical values are Python objects rather than Numpy types.

`ensure_python_types()`

Ensure numerical values are Python objects rather than Numpy types.

Source code in packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py

@model_validator(mode="after")
def ensure_python_types(self) -> Self:
    """Ensure numerical values are Python objects rather than Numpy types."""
    self.categories = [(float(x) if is_float(x) else (int(x) if is_int(x) else str(x))) for x in self.categories]
    self.counts = [int(i) for i in self.counts]
    return self

`ExpressionColumnStatistics`

Bases: GeneralColumnStatistics

Container for statistics on expression-based derived columns.

Inherits general statistics and stores statistics computed from columns that are derived from columns that are derived from Jinja2 expressions referencing other column values.

Attributes:

Name	Type	Description
`column_type`	`Literal[value]`	Discriminator field, always "expression" for this statistics type.

`GeneralColumnStatistics`

Bases: BaseColumnStatistics

Container for general statistics applicable to all column types.

Holds core statistical measures that apply universally across all column types, including null counts, unique values, and data type information. Serves as the base for more specialized column statistics classes that store additional column-specific metrics.

Attributes:

Name	Type	Description
`column_name`	`str`	Name of the column being analyzed.
`num_records`	`int \| MissingValue`	Total number of records in the column.
`num_null`	`int \| MissingValue`	Number of null/missing values in the column.
`num_unique`	`int \| MissingValue`	Number of distinct values in the column. If a value is not hashable, it is converted to a string.
`pyarrow_dtype`	`str`	PyArrow data type of the column as a string.
`simple_dtype`	`str`	Simplified human-readable data type label.
`column_type`	`Literal['general']`	Discriminator field, always "general" for this statistics type.

`LLMCodeColumnStatistics`

Bases: LLMTextColumnStatistics

Container for statistics on LLM-generated code columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that generate code snippets in specific programming languages.

Attributes:

Name	Type	Description
`column_type`	`Literal[value]`	Discriminator field, always "llm-code" for this statistics type.

`LLMJudgedColumnStatistics`

Bases: LLMTextColumnStatistics

Container for statistics on LLM-as-a-judge quality assessment columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that evaluate and score other generated content based on defined criteria.

Attributes:

Name	Type	Description
`column_type`	`Literal[value]`	Discriminator field, always "llm-judge" for this statistics type.

`LLMStructuredColumnStatistics`

Bases: LLMTextColumnStatistics

Container for statistics on LLM-generated structured JSON columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that generate structured data conforming to JSON schemas or Pydantic models.

Attributes:

Name	Type	Description
`column_type`	`Literal[value]`	Discriminator field, always "llm-structured" for this statistics type.

`LLMTextColumnStatistics`

Bases: GeneralColumnStatistics

Container for statistics on LLM-generated text columns.

Inherits general statistics plus token usage metrics specific to LLM text generation. Stores both prompt and completion token consumption data.

Attributes:

Name	Type	Description
`output_tokens_mean`	`float \| MissingValue`	Mean number of output tokens generated per record.
`output_tokens_median`	`float \| MissingValue`	Median number of output tokens generated per record.
`output_tokens_stddev`	`float \| MissingValue`	Standard deviation of output tokens per record.
`input_tokens_mean`	`float \| MissingValue`	Mean number of input tokens used per record.
`input_tokens_median`	`float \| MissingValue`	Median number of input tokens used per record.
`input_tokens_stddev`	`float \| MissingValue`	Standard deviation of input tokens per record.
`column_type`	`Literal[value]`	Discriminator field, always "llm-text" for this statistics type.

`NumericalDistribution`

Bases: BaseModel

Container for computed numerical distribution statistics.

Attributes:

Name	Type	Description
`min`	`float \| int`	Minimum value in the distribution.
`max`	`float \| int`	Maximum value in the distribution.
`mean`	`float`	Arithmetic mean (average) of all values.
`stddev`	`float`	Standard deviation measuring the spread of values around the mean.
`median`	`float`	Median value of the distribution.

`SamplerColumnStatistics`

Bases: GeneralColumnStatistics

Container for statistics on sampler-generated columns.

Inherits general statistics plus sampler-specific information including the sampler type used and the empirical distribution of generated values. Stores both categorical and numerical distribution results.

Attributes:

Name	Type	Description
`sampler_type`	`SamplerType`	Type of sampler used to generate this column (e.g., "uniform", "category", "gaussian", "person").
`distribution_type`	`ColumnDistributionType`	Classification of the column's distribution (categorical, numerical, text, other, or unknown).
`distribution`	`CategoricalDistribution \| NumericalDistribution \| MissingValue \| None`	Empirical distribution statistics for the generated values. Can be CategoricalDistribution (for discrete values), NumericalDistribution (for continuous values), or MissingValue if distribution could not be computed.
`column_type`	`Literal[value]`	Discriminator field, always "sampler" for this statistics type.

`SeedDatasetColumnStatistics`

Bases: GeneralColumnStatistics

Container for statistics on columns sourced from seed datasets.

Inherits general statistics and stores statistics computed from columns that originate from existing data provided via the seed dataset functionality.

Attributes:

Name	Type	Description
`column_type`	`Literal[value]`	Discriminator field, always "seed-dataset" for this statistics type.

`ValidationColumnStatistics`

Bases: GeneralColumnStatistics

Container for statistics on validation result columns.

Inherits general statistics plus validation-specific metrics including the count and percentage of records that passed validation. Stores results from validation logic (Python, SQL, or remote) executed against target columns.

Attributes:

Name	Type	Description
`num_valid_records`	`int \| MissingValue`	Number of records that passed validation.
`column_type`	`Literal[value]`	Discriminator field, always "validation" for this statistics type.

Column Profilers

Column profilers are optional analysis tools that provide deeper insights into specific column types. Currently, the only column profiler available is the Judge Score Profiler.

The classes below are result objects that store the computed profiler results and provide methods for formatting these results for display in reports.

Classes:

Name	Description
`ColumnProfilerResults`	Abstract base class for column profiler results.
`JudgeScoreDistributions`	Container for computed distributions across all judge score dimensions.
`JudgeScoreProfilerConfig`	Configuration for the LLM-as-a-judge score profiler.
`JudgeScoreProfilerResults`	Container for complete judge score profiler analysis results.
`JudgeScoreSample`	Container for a single judge score and its associated reasoning.
`JudgeScoreSummary`	Container for an LLM-generated summary of a judge score dimension.

`ColumnProfilerResults`

Bases: BaseModel, ABC

Abstract base class for column profiler results.

Stores results from column profiling operations. Subclasses hold profiler-specific analysis results and provide methods for generating formatted report sections for display.

Methods:

Name	Description
`create_report_section`	Creates a Rich Panel containing the formatted profiler results for display.

`create_report_section()`

Creates a Rich Panel containing the formatted profiler results for display.

Returns:

Type	Description
`Panel`	A Rich Panel containing the formatted profiler results. Default implementation
`Panel`	returns a "Not Implemented" message; subclasses should override to provide
`Panel`	specific formatting.

Source code in packages/data-designer-config/src/data_designer/config/analysis/column_profilers.py

def create_report_section(self) -> Panel:
    """Creates a Rich Panel containing the formatted profiler results for display.

    Returns:
        A Rich Panel containing the formatted profiler results. Default implementation
        returns a "Not Implemented" message; subclasses should override to provide
        specific formatting.
    """
    return Panel(
        f"Report section generation not implemented for '{self.__class__.__name__}'.",
        title="Not Implemented",
        border_style=f"bold {ColorPalette.YELLOW.value}",
        padding=(1, 2),
    )

`JudgeScoreDistributions`

Bases: BaseModel

Container for computed distributions across all judge score dimensions.

Stores the complete distribution analysis for all score dimensions in an LLM-as-a-judge column. Each score dimension (e.g., "relevance", "fluency") has its own distribution computed from the generated data.

Attributes:

Name	Type	Description
`scores`	`dict[str, list[int \| str]]`	Mapping of each score dimension name to its list of score values.
`reasoning`	`dict[str, list[str]]`	Mapping of each score dimension name to its list of reasoning texts.
`distribution_types`	`dict[str, ColumnDistributionType]`	Mapping of each score dimension name to its classification.
`distributions`	`dict[str, CategoricalDistribution \| NumericalDistribution \| MissingValue]`	Mapping of each score dimension name to its computed distribution statistics.
`histograms`	`dict[str, CategoricalHistogramData \| MissingValue]`	Mapping of each score dimension name to its histogram data.

`JudgeScoreProfilerConfig`

Bases: ConfigBase

Configuration for the LLM-as-a-judge score profiler.

Attributes:

Name	Type	Description
`model_alias`	`str`	Alias of the LLM model to use for generating score distribution summaries. Must match a model alias defined in the Data Designer configuration.
`summary_score_sample_size`	`int \| None`	Number of score samples to include when prompting the LLM to generate summaries. Larger sample sizes provide more context but increase token usage. Must be at least 1. Defaults to 20.

`JudgeScoreProfilerResults`

Bases: ColumnProfilerResults

Container for complete judge score profiler analysis results.

Attributes:

Name	Type	Description
`column_name`	`str`	Name of the judge column that was profiled.
`summaries`	`dict[str, JudgeScoreSummary]`	Mapping of each score dimension name to its LLM-generated summary.
`score_distributions`	`JudgeScoreDistributions \| MissingValue`	Complete distribution analysis across all score dimensions.

`JudgeScoreSample`

Bases: BaseModel

Container for a single judge score and its associated reasoning.

Stores a paired score-reasoning sample extracted from an LLM-as-a-judge column. Used when generating summaries to provide the LLM with examples of scoring patterns.

Attributes:

Name	Type	Description
`score`	`int \| str`	The score value assigned by the judge. Can be numeric (int) or categorical (str).
`reasoning`	`str`	The reasoning or explanation provided by the judge for this score.

`JudgeScoreSummary`

Bases: BaseModel

Container for an LLM-generated summary of a judge score dimension.

Stores the natural language summary and sample data for a single score dimension generated by the judge score profiler. The summary is created by an LLM analyzing the distribution and patterns in the score-reasoning pairs.

Attributes:

Name	Type	Description
`score_name`	`str`	Name of the score dimension being summarized (e.g., "relevance", "fluency").
`summary`	`str`	LLM-generated natural language summary describing the scoring patterns, distribution characteristics, and notable trends for this score dimension.
`score_samples`	`list[JudgeScoreSample]`	List of score-reasoning pairs that were used to generate the summary. These are the examples of the scoring behavior that were used to generate the summary.

Dataset Profiler

The DatasetProfilerResults class contains complete profiling results for a generated dataset. It aggregates column-level statistics, metadata, and profiler results, and provides methods to:

Compute dataset-level metrics (completion percentage, column type summary)
Filter statistics by column type
Generate formatted analysis reports via the to_report() method

Reports can be displayed in the console or exported to HTML/SVG formats.

Classes:

Name	Description
`DatasetProfilerResults`	Container for complete dataset profiling and analysis results.

`DatasetProfilerResults`

Bases: BaseModel

Container for complete dataset profiling and analysis results.

Stores profiling results for a generated dataset, including statistics for all columns, dataset-level metadata, and optional advanced profiler results. Provides methods for computing derived metrics and generating formatted reports.

Attributes:

Name	Type	Description
`num_records`	`int`	Actual number of records successfully generated in the dataset.
`target_num_records`	`int`	Target number of records that were requested to be generated.
`column_statistics`	`list[Annotated[ColumnStatisticsT, Field(discriminator='column_type')]]`	List of statistics objects for all columns in the dataset. Each column has statistics appropriate to its type. Must contain at least one column.
`side_effect_column_names`	`list[str] \| None`	Column names that were generated as side effects of other columns.
`column_profiles`	`list[ColumnProfilerResultsT] \| None`	Column profiler results for specific columns when configured.

Methods:

Name	Description
`get_column_statistics_by_type`	Filters column statistics to return only those of the specified type.
`to_report`	Generate and print an analysis report based on the dataset profiling results.

`column_types` `cached` `property`

Returns a sorted list of unique column types present in the dataset.

`percent_complete` `property`

Returns the completion percentage of the dataset.

`get_column_statistics_by_type(column_type)`

Filters column statistics to return only those of the specified type.

Source code in packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py

def get_column_statistics_by_type(self, column_type: DataDesignerColumnType) -> list[ColumnStatisticsT]:
    """Filters column statistics to return only those of the specified type."""
    return [c for c in self.column_statistics if c.column_type == column_type]

`to_report(save_path=None, include_sections=None)`

Generate and print an analysis report based on the dataset profiling results.

Parameters:

Name	Type	Description	Default
`save_path`	`str \| Path \| None`	Optional path to save the report. If provided, the report will be saved as either HTML (.html) or SVG (.svg) format. If None, the report will only be displayed in the console.	`None`
`include_sections`	`list[ReportSection \| DataDesignerColumnType] \| None`	Optional list of sections to include in the report. Choices are any DataDesignerColumnType, "overview" (the dataset overview section), and "column_profilers" (all column profilers in one section). If None, all sections will be included.	`None`

Source code in packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py

def to_report(
    self,
    save_path: str | Path | None = None,
    include_sections: list[ReportSection | DataDesignerColumnType] | None = None,
) -> None:
    """Generate and print an analysis report based on the dataset profiling results.

    Args:
        save_path: Optional path to save the report. If provided, the report will be saved
              as either HTML (.html) or SVG (.svg) format. If None, the report will
              only be displayed in the console.
        include_sections: Optional list of sections to include in the report. Choices are
              any DataDesignerColumnType, "overview" (the dataset overview section),
              and "column_profilers" (all column profilers in one section). If None,
              all sections will be included.
    """
    generate_analysis_report(self, save_path, include_sections=include_sections)

Analysis

Column Statistics

BaseColumnStatistics

create_report_row_data() abstractmethod

CategoricalDistribution

CategoricalHistogramData

ensure_python_types()

ExpressionColumnStatistics

GeneralColumnStatistics

LLMCodeColumnStatistics

LLMJudgedColumnStatistics

LLMStructuredColumnStatistics

LLMTextColumnStatistics

NumericalDistribution

SamplerColumnStatistics

SeedDatasetColumnStatistics

ValidationColumnStatistics

Column Profilers

ColumnProfilerResults

create_report_section()

JudgeScoreDistributions

JudgeScoreProfilerConfig

JudgeScoreProfilerResults

JudgeScoreSample

JudgeScoreSummary

Dataset Profiler

DatasetProfilerResults

column_types cached property

percent_complete property

get_column_statistics_by_type(column_type)

to_report(save_path=None, include_sections=None)

`BaseColumnStatistics`

`create_report_row_data()` `abstractmethod`

`CategoricalDistribution`

`CategoricalHistogramData`

`ensure_python_types()`

`ExpressionColumnStatistics`

`GeneralColumnStatistics`

`LLMCodeColumnStatistics`

`LLMJudgedColumnStatistics`

`LLMStructuredColumnStatistics`

`LLMTextColumnStatistics`

`NumericalDistribution`

`SamplerColumnStatistics`

`SeedDatasetColumnStatistics`

`ValidationColumnStatistics`

`ColumnProfilerResults`

`create_report_section()`

`JudgeScoreDistributions`

`JudgeScoreProfilerConfig`

`JudgeScoreProfilerResults`

`JudgeScoreSample`

`JudgeScoreSummary`

`DatasetProfilerResults`

`column_types` `cached` `property`

`percent_complete` `property`

`get_column_statistics_by_type(column_type)`

`to_report(save_path=None, include_sections=None)`