Columns
Columns are the fundamental building blocks in Data Designer. Each column represents a field in your dataset and defines how to generate it—whether that's sampling from a distribution, calling an LLM, or applying a transformation.
The Declarative Approach
Columns are declarative specifications. You describe what you want, and the framework handles how to generate it—managing execution order, batching, parallelization, and resources automatically.
Column Types
Data Designer provides nine built-in column types, each optimized for different generation scenarios.
🎲 Sampler Columns
Sampler columns generate data using numerical sampling—fast, deterministic, and ideal for numerical and categorical dataset fields. They're significantly faster than LLMs and can produce data following specific distributions (Poisson for event counts, Gaussian for measurements, etc.).
Available sampler types:
- UUID: Unique identifiers
- Category: Categorical values with optional probability weights
- Subcategory: Hierarchical categorical data (states within countries, models within brands)
- Uniform: Evenly distributed numbers (integers or floats)
- Gaussian: Normally distributed values with configurable mean and standard deviation
- Bernoulli: Binary outcomes with specified success probability
- Bernoulli Mixture: Binary outcomes from multiple probability components
- Binomial: Count of successes in repeated trials
- Poisson: Count data and event frequencies
- Scipy: Access to the full scipy.stats distribution library
- Person: Realistic synthetic individuals with names, demographics, and attributes
- Datetime: Timestamps within specified ranges
- Timedelta: Time duration values
Conditional Sampling
Samplers support conditional parameters that change behavior based on other columns. Want age distributions that vary by country? Income ranges that depend on occupation? Just define conditions on existing column values.
📝 LLM-Text Columns
LLM-Text columns generate natural language text: product descriptions, customer reviews, narrative summaries, email threads, or anything requiring semantic understanding and creativity.
Use Jinja2 templating in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.
Reasoning Traces
Models that support extended thinking (chain-of-thought reasoning) can capture their reasoning process in a separate {column_name}__reasoning_trace column–useful for understanding why the model generated specific content. This column is automatically added to the dataset if the model and service provider parse and return reasoning content.
💻 LLM-Code Columns
LLM-Code columns generate code in specific programming languages. They handle the prompting and parsing necessary to extract clean code from the LLM's response—automatically detecting and extracting code from markdown blocks. You provide the prompt and choose the model; the column handles the extraction.
Supported languages: Python, JavaScript, TypeScript, Java, Kotlin, Go, Rust, Ruby, Scala, Swift, plus SQL dialects (SQLite, PostgreSQL, MySQL, T-SQL, BigQuery, ANSI SQL).
🗂️ LLM-Structured Columns
LLM-Structured columns generate JSON with a guaranteed schema. Define your structure using a Pydantic model or JSON schema, and Data Designer ensures the LLM output conforms—no parsing errors, no schema drift.
Use for complex nested structures: API responses, configuration files, database records with multiple related fields, or any structured data where type safety matters. Schemas can be arbitrarily complex with nested objects, arrays, enums, and validation constraints, but success depends on the model's capabilities.
Schema Complexity and Model Choice
Flat schemas with simple fields are easier and more robustly produced across models. Deeply nested schemas with complex validation constraints are more sensitive to model choice—stronger models handle complexity better. If you're experiencing schema conformance issues, try simplifying the schema or switching to a more capable model.
⚖️ LLM-Judge Columns
LLM-Judge columns score generated content across multiple quality dimensions using LLMs as evaluators.
Define scoring rubrics (relevance, accuracy, fluency, helpfulness) and the judge model evaluates each record. Score rubrics specify criteria and scoring options (1-5 scales, categorical grades, etc.), producing quantified quality metrics for every data point.
Use judge columns for data quality filtering (e.g., keep only 4+ rated responses), A/B testing generation strategies, and quality monitoring over time.
🧩 Expression Columns
Expression columns handle simple transformations using Jinja2 templates—concatenate first and last names, calculate numerical totals, format date strings. No LLM overhead needed.
Template capabilities:
- Variable substitution: Pull values from any existing column
- String filters: Uppercase, lowercase, strip whitespace, replace patterns
- Conditional logic: if/elif/else support
- Arithmetic: Add, subtract, multiply, divide
🔍 Validation Columns
Validation columns check generated content against rules and return structured pass/fail results.
Built-in validation types:
Code validation runs Python or SQL code through a linter to validate the code.
Local callable validation accepts a Python function directly when using Data Designer as a library.
Remote validation sends data to HTTP endpoints for validation-as-a-service. Useful for linters, security scanners, or proprietary systems.
🌱 Seed Dataset Columns
Seed dataset columns bootstrap generation from existing data. Provide a real dataset, and those columns become available as context for generating new synthetic data.
Typical pattern: use seed data for one part of your schema (real product names and categories), then generate synthetic fields around it (customer reviews, purchase histories, ratings). The seed data provides realism and constraints; generated columns add volume and variation.
Shared Column Properties
Every column configuration inherits from SingleColumnConfig with these standard properties:
name
The column's identifier—unique within your configuration, used in Jinja2 references, and becomes the column name in the output DataFrame. Choose descriptive names: user_review > col_17.
drop
Boolean flag (default: False) controlling whether the column appears in final output. Setting drop=True generates the column (available as a dependency) but excludes it from final output.
When to drop columns:
- Intermediate calculations that feed expressions but aren't meaningful standalone
- Context columns used only for LLM prompt templates
- Validation results during development unwanted in production
Dropped columns participate fully in generation and the dependency graph—just filtered out at the end.
column_type
Literal string identifying the column type: "sampler", "llm-text", "expression", etc. Set automatically by each configuration class and serves as Pydantic's discriminator for deserialization.
You rarely set this manually—instantiating LLMTextColumnConfig automatically sets column_type="llm-text". Serialization is reversible: save to YAML, load later, and Pydantic reconstructs the exact objects.
required_columns
Computed property listing columns that must be generated before this one. The framework derives this automatically:
- For LLM/Expression columns: extracted from Jinja2 template
{{ variables }} - For Validation columns: explicitly listed target columns
- For Sampler columns with conditional parameters: columns referenced in conditions
You read this property for introspection but never set it—always computed from configuration details.
side_effect_columns
Computed property listing columns created implicitly alongside the primary column. Currently, only LLM columns produce side effects (reasoning trace columns like {name}__reasoning_trace when models use extended thinking).
For detailed information on each column type, refer to the column configuration code reference.