ποΈ Architecture & Performance
Data Designer is an orchestration framework that coordinates synthetic data generation workflows. It is a client of LLM inference serversβit does not host models itself.
This guide explains the architecture, execution model, and how to tune performance for your specific use case.
Separation of Concerns
βββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Data Designer β β Inference Server(s) β
β (Orchestration) β HTTP β (LLM Hosting) β
β β ββββββΊ β β
β β’ Dataset workflow management β β β’ Model weights and execution β
β β’ Column dependency resolution β β β’ GPU allocation and scheduling β
β β’ Batching and parallelism β β β’ Request queuing β
β β’ Retry and error handling β β β’ Token generation β
β β’ Data validation and quality β β β’ Rate limiting (optional) β
βββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β² β²
β β
Your workflow Your infrastructure
configuration (or cloud API)
What Data Designer Does
- Orchestrates the generation workflow across multiple columns
- Resolves dependencies between columns (DAG-based execution)
- Batches work into manageable chunks (
buffer_size) - Parallelizes LLM calls within batches (
max_parallel_requests) - Handles errors with retries and early shutdown logic
- Validates generated data against schemas and constraints
What Data Designer Does NOT Do
- Host models: You must provide LLM endpoints
- Manage GPUs: Your inference server handles GPU allocation
- Scale inference: You must provision sufficient capacity
- Rate limit: Your server or API gateway handles this
Execution Model
Column-Wise Generator
This describes Data Designer's current column-wise dataset generator. Other dataset generation strategies are in development.
Data Designer processes datasets in batches, with parallel operations within each batch.
How It Works
Step 1: Split into batches
Your dataset is divided into batches of buffer_size records. Each batch is processed completely before moving to the next.
Step 2: Process columns sequentially
Within a batch, columns are generated one at a time following the dependency graph. The order depends on column dependenciesβexpression columns may come before LLM columns if the LLM columns depend on them.
Example workflow:
Batch 1 (100 records)
β
βββΊ Column 1: category (Sampler) ββββ All 100 values generated
βββΊ Column 2: prompt (LLM Text) ββββ All 100 values generated
βββΊ Column 3: response (LLM Text) ββββ All 100 values generated
βββΊ Column 4: score (Expression) ββββ All 100 values computed
β
βββΊ Write batch to disk
β
βΌ
Batch 2 (100 records)
...repeat...
Step 3: Generate cells in parallel
Within each column, cells are processed in parallel up to the configured limit:
| Column Type | Parallelism Control |
|---|---|
| Sampler | non_inference_max_parallel_workers |
| LLM (Text, Code, Structured, Judge) | max_parallel_requests |
| Expression | Sequential (fast, CPU-bound) |
Key Concepts
| Concept | Description |
|---|---|
| Batching | Records are split into batches of buffer_size. Each batch completes entirely before the next begins. |
| Sequential columns | Within a batch, columns are generated one at a time, respecting the dependency graph. |
| Parallel cells | Within a column, individual cells (records) are generated in parallel up to the configured limit. |
Concurrency Formula
At any moment, the number of concurrent LLM requests is:
concurrent_requests = min(
buffer_size, # Records in current batch
max_parallel_requests, # Per-model limit
remaining_cells_in_column # Cells left to generate
)
Example: With buffer_size=100 and max_parallel_requests=8, Data Designer sends up to 8 LLM requests at a time until all 100 cells in the column are complete.
Configuration Parameters
buffer_size (RunConfig)
Controls how many records are processed per batch.
import data_designer.config as dd
from data_designer.interface import DataDesigner
run_config = dd.RunConfig(buffer_size=2000)
designer = DataDesigner()
designer.set_run_config(run_config)
| Value | Memory Usage | Throughput | Error Feedback |
|---|---|---|---|
| Low (100-500) | Lower | May not saturate inference | Fast |
| Default (1000) | Moderate | Good for most cases | Moderate |
| High (2000-5000) | Higher | Better for deep pipelines | Slower |
When to increase: High-capacity inference server, single-model workflows, memory not constrained
When to decrease: Memory-constrained environments, development/debugging, complex multi-model pipelines
max_parallel_requests (InferenceParams)
Controls concurrent LLM API calls per model alias.
import data_designer.config as dd
model = dd.ModelConfig(
alias="my-model",
model="nvidia/nemotron-3-nano-30b-a3b",
inference_parameters=dd.ChatCompletionInferenceParams(
max_parallel_requests=8,
),
)
Default: 4
When to increase: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs
When to decrease: You're hitting rate limits or 429 errors, the inference server is overloaded, or you want more predictable/debuggable execution
Finding the optimal value
The right value depends on your inference stack and model. Self-hosted vLLM servers can often handle values as high as 256, 512, or even 1024 depending on your hardware.
Benchmark approach: Run a small dataset (e.g., 100 records) with increasing max_parallel_requests values (4 β 8 β 16 β 32 β ...) and measure generation time. Stop increasing when the runtime stops decreasingβthat's when your inference server is saturated.
non_inference_max_parallel_workers (RunConfig)
Controls thread pool size for non-LLM operations (samplers, expressions, validators).
run_config = dd.RunConfig(non_inference_max_parallel_workers=8)
designer.set_run_config(run_config)
Default: 4
When to increase: Many CPU-bound columns (complex expressions, heavy sampling)
Error Handling (RunConfig)
Control retry behavior and early shutdown for failed generations.
run_config = dd.RunConfig(
max_conversation_restarts=5, # Full conversation restarts (default: 5)
max_conversation_correction_steps=0, # In-conversation corrections (default: 0)
disable_early_shutdown=False, # Enable early shutdown (default)
shutdown_error_rate=0.5, # Shut down if >50% errors
shutdown_error_window=10, # Min tasks before error monitoring
)
designer.set_run_config(run_config)
When to adjust:
- Strict schemas: Increase
max_conversation_restartsto 7, addmax_conversation_correction_steps=2 - Debugging: Set
disable_early_shutdown=Trueto see all errors - Simple text: Reduce
max_conversation_restartsto 3
Common Problems
| Problem | Symptom | Solution |
|---|---|---|
| Low throughput | Low GPU utilization | Increase max_parallel_requests and/or buffer_size |
| Long tail of slow generations | Most records fast, few very slow | Reduce max_conversation_restarts, simplify schemas, improve prompts |
| Multi-model idle periods | One model busy, others idle | Reduce buffer_size for faster cycling, or consolidate models |
| Memory errors | OOM crashes | Reduce buffer_size and max_parallel_requests |
| Too many errors | Generation fails frequently | Check prompts/schemas; adjust shutdown_error_rate or disable early shutdown for debugging |
Tuning Workflow
- Start with defaults for initial development
- Profile your workload: How many LLM columns? How many records? What models?
- Identify bottleneck: Low GPU util β increase
max_parallel_requests. Memory issues β decreasebuffer_size. Long tails β tune retry settings. - Iterate: Make one change at a time, measure impact before next change
Related Documentation
- Deployment Options: Choosing between library and microservice
- Model Configuration: Complete model settings reference
- Inference Parameters: Detailed parameter reference