Skip to content

2026

Data Designer Got Skills

Lessons from building an agent-first CLI and skill for Data Designer

We just published the data-designer skill, which leverages agent-focused CLI commands in Data Designer to efficiently generate datasets. Just describe the dataset you want and your agent will craft the Data Designer configuration for you — schema design, validation, preview, generation — interactively or on full autopilot (just tell the agent to "be opinionated" or "surprise me").

Search Agent SFT Data: Teaching LLMs to Browse the Web

Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data.

Structured Outputs for Nemotron: Teaching Models to Produce Valid JSON, YAML, and XML

Using NeMo Data Designer, an orchestration framework for generating high-quality synthetic data at scale, we built an iterative pipeline that generates diverse, schema-constrained structured outputs across JSON, YAML, and XML. Through multiple rounds of prompt refinement, rejection sampling, and programmatic validation, we produced a 9,949-sample dataset of verified structured output training data.

Designing Data Designer: Why SDG Is a Systems Problem

Synthetic data generation is more than a single prompt to a large language model. In this post, we walk through the design principles behind NeMo Data Designer and explain why we built it as a composable orchestration framework - treating SDG as a system of specialized stages rather than a monolithic generation task.