Search Agent SFT Data: Teaching LLMs to Browse the Web

Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data.

Why This Matters: The Agentic Shift

The industry is moving from models that answer questions to agents that take actions. Real-world AI applications orchestrate multiple steps --- searching the web, querying databases, reading documents, calling APIs --- with the LLM as the reasoning engine deciding what to do next.

Consider this question from OpenAI's BrowseComp benchmark:

Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee that had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.

Answer: Ireland v Romania

You can't answer this from memory. You need to search, read results, refine your query, search again, and piece it together --- exactly what we want AI agents to do. Training a model for this requires trajectory data: the full record of every search query, every result evaluation, and every reasoning step, not just the final answer.

Creating this data by hand takes 15-30 minutes per example. At the thousands of trajectories needed for SFT, that's months of annotation work. We needed a way to generate it synthetically.

End-to-End Pipeline Architecture

                                    SEARCH AGENT SFT PIPELINE
                                    =========================

             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                       STAGE 1: SEED DATA (Wikidata KG Walks)                        │
             │                                                                                     │
             │   Random walks on the Wikidata knowledge graph                                      │
             │   ├─ Anti-meta filters (no category/template/list-y nodes)                          │
             │   ├─ Hop range: 4 minimum, 8 maximum                                                │
             │   └─ SPARQL queries to fetch neighbors                                              │
             │                                                                                     │
             │   Output: seed JSONL with hops[], seed_entity, final_answer_entity, path_length     │
             │   50,000 seeds generated                                                            │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                   STAGE 2: SEARCH RIDDLE GENERATION (Data Designer)                 │
             │                                                                                     │
             │   user_query_draft ──────────► user_query_obfuscated                                │
             │   (chain clues from path,       (BrowseComp-style rewrite:                          │
             │    hide intermediate nodes,      concise, natural, no breadcrumbs,                  │
             │    don't name the answer)        1-2 sentences max)                                 │
             │                                                                                     │
             │   + Heuristic filters: answer leakage, intermediate node leakage, INVALID_PATH      │
             │   50,000 → 37,000 valid seeds → 24,000 valid questions                              │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                      STAGE 3: SEARCH TRAJECTORY ROLLOUTS                            │
             │                                                                                     │
             │   Thought-Action-Observation loop with live web search (Tavily API)                 │
             │   ├─ Rollout model: MiniMax-M2 (strong BrowseComp + tool-calling scores)            │
             │   ├─ Average ~12 tool calls per sample                                              │
             │   ├─ Multiple rollouts per question for rejection sampling                          │
             │   └─ 6,974 completed (stop) / 177 truncated (length)                                │
             │                                                                                     │
             │   24,000 questions → 7,000 valid trajectories                                       │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                      STAGE 4: POST-PROCESSING → SFT DATASET                         │
             │                                                                                     │
             │   ├─ Normalize tool outputs to consistent JSON "tool response" shape                │
             │   ├─ Drop broken/truncated interactions                                             │
             │   ├─ Select best rollout per question (min tool calls among correct)                │
             │   ├─ Write OpenAI-messages style: messages[], tools[], metadata{}                   │
             │   └─ Manual review + LLM spot-checking (Gemini)                                     │
             │                                                                                     │
             │   ~7,000 SFT records for Nemotron Super                                             │
             └─────────────────────────────────────────────────────────────────────────────────────┘

Step 1: Seed Data from Wikidata Knowledge Graph Walks

The core idea: start at a random entity in the Wikidata knowledge graph and perform a random walk through its relations, producing a chain of hops that becomes a multi-hop search problem. Each chain provides a seed_entity (start), a final_answer_entity (destination), and a readable_path describing the edges traversed.

We used Wikidata SPARQL queries to fetch neighbors at each hop. The number of hops is directly proportional to the number of tool calls the model would need to solve questions derived from that path --- more hops means harder riddles.

START ENTITY: NVIDIA (Q182477)
  ⬇ [chief executive officer (P169)]
  NODE: Jensen Huang (Q332838)
  ⬇ [educated at (P69)]
  NODE: Oregon State University (Q861888)
  ⬇ [located in the administrative territorial entity (P131)]
  NODE: Benton County (Q115372)
  ⬇ [named after (P138)]
  NODE: Thomas Hart Benton (Q178712)

START ENTITY: toothache (Q143925)
  ⬇ [risk factor (P564)]
  NODE: smoking (Q662860)
  ⬇ [has effect (P1542)]
  NODE: Crohn's disease (Q1472)
  ⬇ [drug or therapy used for treatment (P2176)]
  NODE: TNF inhibitor (Q1536078)
  ⬇ [(is possible treatment of) (P2175)] reverse relation
  NODE: Behçet's disease (Q911427)
  ⬇ [symptoms and signs (P780)]
  NODE: inflammation (Q101991)
  ⬇ [drug or therapy used for treatment (P2176)]
  NODE: (±)-flurbiprofen (Q419890)
  ⬇ [significant drug interaction (P769)]
  NODE: parecoxib (Q347941)
  ⬇ [significant drug interaction (P769)]
  NODE: ibuprofen (Q186969)

Heuristics to Keep Walks Coherent

Unrestricted random walks go off the rails quickly --- you'd get paths like CEO → Human Being → Civilization → Indus Valley. We applied several filters:

Anti-meta filters: Avoid category nodes, template pages, list-y entities, and other degenerate hops that exist for Wikidata bookkeeping rather than representing real-world relationships.
Hop range: 4 minimum, 8 maximum. Below 4 hops, the questions aren't difficult enough to require multi-step search. Above 8, the path wanders off-topic and produces unsolvable riddles.
Generic entity filtering: Remove seeds where the final_answer_entity is too abstract ("technology", "people", "field", "concept"). These produce questions where any answer could be correct.

The resulting seed dataset: 50,000 JSONL records, each containing hops[], seed_entity, final_answer_entity, readable_path, and path_length.

A Note on Ground Truth Staleness

An important caveat when using Wikidata as a seed source: the knowledge graph reflects a snapshot in time. Models with current parametric knowledge or live search results may find answers that are factually correct today but disagree with the KG-derived ground truth. For example, a question about "which country contains the headquarters of the owner of U.S. Steel?" has ground truth "United States" from Wikidata --- but U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023, making "Japan" the correct answer now. This staleness affects both question quality (paths through outdated facts) and evaluation (correct model answers flagged as wrong). We revisit this challenge in the Correctness Challenge section below.

Step 2: Creating Search Riddles with Data Designer

Each seed path needs to be converted into two things: a search question (obfuscated so it doesn't leak the answer) and a ground truth target entity (the final node in the path). We use two chained LLM columns in Data Designer for this.

Stage 1 --- Draft question: Chain clues from the knowledge path into a multi-hop riddle. Critical rules: don't name intermediate nodes, don't name the final answer, skip weak or illogical hops, and output INVALID_PATH if the path is unsalvageable.

Stage 2 --- Obfuscated question: Rewrite the draft in BrowseComp style --- concise, natural, 1-2 sentences max. The solver must figure out what to search rather than following explicit breadcrumbs. No relational breadcrumbing like "X is member of Y; Y is based in Z...".

import data_designer.config as dd
from data_designer.interface import DataDesigner

config = dd.DataDesignerConfigBuilder(model_configs=[
    dd.ModelConfig(
        alias="riddle-gen",
        model="qwen/qwen3-235b-a22b",
        provider="nvidia",
    ),
])

config.with_seed_dataset(
    dd.LocalFileSeedSource(path="search_agent_seeds.parquet"),
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)

# Stage 1: Draft question from knowledge path
config.add_column(dd.LLMTextColumnConfig(
    name="user_query_draft",
    model_alias="riddle-gen",
    prompt=(
        "You are an expert Search Evaluator designing Grandmaster-Level search tests.\n"
        "Create a complex, multi-step search riddle based on this knowledge path:\n\n"
        "{{ readable_path }}\n\n"
        "Start Entity: {{ seed_entity }}\n"
        "Final Answer Entity: {{ final_answer_entity }}\n\n"
        "RULES:\n"
        "1. DO NOT name the intermediate nodes. Hide them behind descriptions.\n"
        "2. DO NOT name the Final Answer.\n"
        "3. Chain the clues logically.\n"
        "4. If a step is weak or nonsensical, IGNORE IT.\n"
        "5. Output INVALID_PATH if the path is unsalvageable.\n\n"
        "Return ONLY the question string (or INVALID_PATH)."
    ),
))

# Stage 2: BrowseComp-style obfuscation
config.add_column(dd.LLMTextColumnConfig(
    name="user_query_obfuscated",
    model_alias="riddle-gen",
    prompt=(
        "Rewrite this search riddle to be MORE obfuscated and natural.\n\n"
        "Original: {{ user_query_draft }}\n"
        "Secret path: {{ readable_path }}\n\n"
        "REQUIREMENTS:\n"
        "1. DO NOT reveal the step-by-step plan. No breadcrumb chains.\n"
        "2. DO NOT name intermediate or final entities.\n"
        "3. 1-2 sentences max. Sound like a real user question.\n"
        "4. If original == INVALID_PATH, output INVALID_PATH.\n\n"
        "Return ONLY the rewritten question."
    ),
))

Example transformation (NVIDIA path):

Draft:      "Starting from NVIDIA, identify the current CEO, then find
             where they received their bachelor's degree, determine which
             county houses that university's main campus, and finally
             identify the nickname of the 19th-century U.S. Senator
             the county is named after."

Obfuscated: "Identify the nickname ('Old ____') of the 19th-century U.S.
             Senator who is the namesake of the specific county that houses
             the main campus of the university where the current CEO of
             NVIDIA received his bachelor's degree."

Answer:     "Old Bullion"

The obfuscated version requires the solver to:

Identify Jensen Huang as NVIDIA's CEO
Find where he got his bachelor's degree (Oregon State, not Stanford)
Identify the county (Benton County, OR)
Find who the county is named after (Thomas Hart Benton)
Find his nickname --- forcing one final hop to verify it's the Senator, not the painter

Step 3: Search Trajectory Rollouts

For each riddle, we generate multi-step tool-using conversation rollouts where the model makes several web-search tool calls, reads tool outputs, reasons about results, and ends with a final answer. The agent operates through a cyclical thought-action-observation loop:

Thought: Analyze the current state, identify the next missing link, formulate a search query
Action (Tool Call): Execute web_search(query="..."), retrieving real data from the web
Observation & Synthesis: Read the results, extract the relevant fact, loop back to step 1

The rollout model was MiniMax-M2, chosen for its strong performance on BrowseComp and tool-calling benchmarks.

Rollout Statistics

$ jq -r '.finish_reason[-1] // "MISSING"' shard_*/output-rs0.jsonl-async \
  | sort | uniq -c | sort -nr
   6974 stop
    177 length

6,974 completed (stop) --- the model reached a final answer
177 truncated (length) --- hit the context limit before answering
Average ~12 tool calls per sample

What a Trajectory Looks Like

Each trajectory captures the full search process. Here's a condensed example (simplified for readability --- production records include OpenAI-spec fields like tool_call_id, id, and type on tool call messages):

{
  "messages": [
    {
      "role": "system",
      "content": "You are an expert search agent. Your goal is to answer the user's question efficiently using verifiable online sources.\n\nExecution Rules:\n1. Search Results...\n2. Planning...\n3. Tool Use...\n4. MAXIMUM 25 Calls...\n5. Verification...\n6. Output Format - The last line must be EXACTLY: \"Final Answer: <The Entity>\"...\n"
    },
    {
      "role": "user",
      "content": "Problem: Which equatorial coordinate system, using an angular measurement instead of right ascension, can be reached by starting with lawrencium, then the physicist it honors, the university they attended...?"
    },
    {
      "role": "assistant",
      "content": "<think>\nWe need to parse the clues step by step:\n1. Start with lawrencium... named after Ernest Lawrence...\n2. The university they attended... need to verify...\n</think>",
      "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"Ernest Lawrence attended university\"}"}}]
    },
    {
      "role": "tool",
      "name": "web-search",
      "content": "{\"results\": [{\"title\": \"Physics - Ernest Lawrence\", \"content\": \"He attended Manchester High School prior to entering Manchester University...\"}, {\"title\": \"2025 Hall of Fame...\", \"content\": \"...Ernest Lawrence graduated from USD in 1922...\"}]}"
    },
    {
      "role": "assistant",
      "content": "<think>\nErnest Lawrence attended University of South Dakota (BS) and Yale (PhD)... The clue mentions 'six-month academic schedule' — likely 'semester'...\n</think>",
      "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"semester six month academic schedule\"}"}}]
    },
    {
      "role": "tool",
      "name": "web-search",
      "content": "{\"results\": [{\"title\": \"Academic Calendar\", \"content\": \"Many colleges follow a semester plan...\"}]}"
    },
    {
      "role": "assistant",
      "content": "Final Answer: first equatorial coordinate system"
    }
  ],
  "metadata": {
    "ground_truth": "first equatorial coordinate system",
    "num_hops_in_graph": 8,
    "seed_entity": "lawrencium",
    "final_answer_entity": "first equatorial coordinate system",
    "num_tool_calls": 2,
    "finish_reason": ["tool_calls", "tool_calls", "stop"]
  },
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "web-search",
        "description": "Search the web for a query.",
        "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
      }
    }
  ]
}

Step 4: Post-Processing to SFT Dataset

Raw rollouts need cleanup before they become trainable SFT records:

Normalize tool outputs into a consistent JSON "tool response" shape
Drop broken/truncated interactions (the 177 length records)
Select the best rollout per question (minimum tool calls among correct ones)
Write OpenAI-messages style JSONL with messages[], tools[], and metadata{}
Manual review + LLM spot-checking --- we reviewed as much SFT data as we could manually and used Gemini to spot-check chunks

Production Yield Analysis

                                        PIPELINE YIELD
                                        ==============

             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │   50,000 ───74%──► 37,000 ───65%──► 24,000 ───29%──► 7,000                          │
             │   Seeds           Valid Seeds       Valid Questions    Valid Trajectories           │
             │                                                                                     │
             │   Total Yield: 14%                                                                  │
             └─────────────────────────────────────────────────────────────────────────────────────┘

Stage	Input	Output	Yield
Seed Creation (Wikidata walks)	50,000	37,000	74%
Riddle / Question Generation (DD)	37,000	24,000	65%
Search Trajectory Rollouts (Tavily)	24,000	7,000	29%
End-to-End	50,000	~7,000	~14%

The 14% yield might seem low, but each surviving record is a verified, multi-turn search trajectory showing a model successfully navigating web search. The alternative --- human annotation at 15-30 minutes per trajectory --- would take months for the same volume.

The Correctness Challenge

Measuring correctness in post-processing was one of the hardest parts of this project, for reasons that go beyond typical evaluation:

1. Questions can have multiple valid answers. A question about "which country contains X" might have a valid answer at multiple levels of granularity, or the entity might have multiple correct associations.

2. Wikidata has stale ground truth. The knowledge graph reflects a snapshot in time. The model's parametric knowledge or live search results may be more current. For example:

Question: "...city that contains the headquarters of the owner of U.S. Steel?"

Ground truth (from Wikidata): United States

Model answer: Japan

Reality: U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023. The model's answer is factually correct today but wrong according to the outdated KG path.

Accuracy Results

We evaluated 657 trajectories against ground truth using fuzzy matching:

------------------------------------------------------------------------------------------------------------
#    | GROUND TRUTH              | #TC | MODEL ANSWER                       | STATUS
------------------------------------------------------------------------------------------------------------
1    | Ramsar Convention         | 3   | Ramsar Convention (the Convention. | ✅ MATCH
2    | United States             | 4   | United States                      | ✅ MATCH
3    | South Korea               | 3   | Uzbekistan                         | ❌ MISS
4    | France                    | 3   | Germany                            | ❌ MISS
5    | Joseph Poelaert           | 4   | Joseph Poelaert                    | ✅ MATCH
...
653  | Bangladesh                | 11  | Bangladesh                         | ✅ MATCH
654  | Portal:Arithmetic         | 10  | Portal:Arithmetic                  | ✅ MATCH
655  | Monumento 6 Gran Vía..    | 11  | Monumento V (the Monumento a los.. | ❌ MISS
656  | Tehran                    | 11  | Constantinople                     | ❌ MISS
657  | United Kingdom            | 11  | Germany                            | ❌ MISS
------------------------------------------------------------------------------------------------------------

📊 RESULTS: 181/657 (27.5%) Correct

The 27.5% accuracy on this sample is for raw, unfiltered trajectories. After the full pipeline (rejection sampling, best-rollout selection, manual review), the final SFT dataset has much higher quality. The low raw accuracy underscores why multi-stage filtering is essential.

Implementing with Data Designer's MCP Integration

The same pipeline is reproducible with Data Designer's MCP integration. Three components make this work:

LocalStdioMCPProvider launches a Tavily MCP server as a subprocess:

from data_designer.config.mcp import LocalStdioMCPProvider, ToolConfig

tavily_provider = LocalStdioMCPProvider(
    name="tavily",
    command=sys.executable,
    args=[str(tavily_server_path)],
    env={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]},
)

ToolConfig controls safety and limits:

tool_config = ToolConfig(
    tool_alias="tavily",
    providers=["tavily"],
    allow_tools=["tavily_search"],
    max_tool_call_turns=15,
    timeout_sec=300.0,
)

tool_alias + with_trace on the LLM column enables tool calling and captures the full conversation:

config.add_column(dd.LLMTextColumnConfig(
    name="agent_solution_raw",
    system_prompt="You are an expert search agent...",
    prompt="Problem: {{ user_query_obfuscated }}",
    model_alias="search-agent",
    tool_alias="tavily",
    with_trace=dd.TraceType.ALL_MESSAGES,
))

The resulting agent_solution_raw__trace column contains the complete ChatML conversation --- every user message, every tool call with arguments, every tool response with search results, and the final assistant response. This trace is the SFT training data.

Safety controls matter here. allow_tools prevents the model from calling unexpected tools. max_tool_call_turns=15 prevents infinite search loops. timeout_sec=300 prevents hung connections. Without these, a fraction of records would consume unbounded resources.

BrowseComp Benchmark Results

This dataset was shipped as part of Nemotron Super v3 post-training (SFT + RL). On the BrowseComp benchmark (1,266 web browsing problems), Nemotron Super went from 0% to 31.28% accuracy --- approaching GPT-OSS-120B at 33.89%.

Model	BrowseComp Accuracy (%)
Nemotron Super (before synthetic search agent data)	0.00
Nemotron Super (after synthetic search agent data, SFT + RL)	31.28
GPT-OSS-120B	33.89

Before this work, Nemotron Super had zero web browsing capability --- it had never been trained on tool-use trajectories with search. Including our synthetic search agent dataset in the SFT blend, combined with other RL datasets in later training stages, enabled the model to go from no capability to near-competitive with GPT-OSS-120B on one of the hardest agentic benchmarks. This dev note focuses on the SFT data generation pipeline.

Key Takeaways

Wikidata provides infinite seed diversity. Random walks on a knowledge graph with 100M+ entities produce an inexhaustible supply of multi-hop problems. The hop count directly controls difficulty --- 4 hops for warm-up, 8 for expert-level riddles.
Two-stage obfuscation prevents leakage. Draft questions tend to follow the path structure too closely (breadcrumbing). The obfuscation rewrite produces concise, natural questions that force the solver to figure out what to search.
Low yield is expected and acceptable. 14% end-to-end yield from 50k seeds still produces ~7,000 high-quality trajectories --- enough for meaningful SFT impact. Multi-hop search is genuinely hard, and most generated paths or questions are legitimately unsolvable.
Stale knowledge graphs are a real problem. Wikidata doesn't update in real-time. Models with current parametric knowledge or live search results will disagree with ground truth on entities that have changed (mergers, leadership changes, geopolitical shifts). Correctness evaluation needs to account for this.
Iterate on seeds, not just prompts. Seed filtering (removing generic answers, constraining hop counts, anti-meta filters) has as much impact on quality as prompt engineering. Filter early, save compute.
Traces are the training data. The full thought-action-observation loop --- every search query formulation, every result evaluation, every reasoning step --- is what teaches tool-use capability. Final answers alone are worthless without the process.

Next Steps

Scale question generation. Generate closer to ~25,000 filtered questions using Data Designer, up from the current 7k trajectories.
Push difficulty higher. Target questions where num_tool_calls consistently exceeds 15+, requiring deeper reasoning chains.
Explore fresher knowledge bases. Wikidata staleness is a real limitation. Investigate more recently updated, freely available knowledge bases for seed generation.
Search RL environment. Use the filtered questions as an RL environment where the model gets reward for correct final answers, complementing the SFT data.

Try For Yourself

The snippet below shows the core pattern: seed data, two-stage riddle generation, and an MCP-enabled agent trajectory with full trace capture.

Minimal example: search agent trajectory pipeline

import data_designer.config as dd
from data_designer.interface import DataDesigner

MODEL_ALIAS = "nvidia-text"

# Tavily MCP provider (hosted endpoint, no local server needed)
mcp_provider = dd.MCPProvider(
    name="tavily",
    endpoint="https://mcp.tavily.com/mcp/?tavilyApiKey=YOUR_KEY",
    provider_type="streamable_http",
)

tool_config = dd.ToolConfig(
    tool_alias="tavily-search",
    providers=["tavily"],
    allow_tools=["tavily_search"],
    max_tool_call_turns=25,
    timeout_sec=300.0,
)

config = dd.DataDesignerConfigBuilder(tool_configs=[tool_config])
config.with_seed_dataset(
    dd.LocalFileSeedSource(path="seeds.jsonl"),
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)

# Stage 2a: Draft question from knowledge path
config.add_column(dd.LLMTextColumnConfig(
    name="user_query_draft", model_alias=MODEL_ALIAS,
    prompt=(
        "Create a multi-step search riddle from this knowledge path:\n"
        "{{ readable_path }}\n"
        "Start: {{ seed_entity }}. Answer: {{ final_answer_entity }}\n"
        "Do NOT name intermediate nodes or the answer. Return ONLY the question."
    ),
))

# Stage 2b: BrowseComp-style obfuscation
config.add_column(dd.LLMTextColumnConfig(
    name="user_query_obfuscated", model_alias=MODEL_ALIAS,
    prompt=(
        "Rewrite this riddle to be concise and natural (1-2 sentences).\n"
        "Original: {{ user_query_draft }}\n"
        "No breadcrumb chains. No entity names. If INVALID_PATH, output INVALID_PATH."
    ),
))

# Stage 3: Agent trajectory with MCP tool calling
config.add_column(dd.LLMTextColumnConfig(
    name="agent_solution_raw", model_alias=MODEL_ALIAS,
    system_prompt="You are an expert search agent. Use tavily_search to find the answer.",
    prompt="Problem: {{ user_query_obfuscated }}",
    tool_alias="tavily-search",
    with_trace=dd.TraceType.ALL_MESSAGES,
))

# Run
data_designer = DataDesigner(mcp_providers=[mcp_provider])
preview = data_designer.preview(config, num_records=5)
preview.display_sample_record()

Full recipe: search_agent.py (self-contained, runnable)

Download Code

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "data-designer",
# ]
# ///
"""Nemotron Super Search Agent Recipe: Trajectories with Tavily Web Search

Generate multi-turn search agent trajectories where an LLM iteratively
searches the web, reads results, reasons about evidence, and synthesizes
answers -- the kind of data needed to train BrowseComp-style search agents.

This recipe implements the pipeline used to produce ~7,000 high-quality
tool-use trajectories for Nemotron Super post-training, starting from
50,000 Wikidata knowledge graph seeds.

Pipeline architecture:

    ┌─────────────────────────────────────────────────────────────────────────┐
    │                   STAGE 1: SEED DATA (Wikidata KG Walks)                │
    │                                                                         │
    │  Random walks on the Wikidata knowledge graph produce multi-hop paths.  │
    │  Each seed has: seed_entity, final_answer_entity, readable_path,        │
    │  num_hops_in_graph, ground_truth.                                       │
    │  Built-in demo seeds included; bring your own for production.           │
    ├─────────────────────────────────────────────────────────────────────────┤
    │                   STAGE 2: SEARCH RIDDLE GENERATION (LLM)               │
    │                                                                         │
    │  user_query_draft ────────► user_query_obfuscated                       │
    │  (chain clues from path,     (BrowseComp-style rewrite:                 │
    │   hide intermediate nodes,    concise, natural, no breadcrumbs,         │
    │   don't name the answer)      1-2 sentences max)                        │
    ├─────────────────────────────────────────────────────────────────────────┤
    │                   STAGE 3: SEARCH TRAJECTORY ROLLOUTS (LLM + MCP)       │
    │                                                                         │
    │  Thought-Action-Observation loop with live Tavily web search.           │
    │  ├─ tavily_search tool via hosted MCP endpoint                          │
    │  ├─ Maximum 25 tool call turns; 300s timeout                            │
    │  ├─ Full trace captured via with_trace=ALL_MESSAGES                     │
    │  └─ Structured JSON output: final_answer, supporting_urls,              │
    │     short_justification                                                 │
    ├─────────────────────────────────────────────────────────────────────────┤
    │                   STAGE 4: STRUCTURED FORMATTING (LLM)                  │
    │                                                                         │
    │  Normalize raw agent output into clean JSON via LLMStructuredColumn.    │
    │  Handles markdown fences, trailing text, single-quoted dicts.           │
    │                                                                         │
    │  The agent_solution_raw__trace column IS the SFT training data:         │
    │  complete ChatML conversation with every tool call and response.        │
    └─────────────────────────────────────────────────────────────────────────┘

Prerequisites:
    - TAVILY_API_KEY environment variable (get a free key at https://tavily.com)
    - OPENAI_API_KEY environment variable for OpenAI provider model aliases.
    - NVIDIA_API_KEY environment variable for NVIDIA provider model aliases (default model alias is "nvidia-text").

Run:
    # Basic usage with built-in demo seeds (generates 2 trajectories)
    uv run search_agent.py

    # Use a custom seed parquet
    uv run search_agent.py --seed-path /path/to/seeds.parquet --num-records 10

    # For help message and available options
    uv run search_agent.py --help
"""

from __future__ import annotations

import json
import os
import tempfile
from pathlib import Path

from pydantic import BaseModel, Field

import data_designer.config as dd
from data_designer.interface import DataDesigner

# =============================================================================
# Structured Output Schema
# =============================================================================


class AgentSolution(BaseModel):
    """Structured output for the search agent's final answer."""

    final_answer: str = Field(..., min_length=1, description="The final answer entity.")
    supporting_urls: list[str] = Field(
        default_factory=list, description="Authoritative URLs used to verify the answer."
    )
    short_justification: str = Field(..., min_length=1, description="Brief explanation of reasoning (1-2 sentences).")


# =============================================================================
# Prompt Templates
# =============================================================================

QUERY_DRAFT_PROMPT = """\
You are an expert Search Evaluator designing Grandmaster-Level search tests.
Create a complex, multi-step search riddle based on this knowledge path:

{{ readable_path }}

Start Entity: {{ seed_entity }}
Final Answer Entity: {{ final_answer_entity }}

CRITICAL RULES:
1. DO NOT name the intermediate nodes. Hide them behind descriptions.
2. DO NOT name the Final Answer.
3. Chain the clues logically -- describe each step relative to the previous one.
4. Audit the logic: if a step is weak or nonsensical, IGNORE IT.
5. Salvage and simplify: use only the strongest, most logical hops.
6. No hallucinations: do not invent relationships not in the path.
7. Aim for 4-8 meaningful hops.

VALIDATION - Output "INVALID_PATH" if:
- Final answer is generic/abstract (e.g. "technology", "people", "field")
- Path has weak/illogical relationships
- No coherent question can be formed

Return ONLY the question string (or "INVALID_PATH").\
"""

OBFUSCATE_PROMPT = """\
Rewrite this search riddle to better match BrowseComp-style tasks.

Original Riddle: {{ user_query_draft }}

Secret Path (do not leak entities): {{ readable_path }}
Start Entity: {{ seed_entity }}
Final Answer (do not leak): {{ final_answer_entity }}

HARD REQUIREMENTS:
1. NEVER reveal the step-by-step plan. No breadcrumb chains.
   Avoid: "X is member of Y; Y is based in Z; Z is the capital of..."
   Avoid meta language: "then search...", "next find...", "follow the chain..."
2. NEVER mention the final answer or any intermediate entity by name.
3. Keep it concise and natural: 1-2 sentences max (3 for very complex paths).
4. Use descriptive clues that require reasoning.
5. Include at least one disambiguating filter (date, count, or specific attribute).
6. If original == "INVALID_PATH", output exactly "INVALID_PATH".

Return ONLY the rewritten question string (or "INVALID_PATH").\
"""

AGENT_SYSTEM_PROMPT = """\
You are an expert search agent that uses web search to answer questions accurately.

You MUST output ONLY valid JSON matching this exact schema:

{
  "final_answer": "string - the specific answer entity",
  "supporting_urls": ["url1", "url2"],
  "short_justification": "string - brief 1-2 sentence explanation"
}

AVAILABLE TOOLS:
You have access to ONE tool called "tavily_search" with parameter: query (string, required).

TOOL USAGE RULES:
1. Exact Tool Name: Always use "tavily_search" (no suffixes or prefixes).
2. Exact Args: Only send {"query": "..."} for the tool call.
3. Maximum 25 tool calls. Budget your searches wisely.
4. Search Strategy:
   - Start with broad queries to understand the domain
   - Refine to specific entities/relationships
   - Cross-verify facts across multiple sources
   - Use different query formulations for the same information
5. No Reasoning Tags: Do NOT use <think> tags or XML formatting.
6. No Intermediate Text: Do NOT output explanatory text between tool calls.
7. Final Output: After completing your searches, output ONLY the JSON object.

EXECUTION FLOW:
1. Read the user's question
2. Make tool calls using "tavily_search" to gather information
3. Verify information across multiple sources
4. Once confident, output the JSON result (no additional text)\
"""

FORMATTER_PROMPT = """\
You are a JSON normalizer.

You will be given a messy model output that should contain a JSON object with:
- final_answer (string)
- supporting_urls (list of strings)
- short_justification (string)

Rules:
- Return ONLY a JSON object. No markdown. No extra text.
- If the input contains code fences, tool chatter, or extra prose, ignore it.
- If the input contains invalid JSON, repair it.
- supporting_urls must be a list of valid http(s) URLs (dedupe, keep best 1-5).

Input:
{{ agent_solution_raw }}\
"""


# =============================================================================
# Data Designer Configuration
# =============================================================================


def build_config(model_alias: str) -> tuple[dd.DataDesignerConfigBuilder, dd.MCPProvider]:
    """Build the Data Designer configuration for search agent trajectory generation.

    Returns:
        A tuple of (config_builder, mcp_provider).
    """
    tavily_api_key = os.environ.get("TAVILY_API_KEY", "")
    mcp_provider = dd.MCPProvider(
        name="tavily",
        endpoint=f"https://mcp.tavily.com/mcp/?tavilyApiKey={tavily_api_key}",
        provider_type="streamable_http",
    )

    tool_config = dd.ToolConfig(
        tool_alias="tavily-search",
        providers=["tavily"],
        allow_tools=["tavily_search"],
        max_tool_call_turns=25,
        timeout_sec=300.0,
    )

    config_builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config])

    # Stage 2: Draft question from knowledge path
    config_builder.add_column(
        dd.LLMTextColumnConfig(
            name="user_query_draft",
            model_alias=model_alias,
            prompt=QUERY_DRAFT_PROMPT,
        )
    )

    # Stage 2: BrowseComp-style obfuscation
    config_builder.add_column(
        dd.LLMTextColumnConfig(
            name="user_query_obfuscated",
            model_alias=model_alias,
            prompt=OBFUSCATE_PROMPT,
        )
    )

    # Stage 3: Agent trajectory with MCP tool calling
    config_builder.add_column(
        dd.LLMTextColumnConfig(
            name="agent_solution_raw",
            model_alias=model_alias,
            system_prompt=AGENT_SYSTEM_PROMPT,
            prompt="Problem: {{ user_query_obfuscated }}",
            tool_alias="tavily-search",
            with_trace=dd.TraceType.ALL_MESSAGES,
        )
    )

    # Stage 4: Structured JSON formatting
    config_builder.add_column(
        dd.LLMStructuredColumnConfig(
            name="agent_solution",
            model_alias=model_alias,
            prompt=FORMATTER_PROMPT,
            output_format=AgentSolution,
        )
    )

    return config_builder, mcp_provider


# =============================================================================
# Demo Seed Data
# =============================================================================

DEMO_SEEDS = [
    {
        "seed_entity": "NVIDIA",
        "final_answer_entity": "Thomas Hart Benton",
        "readable_path": (
            "START ENTITY: NVIDIA (Q182477)\n"
            "  \u2b07 [chief executive officer (P169)]\n"
            "  NODE: Jensen Huang (Q332838)\n"
            "  \u2b07 [educated at (P69)]\n"
            "  NODE: Oregon State University (Q861888)\n"
            "  \u2b07 [located in the administrative territorial entity (P131)]\n"
            "  NODE: Benton County (Q115372)\n"
            "  \u2b07 [named after (P138)]\n"
            "  NODE: Thomas Hart Benton (Q178712)"
        ),
        "num_hops_in_graph": 4,
        "ground_truth": "Thomas Hart Benton",
    },
    {
        "seed_entity": "Python",
        "final_answer_entity": "Centrum Wiskunde & Informatica",
        "readable_path": (
            "START ENTITY: Python (Q28865)\n"
            "  \u2b07 [developer (P178)]\n"
            "  NODE: Guido van Rossum (Q19845)\n"
            "  \u2b07 [employer (P108)]\n"
            "  NODE: Centrum Wiskunde & Informatica (Q1060645)"
        ),
        "num_hops_in_graph": 2,
        "ground_truth": "Centrum Wiskunde & Informatica",
    },
    {
        "seed_entity": "toothache",
        "final_answer_entity": "ibuprofen",
        "readable_path": (
            "START ENTITY: toothache (Q143925)\n"
            "  \u2b07 [risk factor (P564)]\n"
            "  NODE: smoking (Q662860)\n"
            "  \u2b07 [has effect (P1542)]\n"
            "  NODE: Crohn's disease (Q1472)\n"
            "  \u2b07 [drug or therapy used for treatment (P2176)]\n"
            "  NODE: TNF inhibitor (Q1536078)\n"
            "  \u2b07 [is possible treatment of (P2175)]\n"
            "  NODE: Beh\u00e7et's disease (Q911427)\n"
            "  \u2b07 [symptoms and signs (P780)]\n"
            "  NODE: inflammation (Q101991)\n"
            "  \u2b07 [drug or therapy used for treatment (P2176)]\n"
            "  NODE: flurbiprofen (Q419890)\n"
            "  \u2b07 [significant drug interaction (P769)]\n"
            "  NODE: parecoxib (Q347941)\n"
            "  \u2b07 [significant drug interaction (P769)]\n"
            "  NODE: ibuprofen (Q186969)"
        ),
        "num_hops_in_graph": 8,
        "ground_truth": "ibuprofen",
    },
]


def write_demo_seeds(output_dir: Path) -> Path:
    """Write demo seed data to a JSONL file."""
    output_dir.mkdir(parents=True, exist_ok=True)
    seed_path = output_dir / "demo_seeds.jsonl"
    with open(seed_path, "w", encoding="utf-8") as f:
        for seed in DEMO_SEEDS:
            f.write(json.dumps(seed, ensure_ascii=False) + "\n")
    return seed_path


# =============================================================================
# Main Entry Point
# =============================================================================


def parse_args():
    """Parse command line arguments."""
    from argparse import ArgumentParser

    parser = ArgumentParser(description="Generate search agent trajectories using Tavily web search via MCP.")
    parser.add_argument("--model-alias", type=str, default="nvidia-text", help="Model alias to use for generation")
    parser.add_argument("--num-records", type=int, default=2, help="Number of trajectories to generate")
    parser.add_argument("--seed-path", type=str, default=None, help="Path to seed parquet or JSONL file")
    parser.add_argument("--artifact-path", type=str, default=None, help="Path to save artifacts")
    return parser.parse_args()


def main() -> None:
    """Main entry point for the demo."""
    args = parse_args()

    if os.environ.get("TAVILY_API_KEY") is None:
        raise RuntimeError("TAVILY_API_KEY must be set. Get a free key at https://tavily.com")

    if os.environ.get("NVIDIA_API_KEY") is None and args.model_alias.startswith("nvidia"):
        raise RuntimeError("NVIDIA_API_KEY must be set when using NVIDIA model aliases.")

    if args.seed_path:
        seed_path = args.seed_path
    else:
        demo_dir = Path(tempfile.mkdtemp(prefix="search_agent_demo_"))
        seed_path = str(write_demo_seeds(demo_dir))
        print(f"Using demo seeds in: {demo_dir}")

    config_builder, mcp_provider = build_config(model_alias=args.model_alias)
    config_builder.with_seed_dataset(
        dd.LocalFileSeedSource(path=seed_path),
        sampling_strategy=dd.SamplingStrategy.SHUFFLE,
    )

    data_designer = DataDesigner(artifact_path=args.artifact_path, mcp_providers=[mcp_provider])
    preview_results = data_designer.preview(config_builder, num_records=args.num_records)

    print("\n" + "=" * 60)
    print("GENERATED SEARCH AGENT TRAJECTORIES")
    print("=" * 60)
    preview_results.display_sample_record()


if __name__ == "__main__":
    main()

Key Resources: