Nemotron Super Search Agent
Dev Note
For a deep dive into the pipeline design, production yield analysis, correctness challenges, and key takeaways, see Search Agent SFT Data: Teaching LLMs to Browse the Web.
Seed Dataset
This recipe includes built-in demo seeds (3 Wikidata knowledge graph paths) for quick testing. For production use, generate your own seed dataset from Wikidata random walks -- the dev note above describes the seed generation process (SPARQL queries, anti-meta filters, hop range 4-8). Each seed row needs: seed_entity, final_answer_entity, readable_path, num_hops_in_graph, and ground_truth. Pass your seed file via --seed-path.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "data-designer",
# ]
# ///
"""Nemotron Super Search Agent Recipe: Trajectories with Tavily Web Search
Generate multi-turn search agent trajectories where an LLM iteratively
searches the web, reads results, reasons about evidence, and synthesizes
answers -- the kind of data needed to train BrowseComp-style search agents.
This recipe implements the pipeline used to produce ~7,000 high-quality
tool-use trajectories for Nemotron Super post-training, starting from
50,000 Wikidata knowledge graph seeds.
Pipeline architecture:
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 1: SEED DATA (Wikidata KG Walks) │
│ │
│ Random walks on the Wikidata knowledge graph produce multi-hop paths. │
│ Each seed has: seed_entity, final_answer_entity, readable_path, │
│ num_hops_in_graph, ground_truth. │
│ Built-in demo seeds included; bring your own for production. │
├─────────────────────────────────────────────────────────────────────────┤
│ STAGE 2: SEARCH RIDDLE GENERATION (LLM) │
│ │
│ user_query_draft ────────► user_query_obfuscated │
│ (chain clues from path, (BrowseComp-style rewrite: │
│ hide intermediate nodes, concise, natural, no breadcrumbs, │
│ don't name the answer) 1-2 sentences max) │
├─────────────────────────────────────────────────────────────────────────┤
│ STAGE 3: SEARCH TRAJECTORY ROLLOUTS (LLM + MCP) │
│ │
│ Thought-Action-Observation loop with live Tavily web search. │
│ ├─ tavily_search tool via hosted MCP endpoint │
│ ├─ Maximum 25 tool call turns; 300s timeout │
│ ├─ Full trace captured via with_trace=ALL_MESSAGES │
│ └─ Structured JSON output: final_answer, supporting_urls, │
│ short_justification │
├─────────────────────────────────────────────────────────────────────────┤
│ STAGE 4: STRUCTURED FORMATTING (LLM) │
│ │
│ Normalize raw agent output into clean JSON via LLMStructuredColumn. │
│ Handles markdown fences, trailing text, single-quoted dicts. │
│ │
│ The agent_solution_raw__trace column IS the SFT training data: │
│ complete ChatML conversation with every tool call and response. │
└─────────────────────────────────────────────────────────────────────────┘
Prerequisites:
- TAVILY_API_KEY environment variable (get a free key at https://tavily.com)
- OPENAI_API_KEY environment variable for OpenAI provider model aliases.
- NVIDIA_API_KEY environment variable for NVIDIA provider model aliases (default model alias is "nvidia-text").
Run:
# Basic usage with built-in demo seeds (generates 2 trajectories)
uv run search_agent.py
# Use a custom seed parquet
uv run search_agent.py --seed-path /path/to/seeds.parquet --num-records 10
# For help message and available options
uv run search_agent.py --help
"""
from __future__ import annotations
import json
import os
import tempfile
from pathlib import Path
from pydantic import BaseModel, Field
import data_designer.config as dd
from data_designer.interface import DataDesigner
# =============================================================================
# Structured Output Schema
# =============================================================================
class AgentSolution(BaseModel):
"""Structured output for the search agent's final answer."""
final_answer: str = Field(..., min_length=1, description="The final answer entity.")
supporting_urls: list[str] = Field(
default_factory=list, description="Authoritative URLs used to verify the answer."
)
short_justification: str = Field(..., min_length=1, description="Brief explanation of reasoning (1-2 sentences).")
# =============================================================================
# Prompt Templates
# =============================================================================
QUERY_DRAFT_PROMPT = """\
You are an expert Search Evaluator designing Grandmaster-Level search tests.
Create a complex, multi-step search riddle based on this knowledge path:
{{ readable_path }}
Start Entity: {{ seed_entity }}
Final Answer Entity: {{ final_answer_entity }}
CRITICAL RULES:
1. DO NOT name the intermediate nodes. Hide them behind descriptions.
2. DO NOT name the Final Answer.
3. Chain the clues logically -- describe each step relative to the previous one.
4. Audit the logic: if a step is weak or nonsensical, IGNORE IT.
5. Salvage and simplify: use only the strongest, most logical hops.
6. No hallucinations: do not invent relationships not in the path.
7. Aim for 4-8 meaningful hops.
VALIDATION - Output "INVALID_PATH" if:
- Final answer is generic/abstract (e.g. "technology", "people", "field")
- Path has weak/illogical relationships
- No coherent question can be formed
Return ONLY the question string (or "INVALID_PATH").\
"""
OBFUSCATE_PROMPT = """\
Rewrite this search riddle to better match BrowseComp-style tasks.
Original Riddle: {{ user_query_draft }}
Secret Path (do not leak entities): {{ readable_path }}
Start Entity: {{ seed_entity }}
Final Answer (do not leak): {{ final_answer_entity }}
HARD REQUIREMENTS:
1. NEVER reveal the step-by-step plan. No breadcrumb chains.
Avoid: "X is member of Y; Y is based in Z; Z is the capital of..."
Avoid meta language: "then search...", "next find...", "follow the chain..."
2. NEVER mention the final answer or any intermediate entity by name.
3. Keep it concise and natural: 1-2 sentences max (3 for very complex paths).
4. Use descriptive clues that require reasoning.
5. Include at least one disambiguating filter (date, count, or specific attribute).
6. If original == "INVALID_PATH", output exactly "INVALID_PATH".
Return ONLY the rewritten question string (or "INVALID_PATH").\
"""
AGENT_SYSTEM_PROMPT = """\
You are an expert search agent that uses web search to answer questions accurately.
You MUST output ONLY valid JSON matching this exact schema:
{
"final_answer": "string - the specific answer entity",
"supporting_urls": ["url1", "url2"],
"short_justification": "string - brief 1-2 sentence explanation"
}
AVAILABLE TOOLS:
You have access to ONE tool called "tavily_search" with parameter: query (string, required).
TOOL USAGE RULES:
1. Exact Tool Name: Always use "tavily_search" (no suffixes or prefixes).
2. Exact Args: Only send {"query": "..."} for the tool call.
3. Maximum 25 tool calls. Budget your searches wisely.
4. Search Strategy:
- Start with broad queries to understand the domain
- Refine to specific entities/relationships
- Cross-verify facts across multiple sources
- Use different query formulations for the same information
5. No Reasoning Tags: Do NOT use <think> tags or XML formatting.
6. No Intermediate Text: Do NOT output explanatory text between tool calls.
7. Final Output: After completing your searches, output ONLY the JSON object.
EXECUTION FLOW:
1. Read the user's question
2. Make tool calls using "tavily_search" to gather information
3. Verify information across multiple sources
4. Once confident, output the JSON result (no additional text)\
"""
FORMATTER_PROMPT = """\
You are a JSON normalizer.
You will be given a messy model output that should contain a JSON object with:
- final_answer (string)
- supporting_urls (list of strings)
- short_justification (string)
Rules:
- Return ONLY a JSON object. No markdown. No extra text.
- If the input contains code fences, tool chatter, or extra prose, ignore it.
- If the input contains invalid JSON, repair it.
- supporting_urls must be a list of valid http(s) URLs (dedupe, keep best 1-5).
Input:
{{ agent_solution_raw }}\
"""
# =============================================================================
# Data Designer Configuration
# =============================================================================
def build_config(model_alias: str) -> tuple[dd.DataDesignerConfigBuilder, dd.MCPProvider]:
"""Build the Data Designer configuration for search agent trajectory generation.
Returns:
A tuple of (config_builder, mcp_provider).
"""
tavily_api_key = os.environ.get("TAVILY_API_KEY", "")
mcp_provider = dd.MCPProvider(
name="tavily",
endpoint=f"https://mcp.tavily.com/mcp/?tavilyApiKey={tavily_api_key}",
provider_type="streamable_http",
)
tool_config = dd.ToolConfig(
tool_alias="tavily-search",
providers=["tavily"],
allow_tools=["tavily_search"],
max_tool_call_turns=25,
timeout_sec=300.0,
)
config_builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config])
# Stage 2: Draft question from knowledge path
config_builder.add_column(
dd.LLMTextColumnConfig(
name="user_query_draft",
model_alias=model_alias,
prompt=QUERY_DRAFT_PROMPT,
)
)
# Stage 2: BrowseComp-style obfuscation
config_builder.add_column(
dd.LLMTextColumnConfig(
name="user_query_obfuscated",
model_alias=model_alias,
prompt=OBFUSCATE_PROMPT,
)
)
# Stage 3: Agent trajectory with MCP tool calling
config_builder.add_column(
dd.LLMTextColumnConfig(
name="agent_solution_raw",
model_alias=model_alias,
system_prompt=AGENT_SYSTEM_PROMPT,
prompt="Problem: {{ user_query_obfuscated }}",
tool_alias="tavily-search",
with_trace=dd.TraceType.ALL_MESSAGES,
)
)
# Stage 4: Structured JSON formatting
config_builder.add_column(
dd.LLMStructuredColumnConfig(
name="agent_solution",
model_alias=model_alias,
prompt=FORMATTER_PROMPT,
output_format=AgentSolution,
)
)
return config_builder, mcp_provider
# =============================================================================
# Demo Seed Data
# =============================================================================
DEMO_SEEDS = [
{
"seed_entity": "NVIDIA",
"final_answer_entity": "Thomas Hart Benton",
"readable_path": (
"START ENTITY: NVIDIA (Q182477)\n"
" \u2b07 [chief executive officer (P169)]\n"
" NODE: Jensen Huang (Q332838)\n"
" \u2b07 [educated at (P69)]\n"
" NODE: Oregon State University (Q861888)\n"
" \u2b07 [located in the administrative territorial entity (P131)]\n"
" NODE: Benton County (Q115372)\n"
" \u2b07 [named after (P138)]\n"
" NODE: Thomas Hart Benton (Q178712)"
),
"num_hops_in_graph": 4,
"ground_truth": "Thomas Hart Benton",
},
{
"seed_entity": "Python",
"final_answer_entity": "Centrum Wiskunde & Informatica",
"readable_path": (
"START ENTITY: Python (Q28865)\n"
" \u2b07 [developer (P178)]\n"
" NODE: Guido van Rossum (Q19845)\n"
" \u2b07 [employer (P108)]\n"
" NODE: Centrum Wiskunde & Informatica (Q1060645)"
),
"num_hops_in_graph": 2,
"ground_truth": "Centrum Wiskunde & Informatica",
},
{
"seed_entity": "toothache",
"final_answer_entity": "ibuprofen",
"readable_path": (
"START ENTITY: toothache (Q143925)\n"
" \u2b07 [risk factor (P564)]\n"
" NODE: smoking (Q662860)\n"
" \u2b07 [has effect (P1542)]\n"
" NODE: Crohn's disease (Q1472)\n"
" \u2b07 [drug or therapy used for treatment (P2176)]\n"
" NODE: TNF inhibitor (Q1536078)\n"
" \u2b07 [is possible treatment of (P2175)]\n"
" NODE: Beh\u00e7et's disease (Q911427)\n"
" \u2b07 [symptoms and signs (P780)]\n"
" NODE: inflammation (Q101991)\n"
" \u2b07 [drug or therapy used for treatment (P2176)]\n"
" NODE: flurbiprofen (Q419890)\n"
" \u2b07 [significant drug interaction (P769)]\n"
" NODE: parecoxib (Q347941)\n"
" \u2b07 [significant drug interaction (P769)]\n"
" NODE: ibuprofen (Q186969)"
),
"num_hops_in_graph": 8,
"ground_truth": "ibuprofen",
},
]
def write_demo_seeds(output_dir: Path) -> Path:
"""Write demo seed data to a JSONL file."""
output_dir.mkdir(parents=True, exist_ok=True)
seed_path = output_dir / "demo_seeds.jsonl"
with open(seed_path, "w", encoding="utf-8") as f:
for seed in DEMO_SEEDS:
f.write(json.dumps(seed, ensure_ascii=False) + "\n")
return seed_path
# =============================================================================
# Main Entry Point
# =============================================================================
def parse_args():
"""Parse command line arguments."""
from argparse import ArgumentParser
parser = ArgumentParser(description="Generate search agent trajectories using Tavily web search via MCP.")
parser.add_argument("--model-alias", type=str, default="nvidia-text", help="Model alias to use for generation")
parser.add_argument("--num-records", type=int, default=2, help="Number of trajectories to generate")
parser.add_argument("--seed-path", type=str, default=None, help="Path to seed parquet or JSONL file")
parser.add_argument("--artifact-path", type=str, default=None, help="Path to save artifacts")
return parser.parse_args()
def main() -> None:
"""Main entry point for the demo."""
args = parse_args()
if os.environ.get("TAVILY_API_KEY") is None:
raise RuntimeError("TAVILY_API_KEY must be set. Get a free key at https://tavily.com")
if os.environ.get("NVIDIA_API_KEY") is None and args.model_alias.startswith("nvidia"):
raise RuntimeError("NVIDIA_API_KEY must be set when using NVIDIA model aliases.")
if args.seed_path:
seed_path = args.seed_path
else:
demo_dir = Path(tempfile.mkdtemp(prefix="search_agent_demo_"))
seed_path = str(write_demo_seeds(demo_dir))
print(f"Using demo seeds in: {demo_dir}")
config_builder, mcp_provider = build_config(model_alias=args.model_alias)
config_builder.with_seed_dataset(
dd.LocalFileSeedSource(path=seed_path),
sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)
data_designer = DataDesigner(artifact_path=args.artifact_path, mcp_providers=[mcp_provider])
preview_results = data_designer.preview(config_builder, num_records=args.num_records)
print("\n" + "=" * 60)
print("GENERATED SEARCH AGENT TRAJECTORIES")
print("=" * 60)
preview_results.display_sample_record()
if __name__ == "__main__":
main()