Skip to content

Agent Configuration

Online evaluations can target an agent instead of a model. An agent is an HTTP endpoint that accepts a request and returns a response, optionally with a trajectory of intermediate steps. Use agents when you want to evaluate an agentic system end to end rather than a standalone LLM endpoint.

Provide an Agent as target=.... The metric, prompt template, target, dataset rows, and runtime parameters are all passed through the Evaluator plugin SDK call.

Agent Formats

Two agent formats are supported:

Format Value Description
Generic generic Configurable HTTP POST with a Jinja-templated request body and JSONPath extraction for response and trajectory.
NeMo Agent Toolkit nemo_agent_toolkit Fixed protocol for NeMo Agent Toolkit endpoints.

Initialize the SDK

import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Managing Secrets for Agent Endpoints

If your agent endpoint requires authentication, configure api_key_secret on the Agent.

For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For remote evaluator.submit(...) jobs, it must name a NeMo platform secret in the target workspace. See Model API Authentication for the local-versus-remote behavior.

For remote evaluator.submit(...) jobs, create the secret in the platform workspace before submitting the job:

client.secrets.create(
    name="my-agent-api-key",
    value=os.environ["MY_AGENT_API_KEY"],
)

The secret name may be a workspace-local name such as "my-agent-api-key" or a full reference such as "my-workspace/my-agent-api-key" for remote jobs.

Generic Agent

A generic agent is any HTTP endpoint that:

  1. Accepts a POST request with Content-Type: application/json.
  2. Returns a JSON response containing the answer and, optionally, a trajectory.

You control the request shape with body and extract values from the response with JSONPath expressions.

Generic Agent Fields

Field Required Type Description
url Yes string Base URL of the agent endpoint.
name Yes string Agent name or identifier.
format No string generic (default) or nemo_agent_toolkit.
api_key_secret No string API key reference. See Model API Authentication.
body Yes dict Jinja template for the request payload. Use {{ prompt }}, {{ messages }}, or fields from the rendered prompt context.
response_path Yes string JSONPath expression to extract the response text.
trajectory_path No string JSONPath expression to extract the trajectory.

Run a Generic Agent Evaluation

from nemo_evaluator_sdk import Agent, RunConfigOnline
from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
    url="https://my-agent.example.com/invoke",
    name="qa-agent",
    format="generic",
    api_key_secret="MY_AGENT_API_KEY",
    body={"question": "{{ prompt }}"},
    response_path="$.answer",
    trajectory_path="$.reasoning_steps",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"question": "What is the capital of France?", "expected_answer": "Paris"},
    ],
    config=RunConfigOnline(parallelism=4, request_timeout=60, max_retries=2),
    target=agent,
    prompt_template="Question: {{item.question}}\nAnswer:",
)
for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

Use evaluator.submit(...) with the same argument shape when you want a durable remote job, but set api_key_secret to a platform secret name for the target workspace.

Example Generic Agent Endpoint

Your agent endpoint might look like this:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()


class AgentRequest(BaseModel):
    question: str


class AgentResponse(BaseModel):
    answer: str
    reasoning_steps: list[dict]


@app.post("/invoke")
async def invoke(request: AgentRequest) -> AgentResponse:
    return AgentResponse(
        answer="Paris",
        reasoning_steps=[
            {"step": "search", "result": "Found relevant documents"},
            {"step": "synthesize", "result": "Generated answer from context"},
        ],
    )

NeMo Agent Toolkit Agent

Use the nemo_agent_toolkit format when evaluating agents built with the NeMo Agent Toolkit. This format uses the NAT streaming protocol:

  1. Sends a POST to {url}/generate/full?filter_steps=none with {"input_message": "<text>"}.
  2. Reads the SSE (Server-Sent Events) stream.
  3. Extracts the final value from the last SSE data: chunk.
  4. Returns it as the agent response.

NeMo Agent Toolkit Fields

Field Required Type Description
url Yes string Base URL of the agent endpoint.
name Yes string Agent name or identifier.
format Yes string Set to nemo_agent_toolkit.
api_key_secret No string API key reference. See Model API Authentication.

Run a NAT Agent Evaluation

from nemo_evaluator_sdk import Agent, RunConfigOnline


from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
    url="https://my-nat-agent.example.com",
    name="nat-research-agent",
    format="nemo_agent_toolkit",
    api_key_secret="my-nat-api-key",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"question": "What is the capital of France?", "expected_answer": "Paris"},
    ],
    config=RunConfigOnline(parallelism=4),
    target=agent,
    prompt_template={
        "messages": [
            {"role": "user", "content": "{{item.question}}"},
        ],
    },
)
job.wait_until_done()
result = job.get_result()

Model vs Agent: When to Use Which

Use Case Use Model Use Agent
Evaluate a standalone LLM endpoint x
Evaluate an agentic system with tool use and multi-step reasoning x
Evaluate a NeMo Agent Toolkit workflow x
Evaluate a custom HTTP endpoint with non-standard response format x
Use a standard chat completions API x

Info

Online evaluations accept either a model or an agent as the request target, never both.