Agent Configuration¶

Online evaluations can target an agent instead of a model. An agent is an HTTP endpoint that accepts a request and returns a response, optionally with a trajectory of intermediate steps. Use agents when you want to evaluate an agentic system end to end rather than a standalone LLM endpoint.

Provide an Agent as target=.... The metric, prompt template, target, dataset rows, and runtime parameters are all passed through the Evaluator plugin SDK call.

Agent Formats¶

Two agent formats are supported:

Format	Value	Description
Generic	`generic`	Configurable HTTP POST with a Jinja-templated request body and JSONPath extraction for response and trajectory.
NeMo Agent Toolkit	`nemo_agent_toolkit`	Fixed protocol for NeMo Agent Toolkit endpoints.

Initialize the SDK¶

import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Managing Secrets for Agent Endpoints¶

If your agent endpoint requires authentication, configure api_key_secret on the Agent.

For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For remote evaluator.submit(...) jobs, it must name a NeMo platform secret in the target workspace. See Model API Authentication for the local-versus-remote behavior.

For remote evaluator.submit(...) jobs, create the secret in the platform workspace before submitting the job:

client.secrets.create(
    name="my-agent-api-key",
    value=os.environ["MY_AGENT_API_KEY"],
)

The secret name may be a workspace-local name such as "my-agent-api-key" or a full reference such as "my-workspace/my-agent-api-key" for remote jobs.

Generic Agent¶

A generic agent is any HTTP endpoint that:

Accepts a POST request with Content-Type: application/json.
Returns a JSON response containing the answer and, optionally, a trajectory.

You control the request shape with body and extract values from the response with JSONPath expressions.

Generic Agent Fields¶

Field	Required	Type	Description
`url`	Yes	string	Base URL of the agent endpoint.
`name`	Yes	string	Agent name or identifier.
`format`	No	string	`generic` (default) or `nemo_agent_toolkit`.
`api_key_secret`	No	string	API key reference. See Model API Authentication.
`body`	Yes	dict	Jinja template for the request payload. Use `{{ prompt }}`, `{{ messages }}`, or fields from the rendered prompt context.
`response_path`	Yes	string	JSONPath expression to extract the response text.
`trajectory_path`	No	string	JSONPath expression to extract the trajectory.

Run a Generic Agent Evaluation¶

from nemo_evaluator_sdk import Agent, RunConfigOnline
from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
    url="https://my-agent.example.com/invoke",
    name="qa-agent",
    format="generic",
    api_key_secret="MY_AGENT_API_KEY",
    body={"question": "{{ prompt }}"},
    response_path="$.answer",
    trajectory_path="$.reasoning_steps",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"question": "What is the capital of France?", "expected_answer": "Paris"},
    ],
    config=RunConfigOnline(parallelism=4, request_timeout=60, max_retries=2),
    target=agent,
    prompt_template="Question: {{item.question}}\nAnswer:",
)
for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

Use evaluator.submit(...) with the same argument shape when you want a durable remote job, but set api_key_secret to a platform secret name for the target workspace.

Example Generic Agent Endpoint¶

Your agent endpoint might look like this:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()


class AgentRequest(BaseModel):
    question: str


class AgentResponse(BaseModel):
    answer: str
    reasoning_steps: list[dict]


@app.post("/invoke")
async def invoke(request: AgentRequest) -> AgentResponse:
    return AgentResponse(
        answer="Paris",
        reasoning_steps=[
            {"step": "search", "result": "Found relevant documents"},
            {"step": "synthesize", "result": "Generated answer from context"},
        ],
    )

NeMo Agent Toolkit Agent¶

Use the nemo_agent_toolkit format when evaluating agents built with the NeMo Agent Toolkit. This format uses the NAT streaming protocol:

Sends a POST to {url}/generate/full?filter_steps=none with {"input_message": "<text>"}.
Reads the SSE (Server-Sent Events) stream.
Extracts the final value from the last SSE data: chunk.
Returns it as the agent response.

NeMo Agent Toolkit Fields¶

Field	Required	Type	Description
`url`	Yes	string	Base URL of the agent endpoint.
`name`	Yes	string	Agent name or identifier.
`format`	Yes	string	Set to `nemo_agent_toolkit`.
`api_key_secret`	No	string	API key reference. See Model API Authentication.

Run a NAT Agent Evaluation¶

from nemo_evaluator_sdk import Agent, RunConfigOnline


from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
    url="https://my-nat-agent.example.com",
    name="nat-research-agent",
    format="nemo_agent_toolkit",
    api_key_secret="my-nat-api-key",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"question": "What is the capital of France?", "expected_answer": "Paris"},
    ],
    config=RunConfigOnline(parallelism=4),
    target=agent,
    prompt_template={
        "messages": [
            {"role": "user", "content": "{{item.question}}"},
        ],
    },
)
job.wait_until_done()
result = job.get_result()

Model vs Agent: When to Use Which¶

Use Case	Use `Model`	Use `Agent`
Evaluate a standalone LLM endpoint	x
Evaluate an agentic system with tool use and multi-step reasoning		x
Evaluate a NeMo Agent Toolkit workflow		x
Evaluate a custom HTTP endpoint with non-standard response format		x
Use a standard chat completions API	x

Info

Online evaluations accept either a model or an agent as the request target, never both.

Model Configuration - Inline model targets for LLM endpoints.
Agentic Evaluation Metrics - Metrics for evaluating agent tool calling, goal accuracy, and trajectory.
LLM-as-a-Judge - Custom judge-based evaluation with flexible scoring criteria.
Bring Your Own Metric - Integrate custom evaluation endpoints.