Agent Configuration¶
Online evaluations can target an agent instead of a model. An agent is an HTTP endpoint that accepts a request and returns a response, optionally with a trajectory of intermediate steps. Use agents when you want to evaluate an agentic system end to end rather than a standalone LLM endpoint.
Provide an Agent as target=.... The metric, prompt template, target, dataset rows, and runtime parameters are all passed through the Evaluator plugin SDK call.
Agent Formats¶
Two agent formats are supported:
| Format | Value | Description |
|---|---|---|
| Generic | generic |
Configurable HTTP POST with a Jinja-templated request body and JSONPath extraction for response and trajectory. |
| NeMo Agent Toolkit | nemo_agent_toolkit |
Fixed protocol for NeMo Agent Toolkit endpoints. |
Initialize the SDK¶
import os
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
evaluator: Evaluator = client.evaluator # this object is an Evaluator resource
Managing Secrets for Agent Endpoints¶
If your agent endpoint requires authentication, configure api_key_secret on the Agent.
For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For remote evaluator.submit(...) jobs, it must name a NeMo platform secret in the target workspace. See Model API Authentication for the local-versus-remote behavior.
For remote evaluator.submit(...) jobs, create the secret in the platform workspace before submitting the job:
The secret name may be a workspace-local name such as "my-agent-api-key" or a full reference such as "my-workspace/my-agent-api-key" for remote jobs.
Generic Agent¶
A generic agent is any HTTP endpoint that:
- Accepts a
POSTrequest withContent-Type: application/json. - Returns a JSON response containing the answer and, optionally, a trajectory.
You control the request shape with body and extract values from the response with JSONPath expressions.
Generic Agent Fields¶
| Field | Required | Type | Description |
|---|---|---|---|
url |
Yes | string | Base URL of the agent endpoint. |
name |
Yes | string | Agent name or identifier. |
format |
No | string | generic (default) or nemo_agent_toolkit. |
api_key_secret |
No | string | API key reference. See Model API Authentication. |
body |
Yes | dict | Jinja template for the request payload. Use {{ prompt }}, {{ messages }}, or fields from the rendered prompt context. |
response_path |
Yes | string | JSONPath expression to extract the response text. |
trajectory_path |
No | string | JSONPath expression to extract the trajectory. |
Run a Generic Agent Evaluation¶
from nemo_evaluator_sdk import Agent, RunConfigOnline
from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
url="https://my-agent.example.com/invoke",
name="qa-agent",
format="generic",
api_key_secret="MY_AGENT_API_KEY",
body={"question": "{{ prompt }}"},
response_path="$.answer",
trajectory_path="$.reasoning_steps",
)
result = evaluator.run(
metric=metric,
dataset=[
{"question": "What is the capital of France?", "expected_answer": "Paris"},
],
config=RunConfigOnline(parallelism=4, request_timeout=60, max_retries=2),
target=agent,
prompt_template="Question: {{item.question}}\nAnswer:",
)
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}")
Use evaluator.submit(...) with the same argument shape when you want a durable remote job, but set api_key_secret to a platform secret name for the target workspace.
Example Generic Agent Endpoint¶
Your agent endpoint might look like this:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class AgentRequest(BaseModel):
question: str
class AgentResponse(BaseModel):
answer: str
reasoning_steps: list[dict]
@app.post("/invoke")
async def invoke(request: AgentRequest) -> AgentResponse:
return AgentResponse(
answer="Paris",
reasoning_steps=[
{"step": "search", "result": "Found relevant documents"},
{"step": "synthesize", "result": "Generated answer from context"},
],
)
NeMo Agent Toolkit Agent¶
Use the nemo_agent_toolkit format when evaluating agents built with the NeMo Agent Toolkit. This format uses the NAT streaming protocol:
- Sends a POST to
{url}/generate/full?filter_steps=nonewith{"input_message": "<text>"}. - Reads the SSE (Server-Sent Events) stream.
- Extracts the final
valuefrom the last SSEdata:chunk. - Returns it as the agent response.
NeMo Agent Toolkit Fields¶
| Field | Required | Type | Description |
|---|---|---|---|
url |
Yes | string | Base URL of the agent endpoint. |
name |
Yes | string | Agent name or identifier. |
format |
Yes | string | Set to nemo_agent_toolkit. |
api_key_secret |
No | string | API key reference. See Model API Authentication. |
Run a NAT Agent Evaluation¶
from nemo_evaluator_sdk import Agent, RunConfigOnline
from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(reference="{{item.expected_answer}}")
agent = Agent(
url="https://my-nat-agent.example.com",
name="nat-research-agent",
format="nemo_agent_toolkit",
api_key_secret="my-nat-api-key",
)
job = evaluator.submit(
metric=metric,
dataset=[
{"question": "What is the capital of France?", "expected_answer": "Paris"},
],
config=RunConfigOnline(parallelism=4),
target=agent,
prompt_template={
"messages": [
{"role": "user", "content": "{{item.question}}"},
],
},
)
job.wait_until_done()
result = job.get_result()
Model vs Agent: When to Use Which¶
| Use Case | Use Model |
Use Agent |
|---|---|---|
| Evaluate a standalone LLM endpoint | x | |
| Evaluate an agentic system with tool use and multi-step reasoning | x | |
| Evaluate a NeMo Agent Toolkit workflow | x | |
| Evaluate a custom HTTP endpoint with non-standard response format | x | |
| Use a standard chat completions API | x |
Info
Online evaluations accept either a model or an agent as the request target, never both.
Related¶
- Model Configuration - Inline model targets for LLM endpoints.
- Agentic Evaluation Metrics - Metrics for evaluating agent tool calling, goal accuracy, and trajectory.
- LLM-as-a-Judge - Custom judge-based evaluation with flexible scoring criteria.
- Bring Your Own Metric - Integrate custom evaluation endpoints.