Agentic Evaluation Metrics¶
Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.
Agent Workflow Evaluation Stages¶
Key stages of agent workflow evaluation:

1. Intermediate Steps Evaluation Assesses the correctness of intermediate steps during agent execution:
- Tool Use: Validates that the agent invoked the right tools with correct arguments at each step. Refer to Tool Call Accuracy for implementation details.
2. Final-Step Evaluation Evaluates the quality of the agent's final output using:
- Agent Goal Accuracy: Measures whether the agent successfully completed the requested task. Refer to Agent Goal Accuracy.
- Topic Adherence: Assesses how well the agent maintained focus on the assigned topic throughout the conversation. Refer to Topic Adherence.
- Answer Accuracy: Evaluates the factual correctness of agent answers. Refer to Answer Accuracy.
- Custom Metrics: For domain-specific or custom evaluation criteria, use LLM-as-a-Judge with the
datatask type.
3. Trajectory Evaluation Evaluates the agent's decision-making process by analyzing the entire sequence of actions taken to accomplish a goal. This includes assessing whether the agent chose appropriate tools in the correct order. Refer to Trajectory Evaluation for the expected data format and current plugin SDK support.
Online vs Offline Evaluation¶
Agentic metrics support two execution patterns through the Evaluator plugin SDK:
Offline Evaluation¶
Offline evaluation scores pre-generated responses or tool calls already present in the dataset:
- Dataset rows are passed inline with
dataset=[...], as a file Path, or as a FilesetRef. - No model or agent generation is performed.
- Use this mode to evaluate existing agent outputs or compare different response strategies.
Online Target Generation¶
Online target generation first calls a model or agent target, then evaluates the generated output:
- Configure the target with
ModelorAgentfromnemo_evaluator_sdk. - Pass the target through
target=.... - Use
RunConfigOnlineModelfor model targets andRunConfigOnlinefor agent targets. - Include a
prompt_templatewhen the dataset row must be transformed into a model or agent request.
Note
Response Usage: In online target generation, the generated response is used as the metric response. A dataset response column is optional and is superseded by the generated response for that run.
Overview¶
Agentic metrics evaluate different aspects of agent behavior:
| Metric | Use Case | Requires Judge | Plugin SDK Execution |
|---|---|---|---|
| Tool Call Accuracy | Evaluates tool/function call correctness | No | run + submit |
| Tool Calling (template) | Evaluates tool/function call correctness using Jinja templates | No | run + submit |
| Topic Adherence | Measures topic focus in multi-turn conversations | Yes | run + submit |
| Agent Goal Accuracy | Assesses goal completion, with or without reference. | Yes | run + submit |
| Answer Accuracy | Checks factual correctness | Yes | run + submit |
| Trajectory Evaluation | Evaluates decision-making across action sequence | Yes | Not exposed as a typed plugin SDK metric |
Note
Use evaluator.run(...) for local in-process evaluation and evaluator.submit(...) for durable remote platform jobs. The examples below use inline dataset rows through dataset=[...], but you can use a file Path or a FilesetRef instead.
Prerequisites¶
Before running agentic evaluations:
- Workspace: Have a workspace created. All remote resources, including secrets and jobs, are scoped to a workspace.
- Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge.
- API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key. For local
runversus remotesubmitbehavior, see Model API Authentication. - Initialize the SDK:
import os
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
evaluator: Evaluator = client.evaluator # this object is an Evaluator resource
Creating a Secret for API Keys¶
If using external endpoints, such as Build NVIDIA API endpoints (https://integrate.api.nvidia.com/v1), create a secret first:
client.secrets.create(
name="nvidia-api-key",
value="nvapi-YOUR_API_KEY_HERE",
description="NVIDIA Build API key for RAGAS metrics",
)
SDK Types Reference¶
The plugin SDK examples use context-agnostic metric classes and runtime value classes:
from nemo_evaluator_sdk import (
ToolCallingMetric,
)
from nemo_evaluator_sdk.metrics.ragas import (
AgentGoalAccuracyMetric,
AnswerAccuracyMetric,
ToolCallAccuracyMetric,
TopicAdherenceMetric,
)
from nemo_evaluator_sdk import (
Agent,
RunConfigOnlineModel,
RunConfigOnline,
RunConfig,
InferenceParams,
Model,
ReasoningParams,
)
Use dataset=[...] for inline rows. For offline scoring options, use config=RunConfig(parallelism=...). Whenever outputs must be generated before scoring, pass target=Model(...) or target=Agent(...) plus the corresponding online parameters. Use the same dataset, config, and target arguments for both evaluator.run(...) and evaluator.submit(...); durable jobs follow the identical pattern as local runs and only differ in waiting for and fetching results.
Agentic Metrics¶
Tool Call Accuracy¶
Evaluates whether the agent invoked the correct tools with the correct arguments. This metric does not require a judge LLM.
Note
Online/offline support: Tool Call Accuracy supports scoring existing tool calls directly. It can also score tool calls produced during online target generation when the target response includes the required tool-call fields.
Data Format¶
{
"user_input": [
{
"content": "What's the weather like in New York?",
"type": "human"
},
{
"content": "Let me check that for you.",
"type": "ai",
"tool_calls": [
{
"name": "weather_check",
"args": {
"location": "New York"
}
}
]
},
{
"content": "It's 75°F and partly cloudy.",
"type": "tool"
},
{
"content": "The weather in New York is 75°F and partly cloudy.",
"type": "ai"
}
],
"reference_tool_calls": [
{
"name": "weather_check",
"args": {
"location": "New York"
}
}
]
}
from nemo_evaluator_sdk.metrics.ragas import ToolCallAccuracyMetric
metric = ToolCallAccuracyMetric()
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": [
{"content": "What's the weather in Paris?", "type": "human"},
{
"content": "Let me check.",
"type": "ai",
"tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
},
{"content": "Sunny, 22°C", "type": "tool"},
{"content": "It's sunny and 22°C in Paris.", "type": "ai"},
],
"reference_tool_calls": [
{"name": "weather_api", "args": {"city": "Paris"}}
],
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig
from nemo_evaluator_sdk.metrics.ragas import ToolCallAccuracyMetric
metric = ToolCallAccuracyMetric()
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": [
{"content": "What's the weather in Paris?", "type": "human"},
{
"content": "Let me check.",
"type": "ai",
"tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
},
{"content": "Sunny, 22°C", "type": "tool"},
{"content": "It's sunny and 22°C in Paris.", "type": "ai"},
],
"reference_tool_calls": [
{"name": "weather_api", "args": {"city": "Paris"}}
],
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
Tool Calling (Template)¶
A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.
Note
Online/offline support: Tool Calling supports scoring existing tool-call outputs directly. Use online target generation when you want the model or agent target to produce the response first.
Scores Produced¶
function_name_accuracy- Accuracy of function names onlyfunction_name_and_args_accuracy- Accuracy of both function names and arguments
Data Format¶
Data must use OpenAI-compliant tool calling format:
{
"messages": [
{
"role": "user",
"content": "Book a table for 2 at 7pm."
},
{
"role": "assistant",
"content": "Booking a table...",
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": {
"people": 2,
"time": "7pm"
}
}
}
]
}
],
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": {
"people": 2,
"time": "7pm"
}
}
}
],
"response": {
"choices": [
{
"message": {
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": "{\"people\": 2, \"time\": \"7pm\"}"
}
}
]
}
}
]
}
}
Note
- Function names with dots (
.) must be replaced with underscores (_). - Comparison is case-sensitive.
- Order of tool calls is ignored, which supports parallel tool calling.
from nemo_evaluator_sdk import ToolCallingMetric
metric = ToolCallingMetric(reference="{{item.tool_calls}}")
result = evaluator.run(
metric=metric,
dataset=[
{
"messages": [
{"role": "user", "content": "Book a table for 2 at 7pm."},
{
"role": "assistant",
"content": "Booking...",
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": {"people": 2, "time": "7pm"},
}
}
],
},
],
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": {"people": 2, "time": "7pm"},
}
}
],
"response": {
"choices": [
{
"message": {
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": '{"people": 2, "time": "7pm"}',
}
}
]
}
}
]
},
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig, ToolCallingMetric
metric = ToolCallingMetric(reference="{{item.tool_calls}}")
job = evaluator.submit(
metric=metric,
dataset=[
{
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": {"people": 2, "time": "7pm"},
}
}
],
"response": {
"choices": [
{
"message": {
"tool_calls": [
{
"function": {
"name": "book_table",
"arguments": '{"people": 2, "time": "7pm"}',
}
}
]
}
}
]
},
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
Topic Adherence¶
Measures how well the agent maintained focus on assigned topics throughout a conversation. Supports F1, precision, or recall scoring modes.
Data Format¶
{
"user_input": [
{
"content": "How do I stay healthy?",
"type": "human"
},
{
"content": "Eat more fruits and vegetables, and exercise regularly.",
"type": "ai"
}
],
"reference_topics": [
"health",
"nutrition",
"fitness"
]
}
Configuration Options¶
| Parameter | Type | Default | Description |
|---|---|---|---|
metric_mode |
string | "f1" |
Scoring mode: "f1", "precision", or "recall" |
from nemo_evaluator_sdk import Model
from nemo_evaluator_sdk.metrics.ragas import TopicAdherenceMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = TopicAdherenceMetric(metric_mode="f1", judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": [
{"content": "Tell me about healthy eating", "type": "human"},
{
"content": "Eating fruits and vegetables is essential for good health.",
"type": "ai",
},
],
"reference_topics": ["health", "nutrition", "diet"],
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig, Model
from nemo_evaluator_sdk.metrics.ragas import TopicAdherenceMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = TopicAdherenceMetric(metric_mode="f1", judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": [
{"content": "Tell me about healthy eating", "type": "human"},
{
"content": "Eating fruits and vegetables is essential for good health.",
"type": "ai",
},
],
"reference_topics": ["health", "nutrition", "diet"],
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfigOnlineModel, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import TopicAdherenceMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
target_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
)
metric = TopicAdherenceMetric(metric_mode="f1", judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": "Tell me about healthy eating",
"reference_topics": ["health", "nutrition", "diet"],
}
],
config=RunConfigOnlineModel(
parallelism=4,
inference=InferenceParams(temperature=0.7, max_tokens=1024),
),
target=target_model,
prompt_template={
"messages": [
{
"role": "user",
"content": "{{item.user_input}}",
}
]
},
)
print(result.aggregate_scores)
Agent Goal Accuracy¶
Evaluates whether the agent successfully completed the requested task. Returns a binary score (0 or 1). Supports evaluation with or without a reference outcome.
With Reference¶
Compare the agent's outcome against a known reference:
Data Format
{
"user_input": [
{
"content": "Book a table at a Chinese restaurant for 8pm",
"type": "human"
},
{
"content": "I'll search for restaurants.",
"type": "ai",
"tool_calls": [
{
"name": "restaurant_search",
"args": {}
}
]
},
{
"content": "Found: Italian Place",
"type": "tool"
},
{
"content": "Your table at Italian Place is booked for 8pm.",
"type": "ai"
}
],
"reference": "Successfully booked a table at a restaurant for 8pm"
}
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
use_reference |
boolean | True |
Whether to compare against a reference outcome |
from nemo_evaluator_sdk import Model
from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AgentGoalAccuracyMetric(use_reference=True, judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": [
{"content": "Book a table at a restaurant for 8pm", "type": "human"},
{
"content": "I'll search for restaurants.",
"type": "ai",
"tool_calls": [{"name": "restaurant_search", "args": {}}],
},
{"content": "Found: Italian Place", "type": "tool"},
{
"content": "Your table at Italian Place is booked for 8pm.",
"type": "ai",
},
],
"reference": "Successfully booked a table at a restaurant for 8pm",
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig, Model
from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AgentGoalAccuracyMetric(use_reference=True, judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": [
{"content": "Book a table at a restaurant for 8pm", "type": "human"},
{
"content": "I'll search for restaurants.",
"type": "ai",
"tool_calls": [{"name": "restaurant_search", "args": {}}],
},
{"content": "Found: Italian Place", "type": "tool"},
{
"content": "Your table at Italian Place is booked for 8pm.",
"type": "ai",
},
],
"reference": "Successfully booked a table at a restaurant for 8pm",
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
Without Reference¶
The judge LLM infers the goal from the conversation context:
Data Format
{
"user_input": [
{
"content": "Set a reminder for my dentist appointment tomorrow at 2pm",
"type": "human"
},
{
"content": "I'll set that reminder for you.",
"type": "ai",
"tool_calls": [
{
"name": "set_reminder",
"args": {
"title": "Dentist appointment",
"date": "tomorrow",
"time": "2pm"
}
}
]
},
{
"content": "Reminder set successfully.",
"type": "tool"
},
{
"content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.",
"type": "ai"
}
]
}
from nemo_evaluator_sdk import Model
from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AgentGoalAccuracyMetric(use_reference=False, judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": [
{
"content": "Set a reminder for my dentist appointment tomorrow at 2pm",
"type": "human",
},
{
"content": "I'll set that reminder for you.",
"type": "ai",
"tool_calls": [
{
"name": "set_reminder",
"args": {
"title": "Dentist appointment",
"date": "tomorrow",
"time": "2pm",
},
}
],
},
{"content": "Reminder set successfully.", "type": "tool"},
{"content": "Your reminder has been set.", "type": "ai"},
],
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig, Model
from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AgentGoalAccuracyMetric(use_reference=False, judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": [
{
"content": "Set a reminder for my dentist appointment tomorrow at 2pm",
"type": "human",
},
{
"content": "I'll set that reminder for you.",
"type": "ai",
"tool_calls": [
{
"name": "set_reminder",
"args": {
"title": "Dentist appointment",
"date": "tomorrow",
"time": "2pm",
},
}
],
},
{"content": "Reminder set successfully.", "type": "tool"},
{"content": "Your reminder has been set.", "type": "ai"},
],
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
Answer Accuracy¶
Evaluates the factual correctness of an agent's answer by comparing it against a reference answer. Two LLM judges independently rate the agreement, and scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).
Data Format¶
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris"
}
from nemo_evaluator_sdk import Model
from nemo_evaluator_sdk.metrics.ragas import AnswerAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AnswerAccuracyMetric(judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
}
],
)
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfig, Model
from nemo_evaluator_sdk.metrics.ragas import AnswerAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
metric = AnswerAccuracyMetric(judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
}
],
config=RunConfig(parallelism=4),
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
from nemo_evaluator_sdk import RunConfigOnlineModel, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import AnswerAccuracyMetric
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
target_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
)
metric = AnswerAccuracyMetric(judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"reference": "Paris",
}
],
config=RunConfigOnlineModel(
parallelism=4,
inference=InferenceParams(temperature=0.7, max_tokens=1024),
),
target=target_model,
prompt_template={
"messages": [
{
"role": "user",
"content": "{{item.user_input}}",
}
]
},
)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)
Trajectory Evaluation¶
Evaluates the agent's decision-making process by analyzing the entire sequence of actions (trajectory) taken to accomplish a goal. This system metric assesses whether the agent chose appropriate tools in the correct order.
Info
NAT Format Requirement: This metric supports the NVIDIA Agent Toolkit format with intermediate_steps containing detailed event traces.
Current plugin SDK support: The current plugin SDK does not expose a typed trajectory-evaluation metric class. Use the data-format details below when preparing datasets for environments where the system metric is enabled, but do not use the old generated evaluator job APIs for plugin SDK execution.
Data Format¶
Each data entry must follow the NeMo Agent Toolkit format:
{
"question": "What are LLMs",
"generated_answer": "LLMs, or Large Language Models, are a type of artificial intelligence designed to process and generate human-like language. They are trained on vast amounts of text data and can be fine-tuned for specific tasks or guided by prompt engineering.",
"answer": "LLMs stand for Large Language Models, which are a type of machine learning model designed for natural language processing tasks such as language generation.",
"intermediate_steps": [
{
"payload": {
"event_type": "LLM_END",
"name": "nvidia/llama-3.3-nemotron-super-49b-v1",
"data": {
"input": "\nPrevious conversation history:\n\n\nQuestion: What are LLMs\n",
"output": "Thought: I need to find information about LLMs to answer this question.\n\nAction: wikipedia_search\nAction Input: {'question': 'LLMs'}\n\n"
}
}
},
{
"payload": {
"event_type": "TOOL_END",
"name": "wikipedia_search",
"data": {
"input": "{'question': 'LLMs'}",
"output": "<Document source=\"https://en.wikipedia.org/wiki/Large_language_model\" page=\"\"/>\nA large language model (LLM) is a language model trained with self-supervised machine learning..."
}
}
},
{
"payload": {
"event_type": "LLM_END",
"name": "meta/llama-3.1-70b-instruct",
"data": {
"input": "...",
"output": "Thought: I now know the final answer\n\nFinal Answer: LLMs, or Large Language Models, are a type of artificial intelligence..."
}
}
}
]
}
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
judge |
required | object | Judge LLM configuration |
trajectory_used_tools |
required | string | Comma-separated list of tools available to the agent. Example: "wikipedia_search,current_datetime,code_generation" |
trajectory_custom_tools |
optional | object | JSON mapping custom tool names to descriptions for non-default tools |
Judge Configuration¶
Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:
from nemo_evaluator_sdk import Model
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
Info
Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.
Using Reasoning Models¶
For models that support extended reasoning, such as nvidia/llama-3.3-nemotron-super-49b-v1, add system prompt and reasoning parameters to online model generation:
from nemo_evaluator_sdk import RunConfigOnlineModel, InferenceParams, Model, ReasoningParams
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
)
config = RunConfigOnlineModel(
system_prompt="detailed thinking on",
reasoning=ReasoningParams(end_token="</think>"),
inference=InferenceParams(max_tokens=1024),
)
Managing Secrets for Authenticated Endpoints¶
If your judge endpoint requires an API key, store it as a secret:
from nemo_evaluator_sdk import Model
client.secrets.create(name="nvidia-api-key", value="nvapi-YOUR_API_KEY_HERE")
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
For more details on secret management, refer to Managing Secrets.
For local run versus remote submit behavior of api_key_secret, see Model API Authentication.
Job Management¶
After submitting a durable remote job with evaluator.submit(metric=metric, dataset=dataset), use the returned job resource to monitor execution and retrieve results:
Navigate to Metrics Job Management for more job lifecycle details.
Dataset Notes¶
The current plugin SDK examples use inline dataset rows through dataset=[...]. Keep each row shaped for the selected metric, including fields such as user_input, response, reference, reference_topics, reference_tool_calls, or OpenAI-style tool_calls as required.
Use RunConfig(limit_samples=...) when you want to test a small slice of a larger inline dataset before submitting the full request.
Important Notes¶
-
Execution choice: Use
runfor local in-process evaluation andsubmitfor durable remote jobs withwait_until_done()andget_result(). -
Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)-
reference(notground_truth) -
Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model's ability to follow instructions. Larger models (70B+) produce more consistent results.
-
RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.
-
NaN troubleshooting for judge-based metrics: If you see
nan_count > 0withmean = null, check judge model authentication first (API key secret, endpoint access, and model permissions). See Model API Authentication forapi_key_secretbehavior. Some RAGAS metrics are known to convert auth failures intoNaNscores instead of raising a hard error.
Info
- Agent Configuration - Use agents (generic or NAT) as targets in online evaluation jobs
- Agentic Benchmarks - BFCL benchmark for tool-calling evaluation
- LLM-as-a-Judge - Custom judge-based evaluation
- Evaluation Results - Understanding results
- RAG Metrics - RAGAS metrics for RAG pipelines