Skip to content

Evaluator NeMo Platform SDK Resources

The nemo_evaluator_sdk package provides context-agnostic objects for defining metrics, datasets, evaluation configuration, and result handling. When you want to execute those evaluations through the NeMo Platform Evaluator plugin, use the Evaluator SDK resource mounted on the nemo_platform SDK. This page explains the NeMo Platform-specific objects used to run local plugin jobs, submit durable platform jobs, and retrieve evaluator job results.

Evaluator

The Evaluator resource is the sync SDK object for working with the Evaluator plugin on NeMo Platform. It is accessed directly from a NeMoPlatform instance:

import os
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

The primary execution methods are run and submit. Use run when you want a local in-process plugin execution that returns a completed EvaluationResult. Use submit when you want to create a durable remote platform job and manage the job lifecycle separately.

Method Description Returns
run() Runs one metric locally through the Evaluator plugin job runtime. EvaluationResult
submit() Submits one metric evaluation as a durable platform job. EvaluatorJobResource
plugin_status() Returns Evaluator plugin health information from the service. dict[str, object]
get_job_resource(job_name: str, workspace: str \| None = None) Returns a resource for an existing Evaluator plugin job. EvaluatorJobResource

The dataset argument accepts inline rows, local dataset paths, and fileset references. Use config for evaluator runtime settings, aggregate_fields on result-returning calls to shape aggregate scores, and target plus prompt_template when the evaluator should generate model or agent responses before scoring.

run() arguments

Argument Type Required Description
metric Metric Yes Metric configuration used to score each row.
dataset PluginDatasetInput Yes Inline rows, a local dataset path, or a fileset reference.
config RunConfig \| RunConfigOnline \| RunConfigOnlineModel \| None No Runtime settings such as sample limits, parallelism, timeouts, and retry behavior.
aggregate_fields tuple[AggregateFieldName, ...] \| None No Aggregate score fields to include in the returned result.
target Model \| Agent \| None No Model or agent target used when the evaluator should generate outputs before scoring.
dataset_glob_pattern str \| None No Pattern used to select files from a dataset path or fileset reference.
prompt_template str \| dict[str, Any] \| None No Prompt template used with target for online model or agent evaluation.

submit() arguments

Argument Type Required Description
metric Metric Yes Metric configuration serialized into the durable platform job.
dataset PluginDatasetInput Yes Inline rows, a local dataset path, or a fileset reference.
config RunConfig \| RunConfigOnline \| RunConfigOnlineModel \| None No Runtime settings applied when the submitted job executes.
target Model \| Agent \| None No Model or agent target used when the submitted job should generate outputs before scoring.
dataset_glob_pattern str \| None No Pattern used to select files from a dataset path or fileset reference.
prompt_template str \| dict[str, Any] \| None No Prompt template used with target for online model or agent evaluation.

Run locally

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]

result = evaluator.run(metric=metric, dataset=dataset)
print(result.aggregate_scores)

Submit a platform job

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)

AsyncEvaluator

The AsyncEvaluator resource provides the same Evaluator plugin surface for AsyncNeMoPlatform. Async methods must be awaited:

import os
from nemo_evaluator.sdk import AsyncEvaluator
from nemo_platform import AsyncNeMoPlatform


client = AsyncNeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: AsyncEvaluator = client.evaluator
Method Description Returns
run() Runs one metric locally through the Evaluator plugin job runtime. EvaluationResult
submit() Submits one metric evaluation as a durable platform job. AsyncEvaluatorJobResource
plugin_status() Returns Evaluator plugin health information from the service. dict[str, object]
get_job_resource(job_name: str, workspace: str \| None = None) Returns a resource for an existing Evaluator plugin job. AsyncEvaluatorJobResource

AsyncEvaluator.run() and AsyncEvaluator.submit() accept the same arguments as the sync methods above.

import asyncio

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]


async def main() -> None:
    job = await evaluator.submit(metric=metric, dataset=dataset)
    await job.wait_until_done()
    result = await job.get_result()
    print(result.aggregate_scores)


asyncio.run(main())

EvaluatorJobResource

The EvaluatorJobResource is the sync job handle returned by Evaluator.submit. You can also reconnect to an existing job with Evaluator.get_job_resource.

Some of the most useful methods and properties are described below.

Method or property Description
name Returns the evaluator job name.
job Returns the raw evaluator job payload captured at resource creation.
get_job_status() Fetches the current evaluator job status from the Evaluator plugin API.
check_if_complete(raise_if_not_complete: bool = False) Returns whether the job is complete. When raise_if_not_complete is true, raises for any status other than completed.
wait_until_done() Polls the job until it reaches a terminal platform status. Raises if the job fails or times out.
get_result(aggregate_fields=None) Downloads aggregate-score and row-score artifacts and returns an EvaluationResult. Optional aggregate_fields shapes the returned aggregate scores only.
download_artifacts(path=None) Downloads and extracts the full job artifacts archive under a job-specific directory.
as_async() Returns an AsyncEvaluatorJobResource view over the same job.

AsyncEvaluatorJobResource

The AsyncEvaluatorJobResource is the async job handle returned by AsyncEvaluator.submit. It mirrors EvaluatorJobResource, but status and result methods are awaited.