Evaluator NeMo Platform SDK Resources¶

The nemo_evaluator_sdk package provides context-agnostic objects for defining metrics, datasets, evaluation configuration, and result handling. When you want to execute those evaluations through the NeMo Platform Evaluator plugin, use the Evaluator SDK resource mounted on the nemo_platform SDK. This page explains the NeMo Platform-specific objects used to run local plugin jobs, submit durable platform jobs, and retrieve evaluator job results.

Evaluator¶

The Evaluator resource is the sync SDK object for working with the Evaluator plugin on NeMo Platform. It is accessed directly from a NeMoPlatform instance:

import os
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

The primary execution methods are run and submit. Use run when you want a local in-process plugin execution that returns a completed EvaluationResult. Use submit when you want to create a durable remote platform job and manage the job lifecycle separately.

Method	Description	Returns
`run()`	Runs one metric locally through the Evaluator plugin job runtime.	`EvaluationResult`
`submit()`	Submits one metric evaluation as a durable platform job.	`EvaluatorJobResource`
`plugin_status()`	Returns Evaluator plugin health information from the service.	`dict[str, object]`
`get_job_resource(job_name: str, workspace: str \\| None = None)`	Returns a resource for an existing Evaluator plugin job.	`EvaluatorJobResource`

The dataset argument accepts inline rows, local dataset paths, and fileset references. Use config for evaluator runtime settings, aggregate_fields on result-returning calls to shape aggregate scores, and target plus prompt_template when the evaluator should generate model or agent responses before scoring.

`run()` arguments¶

Argument	Type	Required	Description
`metric`	`Metric`	Yes	Metric configuration used to score each row.
`dataset`	`PluginDatasetInput`	Yes	Inline rows, a local dataset path, or a fileset reference.
`config`	`RunConfig \\| RunConfigOnline \\| RunConfigOnlineModel \\| None`	No	Runtime settings such as sample limits, parallelism, timeouts, and retry behavior.
`aggregate_fields`	`tuple[AggregateFieldName, ...] \\| None`	No	Aggregate score fields to include in the returned result.
`target`	`Model \\| Agent \\| None`	No	Model or agent target used when the evaluator should generate outputs before scoring.
`dataset_glob_pattern`	`str \\| None`	No	Pattern used to select files from a dataset path or fileset reference.
`prompt_template`	`str \\| dict[str, Any] \\| None`	No	Prompt template used with `target` for online model or agent evaluation.

`submit()` arguments¶

Argument	Type	Required	Description
`metric`	`Metric`	Yes	Metric configuration serialized into the durable platform job.
`dataset`	`PluginDatasetInput`	Yes	Inline rows, a local dataset path, or a fileset reference.
`config`	`RunConfig \\| RunConfigOnline \\| RunConfigOnlineModel \\| None`	No	Runtime settings applied when the submitted job executes.
`target`	`Model \\| Agent \\| None`	No	Model or agent target used when the submitted job should generate outputs before scoring.
`dataset_glob_pattern`	`str \\| None`	No	Pattern used to select files from a dataset path or fileset reference.
`prompt_template`	`str \\| dict[str, Any] \\| None`	No	Prompt template used with `target` for online model or agent evaluation.

Run locally¶

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]

result = evaluator.run(metric=metric, dataset=dataset)
print(result.aggregate_scores)

Submit a platform job¶

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()
print(result.aggregate_scores)

AsyncEvaluator¶

The AsyncEvaluator resource provides the same Evaluator plugin surface for AsyncNeMoPlatform. Async methods must be awaited:

import os
from nemo_evaluator.sdk import AsyncEvaluator
from nemo_platform import AsyncNeMoPlatform


client = AsyncNeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: AsyncEvaluator = client.evaluator

Method	Description	Returns
`run()`	Runs one metric locally through the Evaluator plugin job runtime.	`EvaluationResult`
`submit()`	Submits one metric evaluation as a durable platform job.	`AsyncEvaluatorJobResource`
`plugin_status()`	Returns Evaluator plugin health information from the service.	`dict[str, object]`
`get_job_resource(job_name: str, workspace: str \\| None = None)`	Returns a resource for an existing Evaluator plugin job.	`AsyncEvaluatorJobResource`

AsyncEvaluator.run() and AsyncEvaluator.submit() accept the same arguments as the sync methods above.

import asyncio

from nemo_evaluator_sdk import ExactMatchMetric


metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]


async def main() -> None:
    job = await evaluator.submit(metric=metric, dataset=dataset)
    await job.wait_until_done()
    result = await job.get_result()
    print(result.aggregate_scores)


asyncio.run(main())

EvaluatorJobResource¶

The EvaluatorJobResource is the sync job handle returned by Evaluator.submit. You can also reconnect to an existing job with Evaluator.get_job_resource.

Some of the most useful methods and properties are described below.

Method or property	Description
`name`	Returns the evaluator job name.
`job`	Returns the raw evaluator job payload captured at resource creation.
`get_job_status()`	Fetches the current evaluator job status from the Evaluator plugin API.
`check_if_complete(raise_if_not_complete: bool = False)`	Returns whether the job is complete. When `raise_if_not_complete` is true, raises for any status other than `completed`.
`wait_until_done()`	Polls the job until it reaches a terminal platform status. Raises if the job fails or times out.
`get_result(aggregate_fields=None)`	Downloads aggregate-score and row-score artifacts and returns an `EvaluationResult`. Optional `aggregate_fields` shapes the returned aggregate scores only.
`download_artifacts(path=None)`	Downloads and extracts the full job artifacts archive under a job-specific directory.
`as_async()`	Returns an `AsyncEvaluatorJobResource` view over the same job.

AsyncEvaluatorJobResource¶

The AsyncEvaluatorJobResource is the async job handle returned by AsyncEvaluator.submit. It mirrors EvaluatorJobResource, but status and result methods are awaited.