Skip to content

Manage Metrics

Instantiate the metric class you want to run and pass it with dataset and optional configuration to evaluator.run(...) or evaluator.submit(...).

Initialize the SDK

import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Create Metric Objects Inline

Metric objects are normal Python objects from nemo_evaluator_sdk.metrics.*. Keep them close to the evaluation code so the definition, dataset fields, and execution request stay in sync.

from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.expected}}",
    candidate="{{item.output}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"expected": "Paris", "output": "Paris"},
        {"expected": "Berlin", "output": "Munich"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

Use run for fast local execution while developing a metric. Use submit for durable remote execution through the platform job service.

Reuse a Metric Definition

Because metrics are inline objects, reuse is usually just a Python helper function or module-level factory.

from nemo_evaluator_sdk import F1Metric

def answer_f1_metric() -> F1Metric:
    return F1Metric(
        reference="{{item.expected_answer}}",
        candidate="{{item.generated_answer}}",
        description="Token-level F1 between expected and generated answers.",
    )


metric = answer_f1_metric()

Choose Metric Classes

Use the metric-specific pages for configuration details and examples:

Metric family Common classes
Similarity ExactMatchMetric, F1Metric, BLEUMetric, ROUGEMetric, StringCheckMetric, NumberCheckMetric
LLM-as-a-Judge LLMJudgeMetric
RAG and agentic FaithfulnessMetric, ResponseRelevancyMetric, TopicAdherenceMetric, ToolCallingMetric, and related RAGAS-backed classes
Custom endpoints Remote metric classes from nemo_evaluator_sdk.metrics.remote

Configure Runtime Parameters

Pass execution settings through the config argument.

from nemo_evaluator_sdk import RunConfig

config = RunConfig(parallelism=4, limit_samples=100)

For online evaluations, provide a model or agent target and use the online parameter classes described in Model Configuration and Agent Configuration.

Submit a Durable Job

from nemo_evaluator_sdk import RunConfig, ExactMatchMetric

metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"expected": "Paris", "output": "Paris"},
        {"expected": "Berlin", "output": "Munich"},
    ],
    config=RunConfig(parallelism=4),
)

job.wait_until_done()
result = job.get_result()
  • Metric Results - Work with EvaluationResult, aggregate scores, and row scores
  • Manage Metric Jobs - Submit, monitor, reconnect to, and download job results
  • Similarity Metrics - Configure exact match, F1, BLEU, ROUGE, and string/number checks