Manage Metrics¶

Instantiate the metric class you want to run and pass it with dataset and optional configuration to evaluator.run(...) or evaluator.submit(...).

Initialize the SDK¶

import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Create Metric Objects Inline¶

Metric objects are normal Python objects from nemo_evaluator_sdk.metrics.*. Keep them close to the evaluation code so the definition, dataset fields, and execution request stay in sync.

from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.expected}}",
    candidate="{{item.output}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"expected": "Paris", "output": "Paris"},
        {"expected": "Berlin", "output": "Munich"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

Use run for fast local execution while developing a metric. Use submit for durable remote execution through the platform job service.

Reuse a Metric Definition¶

Because metrics are inline objects, reuse is usually just a Python helper function or module-level factory.

from nemo_evaluator_sdk import F1Metric

def answer_f1_metric() -> F1Metric:
    return F1Metric(
        reference="{{item.expected_answer}}",
        candidate="{{item.generated_answer}}",
        description="Token-level F1 between expected and generated answers.",
    )


metric = answer_f1_metric()

Choose Metric Classes¶

Use the metric-specific pages for configuration details and examples:

Metric family	Common classes
Similarity	`ExactMatchMetric`, `F1Metric`, `BLEUMetric`, `ROUGEMetric`, `StringCheckMetric`, `NumberCheckMetric`
LLM-as-a-Judge	`LLMJudgeMetric`
RAG and agentic	`FaithfulnessMetric`, `ResponseRelevancyMetric`, `TopicAdherenceMetric`, `ToolCallingMetric`, and related RAGAS-backed classes
Custom endpoints	Remote metric classes from `nemo_evaluator_sdk.metrics.remote`

Configure Runtime Parameters¶

Pass execution settings through the config argument.

from nemo_evaluator_sdk import RunConfig

config = RunConfig(parallelism=4, limit_samples=100)

For online evaluations, provide a model or agent target and use the online parameter classes described in Model Configuration and Agent Configuration.

Submit a Durable Job¶

from nemo_evaluator_sdk import RunConfig, ExactMatchMetric

metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"expected": "Paris", "output": "Paris"},
        {"expected": "Berlin", "output": "Munich"},
    ],
    config=RunConfig(parallelism=4),
)

job.wait_until_done()
result = job.get_result()

Metric Results - Work with EvaluationResult, aggregate scores, and row scores
Manage Metric Jobs - Submit, monitor, reconnect to, and download job results
Similarity Metrics - Configure exact match, F1, BLEU, ROUGE, and string/number checks