Similarity Metrics¶

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. Similarity metrics compare generated or precomputed text against references, labels, or numeric/string expectations. They support Jinja templates so you can map your dataset columns to the values each metric evaluates.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.

Setup¶

import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = sdk.evaluator  # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()

Template Variables¶

All similarity metrics support Jinja templating with these variables:

{{item}} - Access dataset columns (e.g., {{item.question}}, {{item.answer}})
{{sample.output_text}} - The model's generated output for online runs
Jinja filters: lower, upper, trim, replace, etc.

Use Jinja filters to normalize text before comparison:

from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(
    reference="{{item.expected | lower | trim}}",
    candidate="{{item.output | lower | trim}}",
)

BLEU Metric¶

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It's commonly used for evaluating machine translation and text generation tasks.

Use BLEU when: - Evaluating machine translation quality - Measuring text generation similarity to references - Comparing multiple reference texts

Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import BLEUMetric

metric = BLEUMetric(
    references=["{{item.reference_1}}", "{{item.reference_2}}"],
    candidate="{{item.model_output}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import BLEUMetric

metric = BLEUMetric(
    references=["{{item.reference_1}}", "{{item.reference_2}}"],
    candidate="{{item.model_output}}",
    description="BLEU score for translation quality",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "sentence",
      "count": 2,
      "mean": 76.86,
      "min": 53.73,
      "max": 100.0
    },
    {
      "name": "corpus",
      "count": 1,
      "mean": 53.895
    }
  ]
}

Exact Match Metric¶

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly and 0 otherwise.

Use Exact Match when: - Evaluating classification tasks with discrete labels - Checking for exact answer correctness - Validating structured output formats

Metric Output: Binary score (0 or 1).

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.correct_answer | lower | trim}}",
    candidate="{{item.model_answer | lower | trim}}",
    description="Exact match for question answering",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "London", "model_answer": "london "},
        {"correct_answer": "Berlin", "model_answer": "Munich"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.correct_answer | lower | trim}}",
    candidate="{{item.model_answer | lower | trim}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "London", "model_answer": "london "},
        {"correct_answer": "Berlin", "model_answer": "Munich"},
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "exact-match",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

F1 Metric¶

F1 measures token-level overlap between candidate and reference text. It balances precision and recall, making it useful when there are multiple acceptable ways to phrase a response.

Use F1 when: - Evaluating extractive question answering - Comparing short free-form answers - Measuring partial matches where exact match is too strict

Metric Output: A score between 0 and 1.

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import F1Metric

metric = F1Metric(
    reference="{{item.reference}}",
    candidate="{{item.answer}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference": "the capital of France is Paris",
            "answer": "Paris is the capital of France",
        },
        {"reference": "a red apple", "answer": "red apple"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import F1Metric

metric = F1Metric(
    reference="{{item.reference}}",
    candidate="{{item.answer}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference": "the capital of France is Paris",
            "answer": "Paris is the capital of France",
        },
        {"reference": "a red apple", "answer": "red apple"},
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "f1",
      "count": 2,
      "mean": 0.75,
      "min": 0.5,
      "max": 1.0
    }
  ]
}

Number Check Metric¶

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

Use Number Check when: - Validating numerical outputs (calculations, counts, scores) - Checking value ranges or thresholds - Comparing predicted vs expected numbers

Metric Output: 1 if the condition is true, 0 otherwise. If either value cannot be parsed as a number, the row score is NaN.

Supported Operations¶

Equality: "equals", "=="
Inequality: "!=", "<>", "not equals"
Comparisons: ">", "gt", ">=", "gte", "<", "lt", "<=", "lte"
Absolute difference: "absolute difference" (requires epsilon parameter)

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import NumberCheckMetric

metric = NumberCheckMetric(
    operation="absolute difference",
    epsilon=0.5,
    left_template="{{item.expected}}",
    right_template="{{item.predicted}}",
    description="Check if values match within tolerance",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"expected": "100", "predicted": "100"},
        {"expected": "42.5", "predicted": "42.3"},
        {"expected": "99", "predicted": "101"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import NumberCheckMetric

metric = NumberCheckMetric(
    operation=">",
    left_template="{{item.predicted}}",
    right_template="0.5",
    description="Score must be greater than 0.5",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"predicted": "1"},
        {"predicted": "0.75"},
        {"predicted": "0.5"},
        {"predicted": "0.1"},
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "number-check",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

ROUGE Metric¶

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and reference text. It is commonly used for summarization and long-form generation quality checks.

Use ROUGE when: - Evaluating summarization quality - Measuring overlap with reference passages - Comparing generated text against longer expected answers

Metric Output: ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-L F1 scores between 0 and 1.

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import ROUGEMetric

metric = ROUGEMetric(
    reference="{{item.reference_summary}}",
    candidate="{{item.model_summary}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference_summary": "The cat sat on the mat and looked out the window.",
            "model_summary": "A cat sat on a mat near the window.",
        },
        {
            "reference_summary": "The launch was postponed because of high winds.",
            "model_summary": "High winds delayed the launch.",
        },
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import ROUGEMetric

metric = ROUGEMetric(
    reference="{{item.reference_summary}}",
    candidate="{{item.model_summary}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference_summary": "The cat sat on the mat and looked out the window.",
            "model_summary": "A cat sat on a mat near the window.",
        },
        {
            "reference_summary": "The launch was postponed because of high winds.",
            "model_summary": "High winds delayed the launch.",
        },
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "rouge_1_score",
      "count": 2,
      "mean": 0.72
    },
    {
      "name": "rouge_2_score",
      "count": 2,
      "mean": 0.43
    },
    {
      "name": "rouge_3_score",
      "count": 2,
      "mean": 0.31
    },
    {
      "name": "rouge_L_score",
      "count": 2,
      "mean": 0.67
    }
  ]
}

String Check Metric¶

String Check performs various string operations and comparisons. Supports equality, containment, and prefix/suffix checks.

Use String Check when: - Validating text format or structure - Checking for keyword presence - Pattern matching in generated text - String-based classification

Metric Output: Binary score (1 if condition is true, 0 otherwise).

Supported Operations¶

Equality: "equals", "=="
Inequality: "!=", "<>", "not equals"
Containment: "contains", "not contains"
Pattern: "startswith", "endswith"

Local EvaluationRemote JobExample Result

from nemo_evaluator_sdk import StringCheckMetric

metric = StringCheckMetric(
    operation="contains",
    left_template="{{item.output | trim}}",
    right_template="{{item.must_contain}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"output": "The answer is: 42", "must_contain": "answer"},
        {"output": "Result: Success", "must_contain": "Success"},
        {"output": "Error occurred", "must_contain": "Success"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

from nemo_evaluator_sdk import StringCheckMetric

metric = StringCheckMetric(
    operation="startswith",
    left_template="{{item.output}}",
    right_template="Answer:",
    description="Check if output starts with 'Answer:'",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"output": "Answer: 42"},
        {"output": "Answer: Success"},
        {"output": "Error occurred"},
    ],
)
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "name": "string-check",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}

Dataset Format¶

The examples on this page use inline dataset rows with dataset=[...]. Template fields determine the columns required by each metric:

reference, references, left_template, and right_template read from item fields in the dataset.
candidate reads from an item field for offline rows when configured.
If candidate is omitted for BLEU, Exact Match, F1, or ROUGE, the metric uses sample.output_text, which is populated during online evaluations.

Keep field names consistent between the dataset rows and the templates you configure. For example, {{item.expected}} requires each row to include an expected field.

Similarity Metrics¶

Setup¶

Template Variables¶

BLEU Metric¶

Exact Match Metric¶

F1 Metric¶

Number Check Metric¶

Supported Operations¶

ROUGE Metric¶

String Check Metric¶

Supported Operations¶

Dataset Format¶

Related Topics¶