Skip to content

RAG Evaluation Metrics

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:

  • Offline evaluation: Uses pre-generated responses from your dataset
  • Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
  • Job's model and prompt_template are used to generate responses
  • Generated response (in sample["output_text"]) is automatically used as response in RAGAS evaluation
  • RAG context variables can be included in the job's prompt_template:
  • {{user_input}} - User question/input from dataset
  • {{retrieved_contexts}} - Retrieved context passages from dataset

RAGAS metrics require:

  • Judge LLM: An LLM to evaluate answer quality (required for most metrics)
  • Judge Embeddings (optional): Required for some metrics like response_relevancy
  • Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

Prerequisites

Before running RAG evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
  2. Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
  3. API Keys (if required): Create secrets for any endpoints requiring authentication
  4. Initialize the SDK:
import os

from nemo_evaluator_sdk import RunConfigOnlineModel, RunConfig, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import (
    ContextEntityRecallMetric,
    ContextPrecisionMetric,
    ContextRecallMetric,
    ContextRelevanceMetric,
    FaithfulnessMetric,
    NoiseSensitivityMetric,
    ResponseGroundednessMetric,
    ResponseRelevancyMetric,
)
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()

Creating a Secret for API Keys

If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:

client.secrets.create(
    name="nvidia-api-key",
    value="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA Build API key for RAG metrics",
)

Reference secrets by name in your model configuration. For local run versus remote submit behavior, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)

RAGAS metrics accept inline model definitions for judge_model and, where required, embeddings_model.

See Model Configuration for details.


Supported RAGAS Metrics

Use Case Metric Type Description Required Columns*
Measure retrieval quality context_recall Coverage of reference information in retrieved context user_input, retrieved_contexts, reference
context_precision Whether all retrieved chunks are relevant to the question user_input, retrieved_contexts, reference
context_relevance Relevance of retrieved context to the question user_input, retrieved_contexts
context_entity_recall Recall of important entities from reference in context retrieved_contexts, reference
Detect hallucinations faithfulness Measures factual consistency of response with retrieved context user_input, response, retrieved_contexts
response_groundedness Evaluates whether response is grounded in context without hallucinations response, retrieved_contexts
noise_sensitivity Robustness to noisy or irrelevant context user_input, response, reference, retrieved_contexts
Check if answers address the question response_relevancy** Response relevancy to question using embeddings similarity user_input, response, retrieved_contexts

* Required Columns: Dataset columns that must be present for the metric to be evaluated.

** Requires embeddings_model in addition to judge_model.


Shared Example Setup

The metric examples below use these inline values:

For local run versus remote submit behavior of api_key_secret, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)

embeddings_model = Model(
    url="https://integrate.api.nvidia.com/v1/embeddings",
    name="nvidia/nv-embedqa-e5-v5",
    api_key_secret="nvidia-api-key",
)

generation_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1",
    api_key_secret="nvidia-api-key",
)

Use offline rows when your RAG pipeline has already produced responses:

offline_rows = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "response": "The capital of France is Paris.",
        "reference": "Paris is the capital of France.",
    }
]

Use online arguments when the evaluator should generate the response first:

online_dataset = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "reference": "Paris is the capital of France.",
    }
]
online_prompt_template = {
    "messages": [
        {
            "role": "user",
            "content": "Context:\n{{item.retrieved_contexts | join('\n\n')}}\n\nQuestion: {{item.user_input}}\n\nAnswer:",
        }
    ]
}
online_config = RunConfigOnlineModel(
    parallelism=8,
    inference=InferenceParams(temperature=0.2, max_tokens=1024),
)

Context Recall

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

  • Score name: context_recall
  • Score range: 0 to 1, with higher scores indicating better recall.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}
metric = ContextRecallMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ContextRecallMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Precision

Measures the proportion of relevant chunks in the retrieved contexts (precision@k).

  • Score name: context_precision
  • Score range: 0 to 1, with higher scores indicating better precision.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris"
}
metric = ContextPrecisionMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ContextPrecisionMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_precision",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0
    }
  ]
}

Context Relevance

Measures how relevant the retrieved contexts are to the user input.

  • Score name: context_relevance
  • Score range: 0 to 1, with higher scores indicating better relevance.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ]
}
metric = ContextRelevanceMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ContextRelevanceMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Context Entity Recall

Measures how many important entities from the reference are present in the retrieved contexts.

  • Score name: context_entity_recall
  • Score range: 0 to 1, with higher scores indicating better entity recall.

Data Format

{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}
metric = ContextEntityRecallMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ContextEntityRecallMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Faithfulness

Measures factual consistency of the response with the retrieved context.

  • Score name: faithfulness
  • Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}
metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = FaithfulnessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Response Groundedness

Evaluates whether the response is grounded in the retrieved context without hallucinations.

  • Score name: response_groundedness
  • Score range: 0 to 1, with higher scores indicating stronger grounding.

Data Format

{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}
metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ResponseGroundednessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Noise Sensitivity

Measures robustness when retrieved contexts contain noisy or irrelevant information.

  • Score name: noise_sensitivity
  • Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France.",
    "Berlin is the capital of Germany."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris is the capital of France."
}
metric = NoiseSensitivityMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = NoiseSensitivityMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Response Relevancy

Measures how relevant a response is to the user input using generated questions and embedding similarity.

  • Score name: response_relevancy
  • Score range: 0 to 1, with higher scores indicating better relevancy.

Data Format

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}

Configuration Options

Parameter Type Default Description
strictness int 1 Number of parallel questions generated. NIM supports 1.
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Dataset Format

RAGAS metrics use specific column names:

Field Type Required Description
user_input string Yes User question or input
retrieved_contexts list[string] Some metrics List of context passages
response string Some metrics Generated answer. Required for offline response-quality metrics; generated as sample.output_text for online model requests.
reference string Some metrics Reference answer or ground truth

Note

Different metrics require different columns. Check the metric documentation for specific requirements.

Example Dataset

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris"
}

Response Format

All evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {
      "url": "...",
      "name": "..."
    }
  },
  "aggregate_scores": {
    "scores": [
      {
        "name": "faithfulness",
        "count": 1,
        "mean": 0.95,
        "min": 0.95,
        "max": 0.95,
        "sum": 0.95
      }
    ]
  },
  "row_scores": [
    {
      "row_index": 0,
      "metrics": {
        "faithfulness": [
          {"name": "faithfulness", "value": 0.95}
        ]
      },
      "error": null
    }
  ]
}

Working with Results

# Access aggregate scores
for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.metrics:
        print(f"Row {row.row_index}: {row.metrics}")
    elif row.error:
        print(f"Row {row.row_index} failed: {row.error}")

Managing Secrets for Authenticated Endpoints

Store API keys as secrets for secure authentication:

client.secrets.create(name="judge-api-key", value="<your-judge-key>")
client.secrets.create(name="embedding-api-key", value="<your-embedding-key>")

Reference secrets by name in your metric configuration. For local run versus remote submit behavior, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="judge-api-key",
)

Job Management

For durable remote execution, submit the same metric and dataset that you tested locally:

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
artifacts_dir = job.download_artifacts(path="evaluation_artifacts")
print(f"Saved artifacts under {artifacts_dir}")

Troubleshooting

Common Errors

Error Cause Solution
judge_model is required Missing judge LLM config for metric Add judge_model to metric configuration
embeddings_model is required Using response_relevancy without embeddings Add embeddings_model to metric configuration
Job stuck in "pending" Model endpoint not accessible Verify endpoint URLs and API key secrets. See Model API Authentication
Authentication failed Invalid or missing API key Check api_key_secret for the execution mode. See Model API Authentication
nan_count > 0 and mean = null Judge/model call failures, such as auth, endpoint, quota, or timeout. Some RAGAS metrics are known to return NaN instead of raising on these failures. Inspect row-level error; verify API key, endpoint, and model access
Low faithfulness scores Context doesn't support the response Improve retrieval or response generation

If you see nan_count > 0 with mean = null, first validate judge model authentication.

For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.

Tips for Better Results

  • Use larger judge models (70B+) for more consistent scoring.
  • Start with inline datasets to test your configuration before large evaluations.
  • Set appropriate timeouts - judge LLM calls can take time with large contexts.
  • Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits.
  • Column names matter - RAGAS metrics use user_input, retrieved_contexts, response, and reference.

Important Notes

  1. Secret Management: API keys should be referenced through api_key_secret, with different local run and remote submit behavior. See Model API Authentication. Never pass API keys directly in the request.
  2. Column Names: RAGAS metrics use specific column names:
  3. user_input (not question)
  4. response (not answer)
  5. retrieved_contexts (not contexts)
  6. reference (not ground_truth)
  7. Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

Limitations

  1. Judge Model Quality: Evaluation quality depends on the judge model's ability to follow instructions. Larger models (70B+) typically produce more consistent results.

  2. Dataset Format: RAGAS metrics use specific column names (user_input, retrieved_contexts, response, reference). Ensure your data matches this structure.

Info