RAG Evaluation Metrics¶

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview¶

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:

Offline evaluation: Uses pre-generated responses from your dataset
Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
Job's model and prompt_template are used to generate responses
Generated response (in sample["output_text"]) is automatically used as response in RAGAS evaluation
RAG context variables can be included in the job's prompt_template:
{{user_input}} - User question/input from dataset
{{retrieved_contexts}} - Retrieved context passages from dataset

RAGAS metrics require:

Judge LLM: An LLM to evaluate answer quality (required for most metrics)
Judge Embeddings (optional): Required for some metrics like response_relevancy
Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

Prerequisites¶

Before running RAG evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:

import os

from nemo_evaluator_sdk import RunConfigOnlineModel, RunConfig, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import (
    ContextEntityRecallMetric,
    ContextPrecisionMetric,
    ContextRecallMetric,
    ContextRelevanceMetric,
    FaithfulnessMetric,
    NoiseSensitivityMetric,
    ResponseGroundednessMetric,
    ResponseRelevancyMetric,
)
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()

Creating a Secret for API Keys¶

If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:

client.secrets.create(
    name="nvidia-api-key",
    value="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA Build API key for RAG metrics",
)

Reference secrets by name in your model configuration. For local run versus remote submit behavior, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)

RAGAS metrics accept inline model definitions for judge_model and, where required, embeddings_model.

See Model Configuration for details.

Supported RAGAS Metrics¶

Use Case	Metric Type	Description	Required Columns*
Measure retrieval quality	`context_recall`	Coverage of reference information in retrieved context	user_input, retrieved_contexts, reference
	`context_precision`	Whether all retrieved chunks are relevant to the question	user_input, retrieved_contexts, reference
	`context_relevance`	Relevance of retrieved context to the question	user_input, retrieved_contexts
	`context_entity_recall`	Recall of important entities from reference in context	retrieved_contexts, reference
Detect hallucinations	`faithfulness`	Measures factual consistency of response with retrieved context	user_input, response, retrieved_contexts
	`response_groundedness`	Evaluates whether response is grounded in context without hallucinations	response, retrieved_contexts
	`noise_sensitivity`	Robustness to noisy or irrelevant context	user_input, response, reference, retrieved_contexts
Check if answers address the question	`response_relevancy`**	Response relevancy to question using embeddings similarity	user_input, response, retrieved_contexts

* Required Columns: Dataset columns that must be present for the metric to be evaluated.

** Requires embeddings_model in addition to judge_model.

Shared Example Setup¶

The metric examples below use these inline values:

For local run versus remote submit behavior of api_key_secret, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)

embeddings_model = Model(
    url="https://integrate.api.nvidia.com/v1/embeddings",
    name="nvidia/nv-embedqa-e5-v5",
    api_key_secret="nvidia-api-key",
)

generation_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1",
    api_key_secret="nvidia-api-key",
)

Use offline rows when your RAG pipeline has already produced responses:

offline_rows = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "response": "The capital of France is Paris.",
        "reference": "Paris is the capital of France.",
    }
]

Use online arguments when the evaluator should generate the response first:

online_dataset = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "reference": "Paris is the capital of France.",
    }
]
online_prompt_template = {
    "messages": [
        {
            "role": "user",
            "content": "Context:\n{{item.retrieved_contexts | join('\n\n')}}\n\nQuestion: {{item.user_input}}\n\nAnswer:",
        }
    ]
}
online_config = RunConfigOnlineModel(
    parallelism=8,
    inference=InferenceParams(temperature=0.2, max_tokens=1024),
)

Context Recall¶

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

Score name: context_recall
Score range: 0 to 1, with higher scores indicating better recall.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}

Local EvaluationRemote JobResult

metric = ContextRecallMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ContextRecallMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}

Context Precision¶

Measures the proportion of relevant chunks in the retrieved contexts (precision@k).

Score name: context_precision
Score range: 0 to 1, with higher scores indicating better precision.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris"
}

Local EvaluationRemote JobResult

metric = ContextPrecisionMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ContextPrecisionMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()

{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_precision",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0
    }
  ]
}

Context Relevance¶

Measures how relevant the retrieved contexts are to the user input.

Score name: context_relevance
Score range: 0 to 1, with higher scores indicating better relevance.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ]
}

Local EvaluationRemote Job

metric = ContextRelevanceMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ContextRelevanceMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Context Entity Recall¶

Measures how many important entities from the reference are present in the retrieved contexts.

Score name: context_entity_recall
Score range: 0 to 1, with higher scores indicating better entity recall.

Data Format¶

{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}

Local EvaluationRemote Job

metric = ContextEntityRecallMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ContextEntityRecallMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Faithfulness¶

Measures factual consistency of the response with the retrieved context.

Score name: faithfulness
Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}

Local EvaluationOnline EvaluationRemote Job

metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = FaithfulnessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Response Groundedness¶

Evaluates whether the response is grounded in the retrieved context without hallucinations.

Score name: response_groundedness
Score range: 0 to 1, with higher scores indicating stronger grounding.

Data Format¶

{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}

Local EvaluationOnline EvaluationRemote Job

metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ResponseGroundednessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Noise Sensitivity¶

Measures robustness when retrieved contexts contain noisy or irrelevant information.

Score name: noise_sensitivity
Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France.",
    "Berlin is the capital of Germany."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris is the capital of France."
}

Local EvaluationRemote Job

metric = NoiseSensitivityMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = NoiseSensitivityMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()

Response Relevancy¶

Measures how relevant a response is to the user input using generated questions and embedding similarity.

Score name: response_relevancy
Score range: 0 to 1, with higher scores indicating better relevancy.

Data Format¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}

Configuration Options¶

Parameter	Type	Default	Description
`strictness`	int	`1`	Number of parallel questions generated. NIM supports `1`.

Local EvaluationOnline EvaluationRemote Job

metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")

metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()

Dataset Format¶

RAGAS metrics use specific column names:

Field	Type	Required	Description
`user_input`	string	Yes	User question or input
`retrieved_contexts`	list[string]	Some metrics	List of context passages
`response`	string	Some metrics	Generated answer. Required for offline response-quality metrics; generated as `sample.output_text` for online model requests.
`reference`	string	Some metrics	Reference answer or ground truth

Note

Different metrics require different columns. Check the metric documentation for specific requirements.

Example Dataset¶

{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris"
}

Response Format¶

All evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {
      "url": "...",
      "name": "..."
    }
  },
  "aggregate_scores": {
    "scores": [
      {
        "name": "faithfulness",
        "count": 1,
        "mean": 0.95,
        "min": 0.95,
        "max": 0.95,
        "sum": 0.95
      }
    ]
  },
  "row_scores": [
    {
      "row_index": 0,
      "metrics": {
        "faithfulness": [
          {"name": "faithfulness", "value": 0.95}
        ]
      },
      "error": null
    }
  ]
}

Working with Results¶

# Access aggregate scores
for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.metrics:
        print(f"Row {row.row_index}: {row.metrics}")
    elif row.error:
        print(f"Row {row.row_index} failed: {row.error}")

Managing Secrets for Authenticated Endpoints¶

Store API keys as secrets for secure authentication:

client.secrets.create(name="judge-api-key", value="<your-judge-key>")
client.secrets.create(name="embedding-api-key", value="<your-embedding-key>")

Reference secrets by name in your metric configuration. For local run versus remote submit behavior, see Model API Authentication.

judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="judge-api-key",
)

Job Management¶

For durable remote execution, submit the same metric and dataset that you tested locally:

job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
artifacts_dir = job.download_artifacts(path="evaluation_artifacts")
print(f"Saved artifacts under {artifacts_dir}")

Troubleshooting¶

Common Errors¶

Error	Cause	Solution
`judge_model` is required	Missing judge LLM config for metric	Add `judge_model` to metric configuration
`embeddings_model` is required	Using `response_relevancy` without embeddings	Add `embeddings_model` to metric configuration
Job stuck in "pending"	Model endpoint not accessible	Verify endpoint URLs and API key secrets. See Model API Authentication
Authentication failed	Invalid or missing API key	Check `api_key_secret` for the execution mode. See Model API Authentication
`nan_count > 0` and `mean = null`	Judge/model call failures, such as auth, endpoint, quota, or timeout. Some RAGAS metrics are known to return `NaN` instead of raising on these failures.	Inspect row-level `error`; verify API key, endpoint, and model access
Low faithfulness scores	Context doesn't support the response	Improve retrieval or response generation

If you see nan_count > 0 with mean = null, first validate judge model authentication.

For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.

Tips for Better Results¶

Use larger judge models (70B+) for more consistent scoring.
Start with inline datasets to test your configuration before large evaluations.
Set appropriate timeouts - judge LLM calls can take time with large contexts.
Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits.
Column names matter - RAGAS metrics use user_input, retrieved_contexts, response, and reference.

Important Notes¶

Secret Management: API keys should be referenced through api_key_secret, with different local run and remote submit behavior. See Model API Authentication. Never pass API keys directly in the request.
Column Names: RAGAS metrics use specific column names:
user_input (not question)
response (not answer)
retrieved_contexts (not contexts)
reference (not ground_truth)
Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

Limitations¶

Judge Model Quality: Evaluation quality depends on the judge model's ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAGAS metrics use specific column names (user_input, retrieved_contexts, response, reference). Ensure your data matches this structure.

Info

LLM-as-a-Judge - Custom judge-based evaluation
Agentic Metrics - Evaluate agent workflows