RAG Evaluation Metrics¶
RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.
Overview¶
RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:
- Offline evaluation: Uses pre-generated responses from your dataset
- Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
- Job's model and prompt_template are used to generate responses
- Generated response (in
sample["output_text"]) is automatically used asresponsein RAGAS evaluation - RAG context variables can be included in the job's
prompt_template: {{user_input}}- User question/input from dataset{{retrieved_contexts}}- Retrieved context passages from dataset
RAGAS metrics require:
- Judge LLM: An LLM to evaluate answer quality (required for most metrics)
- Judge Embeddings (optional): Required for some metrics like
response_relevancy - Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)
Prerequisites¶
Before running RAG evaluations:
- Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
- Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
- API Keys (if required): Create secrets for any endpoints requiring authentication
- Initialize the SDK:
import os
from nemo_evaluator_sdk import RunConfigOnlineModel, RunConfig, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import (
ContextEntityRecallMetric,
ContextPrecisionMetric,
ContextRecallMetric,
ContextRelevanceMetric,
FaithfulnessMetric,
NoiseSensitivityMetric,
ResponseGroundednessMetric,
ResponseRelevancyMetric,
)
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
evaluator: Evaluator = client.evaluator # this object is an Evaluator resource
Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:
job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()
Creating a Secret for API Keys¶
If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:
client.secrets.create(
name="nvidia-api-key",
value="nvapi-YOUR_API_KEY_HERE",
description="NVIDIA Build API key for RAG metrics",
)
Reference secrets by name in your model configuration. For local run versus remote submit behavior, see Model API Authentication.
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
RAGAS metrics accept inline model definitions for judge_model and, where required, embeddings_model.
See Model Configuration for details.
Supported RAGAS Metrics¶
| Use Case | Metric Type | Description | Required Columns* |
|---|---|---|---|
| Measure retrieval quality | context_recall |
Coverage of reference information in retrieved context | user_input, retrieved_contexts, reference |
context_precision |
Whether all retrieved chunks are relevant to the question | user_input, retrieved_contexts, reference | |
context_relevance |
Relevance of retrieved context to the question | user_input, retrieved_contexts | |
context_entity_recall |
Recall of important entities from reference in context | retrieved_contexts, reference | |
| Detect hallucinations | faithfulness |
Measures factual consistency of response with retrieved context | user_input, response, retrieved_contexts |
response_groundedness |
Evaluates whether response is grounded in context without hallucinations | response, retrieved_contexts | |
noise_sensitivity |
Robustness to noisy or irrelevant context | user_input, response, reference, retrieved_contexts | |
| Check if answers address the question | response_relevancy** |
Response relevancy to question using embeddings similarity | user_input, response, retrieved_contexts |
* Required Columns: Dataset columns that must be present for the metric to be evaluated.
** Requires embeddings_model in addition to judge_model.
Shared Example Setup¶
The metric examples below use these inline values:
For local run versus remote submit behavior of api_key_secret, see Model API Authentication.
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="nvidia-api-key",
)
embeddings_model = Model(
url="https://integrate.api.nvidia.com/v1/embeddings",
name="nvidia/nv-embedqa-e5-v5",
api_key_secret="nvidia-api-key",
)
generation_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="nvidia/llama-3.3-nemotron-super-49b-v1",
api_key_secret="nvidia-api-key",
)
Use offline rows when your RAG pipeline has already produced responses:
offline_rows = [
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"response": "The capital of France is Paris.",
"reference": "Paris is the capital of France.",
}
]
Use online arguments when the evaluator should generate the response first:
online_dataset = [
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}
]
online_prompt_template = {
"messages": [
{
"role": "user",
"content": "Context:\n{{item.retrieved_contexts | join('\n\n')}}\n\nQuestion: {{item.user_input}}\n\nAnswer:",
}
]
}
online_config = RunConfigOnlineModel(
parallelism=8,
inference=InferenceParams(temperature=0.2, max_tokens=1024),
)
Context Recall¶
Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.
- Score name:
context_recall - Score range: 0 to 1, with higher scores indicating better recall.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"reference": "Paris is the capital of France."
}
Context Precision¶
Measures the proportion of relevant chunks in the retrieved contexts (precision@k).
- Score name:
context_precision - Score range: 0 to 1, with higher scores indicating better precision.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"reference": "Paris"
}
Context Relevance¶
Measures how relevant the retrieved contexts are to the user input.
- Score name:
context_relevance - Score range: 0 to 1, with higher scores indicating better relevance.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
]
}
metric = ContextRelevanceMetric(judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}
],
config=RunConfig(parallelism=8),
)
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}")
metric = ContextRelevanceMetric(judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}
],
config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
Context Entity Recall¶
Measures how many important entities from the reference are present in the retrieved contexts.
- Score name:
context_entity_recall - Score range: 0 to 1, with higher scores indicating better entity recall.
Data Format¶
{
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"reference": "Paris is the capital of France."
}
metric = ContextEntityRecallMetric(judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}
],
config=RunConfig(parallelism=8),
)
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}")
metric = ContextEntityRecallMetric(judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}
],
config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
Faithfulness¶
Measures factual consistency of the response with the retrieved context.
- Score name:
faithfulness - Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"response": "The capital of France is Paris."
}
Response Groundedness¶
Evaluates whether the response is grounded in the retrieved context without hallucinations.
- Score name:
response_groundedness - Score range: 0 to 1, with higher scores indicating stronger grounding.
Data Format¶
{
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"response": "The capital of France is Paris."
}
Noise Sensitivity¶
Measures robustness when retrieved contexts contain noisy or irrelevant information.
- Score name:
noise_sensitivity - Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France.",
"Berlin is the capital of Germany."
],
"response": "The capital of France is Paris.",
"reference": "Paris is the capital of France."
}
metric = NoiseSensitivityMetric(judge_model=judge_model)
result = evaluator.run(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France.",
"Berlin is the capital of Germany.",
],
"response": "The capital of France is Paris.",
"reference": "Paris is the capital of France.",
}
],
config=RunConfig(parallelism=8),
)
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}")
metric = NoiseSensitivityMetric(judge_model=judge_model)
job = evaluator.submit(
metric=metric,
dataset=[
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France.",
"Berlin is the capital of Germany.",
],
"response": "The capital of France is Paris.",
"reference": "Paris is the capital of France.",
}
],
config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
Response Relevancy¶
Measures how relevant a response is to the user input using generated questions and embedding similarity.
- Score name:
response_relevancy - Score range: 0 to 1, with higher scores indicating better relevancy.
Data Format¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"response": "The capital of France is Paris."
}
Configuration Options¶
| Parameter | Type | Default | Description |
|---|---|---|---|
strictness |
int | 1 |
Number of parallel questions generated. NIM supports 1. |
metric = ResponseRelevancyMetric(
judge_model=judge_model,
embeddings_model=embeddings_model,
strictness=1,
)
result = evaluator.run(
metric=metric,
dataset=online_dataset,
config=online_config,
target=generation_model,
prompt_template=online_prompt_template,
)
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}")
metric = ResponseRelevancyMetric(
judge_model=judge_model,
embeddings_model=embeddings_model,
strictness=1,
)
job = evaluator.submit(
metric=metric,
dataset=online_dataset,
config=online_config,
target=generation_model,
prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()
Dataset Format¶
RAGAS metrics use specific column names:
| Field | Type | Required | Description |
|---|---|---|---|
user_input |
string | Yes | User question or input |
retrieved_contexts |
list[string] | Some metrics | List of context passages |
response |
string | Some metrics | Generated answer. Required for offline response-quality metrics; generated as sample.output_text for online model requests. |
reference |
string | Some metrics | Reference answer or ground truth |
Note
Different metrics require different columns. Check the metric documentation for specific requirements.
Example Dataset¶
{
"user_input": "What is the capital of France?",
"retrieved_contexts": [
"Paris is the capital and largest city of France."
],
"response": "The capital of France is Paris.",
"reference": "Paris"
}
Response Format¶
All evaluation responses follow this structure:
{
"metric": {
"type": "faithfulness",
"judge_model": {
"url": "...",
"name": "..."
}
},
"aggregate_scores": {
"scores": [
{
"name": "faithfulness",
"count": 1,
"mean": 0.95,
"min": 0.95,
"max": 0.95,
"sum": 0.95
}
]
},
"row_scores": [
{
"row_index": 0,
"metrics": {
"faithfulness": [
{"name": "faithfulness", "value": 0.95}
]
},
"error": null
}
]
}
Working with Results¶
# Access aggregate scores
for score in result.aggregate_scores.scores:
print(f"{score.name}: mean={score.mean}, count={score.count}")
# Access per-row scores
for row in result.row_scores:
if row.metrics:
print(f"Row {row.row_index}: {row.metrics}")
elif row.error:
print(f"Row {row.row_index} failed: {row.error}")
Managing Secrets for Authenticated Endpoints¶
Store API keys as secrets for secure authentication:
client.secrets.create(name="judge-api-key", value="<your-judge-key>")
client.secrets.create(name="embedding-api-key", value="<your-embedding-key>")
Reference secrets by name in your metric configuration. For local run versus remote submit behavior, see Model API Authentication.
judge_model = Model(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-70b-instruct",
api_key_secret="judge-api-key",
)
Job Management¶
For durable remote execution, submit the same metric and dataset that you tested locally:
job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
artifacts_dir = job.download_artifacts(path="evaluation_artifacts")
print(f"Saved artifacts under {artifacts_dir}")
Troubleshooting¶
Common Errors¶
| Error | Cause | Solution |
|---|---|---|
judge_model is required |
Missing judge LLM config for metric | Add judge_model to metric configuration |
embeddings_model is required |
Using response_relevancy without embeddings |
Add embeddings_model to metric configuration |
| Job stuck in "pending" | Model endpoint not accessible | Verify endpoint URLs and API key secrets. See Model API Authentication |
| Authentication failed | Invalid or missing API key | Check api_key_secret for the execution mode. See Model API Authentication |
nan_count > 0 and mean = null |
Judge/model call failures, such as auth, endpoint, quota, or timeout. Some RAGAS metrics are known to return NaN instead of raising on these failures. |
Inspect row-level error; verify API key, endpoint, and model access |
| Low faithfulness scores | Context doesn't support the response | Improve retrieval or response generation |
If you see nan_count > 0 with mean = null, first validate judge model authentication.
For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.
Tips for Better Results¶
- Use larger judge models (70B+) for more consistent scoring.
- Start with inline datasets to test your configuration before large evaluations.
- Set appropriate timeouts - judge LLM calls can take time with large contexts.
- Use parallelism wisely - increase
parallelismfor faster evaluation, but respect rate limits. - Column names matter - RAGAS metrics use
user_input,retrieved_contexts,response, andreference.
Important Notes¶
- Secret Management: API keys should be referenced through
api_key_secret, with different localrunand remotesubmitbehavior. See Model API Authentication. Never pass API keys directly in the request. - Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)retrieved_contexts(notcontexts)reference(notground_truth)- Embeddings Model: Only
response_relevancyrequires an embeddings model. All other metrics use only the judge LLM.
Limitations¶
-
Judge Model Quality: Evaluation quality depends on the judge model's ability to follow instructions. Larger models (70B+) typically produce more consistent results.
-
Dataset Format: RAGAS metrics use specific column names (
user_input,retrieved_contexts,response,reference). Ensure your data matches this structure.
Info
- LLM-as-a-Judge - Custom judge-based evaluation
- Agentic Metrics - Evaluate agent workflows