About Evaluating¶
Evaluation is powered by NeMo Platform, a cloud-native platform for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The evaluation API provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.
NeMo Platform enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost-effective and suitable for pre-deployment checks and regression testing.
How It Works: Library + Platform¶
Evaluator separates evaluation definition from execution.
Note
The code snippets below are for conceptual demonstration purposes only. For runnable examples see the tutorials and SDK resources.
1. Build RunConfig with the Library¶
Use the nemo_evaluator_sdk package to define your metric, dataset rows, runtime configuration, and optional model or agent target:
from nemo_evaluator_sdk import RunConfig, ExactMatchMetric
# Define metric logic
metric = ExactMatchMetric(
reference="{{item.expected}}",
candidate="{{item.output}}",
)
# Build evaluation input
dataset = [
{"expected": "Paris", "output": "Paris"},
{"expected": "Berlin", "output": "Munich"},
]
config = RunConfig(limit_samples=100, parallelism=8)
The library handles: Metric definitions, dataset row schemas, prompt templates, model and agent targets, runtime parameters, retries, aggregation, and typed result objects.
2. Execute on the Platform¶
Submit your evaluation to the Evaluator service using the NeMo Platform SDK:
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform
sdk = NeMoPlatform(base_url="...", workspace="default")
evaluator: Evaluator = sdk.evaluator
# Fast local iteration through the plugin runtime
local_result = evaluator.run(metric=metric, dataset=dataset, config=config)
# Production evaluation as a durable platform job
job = evaluator.submit(metric=metric, dataset=dataset, config=config)
job.wait_until_done()
result = job.get_result()
The platform handles: Job orchestration, inference routing through NeMo Platform's Inference Gateway, Fileset-based datasets, distributed execution, artifact storage, status monitoring, and result download.
Key Differences from Standalone Library¶
When using Evaluator as a NeMo Platform plugin :
| Feature | Standalone Library | NeMo Platform Plugin |
|---|---|---|
| Execution | Local Python process | Local plugin runs for local experimentation and durable platform jobs for production |
| Inference | Direct model or agent endpoint calls | The same as standalone and can also route through NeMo Platform Inference Gateway and platform-managed endpoints |
| Datasets | Inline rows and local files | Inline rows, local paths resolved at submission time, and NeMo Platform Filesets |
| Results Artifacts | Results stored in memory | NeMo Platform artifact storage with typed result download |
| Authentication | Local environment variables | Local environment variables for local runs and NeMo Platform Secrets service for remote jobs |
Evaluation Concepts¶
NeMo Platform supports two core evaluation primitives:
- Metrics: Scoring logic that evaluates model outputs. Use metrics when you need flexible, reusable scoring for your own datasets and task-specific criteria.
There are two execution modes and two evaluation patterns:
- Live evaluation (synchronous): Submit a request and get results immediately. Best for fast iteration, metric development, and small payloads.
-
Jobs (asynchronous): Submit work, monitor status, and fetch results when complete. Best for production workloads, larger datasets, and recurring regression checks.
-
Offline evaluation: Score existing dataset rows (for example, model outputs already generated).
- Online evaluation: Generate outputs from a model as part of evaluation, then score them.
For deeper details, see Evaluation Metrics.
Tutorials¶
After setting up a local instance of the platform, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.
-
Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.
custom-dataset
Recommended Evaluation Journey¶
Most teams get the best results by starting metric-first, then moving to benchmarks:
- Develop and validate your metrics first
- Start with Metrics to define how quality should be scored for your use case.
-
Use live evaluation (
POST /v2/workspaces/{workspace}/evaluation/metric-evaluate) with smallDatasetRowspayloads to iterate quickly. -
Scale metric evaluation to jobs
- When metrics are validated, run async metric jobs (
/evaluation/metric-jobs) on larger datasets. -
Use filesets for production-scale inputs. See Manage Files.
-
Monitor and analyze results
- Track job status and progress with job management APIs.
- Retrieve results and artifacts for analysis, reporting, and regression tracking.
Where to Go Next¶
- For metric workflows, see Metric Jobs and Metric Results.
- For full endpoint details, see the Evaluator API Reference.
Available Evaluations¶
Review configurations, data formats, and result examples for each evaluation.
- Retrieval — Evaluate document retrieval pipelines on standard or custom datasets.
- RAG — Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
- Agentic — Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
- LLM-as-a-Judge — Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.
- Similarity Metrics — Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.