Vision-Language Models (VLM)¶
This section details how to evaluate Vision-Language Model (VLM) benchmarks that require both text and image understanding.
VLM-specific features¶
VLM evaluation uses the standard vllm server type with multimodal support:
- Automatically converts local image paths to base64 data URLs
- Supports HTTP/HTTPS image URLs and pre-encoded base64 data URLs
- Works seamlessly with any vLLM-supported VLM model
Prompt configuration¶
VLM prompts support two additional fields in the prompt config YAML:
image_field: image_path # Field name in the input data containing the image path
image_position: before # "before" or "after" - where to place image relative to text
For example, the MMMU-Pro prompt config:
image_field: image_path
image_position: before
user: |-
Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A').
{problem}
Image path resolution¶
The image_path field in input data supports multiple formats:
| Format | Example | Behavior |
|---|---|---|
| Relative path | images/test.png |
Resolved relative to input JSONL directory |
| Absolute path | /data/images/test.png |
Used directly |
| HTTP URL | https://example.com/img.png |
Passed through to vLLM |
| Data URL | data:image/png;base64,... |
Passed through to vLLM |
Supported benchmarks¶
mmmu-pro¶
MMMU-Pro is a robust multi-discipline multimodal understanding benchmark from the MMMU team. It evaluates VLMs on expert-level tasks across various academic disciplines using the "vision" configuration where images are critical for problem-solving.
- Benchmark is defined in
nemo_skills/dataset/mmmu-pro/__init__.py - Original benchmark source is here.
- Evaluation follows AAI methodology for 10-choice MCQ.
Preparing data¶
VLM benchmarks require image files which need to be downloaded separately:
Running evaluation¶
For standard instruction-following VLMs (e.g., Qwen3-VL-4B-Instruct):
from nemo_skills.pipeline.cli import wrap_arguments, eval
eval(
ctx=wrap_arguments("++inference.temperature=0 ++inference.tokens_to_generate=16384"),
cluster="slurm",
output_dir="/workspace/mmmu-pro-eval",
server_type="vllm",
server_gpus=1,
model="Qwen/Qwen3-VL-4B-Instruct",
benchmarks="mmmu-pro",
data_dir="/workspace/ns-data",
)
For reasoning-enhanced VLMs (e.g., Qwen3-VL-30B-A3B-Thinking):
from nemo_skills.pipeline.cli import wrap_arguments, eval
eval(
ctx=wrap_arguments("++inference.temperature=0.7 ++inference.tokens_to_generate=131072"),
cluster="slurm",
output_dir="/workspace/mmmu-pro-eval",
server_type="vllm",
server_gpus=8,
model="/hf_models/Qwen3-VL-30B-A3B-Thinking",
benchmarks="mmmu-pro",
data_dir="/workspace/ns-data",
)
Alternative: Command-line usage
vLLM configuration tips¶
Based on vLLM VLM documentation:
- For image-only inference, add
--limit-mm-per-prompt.video 0to save memory - Set
--max-model-len 128000for most use cases (default 262K consumes more memory) - Use
--async-schedulingfor better performance
These can be passed via server_args: