VLRewardBench

A Challenging Benchmark for Vision-Language Generative Reward Models

HKU SCUT PKU SJTU UW & Allen Institute for AI
*Core Contributors

We introduce a novel benchmark VL-RewardBench, designed to expose limitations of vision-language reward models across visual perception, hallucination detection, and reasoning tasks.
Our evaluation reveals including that models primarily fail at basic visual perception rather than reasoning, and that performance on our benchmark strongly correlates (r>0.9) with downstream vision-language tasks.

Leaderboard (Update Date: 2024-11-26):

VL-RewardBench Pipeline

Abstract

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

VL-RewardBench Dataset

Ideally, an effective benchmark for VL-GenRMs should satisfy three key requirements:

a) diverse coverage of real-world applications; b) sufficient difficulty to expose current models' limitations; c) objective ground truth labels.

To satisfy criterion (a), our benchmark evaluates VL-GenRMs across three key application domains:

1. General multimodal queries from real users (VLFeedback and WildVision)
2. Visual hallucination detection tasks (POVID, RLAIF-V, RLHF-V)
3. Multimodal knowledge and mathematical reasoning (MMMU-Pro and MathVerse)

To ensure criterion (b), we employ targeted curation strategies:

  • For source datasets with preference pairs, we employ small LVLMs collaboratively to filter out challenging samples, which our evaluation shows remain difficult even for much larger models.
  • For reasoning tasks without annotated labels, we leverage strong commercial models to generate responses with explicit reasoning paths, followed by GPT-4's quality assessment.
To fulfill criterion (c), all preference labels undergo human verification to eliminate ambiguous or incorrect pairs.
VL-RewardBench Pipeline


VL-RewardBench Statistics

VL-RewardBench Statistics


Experiments

We present a comprehensive evaluation of 16 state-of-the-art vision-language reward models (VL-GenRMs) on VL-RewardBench, encompassing both open-source models (ranging from 7B to 90B parameters) and leading commercial systems including Gemini-1.5-Pro, Claude-3.5-Sonnet, and GPT-4. Our benchmark uncovers significant limitations in current VL-GenRMs: even top commercial models achieve only moderate performance (GPT-4: 62.4%, Gemini-1.5-Pro: 62.5%). Meanwhile, state-of-the-art open-source models like Qwen2-VL-72B and LLaMA-3.2-90B perform near chance level (43.0% and 53.9%, respectively).


Remarkably, performance on VL-RewardBench shows strong correlation (Pearson's r > 0.9) with downstream MMMU-Pro results when using these models for Best-of-N sampling guidance.

Next Step: How to Advance VL-GenRMs?

Given the poor accuracy of VL-GenRMs on our benchmark, our analysis uncovers three critical insights for advancing VL-GenRMs:

1. The primary performance bottleneck lies in visual perception rather than reasoning - models show significantly higher error rates on existence/recognition tasks (> 67%) compared to reasoning tasks (41.8%). → Improving Visual Perception Capability of VL-GenRMs first!

Analysis Results

2. The effectiveness of test-time scaling varies with model capacity, providing benefits to larger models while potentially degrading smaller models' performance. → Advancing Test-time Scaling Strategies for VL-GenRMs.

Analysis Results

3. LLaVA-Critic substantially improves judgment capabilities, demonstrated by a 14.7% accuracy gain for LLaVA-OneVision-7B-ov, with pointwise evaluation slightly outperforming pairwise scoring on average. → Developing a co-evolution framework for LVLMs and VL-GenRMs to improve in a self-play way.

Critic Results

BibTeX



Coming soon. 
  

Acknowledgement

This website is adapted from LLaVA-VL and Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of GPT-4o and Claude.

Related Links: VLFeedback, WildVision, RLHF-V, RLAIF-V, POVID, MMMU-Pro, MathVerse, RewardBench