CVApr 20

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

arXiv:2604.1851261.0h-index: 4
AI Analysis

For researchers working on vision-language model alignment, this work addresses the overlooked capability of global visual search and cross-image comparison in multi-image reasoning.

The paper identifies a gap in multi-image reasoning for VLMs and proposes S2H-Hardness-Aware Preference Optimization, which constructs hierarchical multi-image preference data to improve both multi-image and single-image reasoning. Experiments on LLaVA and Qwen-VL show significant improvements over baselines across benchmarks.

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes