CVJan 4

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

arXiv:2601.01483v12 citations
Originality Incremental advance
AI Analysis

This addresses efficiency and performance issues in vision-language models for applications requiring reliable reasoning, though it is incremental as it builds on existing preference optimization methods.

The paper tackles the high costs of separate generation and verification models in vision-language tasks by proposing ADPO, a unified reinforcement learning framework that jointly learns both functions, achieving up to +34.1% higher verification AUC and -53.5% lower inference time with accuracy gains on benchmarks like MathVista and MMMU.

Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes