CVSep 17, 2025

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

arXiv:2509.13919v11 citationsh-index: 9ICML
Originality Incremental advance
AI Analysis

This addresses a key limitation in LVLMs for improving their reliability in visual question answering tasks, though it is an incremental advancement focused on alignment calibration.

The paper tackles the problem of misalignment between rationales and answers in Large Vision-Language Models (LVLMs), which causes inconsistent reasoning and incorrect responses, by introducing the Self-Rationale Calibration (SRC) framework that iteratively calibrates this alignment, resulting in significant improvements in perception, reasoning, and generalization across multiple benchmarks.

Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight "rationale fine-tuning" approach, which modifies the model's response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes