CVAIMay 4

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

arXiv:2605.0237899.1Has Code
Predicted impact top 1% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners using vision-language models, this work addresses the fragility of in-context learning by introducing structured reasoning, though the improvements are incremental over existing methods.

The paper identifies an inductive gap in multimodal in-context learning and proposes a framework combining visual token compression, dynamic attention rebalancing, and chain-of-thought reasoning to improve VLMs. The method achieves consistent improvements over standard ICL baselines across eight benchmarks.

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes