CVApr 14

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

arXiv:2604.1250875.0
Predicted impact top 40% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners using MLLMs, this work addresses a known bottleneck in fine-grained visual perception with a novel method that can be integrated as a plug-and-play module.

The paper identifies Visual Attenuation as a cause of poor fine-grained perception in Multimodal Large Language Models (MLLMs) and proposes the Variational Information Flow (VIF) framework using a CVAE to model visual saliency. VIF achieves competitive improvements over previous methods across multiple benchmarks.

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes