CVJan 4

BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding

arXiv:2601.01526v1Has Code
Originality Incremental advance
AI Analysis

This work improves visual grounding for multimodal understanding by mitigating modality biases and enhancing referential reasoning, though it appears incremental as it builds on existing one-tower architectures.

The paper tackles the problem of visual grounding by addressing over-entangled multimodal representations and insufficient semantic reasoning in one-tower architectures, resulting in state-of-the-art performance on five benchmarks with superior computational efficiency.

Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes