AI CVSep 27, 2025

AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors

Junyang Zhang, Tianyi Zhu, Thierry Tambe

arXiv:2509.23109v15.81 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of hallucinations and underperformance in VLMs for researchers and practitioners, offering a novel approach to cross-modal alignment that is incremental but with strong empirical gains.

The paper tackles the problem of cross-modal token misalignment in vision-language models (VLMs) by proposing AttAnchor, a parameter-free framework that groups semantically similar tokens across modalities to improve cross-modal locality. The method achieves improvements across 13 out of 15 metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks, while enabling TinyLLaVA 1B to outperform larger models like LLaVA 7B with only 0.1% inference time overhead.

A fundamental reason for the dominance of attention over RNNs and LSTMs in LLMs is its ability to capture long-range dependencies by modeling direct interactions between all tokens, overcoming the sequential limitations of recurrent architectures. Similarly, a key reason why today's vision language models (VLMs) hallucinate and underperform pure language models is that they rely on direct concatenation of image and text tokens with a modality-blinded positional encoding, which conveniently adopts the pretrained LLM backbone but forces unnecessary long-distance attention between semantically related tokens across modalities. This underscores the urgent need for mechanisms that efficiently enhance token locality and cross-modal alignment. In response, we propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities, improving cross-modal locality. By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores, guiding the model to focus on the correct image regions for tasks such as VQA, MMBench and POPE. This improves answer accuracy and reduces hallucinations without disrupting the prompt's semantic flow. AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks. AttAnchor enables TinyLLaVA 1B to outperform much larger models like LLaVA 7B and QwenVL 3B on POPE with only 0.1% inference time overhead. To the best of our knowledge, this work is among the first to investigate mixed-modal token grouping, where text and image tokens are clustered jointly into shared groups rather than being grouped within a single modality or merely aligned post-hoc with additional alignment losses.

View on arXiv PDF

Similar