CVJul 25, 2025

LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

arXiv:2507.19110v22 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses a critical reliability issue in MLLMs for vision-language tasks, offering a plug-and-play solution with incremental improvements.

The paper tackles object hallucinations in Multimodal Large Language Models (MLLMs) by proposing LISA, a layer-wise integration and suppression approach, which reduces hallucinations by up to 53.6% on CHAIR_I and improves POPE F1 by up to 5.1%.

Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose LISA, a Layer-wise Integration and Suppression Approach. LISA leverages the layer-wise functional roles in MLLMs: shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, layer-wise spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully plug-and-play and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\text{CHAIR}_\text{I}$ and improves POPE F1 by up to 5.1%, demonstrating strong generalization across models and tasks. Our code is available at https://github.com/zhlisa1010-eng/LISA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes