CVFeb 28

Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models

April Fu

arXiv:2603.00437v1

Originality Incremental advance

AI Analysis

This addresses hallucination issues in advanced LVLMs, which is an incremental improvement for vision-language tasks.

The paper tackles the problem of hallucination in Large Vision-Language Models (LVLMs) by introducing an Internal self-Correction mechanism using Layer Attention (ICLA), which improves visual grounding across multiple benchmarks with only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B.

Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.

View on arXiv PDF

Similar