Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
This addresses a specific bottleneck in long-context reasoning for vision-language models, representing an incremental improvement.
The paper tackles performance declines in Large Vision-Language Models during long-context reasoning by identifying overreliance on textual information and reduced visual dependency as key issues, and proposes a training-free context pruning method that selectively removes less critical text to improve performance across various models.
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning due to overreliance on textual information and reduced visual dependency. In this study, we empirically analyze LVLMs in long-context reasoning, revealing that increased context length leads to a higher dependence on language at the expense of visual dependency. To address this issue, we propose a novel training-free context pruning method that selectively removes less critical textual information. Our approach enhances visual dependency and reduces textual noise, thereby improving LVLM performance in long-context reasoning. We validate our method by constructing a long-context dataset, demonstrating its effectiveness across various LVLMs. Moreover, further analysis confirms the robustness of different token pruning strategies and preliminary explores scaling laws between pruning rates and context length.