CVAug 21, 2024

EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

arXiv:2408.11397v214 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in MLLMs for geometric reasoning tasks, offering a domain-specific improvement.

The paper tackles the problem of geometric reasoning in multi-modal large language models (MLLMs) by proposing EAGLE, a visual enhancement framework that improves visual perception through geometric knowledge injection and refinement, achieving strong performance on three benchmarks.

Multi-modal Large Language Models (MLLMs) have advanced greatly in general tasks. However, they still face challenges in geometric reasoning, a task that requires synergistic integration of visual recognition proficiency and complex reasoning strength. Existing MLLMs prioritize optimizing the LLM backbone to enhance problem-solving capabilities, while rarely emphasizing improvements in discerning visual elements. However, we reveal that MLLMs suffer from severe visual perception deficiencies, including inaccurate geometric comprehension and severe visual hallucinations, which constrain their reasoning performance. To address this issue, we revisit geometric reasoning through a visual-centric lens that highlights the role of visual perception. To achieve this, we propose EAGLE, a novel coarse-to-fine visual enhancement framework that progressively leverages LLMs' guidance to improve perception proficiency. Specifically, given the substantial disparity between geometric diagrams and natural images, we first introduce Geometric Knowledge Injection. This process explores fundamental knowledge from diagram-caption data to enhance recognition capabilities and improve geometry-language alignments. Then, recognizing that different elements contribute unequally in the reasoning process, we introduce Geometric Knowledge Refinement. This stage leverages LLM-driven chain-of-thought solutions to guide the vision encoder in adaptively prioritizing key elements, fostering a synergistic interplay between visual comprehension and mathematical reasoning. Finally, we develop EAGLE, a geometry expert with strong perception and reasoning capabilities. Extensive experiments demonstrate its effectiveness on three popular benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes