GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models
This addresses a foundational gap in MLLM evaluation for researchers, though it is incremental as it focuses on benchmarking rather than new model development.
The paper tackles the lack of evaluation for geometric perception in multimodal large language models (MLLMs) by introducing GePBench, a benchmark that reveals significant deficiencies in current SOTA models and shows that training with this data leads to substantial improvements on various tasks.
Multimodal large language models (MLLMs) have made significant progress in integrating visual and linguistic understanding. Existing benchmarks typically focus on high-level semantic capabilities, such as scene understanding and visual reasoning, but often overlook a crucial, foundational ability: geometric perception. Geometric perception involves understanding geometric shapes, structures, and spatial relationships, which are essential for supporting higher-level semantic tasks. Despite its importance, this capability remains underexplored in current MLLM research. To address this gap, we introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs. Our extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.