PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
This addresses the need for positionally-accurate multi-modal models in OCR-centric visual question answering, representing an incremental improvement by combining existing specialist and LLM components.
The paper tackles the problem of multi-modal large language models lacking positional reasoning for visual tasks like text spotting, by introducing PositionOCR, a hybrid architecture that integrates a text spotting specialist with an LLM, achieving superior performance in text grounding and spotting tasks with 131M trainable parameters.
In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.