IRCVMar 17

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

arXiv:2603.0555176.9h-index: 8
Predicted impact top 24% in IR · last 90 daysOriginality Highly original
AI Analysis

This work improves multimodal document QA for applications requiring efficient and robust reasoning over long documents, though it is incremental in enhancing existing RAG systems.

The paper tackled the problem of multimodal document question answering by addressing bottlenecks in static retrieval and coupled visual perception with logical reasoning, resulting in AutoThinkRAG achieving 82.13% and 51.29% overall accuracy on benchmarks while reducing token consumption by 18.9% and cost by 18.2%.

Multimodal document question answering requires retrieving dispersed evidence from visually rich long documents and performing reliable reasoning over heterogeneous information. Existing multimodal RAG systems remain limited by two bottlenecks: static retrieval that ignores query complexity, and end-to-end Vision-Language Models (VLMs) that couple visual perception with logical reasoning, leading to inefficient computation and unstable answer generation. We propose AutoThinkRAG, a complexity-aware inference architecture for multimodal document QA. It has two components: (1) a Query Complexity Router that analyzes query difficulty and structure to adaptively select retrieval and reasoning paths; and (2) a Perception--Reasoning Decoupling architecture that uses a lightweight VLM as a high-fidelity visual interpreter to convert query-relevant visual cues into textual representations, which are then passed to an LLM for logical reasoning and answer synthesis. This design improves both efficiency and robustness, especially on long-document and unanswerable queries. Experiments on DocBench and MMLongBench show that AutoThinkRAG achieves 82.13\% and 51.29\% overall accuracy, respectively, while reducing per-query token consumption by 18.9\% and monetary cost by 18.2\%. Further analyses show that the gains are most pronounced on complex queries requiring adaptive retrieval and multi-step reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes