LGAIJul 31, 2025

Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

arXiv:2507.23217v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of efficiently processing diverse document elements for users in document analysis, though it is incremental as it builds on existing multimodal LLM capabilities.

The paper tackles the challenge of understanding complex multimodal documents by introducing DocsRay, a training-free system that uses pseudo Table of Contents generation and hierarchical Retrieval-Augmented Generation, reducing query latency by 45% to 2.12 seconds and achieving 64.7% accuracy on a benchmark.

Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 \cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes