CL AI IROct 29, 2025

Model-Document Protocol for AI Search

arXiv:2510.25160v22 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the challenge of inefficient retrieval for AI search by providing a new paradigm to improve how models interact with documents, though it appears incremental as it builds on existing retrieval and agentic methods.

The paper tackles the problem of linking large language models (LLMs) with unstructured external documents by introducing the Model-Document Protocol (MDP), a framework that transforms raw text into task-specific, LLM-ready knowledge representations, and shows that its agentic instantiation, MDP-Agent, outperforms baselines on information-seeking benchmarks.

AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

View on arXiv PDF

Similar