IRApr 1

Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

arXiv:2604.0050013.2

Predicted impact top 72% in IR · last 90 daysOriginality Highly original

AI Analysis

This addresses the issue of scattered semantic units in document retrieval for users dealing with structured documents, offering a parser-independent solution with significant performance gains.

The paper tackled the problem of fragmented document indexing in retrieval by introducing Evidence Units (EUs), which group visual assets with contextual text, resulting in improved retrieval metrics such as LCS increasing from 0.50 to 0.81 and Recall@1 rising from 0.15 to 0.51.

Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness through two invariants; and (4) cross-parser validation showing EU spatial footprints converge across MinerU and Docling, with gains preserved under parser-induced bbox variance. Experiments on OmniDocBench v1.0 (1,340 pages; 1,551 QA pairs) show EU-based chunking improves retrieval LCS by +0.31 (0.50 to 0.81). Recall@1 increases from 0.15 to 0.51 (3.4x) and MinK decreases from 2.58 to 1.72. Cross-parser results confirm the gain (LCS +0.23 to +0.31) is preserved across parsers. Text queries show the most dramatic gain: Recall@1 rises from 0.08 to 0.47.

View on arXiv PDF

Similar