TopoChunker: Topology-Aware Agentic Document Chunking Framework
This addresses the degradation of retrieval quality in RAG systems for applications like unstructured narratives and complex reports, though it appears incremental as it builds on existing chunking methods.
The paper tackles the problem of semantic fragmentation in document chunking for Retrieval-Augmented Generation (RAG) by proposing TopoChunker, a framework that preserves topological hierarchies, resulting in state-of-the-art performance with an 8.0% improvement in generation accuracy and 23.5% reduction in token overhead.
Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.