Topo-RAG: Topology-aware retrieval for hybrid text-table documents
This addresses the challenge of handling complex, multidimensional documents for enterprise data retrieval, representing an incremental improvement over existing RAG systems.
The paper tackled the problem of retrieving information from hybrid text-table documents in enterprise datasets by proposing Topo-RAG, a framework that processes text and tables separately to preserve spatial relationships, resulting in an 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization methods.
In enterprise datasets, documents are rarely pure. They are not just text, nor just numbers; they are a complex amalgam of narrative and structure. Current Retrieval-Augmented Generation (RAG) systems have attempted to address this complexity with a blunt tool: linearization. We convert rich, multidimensional tables into simple Markdown-style text strings, hoping that an embedding model will capture the geometry of a spreadsheet in a single vector. But it has already been shown that this is mathematically insufficient. This work presents Topo-RAG, a framework that challenges the assumption that "everything is text". We propose a dual architecture that respects the topology of the data: we route fluid narrative through traditional dense retrievers, while tabular structures are processed by a Cell-Aware Late Interaction mechanism, preserving their spatial relationships. Evaluated on SEC-25, a synthetic enterprise corpus that mimics real-world complexity, Topo-RAG demonstrates an 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization approaches. It's not just about searching better; it's about understanding the shape of information.