CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation
This addresses document segmentation for improved text processing and RAG systems, representing a novel method for a known bottleneck.
The paper tackles the problem of text semantic segmentation where traditional methods lose critical semantic information across document segments due to preprocessing constraints, and presents CrossFormer, a transformer-based model with a cross-segment fusion module that achieves state-of-the-art performance on public datasets and considerable gains on RAG benchmarks.
Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.