IRLGAug 28, 2025

SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval

arXiv:2508.20778v2h-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses retrieval inefficiencies for users handling long structured documents, though it is incremental as it builds on existing contrastive learning methods.

The paper tackles the problem of long structured document retrieval by proposing SEAL, a contrastive learning framework that leverages structure-aware learning and masked element alignment, resulting in a performance improvement from 73.96% to 77.84% NDCG@10 on BGE-M3.

In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\% to 77.84\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes