CLFeb 21, 2024

Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction

arXiv:2402.13906v226 citationsh-index: 8ACL
Originality Incremental advance
AI Analysis

This addresses the need for automated structure extraction in domains like legal, medical, or financial to aid human users and structure-aware models, but it is incremental as it builds on existing graph-based and similarity techniques.

The paper tackled the problem of extracting typical document structure from collections by capturing recurring topics across documents while handling header paraphrases and unique sections, resulting in a method that extracts meaningful collection-wide structure as shown in evaluations on three diverse domains in English and Hebrew.

Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes