CLApr 28, 2023

CED: Catalog Extraction from Documents

arXiv:2304.14662v11 citationsh-index: 28Has Code
Originality Incremental advance
AI Analysis

This addresses the tedious task of sentence-by-sentence information extraction for researchers or practitioners dealing with long documents, though it is incremental as it builds on existing parsing methods.

The paper tackles the problem of extracting catalogs from long documents to aid information extraction, proposing a transition-based framework that outperforms baseline systems and demonstrates good transfer ability.

Sentence-by-sentence information extraction from long documents is an exhausting and error-prone task. As the indicator of document skeleton, catalogs naturally chunk documents into segments and provide informative cascade semantics, which can help to reduce the search space. Despite their usefulness, catalogs are hard to be extracted without the assist from external knowledge. For documents that adhere to a specific template, regular expressions are practical to extract catalogs. However, handcrafted heuristics are not applicable when processing documents from different sources with diverse formats. To address this problem, we build a large manually annotated corpus, which is the first dataset for the Catalog Extraction from Documents (CED) task. Based on this corpus, we propose a transition-based framework for parsing documents into catalog trees. The experimental results demonstrate that our proposed method outperforms baseline systems and shows a good ability to transfer. We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents. Data and code are available at \url{https://github.com/Spico197/CatalogExtraction}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes