CLAISep 16, 2024

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

arXiv:2409.10016v21 citationsh-index: 31Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of parsing academic literature for researchers and developers in data-centric AI, providing a new dataset and model, but it is incremental as it builds on existing multimodal approaches.

The paper tackles the challenge of parsing diverse structured texts in academic literature by introducing AceParse, a comprehensive dataset covering formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions, and fine-tuning a multimodal model, AceParser, which outperforms the previous state-of-the-art by 4.1% in F1 score and 5% in Jaccard Similarity.

With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes