AI CV DLJan 28, 2023

ACL-Fig: A Dataset for Scientific Figure Classification

Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, C. Lee Giles

arXiv:2301.12293v114.616 citationsh-index: 99Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for datasets to enable classification, question-answering, and auto-captioning of scientific figures, primarily benefiting researchers and developers in academic search and AI, though it is incremental as it builds on existing extraction and classification methods.

The authors tackled the lack of large-scale retrieval services for scientific figures by creating ACL-Fig, the first large-scale automatically annotated corpus of 112,052 scientific figures extracted from ~56K ACL Anthology papers, with a manually labeled subset of 1,671 figures across 19 categories.

Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific figures' semantics, such as their types and purposes. A key obstacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific literature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-Fig, consisting of 112,052 scientific figures extracted from ~56K research papers in the ACL Anthology. The ACL-Fig-Pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is accessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.

View on arXiv PDF

Similar