IROct 23, 2020

Extracting Body Text from Academic PDF Documents for Text Mining

arXiv:2010.12647v13.07 citations

Originality Incremental advance

AI Analysis

This addresses the need for clean text extraction in academic research and text mining applications, though it is incremental as it builds on existing layout detection and text processing methods.

The paper tackled the problem of accurately extracting body text from academic PDF documents for text mining, achieving high accuracy with average F1 scores of 0.99 for sentences, 0.96 for paragraphs, and 0.98 for removing non-body elements like tables and figures.

Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.

View on arXiv PDF

Similar