CLJan 13, 2021

On consistency scores in text data with an implementation in R

arXiv:2101.05225v1

Originality Synthesis-oriented

AI Analysis

This addresses the problem of text data cleaning for researchers and practitioners working with PDF-extracted text, though it is incremental as it builds on existing n-gram models.

The paper tackles the problem of cleaning text extracted from PDFs by introducing a reproducible process that uses n-gram models to compare extracted text with expected text, resulting in a consistency score to monitor changes during cleaning and across corpora, with an implementation illustrated on 'Jane Eyre' and provided as a Shiny app and R package.

In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.

View on arXiv PDF

Similar