CL DLJul 4, 2024

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique

arXiv:2407.12838v213.223 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a gap in resources for historical and linguistic analysis in Latin America, though it is incremental as it applies existing LLM methods to a new domain.

The paper tackles the lack of specialized historical corpora for Latin American Spanish by introducing a new dataset of 19th-century newspaper texts and develops a flexible framework using a Large Language Model for OCR error correction and linguistic detection, applied to this dataset.

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

View on arXiv PDF Code

Similar