CLDLJul 4, 2024

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

arXiv:2407.12838v223 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This addresses a gap in resources for historical and linguistic analysis in Latin America, though it is incremental as it applies existing LLM methods to a new domain.

The paper tackles the lack of specialized historical corpora for Latin American Spanish by introducing a new dataset of 19th-century newspaper texts and develops a flexible framework using a Large Language Model for OCR error correction and linguistic detection, applied to this dataset.

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes