Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
This addresses a gap in resources for historical and linguistic analysis in Latin America, though it is incremental as it applies existing LLM methods to a new domain.
The paper tackles the lack of specialized historical corpora for Latin American Spanish by introducing a new dataset of 19th-century newspaper texts and develops a flexible framework using a Large Language Model for OCR error correction and linguistic detection, applied to this dataset.
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.