CLOct 12, 2019

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

arXiv:1910.05535v11008 citations
Originality Incremental advance
AI Analysis

This addresses the time-consuming issue of manual correction for historical texts, offering an unsupervised alternative to rule-based or supervised methods.

The paper tackles the problem of OCR errors in historical corpora by proposing a fully automatic unsupervised method to extract parallel data for training a character-based sequence-to-sequence NMT model, achieving error correction without manual intervention.

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes