CLIRLGMay 3, 2020

Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation

arXiv:2005.01158v1997 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better context-aware error correction tools for users of text input systems, though it is incremental as it builds on existing data augmentation methods.

The paper tackled the problem of typographical error correction by generating challenging errors based on real-world statistics to create datasets, and found that machine learning models could effectively detect and correct these errors, with datasets made publicly available.

In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data. The datasets are available at http://typo.nlproc.org

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes