CLLGMLSep 26, 2018

Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

arXiv:1810.00668v11096 citations
Originality Incremental advance
AI Analysis

This work addresses the need for high-quality training data in grammatical error correction, which is expensive to produce, by providing a cheap synthetic data generation method.

The paper tackled the problem of generating realistic grammatical errors to improve grammatical error detection by using a sequence-to-sequence model and post-processing to create synthetic data, resulting in a vanilla bi-directional LSTM outperforming previous state-of-the-art and achieving over 5% F0.5 score gains.

Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic grammatical errors would be difficult, one could learn the distribution of naturallyoccurring errors and attempt to introduce them into other datasets. Initial work on inducing errors in this way using statistical machine translation has shown promise; we investigate cheaply constructing synthetic samples, given a small corpus of human-annotated data, using an off-the-rack attentive sequence-to-sequence model and a straight-forward post-processing procedure. Our approach yields error-filled artificial data that helps a vanilla bi-directional LSTM to outperform the previous state of the art at grammatical error detection, and a previously introduced model to gain further improvements of over 5% $F_{0.5}$ score. When attempting to determine if a given sentence is synthetic, a human annotator at best achieves 39.39 $F_1$ score, indicating that our model generates mostly human-like instances.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes