Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
This work addresses the need for GEC systems that generalize beyond learner essays to broader applications like website text, though it is incremental as it focuses on benchmarking and analysis rather than a new method.
The authors tackled the problem of grammatical error correction (GEC) in low error density domains by introducing CWEB, a new benchmark based on website text, and found that state-of-the-art GEC systems struggle in this setting due to their reliance on strong internal language models.
Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.