CLApr 5, 2019

Cross-Corpora Evaluation and Analysis of Grammatical Error Correction Models --- Is Single-Corpus Evaluation Enough?

arXiv:1904.02927v11093 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for more robust evaluation practices in grammatical error correction, particularly for researchers and developers, by highlighting the limitations of single-corpus benchmarks, though it is incremental as it builds on existing evaluation methods.

The study tackled the problem of evaluating grammatical error correction models by showing that relying on a single benchmark corpus is insufficient, as model rankings vary significantly across different corpora, with performance differences up to 10% F0.5 score depending on factors like writer proficiency and essay topics.

This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers and essay topics. To overcome this limitation, we evaluate the performance of several GEC models, including NMT-based (LSTM, CNN, and transformer) and an SMT-based model, against various learner corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, ICNALE, and KJ). Evaluation results reveal that the models' rankings considerably vary depending on the corpus, indicating that single-corpus evaluation is insufficient for GEC models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes