Inherent Biases in Reference based Evaluation for Grammatical Error Correction and Text Simplification
This reveals a fundamental flaw in evaluation metrics for text-to-text generation tasks like GEC and simplification, impacting researchers and practitioners who rely on these metrics.
The paper demonstrates that low coverage bias in reference-based evaluation for Grammatical Error Correction cannot be fixed by scaling or adding references, due to long-tailed distributions of valid corrections, causing systems to avoid corrections and achieve comparable or superior performance to humans with minimal changes. Similar effects are shown for Text Simplification.
The prevalent use of too few references for evaluating text-to-text generation is known to bias estimates of their quality ({\it low coverage bias} or LCB). This paper shows that overcoming LCB in Grammatical Error Correction (GEC) evaluation cannot be attained by re-scaling or by increasing the number of references in any feasible range, contrary to previous suggestions. This is due to the long-tailed distribution of valid corrections for a sentence. Concretely, we show that LCB incentivizes GEC systems to avoid correcting even when they can generate a valid correction. Consequently, existing systems obtain comparable or superior performance compared to humans, by making few but targeted changes to the input. Similar effects on Text Simplification further support our claims.