CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
This work addresses the interpretability problem for researchers and practitioners in GEC, offering a more transparent evaluation method, though it is incremental as it builds on existing metric frameworks.
The paper tackled the lack of interpretability in Grammatical Error Correction (GEC) evaluation metrics by introducing CLEME2.0, a reference-based metric that disentangles edits into four aspects, achieving superior human consistency and state-of-the-art results on multiple datasets.
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.