CLAIJun 25, 2024

Improving Grammatical Error Correction via Contextual Data Augmentation

arXiv:2406.17456v128 citations
Originality Incremental advance
AI Analysis

This work addresses data scarcity for Grammatical Error Correction systems, offering an incremental improvement by enhancing synthetic data quality for fine-tuning.

The paper tackles the problem of data scarcity in Grammatical Error Correction (GEC) by proposing a contextual data augmentation method that combines rule-based substitution with model-based generation and includes a relabeling-based cleaning technique, resulting in state-of-the-art performance on CoNLL14 and BEA19-Test benchmarks with only a few synthetic data.

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes