CLOct 22, 2022

FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction

arXiv:2210.12364v1292 citationsh-index: 36
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited data for Chinese GEC, which is incremental as it provides a new corpus and baseline for a specific domain.

The authors tackled the lack of high-quality data for Chinese Grammatical Error Correction by introducing FCGEC, a fine-grained human-annotated corpus with 41,340 sentences, and proposed a baseline model (STG) that outperforms other benchmarks on this dataset.

Grammatical Error Correction (GEC) has been broadly applied in automatic correction and proofreading system recently. However, it is still immature in Chinese GEC due to limited high-quality data from native speakers in terms of category and scale. In this paper, we present FCGEC, a fine-grained corpus to detect, identify and correct the grammatical errors. FCGEC is a human-annotated corpus with multiple references, consisting of 41,340 sentences collected mainly from multi-choice questions in public school Chinese examinations. Furthermore, we propose a Switch-Tagger-Generator (STG) baseline model to correct the grammatical errors in low-resource settings. Compared to other GEC benchmark models, experimental results illustrate that STG outperforms them on our FCGEC. However, there exists a significant gap between benchmark models and humans that encourages future models to bridge it.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes