CLMay 25, 2023

NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts

arXiv:2305.16023v1224 citations
Originality Synthesis-oriented
AI Analysis

This work tackles the problem of cross-domain grammatical error correction for native speakers, which is an incremental advancement in broadening dataset scope for CGEC research.

The authors introduced NaSGEC, a multi-domain dataset for Chinese grammatical error correction (CGEC) from native speaker texts, addressing the limitation of previous single-domain focus, and provided benchmark results using state-of-the-art models.

We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction--cross-domain GEC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes