CL AIOct 25, 2022

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyutae Kim, Minjoon Seo, Alice Oh

arXiv:2210.14389v321.7225 citationsh-index: 34Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of limited research and evaluation resources for Korean GEC, benefiting researchers and developers in natural language processing, though it is incremental as it builds on existing GEC concepts for a specific language.

The authors tackled the lack of a standardized evaluation benchmark for Korean grammatical error correction (GEC) by collecting three datasets covering diverse errors and developing an automatic annotation system (KAGAS) to define 14 error types, resulting in baseline models that significantly outperform existing statistical systems on a wider range of errors.

Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.

View on arXiv PDF Code

Similar