CLMar 31, 2021

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

arXiv:2103.16997v2267 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited NLP resources for Ukrainian, particularly in grammatical error correction, by providing a foundational dataset for researchers and developers, though it is incremental as it extends existing GEC methods to a new language.

The authors tackled the lack of grammatical error correction resources for Ukrainian by creating the first professionally annotated corpus for GEC and fluency edits, consisting of 20,715 sentences from diverse sources. This corpus can be used to develop and evaluate GEC systems in Ukrainian and supports research in multilingual and low-resource NLP.

We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with errors (20,715 sentences) from a diverse pool of contributors, including both native and non-native speakers. The data cover a wide variety of writing domains, from text chats and essays to formal writing. Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling. This corpus can be used for developing and evaluating GEC systems in Ukrainian. More generally, it can be used for researching multilingual and low-resource NLP, morphologically rich languages, document-level GEC, and fluency correction. The corpus is publicly available at https://github.com/grammarly/ua-gec

View on arXiv PDF Code

Similar