CL LGApr 26

Neural Grammatical Error Correction for Romanian

Teodor-Mihai Cotet, Stefan Ruseti, Mihai Dascalu

arXiv:2604.2362743.719 citations

AI Analysis

This work provides initial resources and baselines for grammatical error correction in Romanian, a low-resource language, but the gains are incremental and the method is not novel.

The authors created the first Romanian GEC corpus with 10k sentence pairs and adapted the ERRANT scorer for Romanian. Their best model, a pretrained Transformer finetuned on the corpus, achieved an F0.5 of 53.76, outperforming a baseline of 44.38.

Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

View on arXiv PDF

Similar