An Investigation of Noise in Morphological Inflection
This addresses data quality issues in morphological inflection for low-resource languages, but it is incremental as it builds on existing noise analysis and model comparisons.
The study investigated the impact of training data noise on morphological inflection systems, particularly for low-resource languages, and found that encoder-decoders are generally more robust to noise than copy-bias models, with CMLM pretraining improving transformer performance.
With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an error taxonomy and annotation pipeline for inflection training data. Then, we compare the effect of different types of noise on multiple state-of-the-art inflection models. Finally, we propose a novel character-level masked language modeling (CMLM) pretraining objective and explore its impact on the models' resistance to noise. Our experiments show that various architectures are impacted differently by separate types of noise, but encoder-decoders tend to be more robust to noise than models trained with a copy bias. CMLM pretraining helps transformers, but has lower impact on LSTMs.