The Challenge of Diacritics in Yoruba Embeddings
This addresses a specific challenge for Yoruba NLP, but it is incremental as it builds on existing embedding methods with a data normalization approach.
The paper tackled the problem of diacritics affecting embedding performance in Yoruba, a tonal language, by showing that embeddings from an undiacritized dataset yield better results, achieving the best performance in WordSim and Spearman correlation.
The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation. The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.