CLDec 28, 2024

YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text

Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere, David I. Adelani

arXiv:2412.20218v13.43 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better text processing tools for the Yorùbá language, but it is incremental as it applies an existing method to a new dataset.

The authors tackled the problem of automatic diacritization for Yorùbá text by creating a benchmark dataset and pre-training a T5 model, which outperformed multilingual T5 models, showing that more data and larger models improve performance.

In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá

View on arXiv PDF Code

Similar