CLDec 28, 2024

YAD: Leveraging T5 for Improved Automatic Diacritization of Yorùbá Text

arXiv:2412.20218v13 citationsh-index: 31
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better text processing tools for the Yorùbá language, but it is incremental as it applies an existing method to a new dataset.

The authors tackled the problem of automatic diacritization for Yorùbá text by creating a benchmark dataset and pre-training a T5 model, which outperformed multilingual T5 models, showing that more data and larger models improve performance.

In this work, we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yorùbá

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes