Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study
This work addresses the problem of accurate text diacritization for Arabic and Yoruba speakers, offering an incremental improvement by showing LLMs can surpass specialized tools.
The study investigated the effectiveness of large language models (LLMs) for text diacritization in Arabic and Yoruba, finding that many off-the-shelf LLMs outperformed specialized models, but smaller models had hallucination issues that could be mitigated with fine-tuning on a small dataset.
We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.