A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala
This provides an essential guide for practitioners selecting models for Sinhala-specific applications, addressing a domain-specific problem.
The paper tackled the under-explored performance of language models on Sinhala, a lower-resource language, by benchmarking them on Unicode and Romanized scripts, finding that Mistral-Nemo-Base-2407 and Mistral-7B-v0.3 achieved the strongest predictive performance on Unicode and Romanized text respectively, with Llama-3.1-8B performing well overall and significant disparities among closed-source models like Gemini-1.5-pro and Claude-3.5-Sonnet.
The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.