The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
This addresses the problem of limited evaluation tools for Arabic language models, which is incremental as it applies existing methods to a new dataset.
The authors tackled the lack of benchmarks for Arabic language models by introducing two new benchmarks derived from the Qiyas exam to evaluate mathematical reasoning and language understanding, finding that ChatGPT-4 achieved 64% accuracy and ChatGPT-3.5-turbo achieved 49% accuracy.
Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in the Qiyas benchmark. We believe the release of these benchmarks will pave the way for enhancing the mathematical reasoning and language understanding capabilities of future models tailored for the low-resource Arabic language.