Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models
This work assesses ChatGPT models for Arabic NLP, providing benchmarks and tools for researchers, but it is incremental as it applies existing methods to a new language context.
The study evaluated GPT-3.5 and GPT-4 on seven Arabic NLP tasks, finding that GPT-4 outperformed GPT-3.5 on five tasks, with detailed analysis on sentiment analysis using a dialectal dataset.
Large language models (LLMs) have demonstrated impressive performance on various downstream tasks without requiring fine-tuning, including ChatGPT, a chat-based model built on top of LLMs such as GPT-3.5 and GPT-4. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5 and GPT-4 models on seven distinct Arabic NLP tasks: sentiment analysis, translation, transliteration, paraphrasing, part of speech tagging, summarization, and diacritization. Our findings reveal that GPT-4 outperforms GPT-3.5 on five out of the seven tasks. Furthermore, we conduct an extensive analysis of the sentiment analysis task, providing insights into how LLMs achieve exceptional results on a challenging dialectal dataset. Additionally, we introduce a new Python interface https://github.com/ARBML/Taqyim that facilitates the evaluation of these tasks effortlessly.