CLAIMar 27, 2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

arXiv:2503.21910v111 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for more inclusive AI models by highlighting performance gaps in culturally diverse visual-language tasks, though it is incremental as it focuses on benchmarking rather than model development.

The authors tackled the problem of evaluating Vision-Language Models (VLMs) on visual understanding across four Arabic dialects, finding that Arabic VLMs consistently underperform and GPT-4V has varying linguistic competence and lagging visual capabilities.

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes