EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture
This provides a benchmark for developing culturally aware AI models, addressing a gap for Egyptian culture, but it is incremental as it focuses on dataset creation without new methods.
The authors tackled the lack of multimodal culturally diverse datasets for Egyptian culture by introducing EgMM-Corpus, a dataset with over 3,000 images across 313 concepts, and found that CLIP achieved 21.2% Top-1 and 36.4% Top-5 accuracy on it, highlighting cultural bias in vision-language models.
Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.