EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA
This addresses the need for culturally diverse and multilingual benchmarks in multimodal AI, though it is incremental as it builds on existing VQA frameworks by adding cultural and linguistic dimensions.
The authors tackled the problem of multimodal models failing on culturally grounded, everyday knowledge queries in low-resource languages by introducing EverydayMMQA, a framework for creating datasets, and OASIS, a dataset with over 0.92M images and 14.8M QA pairs, including 3.7M spoken questions, to benchmark models on tasks requiring pragmatic and culturally aware reasoning.
Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.