CLAIOct 7, 2025

EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

U of Toronto
arXiv:2510.06371v12 citationsh-index: 37Has Code
Originality Incremental advance
AI Analysis

This addresses the need for culturally diverse and multilingual benchmarks in multimodal AI, though it is incremental as it builds on existing VQA frameworks by adding cultural and linguistic dimensions.

The authors tackled the problem of multimodal models failing on culturally grounded, everyday knowledge queries in low-resource languages by introducing EverydayMMQA, a framework for creating datasets, and OASIS, a dataset with over 0.92M images and 14.8M QA pairs, including 3.7M spoken questions, to benchmark models on tasks requiring pragmatic and culturally aware reasoning.

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes