CVAIDec 1, 2025

Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

arXiv:2512.01419v1h-index: 13
Originality Incremental advance
AI Analysis

This addresses the need for inclusive VLM development to better serve culturally diverse global populations, though it is incremental as it focuses on evaluation rather than new model creation.

The paper tackles the problem of Western-centric biases in Vision-Language Models (VLMs) by introducing RICE-VL, a benchmark evaluating cultural understanding across 11 ASEAN countries, revealing significant performance gaps in low-resource countries and abstract cultural domains.

Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes