CVAISep 27, 2025

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

arXiv:2510.00041v11 citationsh-index: 15Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for better evaluation of cultural awareness in AI models, though it is incremental as it introduces a new benchmark rather than a novel method.

The authors tackled the lack of challenging and cross-lingual benchmarks for cultural awareness in multimodal large language models by proposing C³B, a comic-based benchmark with over 2000 images and 18000 QA pairs, which revealed a significant performance gap between 11 open-source MLLMs and human performance.

Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes