CLCVFeb 8, 2024

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

arXiv:2402.06015v211 citationsh-index: 14
AI Analysis

This work addresses the gap in assessing cultural understanding in vision-language models, which is important for improving AI fairness and applicability in diverse global contexts, though it is incremental as it builds on existing benchmarks.

The study investigated GPT-4V's visual cultural awareness using the MaRVL benchmark, finding that it excels at identifying cultural concepts but performs poorly in low-resource languages like Tamil and Swahili, and human evaluation showed it is more culturally relevant in image captioning than original annotations.

Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes