CV AIMay 22, 2024

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, Andreas Peter Steiner, Xiaohua Zhai, Ibrahim Alabdulmohsin

arXiv:2405.13777v317.815 citationsh-index: 39Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses the problem of cultural and socioeconomic bias in AI systems for creating more inclusive multimodal technologies, representing a domain-specific advancement.

The study found that filtering training data to English image-text pairs disadvantages lower socioeconomic communities and harms cultural understanding in vision-language models, while pretraining with global unfiltered data before fine-tuning improves cultural understanding without hurting performance on Western benchmarks. They introduced geo-localization as a new evaluation metric for cultural diversity.

We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.

View on arXiv PDF Code

Similar