Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
This work addresses the problem of cultural bias and adaptation in AI systems for researchers and developers, though it is incremental as it builds on existing benchmarks and methods.
The study tackled the problem of cultural understanding in Large Multimodal Models (LMMs) by introducing DalleStreet, a dataset of 9,935 images across 67 countries, and found disparities in cultural competence at geographic sub-region levels using models like LLaVA and GPT-4V, identifying over 18,000 artifacts to highlight the need for culture-aware systems.
We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline, CultureAdapt. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DalleStreet and other existing benchmarks, which we try to understand using over 18,000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads