SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation
This addresses a fairness and bias issue for global users of text-to-image generation, identifying a systematic flaw in multilingual models.
The paper tackles the problem of text-to-image models prioritizing surface language forms over semantics, leading to culturally stereotypical outputs, by analyzing seven models across 171 cultural identities and 14 languages, showing that all but one exhibit strong surface-level tendencies in at least two languages, with effects intensifying across text encoder layers.
Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.