Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models
This work addresses biases in multimodal AI models, which is an incremental step in understanding how biases compound across modalities, impacting fairness in applications like image captioning or visual question answering.
The paper tackled the problem of biases in pre-trained vision-and-language models by extending text-based bias analysis methods to multimodal settings, demonstrating that VL-BERT exhibits gender biases that reinforce stereotypes over accurate visual descriptions in controlled case-studies and larger sets of stereotypically gendered entities.
Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that VL-BERT (Su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. We demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.