CLJan 11, 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question AnsweringPaul Lerner, Olivier Ferret, Camille Guinaudeau
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.
CLJan 11, 2024
Cross-modal Retrieval for Knowledge-based Visual Question AnsweringPaul Lerner, Olivier Ferret, Camille Guinaudeau
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.
CVOct 11, 2025
Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender ClassificationHaohua Dong, Ana Manzano Rodríguez, Camille Guinaudeau et al.
Face gender classification models often reflect and amplify demographic biases present in their training data, leading to uneven performance across gender and racial subgroups. We introduce pseudo-balancing, a simple and effective strategy for mitigating such biases in semi-supervised learning. Our method enforces demographic balance during pseudo-label selection, using only unlabeled images from a race-balanced dataset without requiring access to ground-truth annotations. We evaluate pseudo-balancing under two conditions: (1) fine-tuning a biased gender classifier using unlabeled images from the FairFace dataset, and (2) stress-testing the method with intentionally imbalanced training data to simulate controlled bias scenarios. In both cases, models are evaluated on the All-Age-Faces (AAF) benchmark, which contains a predominantly East Asian population. Our results show that pseudo-balancing consistently improves fairness while preserving or enhancing accuracy. The method achieves 79.81% overall accuracy - a 6.53% improvement over the baseline - and reduces the gender accuracy gap by 44.17%. In the East Asian subgroup, where baseline disparities exceeded 49%, the gap is narrowed to just 5.01%. These findings suggest that even in the absence of label supervision, access to a demographically balanced or moderately skewed unlabeled dataset can serve as a powerful resource for debiasing existing computer vision models.
CVApr 14, 2025
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and StructureThéo Gigant, Camille Guinaudeau, Frédéric Dufaux
Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.