CLCVDec 5, 2023

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

arXiv:2312.03766v218 citationsh-index: 37ECCV
Originality Incremental advance
AI Analysis

This addresses the need for more interpretable image-text alignment models, offering a domain-specific improvement for computer vision and natural language processing applications.

The paper tackles the problem of pinpointing the exact source of misalignment in image-text pairs, presenting a method that provides detailed textual and visual explanations, outperforming baselines on binary alignment classification and explanation generation tasks.

While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes