CLCVMay 17, 2023

What You See is What You Read? Improving Text-Image Alignment Evaluation

arXiv:2305.10400v4136 citations
Originality Incremental advance
AI Analysis

This work addresses a key challenge for vision-language models, with applications in generative tasks, though it is incremental as it builds on existing multimodal methods.

The paper tackles the problem of automatically evaluating semantic alignment between text and images, introducing the SeeTRUE dataset and two methods that surpass prior approaches with significant improvements in challenging cases like complex composition or unnatural images.

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes