CLCVJun 16, 2021

Probing Image-Language Transformers for Verb Understanding

arXiv:2106.09141v1741 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a specific limitation in multimodal AI models for researchers, but it is incremental as it focuses on probing existing models rather than introducing new methods.

The study investigated whether pretrained image-language transformers can understand verbs, finding that they struggle more with verb comprehension compared to nouns, based on evaluation using a dataset of 421 verbs.

Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes