CVCLDec 22, 2020

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

arXiv:2012.12352v4671 citations
AI Analysis

This study highlights a critical limitation in the reasoning and grounding capabilities of current pretrained V&L models for researchers and developers, revealing their inability to perform basic counting and individuation.

This paper investigates the cross-modal reasoning abilities of pretrained vision and language (V&L) models, specifically ViLBERT, ViLBERT 12-in-1, and LXMERT, on image-sentence discrimination and entity counting tasks. While models performed well on image-sentence discrimination (Task 1) due to pretraining, they failed to adequately solve the counting task (Task 2) and could not generalize to out-of-distribution quantities.

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes