CLCVMay 12, 2023

Measuring Progress in Fine-grained Vision-and-Language Understanding

arXiv:2305.07558v1239 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of assessing and improving fine-grained vision-and-language understanding for researchers and practitioners, providing insights into model performance and training dynamics, but it is incremental as it analyzes existing models and benchmarks without introducing new ones.

The paper investigated four vision-and-language models on four fine-grained benchmarks to measure progress in understanding relationships, verbs, and numbers in images, finding that X-VLM consistently outperformed others and that modeling innovations had more impact than scaling web data, which sometimes degraded performance.

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes