What's left can't be right -- The remaining positional incompetence of contrastive vision-language models
This work addresses a key limitation in vision-language models for applications requiring spatial reasoning, though it is incremental as it focuses on a specific aspect of a known problem.
The paper investigates why contrastive vision-language models such as CLIP struggle with spatial understanding, specifically left-right positional relations, and demonstrates that this issue can be addressed by training with synthetic data, leading to improved performance on the Visual Genome Relations benchmark.
Contrastive vision-language models like CLIP have been found to lack spatial understanding capabilities. In this paper we discuss the possible causes of this phenomenon by analysing both datasets and embedding space. By focusing on simple left-right positional relations, we show that this behaviour is entirely predictable, even with large-scale datasets, demonstrate that these relations can be taught using synthetic data and show that this approach can generalise well to natural images - improving the performance on left-right relations on Visual Genome Relations.