CVAICLMMDec 12, 2024

Causal Graphical Models for Vision-Language Compositional Understanding

arXiv:2412.09353v26 citationsh-index: 34ICLR
AI Analysis

This addresses the issue of poor compositional reasoning in VLMs for tasks requiring deeper language understanding, representing an incremental advancement with a novel method for a known bottleneck.

The paper tackled the problem of Vision-Language Models (VLMs) struggling with compositional understanding by modeling dependency relations using a Causal Graphical Model (CGM) and a partially-ordered decoder, resulting in significant performance improvements over state-of-the-art methods on five compositional benchmarks.

Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes