CVAICLLGFeb 20, 2024

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

arXiv:2402.13254v443 citationsh-index: 19Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses a critical gap in multimodal AI for applications requiring precise physical and semantic reasoning, though it is incremental as it builds on existing models and methods.

The paper tackled the problem of poor performance in physically grounded visio-linguistic compositional reasoning by multimodal models, achieving significant improvements of +33% for CLIP and +37% for LLaVA on a new benchmark through data augmentation with counterfactual examples.

We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes