CVDec 20, 2024

A New Method to Capturing Compositional Knowledge in Linguistic Space

arXiv:2412.15632v1
Originality Highly original
AI Analysis

This work addresses the challenge of compositional understanding for visual language models, offering a novel approach that is incremental in its method but provides strong specific gains.

The paper tackles the problem of compositional understanding in visual language models by introducing a zero-shot method that avoids the need for hard negative training data, achieving over 8% improvement on the SugarCREPE benchmark and significant gains in image retrieval tasks.

Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes