CVCLFeb 22, 2024

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

arXiv:2402.15021v210 citationsh-index: 70Has Code
Originality Incremental advance
AI Analysis

This addresses a key limitation in vision-language models for applications requiring nuanced language understanding, though it is incremental as it builds on existing models.

The paper tackles the problem of vision-language models failing to encode compositional language, and introduces a framework that achieves over 10% absolute improvement on compositionality benchmarks while maintaining or improving performance on standard tasks.

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes