CLAILGMar 21, 2024

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

arXiv:2403.14551v130 citationsh-index: 64ACL
Originality Highly original
AI Analysis

This addresses the issue of making language models more accurate and human-like by incorporating visual supervision, offering a novel approach to grounded language learning.

The paper tackles the problem of language models lacking multimodal supervision by introducing LexiContrastive Grounding (LCG), a method that uses visual grounding to improve textual representations, resulting in outperforming standard and vision-language models on benchmarks and reducing perplexity by around 5% on language modeling tasks.

Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes