Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
This addresses the issue of making language models more accurate and human-like by incorporating visual supervision, offering a novel approach to grounded language learning.
The paper tackles the problem of language models lacking multimodal supervision by introducing LexiContrastive Grounding (LCG), a method that uses visual grounding to improve textual representations, resulting in outperforming standard and vision-language models on benchmarks and reducing perplexity by around 5% on language modeling tasks.
Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.