CLAICVJun 14, 2023

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Berkeley
arXiv:2306.08685v2228 citationsh-index: 35Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of open-world language learning for AI systems, representing an incremental advance in grounding methods.

The paper tackles the problem of enabling vision-language models to learn new words quickly by connecting language to visual referents, introducing OctoBERT which demonstrates faster and more robust acquisition of unseen words through grounding.

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at https://github.com/sled-group/world-to-words

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes