CL CV LGNov 25, 2019

Learning to Learn Words from Visual Scenes

Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick

arXiv:1911.11237v31.98 citations

Originality Highly original

AI Analysis

This work addresses language acquisition from visual data, offering a data-efficient method that learns from scratch without pre-training, which is incremental in improving word learning efficiency.

The paper tackles the problem of learning word representations from visual scenes by introducing a meta-learning framework that leverages language's compositional structure to create training episodes, resulting in more rapid acquisition of novel words and robust generalization to unseen compositions, significantly outperforming established baselines.

Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at https://expert.cs.columbia.edu/

View on arXiv PDF

Similar