CLMay 2, 2020

Visually Grounded Continual Learning of Compositional Phrases

arXiv:2005.00785v51002 citations
AI Analysis

This addresses the challenge of human-like language acquisition in AI systems, but it is incremental as it builds on existing continual learning and compositional generalization research.

The paper tackles the problem of continual learning of compositional phrases from streaming visual scenes, introducing the VisCOLL task and datasets, and finds that state-of-the-art continual learning methods show little to no improvement, highlighting the challenge of generalizing to novel compositions without storing all examples.

Humans acquire language continually with much more limited access to data samples at a time, as compared to contemporary NLP systems. To study this human-like language acquisition ability, we present VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes. In the task, models are trained on a paired image-caption stream which has shifting object distribution; while being constantly evaluated by a visually-grounded masked language prediction task on held-out test sets. VisCOLL compounds the challenges of continual learning (i.e., learning from continuously shifting data distribution) and compositional generalization (i.e., generalizing to novel compositions). To facilitate research on VisCOLL, we construct two datasets, COCO-shift and Flickr-shift, and benchmark them using different continual learning methods. Results reveal that SoTA continual learning approaches provide little to no improvements on VisCOLL, since storing examples of all possible compositions is infeasible. We conduct further ablations and analysis to guide future work.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes