Learning language through pictures
This addresses the challenge of grounding language in vision for AI systems, but it is incremental as it builds on existing multimodal learning approaches.
The paper tackles the problem of learning visually grounded language representations by proposing Imaginet, a model that uses coupled textual and visual input to predict visual representations and next words, resulting in acquisition of word meanings and effective use of sequential structure for semantic interpretation.
We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Mimicking an important aspect of human language learning, it acquires meaning representations for individual words from descriptions of visual scenes. Moreover, it learns to effectively use sequential structure in semantic interpretation of multi-word phrases.