CL LGOct 11, 2016

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

arXiv:1610.03342v114.331 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of learning language structure and meaning from noisy multimodal data, which is relevant for AI systems mimicking human language acquisition, but it is incremental as it builds on existing neural network approaches.

The paper tackles the problem of visually-grounded language learning by developing a stacked gated recurrent neural network model that predicts visual features from phoneme sequences, and it demonstrates that the model learns to represent linguistic information hierarchically, with lower layers more sensitive to form and higher layers to meaning.

We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.

View on arXiv PDF

Similar