CLAILGFeb 7, 2017

Representations of language in a model of visually grounded speech signal

arXiv:1702.01991v3134 citations
Originality Incremental advance
AI Analysis

This work addresses speech perception modeling for AI systems by integrating visual grounding, though it appears incremental as it builds on existing methods for joint semantic spaces.

The paper tackles the problem of modeling speech perception by grounding it in visual data, using a multi-layer recurrent highway network to project spoken utterances and images into a joint semantic space, and shows that the model learns to extract both form and meaning-based linguistic knowledge, with semantic encoding becoming richer in higher layers while form-related encoding plateaus or decreases.

We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become richer as we go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes