CL LG SD ASFeb 21, 2019

Towards Visually Grounded Sub-Word Speech Unit Discovery

arXiv:1902.08213v136 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of grounding speech in vision for computational linguistics, but it is incremental as it builds on existing multimodal learning approaches.

The paper tackled the problem of discovering interpretable sub-word speech units by training a convolutional neural network to associate raw speech waveforms with related images, and found that diphone boundaries could be extracted from model activations, suggesting their role in word recognition.

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.

View on arXiv PDF

Similar