SDCLASFeb 20, 2019

Audio-Linguistic Embeddings for Spoken Sentences

arXiv:1902.07817v153 citations
Originality Incremental advance
AI Analysis

This work addresses spoken language understanding by providing multi-modal sentence embeddings, though it appears incremental as it builds on existing embedding methods.

The paper tackles the problem of creating spoken sentence embeddings that capture both acoustic and linguistic content, showing that these embeddings outperform phoneme and word-level baselines on speech and emotion recognition tasks.

We propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works operate at the character, phoneme, or word level, our method learns long-term dependencies by modeling speech at the sentence level. Formulated as an audio-linguistic multitask learning problem, our encoder-decoder model simultaneously reconstructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks. Ablation studies show that our embeddings can better model high-level acoustic concepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes