SDAIASAug 28, 2022

Towards Disentangled Speech Representations

arXiv:2208.13191v19 citationsh-index: 96
Originality Incremental advance
AI Analysis

This addresses the challenge of creating more effective audio representations for speech tasks, though it appears incremental as it builds on existing disentanglement concepts.

The paper tackles the problem of learning disentangled speech representations by jointly modeling ASR and TTS, showing that enforcing unique statistical properties during training improves WER by 24.5% relative on average.

The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes