SDLGASDec 9, 2020

DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis

arXiv:2012.05084v25 citations
AI Analysis

This work addresses the problem of capturing behavioral speech features for improved speaker recognition and more natural speech synthesis, benefiting researchers and developers in these fields.

This paper introduces DeepTalk, a prosody encoding network that extracts vocal style features from raw audio. It outperforms several state-of-the-art speaker recognition systems and, when combined with physiological features, further improves performance. DeepTalk also enables speech synthesis that is nearly indistinguishable from real speech in speaker recognition contexts.

Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes