ASCLSDNov 21, 2022

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

arXiv:2211.11222v18 citationsh-index: 54
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing controllability and quality in neural speech synthesis for applications like text-to-speech systems, though it appears incremental by combining existing components.

The paper tackles the challenge of integrating a classic mel-cepstral synthesis filter into a neural speech synthesis system to achieve end-to-end controllable speech synthesis, resulting in improved speech quality while maintaining controllability over voice characteristics and pitch.

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes