ASAISDSep 4, 2024

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

arXiv:2409.02451v16 citationsh-index: 25
Originality Incremental advance
AI Analysis

This work addresses the need for fast, high-quality, and parameter-efficient speech synthesis for applications in speech technology, though it is incremental as it builds on existing DDSP and articulatory synthesis methods.

The paper tackled the problem of synthesizing speech from articulatory trajectories like EMA by integrating them with differentiable digital signal processing (DDSP), resulting in a model that achieves a transcription word error rate of 6.67% and a mean opinion score of 3.74, with improvements of 1.63% and 0.16 over the state-of-the-art baseline, while being 4.9x faster on CPU and using only 0.4M parameters compared to 9M.

Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes