SDCLASJan 2, 2020

Eigenresiduals for improved Parametric Speech Synthesis

arXiv:2001.00581v110 citations
AI Analysis

This addresses a specific audio quality issue for speech synthesis systems, but it is incremental as it builds on existing HMM-based synthesizers.

The paper tackled the problem of buzziness in statistical parametric speech synthesis by proposing a new excitation model based on PCA decomposition of pitch-synchronous residuals, resulting in improved quality while keeping the synthesis engine footprint under about 1Mb.

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes