Eigenresiduals for improved Parametric Speech Synthesis
This addresses a specific audio quality issue for speech synthesis systems, but it is incremental as it builds on existing HMM-based synthesizers.
The paper tackled the problem of buzziness in statistical parametric speech synthesis by proposing a new excitation model based on PCA decomposition of pitch-synchronous residuals, resulting in improved quality while keeping the synthesis engine footprint under about 1Mb.
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.