FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
This addresses a specific issue in speech synthesis for applications requiring prosodic control, but appears incremental as it builds on existing source-filter and Transformer methods.
The paper tackled the problem of audio quality degradation and speaker deformation in neural text-to-speech synthesis when applying large pitch shifts, proposing FastPitchFormant, a feed-forward Transformer model based on source-filter theory that handles text and acoustic features in parallel, resulting in mitigated learning of relationships between features.
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.