Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

Mohammed Salah Al-Radhi, Riad Larbi, Mátyás Bartalis, Géza Németh

arXiv:2601.14472v12.2

Originality Highly original

AI Analysis

This work addresses the challenge of producing natural and pitch-accurate synthetic speech for speech synthesis applications, representing a strong specific gain rather than a foundational advancement.

The paper tackled the problem of limited prosody modeling and inaccurate phase reconstruction in neural vocoders by proposing a vocoder with prosody-guided harmonic attention and direct complex spectrum modeling, resulting in a 22% reduction in F0 RMSE, 18% lower voiced/unvoiced error, and a 0.15 improvement in MOS scores.

Neural vocoders are central to speech synthesis; despite their success, most still suffer from limited prosody modeling and inaccurate phase reconstruction. We propose a vocoder that introduces prosody-guided harmonic attention to enhance voiced segment encoding and directly predicts complex spectral components for waveform synthesis via inverse STFT. Unlike mel-spectrogram-based approaches, our design jointly models magnitude and phase, ensuring phase coherence and improved pitch fidelity. To further align with perceptual quality, we adopt a multi-objective training strategy that integrates adversarial, spectral, and phase-aware losses. Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15. These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.

View on arXiv PDF

Similar