SDAIASSep 8, 2025

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

arXiv:2509.07038v1h-index: 3
Originality Highly original
AI Analysis

This enables user-driven dynamics control for singing voice synthesis, addressing a specific bottleneck in musical expressiveness.

The paper tackles the problem of limited dynamic control in singing voice synthesis by explicitly conditioning the model on energy sequences, achieving over 50% reduction in mean absolute error for phoneme-level inputs compared to baselines.

Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control--temporal loudness variation essential for musical expressiveness--and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes