ASLGSDJan 5, 2023

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

arXiv:2301.02262v21 citationsh-index: 54
Originality Incremental advance
AI Analysis

This work addresses synchronization issues in SVS for music production, but it is incremental as it builds on existing methods to handle alignment errors.

The paper tackles the problem of singing voice synthesis (SVS) by proposing a frame-level sequence-to-sequence model that considers vocal timing deviation, using an attention mechanism to absorb alignment errors and improve sound quality, with experimental results demonstrating its effectiveness.

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes