AS LG SDJan 5, 2023

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

arXiv:2301.02262v21.21 citationsh-index: 54

Originality Incremental advance

AI Analysis

This work addresses synchronization issues in SVS for music production, but it is incremental as it builds on existing methods to handle alignment errors.

The paper tackles the problem of singing voice synthesis (SVS) by proposing a frame-level sequence-to-sequence model that considers vocal timing deviation, using an attention mechanism to absorb alignment errors and improve sound quality, with experimental results demonstrating its effectiveness.

This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.

View on arXiv PDF

Similar