SDOct 14, 2015

Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping

arXiv:1510.04205v13 citations
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in voice conversion for speech synthesis applications, but appears incremental as it builds on existing dynamic frequency warping techniques.

The study tackled the one-to-many problem in voice conversion, where similar source speech segments map to dissimilar target ones, by equalizing formant locations using dynamic frequency warping to reduce complexity and then reversing it post-conversion, resulting in significant speech quality improvements.

In this study, we investigate a solution to reduce the effect of one-to-many problem in voice conversion. One-to-many problem in VC happens when two very similar speech segments in source speaker have corresponding speech segments in target speaker that are not similar to each other. As a result, the mapper function usually over-smoothes the generated features in order to be similar to both target speech segments. In this study, we propose to equalize the formant location of source-target frame pairs using dynamic frequency warping in order to reduce the complexity. After the conversion, another dynamic frequency warping is further applied to reverse the effect of formant location equalization during the training. The subjective experiments showed that the proposed approach improves the speech quality significantly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes