ASAILGSDSPMay 30, 2025

When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

arXiv:2505.24336v13 citationsh-index: 1INTERSPEECH
Originality Incremental advance
AI Analysis

This research addresses the problem of high-fidelity voice conversion to non-human sounds for applications in entertainment, education, or accessibility, but it appears incremental as it builds on prior work focused on specific sounds and lower audio quality.

This work tackled the problem of converting human speech into a broader range of non-human sounds, including natural animal vocalizations and designed synthetic sounds, by introducing a preprocessing pipeline and an improved CVAE-based model optimized for high-quality 44.1kHz audio. The proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres.

Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at https://nc-ai.github.io/speech/publications/nonhuman-vc/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes