SDAIASApr 10, 2024

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

ByteDance
arXiv:2404.06674v212 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the need for flexible, high-quality voice editing tools in applications like entertainment or accessibility, though it is incremental in improving existing methods.

The paper tackles the problem of editing multiple speech attributes like age, gender, accent, and style in a single forward pass while preserving speaker timbre, achieving zero-shot capability for out-of-distribution speakers and avoiding timbre leakage.

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes