SDAIASApr 13, 2024

Voice Attribute Editing with Text Prompt

arXiv:2404.08857v214 citationsh-index: 13Has CodeIEEE Transactions on Audio, Speech, and Language Processing
Originality Incremental advance
AI Analysis

This work addresses the problem of precise voice attribute control in speech synthesis for applications in audio editing and generation, representing a novel task with incremental improvements in alignment techniques.

The paper tackles the challenge of controlling voice attributes in synthesized speech by introducing voice attribute editing with text prompts, achieving effective relative modifications through the proposed VoxEditor model and establishing the VCTK-RVA dataset.

Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes