SD AI ASJan 11, 2025

Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, Zhen-Hua Ling

arXiv:2501.06394v13 citationsh-index: 7EMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of multimodal speaker generation for applications in personalized speech synthesis, though it appears incremental as it builds on existing methods with a unified framework.

The paper tackles the problem of generating synthetic speech that aligns with multimodal voice descriptions by introducing UniSpeaker, a unified approach that outperforms previous modality-specific models on a new benchmark, achieving improvements in voice suitability, diversity, and quality.

Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at \url{https://UniSpeaker.github.io}.

View on arXiv PDF

Similar