SDAIFeb 23

StyleStream: Real-Time Zero-Shot Voice Style Conversion

arXiv:2602.20113v1h-index: 25
Originality Highly original
AI Analysis

This addresses the challenge of high-quality, real-time voice conversion for applications like speech synthesis and editing, representing a significant advance over prior work.

The paper tackles the problem of real-time zero-shot voice style conversion, achieving state-of-the-art performance with an end-to-end latency of 1 second.

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes