LGSDASJun 14, 2024

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

arXiv:2406.10223v1
Originality Highly original
AI Analysis

This addresses efficient, voice-preserving speech translation for multilingual users, with incremental improvements in synthesizer design.

The paper tackles the problem of multilingual speech-to-speech translation with voice preservation by comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer, finding that the diffusion approach improves audio quality metrics by 23% and speaker similarity by 5% while running over 5× faster than real-time.

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes