CLCVMMSDASFeb 1, 2025

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

arXiv:2502.00374v11 citationsh-index: 9INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the challenge of conveying emotions and attitudes in communication for users of speech-to-speech translation systems, representing an incremental improvement by focusing on an overlooked aspect.

The research tackled the problem of preserving paralinguistic information like emotions in speech-to-speech translation by introducing a multilingual dataset from movie audio tracks and integrating prosody transfer techniques, resulting in a model that retains more paralinguistic details while maintaining high translation accuracy and naturalness.

Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes