SDCLCVMMASDec 21, 2024

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Amazon
arXiv:2412.16530v12 citationsh-index: 18ICASSP
Originality Incremental advance
AI Analysis

This addresses the issue of unrealistic dubbed videos for content creators and viewers, though it is incremental as it builds on existing AVS2S models.

The study tackled the problem of poor lip-synchrony in audio-visual speech-to-speech translation by integrating a lip-synchrony loss into model training, achieving a 9.2% reduction in LSE-D score (average 10.67) across four language pairs while maintaining translation quality and naturalness.

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes