ASCLSDJun 29, 2023

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

arXiv:2306.17005v17 citationsh-index: 48
Originality Incremental advance
AI Analysis

This addresses the challenge of accurate audio-visual alignment in automatic voice-over systems, which is incremental as it builds upon existing TTS frameworks.

The paper tackles the problem of generating speech synchronized with silent video using a text script, achieving improved lip-speech synchronization and high speech quality by outperforming baselines in objective and subjective evaluations.

The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have shown impressive results. However, the current AVO learning objective of acoustic feature reconstruction brings in indirect supervision for inter-modal alignment learning, thus limiting the synchronization performance and synthetic speech quality. To this end, we propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction, which not only provides more direct supervision for the alignment learning, but also alleviates the mismatch between the text-video context and acoustic features. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality by outperforming baselines in both objective and subjective evaluations. Code and speech samples are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes