CV CLMar 9, 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao

arXiv:2303.05309v112.630 citationsh-index: 44Has Code

Originality Incremental advance

AI Analysis

It addresses the lack of cross-lingual visual speech research, which is incremental by building on existing audio speech translation methods to handle noisy environments and improve lip reading.

The paper tackles the problem of visual speech translation and recognition by introducing a new dataset (AVMuST-TED) and a cross-modality self-learning framework (MixSpeech) that uses audio speech to regularize training, resulting in BLEU score improvements of +1.4 to +4.2 for translation and state-of-the-art lip reading performance on benchmarks like CMLR (11.1%), LRS2 (25.5%), and LRS3 (28.0%).

Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present \textbf{AVMuST-TED}, the first dataset for \textbf{A}udio-\textbf{V}isual \textbf{Mu}ltilingual \textbf{S}peech \textbf{T}ranslation, derived from \textbf{TED} talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1\%), LRS2 (25.5\%), and LRS3 (28.0\%).

View on arXiv PDF Code

Similar