AICLCVJun 18, 2024

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

arXiv:2406.12233v115 citations
Originality Incremental advance
AI Analysis

This addresses the problem of accurate lip-reading for applications like assistive technology, though it appears incremental by building on prior crossmodal alignment methods.

The paper tackled the challenge of homophenes in Visual Speech Recognition (VSR) by introducing SyncVSR, an end-to-end framework that synchronizes visual and audio data, achieving state-of-the-art results and reducing data usage by up to ninefold.

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes