Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning
This addresses the problem of automated subtitle generation for video content creators and platforms, representing a strong incremental improvement.
This paper tackles the problem of synchronizing utterances with lip movements without audio cues (Visual Forced Alignment), achieving a 6% word-level and 27% phoneme-level accuracy improvement on the LRS2 dataset.
This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.