CVMar 5, 2025

Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning

arXiv:2503.03286v1h-index: 4ICASSP

Originality Incremental advance

AI Analysis

This addresses the problem of automated subtitle generation for video content creators and platforms, representing a strong incremental improvement.

This paper tackles the problem of synchronizing utterances with lip movements without audio cues (Visual Forced Alignment), achieving a 6% word-level and 27% phoneme-level accuracy improvement on the LRS2 dataset.

This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.

View on arXiv PDF

Similar