CVMar 23

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

arXiv:2603.2190162.41 citationsh-index: 3

AI Analysis

This work addresses a practical limitation in video subtitle removal for users by enabling more efficient deployment without mask dependencies, though it is incremental as it builds on existing diffusion-based methods.

The paper tackles the problem of video subtitle removal by introducing a mask-free framework that eliminates the need for explicit mask sequences during inference, achieving a 6.77dB PSNR improvement and 74.7% reduction in VFID on benchmarks while generalizing to multiple languages.

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

View on arXiv PDF

Similar