CVAIApr 24, 2024

Raformer: Redundancy-Aware Transformer for Video Wire Inpainting

arXiv:2404.15802v16 citationsh-index: 7IEEE Transactions on Image Processing
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently removing wires in films or TV series, saving time and labor compared to manual methods, though it is incremental with a novel dataset and method.

The paper tackles the problem of video wire inpainting, which removes wires from videos, by introducing a new dataset (WRV2 with over 4,000 videos) and the Raformer method, achieving state-of-the-art performance on benchmarks.

Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes