CVAIMMOct 17, 2022

Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows

arXiv:2210.08737v19 citationsh-index: 87
Originality Incremental advance
AI Analysis

This addresses the challenge of multi-camera editing for TV production, but it is incremental as it builds on existing transformer-based approaches.

The paper tackles the problem of automatically selecting camera views for TV show editing by collecting a new benchmark with 88 hours of raw video across four scenarios and proposing a temporal and contextual transformer method, which outperforms existing methods on this benchmark.

The ability to choose an appropriate camera view among multiple cameras plays a vital role in TV shows delivery. But it is hard to figure out the statistical pattern and apply intelligent processing due to the lack of high-quality training data. To solve this issue, we first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests, where each scenario contains 6 synchronized tracks recorded by different cameras. It contains 88-hour raw videos that contribute to the 14-hour edited videos. Based on this benchmark, we further propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions and predict which view to be used. Extensive experiments show that our method outperforms existing methods on the proposed multi-camera editing benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes