CVJul 25, 2021

Transcript to Video: Efficient Clip Sequencing from Texts

arXiv:2107.11851v216 citations
Originality Incremental advance
AI Analysis

This addresses the difficulty for inexperienced users in video editing by automating clip sequencing from texts, though it appears incremental as it builds on existing visual-language and sequencing methods.

The paper tackles the problem of automatically creating well-edited videos from text inputs for non-experts, achieving content-relevant shot retrieval and plausible video sequences with real-time performance.

Among numerous videos shared on the web, well-edited ones always attract more attention. However, it is difficult for inexperienced users to make well-edited videos because it requires professional expertise and immense manual labor. To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots. Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles, respectively. For fast inference, we introduce an efficient search strategy for real-time video clip sequencing. Quantitative results and user studies demonstrate empirically that the proposed learning framework can retrieve content-relevant shots while creating plausible video sequences in terms of style. Besides, the run-time performance analysis shows that our framework can support real-world applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes