CVAISep 26, 2022

Multi-modal Video Chapter Generation

arXiv:2209.12694v13 citationsh-index: 16Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of efficient video navigation for users by providing a scalable dataset and method, though it is incremental as it builds on existing multi-modal techniques.

The authors tackled the lack of public methods and datasets for video chapter generation by introducing Chapter-Gen, a dataset of 10k user-generated videos with annotated chapters, and designed a baseline method that achieves superior results over existing approaches.

Chapter generation becomes practical technique for online videos nowadays. The chapter breakpoints enable users to quickly find the parts they want and get the summative annotations. However, there is no public method and dataset for this task. To facilitate the research along this direction, we introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information. Our data collection procedure is fast, scalable and does not require any additional manual annotation. On top of this dataset, we design an effective baseline specificlly for video chapters generation task. which captures two aspects of a video,including visual dynamics and narration text. It disentangles local and global video features for localization and title generation respectively. To parse the long video efficiently, a skip sliding window mechanism is designed to localize potential chapters. And a cross attention multi-modal fusion module is developed to aggregate local features for title generation. Our experiments demonstrate that the proposed framework achieves superior results over existing methods which illustrate that the method design for similar task cannot be transfered directly even after fine-tuning. Code and dataset are available at https://github.com/czt117/MVCG.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes