Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior
This work addresses the challenge of scalable highlight extraction for video editors, offering a practical tool to reduce workload and improve efficiency, though it is incremental in leveraging existing methods like CLIP.
The paper tackles the problem of extracting highlight moments from videos by using professional photographs as a prior, showing that CLIP-based retrieval with a photograph collection can effectively identify highlights, as validated in human studies with 50 participants and expert evaluations with 12 participants.
This paper investigates the challenge of extracting highlight moments from videos. To perform this task, we need to understand what constitutes a highlight for arbitrary video domains while at the same time being able to scale across different domains. Our key insight is that photographs taken by photographers tend to capture the most remarkable or photogenic moments of an activity. Drawing on this insight, we present Videogenic, a technique capable of creating domain-specific highlight videos for a diverse range of domains. In a human evaluation study (N=50), we show that a high-quality photograph collection combined with CLIP-based retrieval (which uses a neural network with semantic knowledge of images) can serve as an excellent prior for finding video highlights. In a within-subjects expert study (N=12), we demonstrate the usefulness of Videogenic in helping video editors create highlight videos with lighter workload, shorter task completion time, and better usability.