Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation
This addresses the problem of generating relevant video previews for users based on textual queries, though it appears incremental as it builds on existing thumbnail generation with a new modulation approach.
The paper tackles dynamic video thumbnail generation guided by user sentences, proposing a sentence-guided temporal modulation mechanism that achieves semantic correspondence between thumbnails and queries. Experiments on a large-scale dataset demonstrate the framework's effectiveness, with the non-recurrent design enabling greater parallelization compared to existing recurrent methods.
We consider the problem of sentence specified dynamic video thumbnail generation. Given an input video and a user query sentence, the goal is to generate a video thumbnail that not only provides the preview of the video content, but also semantically corresponds to the sentence. In this paper, we propose a sentence guided temporal modulation (SGTM) mechanism that utilizes the sentence embedding to modulate the normalized temporal activations of the video thumbnail generation network. Unlike the existing state-of-the-art method that uses recurrent architectures, we propose a non-recurrent framework that is simple and allows much more parallelization. Extensive experiments and analysis on a large-scale dataset demonstrate the effectiveness of our framework.