SELF-VS: Self-supervised Encoding Learning For Video Summarization
This work addresses the challenge of costly annotations in video summarization for researchers and practitioners, though it is incremental as it builds on existing self-supervised and knowledge distillation techniques.
The paper tackles the problem of video summarization by addressing dataset scarcity and overfitting through a self-supervised learning method using knowledge distillation to pre-train a transformer encoder, resulting in superior performance on correlation metrics like Kendall's τ and Spearman's ρ compared to state-of-the-art methods.
Despite its wide range of applications, video summarization is still held back by the scarcity of extensive datasets, largely due to the labor-intensive and costly nature of frame-level annotations. As a result, existing video summarization methods are prone to overfitting. To mitigate this challenge, we propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification. Empirical evaluations on correlation-based metrics, such as Kendall's $τ$ and Spearman's $ρ$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.