VideoSET: Video Summary Evaluation through Text
This addresses the need for better evaluation metrics in video summarization for the computer vision community, though it is incremental as it adapts existing NLP techniques to a new domain.
The paper tackles the problem of evaluating video summaries by proposing VideoSET, a text-based method that measures semantic retention compared to human-written ground-truth summaries, showing higher agreement with human judgment than pixel-based metrics.
In this paper we present VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video. We observe that semantics is most easily expressed in words, and develop a text-based approach for the evaluation. Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans. We show that our technique has higher agreement with human judgment than pixel-based distance metrics. We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.