CLJun 10, 2021

VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization

arXiv:2106.05606v213 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a domain-specific dataset to address the discrepancy between written and spoken language in summarization tasks, though it is incremental as it builds on existing methods with new data.

The authors tackled the problem of video transcript summarization by introducing VT-SSum, a benchmark dataset with 125K transcript-summary pairs from 9,616 videos, which significantly improved model performance on the AMI spoken text benchmark.

Video transcript summarization is a fundamental task for video understanding. Conventional approaches for transcript summarization are usually built upon the summarization data for written language such as news articles, while the domain discrepancy may degrade the model performance on spoken text. In this paper, we present VT-SSum, a benchmark dataset with spoken language for video transcript segmentation and summarization, which includes 125K transcript-summary pairs from 9,616 videos. VT-SSum takes advantage of the videos from VideoLectures.NET by leveraging the slides content as the weak supervision to generate the extractive summary for video transcripts. Experiments with a state-of-the-art deep learning approach show that the model trained with VT-SSum brings a significant improvement on the AMI spoken text summarization benchmark. VT-SSum is publicly available at https://github.com/Dod-o/VT-SSum to support the future research of video transcript segmentation and summarization tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes