CLAISDASOct 12, 2021

Speech Summarization using Restricted Self-Attention

arXiv:2110.06263v239 citations
AI Analysis

This work addresses memory and compute constraints in speech summarization for instructional videos, representing an incremental improvement over existing methods.

The authors tackled the problem of end-to-end speech summarization by applying restricted self-attention to handle long audio sequences, resulting in a model that outperforms cascaded approaches by 3 points on ROUGE for summarization and 4 points on F-1 for concept prediction.

Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences. Recent work in document summarization has inspired methods to reduce the complexity of self-attentions, which enables transformer models to handle long sequences. In this work, we introduce a single model optimized end-to-end for speech summarization. We apply the restricted self-attention technique from text-based models to speech models to address the memory and compute constraints. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos. The proposed end-to-end model outperforms the previously proposed cascaded model by 3 points absolute on ROUGE. Further, we consider the spoken language understanding task of predicting concepts from speech inputs and show that the proposed end-to-end model outperforms the cascade model by 4 points absolute F-1.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes