ASCLSDSep 14, 2022

ESSumm: Extractive Speech Summarization from Untranscribed Meeting

arXiv:2209.06913v19 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the need for efficient meeting summarization without costly transcription, though it is incremental as it builds on existing unsupervised and self-supervised techniques.

The paper tackles the problem of generating speech summaries directly from untranscribed meeting audio, proposing ESSumm, an unsupervised extractive model that bypasses transcription. Results on AMI and ICSI datasets show it improves summarization quality and performs competitively with transcript-based methods.

In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets (AMI and ICSI corpora) show the effectiveness of our direct speech-based method to improve the summarization quality with untranscribed data. We also observe that our unsupervised speech-based method even performs on par with recent transcript-based summarization approaches, where extra speech recognition is required.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes