CLSDASFeb 3, 2023

PSST! Prosodic Speech Segmentation with Transformers

arXiv:2302.01984v1135 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses the challenge of prosodic segmentation for speech processing applications, representing an incremental improvement by adapting an existing model to a specific task.

The paper tackles the problem of automatic prosodic segmentation in speech by finetuning the Whisper model to annotate intonation unit boundaries, achieving an accuracy of 95.8% and outperforming previous methods without requiring large-scale labeled data or high compute resources.

Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes