AS AI SDSep 20, 2024

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Jingwen Liu, Zongli Ye, Jinming Zhang, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini

arXiv:2409.13582v17.312 citationsh-index: 98Has Code

Originality Incremental advance

AI Analysis

This work addresses speech dysfluency detection for researchers and practitioners, offering a novel perspective and open-source resources, though it is incremental in method.

The paper tackles speech dysfluency detection by reframing it as a token-based ASR problem instead of time-based object detection, proposing simulators and a Whisper-like architecture to create a new benchmark with decent performance.

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

View on arXiv PDF

Similar