ASAISDSep 20, 2024

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

arXiv:2409.13582v112 citationsh-index: 98Has Code
Originality Incremental advance
AI Analysis

This work addresses speech dysfluency detection for researchers and practitioners, offering a novel perspective and open-source resources, though it is incremental in method.

The paper tackles speech dysfluency detection by reframing it as a token-based ASR problem instead of time-based object detection, proposing simulators and a Whisper-like architecture to create a new benchmark with decent performance.

Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at https://rorizzz.github.io/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes