AS LG SDJun 2, 2025

On-device Streaming Discrete Speech Units

Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe

arXiv:2506.01845v15.14 citationsh-index: 27Has CodeINTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of enabling real-time speech processing in resource-constrained environments, representing an incremental improvement over existing DSU methods.

The paper tackled the impracticality of conventional discrete speech unit (DSU) approaches for on-device streaming by reducing attention window and model size, achieving a 50% reduction in FLOPs with only a 6.5% relative increase in character error rate on the ML-SUPERB 1h dataset.

Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.

View on arXiv PDF Code

Similar