CL AI SD ASMay 20, 2025

Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Leibny Paola Garcia, Eng Siong Chng

arXiv:2505.17076v3h-index: 13

Originality Synthesis-oriented

AI Analysis

This work addresses the underexplored impact of frame rates on speech tokenizers, with implications for optimizing automatic speech recognition and text-to-speech systems, though it is incremental as it builds on existing tokenizer frameworks.

The study investigated how varying frame rates affect speech tokenization in Mandarin and English, finding that frame rate variations influence speech tokens differently for each language due to phonetic density and acoustic features.

The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.

View on arXiv PDF

Similar