LGAISDJan 30

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

arXiv:2601.23174v21 citationsh-index: 31Has Code
Originality Highly original
AI Analysis

This work addresses inefficiencies in speech tokenization for conversational AI, offering a more compact representation that could reduce computational costs in speech processing systems.

The paper tackles the problem of fixed-frame-rate speech tokenization producing unnecessarily long sequences by introducing DyCAST, a dynamic character-aligned speech tokenizer that achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens.

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes