ASAICLSDOct 7, 2025

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

arXiv:2510.06201v1
Originality Incremental advance
AI Analysis

This is an incremental improvement for speech processing systems, enhancing joint ASR and TTS performance through a token-based interface.

The paper tackles the problem of improving automatic speech recognition (ASR) and text-to-speech (TTS) systems by proposing TokenChain, a fully discrete speech chain that couples semantic-token ASR with a two-stage TTS. The result shows TokenChain surpasses baseline accuracy 2-6 epochs earlier, yields 5-13% lower equal-epoch error on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM.

Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes