SDASJun 5

Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

arXiv:2606.0680613.7
Originality Incremental advance
AI Analysis

For practitioners using discrete speech tokens, this method improves performance without changing training, offering a practical upgrade for ASR and synthesis.

The paper proposes using soft token assignment from SSL models during downstream inference to reduce information loss from discretization, outperforming hard assignment on ASR and speech synthesis, and surpassing continuous SSL features on non-native ASR.

Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes