Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
This work addresses an incremental improvement for researchers and practitioners in speech processing by enhancing ASR performance in unified speech-text models.
The paper tackles the problem of improving automatic speech recognition (ASR) in decoder-only Transformers by addressing the dependency among speech tokens, which is ignored by the Loss Masking strategy used in prior models. The result is a novel Smoothed Label Distillation (SLD) approach that outperforms Loss Masking across different speech discretization methods, though specific performance numbers are not provided in the abstract.
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld