CLSep 1, 2024

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu

arXiv:2409.00800v18.718 citationsh-index: 29Has Code

Originality Incremental advance

AI Analysis

It provides the first extensive comparison of speech representations for LLM-based ASR, offering insights for advancing ASR and NLP research.

This paper compared discrete and continuous speech representations in LLM-based automatic speech recognition, achieving a state-of-the-art word error rate of 1.69% on LibriSpeech using a HuBERT encoder.

This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.

View on arXiv PDF

Similar