CLSep 1, 2024

Comparing Discrete and Continuous Space LLMs for Speech Recognition

arXiv:2409.00800v117 citationsh-index: 29Has Code
Originality Incremental advance
AI Analysis

It provides the first extensive comparison of speech representations for LLM-based ASR, offering insights for advancing ASR and NLP research.

This paper compared discrete and continuous speech representations in LLM-based automatic speech recognition, achieving a state-of-the-art word error rate of 1.69% on LibriSpeech using a HuBERT encoder.

This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes