CLSDASJan 31, 2024

Exploring the limits of decoder-only models trained on public speech recognition corpora

IBM
arXiv:2402.00235v18 citationsh-index: 40Has CodeINTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the need for competitive open-source speech recognition models without relying on proprietary data, though it is incremental as it builds on existing decoder-only architectures.

The study tackled the challenge of achieving competitive speech recognition performance using decoder-only models trained solely on public data, and found that their DOTA model outperformed the open-source Whisper replication on most English benchmarks and even surpassed Whisper large-v3 on 7 out of 15 test sets.

The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes