CLSDASOct 24, 2023

How Much Context Does My Attention-Based ASR System Need?

arXiv:2310.15672v25 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of optimizing context length for speech recognition performance, particularly for long-format audio, but is incremental as it builds on existing attention-based methods.

The study investigated how much acoustic context an attention-based automatic speech recognition system needs, finding that training with up to 21.8 minutes of context yields up to a 14.5% relative improvement over a 10-second baseline on long-format datasets.

For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5\% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes