ASCLSDSep 13, 2023

Can Whisper perform speech-based in-context learning?

NVIDIA
arXiv:2309.07081v262 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the challenge of adapting ASR models to low-resource dialects for users in those communities, though it is incremental as it builds on existing Whisper models.

The paper tackled the problem of adapting Whisper speech recognition models to new languages or dialects without retraining, proposing a speech-based in-context learning method that reduced word error rates by an average of 32.3% on Chinese dialects, with further improvements to 36.4% using example selection.

This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes