kNN For Whisper And Its Effect On Bias And Speaker Adaptation
This work addresses speaker adaptation and bias reduction in speech recognition, but it is incremental as it applies an existing method from text to speech.
The paper tackled the problem of speech recognition performance variation across languages, domains, and speaker characteristics by applying token-level k-nearest neighbor search to Whisper, a transformer-based speech model, showing that this non-parametric method improves adaptation without fine-tuning.
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.