Pre-Finetuning for Few-Shot Emotional Speech Recognition
This addresses speaker adaptation for emotional speech recognition, but it appears incremental as it builds on existing pre-trained models and few-shot learning methods.
The paper tackled poor generalization of speech models to out-of-domain speakers by proposing pre-finetuning on emotional speech tasks, achieving evaluation through 33,600 few-shot trials on the Emotional Speech Dataset.
Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.