Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes
This work addresses the challenge of robust speaker adaptation in speech recognition for scenarios with minimal data, offering incremental improvements over existing methods.
The paper tackles the problem of adapting speech recognizers to new speakers with limited unlabeled data by proposing a novel loss function based on conditional entropy over multiple hypotheses and using speaker codes, achieving a 20% relative improvement in word error rate with one minute of data and 29% with ten minutes.
Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or "pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a "speaker code" characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.