Teach an all-rounder with experts in different domains
This work addresses the need for robust ASR models that perform well across multiple domains, such as different speaking styles and acoustic conditions, offering an incremental improvement over existing methods.
The paper tackles the problem of building a multi-domain automatic speech recognition (ASR) model by using a teacher-student framework with domain-specific experts, achieving up to a 10.4% relative improvement in character error rate over a baseline multi-condition model.
In many automatic speech recognition (ASR) tasks, an ideal model has to be applicable over multiple domains. In this paper, we propose to teach an all-rounder with experts in different domains. Concretely, we build a multi-domain acoustic model by applying the teacher-student training framework. First, for each domain, a teacher model (domain-dependent model) is trained by fine-tuning a multi-condition model with domain-specific subset. Then all these teacher models are used to teach one single student model simultaneously. We perform experiments on two predefined domain setups. One is domains with different speaking styles, the other is nearfield, far-field and far-field with noise. Moreover, two types of models are examined: deep feedforward sequential memory network (DFSMN) and long short term memory (LSTM). Experimental results show that the model trained with this framework outperforms not only multi-condition model but also domain-dependent model. Specially, our training method provides up to 10.4% relative character error rate improvement over baseline model (multi-condition model).