Multi-task Recurrent Model for Speech and Speaker Recognition
This work addresses the challenge of integrating speech and speaker recognition for applications like voice assistants, but it appears incremental as it combines existing neural network approaches without a major paradigm shift.
The paper tackles the problem of jointly performing speech and speaker recognition, which are typically treated as separate tasks, by proposing a unified multi-task recurrent neural network model. The result is that this joint model outperforms task-specific models on both tasks, though no concrete numbers are provided.
Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities. This is certainly not the way that people behave: we decipher both speech content and speaker traits at the same time. This paper presents a unified model to perform speech and speaker recognition simultaneously and altogether. The model is based on a unified neural network where the output of one task is fed to the input of the other, leading to a multi-task recurrent network. Experiments show that the joint model outperforms the task-specific models on both the two tasks.