AS LG SDOct 31, 2021

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

Martin Kocour, Kateřina Žmolíková, Lucas Ondel, Ján Švec, Marc Delcroix, Tsubasa Ochiai, Lukáš Burget, Jan Černocký

arXiv:2111.00009v21.2Has Code

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in multi-talker speech recognition systems, offering an incremental improvement by revisiting and updating older factorial generative models with DNNs.

The authors tackled the sub-optimality of separate decoding in multi-talker speech recognition by proposing a joint decoding approach that predicts joint state posteriors for all speakers, enabling uncertainty attribution and leveraging higher-level language information. They demonstrated advantages in proof-of-concept experiments on a mixed-TIDIGITS dataset, though no concrete numbers were provided.

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

View on arXiv PDF Code

Similar