CL LG SD ASNov 1, 2021

Sequence Transduction with Graph-based Supervision

Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux

arXiv:2111.01272v20.77 citations

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in ASR systems for production by enabling more flexible and efficient training lattices, though it is incremental as it builds on existing transducer methods.

The authors tackled the problem of suboptimal alignment rules in recurrent neural network transducer (RNN-T) objectives for automatic speech recognition by introducing a new transducer objective that generalizes RNN-T to accept graph-based supervision, achieving a 4.8% improvement on the LibriSpeech test-other condition.

The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production. Similarly to the connectionist temporal classification (CTC) objective, the RNN-T loss uses specific rules that define how a set of alignments is generated to form a lattice for the full-sum training. However, it is yet largely unknown if these rules are optimal and do lead to the best possible ASR results. In this work, we present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels, thus providing a flexible and efficient framework to manipulate training lattices, e.g., for studying different transition rules, implementing different transducer losses, or restricting alignments. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T, while also ensuring a strictly monotonic alignment, which will allow better optimization of the decoding procedure. For example, the proposed CTC-like transducer achieves an improvement of 4.8% on the test-other condition of LibriSpeech relative to an equivalent RNN-T based system.

View on arXiv PDF

Similar