RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition
This addresses a tooling gap for researchers in speech recognition, enabling more flexible experimentation with S2S models, though it is incremental as it builds on existing ASR frameworks.
The authors tackled the lack of public ASR tools supporting lexical-constrained decoding for various sequence-to-sequence models in closed-vocabulary scenarios by presenting RASR2, a generic S2S decoder implemented in C++, which offers flexibility and efficient decoding for both open- and closed-vocabulary settings, as evaluated on switchboard and Librispeech corpora.
Modern public ASR tools usually provide rich support for training various sequence-to-sequence (S2S) models, but rather simple support for decoding open-vocabulary scenarios only. For closed-vocabulary scenarios, public tools supporting lexical-constrained decoding are usually only for classical ASR, or do not support all S2S models. To eliminate this restriction on research possibilities such as modeling unit choice, we present RASR2 in this work, a research-oriented generic S2S decoder implemented in C++. It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures. It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings. We evaluate RASR2 with a wide range of experiments on both switchboard and Librispeech corpora. Our source code is public online.