ESPnet: End-to-End Speech Processing Toolkit
It provides a new software platform for researchers and practitioners in speech processing, but it is incremental as it builds on existing toolkits like Kaldi.
The paper introduces ESPnet, an open-source toolkit for end-to-end speech processing, focusing on automatic speech recognition (ASR) and integrating Chainer and PyTorch with Kaldi-style data handling, and reports experimental results on major ASR benchmarks.
This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.