AS SDNov 7, 2020

ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration

Chenda Li, Jing Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Boeddeker, Zhuo Chen, Shinji Watanabe

arXiv:2011.03706v116.791 citations

Originality Synthesis-oriented

AI Analysis

This toolkit addresses the need for unified speech enhancement/separation development with ASR integration, though it appears incremental as an extension of existing ESPnet infrastructure.

The authors developed ESPnet-SE, an end-to-end toolkit for speech enhancement and separation designed to integrate with automatic speech recognition systems, providing all-in-one recipes for processing single- and multi-channel data across benchmark datasets.

We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation).It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets.

View on arXiv PDF

Similar