SDCLASMay 18, 2023

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

arXiv:2305.11013v1140 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This toolkit addresses the problem of deploying high-precision speech recognition in industrial settings, though it is incremental as it builds on existing non-autoregressive methods.

The paper introduces FunASR, an open-source speech recognition toolkit that tackles the gap between academic research and industrial applications by providing models trained on large-scale industrial corpora, including a 60,000-hour Mandarin dataset, and achieves superior performance compared to models on open datasets.

This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes