ASSDJan 14, 2021

Fast offline Transformer-based end-to-end automatic speech recognition for real-world applications

arXiv:2101.05600v3
Originality Incremental advance
AI Analysis

This work addresses the need for fast and accurate offline speech recognition in practical settings like meetings, though it is incremental as it builds on existing Transformer-based methods.

The paper tackled the problem of efficiently converting large amounts of speech to text in real-world applications, achieving a 27.1% relative reduction in character error rate and processing 8 hours of speech in under 3 minutes.

With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more important than ever. This paper proposes a method to rapidly recognize a large speech database via a Transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance batched beam search, detecting end-of-speech based on a connectionist temporal classification (CTC), restricting the CTC prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech English and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 hours of speeches spoken at real-world meetings into text in less than 3 minutes with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes