ASCLLGSDMar 23, 2023

A Deliberation-based Joint Acoustic and Text Decoder

arXiv:2303.15293v17 citationsh-index: 69
Originality Incremental advance
AI Analysis

This work addresses ASR accuracy for rare words, which is important for on-device applications, but it is incremental as it builds on existing deliberation and JATD methods.

The paper tackles the problem of improving automatic speech recognition (ASR) performance, especially for rare words, by proposing a two-pass end-to-end model that combines deliberation architecture with a joint acoustic and text decoder, resulting in relative word error rate reductions of 12% to 22.5% on multiple test sets.

We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes