CLSDASMay 23, 2023

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

arXiv:2305.14049v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating more accurate semantic states in ASR for applications requiring high transcription quality, though it appears incremental as it builds on existing attention-based encoder-decoder models.

The paper tackles the problem of automatic speech recognition (ASR) by proposing an Acoustic and Semantic Cooperative Decoder (ASCD) that integrates acoustic and semantic features simultaneously, unlike existing methods that process them separately, resulting in significant performance improvements on datasets like AISHELL-1 and aidatatang_200zh.

Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes