SD CL ASOct 27, 2022

SAN: a robust end-to-end ASR model architecture

arXiv:2210.15285v14.12 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the difficulty of recognizing unclear audio in ASR systems, representing a strong specific gain rather than a broad breakthrough.

The paper tackles the problem of fuzzy audio recognition in automatic speech recognition by proposing a Siamese Adversarial Network (SAN) architecture, which achieves a new state-of-the-art character error rate of 4.37 on the AISHELL-1 dataset, leading to a 5% relative reduction.

In this paper, we propose a novel Siamese Adversarial Network (SAN) architecture for automatic speech recognition, which aims at solving the difficulty of fuzzy audio recognition. Specifically, SAN constructs two sub-networks to differentiate the audio feature input and then introduces a loss to unify the output distribution of these sub-networks. Adversarial learning enables the network to capture more essential acoustic features and helps the models achieve better performance when encountering fuzzy audio input. We conduct numerical experiments with the SAN model on several datasets for the automatic speech recognition task. All experimental results show that the siamese adversarial nets significantly reduce the character error rate (CER). Specifically, we achieve a new state of art 4.37 CER without language model on the AISHELL-1 dataset, which leads to around 5% relative CER reduction. To reveal the generality of the siamese adversarial net, we also conduct experiments on the phoneme recognition task, which also shows the superiority of the siamese adversarial network.

View on arXiv PDF

Similar