ASLGSDOct 8, 2021

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition

arXiv:2110.04187v33 citations
Originality Incremental advance
AI Analysis

This addresses recognition errors due to similar-phoneme confusion for speech recognition systems, representing an incremental improvement.

The paper tackles the problem of phoneme confusion in end-to-end speech recognition by proposing SCaLa, a supervised contrastive learning framework, which achieves absolute reductions of 2.8 and 1.4 points in Character Error Rate on reading and spontaneous speech datasets, respectively.

End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels to mitigate the noise introduced by positive-negative pairs in self-supervised MCPC. Experiments on reading and spontaneous speech datasets show that our proposed approach achieves 2.8 and 1.4 points Character Error Rate (CER) absolute reductions compared to the baseline, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes