ASCLJan 29, 2021

BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

arXiv:2101.12729v17 citations
Originality Synthesis-oriented
AI Analysis

This work addresses speech recognition in noisy TV show environments for the Albayzin challenge, representing an incremental improvement with specific techniques.

The paper tackled the Albayzin 2020 Speech to Text Challenge by developing ASR systems using hybrid and end-to-end models, with techniques like SpecAugment and Demucs for noise separation, achieving a 23.33% word error rate through system fusion.

This paper describes joint effort of BUT and Telefónica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge. We compare approaches based on either hybrid or end-to-end models. In hybrid modelling, we explore the impact of SpecAugment layer on performance. For end-to-end modelling, we used a convolutional neural network with gated linear units (GLUs). The performance of such model is also evaluated with an additional n-gram language model to improve word error rates. We further inspect source separation methods to extract speech from noisy environment (i.e. TV shows). More precisely, we assess the effect of using a neural-based music separator named Demucs. A fusion of our best systems achieved 23.33% WER in official Albayzin 2020 evaluations. Aside from techniques used in our final submitted systems, we also describe our efforts in retrieving high quality transcripts for training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes