SDCLASJan 30, 2021

Speech Recognition by Simply Fine-tuning BERT

arXiv:2102.00291v132 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental approach for speech recognition researchers, proposing a simpler method compared to training acoustic models from scratch.

The paper tackles automatic speech recognition by fine-tuning BERT, a language model, and shows that stacking a simple acoustic model on top yields reasonable performance on the AISHELL dataset.

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes