ASSDJul 29, 2020

Transformer based unsupervised pre-training for acoustic representation learning

arXiv:2007.14602v331 citations
AI Analysis

This work addresses data scarcity in acoustic tasks like speech emotion recognition and sound event detection, though it is incremental as it applies existing Transformer pre-training to acoustic domains.

The paper tackles the problem of limited labeled data for acoustic tasks by proposing an unsupervised pre-training method using a Transformer-based encoder to learn general acoustic representations, resulting in performance improvements such as a 4.3% absolute increase in UAR for speech emotion recognition and up to 12.2% relative improvement in BLEU scores for speech translation.

Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes