CLSDASMay 28, 2023

Investigating Pre-trained Audio Encoders in the Low-Resource Condition

arXiv:2305.17733v113 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating pre-trained encoders for speech tasks in low-resource conditions, which is incremental as it builds on existing models without introducing new methods.

The paper investigated the performance of pre-trained audio encoders (Wav2vec2, WavLM, Whisper) in low-resource settings across 7 speech tasks, finding that Whisper showed the best capabilities in content-driven tasks with improved performance and convergence speed.

Pre-trained speech encoders have been central to pushing state-of-the-art results across various speech understanding and generation tasks. Nonetheless, the capabilities of these encoders in low-resource settings are yet to be thoroughly explored. To address this, we conduct a comprehensive set of experiments using a representative set of 3 state-of-the-art encoders (Wav2vec2, WavLM, Whisper) in the low-resource setting across 7 speech understanding and generation tasks. We provide various quantitative and qualitative analyses on task performance, convergence speed, and representational properties of the encoders. We observe a connection between the pre-training protocols of these encoders and the way in which they capture information in their internal layers. In particular, we observe the Whisper encoder exhibits the greatest low-resource capabilities on content-driven tasks in terms of performance and convergence speed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes