CLSDASMar 13, 2024

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

arXiv:2403.08738v1105 citationsh-index: 4EACL
Originality Incremental advance
AI Analysis

It addresses the incremental improvement of AWEs for speech processing tasks, particularly in cross-lingual scenarios, by leveraging SSL representations.

This paper tackles the problem of improving acoustic word embeddings (AWEs) by applying the Correspondence Auto-Encoder (CAE) method to self-supervised learning (SSL)-based speech representations like HuBERT, achieving the best word discrimination results across five languages and outperforming MFCC-based models in cross-lingual settings.

Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite Hu-BERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes