SDAIASApr 1, 2024

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

arXiv:2404.00856v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the need for speaker-independent speech representations in synthesis, offering an incremental improvement over existing methods.

The paper tackles the problem of speaker information entanglement in self-supervised speech representations by proposing a variable-length soft pooling method based on predicted boundaries, achieving competitive performance in phonetic tasks while reducing speaker identification accuracy.

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes