CLSep 15, 2024

A Simple HMM with Self-Supervised Representations for Phone Segmentation

arXiv:2409.09646v21 citationsh-index: 6
AI Analysis

This work addresses phone segmentation for speech processing, but it is incremental as it builds on existing methods with a new formulation.

The paper tackled unsupervised phonetic segmentation by showing that peak detection on Mel spectrograms outperforms many self-supervised methods, and proposed a simple hidden Markov model using self-supervised representations and boundary features, achieving consistent improvements over previous approaches.

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes