SD AI IR LG ASJan 21, 2017

Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics

arXiv:1701.06078v24.310 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of adapting speech models to individual singers for lyrics alignment, though it is incremental as it builds on existing unsupervised methods.

The paper tackled the problem of lyrics-to-audio alignment by learning repetitive vowel patterns in singing voices, achieving more promising results than state-of-the-art unsupervised approaches and an existing ASR-based system on Korean and English datasets.

Most of the previous approaches to lyrics-to-audio alignment used a pre-developed automatic speech recognition (ASR) system that innately suffered from several difficulties to adapt the speech model to individual singers. A significant aspect missing in previous works is the self-learnability of repetitive vowel patterns in the singing voice, where the vowel part used is more consistent than the consonant part. Based on this, our system first learns a discriminative subspace of vowel sequences, based on weighted symmetric non-negative matrix factorization (WS-NMF), by taking the self-similarity of a standard acoustic feature as an input. Then, we make use of canonical time warping (CTW), derived from a recent computer vision technique, to find an optimal spatiotemporal transformation between the text and the acoustic sequences. Experiments with Korean and English data sets showed that deploying this method after a pre-developed, unsupervised, singing source separation achieved more promising results than other state-of-the-art unsupervised approaches and an existing ASR-based system.

View on arXiv PDF

Similar