SD LGJun 5, 2017

Deep Factorization for Speech Signal

Dong Wang, Lantian Li, Ying Shi, Yixiang Chen, Zhiyuan Tang

arXiv:1706.01777v29.17 citations

Originality Incremental advance

AI Analysis

This provides a novel tool for speech processing tasks, though it appears incremental as it builds on existing factorization ideas with a new method.

The paper tackled the problem of factorizing speech signals into independent factors by demonstrating that speaker traits can be identified at the frame level using a DNN, leading to a cascade deep factorization framework that achieved high accuracy in recovering speech spectra in an automatic emotion recognition task.

Speech signals are complex intermingling of various informative factors, and this information blending makes decoding any of the individual factors extremely difficult. A natural idea is to factorize each speech frame into independent factors, though it turns out to be even more difficult than decoding each individual factor. A major encumbrance is that the speaker trait, a major factor in speech signals, has been suspected to be a long-term distributional pattern and so not identifiable at the frame level. In this paper, we demonstrated that the speaker factor is also a short-time spectral pattern and can be largely identified with just a few frames using a simple deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that infers speech factors in a sequential way, and factors previously inferred are used as conditional variables when inferring other factors. Our experiment on an automatic emotion recognition (AER) task demonstrated that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with high accuracy. This factorization and reconstruction approach provides a novel tool for many speech processing tasks.

View on arXiv PDF

Similar