Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder
This addresses the problem of generating realistic head motion for applications like virtual avatars, but it is incremental as it improves on existing feature-based methods.
The study tackled predicting head motion directly from speech waveforms for speech-driven head-motion synthesis, showing that using waveforms directly is more effective than combining spectral features like MFCC, with the proposed canonical-correlation-constrained autoencoder achieving comparable objective performance and better subjective evaluation compared to an MFCC-based system.
This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We show that, rather than combining different features that originate from waveforms, it is more effective to use waveforms directly predicting corresponding head motion. The challenge with the waveform-based approach is that waveforms contain a large amount of information irrelevant to predict head motion, which hinders the training of neural networks. To overcome the problem, we propose a canonical-correlation-constrained autoencoder (CCCAE), where hidden layers are trained to not only minimise the error but also maximise the canonical correlation with head motion. Compared with an MFCC-based system, the proposed system shows comparable performance in objective evaluation, and better performance in subject evaluation.