CVSep 7, 2016

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

arXiv:1609.01932v12.14 citations

Originality Incremental advance

AI Analysis

It addresses the problem of improving accuracy in visual speech recognition for applications like assistive technology, though it is incremental with modest gains.

This paper tackles visual speech recognition by proposing a 3D Discrete Cosine Transform for spatio-temporal feature extraction, combined with SVMs and a custom HMM, achieving 20% accuracy for phoneme sequences and 39% for viseme sequences, improving prior results by about 2%.

Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20%, and the accuracy of viseme sequences is 39%. Both results improve the best values reported in other papers by approximately 2%. The contribution of the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.

View on arXiv PDF

Similar