Decoding visemes: improving machine lipreading
This work addresses the challenge of speaker dependence in machine lipreading, which is an incremental improvement for applications like speech recognition from visual cues.
The paper tackled the problem of speaker-dependent variability in machine lipreading by proposing a new method for phoneme-to-viseme maps, showing that optimal viseme set sizes vary by speaker from 11 to 35 and that HD video is not necessary, resulting in a significant increase in classification accuracy.
Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.