Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals
This work addresses the problem of defining and comparing visual speech signals for researchers in lipreading and speech recognition, but it is incremental as it builds on existing clustering methods without introducing a new paradigm.
The paper tackled the lack of a formal definition for visual lip gestures (visemes) by creating new phoneme-to-viseme maps using a phoneme-clustering method for individual and multiple speakers, and found that speakers share the same repertoire of mouth gestures but differ in their usage, as measured by signed rank tests.
Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.