CVSDASIVMay 8, 2018

Phoneme-to-viseme mappings: the good, the bad, and the ugly

arXiv:1805.02934v166 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of improving speech recognition accuracy for applications like assistive technologies, though it is incremental as it builds on existing mapping methods.

The paper tackles the problem of ambiguous phoneme-to-viseme mappings in audio-visual speech recognition, showing that different mappings affect classifier performance and introducing a new algorithm that produces 'Bear' visemes, which outperform previous units.

Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes