CVCLSDASIVMay 8, 2018

Comparing phonemes and visemes with DNN-based lipreading

arXiv:1805.02924v130 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the debate over optimal units for lipreading systems, providing incremental insights for speech recognition researchers.

The study compared phoneme and viseme units for lipreading using a DNN-HMM system on the TCD-TIMIT corpus, finding that phoneme-based systems achieved higher word accuracy but lower unit-level accuracy than viseme-based systems.

There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes