CVFeb 1, 2025

Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

arXiv:2502.00464v22 citationsh-index: 12Has CodeLanguage Resources and Evaluation
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of visual speech recognition for Spanish speakers, representing an incremental advancement in a domain-specific area.

The paper tackles continuous Spanish lipreading by presenting an end-to-end system based on hybrid CTC/Attention architecture, achieving state-of-the-art results on two disparate corpora and establishing a new benchmark.

Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes