SDASIVDec 26, 2021

Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

arXiv:2112.13463v15 citations
Originality Incremental advance
AI Analysis

This addresses speech recognition in noisy, bilingual classrooms, but it is incremental with modest gains over existing methods.

The paper tackled speech recognition in noisy, bilingual student environments by using video analysis to estimate speaker geometry for audio simulation, achieving a 10.83% error rate in distance estimation compared to 33.12% baseline and 27.92% accuracy, 1.5% better than Google Speech-to-text.

Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings. In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9 important keywords, our approach gave an average sensitivity of 38% compared to 24% for Google Speech-to-text, while both methods maintained high average specificity of 90% and 92%. On average, sensitivity improved from 24% to 38% for our proposed approach. On the other hand, specificity remained high for both methods (90% to 92%).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes