ASAIOct 31, 2023

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

arXiv:2311.00146v22 citationsh-index: 29
Originality Highly original
AI Analysis

This addresses the challenge of target speech recognition in reverberant environments for ASR systems, representing a novel method for a known bottleneck.

The paper tackled the problem of automatic speech recognition in multi-channel multi-speaker scenarios by introducing RIR-SF, a spatial feature based on room impulse response that leverages speaker position and reflection dynamics, resulting in a 21.3% relative reduction in CER.

Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes