SDASMar 31, 2021

TS-RIR: Translated synthetic room impulse responses for speech augmentation

arXiv:2103.16804v524 citations
AI Analysis

This work addresses the gap between synthetic and real RIRs for improving far-field speech recognition systems, representing an incremental advancement in speech augmentation.

The paper tackles the problem of low fidelity in synthetic room impulse responses for far-field speech recognition by introducing TS-RIRGAN to translate synthetic RIRs into real-like ones, resulting in up to a 19.9% reduction in word error rate on a benchmark dataset.

We present a method for improving the quality of synthetic room impulse responses for far-field speech recognition. We bridge the gap between the fidelity of synthetic room impulse responses (RIRs) and the real room impulse responses using our novel, TS-RIRGAN architecture. Given a synthetic RIR in the form of raw audio, we use TS-RIRGAN to translate it into a real RIR. We also perform real-world sub-band room equalization on the translated synthetic RIR. Our overall approach improves the quality of synthetic RIRs by compensating low-frequency wave effects, similar to those in real RIRs. We evaluate the performance of improved synthetic RIRs on a far-field speech dataset augmented by convolving the LibriSpeech clean speech dataset [1] with RIRs and adding background noise. We show that far-field speech augmented using our improved synthetic RIRs reduces the word error rate by up to 19.9% in Kaldi far-field automatic speech recognition benchmark [2].

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes