SD LG ASFeb 17, 2025

TAPS: Throat and Acoustic Paired Speech Dataset for Deep Learning-Based Speech Enhancement

Yunsik Kim, Yonghun Song, Yoonyoung Chung

arXiv:2502.11478v25 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This provides a standard dataset for researchers working on speech enhancement in high-noise environments using throat microphones, though it is incremental as it builds on existing deep learning approaches.

The authors tackled the problem of enhancing noisy throat microphone recordings by introducing the TAPS dataset, a collection of paired utterances from 60 speakers, and found that mapping-based deep learning models improved speech quality and content restoration.

In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content. These findings demonstrate the TAPS dataset's utility for speech enhancement tasks and support its potential as a standard resource for advancing research in throat microphone-based applications.

View on arXiv PDF

Similar