SD AI ASAug 16, 2024

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

arXiv:2408.08673v210.910 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited labeled data in sound event detection for audio processing applications, representing an incremental improvement over existing methods.

The paper tackles sound event detection by proposing a pure Transformer-based model with masked-reconstruction pre-training, achieving state-of-the-art performance with PSDS1/PSDS2 scores of 0.587/0.896 on DCASE2023 task4.

Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

View on arXiv PDF Code

Similar