SD ASNov 30, 2021

SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer

Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long Yan, Kazushige Ouchi

arXiv:2111.15222v24.32 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient training in sound event detection, particularly for applications in audio analysis, though it is incremental as it adapts existing pre-training techniques from object detection.

The paper tackles the problem of sound event detection by proposing a self-supervised pre-training method for an event-based transformer model to reduce reliance on annotated data, achieving performance that outperforms fine-tuned frame-based models on the DCASE2019 task4 dataset.

Recently, an event-based end-to-end model (SEDT) has been proposed for sound event detection (SED) and achieves competitive performance. However, compared with the frame-based model, it requires more training data with temporal annotations to improve the localization ability. Synthetic data is an alternative, but it suffers from a great domain gap with real recordings. Inspired by the great success of UP-DETR in object detection, we propose to self-supervisedly pre-train SEDT (SP-SEDT) by detecting random patches (only cropped along the time axis). Experiments on the DCASE2019 task4 dataset show the proposed SP-SEDT can outperform fine-tuned frame-based model. The ablation study is also conducted to investigate the impact of different loss functions and patch size.

View on arXiv PDF Code

Similar