SDOct 18, 2022Code
A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4Yiming Li, Zhifang Guo, Zhirong Ye et al. · tsinghua
In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.
SDNov 30, 2021
SP-SEDT: Self-supervised Pre-training for Sound Event Detection TransformerZhirong Ye, Xiangdong Wang, Hong Liu et al.
Recently, an event-based end-to-end model (SEDT) has been proposed for sound event detection (SED) and achieves competitive performance. However, compared with the frame-based model, it requires more training data with temporal annotations to improve the localization ability. Synthetic data is an alternative, but it suffers from a great domain gap with real recordings. Inspired by the great success of UP-DETR in object detection, we propose to self-supervisedly pre-train SEDT (SP-SEDT) by detecting random patches (only cropped along the time axis). Experiments on the DCASE2019 task4 dataset show the proposed SP-SEDT can outperform fine-tuned frame-based model. The ablation study is also conducted to investigate the impact of different loss functions and patch size.
SDOct 5, 2021
Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event DetectionZhirong Ye, Xiangdong Wang, Hong Liu et al.
Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level prediction, converting it into a sequence multi-label classification problem. A critical issue with the frame-based model is that it pursues the best frame-level prediction rather than the best event-level prediction. Besides, it needs post-processing and cannot be trained in an end-to-end way. This paper firstly presents the one-dimensional Detection Transformer (1D-DETR), inspired by Detection Transformer for image object detection. Furthermore, given the characteristics of SED, the audio query branch and a one-to-many matching strategy for fine-tuning the model are added to 1D-DETR to form Sound Event Detection Transformer (SEDT). To our knowledge, SEDT is the first event-based and end-to-end SED model. Experiments are conducted on the URBAN-SED dataset and the DCASE2019 Task4 dataset, and both show that SEDT can achieve competitive performance.
SDNov 2, 2020
Learning generic feature representation with synthetic data for weakly-supervised sound event detection by inter-frame distance lossYuxin Huang, Liwei Lin, Xiangdong Wang et al.
Due to the limitation of strong-labeled sound event detection data set, using synthetic data to improve the sound event detection system performance has been a new research focus. In this paper, we try to exploit the usage of synthetic data to improve the feature representation. Based on metric learning, we proposed inter-frame distance loss function for domain adaptation, and prove the effectiveness of it on sound event detection. We also applied multi-task learning with synthetic data. We find the the best performance can be achieved when the two methods being used together. The experiment on DCASE 2018 task 4 test set and DCASE 2019 task 4 synthetic set both show competitive results.
SDJul 21, 2020
Guided multi-branch learning systems for sound event detection with sound separationYuxin Huang, Liwei Lin, Shuo Ma et al.
In this paper, we describe in detail our systems for DCASE 2020 Task 4. The systems are based on the 1st-place system of DCASE 2019 Task 4, which adopts weakly-supervised framework with an attention-based embedding-level pooling module and a semi-supervised learning approach named guided learning. This year, we incorporate multi-branch learning (MBL) into the original system to further improve its performance. MBL uses different branches with different pooling strategies (including instance-level and embedding-level strategies) and different pooling modules (including attention pooling, global max pooling or global average pooling modules), which share the same feature encoder of the model. Therefore, multiple branches pursuing different purposes and focusing on different characteristics of the data can help the feature encoder model the feature space better and avoid over-fitting. To better exploit the strongly-labeled synthetic data, inspired by multi-task learning, we also employ a sound event detection branch. To combine sound separation (SS) with sound event detection (SED), we fuse the results of SED systems with SS-SED systems which are trained using separated sound output by an SS system. The experimental results prove that MBL can improve the model performance and using SS has great potential to improve the performance of SED ensemble system.
ASNov 6, 2019
An End-to-end Approach for Lexical Stress Detection based on TransformerYong Ruan, Xiangdong Wang, Hong Liu et al.
The dominant automatic lexical stress detection method is to split the utterance into syllable segments using phoneme sequence and their time-aligned boundaries. Then we extract features from syllable to use classification method to classify the lexical stress. However, we can't get very accurate time boundaries of each phoneme and we have to design some features in the syllable segments to classify the lexical stress. Therefore, we propose a end-to-end approach using sequence to sequence model of transformer to estimate lexical stress. For this, we train transformer model using feature sequence of audio and their phoneme sequence with lexical stress marks. During the recognition process, the recognized phoneme sequence is restricted according to the original standard phoneme sequence without lexical stress marks, but the lexical stress mark of each phoneme is not limited. We train the model in different subset of Librispeech and do lexical stress recognition in TIMIT and L2-ARCTIC dataset. For all subsets, the end-to-end model will perform better than the syllable segments classification method. Our method can achieve a 6.36% phoneme error rate on the TIMIT dataset, which exceeds the 7.2% error rate in other studies.
ASSep 11, 2019
Guided Learning Convolution System for DCASE 2019 Task 4Liwei Lin, Xiangdong Wang, Hong Liu et al.
In this paper, we describe in detail the system we submitted to DCASE2019 task 4: sound event detection (SED) in domestic environments. We employ a convolutional neural network (CNN) with an embedding-level attention pooling module to solve it. By considering the interference caused by the co-occurrence of multiple events in the unbalanced dataset, we utilize the disentangled feature to raise the performance of the model. To take advantage of the unlabeled data, we adopt Guided Learning for semi-supervised learning. A group of median filters with adaptive window sizes is utilized in the post-processing of output probabilities of the model. We also analyze the effect of the synthetic data on the performance of the model and finally achieve an event-based F-measure of 45.43% on the validation set and an event-based F-measure of 42.7% on the test set. The system we submitted to the challenge achieves the best performance compared to those of other participates.
LGJun 6, 2019
Guided learning for weakly-labeled semi-supervised sound event detectionLiwei Lin, Xiangdong Wang, Hong Liu et al.
We propose a simple but efficient method termed Guided Learning for weakly-labeled semi-supervised sound event detection (SED). There are two sub-targets implied in weakly-labeled SED: audio tagging and boundary detection. Instead of designing a single model by considering a trade-off between the two sub-targets, we design a teacher model aiming at audio tagging to guide a student model aiming at boundary detection to learn using the unlabeled data. The guidance is guaranteed by the audio tagging performance gap of the two models. In the meantime, the student model liberated from the trade-off is able to provide more excellent boundary detection results. We propose a principle to design such two models based on the relation between the temporal compression scale and the two sub-targets. We also propose an end-to-end semi-supervised learning process for these two models to enable their abilities to rise alternately. Experiments on the DCASE2018 Task4 dataset show that our approach achieves competitive performance.
SDMay 24, 2019
Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event DetectionLiwei Lin, Xiangdong Wang, Hong Liu et al.
In this paper, a special decision surface for the weakly-supervised sound event detection (SED) and a disentangled feature (DF) for the multi-label problem in polyphonic SED are proposed. We approach SED as a multiple instance learning (MIL) problem and utilize a neural network framework with a pooling module to solve it. General MIL approaches include two kinds: the instance-level approaches and embedding-level approaches. We present a method of generating instance-level probabilities for the embedding level approaches which tend to perform better than the instance-level approaches in terms of bag-level classification but can not provide instance-level probabilities in current approaches. Moreover, we further propose a specialized decision surface (SDS) for the embedding-level attention pooling. We analyze and explained why an embedding-level attention module with SDS is better than other typical pooling modules from the perspective of the high-level feature space. As for the problem of the unbalanced dataset and the co-occurrence of multiple categories in the polyphonic event detection task, we propose a DF to reduce interference among categories, which optimizes the high-level feature space by disentangling it based on class-wise identifiable information and obtaining multiple different subspaces. Experiments on the dataset of DCASE 2018 Task 4 show that the proposed SDS and DF significantly improve the detection performance of the embedding-level MIL approach with an attention pooling module and outperform the first place system in the challenge by 6.6 percentage points.
CVNov 27, 2018
DSBI: Double-Sided Braille Image Dataset and Algorithm Evaluation for Braille Dots DetectionRenqiang Li, Hong Liu, Xiangdong Wan et al.
Braille is an effective way for the visually impaired to learn knowledge and obtain information. Braille image recognition aims to automatically detect Braille dots in the whole Braille image. There is no available public datasets for Braille image recognition to push relevant research and evaluate algorithms. This paper constructs a large-scale Double-Sided Braille Image dataset DSBI with detailed Braille recto dots, verso dots and Braille cells annotation. To quickly annotate Braille images, an auxiliary annotation strategy is proposed, which adopts initial automatic detection of Braille dots and modifies annotation results by convenient human-computer interaction method. This labeling strategy can averagely increase label efficiency by six times for recto dots annotation in one Braille image. Braille dots detection is the core and basic step for Braille image recognition. This paper also evaluates some Braille dots detection methods on our dataset DSBI and gives the benchmark performance of recto dots detection. We have released our Braille images dataset on the GitHub website.