SDApr 7, 2023
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning PretrainingJian Guan, Feiyang Xiao, Youde Liu et al.
Existing contrastive learning methods for anomalous sound detection refine the audio representation of each audio sample by using the contrast between the samples' augmentations (e.g., with time or frequency masking). However, they might be biased by the augmented data, due to the lack of physical properties of machine sound, thereby limiting the detection performance. This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model by incorporating machine ID and a self-supervised ID classifier to fine-tune the learnt model, while enhancing the relation between audio features from the same ID. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification in overall anomaly detection performance and stability on DCASE 2020 Challenge Task2 dataset.
SDJan 14, 2022Code
Anomalous Sound Detection using Spectral-Temporal Information FusionYoude Liu, Jian Guan, Qiaoxi Zhu et al.
Unsupervised anomalous sound detection aims to detect unknown abnormal sounds of machines from normal sounds. However, the state-of-the-art approaches are not always stable and perform dramatically differently even for machines of the same type, making it impractical for general applications. This paper proposes a spectral-temporal fusion based self-supervised method to model the feature of the normal sound, which improves the stability and performance consistency in detection of anomalous sounds from individual machines, even of the same type. Experiments on the DCASE 2020 Challenge Task 2 dataset show that the proposed method achieved 81.39\%, 83.48\%, 98.22\% and 98.83\% in terms of the minimum AUC (worst-case detection performance amongst individuals) in four types of real machines (fan, pump, slider and valve), respectively, giving 31.79\%, 17.78\%, 10.42\% and 21.13\% improvement compared to the state-of-the-art method, i.e., Glow\_Aff. Moreover, the proposed method has improved AUC (average performance of individuals) for all the types of machines in the dataset. The source codes are available at https://github.com/liuyoude/STgram_MFN
SPDec 27, 2024
Spectral-Temporal Fusion Representation for Person-in-Bed DetectionXuefeng Yang, Shiheng Zhang, Jian Guan et al.
This study is based on the ICASSP 2025 Signal Processing Grand Challenge's Accelerometer-Based Person-in-Bed Detection Challenge, which aims to determine bed occupancy using accelerometer signals. The task is divided into two tracks: "in bed" and "not in bed" segmented detection, and streaming detection, facing challenges such as individual differences, posture variations, and external disturbances. We propose a spectral-temporal fusion-based feature representation method with mixup data augmentation, and adopt Intersection over Union (IoU) loss to optimize detection accuracy. In the two tracks, our method achieved outstanding results of 100.00% and 95.55% in detection scores, securing first place and third place, respectively.
SDJan 10, 2022
Local Information Assisted Attention-free Decoder for Audio CaptioningFeiyang Xiao, Jian Guan, Haiyan Lan et al.
Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an audio signal, leading to inaccurate prediction of captions. To address this issue, we propose a method using the pretrained audio neural networks (PANNs) as the encoder and local information assisted attention-free Transformer (LocalAFT) as the decoder. The novelty of our method is in the proposal of the LocalAFT decoder, which allows local information within an audio signal to be captured while retaining the global information. This enables the events of different duration, including short duration, to be captured for more precise caption generation. Experiments show that our method outperforms the state-of-the-art methods in Task 6 of the DCASE 2021 Challenge with the standard attention-based decoder for caption generation.