SDSep 8, 2024Code
Deep Generic Representations for Domain-Generalized Anomalous Sound DetectionPhurich Saengthong, Takahiro Shinozaki
Developing a reliable anomalous sound detection (ASD) system requires robustness to noise, adaptation to domain shifts, and effective performance with limited training data. Current leading methods rely on extensive labeled data for each target machine type to train feature extractors using Outlier-Exposure (OE) techniques, yet their performance on the target domain remains sub-optimal. In this paper, we present \textit{GenRep}, which utilizes generic feature representations from a robust, large-scale pre-trained feature extractor combined with kNN for domain-generalized ASD, without the need for fine-tuning. \textit{GenRep} incorporates MemMixup, a simple approach for augmenting the target memory bank using nearest source samples, paired with a domain normalization technique to address the imbalance between source and target domains. \textit{GenRep} outperforms the best OE-based approach without a need for labeled data with an Official Score of 73.79\% on the DCASE2023T2 Eval set and demonstrates robustness under limited data scenarios. The code is available open-source.
SDMar 14
Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound DetectionPhurich Saengthong, Takahiro Shinozaki
Detecting subtle deviations in noisy acoustic environments is central to anomalous sound detection (ASD). A common training-free ASD pipeline temporally pools frame-level representations into a band-preserving feature vector and scores anomalies using a single nearest-neighbor match. However, this global matching can inflate normal-score variance through two effects. First, when normal sounds exhibit band-wise variability, a single global neighbor forces all bands to share the same reference, increasing band-level mismatch. Second, cosine-based matching is energy-coupled, allowing a few high-energy bands to dominate score computation under normal energy fluctuations and further increase variance. We propose BEAM, which stores temporally pooled sub-band vectors in a memory bank, retrieves neighbors per sub-band, and uniformly aggregates scores to reduce normal-score variability and improve discriminability. We further introduce a parameter-free adaptive fusion to better handle diverse temporal dynamics in sub-band responses. Experiments on multiple DCASE Task 2 benchmarks show strong performance without task-specific training, robustness to noise and domain shifts, and complementary gains when combined with encoder fine-tuning.
CLJun 26, 2025
A Unified Speech LLM for Diarization and Speech Recognition in Multilingual ConversationsPhurich Saengthong, Boonnithi Jiaramaneepinit, Sheng Li et al.
Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87\% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.