Phurich Saengthong

SD
h-index17
3papers
13citations
Novelty52%
AI Score43

3 Papers

SDSep 8, 2024Code
Deep Generic Representations for Domain-Generalized Anomalous Sound Detection

Phurich Saengthong, Takahiro Shinozaki

Developing a reliable anomalous sound detection (ASD) system requires robustness to noise, adaptation to domain shifts, and effective performance with limited training data. Current leading methods rely on extensive labeled data for each target machine type to train feature extractors using Outlier-Exposure (OE) techniques, yet their performance on the target domain remains sub-optimal. In this paper, we present \textit{GenRep}, which utilizes generic feature representations from a robust, large-scale pre-trained feature extractor combined with kNN for domain-generalized ASD, without the need for fine-tuning. \textit{GenRep} incorporates MemMixup, a simple approach for augmenting the target memory bank using nearest source samples, paired with a domain normalization technique to address the imbalance between source and target domains. \textit{GenRep} outperforms the best OE-based approach without a need for labeled data with an Official Score of 73.79\% on the DCASE2023T2 Eval set and demonstrates robustness under limited data scenarios. The code is available open-source.

SDMar 14
Sub-Band Spectral Matching with Localized Score Aggregation for Robust Anomalous Sound Detection

Phurich Saengthong, Takahiro Shinozaki

Detecting subtle deviations in noisy acoustic environments is central to anomalous sound detection (ASD). A common training-free ASD pipeline temporally pools frame-level representations into a band-preserving feature vector and scores anomalies using a single nearest-neighbor match. However, this global matching can inflate normal-score variance through two effects. First, when normal sounds exhibit band-wise variability, a single global neighbor forces all bands to share the same reference, increasing band-level mismatch. Second, cosine-based matching is energy-coupled, allowing a few high-energy bands to dominate score computation under normal energy fluctuations and further increase variance. We propose BEAM, which stores temporally pooled sub-band vectors in a memory bank, retrieves neighbors per sub-band, and uniformly aggregates scores to reduce normal-score variability and improve discriminability. We further introduce a parameter-free adaptive fusion to better handle diverse temporal dynamics in sub-band responses. Experiments on multiple DCASE Task 2 benchmarks show strong performance without task-specific training, robustness to noise and domain shifts, and complementary gains when combined with encoder fine-tuning.

CLJun 26, 2025
A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Phurich Saengthong, Boonnithi Jiaramaneepinit, Sheng Li et al.

Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87\% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.