66.4SDMay 24
Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language ModelsHan Yin, Yang Xiao, Younghoo Kwon et al.
Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.
SDOct 10, 2023
Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup Anomalous Sound DetectionSoonhyeon Choi, Jung-Woo Choi
Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations. Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important through representation learning yielding compact intra-class and well-separated intra-class distributions. However, we show that conventional approaches often fail to ensure sufficient intra-class compactness and exhibit angular disparity between samples and their corresponding centers. In this paper, we propose a training technique aimed at ensuring intra-class compactness and increasing the angle gap between normal and abnormal samples. Furthermore, we present an architecture that extracts features for important temporal regions, enabling the model to learn which time frames should be emphasized or suppressed. Experimental results demonstrate that the proposed method achieves the best performance giving 0.90%, 0.83%, and 2.16% improvement in terms of AUC, pAUC, and mAUC, respectively, compared to the state-of-the-art method on DCASE 2020 Challenge Task2 dataset.
SDApr 23, 2023
Sound-based drone fault classification using multitask learningWonjun Yi, Jung-Woo Choi, Jae-Woo Lee
The drone has been used for various purposes, including military applications, aerial photography, and pesticide spraying. However, the drone is vulnerable to external disturbances, and malfunction in propellers and motors can easily occur. To improve the safety of drone operations, one should detect the mechanical faults of drones in real-time. This paper proposes a sound-based deep neural network (DNN) fault classifier and drone sound dataset. The dataset was constructed by collecting the operating sounds of drones from microphones mounted on three different drones in an anechoic chamber. The dataset includes various operating conditions of drones, such as flight directions (front, back, right, left, clockwise, counterclockwise) and faults on propellers and motors. The drone sounds were then mixed with noises recorded in five different spots on the university campus, with a signal-to-noise ratio (SNR) varying from 10 dB to 15 dB. Using the acquired dataset, we train a DNN classifier, 1DCNN-ResNet, that classifies the types of mechanical faults and their locations from short-time input waveforms. We employ multitask learning (MTL) and incorporate the direction classification task as an auxiliary task to make the classifier learn more general audio features. The test over unseen data reveals that the proposed multitask model can successfully classify faults in drones and outperforms single-task models even with less training data.
ASSep 19, 2024
Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp CluesDayun Choi, Jung-Woo Choi
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatial information in multichannel audio signals, it is essential to extract multichannel signals of a target sound source. Moreover, the clue for extraction can also include spatial or temporal cues like direction-of-arrival (DoA) or timestamps of source activation. To address these challenges, we present an M2M framework that extracts a multichannel sound signal based on spatio-temporal clues. We demonstrate that our transformer-based architecture can successively accomplish the M2M-TSE task for multichannel signals synthesized from audio signals of diverse classes in different room environments. Furthermore, we show that the multichannel extraction task introduces sufficient inductive bias in the DNN, allowing it to directly handle DoA clues without utilizing hand-crafted spatial features.
ASDec 23, 2025
DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene AnalysisDongheon Lee, Younghoo Kwon, Jung-Woo Choi
We propose DeepASA, a multi-purpose model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources overlap in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a transformer-based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific decoders. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.
9.7ASMar 12
SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral PatternsYongjoon Lee, Jung-Woo Choi
General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.
27.0ASMar 12
RAF: Relativistic Adversarial Feedback For Universal Speech SynthesisYongjoon Lee, Jung-Woo Choi
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
ASSep 23, 2025
SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic ScenesDayun Choi, Jung-Woo Choi
Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.
ASSep 4, 2023
RGI-Net: 3D Room Geometry Inference from Room Impulse Responses With Hidden First-Order ReflectionsInmo Yeon, Jung-Woo Choi
Room geometry is important prior information for implementing realistic 3D audio rendering. For this reason, various room geometry inference (RGI) methods have been developed by utilizing the time-of-arrival (TOA) or time-difference-of-arrival (TDOA) information in room impulse responses (RIRs). However, the conventional RGI technique poses several assumptions, such as convex room shapes, the number of walls known in priori, and the visibility of first-order reflections. In this work, we introduce the RGI-Net which can estimate room geometries without the aforementioned assumptions. RGI-Net learns and exploits complex relationships between low-order and high-order reflections in RIRs and, thus, can estimate room shapes even when the shape is non-convex or first-order reflections are missing in the RIRs. RGI-Net includes the evaluation network that separately evaluates the presence probability of walls, so the geometry inference is possible without prior knowledge of the number of walls.
ASNov 8, 2021
Inter-channel Conv-TasNet for multichannel speech enhancementDongheon Lee, Seongrae Kim, Jung-Woo Choi
Speech enhancement in multichannel settings has been realized by utilizing the spatial information embedded in multiple microphone signals. Moreover, deep neural networks (DNNs) have been recently advanced in this field; however, studies on the efficient multichannel network structure fully exploiting spatial information and inter-channel relationships is still in its early stages. In this study, we propose an end-to-end time-domain speech enhancement network that can facilitate the use of inter-channel relationships at individual layers of a DNN. The proposed technique is based on a fully convolutional time-domain audio separation network (Conv-TasNet), originally developed for speech separation tasks. We extend Conv-TasNet into several forms that can handle multichannel input signals and learn inter-channel relationships. To this end, we modify the encoder-mask-decoder structures of the network to be compatible with 3-D tensors defined over spatial channels, features, and time dimensions. In particular, we conduct extensive parameter analyses on the convolution structure and propose independent assignment of the depthwise and 1$\times$1 convolution layers to the feature and spatial dimensions, respectively. We demonstrate that the enriched inter-channel information from the proposed network plays a significant role in suppressing noisy signals impinging from various directions. The proposed inter-channel Conv-TasNet outperforms the state-of-the-art multichannel variants of neural networks, even with one-tenth of their parameter size. The performance of the proposed model is evaluated using the CHiME-3 dataset, which exhibits a remarkable improvement in SDR, PESQ, and STOI.
SDFeb 25, 2019
Robust Sound Source Localization considering Similarity of Back-Propagation SignalsInkyu An, Doheon Lee, Byeongho Jo et al.
We present a novel, robust sound source localization algorithm considering back-propagation signals. Sound propagation paths are estimated by generating direct and reflection acoustic rays based on ray tracing in a backward manner. We then compute the back-propagation signals by designing and using the impulse response of the backward sound propagation based on the acoustic ray paths. For identifying the 3D source position, we suggest a localization method based on the Monte Carlo localization algorithm. Candidates for a source position is determined by identifying the convergence regions of acoustic ray paths. This candidate is validated by measuring similarities between back-propagation signals, under the assumption that the back-propagation signals of different acoustic ray paths should be similar near the sound source position. Thanks to considering similarities of back-propagation signals, our approach can localize a source position with an averaged error of 0.51 m in a room of 7 m by 7 m area with 3 m height in tested environments. We also observe 65 % to 220 % improvement in accuracy over the stateof-the-art method. This improvement is achieved in environments containing a moving source, an obstacle, and noises.
ROSep 20, 2018
Diffraction-Aware Sound Localization for a Non-Line-of-Sight SourceInkyu An, Doheon Lee, Jung-woo Choi et al.
We present a novel sound localization algorithm for a non-line-of-sight (NLOS) sound source in indoor environments. Our approach exploits the diffraction properties of sound waves as they bend around a barrier or an obstacle in the scene. We combine a ray tracing based sound propagation algorithm with a Uniform Theory of Diffraction (UTD) model, which simulate bending effects by placing a virtual sound source on a wedge in the environment. We precompute the wedges of a reconstructed mesh of an indoor scene and use them to generate diffraction acoustic rays to localize the 3D position of the source. Our method identifies the convergence region of those generated acoustic rays as the estimated source position based on a particle filter. We have evaluated our algorithm in multiple scenarios consisting of a static and dynamic NLOS sound source. In our tested cases, our approach can localize a source position with an average accuracy error, 0.7m, measured by the L2 distance between estimated and actual source locations in a 7m*7m*3m room. Furthermore, we observe 37% to 130% improvement in accuracy over a state-of-the-art localization method that does not model diffraction effects, especially when a sound source is not visible to the robot.