Hyeonuk Nam

AS
h-index4
5papers
73citations
Novelty54%
AI Score37

5 Papers

ASJun 8, 2023
VIFS: An End-to-End Variational Inference for Foley Sound Synthesis

Junhyeok Lee, Hyeonuk Nam, Yong-Hwa Park

The goal of DCASE 2023 Challenge Task 7 is to generate various sound clips for Foley sound synthesis (FSS) by "category-to-sound" approach. "Category" is expressed by a single index while corresponding "sound" covers diverse and different sound examples. To generate diverse sounds for a given category, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and speaker identity, the category-to-sound problem requires generating diverse sounds just from a category index. To compensate for the difference while maintaining consistency within each audio clip, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional Gaussian on the prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-to-end Foley sound synthesis, which generates diverse high-quality sounds.

ASJul 8, 2021Code
Heavily Augmented Sound Event Detection utilizing Weak Predictions

Hyeonuk Nam, Byeong-Yun Ko, Gyeong-Tae Lee et al.

The performances of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset. This work is submitted to DCASE 2021 Task4 and is ranked on the 3rd place. Code availa-ble: https://github.com/frednam93/FilterAugSED.

ASAug 11, 2025
Auditory Intelligence: Understanding the World Through Sound

Hyeonuk Nam

Recent progress in auditory intelligence has yielded high-performing systems for sound event detection (SED), acoustic scene classification (ASC), automated audio captioning (AAC), and audio question answering (AQA). Yet these tasks remain largely constrained to surface-level recognition-capturing what happened but not why, what it implies, or how it unfolds in context. I propose a conceptual reframing of auditory intelligence as a layered, situated process that encompasses perception, reasoning, and interaction. To instantiate this view, I introduce four cognitively inspired task paradigms-ASPIRE, SODA, AUX, and AUGMENT-those structure auditory understanding across time-frequency pattern captioning, hierarchical event/scene description, causal explanation, and goal-driven interpretation, respectively. Together, these paradigms provide a roadmap toward more generalizable, explainable, and human-aligned auditory intelligence, and are intended to catalyze a broader discussion of what it means for machines to understand sound.

SPMay 29, 2025
Multi-output Classification using a Cross-talk Architecture for Compound Fault Diagnosis of Motors in Partially Labeled Condition

Wonjun Yi, Wonho Jung, Hyeonuk Nam et al.

The increasing complexity of rotating machinery and the diversity of operating conditions, such as rotating speed and varying torques, have amplified the challenges in fault diagnosis in scenarios requiring domain adaptation, particularly involving compound faults. This study addresses these challenges by introducing a novel multi-output classification (MOC) framework tailored for domain adaptation in partially labeled target datasets. Unlike conventional multi-class classification (MCC) approaches, the MOC framework classifies the severity levels of compound faults simultaneously. Furthermore, we explore various single-task and multi-task architectures applicable to the MOC formulation-including shared trunk and cross-talk-based designs-for compound fault diagnosis under partially labeled conditions. Based on this investigation, we propose a novel cross-talk architecture, residual neural dimension reductor (RNDR), that enables selective information sharing across diagnostic tasks, effectively enhancing classification performance in compound fault scenarios. In addition, frequency-layer normalization was incorporated to improve domain adaptation performance on motor vibration data. Compound fault conditions were implemented using a motor-based test setup and evaluated across six domain adaptation scenarios. The experimental results demonstrate its superior macro F1 performance compared to baseline models. We further showed that the structural advantage of RNDR is more pronounced in compound fault settings through a single-fault comparison. We also found that frequency-layer normalization fits the fault diagnosis task better than conventional methods. Lastly, we analyzed the RNDR with various conditions, other models with increased number of parameters, and compared with the ablated RNDR structure.

ASJul 28, 2021
Deep learning based cough detection camera using enhanced features

Gyeong-Tae Lee, Hyeonuk Nam, Seong-Hu Kim et al.

Coughing is a typical symptom of COVID-19. To detect and localize coughing sounds remotely, a convolutional neural network (CNN) based deep learning model was developed in this work and integrated with a sound camera for the visualization of the cough sounds. The cough detection model is a binary classifier of which the input is a two second acoustic feature and the output is one of two inferences (Cough or Others). Data augmentation was performed on the collected audio files to alleviate class imbalance and reflect various background noises in practical environments. For effective featuring of the cough sound, conventional features such as spectrograms, mel-scaled spectrograms, and mel-frequency cepstral coefficients (MFCC) were reinforced by utilizing their velocity (V) and acceleration (A) maps in this work. VGGNet, GoogLeNet, and ResNet were simplified to binary classifiers, and were named V-net, G-net, and R-net, respectively. To find the best combination of features and networks, training was performed for a total of 39 cases and the performance was confirmed using the test F1 score. Finally, a test F1 score of 91.9% (test accuracy of 97.2%) was achieved from G-net with the MFCC-V-A feature (named Spectroflow), an acoustic feature effective for use in cough detection. The trained cough detection model was integrated with a sound camera (i.e., one that visualizes sound sources using a beamforming microphone array). In a pilot test, the cough detection camera detected coughing sounds with an F1 score of 90.0% (accuracy of 96.0%), and the cough location in the camera image was tracked in real time.