SDMar 7, 2023Code
An Inception-Residual-Based Architecture with Multi-Objective Loss for Detecting Respiratory AnomaliesDat Ngo, Lam Pham, Huy Phan et al.
This paper presents a deep learning system applied for detecting anomalies from respiratory sound recordings. Initially, our system begins with audio feature extraction using Gammatone and Continuous Wavelet transformation. This step aims to transform the respiratory sound input into a two-dimensional spectrogram where both spectral and temporal features are presented. Then, our proposed system integrates Inception-residual-based backbone models combined with multi-head attention and multi-objective loss to classify respiratory anomalies. Instead of applying a simple concatenation approach by combining results from various spectrograms, we propose a Linear combination, which has the ability to regulate equally the contribution of each individual spectrogram throughout the training process. To evaluate the performance, we conducted experiments over the benchmark dataset of SPRSound (The Open-Source SJTU Paediatric Respiratory Sound) proposed by the IEEE BioCAS 2022 challenge. As regards the Score computed by an average between the average score and harmonic score, our proposed system gained significant improvements of 9.7%, 15.8%, 17.8%, and 16.1% in Task 1-1, Task 1-2, Task 2-1, and Task 2-2, respectively, compared to the challenge baseline system. Notably, we achieved the Top-1 performance in Task 2-1 and Task 2-2 with the highest Score of 74.5% and 53.9%, respectively.
SDMar 23, 2022
Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording DevicesLam Pham, Khoa Dinh, Dat Ngo et al.
In this paper, we present a robust and low complexity system for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording. We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue. To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction on the ASC baseline system. By conducting extensive experiments on the benchmark DCASE 2020 Task 1A Development dataset, we achieve the best model performing an accuracy of 69.9% and a low complexity of 2.4M trainable parameters, which is competitive to the state-of-the-art ASC systems and potential for real-life applications on edge devices.
CVJun 20, 2022
Remote Sensing Image Classification using Transfer Learning and Attention Based Deep Neural NetworkLam Pham, Khoa Tran, Dat Ngo et al.
The task of remote sensing image scene classification (RSISC), which aims at classifying remote sensing images into groups of semantic categories based on their contents, has taken the important role in a wide range of applications such as urban planning, natural hazards detection, environment monitoring,vegetation mapping, or geospatial object detection. During the past years, research community focusing on RSISC task has shown significant effort to publish diverse datasets as well as propose different approaches to deal with the RSISC challenges. Recently, almost proposed RSISC systems base on deep learning models which prove powerful and outperform traditional approaches using image processing and machine learning. In this paper, we also leverage the power of deep learning technology, evaluate a variety of deep neural network architectures, indicate main factors affecting the performance of a RSISC system. Given the comprehensive analysis, we propose a deep learning based framework for RSISC, which makes use of the transfer learning technique and multihead attention scheme. The proposed deep learning framework is evaluated on the benchmark NWPU-RESISC45 dataset and achieves the best classification accuracy of 94.7% which shows competitive to the state-of-the-art systems and potential for real-life applications.
SDOct 16, 2022
Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene ContextLam Pham, Dusan Salovic, Anahid Jalali et al.
In this paper, we present a comprehensive analysis of Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature. In particular, we firstly propose an inception-based and low footprint ASC model, referred to as the ASC baseline. The proposed ASC baseline is then compared with benchmark and high-complexity network architectures of MobileNetV1, MobileNetV2, VGG16, VGG19, ResNet50V2, ResNet152V2, DenseNet121, DenseNet201, and Xception. Next, we improve the ASC baseline by proposing a novel deep neural network architecture which leverages residual-inception architectures and multiple kernels. Given the novel residual-inception (NRI) model, we further evaluate the trade off between the model complexity and the model accuracy performance. Finally, we evaluate whether sound events occurring in a sound scene recording can help to improve ASC accuracy, then indicate how a sound scene context is well presented by combining both sound scene and sound event information. We conduct extensive experiments on various ASC datasets, including Crowded Scenes, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 1A and 1B, 2019 Task 1A and 1B, 2020 Task 1A, 2021 Task 1A, 2022 Task 1. The experimental results on several different ASC challenges highlight two main achievements; the first is to propose robust, general, and low complexity ASC systems which are suitable for real-life applications on a wide range of edge devices and mobiles; the second is to propose an effective visualization method for comprehensively presenting a sound scene context.
CVFeb 25, 2023
A Light-weight Deep Learning Model for Remote Sensing Image ClassificationLam Pham, Cam Le, Dat Ngo et al.
In this paper, we present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the aerial scene of a remote sensing image. To this end, we first valuate various benchmark convolutional neural network (CNN) architectures: MobileNet V1/V2, ResNet 50/151V2, InceptionV3/InceptionResNetV2, EfficientNet B0/B7, DenseNet 121/201, ConNeXt Tiny/Large. Then, the best performing models are selected to train a compact model in a teacher-student arrangement. The knowledge distillation from the teacher aims to achieve high performance with significantly reduced complexity. By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems, and has potential to be applied on a wide rage of edge devices.
CVNov 5, 2022
A Robust and Low Complexity Deep Learning Model for Remote Sensing Image ClassificationCam Le, Lam Pham, Nghia NVN et al.
In this paper, we present a robust and low complexity deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the scene of a remote sensing image. In particular, we firstly evaluate different low complexity and benchmark deep neural networks: MobileNetV1, MobileNetV2, NASNetMobile, and EfficientNetB0, which present the number of trainable parameters lower than 5 Million (M). After indicating best network architecture, we further improve the network performance by applying attention schemes to multiple feature maps extracted from middle layers of the network. To deal with the issue of increasing the model footprint as using attention schemes, we apply the quantization technique to satisfy the maximum of 20 MB memory occupation. By conducting extensive experiments on the benchmark datasets NWPU-RESISC45, we achieve a robust and low-complexity model, which is very competitive to the state-of-the-art systems and potential for real-life applications on edge devices.
SDJun 13, 2022
Low-complexity deep learning frameworks for acoustic scene classificationLam Pham, Dat Ngo, Anahid Jalali et al.
In this report, we presents low-complexity deep learning frameworks for acoustic scene classification (ASC). The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities. In particular, we initially transform audio recordings into Mel, Gammatone, and CQT spectrograms. Next, data augmentation methods of Random Cropping, Specaugment, and Mixup are then applied to generate augmented spectrograms before being fed into deep learning based classifiers. Finally, to achieve the best performance, we fuse probabilities which obtained from three individual classifiers, which are independently-trained with three type of spectrograms. Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%, improving DCASE baseline by 17.2%.
LGSep 12, 2023
Robust-MBDL: A Robust Multi-branch Deep Learning Based Model for Remaining Useful Life Prediction and Operational Condition Identification of Rotating MachinesKhoa Tran, Hai-Canh Vu, Lam Pham et al.
In this paper, a Robust Multi-branch Deep learning-based system for remaining useful life (RUL) prediction and condition operations (CO) identification of rotating machines is proposed. In particular, the proposed system comprises main components: (1) an LSTM-Autoencoder to denoise the vibration data; (2) a feature extraction to generate time-domain, frequency-domain, and time-frequency based features from the denoised data; (3) a novel and robust multi-branch deep learning network architecture to exploit the multiple features. The performance of our proposed system was evaluated and compared to the state-of-the-art systems on two benchmark datasets of XJTU-SY and PRONOSTIA. The experimental results prove that our proposed system outperforms the state-of-the-art systems and presents potential for real-life applications on bearing machines.
SDJul 1, 2024
Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning ModelsLam Pham, Phat Lam, Truong Nguyen et al.
In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet, RegNet. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings from the input spectrograms. Then, the audio embeddings are explored by a Multilayer perceptron (MLP) model to detect the fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance the task of audio deepfake detection.
LGOct 17, 2023
A Robust Deep Learning System for Motor Bearing Fault Detection: Leveraging Multiple Learning Strategies and a Novel Double Loss FunctionKhoa Tran, Lam Pham, Vy-Rin Nguyen et al.
Motor bearing fault detection (MBFD) is critical for maintaining the reliability and operational efficiency of industrial machinery. Early detection of bearing faults can prevent system failures, reduce operational downtime, and lower maintenance costs. In this paper, we propose a robust deep learning-based system for MBFD that incorporates multiple training strategies, including supervised, semi-supervised, and unsupervised learning. To enhance the detection performance, we introduce a novel double loss function. Our approach is evaluated using benchmark datasets from the American Society for Mechanical Failure Prevention Technology (MFPT), Case Western Reserve University Bearing Center (CWRU), and Paderborn University's Condition Monitoring of Bearing Damage in Electromechanical Drive Systems (PU). Results demonstrate that deep learning models outperform traditional machine learning techniques, with our novel system achieving superior accuracy across all datasets. These findings highlight the potential of our approach for practical MBFD applications.
31.9ASMar 16
Spectrogram features for audio and speech analysisIan McLoughlin, Lam Pham, Yan Song et al.
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.
40.8SDMar 29
A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based GeneratorsLam Pham, Khoi Vu, Dat Tran et al.
In this paper, we analyze two main factors of Bonafide Resource (BR) or AI-based Generator (AG) which affect the performance and the generality of a Deepfake Speech Detection (DSD) model. To this end, we first propose a deep-learning based model, referred to as the baseline. Then, we conducted experiments on the baseline by which we indicate how Bonafide Resource (BR) and AI-based Generator (AG) factors affect the threshold score used to detect fake or bonafide input audio in the inference process. Given the experimental results, a dataset, which re-uses public Deepfake Speech Detection (DSD) datasets and shows a balance between Bonafide Resource (BR) or AI-based Generator (AG), is proposed. We then train various deep-learning based models on the proposed dataset and conduct cross-dataset evaluation on different benchmark datasets. The cross-dataset evaluation results prove that the balance of Bonafide Resources (BR) and AI-based Generators (AG) is the key factor to train and achieve a general Deepfake Speech Detection (DSD) model.
LGJan 15, 2020Code
Improving GANs for Speech EnhancementHuy Phan, Ian V. McLoughlin, Lam Pham et al.
Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. However, most, if not all, existing speech enhancement GANs (SEGAN) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose to use multiple generators that are chained to perform multi-stage enhancement mapping, which gradually refines the noisy input signals in a stage-wise fashion. Furthermore, we study two scenarios: (1) the generators share their parameters and (2) the generators' parameters are independent. The former constrains the generators to learn a common mapping that is iteratively applied at all enhancement stages and results in a small model footprint. On the contrary, the latter allows the generators to flexibly learn different enhancement mappings at different stages of the network at the cost of an increased model size. We demonstrate that the proposed multi-stage enhancement approach outperforms the one-stage SEGAN baseline, where the independent generators lead to more favorable results than the tied generators. The source code is available at http://github.com/pquochuy/idsegan.
CVDec 27, 2023
Landslide Detection and Segmentation Using Remote Sensing Images and Deep Neural NetworkCam Le, Lam Pham, Jasmin Lampert et al.
Knowledge about historic landslide event occurrence is important for supporting disaster risk reduction strategies. Building upon findings from 2022 Landslide4Sense Competition, we propose a deep neural network based system for landslide detection and segmentation from multisource remote sensing image input. We use a U-Net trained with Cross Entropy loss as baseline model. We then improve the U-Net baseline model by leveraging a wide range of deep learning techniques. In particular, we conduct feature engineering by generating new band data from the original bands, which helps to enhance the quality of remote sensing image input. Regarding the network architecture, we replace traditional convolutional layers in the U-Net baseline by a residual-convolutional layer. We also propose an attention layer which leverages the multi-head attention scheme. Additionally, we generate multiple output masks with three different resolutions, which creates an ensemble of three outputs in the inference process to enhance the performance. Finally, we propose a combined loss function which leverages Focal loss and IoU loss to train the network. Our experiments on the development set of the Landslide4Sense challenge achieve an F1 score and an mIoU score of 84.07 and 76.07, respectively. Our best model setup outperforms the challenge baseline and the proposed U-Net baseline, improving the F1 score/mIoU score by 6.8/7.4 and 10.5/8.8, respectively.
16.7SDApr 21
Environmental Sound Deepfake Detection Using Deep-Learning FrameworkLam Pham, Khoi Vu, Dat Tran et al.
In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.
LGMar 27, 2025
HybridoNet-Adapt: A Domain-Adapted Framework for Accurate Lithium-Ion Battery RUL PredictionKhoa Tran, Bao Huynh, Tri Le et al.
Accurate prediction of the Remaining Useful Life (RUL) in Lithium ion battery (LIB) health management systems is essential for ensuring operational reliability and safety. However, many existing methods assume that training and testing data follow the same distribution, limiting their ability to generalize to unseen target domains. To address this, we propose a novel RUL prediction framework that incorporates a domain adaptation (DA) technique. Our framework integrates a signal preprocessing pipeline including noise reduction, feature extraction, and normalization with a robust deep learning model called HybridoNet Adapt. The model features a combination of LSTM, Multihead Attention, and Neural ODE layers for feature extraction, followed by two predictor modules with trainable trade-off parameters. To improve generalization, we adopt a DA strategy inspired by Domain Adversarial Neural Networks (DANN), replacing adversarial loss with Maximum Mean Discrepancy (MMD) to learn domain-invariant features. Experimental results show that HybridoNet Adapt significantly outperforms traditional models such as XGBoost and Elastic Net, as well as deep learning baselines like Dual input DNN, demonstrating its potential for scalable and reliable battery health management (BHM).
CLJan 29, 2024
LSTM-based Deep Neural Network With A Focus on Sentence Representation for Sequential Sentence Classification in Medical Scientific AbstractsPhat Lam, Lam Pham, Tin Nguyen et al.
The Sequential Sentence Classification task within the domain of medical abstracts, termed as SSC, involves the categorization of sentences into pre-defined headings based on their roles in conveying critical information in the abstract. In the SSC task, sentences are sequentially related to each other. For this reason, the role of sentence embeddings is crucial for capturing both the semantic information between words in the sentence and the contextual relationship of sentences within the abstract, which then enhances the SSC system performance. In this paper, we propose a LSTM-based deep learning network with a focus on creating comprehensive sentence representation at the sentence level. To demonstrate the efficacy of the created sentence representation, a system utilizing these sentence embeddings is also developed, which consists of a Convolutional-Recurrent neural network (C-RNN) at the abstract level and a multi-layer perception network (MLP) at the segment level. Our proposed system yields highly competitive results compared to state-of-the-art systems and further enhances the F1 scores of the baseline by 1.0%, 2.8%, and 2.6% on the benchmark datasets PudMed 200K RCT, PudMed 20K RCT and NICTA-PIBOSO, respectively. This indicates the significant impact of improving sentence representation on boosting model performance.
CVJul 15, 2025
RMAU-NET: A Residual-Multihead-Attention U-Net Architecture for Landslide Segmentation and Detection from Remote Sensing ImagesLam Pham, Cam Le, Hieu Tang et al.
In recent years, landslide disasters have reported frequently due to the extreme weather events of droughts, floods , storms, or the consequence of human activities such as deforestation, excessive exploitation of natural resources. However, automatically observing landslide is challenging due to the extremely large observing area and the rugged topography such as mountain or highland. This motivates us to propose an end-to-end deep-learning-based model which explores the remote sensing images for automatically observing landslide events. By considering remote sensing images as the input data, we can obtain free resource, observe large and rough terrains by time. To explore the remote sensing images, we proposed a novel neural network architecture which is for two tasks of landslide detection and landslide segmentation. We evaluated our proposed model on three different benchmark datasets of LandSlide4Sense, Bijie, and Nepal. By conducting extensive experiments, we achieve F1 scores of 98.23, 93.83 for the landslide detection task on LandSlide4Sense, Bijie datasets; mIoU scores of 63.74, 76.88 on the segmentation tasks regarding LandSlide4Sense, Nepal datasets. These experimental results prove potential to integrate our proposed model into real-life landslide observation systems.
SDMay 2, 2024
A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)Lam Pham, Phat Lam, Tin Nguyen et al.
In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.
CVJul 17, 2025
A Deep-Learning Framework for Land-Sliding Classification from Remote Sensing ImageHieu Tang, Truong Vo, Dong Pham et al.
The use of satellite imagery combined with deep learning to support automatic landslide detection is becoming increasingly widespread. However, selecting an appropriate deep learning architecture to optimize performance while avoiding overfitting remains a critical challenge. To address these issues, we propose a deep-learning based framework for landslide detection from remote sensing image in this paper. The proposed framework presents an effective combination of the online an offline data augmentation to tackle the imbalanced data, a backbone EfficientNet\_Large deep learning model for extracting robust embedding features, and a post-processing SVM classifier to balance and enhance the classification performance. The proposed model achieved an F1-score of 0.8938 on the public test set of the Zindi challenge.
CVJul 17, 2025
SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image CaptioningKhang Truong, Lam Pham, Hieu Tang et al.
Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.
SDMay 16, 2023
Low-complexity deep learning frameworks for acoustic scene classification using teacher-student scheme and multiple spectrogramsLam Pham, Dat Ngo, Cam Le et al.
In this technical report, a low-complexity deep learning system for acoustic scene classification (ASC) is presented. The proposed system comprises two main phases: (Phase I) Training a teacher network; and (Phase II) training a student network using distilled knowledge from the teacher. In the first phase, the teacher, which presents a large footprint model, is trained. After training the teacher, the embeddings, which are the feature map of the second last layer of the teacher, are extracted. In the second phase, the student network, which presents a low complexity model, is trained with the embeddings extracted from the teacher. Our experiments conducted on DCASE 2023 Task 1 Development dataset have fulfilled the requirement of low-complexity and achieved the best classification accuracy of 57.4%, improving DCASE baseline by 14.5%.
SDFeb 10, 2022
Audio-Based Deep Learning Frameworks for Detecting COVID-19Dat Ngo, Lam Pham, Truong Hoang et al.
This paper evaluates a wide range of audio-based deep learning frameworks applied to the breathing, cough, and speech sounds for detecting COVID-19. In general, the audio recording inputs are transformed into low-level spectrogram features, then they are fed into pre-trained deep learning models to extract high-level embedding features. Next, the dimension of these high-level embedding features are reduced before finetuning using Light Gradient Boosting Machine (LightGBM) as a back-end classification. Our experiments on the Second DiCOVA Challenge achieved the highest Area Under the Curve (AUC), F1 score, sensitivity score, and specificity score of 89.03%, 64.41%, 63.33%, and 95.13%, respectively. Based on these scores, our method outperforms the state-of-the-art systems, and improves the challenge baseline by 4.33%, 6.00% and 8.33% in terms of AUC, F1 score and sensitivity score, respectively.
SDJan 9, 2022
An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory AnomaliesLam Pham, Dat Ngo, Truong Hoang et al.
In this paper, we evaluate various deep learning frameworks for detecting respiratory anomalies from input audio recordings. To this end, we firstly transform audio respiratory cycles collected from patients into spectrograms where both temporal and spectral features are presented, referred to as the front-end feature extraction. We then feed the spectrograms into back-end deep learning networks for classifying these respiratory cycles into certain categories. Finally, results from high-performed deep learning frameworks are fused to obtain the best score. Our experiments on ICBHI benchmark dataset achieve the highest ICBHI score of 57.3 from a late fusion of inception based and transfer learning based deep learning frameworks, which outperforms the state-of-the-art systems.
CVDec 16, 2021
An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene ClassificationLam Pham, Dat Ngo, Phu X. Nguyen et al.
This paper presents a task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning frameworks are proposed to deploy either audio or visual input data independently. Finally, results obtained from high-performed deep learning frameworks are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task's performance. Significantly, an ensemble of deep learning frameworks exploring either audio or visual input data can achieve the best accuracy of 95.7%.
SDOct 12, 2021
An Annihilating Filter-Based DOA Estimation for Uniform Linear ArraySon Phan, Lam Pham
In this paper, we propose a new method to design an annihilating filter (AF) for direction-of-arrival (DOA) estimation of multiple snapshots within an uniform linear array. To evaluate the proposed method, we firstly design a DOA estimation using multiple signal classification (MUSIC) algorithm, referred to as the MUSIC baseline. We then compare the proposed method with the MUSIC baseline in two environmental noise conditions: Only white noise, or both white noise and diffusion. The experimental results highlight two main contributions; the first is to modify conventional MUSIC algorithm for adapting different noise conditions, and the second is to propose an AF-based method that shows competitive accuracy of arrival angles detected and low complexity compared with the MUSIC baseline.
SDOct 7, 2021
A Cough-based deep learning framework for detecting COVID-19Truong Hoang, Lam Pham, Dat Ngo et al.
This paper presents a deep learning framework for detecting COVID-19 positive subjects from their cough sounds. In particular, the proposed approach comprises two main steps. In the first step, we generate a feature representing the cough sound by combining an embedding extracted from a pre-trained model and handcrafted features extracted from draw audio recording, referred to as the front-end feature extraction. Then, the combined features are fed into different back-end classification models for detecting COVID-19 positive subjects in the second step. Our experiments on the Track-2 dataset of the Second 2021 DiCOVA Challenge achieved the second top ranking with an AUC score of 81.21 and the top F1 score of 53.21 on a Blind Test set, improving the challenge baseline by 8.43% and 23.4% respectively and showing deployability, robustness and competitiveness with the state-of-the-art systems.
SDJul 20, 2021
Robust Deep Learning Frameworks for Acoustic Scene and Respiratory Sound ClassificationLam Pham
This thesis focuses on dealing with the task of acoustic scene classification (ASC), and then applied the techniques developed for ASC to a real-life application of detecting respiratory disease. To deal with ASC challenges, this thesis addresses three main factors that directly affect the performance of an ASC system. Firstly, this thesis explores input features by making use of multiple spectrograms (log-mel, Gamma, and CQT) for low-level feature extraction to tackle the issue of insufficiently discriminative or descriptive input features. Next, a novel Encoder network architecture is introduced. The Encoder firstly transforms each low-level spectrogram into high-level intermediate features, or embeddings, and thus combines these high-level features to form a very distinct composite feature. The composite or combined feature is then explored in terms of classification performance, with different Decoders such as Random Forest (RF), Multilayer Perception (MLP), and Mixture of Experts (MoE). By using this Encoder-Decoder framework, it helps to reduce the computation cost of the reference process in ASC systems which make use of multiple spectrogram inputs. Since the proposed techniques applied for general ASC tasks were shown to be highly effective, this inspired an application to a specific real-life application. This was namely the 2017 Internal Conference on Biomedical Health Informatics (ICBHI) respiratory sound dataset. Building upon the proposed ASC framework, the ICBHI tasks were tackled with a deep learning framework, and the resulting system shown to be capable at detecting respiratory anomaly cycles and diseases.
SDJun 12, 2021
Deep Learning Frameworks Applied For Audio-Visual Scene ClassificationLam Pham, Alexander Schindler, Mina Schütz et al.
In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual and audio features as well as their combination affect SC performance. Our extensive experiments, which are conducted on DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE baseline.
SDJun 12, 2021
A Low-Compexity Deep Learning Framework For Acoustic Scene ClassificationLam Pham, Hieu Tang, Anahid Jalali et al.
In this paper, we presents a low-complexity deep learning frameworks for acoustic scene classification (ASC). The proposed framework can be separated into three main steps: Front-end spectrogram extraction, back-end classification, and late fusion of predicted probabilities. First, we use Mel filter, Gammatone filter and Constant Q Transfrom (CQT) to transform raw audio signal into spectrograms, where both frequency and temporal features are presented. Three spectrograms are then fed into three individual back-end convolutional neural networks (CNNs), classifying into ten urban scenes. Finally, a late fusion of three predicted probabilities obtained from three CNNs is conducted to achieve the final classification result. To reduce the complexity of our proposed CNN network, we apply two model compression techniques: model restriction and decomposed convolution. Our extensive experiments, which are conducted on DCASE 2021 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1A development dataset, achieve a low-complexity CNN based framework with 128 KB trainable parameters and the best classification accuracy of 66.7%, improving DCASE baseline by 19.0%
LGApr 5, 2021
An Analysis of State-of-the-art Activation Functions For Supervised Deep Neural NetworkAnh Nguyen, Khoa Pham, Dat Ngo et al.
This paper provides an analysis of state-of-the-art activation functions with respect to supervised classification of deep neural network. These activation functions comprise of Rectified Linear Units (ReLU), Exponential Linear Unit (ELU), Scaled Exponential Linear Unit (SELU), Gaussian Error Linear Unit (GELU), and the Inverse Square Root Linear Unit (ISRLU). To evaluate, experiments over two deep learning network architectures integrating these activation functions are conducted. The first model, basing on Multilayer Perceptron (MLP), is evaluated with MNIST dataset to perform these activation functions. Meanwhile, the second model, likely VGGish-based architecture, is applied for Acoustic Scene Classification (ASC) Task 1A in DCASE 2018 challenge, thus evaluate whether these activation functions work well in different datasets as well as different network architectures.
SDApr 2, 2021
An Audio-Based Deep Learning Framework For BBC Television Programme ClassificationLam Pham, Chris Baume, Qiuqiang Kong et al.
This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated to extract discriminative features representing the television programmes. Finally, the embedded features extracted are fed into a classifier for classifying the programmes into different genres. Our experiments are conducted over a dataset of 6,160 programmes belonging to nine genres labelled by the BBC. We achieve an average classification accuracy of 93.7% over 14-fold cross validation. This demonstrates the efficacy of the proposed framework for the task of audio-based classification of television programmes.
SDMar 3, 2021
Multi-view Audio and Music ClassificationHuy Phan, Huy Le Nguyen, Oliver Y. Chén et al.
We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network. However, apart from the joint classification branch, the network also maintains four classification branches on the single-view embedding of the subnetworks. A novel method is then proposed to keep track of the learning behavior on the classification branches and adapt their weights to proportionally blend their gradients for network training. The weights are adapted in such a way that learning on a branch that is generalizing well will be encouraged whereas learning on a branch that is overfitting will be slowed down. Experiments on three different audio and music classification tasks show that the proposed multi-view network not only outperforms the single-view baselines but also is superior to the multi-view baselines based on concatenation and late fusion.
SDDec 26, 2020
Inception-Based Network and Multi-Spectrogram Ensemble Applied For Predicting Respiratory Anomalies and Lung DiseasesLam Pham, Huy Phan, Ross King et al.
This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as back-end classification, for detecting whether patients suffer from lung-relevant diseases. Our experiments, conducted over the ICBHI benchmark meta-dataset of respiratory sound, achieve competitive ICBHI scores of 0.53/0.45 and 0.87/0.85 regarding respiratory anomaly and disease detection, respectively.
LGDec 26, 2020
Deep Learning Framework Applied for Predicting Anomaly of Respiratory SoundsDat Ngo, Lam Pham, Anh Nguyen et al.
This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into a two-dimensional spectrogram where both spectral and temporal features are well presented. Next, an ensemble of C- DNN and Autoencoder networks is then applied to classify into four categories of respiratory anomaly cycles. In this work, we conducted experiments over 2017 Internal Conference on Biomedical Health Informatics (ICBHI) benchmark dataset. As a result, we achieve competitive performances with ICBHI average score of 0.49, ICBHI harmonic score of 0.42.
ASSep 11, 2020
On Multitask Loss Function for Audio Event Detection and LocalizationHuy Phan, Lam Pham, Philipp Koch et al.
Audio event localization and detection (SELD) have been commonly tackled using multitask models. Such a model usually consists of a multi-label event classification branch with sigmoid cross-entropy loss for event activity detection and a regression branch with mean squared error loss for direction-of-arrival estimation. In this work, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems and use the mean squared error loss homogeneously for model training. We show that the common combination of heterogeneous loss functions causes the network to underfit the data whereas the homogeneous mean squared error loss leads to better convergence and performance. Experiments on the development and validation sets of the DCASE 2020 SELD task demonstrate that the proposed system also outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) by approximately 10% absolute.
SDMay 26, 2020
Sound Context Classification Basing on Join Learning Model and Multi-Spectrogram FeaturesDat Ngo, Hao Hoang, Anh Nguyen et al.
In this paper, we present a deep learning framework applied for Acoustic Scene Classification (ASC), the task of classifying scene contexts from environmental input sounds. An ASC system generally comprises of two main steps, referred to as front-end feature extraction and back-end classification. In the first step, an extractor is used to extract low-level features from raw audio signals. Next, the discriminative features extracted are fed into and classified by a classifier, reporting accuracy results. Aim to develop a robust framework applied for ASC, we address exited issues of both the front-end and back-end components in an ASC system, thus present three main contributions: Firstly, we carry out a comprehensive analysis of spectrogram representation extracted from sound scene input, thus propose the best multi-spectrogram combinations. In terms of back-end classification, we propose a novel join learning architecture using parallel convolutional recurrent networks, which is effective to learn spatial features and temporal sequences of spectrogram input. Finally, good experimental results obtained over benchmark datasets of IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Task 1A & 1B, LITIS Rouen prove our proposed framework general and robust for ASC task.
ASApr 4, 2020
CNN-MoE based framework for classification of respiratory anomalies and lung disease detectionLam Pham, Huy Phan, Ramaswamy Palaniappan et al.
This paper presents and explores a robust deep learning framework for auscultation analysis. This aims to classify anomalies in respiratory cycles and detect disease, from respiratory sound recordings. The framework begins with front-end feature extraction that transforms input sound into a spectrogram representation. Then, a back-end deep learning network is used to classify the spectrogram features into categories of respiratory anomaly cycles or diseases. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, confirm three main contributions towards respiratory-sound analysis. Firstly, we carry out an extensive exploration of the effect of spectrogram type, spectral-time resolution, overlapped/non-overlapped windows, and data augmentation on final prediction accuracy. This leads us to propose a novel deep learning system, built on the proposed framework, which outperforms current state-of-the-art methods. Finally, we apply a Teacher-Student scheme to achieve a trade-off between model performance and model complexity which additionally helps to increase the potential of the proposed framework for building real-time applications.
ASFeb 12, 2020
Deep Feature Embedding and Hierarchical Classification for Audio Scene ClassificationLam Pham, Ian McLoughlin, Huy Phan et al.
In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural network, the learned embedding embeds an input into the embedding feature space and transforms it into a high-level feature vector for representation. In the other hand, in order to exploit the structure of the scene categories, the original scene classification problem is structured into a hierarchy where similar categories are grouped into meta-categories. Then, hierarchical classification is accomplished using deep neural network classifiers associated with triplet loss function. Our experiments show that the proposed system achieves good performance on both the DCASE 2018 Task 1A and 1B datasets, resulting in accuracy gains of 15.6% and 16.6% absolute over the DCASE 2018 baseline on Task 1A and 1B, respectively.
SDFeb 11, 2020
Robust Acoustic Scene Classification using a Multi-Spectrogram Encoder-Decoder FrameworkLam Pham, Huy Phan, Truc Nguyen et al.
This article proposes an encoder-decoder network model for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature. We make use of multiple low-level spectrogram features at the front-end, transformed into higher level features through a well-trained CNN-DNN front-end encoder. The high level features and their combination (via a trained feature combiner) are then fed into different decoder models comprising random forest regression, DNNs and a mixture of experts, for back-end classification. We report extensive experiments to evaluate the accuracy of this framework for various ASC datasets, including LITIS Rouen and IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Tasks 1A & 1B and 2019 Tasks 1A & 1B. The experimental results highlight two main contributions; the first is an effective method for high-level feature extraction from multi-spectrogram input via the novel C-DNN architecture encoder network, and the second is the proposed decoder which enables the framework to achieve competitive results on various datasets. The fact that a single framework is highly competitive for several different challenges is an indicator of its robustness for performing general ASC tasks.
SDJan 21, 2020
Robust Deep Learning Framework For Predicting Respiratory Anomalies and DiseasesLam Pham, Ian McLoughlin, Huy Phan et al.
This paper presents a robust deep learning framework developed to detect respiratory diseases from recordings of respiratory sounds. The complete detection process firstly involves front end feature extraction where recordings are transformed into spectrograms that convey both spectral and temporal information. Then a back-end deep learning model classifies the features into classes of respiratory disease or anomaly. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, evaluate the ability of the framework to classify sounds. Two main contributions are made in this paper. Firstly, we provide an extensive analysis of how factors such as respiratory cycle length, time resolution, and network architecture, affect final prediction accuracy. Secondly, a novel deep learning based framework is proposed for detection of respiratory diseases and shown to perform extremely well compared to state of the art methods.
SDApr 6, 2019
Spatio-Temporal Attention Pooling for Audio Scene ClassificationHuy Phan, Oliver Y. Chén, Lam Pham et al.
Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.
SDNov 2, 2018
Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?Huy Phan, Oliver Y. Chén, Philipp Koch et al.
Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer durations. In addition, model fusion is shown to be the most beneficial when the signal length is short.
LGNov 2, 2018
Unifying Isolated and Overlapping Audio Event Detection with Multi-Label Multi-Task Convolutional Recurrent Neural NetworksHuy Phan, Oliver Y. Chén, Philipp Koch et al.
We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events. The framework leverages the power of convolutional recurrent neural network architectures; convolutional layers learn effective features over which higher recurrent layers perform sequential modelling. Furthermore, the output layer is designed to handle arbitrary degrees of event overlap. At each time step in the recurrent output sequence, an output triple is dedicated to each event category of interest to jointly model event occurrence and temporal boundaries. That is, the network jointly determines whether an event of this category occurs, and when it occurs, by estimating onset and offset positions at each recurrent time step. We then introduce three sequential losses for network training: multi-label classification loss, distance estimation loss, and confidence loss. We demonstrate good generalization on two datasets: ITC-Irst for isolated audio event detection, and TUT-SED-Synthetic-2016 for overlapping audio event detection.