Huy-Hieu Pham

CV
h-index10
17papers
326citations
Novelty45%
AI Score56

17 Papers

CVMay 28
EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

LGMay 28
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.

CLMay 14Code
Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

Hoang-Thuy-Duong Vu, Quoc-Cuong Pham, Huy-Hieu Pham

Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.

CVAug 5, 2022
A novel deep learning-based approach for sleep apnea detection using single-lead ECG signals

Anh-Tu Nguyen, Thao Nguyen, Huy-Khiem Le et al.

Sleep apnea (SA) is a type of sleep disorder characterized by snoring and chronic sleeplessness, which can lead to serious conditions such as high blood pressure, heart failure, and cardiomyopathy (enlargement of the muscle tissue of the heart). The electrocardiogram (ECG) plays a critical role in identifying SA since it might reveal abnormal cardiac activity. Recent research on ECG-based SA detection has focused on feature engineering techniques that extract specific characteristics from multiple-lead ECG signals and use them as classification model inputs. In this study, a novel method of feature extraction based on the detection of S peaks is proposed to enhance the detection of adjacent SA segments using a single-lead ECG. In particular, ECG features collected from a single lead (V2) are used to identify SA episodes. On the extracted features, a CNN model is trained to detect SA. Experimental results demonstrate that the proposed method detects SA from single-lead ECG data is more accurate than existing state-of-the-art methods, with 91.13% classification accuracy, 92.58% sensitivity, and 88.75% specificity. Moreover, the further usage of features associated with the S peaks enhances the classification accuracy by 0.85%. Our findings indicate that the proposed machine learning system has the potential to be an effective method for detecting SA episodes.

CVFeb 13Code
Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen et al.

Chest X-Ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.

CVMar 19Code
Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

Duc T. Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham

Automated white blood cell (WBC) classification is essential for leukemia screening yet remains challenging under extreme class imbalance and domain shift. These limitations often cause deep models to overfit dominant classes while failing to generalize to rare pathological subtypes. To address this issue, we propose a three-stage hybrid framework. First, a self-supervised Pix2Pix restoration module mitigates synthetic noise and restores high frequency cytoplasmic details. Second, we integrate a Swin Transformer ensemble with MedSigLIP contrastive embeddings to enhance rare-class semantic representation. Finally, we introduce a biologically inspired refinement strategy combining geometric spikiness analysis and Mahalanobis-based morphological constraints to explicitly rescue suppressed minority predictions. Our hybrid framework achieves a Macro-F1 score of 0.77139 on the private leaderboard, demonstrating strong robustness under extreme long-tail distributions. The code is available at https://github.com/trongduc-nguyen/WBCBench2026.

CVApr 16
CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Hexin Dong, Yi Lin, Pengyu Zhou et al.

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.

LGFeb 22, 2023
Personalized Privacy-Preserving Framework for Cross-Silo Federated Learning

Van-Tuan Tran, Huy-Hieu Pham, Kok-Seng Wong

Federated learning (FL) is recently surging as a promising decentralized deep learning (DL) framework that enables DL-based approaches trained collaboratively across clients without sharing private data. However, in the context of the central party being active and dishonest, the data of individual clients might be perfectly reconstructed, leading to the high possibility of sensitive information being leaked. Moreover, FL also suffers from the nonindependent and identically distributed (non-IID) data among clients, resulting in the degradation in the inference performance on local clients' data. In this paper, we propose a novel framework, namely Personalized Privacy-Preserving Federated Learning (PPPFL), with a concentration on cross-silo FL to overcome these challenges. Specifically, we introduce a stabilized variant of the Model-Agnostic Meta-Learning (MAML) algorithm to collaboratively train a global initialization from clients' synthetic data generated by Differential Private Generative Adversarial Networks (DP-GANs). After reaching convergence, the global initialization will be locally adapted by the clients to their private data. Through extensive experiments, we empirically show that our proposed framework outperforms multiple FL baselines on different datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100.

CVJul 5, 2025Code
Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen et al.

Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.

CVApr 2
HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation

Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh et al.

Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.

CVApr 2
BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography

Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh et al.

Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.

CVMar 18
Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening

Ngoc-Khai Hoang, Thi-Nhu-Mai Nguyen, Huy-Hieu Pham

Early identification of stroke symptoms is essential for enabling timely intervention and improving patient outcomes, particularly in prehospital settings. This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment. The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness. Facial dynamics are represented using landmark based features and modeled with a Transformer architecture to capture temporal dependencies. Speech signals are converted into mel spectrograms and processed using an Audio Spectrogram Transformer, while upper-body pose sequences are analyzed with an MLP-Mixer network to model spatiotemporal motion patterns. The extracted modality specific representations are combined through an attention-based fusion mechanism to effectively learn cross modal interactions. Experiments conducted on a self-collected dataset of 222 videos from 37 subjects demonstrate that the proposed multimodal model consistently outperforms unimodal baselines, achieving 95.83% accuracy and a 96.00% F1-score. The model attains a strong balance between sensitivity and specificity and successfully detects all stroke cases in the test set. These results highlight the potential of multimodal learning and transfer learning for early stroke screening, while emphasizing the need for larger, clinically representative datasets to support reliable real-world deployment.

CVOct 16, 2025
CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi, Huy-Hieu Pham, Duc-Trong Le

Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

IVSep 19, 2025
Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham, Minh Le, Han Huynh et al.

Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.

CVDec 26, 2018
Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks

Huy-Hieu Pham, Louahdi Khoudour, Alain Crouzil et al.

Recognizing human actions in untrimmed videos is an important challenging task. An effective 3D motion representation and a powerful learning model are two key factors influencing recognition performance. In this paper we introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a color encoding process. By normalizing the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the color-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. We then design and train different Deep Convolutional Neural Networks (D-CNNs) based on the Residual Network architecture (ResNet) on the obtained image-based representations to learn 3D motion features and classify them into classes. Our method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches whilst requiring less computation for training and prediction.

CVMar 21, 2018
Exploiting deep residual networks for human action recognition from skeletal data

Huy-Hieu Pham, Louahdi Khoudour, Alain Crouzil et al.

The computer vision community is currently focusing on solving action recognition problems in real videos, which contain thousands of samples with many challenges. In this process, Deep Convolutional Neural Networks (D-CNNs) have played a significant role in advancing the state-of-the-art in various vision-based action recognition systems. Recently, the introduction of residual connections in conjunction with a more traditional CNN model in a single architecture called Residual Network (ResNet) has shown impressive performance and great potential for image recognition tasks. In this paper, we investigate and apply deep ResNets for human action recognition using skeletal data provided by depth sensors. Firstly, the 3D coordinates of the human body joints carried in skeleton sequences are transformed into image-based representations and stored as RGB images. These color images are able to capture the spatial-temporal evolutions of 3D motions from skeleton sequences and can be efficiently learned by D-CNNs. We then propose a novel deep learning architecture based on ResNets to learn features from obtained color-based representations and classify them into action classes. The proposed method is evaluated on three challenging benchmark datasets including MSR Action 3D, KARD, and NTU-RGB+D datasets. Experimental results demonstrate that our method achieves state-of-the-art performance for all these benchmarks whilst requiring less computation resource. In particular, the proposed method surpasses previous approaches by a significant margin of 3.4% on MSR Action 3D dataset, 0.67% on KARD dataset, and 2.5% on NTU-RGB+D dataset.

CVMar 21, 2018
Learning and Recognizing Human Action from Skeleton Movement with Deep Residual Neural Networks

Huy-Hieu Pham, Louahdi Khoudour, Alain Crouzil et al.

Automatic human action recognition is indispensable for almost artificial intelligent systems such as video surveillance, human-computer interfaces, video retrieval, etc. Despite a lot of progress, recognizing actions in an unknown video is still a challenging task in computer vision. Recently, deep learning algorithms have proved its great potential in many vision-related recognition tasks. In this paper, we propose the use of Deep Residual Neural Networks (ResNets) to learn and recognize human action from skeleton data provided by Kinect sensor. Firstly, the body joint coordinates are transformed into 3D-arrays and saved in RGB images space. Five different deep learning models based on ResNet have been designed to extract image features and classify them into classes. Experiments are conducted on two public video datasets for human action recognition containing various challenges. The results show that our method achieves the state-of-the-art performance comparing with existing approaches.