Dung T. Tran

CV
h-index10
7papers
8citations
Novelty40%
AI Score43

7 Papers

CVJan 22
TinySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing

Toan Gian, Dung T. Tran, Viet Quoc Pham et al.

With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.

SDMay 27, 2025
Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Jingping Nie, Dung T. Tran, Karan Thakkar et al.

Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

IVJan 7, 2025
Semise: Semi-supervised learning for severity representation in medical image

Dung T. Tran, Hung Vu, Anh Tran et al.

This paper introduces SEMISE, a novel method for representation learning in medical imaging that combines self-supervised and supervised learning. By leveraging both labeled and augmented data, SEMISE addresses the challenge of data scarcity and enhances the encoder's ability to extract meaningful features. This integrated approach leads to more informative representations, improving performance on downstream tasks. As result, our approach achieved a 12% improvement in classification and a 3% improvement in segmentation, outperforming existing methods. These results demonstrate the potential of SIMESE to advance medical image analysis and offer more accurate solutions for healthcare applications, particularly in contexts where labeled data is limited.

SDMar 24, 2025
Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

Vikramjit Mitra, Amrit Romana, Dung T. Tran et al.

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

CVSep 10, 2025
VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring

Cuong Nguyen, Dung T. Tran, Hong Nguyen et al.

In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.

LGSep 10, 2025
Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning

Dung T. Tran, Huyen Ngoc Huyen, Hong Nguyen et al.

Rainfall forecasting in Vietnam is highly challenging due to its diverse climatic conditions and strong geographical variability across river basins, yet accurate and reliable forecasts are vital for flood management, hydropower operation, and disaster preparedness. In this work, we propose a Matrix Profile-based Weighted Ensemble (MPWE), a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts while incorporating redundancy-aware weighting to balance contributions across models. We evaluate MPWE using rainfall forecasts from eight major basins in Vietnam, spanning five forecast horizons (1 hour and accumulated rainfall over 12, 24, 48, 72, and 84 hours). Experimental results show that MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines, demonstrating both improved accuracy and stability across basins and horizons.

LGAug 11, 2025
SHeRL-FL: When Representation Learning Meets Split Learning in Hierarchical Federated Learning

Dung T. Tran, Nguyen B. Ha, Van-Dinh Nguyen et al.

Federated learning (FL) is a promising approach for addressing scalability and latency issues in large-scale networks by enabling collaborative model training without requiring the sharing of raw data. However, existing FL frameworks often overlook the computational heterogeneity of edge clients and the growing training burden on resource-limited devices. However, FL suffers from high communication costs and complex model aggregation, especially with large models. Previous works combine split learning (SL) and hierarchical FL (HierFL) to reduce device-side computation and improve scalability, but this introduces training complexity due to coordination across tiers. To address these issues, we propose SHeRL-FL, which integrates SL and hierarchical model aggregation and incorporates representation learning at intermediate layers. By allowing clients and edge servers to compute training objectives independently of the cloud, SHeRL-FL significantly reduces both coordination complexity and communication overhead. To evaluate the effectiveness and efficiency of SHeRL-FL, we performed experiments on image classification tasks using CIFAR-10, CIFAR-100, and HAM10000 with AlexNet, ResNet-18, and ResNet-50 in both IID and non-IID settings. In addition, we evaluate performance on image segmentation tasks using the ISIC-2018 dataset with a ResNet-50-based U-Net. Experimental results demonstrate that SHeRL-FL reduces data transmission by over 90\% compared to centralized FL and HierFL, and by 50\% compared to SplitFed, which is a hybrid of FL and SL, and further improves hierarchical split learning methods.