Tan-Hanh Pham

h-index5

12papers

106citations

Novelty48%

AI Score41

Ranked #67,354 of 194,257 authors (top 35%)#22,980 in CV (top 39%)

12 Papers

6.1CLSep 21, 2024Code

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Khai Le-Duc, Phuc Phan, Tan-Hanh Pham et al.

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online: https://github.com/leduckhai/MultiMed/tree/master/MultiMed.

11.9IVJul 16, 2024Code

LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task

Khai Le-Duc, Ryan Zhang, Ngoc Son Nguyen et al.

Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT

11.7IVOct 16, 2023

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

Tan-Hanh Pham, Xianqi Li, Kim-Doang Nguyen

Automated medical image segmentation is becoming increasingly crucial to modern clinical practice, driven by the growing demand for precise diagnosis, the push towards personalized treatment plans, and the advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNN) have been prevalent among these methods, the remarkable potential of Transformer-based models for computer vision tasks is gaining more acknowledgment. To harness the advantages of both CNN-based and Transformer-based models, we propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation. In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images, then the maps are propagated into a bridge layer, which is introduced to sequentially connect the UNet and the Transformer. In this stage, we approach the pixel-level embedding technique without position embedding vectors, aiming to make the model more efficient. Moreover, we apply spatial-reduction attention in the Transformer to reduce the computational/memory overhead. By leveraging the UNet architecture and the self-attention mechanism, our model not only retains the preservation of both local and global context information but also is capable of capturing long-range dependencies between input elements. The proposed model is extensively experimented on seven medical image segmentation datasets including polyp segmentation to demonstrate its efficacy. Comparison with several state-of-the-art segmentation models on these datasets shows the superior performance of our proposed seUNet-Trans network.

10.2CVApr 14, 2025Code

SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

Tan-Hanh Pham, Chris Ngo, Trong-Duong Bui et al.

Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.

24.3CVAug 18, 2025Code

Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models

Tan-Hanh Pham, Chris Ngo

Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.

4.1LGFeb 3, 2025Code

A Wearable Device Dataset for Mental Health Assessment Using Laser Doppler Flowmetry and Fluorescence Spectroscopy Sensors

Minh Ngoc Nguyen, Khai Le-Duc, Tan-Hanh Pham et al.

In this study, we introduce a novel method to predict mental health by building machine learning models for a non-invasive wearable device equipped with Laser Doppler Flowmetry (LDF) and Fluorescence Spectroscopy (FS) sensors. Besides, we present the corresponding dataset to predict mental health, e.g. depression, anxiety, and stress levels via the DAS-21 questionnaire. To our best knowledge, this is the world's largest and the most generalized dataset ever collected for both LDF and FS studies. The device captures cutaneous blood microcirculation parameters, and wavelet analysis of the LDF signal extracts key rhythmic oscillations. The dataset, collected from 132 volunteers aged 18-94 from 19 countries, explores relationships between physiological features, demographics, lifestyle habits, and health conditions. We employed a variety of machine learning methods to classify stress detection, in which LightGBM is identified as the most effective model for stress detection, achieving a ROC AUC of 0.7168 and a PR AUC of 0.8852. In addition, we also incorporated Explainable Artificial Intelligence (XAI) techniques into our analysis to investigate deeper insights into the model's predictions. Our results suggest that females, younger individuals and those with a higher Body Mass Index (BMI) or heart rate have a greater likelihood of experiencing mental health conditions like stress and anxiety. All related code and data are published online: https://github.com/leduckhai/Wearable_LDF-FS.

3.7CVDec 21, 2024Code

SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

Tan-Hanh Pham, Hoang-Nam Le, Phu-Vinh Nguyen et al.

Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.

1.9CLAug 8, 2024Code

wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech

Khai Le-Duc, Quy-Anh Dang, Tan-Hanh Pham et al.

Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.

3.7CVFeb 24, 2024

Enhanced Droplet Analysis Using Generative Adversarial Networks

Tan-Hanh Pham, Kim-Doang Nguyen

Precision devices play an important role in enhancing production quality and productivity in agricultural systems. Therefore, the optimization of these devices is essential in precision agriculture. Recently, with the advancements of deep learning, there have been several studies aiming to harness its capabilities for improving spray system performance. However, the effectiveness of these methods heavily depends on the size of the training dataset, which is expensive and time-consuming to collect. To address the challenge of insufficient training samples, we developed an image generator named DropletGAN to generate images of droplets. The DropletGAN model is trained by using a small dataset captured by a high-speed camera and capable of generating images with progressively increasing resolution. The results demonstrate that the model can generate high-quality images with the size of 1024x1024. The generated images from the DropletGAN are evaluated using the Fréchet inception distance (FID) with an FID score of 11.29. Furthermore, this research leverages recent advancements in computer vision and deep learning to develop a light droplet detector using the synthetic dataset. As a result, the detection model achieves a 16.06% increase in mean average precision (mAP) when utilizing the synthetic dataset. To the best of our knowledge, this work stands as the first to employ a generative model for augmenting droplet detection. Its significance lies not only in optimizing nozzle design for constructing efficient spray systems but also in addressing the common challenge of insufficient data in various precision agriculture tasks. This work offers a critical contribution to conserving resources while striving for optimal and sustainable agricultural practices.

10.2CVMay 17, 2025

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

Tan-Hanh Pham, Phu-Vinh Nguyen, Dang The Hung et al.

Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

5.1IVMay 11, 2025

Missing Data Estimation for MR Spectroscopic Imaging via Mask-Free Deep Learning Methods

Tan-Hanh Pham, Ovidiu C. Andronesi, Xianqi Li et al.

Magnetic Resonance Spectroscopic Imaging (MRSI) is a powerful tool for non-invasive mapping of brain metabolites, providing critical insights into neurological conditions. However, its utility is often limited by missing or corrupted data due to motion artifacts, magnetic field inhomogeneities, or failed spectral fitting-especially in high resolution 3D acquisitions. To address this, we propose the first deep learning-based, mask-free framework for estimating missing data in MRSI metabolic maps. Unlike conventional restoration methods that rely on explicit masks to identify missing regions, our approach implicitly detects and estimates these areas using contextual spatial features through 2D and 3D U-Net architectures. We also introduce a progressive training strategy to enhance robustness under varying levels of data degradation. Our method is evaluated on both simulated and real patient datasets and consistently outperforms traditional interpolation techniques such as cubic and linear interpolation. The 2D model achieves an MSE of 0.002 and an SSIM of 0.97 with 20% missing voxels, while the 3D model reaches an MSE of 0.001 and an SSIM of 0.98 with 15% missing voxels. Qualitative results show improved fidelity in estimating missing data, particularly in metabolically heterogeneous regions and ventricular regions. Importantly, our model generalizes well to real-world datasets without requiring retraining or mask input. These findings demonstrate the effectiveness and broad applicability of mask-free deep learning for MRSI restoration, with strong potential for clinical and research integration.

5.0CVSep 2, 2023

Deep-Learning Framework for Optimal Selection of Soil Sampling Sites

Tan-Hanh Pham, Praneel Acharya, Sravanthi Bachina et al.

This work leverages the recent advancements of deep learning in image processing to find optimal locations that present the important characteristics of a field. The data for training are collected at different fields in local farms with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because the ground truth is highly imbalanced binary images. Therefore, we approached the problem with two methods, the first approach involves utilizing a state-of-the-art model with the convolutional neural network (CNN) backbone, while the second is to innovate a deep-learning design grounded in the concepts of transformer and self-attention. Our framework is constructed with an encoder-decoder architecture with the self-attention mechanism as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while the performance metrics of the state-of-the-art CNN-based model are 66.08%, 3.85%, and 1.98%, respectively. This indicates that our proposed model outperforms the CNN-based method on the soil-sampling dataset. To the best of our knowledge, our work is the first to provide a soil-sampling dataset with multiple attributes and leverage deep learning techniques to enable the automatic selection of soil-sampling sites. This work lays a foundation for novel applications of data science and machine-learning technologies to solve other emerging agricultural problems.