Thu Nguyen

LG
h-index24
24papers
1,417citations
Novelty47%
AI Score32

24 Papers

CVDec 6, 2022
VISEM-Tracking, a human spermatozoa tracking dataset

Vajira Thambawita, Steven A. Hicks, Andrea M. Storås et al.

A manual assessment of sperm motility requires microscopy observation, which is challenging due to the fast-moving spermatozoa in the field of view. To obtain correct results, manual evaluation requires extensive training. Therefore, computer-assisted sperm analysis (CASA) has become increasingly used in clinics. Despite this, more data is needed to train supervised machine learning approaches in order to improve accuracy and reliability in the assessment of sperm motility and kinematics. In this regard, we provide a dataset called VISEM-Tracking with 20 video recordings of 30 seconds (comprising 29,196 frames) of wet sperm preparations with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. In addition to the annotated data, we provide unlabeled video clips for easy-to-use access and analysis of the data via methods such as self- or unsupervised learning. As part of this paper, we present baseline sperm detection performances using the YOLOv5 deep learning (DL) model trained on the VISEM-Tracking dataset. As a result, we show that the dataset can be used to train complex DL models to analyze spermatozoa.

MLOct 11, 2022
Combining datasets to increase the number of samples and improve model fitting

Thu Nguyen, Rabindra Khadka, Nhan Phan et al.

For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while combining different datasets. To illustrate the power of the proposed methods and their potential usages, we conduct experiments for various tasks: regression, classification, and for different data types: tabular data, time series data, when the datasets to be combined have missing data. We also investigate how the devised methods can be used with transfer learning to provide even further model training improvement. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets. In addition, the methods can boost performance by a significant margin when combining small datasets together and can provide extra improvement when being used with transfer learning.

LGMay 30, 2022
Principal Component Analysis based frameworks for efficient missing data imputation algorithms

Thu Nguyen, Hoang Thien Ly, Michael Alexander Riegler et al.

Missing data is a commonly occurring problem in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, in this work, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques, without sacrificing the imputation quality in term of MSE. In addition, the frameworks can be used even when some or all of the missing features are categorical, or when the number of missing features is large. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms, including the state-of-the-art ones and improve the imputation speed significantly, while achieving competitive mean square error/classification accuracy compared to direct imputation (i.e., impute directly on the missing data).

MLFeb 2, 2023
Conditional expectation with regularization for missing data imputation

Mai Anh Vu, Thu Nguyen, Tu T. Do et al.

Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the imputation method is scalable and the logic behind the imputation is explainable, which is especially difficult for complex methods that are, for example, based on deep learning. Based on these considerations, we propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV). DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis. As will be illustrated via experiments in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods; (ii) fast and scalable; (iii) is explainable as coefficients in a regression model, allowing reliable and trustable analysis, makes it a suitable choice for critical domains where understanding is important such as in medical fields, finance, etc; (iv) can provide an approximated confidence region for the missing values in a given sample; (v) suitable for both small and large scale data; (vi) in many scenarios, does not require a huge number of parameters as deep learning approaches; (vii) handle multicollinearity in imputation effectively; and (viii) is robust to the normally distributed assumption that its theoretical grounds rely on.

LGMar 3, 2022
Parallel feature selection based on the trace ratio criterion

Thu Nguyen, Thanh Nhan Phan, Van Nhuong Nguyen et al.

The growth of data today poses a challenge in management and inference. While feature extraction methods are capable of reducing the size of the data for inference, they do not help in minimizing the cost of data storage. On the other hand, feature selection helps to remove the redundant features and therefore is helpful not only in inference but also in reducing management costs. This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST), which scales up to very large datasets. Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness. We analyzed the criterion's desirable properties theoretically. Based on the criterion, PFST rapidly finds important features out of a set of features for big datasets by first making a forward selection with early removal of seemingly redundant features parallelly. After the most important features are included in the model, we check back their contribution for possible interaction that may improve the fit. Lastly, we make a backward selection to check back possible redundant added by the forward steps. We evaluate our methods via various experiments using Linear Discriminant Analysis as the classifier on selected features. The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison. In addition, the classifier trained on the features selected by PFST not only achieves better accuracy than the ones chosen by other approaches but can also achieve better accuracy than the classification on all available features.

LGNov 28, 2023
Imputation using training labels and classification via label imputation

Thu Nguyen, Tuan L. Vo, Pål Halvorsen et al.

Missing data is a common problem in practical data science settings. Various imputation methods have been developed to deal with missing data. However, even though the labels are available in the training data in many situations, the common practice of imputation usually only relies on the input and ignores the label. We propose Classification Based on MissForest Imputation (CBMI), a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation, allowing the label and the input to be imputed simultaneously. In addition, we propose the imputation using labels (IUL) algorithm, an imputation strategy that stacks the label into the input and illustrates how it can significantly improve the imputation quality. Experiments show that CBMI has classification accuracy when the test set contains missing data, especially for imbalanced data and categorical data. Moreover, for both the regression and classification, IUL consistently shows significantly better results than imputation based on only the input data.

LGMay 26, 2022
Unequal Covariance Awareness for Fisher Discriminant Analysis and Its Variants in Classification

Thu Nguyen, Quang M. Le, Son N. T. Tu et al.

Fisher Discriminant Analysis (FDA) is one of the essential tools for feature extraction and classification. In addition, it motivates the development of many improved techniques based on the FDA to adapt to different problems or data types. However, none of these approaches make use of the fact that the assumption of equal covariance matrices in FDA is usually not satisfied in practical situations. Therefore, we propose a novel classification rule for the FDA that accounts for this fact, mitigating the effect of unequal covariance matrices in the FDA. Furthermore, since we only modify the classification rule, the same can be applied to many FDA variants, improving these algorithms further. Theoretical analysis reveals that the new classification rule allows the implicit use of the class covariance matrices while increasing the number of parameters to be estimated by a small amount compared to going from FDA to Quadratic Discriminant Analysis. We illustrate our idea via experiments, which show the superior performance of the modified algorithms based on our new classification rule compared to the original ones.

CVApr 16, 2021Code
Learning To Count Everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen et al.

Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. To this end, we pose counting as a few-shot regression task. To tackle this task, we present a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query image. We also present a novel adaptation strategy to adapt our network to any novel visual category at test time, using only a few exemplar objects from the novel category. We also introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. The images are annotated with two types of annotation, dots and bounding boxes, and they can be used for developing few-shot counting models. Experiments on this dataset shows that our method outperforms several state-of-the-art object detectors and few-shot counting approaches. Our code and dataset can be found at https://github.com/cvlab-stonybrook/LearningToCountEverything.

HCMar 21, 2024
How Human-Centered Explainable AI Interface Are Designed and Evaluated: A Systematic Survey

Thu Nguyen, Alessandro Canossa, Jichen Zhu

Despite its technological breakthroughs, eXplainable Artificial Intelligence (XAI) research has limited success in producing the {\em effective explanations} needed by users. In order to improve XAI systems' usability, practical interpretability, and efficacy for real users, the emerging area of {\em Explainable Interfaces} (EIs) focuses on the user interface and user experience design aspects of XAI. This paper presents a systematic survey of 53 publications to identify current trends in human-XAI interaction and promising directions for EI design and development. This is among the first systematic survey of EI research.

LGDec 15, 2024
Missing data imputation for noisy time-series data and applications in healthcare

Lien P. Le, Xuan-Hien Nguyen Thi, Thu Nguyen et al.

Healthcare time series data is vital for monitoring patient activity but often contains noise and missing values due to various reasons such as sensor errors or data interruptions. Imputation, i.e., filling in the missing values, is a common way to deal with this issue. In this study, we compare imputation methods, including Multiple Imputation with Random Forest (MICE-RF) and advanced deep learning approaches (SAITS, BRITS, Transformer) for noisy, missing time series data in terms of MAE, F1-score, AUC, and MCC, across missing data rates (10 % - 80 %). Our results show that MICE-RF can effectively impute missing data compared to deep learning methods and the improvement in classification of data imputed indicates that imputation can have denoising effects. Therefore, using an imputation algorithm on time series with missing data can, at the same time, offer denoising effects.

MLJan 17, 2025
DPERC: Direct Parameter Estimation for Mixed Data

Tuan L. Vo, Quan Huu Do, Uyen Dang et al.

The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this challenge, they often entail a trade-off between computational efficiency and estimation accuracy. Consequently, attention has shifted towards direct parameter estimation, given its precision and reduced computational burden. In this paper, we propose Direct Parameter Estimation for Randomly Missing Data with Categorical Features (DPERC), an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features. Our method is motivated by leveraging information from categorical features, which can significantly enhance covariance matrix estimation for continuous features. Our approach effectively harnesses the information embedded within mixed data structures. Through comprehensive evaluations of diverse datasets, we demonstrate the competitive performance of DPERC compared to various contemporary techniques. In addition, we also show by experiments that DPERC is a valuable tool for visualizing the correlation heatmap.

LGJan 31, 2025
Principal Components for Neural Network Initialization

Nhan Phan, Thu Nguyen, Uyen Dang et al.

Principal Component Analysis (PCA) is a commonly used tool for dimension reduction and denoising. Therefore, it is also widely used on the data prior to training a neural network. However, this approach can complicate the explanation of eXplainable Artificial Intelligence (XAI) methods for the decision of the model. In this work, we analyze the potential issues with this approach and propose Principal Components-based Initialization (PCsInit), a strategy to incorporate PCA into the first layer of a neural network via initialization of the first layer in the network with the principal components, and its two variants PCsInit-Act and PCsInit-Sub. We will show that explanations using these strategies are more simple, direct and straightforward than using PCA prior to training a neural network on the principal components. We also show that the proposed techniques possess desirable theoretical properties. Moreover, as will be illustrated in the experiments, such training strategies can also allow further improvement of training via backpropagation compared to training neural networks on principal components.

LGJun 30, 2024
Directly Handling Missing Data in Linear Discriminant Analysis for Enhancing Classification Accuracy and Interpretability

Tuan L. Vo, Uyen Dang, Thu Nguyen

As the adoption of Artificial Intelligence (AI) models expands into critical real-world applications, ensuring the explainability of these models becomes paramount, particularly in sensitive fields such as medicine and finance. Linear Discriminant Analysis (LDA) remains a popular choice for classification due to its interpretable nature, derived from its capacity to model class distributions and enhance class separation through linear combinations of features. However, real-world datasets often suffer from incomplete data, posing substantial challenges for both classification accuracy and model interpretability. In this paper, we introduce a novel and robust classification method, termed Weighted missing Linear Discriminant Analysis (WLDA), which extends LDA to handle datasets with missing values without the need for imputation. Our approach innovatively incorporates a weight matrix that penalizes missing entries, thereby refining parameter estimation directly on incomplete data. This methodology not only preserves the interpretability of LDA but also significantly enhances classification performance in scenarios plagued by missing data. We conduct an in-depth theoretical analysis to establish the properties of WLDA and thoroughly evaluate its explainability. Experimental results across various datasets demonstrate that WLDA consistently outperforms traditional methods, especially in challenging environments where missing values are prevalent in both training and test datasets. This advancement provides a critical tool for improving classification accuracy and maintaining model transparency in the face of incomplete data.

LGJun 29, 2024
Explainability of Machine Learning Models under Missing Data

Tuan L. Vo, Thu Nguyen, Luis M. Lopez-Ramos et al.

Missing data is a prevalent issue that can significantly impair model performance and explainability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on SHAP (SHapley Additive exPlanations), a popular technique for explaining the output of complex machine learning models. Next, we compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the explainability of the model. Moreover, we also show that a lower test prediction MSE (Mean Square Error) does not necessarily imply a lower MSE in Shapley values and vice versa. Also, while XGBoost (eXtreme Gradient Boosting) is a method that could handle missing data directly, using XGBoost directly on missing data can seriously affect explainability compared to imputing the data before training XGBoost. This study provides a comprehensive evaluation of imputation methods in the context of model explanations, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

LGMay 10, 2023
Correlation visualization under missing values: a comparison between imputation and direct parameter estimation methods

Nhat-Hao Pham, Khanh-Linh Vo, Mai Anh Vu et al.

Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can pose a significant challenge in estimating correlation coefficients. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.

LGMay 10, 2023
Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

Tu T. Do, Mai Anh Vu, Tuan L. Vo et al.

Monotone missing data is a common problem in data analysis. However, imputation combined with dimensionality reduction can be computationally expensive, especially with the increasing size of datasets. To address this issue, we propose a Blockwise principal component analysis Imputation (BPI) framework for dimensionality reduction and imputation of monotone missing data. The framework conducts Principal Component Analysis (PCA) on the observed part of each monotone block of the data and then imputes on merging the obtained principal components using a chosen imputation technique. BPI can work with various imputation techniques and can significantly reduce imputation time compared to conducting dimensionality reduction after imputation. This makes it a practical and efficient approach for large datasets with monotone missing data. Our experiments validate the improvement in speed. In addition, our experiments also show that while applying MICE imputation directly on missing data may not yield convergence, applying BPI with MICE for the data may lead to convergence.

LGFeb 18, 2022
FinNet: Solving Time-Independent Differential Equations with Finite Difference Neural Network

Son N. T. Tu, Thu Nguyen

Deep learning approaches for partial differential equations (PDEs) have received much attention in recent years due to their mesh-freeness and computational efficiency. However, most of the works so far have concentrated on time-dependent nonlinear differential equations. In this work, we analyze potential issues with the well-known Physic Informed Neural Network for differential equations with little constraints on the boundary (i.e., the constraints are only on a few points). This analysis motivates us to introduce a novel technique called FinNet, for solving differential equations by incorporating finite difference into deep learning. Even though we use a mesh during training, the prediction phase is mesh-free. We illustrate the effectiveness of our method through experiments on solving various equations, which shows that FinNet can solve PDEs with low error rates and may work even when PINNs cannot.

LGAug 26, 2021
StressNAS: Affect State and Stress Detection Using Neural Architecture Search

Lam Huynh, Tri Nguyen, Thu Nguyen et al.

Smartwatches have rapidly evolved towards capabilities to accurately capture physiological signals. As an appealing application, stress detection attracts many studies due to its potential benefits to human health. It is propitious to investigate the applicability of deep neural networks (DNN) to enhance human decision-making through physiological signals. However, manually engineering DNN proves a tedious task especially in stress detection due to the complex nature of this phenomenon. To this end, we propose an optimized deep neural network training scheme using neural architecture search merely using wrist-worn data from WESAD. Experiments show that our approach outperforms traditional ML methods by 8.22% and 6.02% in the three-state and two-state classifiers, respectively, using the combination of WESAD wrist signals. Moreover, the proposed method can minimize the need for human-design DNN while improving performance by 4.39% (three-state) and 8.99% (binary).

MLJun 6, 2021
DPER: Efficient Parameter Estimation for Randomly Missing Data

Thu Nguyen, Khoi Minh Nguyen-Duy, Duy Ho Minh Nguyen et al.

The missing data problem has been broadly studied in the last few decades and has various applications in different areas such as statistics or bioinformatics. Even though many methods have been developed to tackle this challenge, most of those are imputation techniques that require multiple iterations through the data before yielding convergence. In addition, such approaches may introduce extra biases and noises to the estimated parameters. In this work, we propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set under some mild assumptions. As the computation is direct without any imputation, our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods while maintaining superior estimation performance. We validate these claims by empirical results on various data sets of different sizes and release all codes in a GitHub repository to contribute to the research community related to this problem.

SIApr 22, 2021
COVID-19 and Big Data: Multi-faceted Analysis for Spatio-temporal Understanding of the Pandemic with Social Media Conversations

Shayan Fazeli, Davina Zamanzadeh, Anaelia Ovalle et al.

COVID-19 has been devastating the world since the end of 2019 and has continued to play a significant role in major national and worldwide events, and consequently, the news. In its wake, it has left no life unaffected. Having earned the world's attention, social media platforms have served as a vehicle for the global conversation about COVID-19. In particular, many people have used these sites in order to express their feelings, experiences, and observations about the pandemic. We provide a multi-faceted analysis of critical properties exhibited by these conversations on social media regarding the novel coronavirus pandemic. We present a framework for analysis, mining, and tracking the critical content and characteristics of social media conversations around the pandemic. Focusing on Twitter and Reddit, we have gathered a large-scale dataset on COVID-19 social media conversations. Our analyses cover tracking potential reports on virus acquisition, symptoms, conversation topics, and language complexity measures through time and by region across the United States. We also present a BERT-based model for recognizing instances of hateful tweets in COVID-19 conversations, which achieves a lower error-rate than the state-of-the-art performance. Our results provide empirical validation for the effectiveness of our proposed framework and further demonstrate that social media data can be efficiently leveraged to provide public health experts with inexpensive but thorough insight over the course of an outbreak.

LGNov 17, 2020
Structural and Functional Decomposition for Personality Image Captioning in a Communication Game

Thu Nguyen, Duy Phung, Minh Hoai et al.

Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait. In this work, we introduce a novel formulation for PIC based on a communication game between a speaker and a listener. The speaker attempts to generate natural language captions while the listener encourages the generated captions to contain discriminative information about the input images and personality traits. In this way, we expect that the generated captions can be improved to naturally represent the images and express the traits. In addition, we propose to adapt the language model GPT2 to perform caption generation for PIC. This enables the speaker and listener to benefit from the language encoding capacity of GPT2. Our experiments show that the proposed model achieves the state-of-the-art performance for PIC.

LGSep 23, 2020
EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data

Thu Nguyen, Duy H. M. Nguyen, Huy Nguyen et al.

The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive exact formulas and propose a novel algorithm to compute the maximum likelihood estimators (MLEs) of a multiple class, monotone missing dataset when all the covariance matrices of all categories are assumed to be equal, namely EPEM. We then illustrate an application of our proposed methods in Linear Discriminant Analysis (LDA). As the computation is exact, our EPEM algorithm does not require multiple iterations through the data as other imputation approaches, thus promising to handle much less time-consuming than other methods. This effectiveness was validated by empirical results when EPEM reduced the error rates significantly and required a short computation time compared to several imputation-based approaches. We also release all codes and data of our experiments in one GitHub repository to contribute to the research community related to this problem.

MLOct 17, 2019
Faster feature selection with a Dropping Forward-Backward algorithm

Thu Nguyen

In this era of big data, feature selection techniques, which have long been proven to simplify the model, makes the model more comprehensible, speed up the process of learning, have become more and more important. Among many developed methods, forward and stepwise feature selection regression remained widely used due to their simplicity and efficiency. However, they all involving rescanning all the un-selected features again and again. Moreover, many times, the backward steps in stepwise deem unnecessary, as we will illustrate in our example. These remarks motivate us to introduce a novel algorithm that may boost the speed up to 65.77% compared to the stepwise procedure while maintaining good performance in terms of the number of selected features and error rates. Also, our experiments illustrate that feature selection procedures may be a better choice for high-dimensional problems where the number of features highly exceeds the number of samples.

CVFeb 5, 2018
ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks

Duy H. M. Nguyen, Duy M. Nguyen, Mai T. N. Truong et al.

Brain extraction (skull stripping) is a challenging problem in neuroimaging. It is due to the variability in conditions from data acquisition or abnormalities in images, making brain morphology and intensity characteristics changeable and complicated. In this paper, we propose an algorithm for skull stripping in Magnetic Resonance Imaging (MRI) scans, namely ASMCNN, by combining the Active Shape Model (ASM) and Convolutional Neural Network (CNN) for taking full of their advantages to achieve remarkable results. Instead of working with 3D structures, we process 2D image sequences in the sagittal plane. First, we divide images into different groups such that, in each group, shapes and structures of brain boundaries have similar appearances. Second, a modified version of ASM is used to detect brain boundaries by utilizing prior knowledge of each group. Finally, CNN and post-processing methods, including Conditional Random Field (CRF), Gaussian processes, and several special rules are applied to refine the segmentation contours. Experimental results show that our proposed method outperforms current state-of-the-art algorithms by a significant margin in all experiments.