AIJul 18, 2022
CausNet : Generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraintsNand Sharma, Joshua Millstein
Finding a globally optimal Bayesian Network using exhaustive search is a problem with super-exponential complexity, which severely restricts the number of variables that it can work for. We implement a dynamic programming based algorithm with built-in dimensionality reduction and parent set identification. This reduces the search space drastically and can be applied to large-dimensional data. We use what we call generational orderings based search for optimal networks, which is a novel way to efficiently search the space of possible networks given the possible parent sets. The algorithm supports both continuous and categorical data, and categorical as well as survival outcomes. We demonstrate the efficacy of our algorithm on both synthetic and real data. In simulations, our algorithm performs better than three state-of-art algorithms that are currently used extensively. We then apply it to an Ovarian Cancer gene expression dataset with 513 genes and a survival outcome. Our algorithm is able to find an optimal network describing the disease pathway consisting of 6 genes leading to the outcome node in a few minutes on a basic computer. Our generational orderings based search for optimal networks, is both efficient and highly scalable approach to finding optimal Bayesian Networks, that can be applied to 1000s of variables. Using specifiable parameters - correlation, FDR cutoffs, and in-degree - one can increase or decrease the number of nodes and density of the networks. Availability of two scoring option-BIC and Bge-and implementation of survival outcomes and mixed data types makes our algorithm very suitable for many types of high dimensional biomedical data to find disease pathways.
NEDec 6, 2017
Single-trial P300 Classification using PCA with LDA, QDA and Neural NetworksNand Sharma
The P300 event-related potential (ERP), evoked in scalp-recorded electroencephalography (EEG) by external stimuli, has proven to be a reliable response for controlling a BCI. The P300 component of an event related potential is thus widely used in brain-computer interfaces to translate the subjects' intent by mere thoughts into commands to control artificial devices. The main challenge in the classification of P300 trials in electroencephalographic (EEG) data is the low signal-to-noise ratio (SNR) of the P300 response. To overcome the low SNR of individual trials, it is common practice to average together many consecutive trials, which effectively diminishes the random noise. Unfortunately, when more repeated trials are required for applications such as the P300 speller, the communication rate is greatly reduced. This has resulted in a need for better methods to improve single-trial classification accuracy of P300 response. In this work, we use Principal Component Analysis (PCA) as a preprocessing method and use Linear Discriminant Analysis (LDA)and neural networks for classification. The results show that a combination of PCA with these methods provided as high as 13\% accuracy gain for single-trial classification while using only 3 to 4 principal components.
LGDec 6, 2017
Regularization and feature selection for large dimensional dataNand Sharma, Prathamesh Verlekar, Rehab Ashary et al.
Feature selection has evolved to be an important step in several machine learning paradigms. In domains like bio-informatics and text classification which involve data of high dimensions, feature selection can help in drastically reducing the feature space. In cases where it is difficult or infeasible to obtain sufficient number of training examples, feature selection helps overcome the curse of dimensionality which in turn helps improve performance of the classification algorithm. The focus of our research here are five embedded feature selection methods which use either the ridge regression, or Lasso regression, or a combination of the two in the regularization part of the optimization function. We evaluate five chosen methods on five large dimensional datasets and compare them on the parameters of sparsity and correlation in the datasets and their execution times.