LGNov 6, 2022
Physics Informed Machine Learning for Chemistry TabulationAmol Salunkhe, Dwyer Deighan, Paul Desjardin et al.
Modeling of turbulent combustion system requires modeling the underlying chemistry and the turbulent flow. Solving both systems simultaneously is computationally prohibitive. Instead, given the difference in scales at which the two sub-systems evolve, the two sub-systems are typically (re)solved separately. Popular approaches such as the Flamelet Generated Manifolds (FGM) use a two-step strategy where the governing reaction kinetics are pre-computed and mapped to a low-dimensional manifold, characterized by a few reaction progress variables (model reduction) and the manifold is then ``looked-up'' during the runtime to estimate the high-dimensional system state by the flow system. While existing works have focused on these two steps independently, in this work we show that joint learning of the progress variables and the look--up model, can yield more accurate results. We build on the base formulation and implementation ChemTab to include the dynamically generated Themochemical State Variables (Lower Dimensional Dynamic Source Terms). We discuss the challenges in the implementation of this deep neural network architecture and experimentally demonstrate it's superior performance.
LGNov 25, 2022
An Ensemble-Based Deep Framework for Estimating Thermo-Chemical State Variables from Flamelet Generated ManifoldsAmol Salunkhe, Georgios Georgalis, Abani Patra et al.
Complete computation of turbulent combustion flow involves two separate steps: mapping reaction kinetics to low-dimensional manifolds and looking-up this approximate manifold during CFD run-time to estimate the thermo-chemical state variables. In our previous work, we showed that using a deep architecture to learn the two steps jointly, instead of separately, is 73% more accurate at estimating the source energy, a key state variable, compared to benchmarks and can be integrated within a DNS turbulent combustion framework. In their natural form, such deep architectures do not allow for uncertainty quantification of the quantities of interest: the source energy and key species source terms. In this paper, we expand on such architectures, specifically ChemTab, by introducing deep ensembles to approximate the posterior distribution of the quantities of interest. We investigate two strategies of creating these ensemble models: one that keeps the flamelet origin information (Flamelets strategy) and one that ignores the origin and considers all the data independently (Points strategy). To train these models we used flamelet data generated by the GRI--Mech 3.0 methane mechanism, which consists of 53 chemical species and 325 reactions. Our results demonstrate that the Flamelets strategy is superior in terms of the absolute prediction error for the quantities of interest, but is reliant on the types of flamelets used to train the ensemble. The Points strategy is best at capturing the variability of the quantities of interest, independent of the flamelet types. We conclude that, overall, ChemTab Deep Ensembles allows for a more accurate representation of the source energy and key species source terms, compared to the model without these modifications.
LGMar 2, 2023
Large Deviations for Accelerating Neural Networks TrainingSreelekha Guggilam, Varun Chandola, Abani Patra
Artificial neural networks (ANNs) require tremendous amount of data to train on. However, in classification models, most data features are often similar which can lead to increase in training time without significant improvement in the performance. Thus, we hypothesize that there could be a more efficient way to train an ANN using a better representative sample. For this, we propose the LAD Improved Iterative Training (LIIT), a novel training approach for ANN using large deviations principle to generate and iteratively update training samples in a fast and efficient setting. This is exploratory work with extensive opportunities for future work. The thesis presents this ongoing research work with the following contributions from this study: (1) We propose a novel ANN training method, LIIT, based on the large deviations theory where additional dimensionality reduction is not needed to study high dimensional data. (2) The LIIT approach uses a Modified Training Sample (MTS) that is generated and iteratively updated using a LAD anomaly score based sampling strategy. (3) The MTS sample is designed to be well representative of the training data by including most anomalous of the observations in each class. This ensures distinct patterns and features are learnt with smaller samples. (4) We study the classification performance of the LIIT trained ANNs with traditional batch trained counterparts.
LGNov 27, 2022
Geo-Adaptive Deep Spatio-Temporal predictive modeling for human mobilitySyed Mohammed Arshad Zaidi, Varun Chandola, EunHye Yoo
Deep learning approaches for spatio-temporal prediction problems such as crowd-flow prediction assumes data to be of fixed and regular shaped tensor and face challenges of handling irregular, sparse data tensor. This poses limitations in use-case scenarios such as predicting visit counts of individuals' for a given spatial area at a particular temporal resolution using raster/image format representation of the geographical region, since the movement patterns of an individual can be largely restricted and localized to a certain part of the raster. Additionally, current deep-learning approaches for solving such problem doesn't account for the geographical awareness of a region while modelling the spatio-temporal movement patterns of an individual. To address these limitations, there is a need to develop a novel strategy and modeling approach that can handle both sparse, irregular data while incorporating geo-awareness in the model. In this paper, we make use of quadtree as the data structure for representing the image and introduce a novel geo-aware enabled deep learning layer, GA-ConvLSTM that performs the convolution operation based on a novel geo-aware module based on quadtree data structure for incorporating spatial dependencies while maintaining the recurrent mechanism for accounting for temporal dependencies. We present this approach in the context of the problem of predicting spatial behaviors of an individual (e.g., frequent visits to specific locations) through deep-learning based predictive model, GADST-Predict. Experimental results on two GPS based trace data shows that the proposed method is effective in handling frequency visits over different use-cases with considerable high accuracy.
LGApr 7
Modeling Patient Care Trajectories with Transformer Hawkes ProcessesSaumya Pandey, Varun Chandola
Patient healthcare utilization consists of irregularly time-stamped events, such as outpatient visits, inpatient admissions, and emergency encounters, forming individualized care trajectories. Modeling these trajectories is crucial for understanding utilization patterns and predicting future care needs, but is challenging due to temporal irregularity and severe class imbalance. In this work, we build on the Transformer Hawkes Process framework to model patient trajectories in continuous time. By combining Transformer-based history encoding with Hawkes process dynamics, the model captures event dependencies and jointly predicts event type and time-to-event. To address extreme imbalance, we introduce an imbalance-aware training strategy using inverse square-root class weighting. This improves sensitivity to rare but clinically important events without altering the data distribution. Experiments on real-world data demonstrate improved performance and provide clinically meaningful insights for identifying high-risk patient populations.
LGFeb 20, 2022
ChemTab: A Physics Guided Chemistry Modeling FrameworkAmol Salunkhe, Dwyer Deighan, Paul DesJardin et al.
Modeling of turbulent combustion system requires modeling the underlying chemistry and the turbulent flow. Solving both systems simultaneously is computationally prohibitive. Instead, given the difference in scales at which the two sub-systems evolve, the two sub-systems are typically (re)solved separately. Popular approaches such as the Flamelet Generated Manifolds (FGM) use a two-step strategy where the governing reaction kinetics are pre-computed and mapped to a low-dimensional manifold, characterized by a few reaction progress variables (model reduction) and the manifold is then "looked-up" during the run-time to estimate the high-dimensional system state by the flow system. While existing works have focused on these two steps independently, we show that joint learning of the progress variables and the look-up model, can yield more accurate results. We propose a deep neural network architecture, called ChemTab, customized for the joint learning task and experimentally demonstrate its superiority over existing state-of-the-art methods.
LGSep 28, 2021
Anomaly Detection for High-Dimensional Data Using Large Deviations PrincipleSreelekha Guggilam, Varun Chandola, Abani Patra
Most current anomaly detection methods suffer from the curse of dimensionality when dealing with high-dimensional data. We propose an anomaly detection algorithm that can scale to high-dimensional data using concepts from the theory of large deviations. The proposed Large Deviations Anomaly Detection (LAD) algorithm is shown to outperform state of art anomaly detection methods on a variety of large and high-dimensional benchmark data sets. Exploiting the ability of the algorithm to scale to high-dimensional data, we propose an online anomaly detection method to identify anomalies in a collection of multivariate time series. We demonstrate the applicability of the online algorithm in identifying counties in the United States with anomalous trends in terms of COVID-19 related cases and deaths. Several of the identified anomalous counties correlate with counties with documented poor response to the COVID pandemic.
CVSep 24, 2021
From images in the wild to video-informed image classificationMarc Böhlen, Varun Chandola, Wawan Sujarwo et al.
Image classifiers work effectively when applied on structured images, yet they often fail when applied on images with very high visual complexity. This paper describes experiments applying state-of-the-art object classifiers toward a unique set of images in the wild with high visual complexity collected on the island of Bali. The text describes differences between actual images in the wild and images from Imagenet, and then discusses a novel approach combining informational cues particular to video with an ensemble of imperfect classifiers in order to improve classification results on video sourced images of plants in the wild.
MLNov 1, 2019
Integrated Clustering and Anomaly Detection (INCAD) for Streaming Data (Revised)Sreelekha Guggilam, Syed M. A. Zaidi, Varun Chandola et al.
Most current clustering based anomaly detection methods use scoring schema and thresholds to classify anomalies. These methods are often tailored to target specific data sets with "known" number of clusters. The paper provides a streaming clustering and anomaly detection algorithm that does not require strict arbitrary thresholds on the anomaly scores or knowledge of the number of clusters while performing probabilistic anomaly detection and clustering simultaneously. This ensures that the cluster formation is not impacted by the presence of anomalous data, thereby leading to more reliable definition of "normal vs abnormal" behavior. The motivations behind developing the INCAD model and the path that leads to the streaming model is discussed.
MLMay 29, 2019
Bayesian Anomaly Detection Using Extreme Value TheorySreelekha Guggilam, S. M. Arshad Zaidi, Varun Chandola et al.
Data-driven anomaly detection methods typically build a model for the normal behavior of the target system, and score each data instance with respect to this model. A threshold is invariably needed to identify data instances with high (or low) scores as anomalies. This presents a practical limitation on the applicability of such methods, since most methods are sensitive to the choice of the threshold, and it is challenging to set optimal thresholds. We present a probabilistic framework to explicitly model the normal and anomalous behaviors and probabilistically reason about the data. An extreme value theory based formulation is proposed to model the anomalous behavior as the extremes of the normal behavior. As a specific instantiation, a joint non-parametric clustering and anomaly detection algorithm is proposed that models the normal behavior as a Dirichlet Process Mixture Model.
LGOct 1, 2018
Learning Deep Representations from Clinical Data for Chronic Kidney DiseaseDuc Thanh Anh Luong, Varun Chandola
We study the behavior of a Time-Aware Long Short-Term Memory Autoencoder, a state-of-the-art method, in the context of learning latent representations from irregularly sampled patient data. We identify a key issue in the way such recurrent neural network models are being currently used and show that the solution of the issue leads to significant improvements in the learnt representations on both synthetic and real datasets. A detailed analysis of the improved methodology for representing patients suffering from Chronic Kidney Disease (CKD) using clinical data is provided. Experimental results show that the proposed T-LSTM model is able to capture the long-term trends in the data, while effectively handling the noise in the signal. Finally, we show that by using the latent representations of the CKD patients obtained from the T-LSTM autoencoder, one can identify unusual patient profiles from the target population.
CRMay 30, 2018
Detecting Data Leakage from Databases on Android Apps with Concept DriftGokhan Kul, Shambhu Upadhyaya, Varun Chandola
Mobile databases are the statutory backbones of many applications on smartphones, and they store a lot of sensitive information. However, vulnerabilities in the operating system or the app logic can lead to sensitive data leakage by giving the adversaries unauthorized access to the app's database. In this paper, we study such vulnerabilities to define a threat model, and we propose an OS-version independent protection mechanism that app developers can utilize to detect such attacks. To do so, we model the user behavior with the database query workload created by the original apps. Here, we model the drift in behavior by comparing probability distributions of the query workload features over time. We then use this model to determine if the app behavior drift is anomalous. We evaluate our framework on real-world workloads of three different popular Android apps, and we show that our system was able to detect more than 90% of such attacks.
MLApr 24, 2018
Learning Manifolds from Non-stationary Streaming DataSuchismit Mahapatra, Varun Chandola
Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary. Such methods are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms. The predictive variance obtained from the GPR prediction is then shown to be an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution.
LGApr 3, 2018
Hospital Readmission Prediction - Applying Hierarchical Sparsity Norms for Interpretable ModelsJialiang Jiang, Sharon Hewner, Varun Chandola
Hospital readmissions have become one of the key measures of healthcare quality. Preventable readmissions have been identified as one of the primary targets for reducing costs and improving healthcare delivery. However, most data driven studies for understanding readmissions have produced black box classification and predictive models with moderate performance, which precludes them from being used effectively within the decision support systems in the hospitals. In this paper we present an application of structured sparsity-inducing norms for predicting readmission risk for patients based on their disease history and demographics. Most existing studies have focused on hospital utilization, test results, etc., to assign a readmission label to each episode of hospitalization. However, we focus on assigning a readmission risk label to a patient based on their disease history. Our emphasis is on interpreting the models to improve the understanding of the readmission problem. To achieve this, we exploit the domain induced hierarchical structure available for the disease codes which are the features for the classification algorithm. We use a tree based sparsity-inducing regularization strategy that explicitly uses the domain hierarchy. The resulting model not only outperforms standard regularization procedures but is also highly sparse and interpretable. We analyze the model and identify several significant factors that have an effect on readmission risk. Some of these factors conform to existing beliefs, e.g., impact of surgical complications and infections during hospital stay. Other factors, such as the impact of mental disorder and substance abuse on readmission, provide empirical evidence for several pre-existing but unverified hypotheses. The analysis also reveals previously undiscovered connections such as the influence of socioeconomic factors like lack of housing and malnutrition.
MLFeb 19, 2018
Entropy-Isomap: Manifold Learning for High-dimensional Dynamic ProcessesFrank Schoeneman, Varun Chandola, Nils Napp et al.
Scientific and engineering processes deliver massive high-dimensional data sets that are generated as non-linear transformations of an initial state and few process parameters. Mapping such data to a low-dimensional manifold facilitates better understanding of the underlying processes, and enables their optimization. In this paper, we first show that off-the-shelf non-linear spectral dimensionality reduction methods, e.g., Isomap, fail for such data, primarily due to the presence of strong temporal correlations. Then, we propose a novel method, Entropy-Isomap, to address the issue. The proposed method is successfully applied to large data describing a fabrication process of organic materials. The resulting low-dimensional representation correctly captures process control variables, allows for low-dimensional visualization of the material morphology evolution, and provides key insights to improve the process.
MLOct 17, 2017
S-Isomap++: Multi Manifold Learning from Streaming DataSuchismit Mahapatra, Varun Chandola
Manifold learning based methods have been widely used for non-linear dimensionality reduction (NLDR). However, in many practical settings, the need to process streaming data is a challenge for such methods, owing to the high computational complexity involved. Moreover, most methods operate under the assumption that the input data is sampled from a single manifold, embedded in a high dimensional space. We propose a method for streaming NLDR when the observed data is either sampled from multiple manifolds or irregularly sampled from a single manifold. We show that existing NLDR methods, such as Isomap, fail in such situations, primarily because they rely on smoothness and continuity of the underlying manifold, which is violated in the scenarios explored in this paper. However, the proposed algorithm is able to learn effectively in presence of multiple, and potentially intersecting, manifolds, while allowing for the input data to arrive as a massive stream.
MLNov 13, 2016
Error Metrics for Learning Reliable Manifolds from Streaming DataFrank Schoeneman, Suchismit Mahapatra, Varun Chandola et al.
Spectral dimensionality reduction is frequently used to identify low-dimensional structure in high-dimensional data. However, learning manifolds, especially from the streaming data, is computationally and memory expensive. In this paper, we argue that a stable manifold can be learned using only a fraction of the stream, and the remaining stream can be mapped to the manifold in a significantly less costly manner. Identifying the transition point at which the manifold is stable is the key step. We present error metrics that allow us to identify the transition point for a given stream by quantitatively assessing the quality of a manifold learned using Isomap. We further propose an efficient mapping algorithm, called S-Isomap, that can be used to map new samples onto the stable manifold. We describe experiments on a variety of data sets that show that the proposed approach is computationally efficient without sacrificing accuracy.