DBJun 22, 2022
Deep Learning to Jointly Schema Match, Impute, and Transform DatabasesSandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack et al.
An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.
CVMar 30, 2020Code
An Open-source Tool for Hyperspectral Image Augmentation in TensorflowMohamed Abdelhack
Satellite imagery allows a plethora of applications ranging from weather forecasting to land surveying. The rapid development of computer vision systems could open new horizons to the utilization of satellite data due to the abundance of large volumes of data. However, current state-of-the-art computer vision systems mainly cater to applications that mainly involve natural images. While useful, those images exhibit a different distribution from satellite images in addition to having more spectral channels. This allows the use of pretrained deep learning models only in a subset of spectral channels that are equivalent to natural images thus discarding valuable information from other spectral channels. This calls for research effort to optimize deep learning models for satellite imagery to enable the assessment of their utility in the domain of remote sensing. Tensorflow tool allows for rapid prototyping and testing of deep learning models, however, its built-in image generator is designed to handle a maximum of four spectral channels. This manuscript introduces an open-source tool that allows the implementation of image augmentation for hyperspectral images in Tensorflow. Given how accessible and easy-to-use Tensorflow is, this tool would provide many researchers with the means to implement, test, and deploy deep learning models for remote sensing applications.
LGJul 19, 2021
A Modulation Layer to Increase Neural Network Robustness Against Data Quality IssuesMohamed Abdelhack, Jiaming Zhang, Sandhya Tripathi et al.
Data missingness and quality are common problems in machine learning, especially for high-stakes applications such as healthcare. Developers often train machine learning models on carefully curated datasets using only high quality data; however, this reduces the utility of such models in production environments. We propose a novel neural network modification to mitigate the impacts of low quality and missing data which involves replacing the fixed weights of a fully-connected layer with a function of an additional input. This is inspired from neuromodulation in biological neural networks where the cortex can up- and down-regulate inputs based on their reliability and the presence of other data. In testing, with reliability scores as a modulating signal, models with modulating layers were found to be more robust against degradation of data quality, including additional missingness. These models are superior to imputation as they save on training time by completely skipping the imputation process and further allow the introduction of other data quality measures that imputation cannot handle. Our results suggest that explicitly accounting for reduced information quality with a modulating fully connected layer can enable the deployment of artificial intelligence systems in real-time applications.
LGNov 3, 2020
(Un)fairness in Post-operative Complication Prediction ModelsSandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack et al.
With the current ongoing debate about fairness, explainability and transparency of machine learning models, their application in high-impact clinical decision-making systems must be scrutinized. We consider a real-life example of risk estimation before surgery and investigate the potential for bias or unfairness of a variety of algorithms. Our approach creates transparent documentation of potential bias so that the users can apply the model carefully. We augment a model-card like analysis using propensity scores with a decision-tree based guide for clinicians that would identify predictable shortcomings of the model. In addition to functioning as a guide for users, we propose that it can guide the algorithm development and informatics team to focus on data sources and structures that can address these shortcomings.