DBLGJun 22, 2022

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

arXiv:2207.03536v13 citationsh-index: 33
Originality Highly original
AI Analysis

This addresses the applied problem of data harmonization for data scientists, especially in healthcare, enabling more robust algorithm development, though it is incremental with new methods for known bottlenecks.

The paper tackled the problem of harmonizing data sources with unmapped and partially overlapping numeric features, such as in healthcare databases, by developing novel procedures for feature fingerprinting and deep learning translation, outperforming existing baselines in synthetic and real-world experiments.

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes