ML LGAug 16, 2025

Robust Data Fusion via Subsampling

arXiv:2508.12048v14.51 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses data fusion challenges for domains like aviation where limited target data and contaminated external sources hinder model performance, though it appears incremental in combining existing subsampling and transfer learning ideas.

The paper tackles the problem of robust transfer learning when target data is limited and external data is contaminated with outliers, by proposing subsampling strategies to reduce bias and variance, with non-asymptotic error bounds and simulations showing superior performance, and an application to airplane risk analysis demonstrating improved estimation efficiency.

Data fusion and transfer learning are rapidly growing fields that enhance model performance for a target population by leveraging other related data sources or tasks. The challenges lie in the various potential heterogeneities between the target and external data, as well as various practical concerns that prevent a naïve data integration. We consider a realistic scenario where the target data is limited in size while the external data is large but contaminated with outliers; such data contamination, along with other computational and operational constraints, necessitates proper selection or subsampling of the external data for transfer learning. To our knowledge,transfer learning and subsampling under data contamination have not been thoroughly investigated. We address this gap by studying various transfer learning methods with subsamples of the external data, accounting for outliers deviating from the underlying true model due to arbitrary mean shifts. Two subsampling strategies are investigated: one aimed at reducing biases and the other at minimizing variances. Approaches to combine these strategies are also introduced to enhance the performance of the estimators. We provide non-asymptotic error bounds for the transfer learning estimators, clarifying the roles of sample sizes, signal strength, sampling rates, magnitude of outliers, and tail behaviors of model error distributions, among other factors. Extensive simulations show the superior performance of the proposed methods. Additionally, we apply our methods to analyze the risk of hard landings in A380 airplanes by utilizing data from other airplane types,demonstrating that robust transfer learning can improve estimation efficiency for relatively rare airplane types with the help of data from other types of airplanes.

View on arXiv PDF

Similar