MLLGSTMEOct 13, 2025

Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application

arXiv:2510.10870v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving prediction accuracy in target domains with limited data by leveraging source domain knowledge, with incremental contributions to random forest methods.

The authors tackled the problem of transfer learning for random forests in nonparametric regression by proposing a method using centered random forests with distance covariance-based feature weights, which achieved significant gains in predicting ICU patient mortality in smaller hospitals using a large electronic health records dataset.

Random forest is an important method for ML applications due to its broad outperformance over competing methods for structured tabular data. We propose a method for transfer learning in nonparametric regression using a centered random forest (CRF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are different for a few features (sparsely different). Our method first obtains residuals from predicting the response in the target domain using a source domain-trained CRF. Then, we fit another CRF to the residuals, but with feature splitting probabilities proportional to the sample distance covariance between the features and the residuals in an independent sample. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard random forest (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF in some situations. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes