A Nonparametric Test of Dependence Based on Ensemble of Decision Trees
This provides a robust tool for statisticians and data scientists to assess dependence, though it is incremental as it builds on existing permutation and ensemble methods.
The paper tackles the problem of measuring statistical dependence between random variables by proposing a non-parametric coefficient based on an ensemble of decision trees, which shows high power in detecting complex relationships from noisy data.
In this paper, a robust non-parametric measure of statistical dependence, or correlation, between two random variables is presented. The proposed coefficient is a permutation-like statistic that quantifies how much the observed sample S_n : {(X_i , Y_i), i = 1 . . . n} is discriminable from the permutated sample ^S_nn : {(X_i , Y_j), i, j = 1 . . . n}, where the two variables are independent. The extent of discriminability is determined using the predictions for the, interchangeable, leave-out sample from training an aggregate of decision trees to discriminate between the two samples without materializing the permutated sample. The proposed coefficient is computationally efficient, interpretable, invariant to monotonic transformations, and has a well-approximated distribution under independence. Empirical results show the proposed method to have a high power for detecting complex relationships from noisy data.