ML LGJul 13, 2021

Oversampling Divide-and-conquer for Response-skewed Kernel Ridge Regression

arXiv:2107.05834v21.9

Originality Incremental advance

AI Analysis

This addresses skewed data issues in large-scale kernel ridge regression for practitioners, though it is incremental as it adapts existing oversampling techniques to a specific setting.

The paper tackles the problem of poor performance in divide-and-conquer kernel ridge regression when response variables are highly skewed, by combining a response-adaptive partition strategy with oversampling to allocate informative observations to multiple nodes, resulting in a smaller risk than classical methods under mild conditions.

The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We combine a novel response-adaptive partition strategy with the oversampling technique synergistically to overcome the limitation. Through the proposed novel algorithm, we allocate some carefully identified informative observations to multiple nodes (local processors). Although the oversampling technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller risk than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.

View on arXiv PDF

Similar