LGSep 21, 2015

The Utility of Clustering in Prediction Tasks

Shubhendu Trivedi, Zachary A. Pardos, Neil T. Heffernan

arXiv:1509.06163v14.753 citations

Originality Incremental advance

AI Analysis

This work addresses improving prediction accuracy for data analysis tasks, but it appears incremental as it builds on prior hints about clustering's utility.

The paper investigates using clustering to reduce error in prediction tasks by applying k-means at different scales and combining predictions via a naive ensemble, finding improved accuracy in most datasets and outperforming Random Forests in some cases.

We explore the utility of clustering in reducing error in various prediction tasks. Previous work has hinted at the improvement in prediction accuracy attributed to clustering algorithms if used to pre-process the data. In this work we more deeply investigate the direct utility of using clustering to improve prediction accuracy and provide explanations for why this may be so. We look at a number of datasets, run k-means at different scales and for each scale we train predictors. This produces k sets of predictions. These predictions are then combined by a naïve ensemble. We observed that this use of a predictor in conjunction with clustering improved the prediction accuracy in most datasets. We believe this indicates the predictive utility of exploiting structure in the data and the data compression handed over by clustering. We also found that using this method improves upon the prediction of even a Random Forests predictor which suggests this method is providing a novel, and useful source of variance in the prediction process.

View on arXiv PDF

Similar