Constructing Decision Trees from Data Streams
This work addresses the challenge of scalable decision tree learning for streaming data, which is incremental as it builds on prior seminal work in the field.
The paper tackles the problem of constructing decision trees from data streams without the i.i.d. assumption by proposing efficient streaming algorithms that compute optimal splits for regression and classification, requiring sublinear space and a small number of passes.
In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations \(x_i\) and their corresponding labels \(y_i\), without the i.i.d. assumption, the objective is to identify the optimal split \(j\) that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).