VHT: Vertical Hoeffding Tree
This addresses the problem of handling IoT big data efficiently for data stream mining applications, representing an incremental advancement by combining existing distributed and streaming approaches.
The paper tackles the challenge of scaling decision tree learning to large, high-speed IoT data streams by introducing VHT, the first distributed streaming algorithm for decision trees, which achieves superior performance in accuracy and throughput compared to non-distributed methods.
IoT Big Data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, we can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining distributed data streams, and thus able to run on real-world clusters. We run several experiments to study the accuracy and throughput performance of our new VHT algorithm, as well as its ability to scale while keeping its superior performance with respect to non-distributed decision trees.