Exact Distributed Training: Random Forest with Billions of Examples
This enables scalable exact training for decision forests on billion-scale datasets, addressing a bottleneck in big data machine learning.
The paper tackles the problem of training Random Forest models on extremely large datasets by introducing an exact distributed algorithm that avoids approximating best split searches, achieving training on up to 18 billion examples with a specific case of 17.3B examples trained in 22 hours.
We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h.