Random Forests (RF) of tree classifiers are a state-of-the-art method for classification purposes. RF show limited hyperparameter sensitivity, have high numerical robustness, possess native capacity of dealing with numerical and categorical features, and are quite effective in many real world problems with respect to other state-of-the-art techniques. In this work we show how to crack RF in order to be able to train them on arbitrary large datasets. In particular, we extend ReForeSt, an Apache Spark-based RF implementation. The new version of ReForeSt computation automatically adapts to two methodologies to distribute the data and the computation on the available machines and automatically chooses the one able to provide the result in less time. The new ReForeSt also supports Random Rotations, a quite recent randomization technique which can bust the accuracy of the original RF. We perform an extensive experimental evaluation between ReForeSt and MLlib by taking advantage of the Google Cloud Platform1. We test the performances and the scalability of ReForeSt and MLlib on several real world datasets. Results confirm that ReForeSt outperforms MLlib both in terms of memory and computational efficiency, and classification performances. ReForeSt is publicly available via GitHub2.
|Titolo:||Crack random forest for arbitrary large datasets|
|Data di pubblicazione:||2018|
|Appare nelle tipologie:||04.01 - Contributo in atti di convegno|