Random Forests (RF) of tree classifiers are a state-of-the-art method for classification purposes. RF show limited hyperparameter sensitivity, have high numerical robustness, possess native capacity of dealing with numerical and categorical features, and are quite effective in many real world problems with respect to other state-of-the-art techniques. In this work we show how to crack RF in order to be able to train them on arbitrary large datasets. In particular, we extend ReForeSt, an Apache Spark-based RF implementation. The new version of ReForeSt computation automatically adapts to two methodologies to distribute the data and the computation on the available machines and automatically chooses the one able to provide the result in less time. The new ReForeSt also supports Random Rotations, a quite recent randomization technique which can bust the accuracy of the original RF. We perform an extensive experimental evaluation between ReForeSt and MLlib by taking advantage of the Google Cloud Platform1. We test the performances and the scalability of ReForeSt and MLlib on several real world datasets. Results confirm that ReForeSt outperforms MLlib both in terms of memory and computational efficiency, and classification performances. ReForeSt is publicly available via GitHub2.

Crack random forest for arbitrary large datasets

LULLI, ALESSANDRO;Oneto, Luca;Anguita, Davide
2018

Abstract

Random Forests (RF) of tree classifiers are a state-of-the-art method for classification purposes. RF show limited hyperparameter sensitivity, have high numerical robustness, possess native capacity of dealing with numerical and categorical features, and are quite effective in many real world problems with respect to other state-of-the-art techniques. In this work we show how to crack RF in order to be able to train them on arbitrary large datasets. In particular, we extend ReForeSt, an Apache Spark-based RF implementation. The new version of ReForeSt computation automatically adapts to two methodologies to distribute the data and the computation on the available machines and automatically chooses the one able to provide the result in less time. The new ReForeSt also supports Random Rotations, a quite recent randomization technique which can bust the accuracy of the original RF. We perform an extensive experimental evaluation between ReForeSt and MLlib by taking advantage of the Google Cloud Platform1. We test the performances and the scalability of ReForeSt and MLlib on several real world datasets. Results confirm that ReForeSt outperforms MLlib both in terms of memory and computational efficiency, and classification performances. ReForeSt is publicly available via GitHub2.
File in questo prodotto:
File Dimensione Formato  
C052.pdf

accesso chiuso

Tipologia: Documento in versione editoriale
Dimensione 362.35 kB
Formato Adobe PDF
362.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11567/914833
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact