Crack random forest for arbitrary large datasets

IRIS

Random Forests (RF) of tree classifiers are a state-of-the-art method for classification purposes. RF show limited hyperparameter sensitivity, have high numerical robustness, possess native capacity of dealing with numerical and categorical features, and are quite effective in many real world problems with respect to other state-of-the-art techniques. In this work we show how to crack RF in order to be able to train them on arbitrary large datasets. In particular, we extend ReForeSt, an Apache Spark-based RF implementation. The new version of ReForeSt computation automatically adapts to two methodologies to distribute the data and the computation on the available machines and automatically chooses the one able to provide the result in less time. The new ReForeSt also supports Random Rotations, a quite recent randomization technique which can bust the accuracy of the original RF. We perform an extensive experimental evaluation between ReForeSt and MLlib by taking advantage of the Google Cloud Platform1. We test the performances and the scalability of ReForeSt and MLlib on several real world datasets. Results confirm that ReForeSt outperforms MLlib both in terms of memory and computational efficiency, and classification performances. ReForeSt is publicly available via GitHub2.

Crack random forest for arbitrary large datasets

LULLI, ALESSANDRO;Oneto, Luca;Anguita, Davide

2018-01-01

Abstract

Random Forests (RF) of tree classifiers are a state-of-the-art method for classification purposes. RF show limited hyperparameter sensitivity, have high numerical robustness, possess native capacity of dealing with numerical and categorical features, and are quite effective in many real world problems with respect to other state-of-the-art techniques. In this work we show how to crack RF in order to be able to train them on arbitrary large datasets. In particular, we extend ReForeSt, an Apache Spark-based RF implementation. The new version of ReForeSt computation automatically adapts to two methodologies to distribute the data and the computation on the available machines and automatically chooses the one able to provide the result in less time. The new ReForeSt also supports Random Rotations, a quite recent randomization technique which can bust the accuracy of the original RF. We perform an extensive experimental evaluation between ReForeSt and MLlib by taking advantage of the Google Cloud Platform1. We test the performances and the scalability of ReForeSt and MLlib on several real world datasets. Results confirm that ReForeSt outperforms MLlib both in terms of memory and computational efficiency, and classification performances. ReForeSt is publicly available via GitHub2.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2018
			
	ISBN
	
				9781538627143
			
	Appare nelle tipologie:
	
				04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
C052.pdf accesso chiuso Tipologia: Documento in versione editoriale Dimensione 362.35 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	362.35 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/914833

Citazioni

ND

2

ND

social impact