Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf

IRIS

One of the biggest challenges of the current big data landscape is our inability to process vast amounts of information in a reasonable time. In this work, we explore and compare two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multicore infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. We use the Google Cloud Platform service to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data management infrastructure and the possibility of dealing with other aspects such as node failure and data replication.

Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf

REYES ORTIZ, JORGE LUIS;ONETO, LUCA;ANGUITA, DAVIDE

2015-01-01

Abstract

One of the biggest challenges of the current big data landscape is our inability to process vast amounts of information in a reasonable time. In this work, we explore and compare two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multicore infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. We use the Google Cloud Platform service to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data management infrastructure and the possibility of dealing with other aspects such as node failure and data replication.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2015

Appare nelle tipologie:

04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
C035.pdf accesso chiuso Tipologia: Documento in versione editoriale Dimensione 217.27 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	217.27 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/845897

Citazioni

ND

150

122

social impact