Performance Comparison of Apache Spark and Hadoop for Machine Learning based iterative GBTR on HIGGS and Covid-19 Datasets

Piyush Sewal; Hari Singh

doi:10.12694/scpe.v25i3.2687

Authors

Piyush Sewal CSE & IT Department, Jaypee University of Information Technology, Solan, HP, India
Hari Singh CSE & IT Department, Jaypee University of Information Technology, Solan, HP, India

DOI:

https://doi.org/10.12694/scpe.v25i3.2687

Keywords:

Hadoop, Apache Spark

Abstract

In the realm of distributed computing frameworks, such as Apache Spark and MapReduce Hadoop, the efficacy of these frameworks varies across diverse applications and algorithms contingent upon distinctive evaluation metrics and critical parameters. This research paper diligently scrutinizes the extant body of research that compares these two frameworks concerning said evaluation metrics and parameters. Subsequently, it conducts empirical investigations to authenticate the performance of these frameworks in the context of an iterative Gradient Boosting Tree Regression (GBTR) algorithm. Remarkably, the comparative analyses in previous studies encompass a spectrum of iterative machine learning regression and classification techniques, batch processing, SQL, and Graph processing algorithms. Furthermore, numerous investigations have explored the application of machine learning algorithms encompassing logistic regression, Page Rank, K-Means, KNN, and the HiBench suite. This paper presents the comparison between the two distributed computing platforms on iterative GBTR for classification task on the HIGGS dataset from the physics domain and for the regression task on the Covid-19 dataset from the healthcare domain. The empirical findings corroborate that Apache Spark exhibits superior execution speed in iterative tasks when the available physical memory significantly exceeds the dataset size. Conversely, Hadoop outperforms Spark when dealing with substantial datasets or constrained physical memory resources.

Author Biography

Hari Singh, CSE & IT Department, Jaypee University of Information Technology, Solan, HP, India

Faculty, Jaypee University of Information Technology, Solan, Himachal Pradesh, India

Performance Comparison of Apache Spark and Hadoop for Machine Learning based iterative GBTR on HIGGS and Covid-19 Datasets

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

announcement

Indexed In

SUBMIT

Metrics

Journal Information