Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Lukas Forer; Enis Afgan; Hansi Weissensteiner; Davor Davidovic; Guenther Specht; Florian Kronenberg; Sebastian Schoenherr

doi:10.12694/scpe.v17i2.1159

Authors

Lukas Forer
Enis Afgan
Hansi Weissensteiner
Davor Davidovic
Guenther Specht
Florian Kronenberg
Sebastian Schoenherr

DOI:

https://doi.org/10.12694/scpe.v17i2.1159

Abstract

For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudfl ow - a high-level pipeline framework that allows users to create sophisticated biomedical pipelines using predefined code blocks while the framework automatically translates those into the MapReduce execution model. With the introduction of the YARN resource management layer, new computational processing models such as Apache Spark are now plugable into the Hadoop ecosystem. In this paper we describe the extension of Cloudfl ow to support Apache Spark without any adaptions to already implemented pipelines. The described performance evaluation demonstrates that Spark can bring an additional boost for analysing next generation sequencing (NGS) data to the field of genetics. The Cloudflow framework is open source and freely available at https://github.com/genepi/cloud flow.

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

announcement

Indexed In

SUBMIT

Metrics

Journal Information