Sequence Similarity Parallelization over Heterogeneous Computer Clusters Using Data Parallel Programming Model

Majid
 Hajibaba; Saed
 Gorgin; Mohsen
 Sharifi

doi:10.12694/scpe.v18i1.1233

PDF

Published: Mar 26, 2017

DOI: https://doi.org/10.12694/scpe.v18i1.1233

Majid Hajibaba

Saed Gorgin

Mohsen Sharifi

Abstract

Sequence similarity, as a special case of data intensive applications, is one of the neediest applications for parallelization. Clustered commodity computers as a cost-effective platform for distributed and parallel processing, can be leveraged to parallelize sequence similarity. However, manually designing and developing parallel programs on commodity computers is a time-consuming, complex and error-prone process. In this paper, we present a sequence similarity parallelization technique using the Apache Storm as a stream processing framework with a data parallel programming model. Storm automatically parallelizes computations via a special user-defined topology that is represented as a directed acyclic graph. The proposed technique collects streams of data from a disk and sends them sequence by sequence to clustered computers for parallel processing. We also present a dispatching policy for balancing the cluster workload and managing the cluster heterogeneity to achieve more than 99 percent parallelism. An alignment-free method, known as n-gram modeling, is used to calculate similarities between the sequences. To show the cost-performance superiority of our method on clustered commodity computers over serial processing in powerful computers, we simply use UniProtKB/SwissProt dataset for evaluation of the performance of sequence similarity as an interesting large-scale Bioinformatics application.

Issue

Vol. 18 No. 1 (2017)

Section

Proposal for Special Issue Papers

Article Sidebar

Main Article Content

Abstract

Article Details