# Design of Efficient and Scalable Parallel Algorithms

## Main Article Content

## Abstract

In this editorial we discuss the important current trend of using realistic models to design efficient and scalable parallel algorithms. Previous issues of *Parallel and Distributed Computing Practices* (*PDCP*) have emphasized the ubiquitous presence of parallel computing. In the first issue of*PDCP*, it is mentioned that parallel computing is invading the world of everyday computing through multiprocessor desktop systems. A recent issue of *PDCP* addresses the new-coming and increasingly more popular cluster computing, stating that over the decade, clusters will span the entire range of high-performance computing platforms. Indeed parallel computing has become the mainstream of high-performance computing. If we examine the list of the TOP500 Supercomputer Sites, which contains the five hundred most powerful systems installed, we can verify that all the 500 of the TOP500 list are parallel computers of some kind, ranging from 6 to 9,632 processors. Of these 500, nearly 90% have 64 processors or more.

*Parallel and Distributed Computing Practices*, as indicated by its name, is concerned with practical issues of parallel computing, addressing the consequences of the trends in such areas as performance and applications. On the one hand, it is indisputable the rapid advances in new architectural designs of parallel computers. On the other, it is still far from clear where we stand as far as the design of really efficient and scalable parallel algorithms is concerned.

In his recent editorial in a special issue on coarse-grained parallel algorithms of *Algorithmica*, Dehne has addressed this problem to some depth. It is our intent to contribute in this discussion.

Since the eighties, the PRAM (Parallel Random Access Machine) model has been receiving considerable attention. It is a formal model that allows one to establish optimal results. Its importance also relies on the possibility to relate parallel complexity to sequential complexity defined on traditional sequential computing models. By removing algorithmic details as communication and synchronization, the PRAM model allows one to focus on the structural characteristics in the problem domain. Furthermore, many of the techniques and methods designed for the PRAM model can be extended to other computing models.

One should notice, however, that PRAM algorithms, when implemented in practice, leave much to be desired in terms of actual performance. Frequently speedup results for theoretical PRAM algorithms do not match the actual speedups obtained in experiments performed on real parallel computers. So in spite of the usefulness, as far theory is concerned, of the PRAM model, we are desperately in need of more realistic parallel computing models.

Among the realistic computing models, the most important is probably Valiant's BSP (Bulk Synchronous Parallel) computing model, proposed in 1990. A BSP computer consists of a set of processor/memory modules connected by a router that can deliver messages in a point to point fashion among the processors. In the BSP model, computation is divided into a sequence of supersteps separated by barrier synchronizations. A superstep in turn consists of local computation and data exchange among processors through the router. Though BSP is possible to simulate PRAM algorithm optimally on distributed memory machines, Valiant observes the importance of design of parallel algorithms that take advantage of local computations and minimize global operations. Valiant also points out situations in which PRAM simulations are not efficient and these situations, unfortunately, occur in the majority of current parallel computers.

Dehne et al. proposed a simpler and more practical version of the BSP model in 1993, referred to as the Coarse-Grained Multicomputer (CGM) model. Considering n the problem size, the CGM (n,p) model consists of p processors P1, …, Pp, with O(n/p) local memory per processor and connected through an arbitrary interconnection network. The term *coarse-grained* means the local memory size is large, usually we require n/p > p. An algorithm in the CGM model consists of alternating local computation and global communication rounds. In a computation round the pprocessors compute independently on their respective local data and the best possible sequential algorithm can be used in each processor for this local computation. In a communication round each processor may send O(n/p) data and receive O(n/p) data. It is required that all information sent from a given processor to another processor in one communication round be packed into one long message, thereby minimizing the message overhead. A CGM computation/communication round corresponds to a BSP superstep. The CGM model is particularly suitable in cases where the overall computation speed is considerably larger than the overall communication speed, and the problem size is considerably larger than the number of processors, which is usually the case in practice. The main advantage of the CGM model is its simplicity. It models the communication cost of a parallel algorithm by using only one single parameter, namely the number of communication rounds. Nevertheless, it gives a realistic performance prediction for commercially available multiprocessors.

The goal of the CGM model is to minimize the number of communication rounds as well as the total local computation time. Both in BSP and CGM algorithms, it has been shown that minimizing the number of communication rounds leads to improved portability across different parallel architectures. The CGM model allows the exchange of O(n/p) data in each communication round. This is of course an upper bound limit. In a practical point of view, it is desirable to have an amount of data transmitted that is independent of n in a communication round. It is desirable that this amount of data transmitted in a communication round be constant or independent of n, say O(p).

The appearance of these more realistic models such as BSP, CGM and others has somehow pushed the advances of design of efficient and scalable parallel algorithms in the nineties to a new level. By examining the proceedings and journals in this area, one can verify the current state in the design of parallel algorithms in such models, many of which actually implemented and shown to give significant performance results. One perceives, however, that such advances are still modest as compared to the advances in hardware and computer architecture design. The challenge in the next decade, at the dawn of the new millenium, is for researchers in algorithms design to close the gap between hardware and software in parallel computing.

S. W. Song

Instituto de Matemática e Estatística

Universidade de São Paulo