Parallel Numeric Algorithms On Faster Computers

Roman Trobec


High performance parallel computers provide the computational rates necessary for computer simulations based on parallel numerical methods. Large-scale natural phenomena, experiments that would cost vast amounts of money, and those that are ecologically problematic or dangerous, can be first simulated in order to foresee the expected results.

There are two possibilities to increase computational performance; either by increasing the performance of a single processor or by increasing the number of processors. Nowdays it seems that the advance in performance of a single processor cannot follow Moor's prediction that computer performance will double every two years. The slower rate is being temporarily compensated by the increasing number of processors that can cooperate on the same problem. Today, it is not unusual to utilise parallel machines with several thousand computers. But are such huge parallel machines able to execute numerical algorithms efficiently?

It is known that only scalable algorithms can be implemented efficiently on a massively parallel machine. Typically, such algorithms need a very small amount of cooperation, i.e., communication between processors. However, a huge number of processors cannot be exploited adequately if the problem size remains independent of the available processors.

New parallel algorithms are being published that can be implemented only on a limited number of processors because an even distribution of computational load can not be achieved. In such cases the speed-up is limited. These algorithms are of a marginal value for practical use, because their parallelization efforts can be compensated by faster computers. If the theoretical speed-up of some improved parallel algorithm is limited, which is often the case, then it would be better to use a faster computer. Coding, testing and documenting a new algorithm, which offers only a small speed-up, is inefficient.

Although a smarter algorithm may have lower asymptotical complexity, the multiplicative constant is usually higher because:

  • it typically performs more operations per iteration (e.g. iterative methods for linear equation systems),
  • it uses more memory, resulting in more cache misses,
  • it sometimes has a more complicated control flow, making it harder for the compiler to produce an optimal assembly code and for the CPU to avoid pipeline stalls.

Therefore algorithm development must take into account the current situation and predicted further developments of CPUs and compilers. Any algorithmic complications also make changes to programs in the future harder.

On the other hand, as new algorithms become widely used, they find their way into numerical libraries such as BLAS, POOMA and others. After that the use of state-of-the-art algorithms requires little more effort than standard algorithms. Again, widely used applications can typically use these libraries because they were designed with such calculations in mind. Special purpose programs often have special demands that the above mentioned libraries cannot meet. Numerical algorithms that are good for manual calculation are not always suitable for computers (complicated coding, memory requirements, etc.). So algorithms with balanced computational and memory requirements can be advocated in numerical analysis, because they have the potential to be effectively parallelized.

Both ideas, scalability and balanced computational and memory requirements, have to be considered in the development of new numerical algorithms and in potential reconsideration of the standard algorithms. Besides, the effort needed to code a new improved algorithm has to be taken into account. Finally, the number of expected computer experiments has to be foreseen. If, for example, only a small number of experiments are expected, then we can take an old and potentially slow algorithm. If its execution time is unacceptably long, then something new, usually based on compromises, has to be developed.

The time needed to finish a computer experiment is one factor that has to be considered in our decision about investment in development of a new parallel algorithm. On the other hand, the number of expected runs of such an experiment is even more important. For example, if we expect only a single or very small number of runs, then we can usually use an existing old algorithm, particularly if the execution time is rather short, say no more then 24 hours or sometimes up to 240 hours. There is no problem at all with shorter experiments. But if we expect the several thousand runs to be necessary, then the strategy should be changed. Many experiments can be run in parallel. Parallelization of such a set of independent experiments is usually simple and the benefits can be enormous. In CHARMM (a widely used simulation programme in structural chemistry), for example, even a 5% speed-up is probably worth a few thousand programming hours, while custom programs that are used only within an organization are another matter.

The general complex algorithms needed by many users or that, which form a part of standard parallel libraries, remain the main problem. They have to be coded with great caution in order to be implemented in the best possible way to offer near ideal speed-up for any number of processors. Research work should be concentrated on this issue. Eficient system programs and tools are needed for further development of parallel programs. Ordinary users do not need to know anything about parallelization; they just want to perform complex experiments in as short a time as possible.

Not only computational performance but also memory requirements play an important role in the parallelization strategy. Sometimes, enough memory or cache has a greater impact on the final performance than the precise coding of a parallel program. The essential fact here is that parallel memory access will improve the performance of the parallel program. Data should be distributed in such a way that permits an even distribution of computation, enabling the use of virtually the same program on all machines regardless of memory type. Such an approach usually leads to an almost ideal speed-up and therefore to applications appropriate for massively parallel machines.

The final issue that has to be considered are the ideas covered by autonomous computing, i.e., self-organization, fault-tolerance, etc. It implicitly assumes distributed and parallel systems. While distributed computing may use fast connected computers, the parallel computers are composed of fast computers connected with equally fast communication media. Comparing these two types of machines we recognise that they need different treatment. Distributed systems or grids are able to perform only ideally scalable algorithms with virtually no communication, while parallel computers have the potential to perform complicated algorithms with high communication requirements.

It is usually supposed that processor performances have to be improved in order to improve the overall performances of parallel algorithms. But suppose that a technological breakthrough were to be made in the communication sector. Suppose that communication speed would become virtually unlimited. In this case the computer performance achieved today would become suficient and the effort needed to be invested in parallelization tools would become much smaller than in the case of fast processors. It seems that research of fast communication methods would be more or at least equally efficient than that focused on increasing processor speed.

If we speculate on future developments in the area of parallel and distributed computing we may conclude that there will be no unique computer design covering all user requirements. Most of us will use home and mobile computers with simple access to different types of specialized high performance computers with fast response for a price comparable with the price of a phone call. Some of us will work on the system software for these specialized computers, probably within several non-standard frameworks. Then, something will happen that will initiate the next cycle in the development of the next generation of thinking machines…

Roman Trobec
Jozef Stefan Institute, Slovenia



  • There are currently no refbacks.