Will SIMD Make a Comeback?

##plugins.themes.bootstrap3.article.main##

Stewart Reddaway

Abstract

When MasPar died in 1996 SIMD looked nearly dead, but, as indicated below, it is now showing signs of revival. The SIMD marketplace had seen many other machines, including Staran, Illiac lV, the DAP series, Aspro, CLIP 4 and the early Connection Machines; of these, only the DAP is still in production (the Gamma II range). Both SIMD and traditional MIMD machines were losing to powerful single processors, loosely coupled farms and SMPs.

SIMD architecture ranges from CRAY vector machines to arrays of 1-bit processors sometimes called associative processors. They all exploit data parallelism. Most non-trivial applications involve loops over arrays of data, doing similar things each iteration. As well as performing data parallel operations well, a good SIMD machine must include

  • efficient serial control
  • efficient selective processing ("activity control")
  • efficient array manipulation to achieve data reordering, replication, etc.
  • fast global testing.

My own familiarity is with bit and byte organized PEs, which have great flexibility, and for which the speed of more complex functions varies with their underlying fundamental complexity. This is in sharp contrast with conventional processors that have certain functions built into hardware, such as 64-bit floating point add and multiply, but perform much worse on functions not built into hardware. Also of critical importance are good program development tools, including high level languages that hide the hardware details. For example by allowing operations on arrays of arbitrary size, program loops are largely eliminated, resulting in both easier programming and good performance because looping is done by efficient low level code. Such whole-array operations are expressed naturally using object-oriented languages such as C++.

On very many applications SIMD can outperform other architectures, often by a wide margin, provided similar levels of technology are used. However SIMD was in a downward spiral due to several interconnected factors:

  • a dominant sequential programming paradigm
  • volume production of standard processors
  • the high investment required for full custom Silicon
  • lower cycle speeds for SIMD hardware
  • the limited market for non-standard software
  • interworking of SIMD and standard software not easy or efficient

Undoubtedly the SIMD community also made mistakes.

SIMD lost two propaganda battles. MIMD appropriated the label MPP (Massively Parallel Processing) when their systems typically contained tens of processors and SIMD contained thousands. More important, MIMD systems were labeled General Purpose and SIMD Special Purpose. This claim really rested on MIMD programming being nearer to the dominant programming paradigm than SIMD programming, which is a very different issue from which applications are fundamentally best served by which architectures.

In the late 80s, MIMD was more successful in capturing research funding. SIMD suffered from being:

  • too simple, leaving no difficult system issues, such as message passing, for academics to research, and
  • complete systems, which meant that it was harder for academic researchers to create their own architectures (for example Transputer arrays with various forms of interconnect).

The fascinating subject of mapping applications to SIMD machines, with algorithms that efficiently exploit their many flexibilities, had less interest for computer scientists. More academic MIMD work (compared with SIMD} led to a larger MIMD community.

Moving from the bad news to the good news, SIMD is now seeing the light of day in volume processors that include small-scale SIMD in the form of MMX etc. Old hands may laugh at parallelism as low as 4 or 8, but this is at high clock rates and is mainstream. This is an SIMD opportunity waiting to be more fully exploited.

At regular intervals there are proposals, going under acronyms such as IRAM and PIM, to build SIMD units with memory and processing integrated on a chip. A development publicized recently is a programmable chip with 1536 PEs and 3 MB of memory, being produced by PixelFusion. This very high performance SIMD chip is for graphics, and is partly based on work over many years at the University of North Carolina. PixelFusion plan to cater for other applications in due course.

Both the above developments exploit advanced Silicon design, and point to a resurgence in SIMD hardware. Currently the less advanced silicon technology used in traditional SIMD, and the low volumes, mean that careful high performance techniques are sometimes needed to be competitive. An example of using flexibilities to achieve high performance was published last year (see S. F. Reddaway, Image and signal processing on the SIMD DAP Gamma II, in High Performance Architectures for Real-time Image Processing, IEE Colloquium,Ref: 1998/197, London, Feb 1998). The task was to compute multiple filters for a Doppler Radar. By programming a complete multiply-accumulate loop (on array data) as a whole, and building the floating point weights into code using a code generator (as well as other algorithmic ideas), up to 23 GFLOPS was achieved on what was characterized as a 1.5 GFLOPS SIMD machine.

A somewhat surprising success for SIMD has been large-scale indexed text retrieval. Currently web search engines, powered by processor farms, are failing to keep up with the explosive growth in both the size of the web and the search demand. Don't be surprised if a new generation of web search engines is powered by SIMD, using a new generation of application code.

With a more level technology playing field, we are likely to see the downward spiral reversed and SIMD conquer many performance-critical applications.

Stewart Reddaway
Cambridge Parallel Processing Ltd.

##plugins.themes.bootstrap3.article.details##

Section
Editorial