Adapting to Hostile Architectural Environments

D. E. Keyes

Published: Mar 1, 2001

D. E. Keyes

Abstract

We shape our buildings; thereafter they shape us.—Sir Winston Churchill

According to SGI-Cray's webpage on the T3E-900 (http://www.sgi.com/t3e/performance.html, last accessed in 1999), the average number of floating point operations required to cover an access to main memory is 252—about 25 times worse than on the CDC 7600, a machine introduced three decades ago. Proponents of the Hybrid Technology Multithreaded (HTMT) petaflops computer (http://www.hq.nasa.gov/hpcc/insights/vol5/petaflop.htm) estimate that main memory latency will grow to 10,000 processor cycles under the earliest versions of their design, and increase thereafter. Ironically, one encounters the same phenomenon at the low end of high performance computing—beowulf clusters—for remote memory latencies. Given this dwarfing of the cost of computation by the cost of data transfer, application programs of the future will be dominated by expression of data motion rather than operations. This daunting task is made more approachable by the identification of data motion templates, such as the eight listed below.

For computations with data access patterns that are regular and periodic in time (as in, for example, an iterative PDE computation on a static grid, even if the grid is spatially irregular), these latencies do not by themselves pose a serious threat to the feasibility of aggressive exploitation of hierarchical distributed memory machines. Latency can be covered, given sufficient bandwidth and proper data structure organization, by block transfers, prefetching, multithreading, or some combination. These solutions presume concurrency well beyond the process level (easily available in PDEs) and high bandwidth. However, high bandwidth-based solutions of the problems of high latency are not always possible. For problems with data-dependent data access patterns, lacking a predictability that can be exploited in an efficient inspector-executor paradigm, high latency is far more serious, and perhaps fatal.

Even if a latency problem is amenable to treatment, the required bandwidth cannot be taken for granted. The IBM F50 node (four 604e processors sharing a local memory) used in an early version of an ASCI machine has barely enough memory bandwidth to keep one of the four processors fed during a streaming computation such as a vector scaling—even with the scaling operand fixed in a register. Since most PDE computations (and hence many Grand Challenge applications) touch O(n) data in O(n) operations, it is clear that bandwidth is at least as critical an issue as latency in present and future architectures.

The supercomputing community increasingly faces a disconnect between architects and application developers, whose seriousness is sometimes unappreciated since the new architectures continue to handle dense linear algebra kernels, such as DGEMM, well. Even the dense matrix-matrix multiply kernel DGEMM is full of interest on a hierarchical memory machine, as the ATLAS, MTL, PhiPAC, and PLAPACK projects have demonstrated. Maximum performance and how to block and order to achieve it are sufficiently complex issues that vendor libraries can be outperformed by freely available architecture-sensitive self-tuning replacements. However, there are sweet spots not far from theoretical peak for such dense kernels within the manufactured balance of hardware parameters; this is not true for many sparse kernels.

Petaflops-scale computers will be useful to the majority of Grand Challenge applications only if efforts are made to bridge the gap between architects and applications writers well before such machines are sent to fabrication. Parallel and Distributed Computing Practices exists, in part, to support this mission, and researchers from one domain should increasingly crystalize their requirements or desiderata for those of the other. Meanwhile, both hardware architects and application writers need to present requirements to system software designers.

For instance, language extensions allowing users to limit the allocatability or replicability of data items are critical to coherence protocols that are both reliable and scalable. Architects and system software designers must make such directives visible to the user and demonstrate their effectiveness in typically arising kernels.

Meanwhile, application writers must converge on a minimal set of generic actions whose requirements make good targets for architects seeking to balance their machines better for particular application domains. A nice example of this is the layman's article by laser fusion physicist Aleksei Shestakov et al. in the April 1999 SIAM News. In this article, the authors discuss four of the eight operations below, and take the reader on a tour of the code data structures and partitioning strategies that support the operations.

A general purpose unstructured scientific simulation, implicit or explicit, may involve the following:

vertex loop: concurrent at point level, no communication (typically state equation updates or vector BLAS)
edge loop: concurrent at colored edge level, near neighbor communication (typically stencil updates or flux evaluations)
recurrence: concurrent at subdomain or wavefront level, near neighbor communication (typically relaxation sweeps or backsolves)
global reduction: concurrent at point level, but synchronizing (typically used in orthogonalization and convergence or robustness tests)
multilevel transfer: concurrent at point level, possibly involving search heuristics (typically interpolation or agglomeration)
tree traversal: concurrent at point level (typically detecting topological change (e.g., the contact problem) or doing multilevel force summation)
ray tracing: concurrent at ray level (for radiative action at a distance, as well as some types of visualization)
parallel I/O: concurrent at the R/W head level (granting a single file image for parallel I/O so that checkpoints/restarts may be undertaken with varying processor granularity)

Architects supporting these operations at a high level in a manner that is efficient for their platforms, with drop-in floating point functions in the inner loops, will go a long way in reducing the hostility of the new platforms for performance-oriented users. Application writers should learn to express their codes in such terms or be prepared to choreograph their data motion in space-time themselves.

Application writers should also study and publish discussions of the hierarchical workingset structure of their codes (e.g., points, stencils, grid patches, sets of related grid patches, full subdomains of data, etc.) and characterize the relative scaling of workingset size in terms of overall problem size and process granularity. The realization that primary BLAS3 workingsets scale as the 2/3rds power of the total operations should lead PDE application writers into higher-order discretizations and interlaced data structures. The realization that multivector forms of matrix-vector multiplication can be executed much faster than the corresponding number of the usual single-vector form should likewise turn the algorithmic crank. Everything is connected to everything in computational ecology, as in natural ecology, but many cross-disciplinary connections, such as the potential effect of discretization order on optimal cache size, may be invisible to experts in individual disciplines.

As the part of the legacy of the PITAC report (accessible at http://www.nitrd.gov/pitac/report/), many will go forth and write software. Intense study and discussion of the appropriate interfaces and functionality of each layer of this software by each application community is urgently needed. May the authors and editors of SCPE play their part!

David Keyes
Old Dominion University

Issue

Vol. 2 No. 1 (1999)

Section

Editorial

Article Sidebar

Main Article Content

Abstract

Article Details