PDF, 4.9 MB
Zipped PostScript, 5.1 MB
HTML
HTML
PDF, 2.3 MB
Zipped PostScript, 2.5 MB
Programming models based on algorithmic skeletons promise to raise the level of abstraction perceived by programmers when implementing parallel applications, while guaranteeing good performance figures. At the same time, however, they restrict the freedom of programmers to implement arbitrary parallelism exploitation patterns. In fact, efficiency is achieved by restricting the parallelism exploitation patterns provided to the programmer to the useful ones for which efficient implementations, as well as useful and efficient compositions, are known.
In this work we introduce muskel, a full Java library targeting workstation clusters, networks and grids and providing the programmers with a skeleton based parallel programming environment.
muskel is implemented exploiting (macro) data flow technology, rather than the more usual skeleton technology relying on the use of implementation templates. Using data flow, muskel easily and efficiently implements both classical, predefined skeletons, and user-defined parallelism exploitation patterns. This provides a means to overcome some of the problems that Cole identified in his skeleton manifesto
as the issues impairing skeleton success in the parallel programming arena. We discuss fully how user-defined skeletons are supported by exploiting a data flow implementation, experimental results and we also discuss extensions supporting the further characterization of skeletons with non-functional properties, such as security,
through the use of Aspect Oriented Programming and annotations.
PDF, 244 KB
Zipped PostScript, 683 KB
MPJ Express is our implementation of MPI-like bindings for Java. In this paper we discuss our intermediate buffering layer that makes use of the so-called direct byte buffers introduced in the Java New I/O package. The purpose of this layer is to support the implementation of derived datatypes. MPJ Express is the first Java messaging library that implements this feature using pure Java. In addition, this buffering layer allows efficient implementation of communication devices based on proprietary networks such as Myrinet. In this paper we evaluate the performance of our buffering layer and demonstrate the usefulness of direct byte buffers. Also, we evaluate the performance of MPJ Express against other messaging systems using Myrinet and show that our buffering layer has made it possible to avoid the overheads suffered by other Java systems such as mpiJava that relies on the Java Native Interface.
HTML
PDF, 771 KB
Zipped PostScript, 1014 KB
High-level area-time estimation is an essential step to facilitate rapid design exploration for FPGA implementations. Existing works in high-level area-time estimation usually ignore the physical effects of the design after place and route, which have a notable impact on the maximum achievable speed of the design. In this paper, we propose a framework to rapidly estimate the area-time measures of mapping C-applications onto FPGA. The framework relies on the Trimaran compiler to generate an optimized high-level IR (Intermediate Representation) of the C-applications. Area-time estimation of the IR is then performed using a proposed estimation model that is based on an architecture template with application-specific heterogeneous functional units. In order to accurately predict the delay of the design after place and route, we introduce a new metric for the estimation that models the criticality of the design's interconnectivity. Experimental results based on a set of embedded functions show that the proposed area estimation can achieve comparable results with the synthesis results of a commercial FPGA tool in the order of milliseconds. For the C functions used in our experiments, the proposed delay estimation leads to an average error of about 3% when compared to the post place and route results. In addition, we demonstrate the robustness of the proposed framework which provides consistent results for different FPGA families. The contribution of this paper is a scalable methodology for rapid estimation of cost-benefit metrics of C-based algorithms to be accelerated on FPGA-based high-performance computing platform.
PDF, 281 KB
Zipped PostScript, 500 KB
Lower/Upper triangular (LU) factorization plays an important role in scientific and high performance computing. This paper presents an implementation of the LU decomposition algorithm for double precision complex numbers on a star topology based multi-FPGA platform. The out of core implementation moves data through multiple levels of a hierarchical memory system (hard disk, DDR SDRAMs and FPGA block RAMS) using completely pipelined data paths in all steps of the algorithm. Detailed performance numbers for all phases of the algorithm are presented and compared to a highly optimized implementation for a low power microprocessor based system. We also compare the performance/Watt for the FPGA and the microprocessor system. Finally, recommendations will be given on how improvements of the FPGA design would increase the performance of the double precision complex LU factorization on the FPGA based system.
PDF, 451 KB
Zipped PostScript, 718 KB
The projection of 3D scenarios onto 2D surfaces produces distortion on the resulting images that affects the accuracy of low-level motion primitives. Independently of the motion detection algorithm used, post-processing stages that use motion data are dominated by this distortion artefact. Therefore we need to devise a way of reducing the distortion effect in order to improve the post-processing capabilities of a vision system based on motion cues. In this paper we adopt a space-variant mapping strategy, and describe a computing architecture that finely pipelines all the processing operations to achieve high performance reliable processing. We validate the computing architecture in the framework of a real-world application, a vision-based system for assisting overtaking manoeuvres using motion information to segment approaching vehicles. The overtaking scene from the rear-view mirror is distorted due to perspective, therefore a space-variant mapping strategy to correct perspective distortion arterfaces becomes of high interest to arrive at reliable motion cues.
PDF, 679 KB
Zipped PostScript, 715 KB
A number of grand-challenge scientific applications are unable to harness Terflops-scale computing capabilities of massively-parallel processing (MPP) systems due to their inherent scaling limits. For these applications, multi-paradigm computing systems that provide additional computing capability per processing node using accelerators are a viable solution. Among various generic and custom-designed accelerators that represent a data-parallel programming paradigm, FPGA devices provide a number of performance enhancing features including concurrency, deep-pipelining and streaming in a flexible manner. We demonstrate acceleration of a production-level biomolecular simulation, in which typical speedups are less than 20 on even the most powerful supercomputing systems, on an FPGA-enabled system with a high-level programming interface. Using accurate models of our FPGA implementation and parallel efficiency results obtained on the Cray XT3 system, we project that the time-to-solution is reduced significantly as compared to the microprocessor-only execution times. A further advantage of computing with FPGA-enabled systems over microprocessor-only implementations is performance sustainability for large-scale problems. The computational complexity of a biomolecular simulation is proportional to its problem sizes, hence the runtime on a microprocessor increases at a much faster rate as compared to FPGA-enabled systems which are capable of providing very high throughput for compute-intensive operations thereby sustaining performance for large-scale problems.
PDF, 224 KB
Zipped PostScript, 395 KB
Some algorithms are more efficient than others. The complexity of an algorithm is a function describing the efficiency of the algorithm which has two measures: Space Complexity and Time Complexity. In this paper, we present complexity analysis for FPGA based designs which is based on 4-input and 1-output LUT structure followed by the majority of FPGA manufacturers. The same procedure is then applied to Karatsuba-Offman Multiplier (KOM) because of two reasons. Firstly, due to the increased use of FPGAs especially for security applications, it seems logical to compare various architectures for their efficiencies in FPGAs. Secondly, for diverse security applications, it provides a prior estimation to hardware resources and achievable timing. We consider a 4-input and 1-output structure as a basic building block available in majority of FPGAs by different FPGA manufacturers. We then compare our theoretical and experimental results for KOM in FPGAs which are fairly convincible.
PDF, 290 KB
Zipped PostScript, 446 KB
This paper presents a new Global Virtual Time (GVT) algorithm, called TQ-GVT that is at the heart of a new high performance Time Warp simulator designed for large-scale clusters. Starting with a survey of numerous existing GVT algorithms, the paper discusses how other GVT solutions, especially Mattern's GVT algorithm, influenced the design of TQ-GVT, as well as how it avoided several types of overheads that arise in clusters executing parallel discrete simulations. The algorithm is presented in details, with a proof of its correctness. Its effectiveness is then verified by experimental results obtained on more than 1,000 processors for two applications, one synthetic workload and the other a spiking neuron network simulation.
HTML
HTML