ow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFT iterations as a parent-child relationship while fully exploiting the underlying parallelism. By establishing small sized threads, minimal cost overhead in thread switching and embedding the synchronization operations within a parent-child relationship, the receiver-initiated algorithm is well suited for fine-grained architectures. The second approach, referred to as the sender-initiated algorithm, follows a data ow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. We show in the paper the advantages and disadvantages of these approaches on fine-grained architectures. The implementations of the proposed algorithms have been carried out on the EARTH (Effcient Architecture for Running THreads) platform which is a non-preemptive, fine-grained data ow model.
For both the algorithms, we analyze the number of remote and local threads in each processor and study its impact on the experimental results. Our implementation results show that for large number of processors, both the algorithms perform well, yielding execution times of only 10 ms for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed. For certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. We also present the algorithms implication on the design of fine-grained architectures.