There are many scientific and engineering problems, as well as business and everyday problems, that can only be solved in a timely manner through the employment of supercomputers. Unfortunately, supercomputers are very expensive and are difficult for non-expert programmers to use because of the complexities of the multiprocessor architecture. There is a strong push to utilize advanced dedicated clusters, and some computer manufacturers have already built cluster- based systems. There is also a strong trend in parallel computing to move away from specialized supercomputers and expensive, dedicated clusters to cheaper, general-purpose distributed systems that consist of commodity off-the-shelf components such as PCs and workstations connected by fast networks. Many organizations already have such a ready made vehicle for parallel computing in the form of a non dedicated homogeneous cluster (of PCs), which is often idle for at least 12 hours/day as well as all weekends, and are also idle or lightly loaded during working hours. The ratio of performance to price of these clusters is very high, which make them a very attractive alternative. The advances constantly being made in processor speed and network bandwidth make using clusters for parallel computing more and more attractive. Furthermore, the scalability of clusters is very good, which makes them even more attractive to users.
However exploiting the enormous power of parallel computing on non- dedicated clusters has until now been limited by the scarcity of software to assist non-expert programmers. Current work in parallel processing on clusters concentrates primarily on execution performance. The development of parallel applications still requires specialized knowledge of operating systems, run-time environments and middleware. In particular, programmers must still deal not only with communication and coordination of parallel processes but also with managing parallel processes and computational resources, including instantiation and coordination of the execution on the cluster, and process placement. Only recently have some research groups begun to address efficient and easy cluster utilisation
It is observed that the lack of a Single System Image (SSI) is a major obstacle for parallel processing on non dedicated clusters entering mainstream computing. Recent results demonstrate that none of the research performed thus far has looked at how to develop a technology that allows SSI operating systems to be built supporting parallel processing on clusters. SSI operating systems should provide not only high execution performance of parallel applications, efficient use of cluster resources and support message passing and shared memory communication paradigms, but should also support high availability, parallelism management, transparency, and fault tolerance, i.e., to offer a SSI cluster. If adequate support for high availability, parallelism management, transparency and fault tolerance is not provided, developing robust parallel applications requires the programmer to be aware of the details of the whole cluster.
Existing systems do not support availability, i.e., they do not handle the automatic addition, removal and reorganisation of cluster resources, in particular computers. Developers must identify the computers of a cluster that are suitable for their application (and are available), detect the arrival of new computers and presence of faulty computers, and set up virtual machines to execute their applications.
Transparency is still being neglected. The problem of how to make the cluster appear to programmers as a single very powerful computer, i.e., to make the cluster transparent, is still open. Programmers must still place parallel processes on selected computers, and have knowledge about the location of their execution. Current attempts at managing (not even transparently) parallel application processes and computational resources by network operating systems or middleware are limited to basic process and communication management. Parallelism management, the management of parallel processes and cluster resources to enable transparency and ease of use, is not provided.
Clusters, because they are constructed using commonly used PCs, workstations and fast networks, exhibit increased probability of (partial) failure during execution of parallel applications. Servers that run on the cluster's computers could unexpectedly fail, which could eliminate relevant services or even whole computers (e.g., Web users could be deprived of services). Programmers must restart their applications from scratch or provide code to detect such faults and perform recovery. To make clusters suitable for long-running parallel execution, support should be provided for automatic fault tolerance.
Checkpointing and rollback recovery are very effective techniques for tolerating faults and avoiding total computational loss. Two approaches to checkpointing should be considered: (i) provide the support by a run-time library; or (ii) provide the support by an operating system. Many checkpointing algorithms have been developed for parallel systems. Several checkpointing libraries and systems have been implemented. However, checkpointing libraries often need to be ported prior to their use, and applications are subject to restrictions and must also be recompiled or relinked.
Much of the research in Distributed Shared Memory (DSM) has concentrated on improving the performance of the systems, rather than researching a comprehensive approach. In general, DSM systems ignore the fault-tolerance issue or maintain that it should be provided and handled by the underlying operating system. However, operating systems practically do not provide any support for fault- tolerance. Some publications show that checkpointing is used in a number of middleware DSM systems for fault-tolerance. However, transparency is not provided.
To substantiate these observations that availability, parallelism management, transparency and fault tolerance are in infancy, it is useful to have a look at some representative parallel programming tools, DSM systems, extended network operating systems and cluster operating systems.
PVM only supports adding/deleting computers to/from a virtual computer, spawning and killing of tasks dynamically and the detection of faults. More complex process management (e.g., load balancing) requires programmers to implement their own resource manager or use third party software.
Fail-Safe PVM provides checkpointing but requires the application developer to have access to the PVM infrastructure. Checkpointing mechanisms in Condor were originally developed for supporting process migration, but can be used to periodically checkpoint a single (non-communicating) process for reliability. Furthermore, the application must first be linked with a checkpointing library, requiring the original source of the application to be available for recompilation. Restoration of a failed process is also not completely transparent.
The most significant and advanced DSM systems, Munin and TreadMarks, have neglected ease of use, availability, transparency and fault tolerance. Programmers start an application by defining which computers are to be used, creating processes on each computer, initialising shared data on each computer and creating synchronisation barriers. This manual approach places a significant load upon programmers and leads to load imbalance and resulting performance degradation.
The Stardust environment provides process migration for balancing load and checkpointing for fault tolerance. The parallel application programmer must follow an execution model where all processes of the application must synchronise regularly using barriers. Only when all processes are synchronised can processes be migrated or checkpoints be taken. Hence, process migration and checkpointing is non- preemptive and user-driven. Furthermore, only the regions shared memory are checkpointed, requiring all of the application's state be stored in those regions.
Beowulf has achieved some results in parallelism management by exploiting distributed process space (BPROC) to manage parallel processes. Processes can be started on remote computers if the manual logon operation into that remote computer was completed successfully. Starting processes in Beowulf is only done sequentially, although this is hidden from the user by PVM and MPI. Another weakness of Beowulf is that BPROC does not address transparent process migration, resource allocation or load balancing.
The only parallelism management services of the Berkeley NOW are those that allow a process to be created on any computer of a cluster, semi-transparent start of parallel processes on multiple nodes (how those nodes are selected is not disclosed), barriers, co-scheduling, and the MPI standard. Load balancing and process migration was identified as desirable but were not implemented. The enhancements to the operating system have been in the form of a global operating system layer (GLUnix) to provide network wide process, file and virtual memory management. The services of GLUnix are built as middleware.
MOSIX demonstrates good outcomes of a study into parallelism management. This system is an extension and modification of Linux to produce a cluster operating system. It provides enhanced and transparent communication and scheduling services within the kernel, and employs PVM to provide high-level parallelism support. MOSIX is based on the Unique Home-Node (UHN) model, where all user processes appear to be run on the local computer regardless of their physical location. MOSIX provides dynamic load balancing using transparent process migration and load collection, but relies on PVM to perform the initial placement of processes. The most recent work in MOSIX concentrates on job assignment to minimise execution costs taking into account two major cluster resources: processors and memory.
Solaris MC is an enhancement of the Solaris system that has been built to provide a SSI. Solaris MC includes global file system, global process management and global network support. However, availability is not provided and transparency and ease of use have been neglected.
The main aim of the Gobelins project is to manage all resources globally and provide SSI. Gobelins has been developed at the kernel level of Linux. Only small modifications have been done to the core kernel. Gobelins exploits the concept of a container, which consists of a set of pages, which can be distributed, accessed and shared regardless of location. The container concept is proposed to unify global resource management. Containers appear to be a form of DSM. The current version of Gobelins does not support process migration, availability or fault tolerance.
The GENESIS cluster operating system was built from scratch to support parallel processing. It addresses ease of use and parallelism management. This system provides to application developers both message passing and DSM. Process migration, group communication, and transparent local and remote process creation and duplication are offered. GENESIS provides efficient checkpointing, performed concurrently with the continued execution of the application. This system shows that it is possible to improve the performance of parallel applications by providing group process instantiation and migration; and by making clusters easy to use by providing automatic instantiation of parallel processes and automatic initialisation of shared data.
In summary, although the need for availability, parallelism management, transparency and fault tolerance in clusters has been being strongly advocated, availability and transparency is still not provided and fault tolerance is poor—no SSI is offered. While there are a variety of techniques for detecting and correcting faults, their implementation is difficult. For an application developer, using checkpoint systems is hard as a library must be ported and applications, subject to a number of restrictions imposed by the library, must be recompiled or relinked. Hence, only few developers employ checkpointing in their application. A number of serious and common problems exist with PVM, Paralex, Beowulf, NOW, MOSIX, Solaris MC, Gobelins and GENESIS. Firstly, these systems lack services to handle addition, removal and reconfiguration of system resources. Secondly, they do not provide complete transparency, and a virtual machine is not maintained automatically and dynamically. Thirdly, these systems (with the exception of MOSIX and GENESIS) do not provide load balancing. Finally, these systems do not provide fault tolerance to support parallel processing. Hence, there is neither comprehensive middleware nor an operating system that forms a SSI cluster.
The development of SSI clusters faces some challenges ahead of it. Research is required to answer the following major questions: (i) how to provide high availability, i.e., adding, removing and reorganization of cluster resources? (ii) how to support parallelism management? (iii) how to provide transparency to all programmers of a cluster and relieve them of operating system type activities, which are very time consuming, error prone and largely irrelevant? (iv) what methods and systems automatically and transparently provide fault tolerance to parallel applications using message passing and shared memory that consider execution performance and costs of fault tolerance?
The question is whether it is worth carrying out research and development to answer these questions. My answer is yes for the following reasons. Firstly, parallel processing will be moved into mainstream computing by wide use of clusters due to their high performance to cost ratio, and support for availability, parallelism management, transparency and fault tolerance. Secondly, a body of knowledge on how to design SSI operating systems that provide efficient parallel execution of applications exploiting both message passing and DSM, and support SSI on clusters will be built. Thirdly, a SSI cluster will create a serious competitor for supercomputers, a convincing factor for cluster manufacturers to invest in the development of the proposed class of operating systems. Lastly, SSI operating systems create a very powerful, transparent, user friendly and reliable parallel computing system that when accessed by programmers appears to be single very powerful computer. Programmers will not require knowledge and skills beyond shared memory or message passing. This will allow even small companies to employ parallel processing. SSI clusters have the potential to move parallel processing to the mainstream computing and it is the responsibility of the cluster computing community to make it happen.
Andrzej M. Goscinski