How to Approach the Virtual Shared Memory Paradigm

Main Article Content

eva Kühn

Abstract

1. Introduction

This special issue aims to direct attention to the great potential offered by virtual shared memory (VSM) on distributed hardware architectures. Industry has no t yet exploited the advantages offered by VSM-based tools, including middleware, programming languages, and systems. This is particularly surprising for two reasons: firstly, much research in this area has been published ([2] is a comprehensive collection of references on VSM). Secondly, existing client/server-based technologies obviously do not fully satisfy all needs of robust distributed applications. Asymmetric application architectures predominate the market. They do not exhibit the advantages offered by distribution: better performance, reliability, and availability through replication of resources and services. In order to overcome these deficiencies, multi-tier architectures have been introduced to avoid the server bottleneck.

Is it only a matter of time and experience before VSM replaces the message-passing-oriented client/server style which, like the relational model, might succeed as the conceptually superior model, although initially considered too inefficient? Another perspective would be a sensible co-existence of both paradigms, where each enhances the other. VSM could help to overcome the replication deficiencies, whereas distributed object standards like CORBA could offer the necessary vendor interoperability.

Performance is clearly a necessary prerequisite for the acceptance of VSM. However, it seems that this is a technical issue that can be or has already been met by existing implementations. What is really necessary for a rapid dissemination of VSM is an awareness of its benefits. Education in the field of parallel and distributed systems is mainly devoted to the message-passing model. The design of concurrent software is based on a send-and-receive-oriented style of thinking. All data required by two concurrently running processes must be sent explicitly each time they are needed or changed. Distributed-object systems, which offer the best possible abstraction of the remote procedure call wrapped in method invocation of remote objects, follow this kind of thinking.

Although much research has been done in the area of VSM, not so many implementations are freely or commercially available. This could be the main reason why articles that communicate non-academic application experiences with VSM are rare. To convince industry, we need a sound argumentation of the following principles: (1) The VSM model is conceptually simpler than the message-passing model and thus reduces development time. This argumentation includes code length and elegance of solutions. (2) Solutions with VSM can be technically superior with respect to caching behaviour, scalability, flexibility, and fault-tolerance. Clearly, this depends on the particular application requirements, but the typical situations for VSM must be pointed out.

The discussion of the potential of VSM is part of this introduction. It is based on experiences with the use of a young commercial VSM system in industrial application scenarios. These situations reflect typical, recurring design patterns. The objective is an argumentation based on properties inherent in any VSM system, without going into the details of any particular implementation. Two basic design patterns are identified and their application in a typical industrial scenario, the integration of multi-database systems, is shown.

This issue contains selections from three articles published at the Minitrack on Virtual Shared Memory for Distributed Architectures of the 1998 Hawaii International Conference of Systems and Sciences (HICSS-31). They cover a general overview of the field, an application of VSM in the operating system area, and the issues of graphical debugging tools for VSM.

The first article in this issue by R. Hyde and B. Fleisch, gives an overview of the current state of the art in the area of VSM. The authors argue that a basic problem with existing distributed-object systems, like DCOM and CORBA, is that access to data fields of a remote object is not entirely transparent. Moreover, these systems do not cache data fields locally. Data field access is expensive, because it requires an expensive remote procedure call, even if the same data fields have been fetched before. As efficient caching is one of the strengths of VSM, the authors propose a combination of both approaches. So-called virtual distributed objects couple data-field caching with remote procedure call capabilities. Existing distributed object systems could be extended in this way without the need to design new systems.

The second article by P. Kostkova and T. Wilkinson, shows the application of VSM for flexible resource management in operating systems. A resource manager for BITS (Component-Based Operating System) is presented, which uses the tuple space paradigm to represent server resources. Resource negotiation is separated from the operating system kernel. The available resources can be reflected dynamically and efficiently via the tuple space coordination data structures. Application requests are mapped at run-time, which allows for dynamic component selection, reconfiguration, and exchange at run-time. The virtual shared tuple space resource manager, termed MAGNET, is a highly innovative contribution to flexible operating system designs and impressively argues the advantages of usin g VSM in this area.

The third article, by J. Lumpp, K. Sivakumar, C. Diaz, and J. Griffioen, presents a graphical debugging tool for the Unify VSM system. Unify aims to reduce communication effort by means of a relaxed consistency model, which is implemented efficiently via multicast communication. The graphical debugger, Xunify, makes the VSM model understandable for the programmer and gives information about the real costs of an application. VSM provides a high level of abstraction from the underlying hardware and distribution, which makes programming very easy. However, this makes the evaluation of application performance difficult. Xunify provides three levels of monitoring, which extend from programming-model-level monitoring, through logical-message-level monitoring to transport-protocol-level monitoring. Xunify demonstrates the importance of graphical representation and monitoring tools for the practical acceptance of VSM.

2. Potential of VMS Systems

VSM offer a conceptually higher level of abstraction than message-passing systems. It makes robust and distributed application development easy. The main benefit is that VSM systems relieve the application programmer from caching and replication issues. They allow the design of symmetrical application architectures, thus a voiding the client/server bottleneck.

2.1 Caching and replication.

It is the task of a VSM system to provide an optimal replication and caching behaviour of the shared data segments, which could range from entire memory pages—influenced by hardware concepts—to smaller entities, like single data objects, which offer a finer granularity of sharing and can better be tuned to application needs because changes do not affect an entire page. The number of communications required by the implementation to simulate the VSM on distributed hardware can be reduced by pre-fetching techniques. The VSM system knows which data segments have been changed and, according to the chosen consistency model, will trigger the necessary communications. Segments that have not been changed will not cause communication. The technology provided by a VSM system to minimise bandwidth determines its strengths. VSM relieves the application programmer of the tasks of caching and replication.

In contrast, distributed-object systems decompose an application into components which communicate via remote method calls. Input data are sent at call time and output data are sent when the task has been completed by the server. Data are always copied by value, which makes a real sharing impossible. Even if a client issues the same data twice, communication will take place. Caching and pre-fetching in a multi-tier application must be done explicitly by the system architect. For example, a distributed object could be implemented to represent a database cache at the server site. Data that have been seen by previous retrievals are held in the cache so that subsequent requests can be satisfied without issuing an expensive database query. Similar possibilities could be created at the client's site. But the burden of keeping caches up-to-date is with the application programmer. Caching is a dynamic concept that aims to reduce network traffic, whereas replication also provides fault-tolerance. A predefined policy must keep all replicas consistent by providing access to the shared data for all concurrent processes. Automatic support of data replication is almost impossible with distributed-object systems, because here the reasoning takes place at call level but not at data level—which fits well with the concept of object orientation.

Probably the worst solution has been taken by the Web's CGI interface mechanism. As the servers here are stateless, in the dialogue between client and server, so-called hidden fields serve to simulate context. Data that have been entered into a form by the user are sent back from server to client and then again to the server, so that the user is given the impression that the server remembers these data [13]. This communication-enabling mechanism has the reverse effect of caching.

2.2 Symmetric Application Architectures

A discussion of why the message-passing paradigm leads to asymmetric client/server architectures, whereas VSM supports symmetric ones, can be found in [9]. The following is a brief summary of this argumentation. As stated above, replication cannot be supported by distributed-object systems because the implementation of an object is a black box from the point of view of the object broker. The broker cannot know which internal states have been changed by the remote method call and thus must either replicate all object data or forward all actions on the object, including all input and output data, to the remote site. The coordination of concurrent accesses to the distributed object replicas must be implemented with extra synchronisation mechanisms, such as semaphores.

Therefore, in the message-passing approach, a single server holds the relevant data and serves all clients' requests on these data. This naturally maps to remote method invocation. The main advantage is the administration of one single server only; the drawback is a bottleneck with respect to performance, availability, and reliability if the application grows. Multi-tier architectures can relieve this problem. The application logic is moved from the client to so-called middle tiers, which also take over tasks that had originally been done by the single server; but the data remain on one centralised server. Such architectures are nowadays considered the best solutions with client/server technologies [12]. However, such architectures introduce many layers, which may increase the number of communication steps between client and server and must be coordinated by the programmer. The server bottleneck with regard to fault tolerance cannot be avoided this way: it would require extra programming effort to replicate the server.

As replication is a basic property of VSM, application architectures tend to be symmetric. No differentiation between client and server roles is necessary, albeit possible. All participating processes have equal rights and communicate via the usage of shared data segments. In its extreme form, every process has access to all data, which form a kind of black-board. The VSM is fully responsible for the performance and reliability of the application. This means that the application programmer need not be concerned with such issues, and that processes—which clearly can also play the role of servers—can be slim. The abstraction offered by VSM is the perspective that the memory of the local computer is extended to include the memory of the other sites. The underlying hardware is fully hidden and the application logic is not hampered by adding a new dimension of message-sending and -receiving. The designer starts with a definition of a so-called coordination data structure, and then defines how the concurrently running processes use it. Symmetric application architectures make the scale-up easy and, in summary, enable lean [17] software in the area of distributed and parallel processing.

2.3 VMS Layer

Middleware is a piece of software that hides the heterogeneity of the distributed hardware [11]Location, migration, access and representation transparency are minimum requirements on middleware. As argued above, replication transparency and the synchronisation of concurrent data accesses (termed transaction transparency) are inherent properties of every VSM model and cannot easily be supported by the message-passing paradigm. (Here we refer to transactions that coordinate access at data level, whereas CORBA 2.0 suggests transactions that coordinate objects that offer two-phase commit properties at application level.) The support of failure transparency depends on the implementation. Orbix is an implementation of CORBA that provides fault-tolerance through the use of a reliable broadcasting mechanism. Many VSM systems provide highly reliable replication strategies. Finally, load balancing transparency is an issue that probably is equally expensive to support on both co mmon distributed-object systems and VSM.

The general requirement on VSM middleware is to synchronise the shared data segments through adequate communication protocols based on either point-to-point messages or broadcasting, supporting an active or passive replication technique. Systems like LINDA [5] and Treadmarks [1] try to get away from the memory-page-oriented model (which is, for example, employed in the MIRAGE system [6]) towards finer-grained segments like tuples or objects. LINDA also supports linguistic constructs to model concurrency by supporting the notion of processes that operate on the shared data. CORSO (Coordinated Shared Objects [9]) is middleware with the above-mentioned properties, and will be used as a typical VSM representative in the subsequent discussion because it has provided our current experiences with VSM in commercial applications.

No open standard has yet been adopted in the VSM area, which is good from the point of view of research. On the other hand, this implies that every implementation has its peculiarities and is individual in its interpretation of the VSM paradigm as a programming model. This may extend from the design of completely new languages to the design of bindings to existing languages and their interaction with the middleware. Briefly summarised, the basic characteristics of CORSO are that it uses constant and variable communication data objects as the unit for sharing, where the former can be written only once and the latter can receive an unlimited number of values; it provides a variety of tuneable distribution strategies implementing different replication behaviours; objects have a unique identifier but no global names, which makes automatic garbage collection possible; inter- and intra-process concurrency is modelled by recoverable processes which can be passed object references in their argument lists after which these objects become shared between the process and its caller; communication objects can be composed of others, building up complex data structures; objects can be read and written only within transactions, which guarantees their persistency. Although the following attempt to identify typical VSM application patterns is based on one existing system, it should be possible to apply it to VSM middleware in general.

3. Coordination Design Patterns

In computer science, the notion of design patterns has been used in object-oriented development [1415]. On the definition of patterns, there is no commonly agreed definition. It basically refers to a recurring situation where a problem has been solved. The experience with solving this problem in the given or a similar context should be communicated with the design pattern. Prerequisite is the willingness of developers to reflect upon their work and to find a common terminology to communicate their experiences. Re-use of good designs should be achieved. Frameworks are compositions of design patterns that collaborate to solve a larger problem class.

Design patterns for VSM describe the coordination of at least two concurrently running processes that communicate and synchronise via shared data segments. In [16], we have termed them coordination design patterns and some basic ones are described in detail. They basically refer to experiences with our VSM system at the TU-Vienna since 1995. Although we cannot yet claim to have had enough experiences, we have learnt that some of those detected in rather academic problems, like prime sieving also appear as basic design blocks in commercial situations. And, surprisingly, very soon other significant patterns began to recur. The following is an attempt to describe, still in an informal way, first experiences with the most important patterns we have identified so far. A great deal of work is still needed for identifying more patterns, and for finding a suitable representation form. Applications do not yet exploit all the possibilities offered by parallel and distributed processing. Most distributed applications focus on the down/right-sizing of legacy applications to distributed hardware. However, new areas, such as distributed CSCW, groupware, etc.have not yet been fully exploited and even more new patterns are expected to derive from these.

The starting point for designing an application based on the VSM paradigm is to define a suitable coordination data structure. In the message-passing approach, it is necessary to define when and what two processes have to communicate, either synchronously or asynchronously. With VSM, the entire interaction of processes is carried out via the shared data segments. Although a highly efficient VSM implementation will not require it, it is important to make clear which data are needed by which process, not least to gain a good understanding of the problem: who must have access to them, who will read/write them, and how frequently?

The agreed upon data segments can be compared with special, enabled communication forms between the corresponding parties. These data segments guarantee a reliable and serialised delivery of the data. In contrast to stream connections used for this purpose in the message-passing paradigm, any kind of structure is possible, and there can be any number of them—not limited by operating system restrictions such as file handles. Such a coordination structure is established by creating or allocating the required data segments and by giving the relevant processes adequate access rights to all required data segments. A coordination structure manages all changes efficiently and no extra effort is needed, e.g., for remote object location, establishment of connections, or sending of input and result data.

The request-answer pattern below illustrates how results can differ if the designer cannot free himself from a message-passing-oriented style of thinking. This pattern contrasts the typical VSM solution with a solution that can be realised with a VSM system, but misuses it in an RPC-oriented style.

Whether, and to what extent, the programmer must know about the implementation of the VSM system remains an open question. [10] argue that it is important to exhibit the message behavior that the VSM conceptually hides from the programmer, because this is the only way to obtain realistic information about performance problems in an application.

3.1 Producer/Consumer

This was the first pattern we found, and it turned out to be one of the basic building blocks when programming with VMS.

Problem Description.The producer/consumer pattern shows cooperation as well as competition of concurrent processes on common (bounded) resources. Some processes produce information which others consume. The data flow is unidirectional: from one or more producers to one or more consumers.

All data produced are relevant and must not be missed by the consumers. They represent a history of events/values. After all consumers have observed all values, the history can be cleared.

In the simple variant, each consumer observes all information produced, while in the full version, each exclusively consumes the information. In the bounded resources variant, the buffer space is limited to N information entries.

Solution. The unbounded variant uses a coordination structure termed stream, which is composed of list cells, where the first element represents the information entry, and the second element is another list cell. Assuming that data segments are uniquely identifiable, the second part of a list cell contains such an identifier. The root of the stream is also an identifier, which is passed to all processes so that the segment representing the first list cell becomes shared by all of them. Processes that are created at a later time are also passed this reference.

Producer processes compete to gain access to the current stream tail cell. Concurrent access is carried out via the VSM, in our case by means of transactions. One single process can write to the segment referenced by the tail cell, all others will have a conflict and need to refresh their view of the current tail, re-attempting to write to the segment referenced by the next tail of the list. Using constant communication objects, the conflict automatically occurs if a write of a previously defined object is issued. With variable communication objects, a test must be performed to determine whether the object still is undefined before it is written.

Observer processes synchronously read the root segment, then all subsequent segments referenced by the tail cells. The writing of new information by one of the concurrently running producer processes will awaken all observers in a data-driven way. Observers need not be equally fast and can be added at any time. As the stream represents the full history of all data, all will observe the same information in the same order.

The full consumer variant can be based on the same coordination structure. Only the first list cell element must be modified to contain, in addition to the produced information, one flag indicating whether the entry has been consumed or not. In addition to the synchronous read, a consumer must compete to write the flag associated with the entry. Analogously to the production of information, concurrent access is carried out via the shared memory. One consumer will obtain the entry, all others will observe a conflict and need to continue reading the stream.

The described solution has one drawback: the stream will grow infinitely. To avoid space problems, in CORSO, techniques that enable an early garbage collection can be applied, as described in [8]: every process restarts after a certain number of items were produced/consumed, with the current list tail as root data segment. This way the process gives up references to data segments. Another possibility is to restrict the buffer to N places. This requires the modification of the stream coordination structure, where for example N list cells form a ring--buffer which can only be realised with variable communication objects and assuming full consumer processes. Concurrent access is managed via the VSM as before, however, extra synchronisation becomes necessary if the buffer is full. In this case, all producers will observe a conflict. The list cell structure of the full consumer can be re-used. If a producer process meets a cell containing an already consumed information entry, it may overwrite it with a new entry plus a new flag.

Consequences. The pattern supports unlimited scale-up. The dynamic addition of new consumer and producer processes that can be distributed to remote sites is possible without the need to reprogram producer or observer/consumer processes. Fault-tolerance is given if the selected VSM provides persistent objects and recoverable worker processes.

Occurences. The pattern was first used as a stand-alone solution to allow an unlimited number of processes to communicate and synchronise. Later, it occurred in academic problems like prime sieving and sorting of integer numbers. Typical commercial applications are those where the ordering of the collected data is crucial. For example, distributed event logging, recording of measurements, retrieval of rates of exchange, and distributed software installation.

Variations The following variations of this pattern were observed:

  1. As described above, the buffer space can be assumed to be infinite or can be bound to N places.
  2. Items can be destined for a certain consumer only, or for a group of consumers. No others are allowed to see them. This can be achieved by adding tags to the produced items.
  3. The relationship between the producer and consumer processes can be 1:1, k:1, 1:m, and k:m; it depends on the population of the application with worker processes. Both types of concurrent worker processes can be added dynamically at the local or a remote site.
  4. The system is reactive and unending but it can also be applied for limited production/consumption processes. In this case, a producer may write a meta item to the stream that indicates the end of the stream.
  5. Fast access to the most recent value in the common production stream (which, in the chosen coordination structure, is contained in the currently last list cell) can be provided. To avoid the reading of all cells between the root and the last tail cell, a second root segment can be administered, which is shared by all processes instead of the original root segment, and which contains a reference to the last list cell with the most current information. Processes that join later start with the current root and skip all previously provided information.
  6. Every producer can be equipped with a separate stream. All root segments must be shared as usual by all worker processes requiring them. Conflicts during production can be reduced in this way. Groups of consumers can be modelled which subscribe only to a subset of the available production streams.
3.2 Request/Answer.

The second basic pattern described here implements a typical client/server scenario.

Problem Description. The request/answer pattern synchronises concurrent requests of one or more clients at a server. Data flow is bi-directional: the answer is communicated from the client to the server, triggers some computation there, and finally the result is communicated back.

Clients are either located at the same site as the server, or remotely.

Solution. Let us first briefly show a possible solution that applies the message-oriented style to VSM. Afterwards, another, more efficient, solution is shown that fully meets the VSM paradigm.

An ad-hoc mapping of the problem is to equip each client process with one data segment. Every request causes a process invocation at the server site, passing it a reference to this data segment. The request is contained in the segment and, after its fulfilment, the answer is assigned to this segment. This solution re-implements the RPC mechanism by means of VSM.

A better solution is the following: imagine a coordination structure composed of data segment pairs. For each client there exists a segment with the meaning request and another one with the meaning result. The server and all clients are concurrent processes, with the client having access only to its two data segments and the server having access to all segment pairs of all clients. This separates the write accesses, which is usually advantageous for VSM performance. As segments are written many times, we realise them with variable communication objects.

A request is initiated by the client by writing to its request data segment. Then the client issues a synchronous read on its result data segment. The server process performs a synchronous wait of all request data segments it shares, assuming that some linguistic construct that avoids busy waiting exists for this purpose. In our case, an alt-wait (which waits for one alternative to fire) was chosen. As soon as one segment receives a next value, the server process is awakened and computes a response, which is written into the result data segment of the corresponding client. This in turn will awaken the client process.

Consequences. The second solution turned out to be more elegant and efficient. To judge the efficiency of the solution, run-time performance can be measured, or it can be proven from a theoretical point of view. However, this requires knowledge about the implementation of the distribution strategy of the VSM system [7]. (In CORSO, the first solution would require 11 communication steps for a single request and the second, only four. The reason is that a process call alone requires two messages for invocation and four more for cleaning up. In between, the server must gain exclusive access to the data segment, which is done by primary copy migration (three messages). The writing of the data requires two more messages. In contrast, using the communication structure of the second solution, only two value messages must be sent, each requiring two messages. The server possesses the primary copies of all result communication objects, and each client, the primary copy of its request communication object. Thus, no extra effort is required to obtain exclusive access to data segments.) The comparison of the two solutions was very encouraging, as it confirmed that a virtual shared memory conform design leads to better performance. In the following, only the second solution is considered.

The solution can scale-up to an unlimited number of clients without modification. Only the server must be initialised to obtain access to all segment pairs. The static initialisation can easily be replaced by the dynamic addition of new client segment pairs. If the server possesses access to one more special request segment with the meaning registration each new client will pass the reference to it s data segments there and thus become registered.

The distribution of servers and clients is fully transparent.

CORBA and DCOM solutions for this problem were published that showed that more clients independently, whether locally or remotely, drastically decreased performance, whereas the VSM solution showed even better performance with 5-10 clients than with one single client, and then decreased only slowly. So we can conclude that the VSM approach can outperform the distributed-object approach if higher concurrency takes place.

Occurences The pattern occurs in all client/server situations, including replicated and distributed servers. For example, clients that retrieve information such as prices, values, or states, or clients that issue an order (flight reservation, booking of a hotel room).

Variations

  1. Result data segments can be named artificially, and many clients may use the same result segments. In the reference to the result segment passed to the server, the client can indicate into which of the pools of common result segments the answer shall be put. Thus, if the client repeatedly uses the same result segment and if the data contained in this segment are well structured (composed of other segments), subsequent calls of the client can profit from data that have been retrieved before. Analogously, different clients located at the same site can take advantage of result data that were cached by another client's request.
  2. If the server becomes overloaded because of increasing client numbers, it can easily be distributed. The coordination data structure can remain, a second server process is started, after which each server only needs access to half of the client data segment pairs. In other words, the servers divide the clients into two parts and each server services the requests of one part. Accordingly, the number of servers can easily be increased to three or more. Another possibility is that all servers work on the same pool of requests, with the first available server fulfilling the request. In neither case need the client be aware of the number of server processes. With the second variant, fault tolerance through function replication can be achieved. If one server fails, its task can be overtaken by another.
4. Application of VSM for Heterogeneous Databases

Heterogeneous database integration and replication [4] is one of the major application requirements in industry. Multi-database research has two dimensions: semantic data integration, and global transaction control. The first issue is concerned with semantic mapping of data from different databases, which can have different names, representations, scalings, etc. The second issue is concerned with the semantics and execution of global and distributed transactions and has to deal with networking and fault-tolerance.

In the following, global multidatabase transaction control will be shown, based upon a combination of our two basic coordination design patterns. We assume that the semantic data integration is either an integrated part of the application logic or that it is the responsibility of the user to specify the right query on the right database.

High throughput is achieved through concurrency. For simplicity, we assume that clients issue a request and wait for the answer before starting the next request.

VSM data segments are used to hold database data that are shared between different partners (proxy data segments) and to coordinate the transactions to be performed on the databases. Note that the analogy with multi-tier architectures is kept, but the architecture introduced by the use of VSM is fully symmetric (peer-to-peer).

4.1 Two-Tier Single Database Access

Clients want to perform queries on a remote database. They issue requests and receive result data. All clients are to be served concurrently.

The coordination structure is a combination of those used for the producer/consumer and request/answer patterns. It consists of an infinite stream, where each token (first part of the list cell) represents a request/answer segment pair.

Client processes act like producers: they append new cells with request segments to the stream, and pass a reference to the result segment where they expect to receive the answers. Requests can be data retrieval queries, but can also represent write requests (insert, delete, update) on the database.

One server process, termed multi-database interface (MDI), is needed that behaves like a consumer. It performs a synchronous wait for tokens on the stream, and processes them by sending the query contained in the request token to the database. Communication with the database is executed via the standard API of the database (ODBC, JDBC). Result data are written to the result segment, which in turn will awaken the client process. The MDI process will typically be located at the database site. To gain a higher concurrency, the MDI will only pick up the requests and start a new worker process for the real execution of the query at the database. Many worker processes can be active at the same time with concurrent access to the database.

4.2 Two-Tier Multiple Database Access

The step from single to multiple database access can be achieved in one of two ways:

  1. Each client also specifies the database concerned in each token it appends to the stream. There remains one single MDI process that dispatches the queries to the proper databases and collects the result data. It can be located at any site.
  2. For each database, an MDI process is supported. Each MDI possesses a separate stream known to all clients. The client selects the correct stream for issuing queries on a database.
4.3 Two-Tier Multi Database Access

The difference from two-tier multiple database access is that a single client query can involve more than one database. The queries are not issued in isolation on each database, but can form global transactions on more than one database. Client requests can therefore also contain transaction control commands of the database, such as transaction begin, commit and abort.

The following modification of the coordination data structure is necessary: instead of one single request segment, a list of request/answer segment pairs is used. The requests in the list form the database commands of one transaction, such as begin, select, insert, and then commit or abort. Each request is accompanied by an answer segment with the role of a flag indicating whether this particular request was successful or not. The MDI worker will write to the answer flag segments after each database API call.

A client wanting to run a query on several databases must do the following:

  1. Generate a new slot (list cell of the specified form) for each database concerned. If multiple MDIs are used, identify the corresponding stream of the MDI of each database addressed by the query, and append a new slot to each. Otherwise, identify the stream of the single dispatcher MDI and add to it a slot for each database, indicating its name.
  2. Subsequently, write the database commands into the request list of each slot.
  3. Check all answer flags of all requests in all slots and perform a global decision about the commitment or abortion of the multi-database transaction.
  4. Communicate the global decision to all databases by writing either commit or abort as the last request to the request list of all slots. This will cause all single database transactions either to commit or to abort.

It is crucial that the last step be performed for all slots. As CORSO supports transactions on the shared segments as well as reliable distribution strategies, this can easily be achieved by communicating the global decision in an atomic step to all MDI workers.

4.4 Multi-Tier Multi Database Access

An improvement of the suggested software architecture would be to relieve the client from knowing about the semantic data integration and about the organisation of the coordination data structures, i.e. whether many MDIs and many streams are employed or only one MDI in its role as a dispatcher. Special processes can be introduced that are responsible for processing the request tokens of the clients, translating them into adequate database query statements and writing them to the right streams of the MDIs. The processes can specialise on certain tasks. In this way, many layers can be introduced that move the application logic from the client to several so-called middle tiers. The processes representing the middle tiers can be distributed to many sites. Clients can become thin, as the knowledge about semantic data integration and the chosen coordination structure can be hidden in the corresponding tiers. However, the introduction of a new tier implies more communication effort, because new data segments (request as well as answer segments) will need to be written.

The number of tiers used by a request on its way from client to the MDI worker can differ from the number of tiers used by the answer on its way back.

4.5 Variations.

Database Heterogeneity. Databases are autonomous and may differ in the scheduling mechanisms they provide [3]. In the best case, we can count upon databases that support a reliable two-phase-commit protocol. Such databases guarantee a ready-to-commit (or prepared) state even if failures occur. The situation becomes more complicated if a database performs a rollback upon a failure occurring after the database has reported its ready state. In this case, a global decision cannot trust the ready state of a single database. The problem can be avoided if a kind of log is simulated for this database.

Using the suggested coordination structure for multi-database integration, the request list of each slot can serve this purpose. If this request list is written into persistent VSM data segments that survive site failures, the corresponding MD I worker can be restarted and resume submission of requests to the database. Care must be taken, however, that no other database queries interleave and destroy the semantics of the global transaction. This can be achieved by sequentialising the processing of the slot stream per database, i.e., by starting only one MDI worker at a time.

In general, the weaker the supported scheduling protocol, the more we have to give up parallelism and forbid access to the database by means other than that provided by the middleware. But surprisingly, the presented multi-database coordination structure is flexible enough to serve all different kinds of database schedulers; even non-two-phase commit transaction managers can be supported. Here, instead of writing abort to the list of requests, the necessary sequence of undo commands must be written to it. Moreover, the mixed usage of databases with different scheduling mechanisms can be supported.

Database Replication. Database replication requires that changes in one database also be reliably reflected in one or more foreign databases. Technically, database replication is a subset of MDBS integration. If a client issues an insert/delete/update to a database, it will also write the same request to the request lists of the slots of all other databases. So this is a special case of multi-database integration where the queries issued at the databases are equal. The knowledge of the replication can be hidden in a middle-tier process. The client need not be aware of a change of the replication logic at all.

Integration of Results. The presented coordination structure for multi-database architectures can easily be enhanced by yet another tier. The task of this tier is to integrate the retrieved result data (multiple tables in different formats) into one result table. This can be done by coupling this tier with another database system that is fed with the single results and is therefore able to integrate and join them.

Mapping of Result Data to Shared Data Segments. As in the request/answer pattern, the result data segments can be shared among clients. They can be named, and a client either indicates the name of the desired result segment or the MDI worker recognises by the request where to put the result data.

The mapping of database result data into the proxy segments of the VSM should be done in a structured way. For relational databases, for example, a relation can be represented by a data segment whose name is composed of the database name and the relation name. This segment itself can be structured in a way such that each tuple of the relation is mapped into a separate segment that is also uniquely identified, e.g. by its key. The granularity of the data segments determines the caching behaviour. The communication effort can be minimised by a good representation of the proxy VSM segments.

Refresh and Caching of Retrieved Data. If we consider read accesses, the retrieval of data can either be done every time a new request is issued, or it may be sufficient to specify a time window for the validity of retrieved data. A further possibility is that the database uses a trigger. Everytime data are updated in the database, the database automatically feeds data segments with the new data to the VSM proxy. If the proxies are well structured, only the parts of the result that have changed will require communication. The VSM is aware of which sub-segments of the result segment have changed. This cannot be achieved automatically with distributed objects, because a remote method invocation cannot profit from data that have been transferred in a previous call. In contrast, with VSM all clients can profit from each other if they have cached data by previous queries on the same site.

Scale-Up. The scale-up of clients is umlimited. In contrast to DCOM and CORBA solutions, we have observed that in VSM solutions the addition of clients can improve global throughput. This is the case if clients issue similar queries and thus can take advantage of previously cached data.

The addition of databases is also possible without changing the application code of the clients. The number and kind of available databases can be reflected by one or more middle-tier processes. The reconfiguration can be done dynamically.

5. Conclusion

Virtual Shared Memory (VSM) eases the development of distributed applications. For general acceptance of VSM systems, two issues are of great importance: performance of the implementation, and advanced programming techniques. The first issue has already been met by several prototype systems, some of which are already commercially available. Experience with programming VSM can be communicated in the form of coordination design patterns. Two basic patterns were described in this introductory article and their application in the multi-database domain have been shown.

Acknowledgement

I would like to thank Alexander Forst, Friedrich Kofler and Herbert Pohlai for their helpful comments on this text, and Heidemarie Wernhart for the many discussions on coordination design patterns.

References

[1] C. Amza et al., TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, Feb. 1996, http://www.cs.rice.edu/~willy/TreadMarks/papers.html

[2] M. Rasit Eskicioglu, A Comprehensive Bibliography of Distributed Shared Memory, Department of Computing Science, University of Alberta, 1998, http://www.cs.umd.edu/~keleher/bib/dsmbiblio/dsmbiblio.html

[3] Y. Breitbart, D. Georgakopoulos, M. Rusinkiewicz, and A. Silberschatz, On Rigorous Transaction Scheduling, Technical Report No. 175-90, University of Kentucky, Department of Computer Science, Lexington, Kentucky 40506.

[4] M. W. Bright, A. R. Hurson, and S. H. Pakzad, A Taxonomy and Current Issues in Multidatabase Systems, IEEE Computer, March 1992.

[5] N. Carriero, and D. Gelernter, How to Write Parallel Programs: A Guide to the Perplexed, ACM Computing Surveys, Vol. 21, No. 3, September 1989.

[6] B. D. Fleisch et al., MIRAGE+: A Kernel Implementation of Distributed Shared Memory on a Network of personal Computers, Software—Practice and Experience, March 1994.

[7] e. Kuehn, Fault-Tolerance for Communicating Multidatabase Transactions, in: Proceedings of the 27th Hawaii International Conference on System Sciences (HICSS), IEEE, January 4–7, Wailea, Maui, Hawaii, 1994.

[8] e. Kuehn, A Distributed and Recoverable Linda Implementation with Prolog & Co, in: Proceedings of the Austrian-Hungarian Workshop on Distributed and Parallel Systems (DAPSYS'96), Miskolc, Hungary, October 2–4, 1996.

[9] e. Kuehn, and G. Nozicka, Post Client/Server Coordination Tools, in: Proceedings of Coordination Technology for Collaborative Applications, Wolfram Cohen, Gustaf Neumann (eds.), Springer Series Lecture Notes in Computer Science, 1997.

[10] J. E. Lumpp, et al., Performance Visualization for Distributed Shared Memory Systems, in this issue.

[11] H. Oesterle, R. Riehm, and P. Vogler, Middleware: Grundlagen, Prdukte und Anwendungsbeispiele fuer die Integration heterogener Welten, Vieweg, 1996 (in German).

[12] R. Orfali, D. Harkey, and J. Edwards, The Essential Distributed Objects Survival Guide, John Wiley and Sons, 1996.

[13] R. Orfali, and D. Harkey, Client/Server Programming with Java and Corba, John Wiley and Sons, 1997.

[14] D. C. Schmidt, Design Patterns and Pattern Languages: Design Patterns for Concurrent, Parallel, and Distributed Systemshttp://www.cs.wust.edu/~schmidt/patterns-ace.html

[15] A. R. Silva, Method and Pattern Languageshttp://albertina.inesc.pt/~ars/dasco/patterns.html

[16] H. Wernhart, and e. Kuehn, Basic Coordination Design Patterns, Technical Report, University of Technology, E185/1, Vienna, 1998 (submitted for publication).

[17] N. Wirth, A Plea for Lean Software, IEEE Computer, Feb. 1995.

eva Kühn

Article Details

Section
Introduction to the Special Issue