New Challenges in Managing Distributed Systems

Michael W. Masters

Published: Mar 1, 2001

Michael W. Masters

Abstract

Has anyone noticed that a quiet revolution is taking place in the way computers are employed in large IT centers?

Well, OK, the whole computer world has been in a state of ferment and revolution from the beginning. Very few technological domains are moving at the rate observed in the field we call home. The pace of change shows little if any sign of slowing up. More to the point of our story, this rapid progress is having a visible impact on how large collections of computer resources are deployed. The trend involves managing those rooms full of supposedly cheap computers as if they were scarce resources—managing them around the clock for optimum utilization across a variety of customers, tasks and domains.

Although computing costs less than it did in the past, it's still an expensive operation for large enterprises. As a result, managers are still looking for ways to cut costs and increase effectiveness despite the commodity level pricing of many products. It's as if the green eyeshade efficiency experts who once roamed factory floors looking for corners to cut have finally been allowed into the computer center.

Why is this happening? A key enabling factor is that distributed processing is now a reality, and it is invading computer centers everywhere. The present fad is thin client. (Before you ask, no, thin client hasn't replaced the PC on my desk at work—and it certainly hasn't replaced the one on my computer table at home.) But, thin client (which may or may not win in the marketplace) is just the tip of the iceberg—there's a lot more to the show than that. The real revolution is in the way computer centers are evolving toward a service-for-hire paradigm rather than a capital investment model.

Of course, rarely is something all one way or the other, so we're likely to see both models operating side-by-side in the market for a long time to come. But, increasingly, computing is being provided as a service rather than as a wholly owned function of the organization doing the computing. For those of us who have been around since the time Gordon Moore enunciated his famous law about memory capacity doubling every 18 to 24 months, this is a case of déja vu all over again. Some of us got our start in the era of massive mainframe computers and batch jobs—a model not unlike the service model evolving today.

But, there are no punch cards in these new computer centers—nor gigantic IBM, CDC and Cray mainframes either. Instead, the computer centers of today, whether organic or service-based, are increasingly filled with dozens to hundreds of processors, all supporting the organization's IT workflow. In fact, a whole new style of machine is evolving to meet customers' insatiable demand for more cycles. Sun calls them blades, Apple calls them slices (ahem, pun alert!), and other companies have other names. Whatever they are called, the low-cost, high density components of today's server farms are made and marketed (or soon will be) by just about every big manufacturer in America.

It is difficult to pinpoint exactly when this movement began. A number of large enterprises never quite abandoned the big, costly mainframe computer center. So, it is difficult to say that the current trend is new. But several new uses have emerged that have encouraged the growth of the service model, among them web server farms, grid computing, enterprise management, and dynamic resource management.

All of this, of course, would be moot without the widespread availability of low-cost processors and memory and fast networks (both local- and wide-area) and protocols—as well as the software needed to manage them. Especially, the software that manages them. For those of us who make our livings building distributed systems, the management software starts to look like a major new growth area all by itself.

There is even momentum in the U. S. Navy to apply this computing center model to mission critical systems. Identified as total ship computing, the approach asks everyone to build their mission critical software to the same set of operating system and middleware standards so as to achieve portability and location transparency. If this is done, then a pool of computers (suitably dispersed for survivability) can meet many (though certainly not all) of the mission critical computing needs of ships, submarines, aircraft, etc.

One of the key benefits of this approach is enhanced survivability. Today, most fault tolerant mission critical systems use a dedicated primary-backup scheme. There is little, if any resource sharing. But, with the pooled resource concept every computer in the pool is potentially a backup for any software that can run on the pool's assets. Furthermore, the pooling approach enables maintenance-free deployments—if it breaks, don't fix it, just switch it out of the pool until the next in-port maintenance cycle.

Whether the use is military or commercial, all this adds up to greater effectiveness at reduced cost. For this reason, the technology of enterprise and resource management is likely to gain a self-sustaining life of its own. Managing computing resources on the fly is here to stay. But, for those of us concerned with mission and safety critical systems, particularly real-time ones, acceptance requires more than enthusiasm. It requires that certain engineering -ilities be satisfied, regardless of the cost/benefit tradeoff.

Perhaps the most important of these -ilities is certifiability. It is not enough that a mission or safety critical system perform its intended mission—or that it do so at a cost that the enterprise (whether corporation or nation) can sustain. It must also do so in a verifiable manner.

For those who insist that the technologies used in mission critical systems must operate in a predictable and bounded manner, the very thought of certifying systems that change their behavior on the fly based on dynamically evolving circumstances—whether internal or external—can be a daunting prospect. The first response is usually an incredulous, you want to do what?

Granted, certifying a dynamically allocated system is a difficult problem. But, is it intractable? No, I don't think so. Like everything, it is subject to science and engineering. We believe that, when properly formulated, the problem is solvable. Some of our thoughts were presented as a challenge problem statement at this year's Workshop on Parallel and Distributed Real-time Systems, held at Nice, France in April.

So, what are the key factors? First, not only must the functionality of the distributed system itself be certified (as was always the case), but the resource management software must also be certified. This much is obvious. But, there is more. We also need to show that any application can run effectively on any computer in the pool. There are a variety of ways to accomplish this, the simplest of which is to use identical computers throughout. But, given the rapidity with which computer technology is progressing this seems overly confining. So, we need to identify criteria by which this condition may be relaxed. Elsewhere, we have used the term, virtual homogeneity to describe this property.

Finally, for real-time systems, the notion of guaranteed schedulability for high priority tasks is paramount. For cyclic real-time applications, rate monotonic theory—applied as a part of the run-time allocation algorithm—might be an interesting approach. Unfortunately, most complex real-time distributed systems contain both cyclic and acyclic functionality. It appears that there is a very real opportunity for real-time scheduling theorists to contribute solutions to the more general problem that are computable at run-time.

More than a decade ago, I had the rare privilege of working on a DARPA-sponsored government/industry/academia project designed to examine the application of high performance distributed processing to mission critical real-time systems. At the beginning of that effort, the DARPA project officer predicted that one day computing would be infinite and free, and he suggested that we should all learn how to effectively use such abundant distributed resources.

Much of the technology base anticipated way back then is now a reality: fast, low cost microprocessors; plenty of memory and mass storage; low latency switch-based networks with incredibly high bi-section bandwidth; high functionality middleware based on the ubiquitous IP standard; etc. But, the infinite and free part of his prediction has not transpired despite Moore's Law—at least not yet. So, there is still plenty of reason to pursue the efficiencies that accrue from managing resources in response to dynamically changing circumstances.

For the inevitable skeptics and unbelievers, consider a recent conversation with the vice-president for manufacturing of a major computer firm. He asked our group to name the single question most often asked of him by visiting corporate CIOs. After waiting for our blank stares to subside, he answered his own question: How are you going to make my computer center operations cheaper and more flexible? And, this in the last days, when computing is infinite and free!

The answer, he added, is to manage service-based computing on an on-demand basis.

For believers that Moore's Law will solve all problems: hold on to a good thought—but don't bet the farm on it just yet! We have not arrived at the nirvana predicted by one technology investment manager I met recently: in the future we will build ships with one very fast computer on the bow and another on the stern—that's all you'll ever need. (No kiddin', this is almost an exact quote!) Contrary to the prognostication of this head-in-the-sand seer, the number of computers in use is growing like yeast—and as a result, the technology of run-time resource management is here to stay.

Now, as is usually the case with technological innovation, it's up to us to figure out how to make it work.

Michael W. Masters
Naval Surface Warfare Center Dahlgren Division

Issue

Vol. 5 No. 3 (2002)

Section

Editorial

Article Sidebar

Main Article Content

Abstract

Article Details