Introduction This book is the third volume in an informal series of books about parallel processing for Artificial Intelligence. Like its predecessors, it is based on the assumption that the computational demands of many AI tasks can be better served by parallel architectures than by the currently popular workstations. However, no assumption is made about the kind of parallelism to be used. Transputers, Connection Machines, farms of workstations, Cellular Neural Networks, Crays, and other hardware paradigms of parallelism are used by the authors of this collection. Papers in this collection are from the areas of parallel knowledge representation, neural modeling, parallel non-monotonic reasoning, search and partitioning, constraint satisfaction, theorem proving, parallel decision trees, parallel programming languages and low-level computer vision. The final paper is an experience report about applications of massive parallelism which can be said to capture the spirit of a whole period of computing history. The articles of this book have been grouped into four chapters: Knowledge Representation, Search and Partitioning, Theorem Proving, and Miscellaneous. Most of the papers are losely based on a workshop that was held at IJCAI 1995 in Montreal, Canada. All papers have been extensively reviewed and revised resulting in a collection that gives a snapshot of the state of the art in Parallel Processing for Artificial Intelligence. In the chapter on Knowledge Representation, Shastri and Mani show how a good understanding of human neural processing can inspire and constrain the building of efficiently implemented parallel reasoning mechanisms. Boutsinas, Stamatiou and Pavlides apply parallel processing techniques to Touretzky-style nonmonotonic inheritance networks. Lee and Geller present an efficient representation of class hierarchies on a massively parallel supercomputer. Stoffel, Hendler and Saltz describe a powerful parallel implementation of their PARKA frame system. In the second chapter, all three papers, by Cook, Suttner and Berlandier and Neveu, deal with partitioning of a search space. Cook describes HyPS, a parallel hybrid search algorithm. Suttner presents static partitioning with slackness as an approach for parallelizing search algorithms. Berlandier and Neveu describe a partitioning technique applied to constraint satisfaction problems. The third chapter contains papers about parallel reasoning. Bergmann and Quantz describe a system based on the classical KL-ONE approach to knowledge representation. This system is called FLEX, as its main goal is to permit flexible reasoning. Fisher describes a technique for theorem proving that relies on the broadcast of partial results. The SiCoTHEO theorem prover of Schumann is based on competing search strategies. The last chapter contains a number of papers that do not fit well into any other category of this book. Destri and Marenzoni analyze different parallel architectures for their ability
vi to execute parallel computer vision algorithms. Kufrin's paper is about machine learning, namely the induction of a parallel decision tree from a given set of data. Lallement, Cornu, and Vialle combine methods from both connectionist and symbolic AI to build an agent-based programming language for Parallel Processing for Artificial Intelligence. Finally, Waltz presents an inside look at the development of the Connection Machine, and the problems that were arising when trying to make it financially viable. The Appendix of this book contains a list of references to papers about Parallel Processing for Artificial Intelligence that appeared at a number of workshops, giving the reader information about sources that are otherwise not easily accessible. The editors would like to thank the authors for their timely and diligent manuscript submissions, and all the reviewers for their efforts in selecting a set of high quality papers. We thank Laveen Kanal and Azriel Rosenfeld for their support in making this project happen. Additional thanks go to the staff at Elsevier, especially Y. Campfens, who patiently suffered through a sequence of delays in our preparation of the final document. We thank Y. Lee and J. Stanski of N JIT who helped with some of the "leg work" in editing the final version of this book. Finally, J. Geller would like to thank his family for giving him time off on Sundays to finish this document.
ix
About the Editors J a m e s Geller James Geller received an Electrical Engineering Diploma from the Technical University Vienna, Austria, in 1979. His M.S. degree (1984) and his Ph.D. degree (1988) in Computer Science were received from the State University of New York at Buffalo. He spent the year before his doctoral defense at the Information Sciences Institute (ISI) of USC in Los Angeles, working with their Intelligent Interfaces group. James Geller received tenure in 1993 and is currently associate professor in the Computer and Information Science Department of the New Jersey Institute of Technology, where he is also Director of the AI & OODB Laboratory. Dr. Geller has published numerous journal and conference papers in a number of areas, including knowledge representation, parallel artificial intelligence, and object-oriented databases. His current research interests concentrate on object-oriented modeling of medical vocabularies, and on massively parallel knowledge representation and reasoning. James GeUer was elected SIGART Treasurer in 1995. His Data Structures and Algorithms class is broadcast on New Jersey cable TV. Hiroaki Kitano
Dr. Hiroaki Kitano is a Senior Researcher at Sony Computer Science Laboratory. Dr. Kitano received a B.A. in Physics from International Christian University, and a Ph.D. in Computer Science from Kyoto University. He joined NEC's software Engineering Laboratory in 1984, and developed a number of very large software systems. From 1988 to 1993, he was a visiting researcher at the Center for Machine Translation, Carnegie Mellon University. In 1993, he received the Computers and Thought Award from the International Joint Conference on Artificial Intelligence. His current academic service includes, chairperson of the international committee for RoboCup (World Cup Robot Soccer), associate editor for Evolutionary Computing, Applied AI journal, and other journals, as well as executive member of various international committees. Christian Suttner
Christian Suttner studied Computer Science and Electrical Engineering at the Technische Universit~it Miinchen and the Virginia Polytechnic Institute and State University. He received a Diploma with excellence from the TU Miinchen in 1990, and since then he is working as a full-time researcher on parallel inference systems in the Automated Reasoning Research Group at the TU Miinchen. He received a Doctoral degree in Computer Science from the TUM in 1995. His current research interests include automated
theorem proving, parallelization of search-based systems, network computing, and system evaluation. Together with Geoff Sutcliffe, he created and maintains the TPTP problem library for automated theorem proving systems and designs and organizes theorem proving competitions.
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.
Massively Parallel Knowledge Representation and Reasoning: Taking a Cue from the Brain Lokendra Shastri a* and D.R. Mani b aInternational Computer Science Institute 1947 Center Street, Ste. 600 Berkeley, CA 94707 bThinking Machines Corporation 14 Crosby Drive Bedford, MA 01730 Any intelligent system capable of common sense reasoning and language understanding must be capable of performing rapid inferences with reference to a large body of knowledge. The ability to perform rapid inferences with large knowledge bases is also essential for supporting flexible and effective access to the enormous body of electronically available data. Since complexity theory tells that not all inferences can be computed effectively, it is important to identify interesting classes of inference that can be performed effectively. Over the past several years we have tried to do so by working within a neurally motivated, massively parallel computational model. Our approach is motivated by the belief that situating the knowledge representation and reasoning problem within a neurally motivated computational architecture will not only enhance our understanding of the mind/brain, but it will also lead to the development of effective knowledge representation and reasoning systems implemented on existing hardware. In this chapter we substantiate this claim and review some results of pursuing this approach. These include a characterization of reflexive reasoning reasoning that can be performed effectively by neurally plausible networks; the design of CSN a connectionist semantic network that can perform inheritance and recognition in time proportional to the depth of the conceptual hierarchy; SHRUTI a connectionist knowledge representation and inference system that can encode a large number of facts, rules, and a type hierarchy, and perform a class of first-order inferences with extreme efficiency, and SHRUTI-CM5 an implementation of SHRUTI on the CM-5 that can encode over half a million rules, facts, and types and respond to reflexive queries within a few hundred milliseconds. 1. I N T R O D U C T I O N The ability to represent and use a large body of knowledge effectively is an essential characteristic of intelligent systems. For example, understanding language requires the *This work was partially funded by NSF grant IRI 88-05465, ARO grant DAA29-84-9-0027, ONR grants N00014-93-1-1149 and N00014-95-C-0182, and NSF resource grant CCR930001N.
hearer to draw inferences based on a large body of common sense knowledge in order to establish referential and causal coherence, generate expectations, and make predictions. Plausible estimates of the size of such a knowledge base range from several hundred thousand to more than a million items [8]. Nevertheless, we can understand language at the rate of several hundred words per minute. This clearly suggests that we are capable of performing a wide range of inferences with reference to a large knowledge base within a few hundred milliseconds. Any real-time language understanding system should be capable of replicating this remarkable human ability. There has been an explosive growth in electronically available information and the number of consumers of such information. The storage, transmission, access, and ultimately, the effective use of this large and heterogeneous body of data poses a number of technological challenges. A core challenge--and one that is relevant to our work--is the problem of providing intelligent content-based access to the available data. The ability to provide such access however, will depend in large part on a system's ability to bridge the "semantic gap" between a user's query and the relevant data items. This in turn would critically depend on a system's ability to perform rapid inferences based on a variety of knowledge such as ontological knowledge, terminological knowledge, domain knowledge, common sense knowledge, and user models. Several of these knowledge sources--in addition to the common sense knowledge base--will be ve.ry large and may contain several hundred thousand items. For example, the Unified Medical Language System's terminological component contains 190,863 entries consisting of medical, clinical and chemical concepts [26]. While database and information retrieval technology has evolved considerably over the past decade, the development of large-scale yet efficient knowledge based systems capable of supporting inference has lagged behind. There exist a number of robust and sophisticated database management systems that provide efficient access to ve.ry large databases, but there do not exist :high performance systems that can carry out efficient inference with respect to large knowledge bases. The integration of the inferential capabilities of an effective large-scale inference system and the full functionality of existing database and information systems should contribute to the development of a flexible, expressive, and efficient system for accessing large and heterogeneous databases. Thus from the point of view of building artificially intelligent systems capable of understanding natural language as well as from the perspective of supporting emerging technology for accessing electronically available information, it is important to develop high performance knowledge representation and reasoning systems. Complexity theo .ry however, rules out the existence of systems capable of performing all inferences effectively. Thus the key scientific challenge in building an efficient inference system consists of identif.ying interesting and useful classes of inference that can be performed effectively. AI researchers, including ourselves have pursued this goal using a number of strategies. Our approach focuses on identifying interesting and useful classes of inference that can be performed rapidly by neurally motivated and massively parallel computational models. The thesis underlying our approach is that crucial insights into the nature of knowledge representation and reasoning can be obtained by working within the computational constraints suggested by the human brain--the only extant system that exhibits the requisite attributes of response time and scalability. We believe that situating the
knowledge representation and reasoning problem within a neurally motivated computational architecture willnot only enhance our understanding of the mind/brain, but it will also lead to the development of effective knowledge representation and reasoning systems realized on existing high performance hardware platforms. In the rest of this chapter we describe some results of pursuing this approach. Section 2 discusses our approach and its motivation in more detail. Sections 3 and 4 describe two connectionist models of knowledge representation and reasoning. Section 5 describes the mapping of one of these models onto existing hardware platforms and Section 6 offers some conclusions. 2. C O M P U T A T I O N A L
EFFECTIVENESS
As the science of artificial intelligence has matured over four decades, it has become apparent that we had underestimated the complexity and intricacy of intelligent behavior. Today we realize that the task of building a system that performs intelligently in a limited domain is dramatically different from that of designing a system that displays the sort of natural intelligence we take for granted among humans and higher animals. This sharp difference is highlighted by the limitations of artificial intelligence systems developed to understand natural language, process visual information, and perform common sense reasoning. There are programs that "understand" English if the exchange is limited to talk about airplane tickets or restaurants; there are reliable vision systems that can identify a predefined set of objects presented under controlled conditions; but we have yet to design systems that can recognize objects with the skill of a monkey, or converse with the facility of a five year old. Given that existing AI systems perform credibly within restricted domains, one may be led to believe that in order to accommodate more complex domains all that is necessa .ry is to encode more facts and rules into our programs. But the situation is not so straightforward; it is not as though the existing programs are just miniature versions of larger programs that would perform intelligently in richer domains. The problem is that the solutions do not scale up: the techniques that work in restricted domains are inadequate for dealing with richer and more complex domains. As the domains grow bigger and more complex, we run into the stone wall of computational effectiveness; the performance of the system degrades and it can no longer solve interesting problems in acceptable time-scales. This is not surprising if we recognize that intelligent activity involves ve.ry dense interactions between many pieces of information, and in any system that encodes knowledge about a complex domain, these interactions can become too numerous for the system to perform effectively. A concern for computational effectiveness should be central to AI. From the viewpoint of AI, it does not suffice to offer a computational account of how an agent may solve an interesting set of problems. AI needs to solve a far more difficult problem: it must provide a computational account of how an agent may solve interesting problems in the time frame permitted by the environment. 2 The ability to satisfy the computational effectiveness constraint appears to be one of the basic properties of intelligent agents. Success, and 2The significance of computational effectiveness in the context of AI was first discussed in these terms in [17].
at times even the survival of an agent, may depend on his ability to make decisions and choose appropriate actions within a given time frame. In fact, in certain situations we would hesitate to label an activity as being "intelligent" if it takes arbitrarily long. To give an extreme example--if time were not a factor, even a dumb computer could beat the world's greatest chess player by simply enumerating the full search tree and following a path that guaranteed a win. No doubt this would take an aeon, but if time is not a factor this should not be of consequence. 3 It is tempting to ignore the computational effectiveness constraint by characterizing it as being merely a matter of efficiency or an implementation level detail. But doing so would be a mistake. Since computational effectiveness places strong constraints on how knowledge may be organized and accessed by cognitive processes, we believe that it may be essential to tackle the question of computational effectiveness at the very outset in order to understand the principles underlying the organization and use of information in intelligent systems. 2.1. C o m p u t a t i o n a l effectiveness n e c e s s i t a t e s a s t r o n g n o t i o n of t r a c t a b i l i t y As pointed out earlier, human agents perform a range of inferences while understanding language at the rate of several hundred words per minute. These inferences are performed rapidly, spontaneously and without conscious effort--as though they were a reflex response of our cognitive apparatus. In view of this we have described such reasoning as reflexive [20]. Reflexive reasoning may be contrasted with reflective reasoning which requires reflection, conscious deliberation, and often an overt consideration of alternatives and weighing of possibilities. Reflective reasoning takes longer and often requires the use of external props such as a paper and pencil. Some examples of such reasoning are solving logic puzzles, doing cryptarithmetic, or planning a vacation. W h a t should be the appropriate criteria of tractability in the context of knowledge representation and reflexive reasoning? Since polynomial time complexity is the usual "threshold" for distinguishing the tractable from the intractable in computer science, it may seem reasonable to adopt this notion of tractability in this context. But as argued in [21] reflexive reasoning requires a more stringent criterion of tractability. Let us amplify:
9 Reflexive reasoning occurs with respect to a large body of background knowledge. A serious attempt at compiling common sense knowledge suggests that our background knowledge base may contain as many as 106 items [8]. This should not be very surprising given that this knowledge includes, besides other things, our knowledge of naive physics and naive psychology; facts about ourselves, our family, friends, colleagues, history and geography; our knowledge of artifacts, sports, art, music; some basic principles of science a n d mathematics; and our models of social, civic, and political interactions. 9 Items in the background knowledge base are fairly stable and persist for a longtime once they are acquired. Hence this knowledge is best described as long-term knowledge and we will refer to this body of knowledge as the long-term knowledge base ( L T K B ) . 3Two caveats are in order. First, we are assuming that a path leading to a forced win exists, but such a path may not exist. Second, in addition to time, space or memory is also a critical resource!
9 Episodes of reflexive reasoning are triggered by "small" inputs. In the context of language understanding, an input (typically) corresponds to a sentence that would map into a small number of assertions. For example, the input "John bought a Rolls Royce" maps into just one assertion (or a few, depending on the underlying representation). The critical observation is that the size of the input, IInl, is insignificant c o m p a r e d t o the size of the long-term knowledge base ILTKBI. 4 9 The vast difference in the magnitude of ILTKBI and Ilnl becomes crucial when discussing the tractability of common sense reasoning and we h a v e t o be careful in how we measure the time and space complexity of the reasoning process. In particular, we need to analyze the complexity of reasoning in terms of ILTKB I as well as Ilnl. In view of the magnitude of ILTKBI, even a cursory analysis suggests that any inference procedure whose time complexity is quadratic or worse in ILTKBI cannot provide a plausible computational account of reflexive reasoning. A process that is polynomial in IIni however, does remain viable.
2.2. T i m e complexity of reflexive reasoning Observe that although the size of a person's ILTKBI increases considerably from, say, age five to thirty, the time taken by a person to understand natural language and draw the requisite inferences does not. This suggests that the time taken by an episode of reflexive reasoning does not depend on the ILTKBI. In view of this it is proposed t h a t a realistic criteria of tractability for reflexive reasoning is one where the time taken by an episode of reflexive reasoning is independent of ILTKBI and only depends on the depth of the derivation tree associated with the inference, s
2.3. Space complexity of reflexive reasoning The expected size of the LTKB also rules out any computational scheme whose space requirement is quadratic (or higher) in the size of the KB. For example, the brain has only about 1011 cells most of which are involved in processing of sensorimotor information. Hence even a linear space requirement is fairly generous and leaves room only for a modest constant of proportionality. In view of this, it is proposed that the admissible space requirement of a model of reflexive reasoning be no more than linear in ILTKBI. To summarize, it is proposed that as far as (reflexive) reasoning underlying language understanding is concerned, the appropriate notion of tractability is one where 9 the reasoning time is independent of [LTKB[ and is only dependent on the depth of the derivation tree associated with the inference, and
[In[ and
4A small input may, however, lead to a potentially large number of elaborate inferences. For example, the input "John bought a Rolls-Royce" may generate a number of reflexive inferences such as "John bought a car", "John owns a car", "John has a driver's license", "John is perhaps a wealthy man", etc. 5The restriction that the reasoning time be independent of ILTKBI may seem overly strong and one might argue that perhaps logarithmic time may be acceptable. Our belief that the stronger notion of effectiveness is relevant, however, is borne out by results which demonstrate that there does exists a class of reasoning that can be performed in time independent of ILTKBI.
9 the associated space requirement, i.e., the space required to encode the L T K B plus the space required to hold the working memo .ry during reasoning should be no worse than linear in ]LTKB]. 2.4. P a r a l l e l i s m The extremely tight constraint on the time available to perform reflexive inferences suggests that we must resort to massive parallelism. Many cognitive tasks, and certainly all the perceptual ones, that humans can perform in a few hundred milliseconds would require millions of instructions on a serial (von Neumann) computer, and it is apparent t h a t a serial computer will be unable to perform these tasks within an acceptable time frame [6]. The crux of the problem becomes apparent if one examines the architecture of a traditional von Neumann computer. In such a computer, the computational and the inferential power is concentrated in a single processing unit (the CPU) while the information on which the computations have to be performed is stored in an inert memo .ry which simply acts as a repository of the system's knowledge. As a result of the single processor design, only one processing step can be executed at any point in time, and during each processing step the CPU can only access a minuscule fraction of the memo .ry. Therefore, at any given instant, only an insignificant portion of the system's knowledge participates in the processing. On the other hand, intelligent behavior requires dense interactions between many pieces of information, and any computational architecture for intelligent information processing must be capable of supporting such dense interactions. It would therefore seem appropriate to treat each memo .ry cell--not as a mere reposito .ry of information, but rather as an active processing element capable of interacting with other such elements. This would result in a massively parallel computer made up of an extremely large number of simple processing elements--as many as there are memo .ry cells in a traditional computer. The processing capability of such a computer would be distributed across its memory, and consequently, such a computer would permit numerous interactions between various pieces of information to occur simultaneously. The above metaphor of computation matches the massively parallel and distributed nature of processing that occurs in the animal brain. 6 2.5. N e u r a l c o n s t r a i n t s W i t h nearly 1011 computing elements and 101'5 interconnections, the brain's capacity for encoding, communicating, and processing information is awesome and can easily support massively parallel processing. But if the brain is extremely powerfill, it is also extremely limited and imposes a number of rather strong computational constraints. First, neurons are slow computing devices. Second, they communicate relatively simple messages that can encode only a few bits of information. Hence a neuron's output cannot encode names, pointers, or complex structures. T The relative simplicity of a neuron's processing ability with reference to
~The importance of massive parallelism was discussed in the above terms in [17,18]. Several other researchers have also pointed out the significance of massive parallelism in AI. For example, see [47,11,12,32]. 7If we assume that information is encoded in the firing rate of a neuron then the amount of information that can be conveyed in a "message" would depend on AF, the range over which the firing frequency of a presynaptic neuron can vary, and AT, the window of time over which a postsynaptic neuron can "sample"
the needs of symbolic computation, and the restriction on the complexity of messages exchanged by neurons, impose strong constraints on the nature of neural representations and processes [6]. A specific limitation of neurally plausible systems is that they have difficulty representing composite structures in a dynamic fashion. Consider the representation of the fact give(John, Mary, a-Book). This fact cannot be represented dynamically by simply activating the nodes representing the roles giver, recipient, and give-object, and the constituents "John", "Mary", and "a-Book". Such a representation would suffer from cross-talk because it would be indistinguishable from the representation of give(Mary, John, a-Book). The problem is that this fact is a composite structure: it does not merely express an association between the constituents "John", "Mary", and "a-Book", rather it expresses a specific relation wherein each constituent fills a distinct role. Hence representing such a fact requires representing the appropriate bindings between roles and their fillers. It is easy to represent static (long-term) bindings using dedicated nodes and links (see Figure 1). For example, one could posit a separate "binder" node for each role-filler pair to represent role-filler bindings. Such a scheme is adequate for representing long-term knowledge because the required binder nodes may be created. This scheme however, is implausible for representing dynamic bindings arising during language understanding since these bindings have to be generated very rapidly--within a hundred milliseconds--and it is unlikely that there exist mechanisms for growth of new links within such time scales. An alternative would be to assume that interconnections between all possible pairs of roles and fillers already exist. These links normally remain "inactive" but the appropriate ones become "active" temporarily to represent dynamic bindings. This approach is also problematic because the number of all possible role-filler bindings is extremely large and will require a prohibitively large number of nodes and links. Techniques for representing role-filler bindings based on the von Neumann architecture cannot be used since they require communicating names or pointers of fillers to appropriate roles and vice versa. As mentioned above, the storage and processing capacity of nodes as well as the resolution of their outputs is not sufficient to store, process, and communicate names or pointers. As we shall see in Section 4, a t t e m p t s to solve representational problems such as the dynamic binding problem within a neurally constrained computational model lead to the identification of important constraints on the nature of reflexive reasoning.
the incident spike train. AT is essentially how long a neuron can "remember" a spike and depends on the time course of the postsynaptic potential and the ensuing changes in the membrane potential of the postsynaptic neuron. A plausible value of AF may be about 200. This means that in order to decode a message containing 2 bits of information, AT has to be about 15 msec, and to decode a 3-bit message it must be 35 about msec. One could argue that neurons may be capable of communicating more complex messages by using variations in interspike delays to encode information (e.g., see Strehler & Lestienne 1986). However, Thorpe and Imbert (1989) have argued that in the context of rapid processing, the firing rate of neurons relative to the time available to neurons to respond to their inputs implies that a presynaptic neuron can only communicate one or two spikes to a postsynaptic neuron before the latter must produce an output. Thus the information communicated in a message remains limited even if interspike delays are used as temporal codes. This does not imply that networks of neurons cannot represent and process complex structures. Clearly they can. The interesting question is how?
10
An instance
of gi
Figure 1. Static coding of bindings using binder nodes.
2.6.
Structured
connectionism
Structured Connectionist Models [6,22] are intended to emulate the information processing characteristics of the brain--albeit at an abstract computational level--and reflect its strengths and weaknesses. Arguably, the structured connectionist approach provides an appropriate framework for developing computational models that are constrained by the computational properties of the brain. Typically, a node in a connectionist network corresponds to an idealized neuron or a small ensemble of neurons, and a link corresponds to an idealized synaptic connection. The main computational features of structured connectionist models are as follows: 9 A structured connectionist model is a network of nodes and weighted links. 9 Nodes compute some simple functions of their inputs. 9 Nodes can only hold limited state information--while a node may maintain a scalar "potential", it cannot store and selectively manipulate bit strings. 9 Node outputs do not have sufficient resolution to encode symbolic names or pointers. 9 There is no central controller that instructs individual nodes to perform specific operations at each step of processing. 9 While links and link weights may change as a result of learning, they remain fixed during an episode of reflexive reasoning.
ll 2.7. M a p p i n g c o n n e c t i o n i s t m o d e l s to real machines The massively parallel structured connectionist models assume a ve.ry targe number of nodes and links, high fan-in and fan-out, and arbitrary interconnection patterns. These traits do not carry over to real machines. This shortcoming of real machines however, is offset by the fact that the processing speed and communication times of high performance platforms are several orders of magnitude faster than those assumed in connectionist models. Another important factor that facilitates the mapping of our models to real machines is the simplicity of messages exchanged by nodes. As we shall see, this allows us to leverage the active message facility provided by machines such as the Connection Machine CM-5 for low-latency interprocessor communication of short messages. Given the partial asymmetry in the strengths of connectionist models and existing hardware platforms, one needs to address several issues when mapping structured connectionist models to real machines. Some of these issues are the granularity of mapping, the coding of messages, processor allocation, and the tradeoff between load balancing and communication overhead. These issues have to be resolved based on a number of factors including the relative costs of communication, message handling, and computation, and the structural properties of the connectionist model. These issues are discussed in Section 5. 3. C S N - - A C O N N E C T I O N I S T S E M A N T I C N E T W O R K Several years ago we developed CSN, a connectionist semantic network [18] that solves a class of inheritance and recognition problems extremely fast--in time proportional to the depth of the conceptual hierarchy. In addition to offering computational effectiveness, CSN computes solutions to inheritance and recognition problems in accordance with a theory of evidential reasoning that derives from the principle of maximum entropy. The mapping between the knowledge level and the network level is precisely specified and, given a high-level specification of conceptual knowledge, a network compiler can generate the appropriate connectionist network. The solution scales because i) the time to answer questions only depends on the depth of the conceptual hierarchy, not on the size of the semantic memory, and ii) the number of nodes in the connectionist encoding is only linear in the number of concepts, properties, and property-value attachments in the underlying semantic network. Inheritance refers to the form of reasoning that leads an agent to infer property values of a concept based on the property values of its ancestors. For example, if the agent knows that "birds fly", then given that "Tweety is a bird", he may infer that "Tweety flies". Inheritance may be generalized to refer to the process of determining property values of a concept C, by looking up information directly available at C, and if such local information is not available, by looking up property values of concepts that lie above C in the conceptual hierarchy. Recognition is the dual of the inheritance problem. The recognition problem may be described as follows: "Given a description consisting of a set of properties, find a concept that best matches this description". Note that during matching all the property values of a concept may not be available locally. For this reason, recognition may be viewed as a very general form of pattern matching: one in which the target patterns are organized
12 in a hierarchy, and where matching an input pattern A with a target pattern Ti involves matching properties of A with local properties of Ti as well as with properties that Ti inherits from its ancestors. A principled treatment of inheritance and recognition is confounded by the presence of exceptions and conflicting information. Such information is bound to arise in any representation that admits default properties. Consider the following situation. An agent believes that most Quakers are pacifist and most Republicans are non-pacifist. She also knows that John is a Republican, Jack is a Quaker, and Dick is both a Quaker and a Republican. Based on her beliefs, it will be reasonable for the agent to conclude that John is, perhaps, a non-pacifist, and Jack is, perhaps, a pacifist. But what should the agent believe about Dick? Is Dick a pacifist or a non-pacifist? In [18,19] we proposed an evidential formalization of semantic networks to deal with such problematic situations. This formalization leads to a principled treatment of exceptions, multiple inheritance and conflicting information during inheritance, and the best match or partial match computation during recognition. The evidential formulation assumes that partial information about property values of concepts is available in the form of relative frequency distributions associated with some concepts. This information can bc treated as evidence during the processing of inheritance and recognition queries. Answering a query involves identifying relevant concepts and combining information (i.e., evidence) available at these concepts to compute the most likely answer. The method of estimating unknown relative frequencies using known relative frequencies is based on the principle of maximum entropy, and can be summarized as follows: If an agent does not know a relative frequency, he may estimate it by ascertaining the most likely state of the world consistent with its knowledge and use the relative frequency that holds in that world. Let us look at an informal example that illustrates the evidential approach. Consider the conceptual hierarchy shown in Figure 2 which is a generalization of the "Quaker example" mentioned above. The agent knows how the instances of some of the concepts are distributed with respect to their beliefs (pacifism or non-pacifism) and with respect to their ethnic origin (African or European). Answering an inheritance que .ry such as "Is Dick a pacifist or a non-pacifist" involves the following steps: 1. Determine the projection of the conceptual hierarchy with respect to the que .ry. The projection consists of concepts that lie above the concept mentioned in the que .ry and for which the distribution with respect to the property values mentioned in the query is known. Figure 3 shows the projected conceptual hierarchy for the example query "Is Dick a pacifist or a non-pacifist?" 2. If the projection has only one leaf, the question can be answered directly on the basis of the information available at the leaf. (In the case of our example que .ry however, the projection has two leaves Q U A K E R and R E P U B ) . 3. If the projection contains two or more leaves, combine the information available in the projection as follows: Combine information available at the leafs of the projection by moving up the projection. A common ancestor provides the reference frame for combining evidence
13 #PERSON= 200 #PERSON[has-belief,PACIFIST= 60 #PERSOM[has-belief,NON-PAC]= 140 :~RSsONN~':th~PERth'~ = 40
i
Propcrty:has-belie,f
Values:PACIFIST,NON-PAC
I
S~I
Property:has-ethEOi~O
~. ! ~ ~
I zo,~s
i
/
#MOR~ 50
I '~ #CHRIST- 60 #CHRIST[I~s-belief,PACIFISTI= 24 \#CHRIST[t~s-belief,NON-PAC]= 36
I ~_'! s" L
#QUAKER~as-belief,PACIFIST] = 7 ~_~
#QUAKE~NON-PAC] = 3
/
++ #REPUB= 80 #REPUB[has-belicf,PACIFIST]= 16 #REPUB[has-belief,NON-PAC]= 64 #REPUB[has-cth-org,AFRIC]= 5
# D E M O C = 120 #DEMOC[has-belid,PACIFIST] = 44
#REPUB[has-cth-orgJ~URO] = 75
#DEMOC[has-eth-org,EURO] = 85
#DEMOC[has-bclief,NON-PAC} = 76 #DEMOC[has-cth-org2~RIC] = 35
Figure 2. An example domain. #PERSON refer to the number of persons in the domain. #PERSON[has-belief, PACIFIST] refers to the number of persons who have the value pacifist for the property has-belief.
14
PERSON
]
................
Figure 3. The structure above the dotted line is the projection of the conceptual hierarchy (see previous Figure) that is relevant for determining whether Dick is a pacifist or a nonpacifist.
available at its descendents. The combination process is repeated until information from all the leaves in the projection is combined (at the root). In the example under discussion, the information available at QUAKER and REPUB would be combined at PERSON to produce the net evidence for Dick being a pacifist and Dick being a non-pacifist.
3.1. An evidential representation language Knowledge in the semantic memory is expressed in terms of a partially ordered set of concepts (i.e., an IS-A hierarchy of concepts) together with a partial specification of the property values of these concepts. The set of concepts is referred to as CSET, the partial ordering as _ 7rmax. In other words, a T-and node behaves like a temporal and. Upon becoming active, such a node produces an output pulse similar to the input pulse. A threshold, n, associated with a T-and node indicates that the node will fire only if it receives n or more synchronous pulses. T-or n o d e : A T-Or node with threshold n becomes active on receiving n or more inputs during an interval 7r,nax. Upon becoming active, a T-Or node produces an output pulse of width 7rmax. Thus a T-or node behaves like a temporal or. The model also makes use of inhibitory modifiers that can block the flow of activation along a l i n k w a pulse propagating along an inhibitory modifier will block a synchronous pulse propagating along the link it impinges upon.
SThe "btu" in p-btu stands for binary threshold unit.
22
from John from Mary from a-Book
give
buy
from Mary from a-Ball
John
9
Mary
0
a-Book
O
s-Ball
0 csn-sell
Figure 5. An example encoding of rules and facts.
4.4. E n c o d i n g of rules a n d facts We discuss a simple example to illustrate how S H R U T I encodes rules and facts and performs inference. We suppress details pertaining to the mapping of multiple antecedent rules, the type hierarchy, and the dynamic representation of multiple instantiations of a predicate. A detailed description can be found in [23,15]. The network shown in Figure 5 encodes the following rules and facts:
1. Yx, y, z give(x,y,z) ~ own(y,z)
2. w , y buy(~,y)~ o~n(~,y)
4. give(John,Mary, a-Book) 5. 3 (x) buy(John, x) 6. own(Mary, a-Ball). The encoding makes use of two types of nodes mentioned above: p-btu nodes (depicted as circles) and v-and nodes (depicted as pentagons). The encoding of more complex rules
23
c:can-sell
f--W-V-"g f--V-W--T-W
c:own
c:give F1
e:give
I I I I I I I I I I I I
g-obj
recip 9 :buy
I I I I I I I I I I I I
b-obj buyer' e:ownl
I I I I I I I I I I I I I I
o-objl
owner e:can-sell a-Book
I I I I I I I I
=
cs-obj
input to e:can-sell ~
,.pu, to cs~jlt input to p-seller
u
I I I I I I ! 1 I I 1 1 I I I I
V
il
II
II
I I I I
I
I
Iiiiiiiii1111
I I I I I I I I I I I I I I o
1
2
3
4
5
6
7
8
9
time
Figure 6. Activation trace for the query can-sell(Mary, a-Book)?.
also makes use of the r-or nodes mentioned above. In Figure 5, inhibitory modifiers are shown as links ending in dark blobs. Each entity in the domain is encoded by a p-btu node. This acts as a "focal" node for this entity. The features of an entity and the roles it fills in various relations are encoded by linking its focal node to appropriate nodes in the network. Each n-ary predicate P is encoded by a "focal" cluster consisting of a pair of r-and nodes and n p-btu nodes, one for each of its n arguments. One of the r-and nodes is referred to as the enabler, e:P, and the other as the collector, c:P. In Figure 5 enablers point upward while collectors point downward. The enabler e:P becomes active whenever the system is being queried about P. On the other hand, the system activates the collector c:P of a predicate P whenever the system wants to assert that the current dynamic bindings of the arguments of P follow from the knowledge encoded in the system. All rules and facts pertaining to a predicate are represented by linking its focal cluster with other focal clusters and nodes as explained below. A rule is encoded by connecting the collector of the antecedent predicate to the collector of the consequent predicate, the enabler of the consequent predicate to the enabler of the antecedent predicate, and by connecting the arguments of the consequent predicate to the arguments of the antecedent predicate in accordance with the correspondence between
24 these arguments specified in the rule. A fact is encoded using a T-and node that receives an input from the enabler of the associated predicate. This input is modified by inhibitor. modifiers from the argument nodes of the associated predicate. If an argument is bound to an entity in the fact then the modifier from such an argument node is in turn modified by an inhibitory modifier from the appropriate entity node. The output of the T-and node is connected to the collector of the associated predicate. 4.5. T h e I n f e r e n c e P r o c e s s Posing a query to the system involves specifying the que .ry predicate and the argument bindings specified in the query. This is done as follows" Choose an arbitrary point in time--say, t0--as the point of reference for initiating the query (it is assumed that the system is in a quiescent state). The query predicate is specified by activating the enabler of the que .ry predicate with a pulse train of width and periodicity 7r starting at time to. The argument bindings specified in the query are communicated to the network as follows: Let the argument bindings in the query involve n distinct entities: Cl,...,c,~. With each ci, associate a delay 5i such that no two delays are within w of one another and the longest delay is less than 7 r - w. Here w is the allowable jitter (or lead/lag) between synchronously firing nodes, and 7r is the period of oscillation. Each of these delays may be viewed as a distinct phase within the period to and t0+Tr. Now the argument bindings of an entity c/are indicated to the system by providing an oscillatory spike train of periodicity 7r starting at to + (f/, to c/and all arguments to which c/is bound. This is done for each entity c/ (1 _< i < n) and amounts to representing argument bindings by the in-phase or synchronous activation of the appropriate entity and argument nodes. We illustrate the reasoning process with the help of an example. Consider the que .ry cansell(Mary, a-Book) (i.e., "Can Mary sell a-Book?") This que .ry is posed by providing inputs to the entities Mary and a-Book, the arguments p-seller, cs-obj and the enabler e:can-sell, as shown in Figure 6. Observe that Mary and p-seller receive in-phase activation and so do a-Book and cs-obj. Let us refer to the phase of activation of Mary and a-Book as Pl and p2 respectively. As a result of these inputs, Mary and p-seller fire synchronously in phase pl of every period of oscillation, while a-Book and cs-obj fire synchronously in phase P2 of every period. The node e:can-sell also fires and generates a pulse train of width 7r. The activations from the arguments p-seller and cs-obj reach the arguments owner and o-obj of the own predicate, and consequently, starting with the second period, owner and o-obj become active in pl and p2, respectively. At the same time, the activation from e:can-sell activates e:own. At this time, the system has essentially, created dynamic bindings for the arguments of predicate own and Mary has been bound to owner, and a-Book has been bound to o-obj. These newly created bindings in conjunction with the activation of e:own can be thought of as encoding the query own(Mary, a-Book) (i.e., ~"Does Ma.ry own a-Book?")! The v-and node associated with the fact own(Mary, a-Ball) does not match the que .ry and remains inactive. The activations from owner and o-obj reach the arguments recip and g-obj of give, and buyer and b-obj of buy respectively. Thus beginning with the third period, arguments recip and buyer become active in Pl, while arguments g-obj and b-obj become active in p2. In essence, the system has created new bindings for the predicates give and buy that together with the activation of the enabler nodes e:give and e:own can
25 be thought of as encoding two new queries: give(x,Mary, a-Book) (i.e., "Did someone give Maw a-Book?"), and buy(Mary, a-Book). Observe that now the T-and node associated with the fact give(John, Mary, a-Book)-this is the T-and node labeled F1 in Figure 5--becomes active as a result of the uninterrupted activation from e:give. Observe that the inhibitory inputs from recip and g-obj are blocked by the in-phase inputs from Mary and a-Book, respectively. The activation from the T-and node F1 causes c:give, the collector of give, to become active. The output from c:give in turn causes c:own to become active and transmit an output to c:can-sell. Consequently, c:can-sell, the collector of the query predicate can-sell, becomes active (refer to Figure 6) resulting in an affirmative answer to the query can-sell(Mary, a-Book)?. Conceptually, the proposed encoding of rules creates a directed inferential dependency graph: Each predicate argument is represented by a node in this graph and each rule is represented by links between nodes denoting the arguments of the antecedent and consequent predicates. In terms of this conceptualization, it should be easy to see that the evolution of the system's state of activity corresponds to a parallel breadth-first traversal of the directed inferential dependency graph. This means that i) a large number of rules can fire in parallel and ii) the time taken to generate a chain of inference is independent of the total number of rules and just equals 17r where l is the length of the chain of inference and 7r is the period of activity. The example discussed above assumed that each predicate was instantiated at most once during the inference process. In the general case, where a predicate may be instantiated several times during an episode of reasoning, the time required for propagating bindings from a consequent predicate to antecedent predicate(s) is proportional to kTr, where k is the number of dynamic instantiations of the antecedent predicate to which the bindings are being propagated.
4.6. Constraints and predictions SHRUTI identifies a number of representational and processing constraints on reflexive processing in addition to the constraints on the form of rules discussed above. These relate to the capacity of the "working memory" underlying reflexive processing and the bound on the depth of reasoning.
Working memory underlying reflexive processing: Dynamic bindings, and hence, dynamic (active) facts are represented in SHRUTI as a rhythmic pattern of activity over nodes in the LTKB network. In functional terms, this transient state of activation holds information temporarily during an episode of reflexive reasoning and corresponds to the working memory underlying reflexive reasoning (WMRR). Note that WMRR is just the state of activity of the LTKB network and not a separate buffer. Also note that the dynamic facts represented in the W M R R during an episode of reflexive reasoning should not be confused with the small number of short-term facts an agent may overtly keep track of during reflective processing and problem solving. W M R R should not be confused with the short-term memory implicated in various memory span tasks [2]. In our view, in addition to the overt working memory, there exist as many "working memories" as there are major processes in the brain since a "working memory" is nothing but the state of activity of a network. SHRUTI predicts that the capacity of W M R R is very large but at the same time it is constrained in critical ways. Most proposals characterizing the capacity of the working
26
memo~, underlying cognitive processing have not paid adequate attention to the structure of items in the working memo .ry and their role in processing. Even recent proposals such as [10] characterize working memory capacity in terms of "total activation". In contrast, the constraints on working memory capacity predicted by SHRUTI depend not on total activation but rather on the maximum number of distinct entities that can participate in dynamic bindings simultaneously, and the maximum number of (multiple) instantiations of a predicate that can be active simultaneously. B o u n d on t h e n u m b e r of d i s t i n c t entities referenced in W M R R : During an episode of reflexive reasoning, each entity involved in dynamic bindings occupies a distinct phase in the rhythmic pattern of activity. Hence the number of distinct entities that can occur as role-fillers in the dynamic facts represented in the working memory cannot exceed 7r,.~a~/w where 7rm~ is the maximum delay between two consecutive firings of cell-clusters involved in synchronous firing and ~v equals the width of the window of synchrony--i.e., the maximum allowable lead/lag between the firing of synchronous cell-clusters. If we assume that a neurally plausible value of ~rm~ is about 30 milliseconds and a conservative estimate of w is around 6 milliseconds, we are led to the following prediction: As long as the number of distinct entities referenced by the dynamic facts in the working memo .ry is five or less, there will essentially be no cross-talk among the dynamic facts. If more entities occur as role-fillers in dynamic facts, the window of synchrony w would have to shrink appropriately in order to accommodate all the entities. As w shrinks, the possibility of cross-talk between dynamic bindings would increase until eventually, the cross-talk would become excessive and disrupt the system's ability to perform systematic reasoning. The exact bound on the number of distinct entities that may fill roles in dynamic facts would depend on the largest and smallest feasible values of 7rm~ and ~v, respectively. However we can safely predict that the upper bound on the maximum number of entities participating in dynamic bindings can be no more than 10 (perhaps less). B o u n d on t h e m u l t i p l e i n s t a n t i a t i o n of r e l a t i o n s : The capacity of WMRR is also limited by the constraint that at most k dynamic facts pertaining to each relation may be active at any given time (recall that the total number of active dynamic facts can be very high). In general, the value of k need not be the same for all relations; some critical relations may have a higher value of k while some other relations may have a smaller value. The cost of maintaining multiple instantiations turns out to be significant in terms of space and time. For example, the number of nodes required to encode a rule for backward reasoning is proportional to k 2. Thus a system that can represent three dynamic instantiations of each relation may have up to nine times as many nodes as a system that can only represent one instantiation per relation. Furthermore, the worst case time required for propagating multiple instantiations of a relation also increases by a factor of k. In view of the additional space and time costs associated with multiple instantiation, and given the necessity of keeping these resources within bounds in the context of reflexive processing, we predict that the value of k during reflexive reasoning is quite small, perhaps no more than 3. B o u n d on t h e d e p t h of t h e c h a i n of reasoning: Consider the propagation of synchronous activity along a chain of role ensembles during an episode of reflexive reasoning. Two things might happen as activity propagates along the chain of role ensembles. First,
27 the lag in the firing times of successive ensembles may gradually build up due to the propagation delay introduced at each level in the chain. Second, the dispersion within each ensemble may gradually increase due to the variations in the propagation delay of links and the noise inherent in synaptic and neuronal processes. While the increased lag along successive ensembles will lead to a "phase shift", and hence, binding confusions, the increased dispersion of activity within successive ensembles will lead to a gradual loss of binding information. Increased dispersion would mean less phase specificity, and hence, more uncertainty about the role's filler. Due to the increase in dispersion along the chain of reasoning, the propagation of activity will correspond less and less to a propagation of role bindings and more and more to an associative spread of activation. For example, the propagation of activity along a chain of rules such as: 1='1(x, y, z) =v P2(x, y, z) =~ . . . Pn(x, y, z) due to a dynamic fact Pl(a, b, c) may lead to a state of activation where all one can say about Pn is this: there is an instance of Pn which involves the entities a, b, and c, but it is not clear which entity fills which role of Pn. In view of the above, it follows that the depth to which an agent may reason during reflexive reasoning is bounded. Thus an agent would be unable to make a prediction (or answer a query)--even when the prediction (or answer) logically follows from the knowledge encoded in the L T K B - - i f the length of the derivation leading to the prediction (or the answer) exceeds this bound. The actual value of this bound depends on values of appropriate physiological parameters. At this time we do not have the relevant data to arrive at a precise value, but we expect this bound to be rather low. Henderson [9] has developed an on-line parser for English using a SSRVTI-like architecture whose speed is independent of the size of the grammar and which can recover the structure of arbitrarily long sentences as long as the dynamic state required to parse the sentence does not exceed the capacity of the parser's working memory. The parser models a range of linguistic phenomena and shows that the constraints on the parser's working memory help explain several properties of human parsing involving long distance dependencies, garden path effects and our limited ability to deal with center-embedding. This suggests that the working memory constraints resulting from SHRUTI have implications for other rapid processing phenomena besides reasoning. 5. M A P P I N G
SHRUTI ONTO REAL MACHINES
Several aspects of SHRUTI suggest that a knowledge representation and reasoning system obtained by mapping SHRUTI onto real machines would be extremely efficient. These include some basic features of structured connectionism as well as constraints on rules and derivations resulting from SHRUTI. As discussed in Section 4, SHRUTI is a structured connectionist model. So it only requires nodes that perform simple computations. Second, unlike neural network models such as multilayer back-propagation networks and Hopfield nets, a SHRUTI network is sparsely connected. So even though a node may be connected to a large number of nodes, it is connected only to a relatively small percentage of nodes in the network. Consequently, only a fraction of the total number of nodes and links in the network participate in any update step. The most important source of SHRUTI'S efficiency however, is that it imposes several
28 constraints on the form of rules and derivations. These constraints were discussed in Sections 4.6 and 4.2. 5.1. M a p p i n g SHRUTI onto parallel machines In order to derive maximum benefit from the parallelism inherent in structured connectionist models, the mapping granularity must be tailored to the computational capabilities of the processors. For most real machines with relatively powerfill processors, knowledgelevel mapping provides a conceptually simple and flexible partitioning scheme. In this scheme, the knowledge base is partitioned at the relatively coarse level of knowledge elements like predicates, concepts, fact, rules and type hierarchy relations. The simplicity of the messages exchanged between nodes supports the use of interprocessor communication schemes which can handle short message packets very efficiently; complex messages and communication protocols are unnecessary. In particular, the information exchanged by nodes in SHRUTI lies in the synchronization - - o r lack t h e r e o f of converging spike trains. Given that nodes in our model are required to discriminate only among a small number of distinct phases, the necessary information can be encoded within a few bits. Consequently, information pertaining to a complete knowledge-level element can be encoded within a small message.
5.1.1. Exploiting constraints imposed by
SHRUTI
The constraints which SHRUTI imposes on the form of rules and type of inferences translates into bounds on system resources and time needed for a reasoning episode-thereby leading to a fast and efficient parallel implementation. These aspects are discussed below: 9 The form of rules and facts that can be encoded is constrained. S H R U T I attains its tractability from this fundamental constraint [21,3] which simplifies the network encoding of the knowledge base and makes it possible to perform efficient inference using spreading activation. 9 The number of distinct entities that can participate in an episode of reasoning is bounded. This restricts the number of active entities, and hence, the amount of information contained in an instantiation. In turn, this limits the amount of information that must be communicated between predicates. 9 Entities and predicates can only represent a limited number of dynamic instantiations. This constrains both space and time requirements. 9 The depth of inference is bounded. This constrains the spread of activation in the network and therefore directly affects response time and resource usage. In mapping S H R U T I onto parallel machines, we exploit these constraints to the fullest extent in order to achieve efficient resource usage and rapid responses with large knowledge bases. Of course, if any of these constraints can be relaxed without paying a severe performance penalty, we would like to obtain a more powerful system by relaxing these constraints. An example of this is the constraint on the number of instantiations of any predicate that can be active simultaneously. Based upon biological considerations, S H R U T I
29 places a limit of around 3 on this number. In our experimentation with the mapping of SHRUTI on the CM-5 we found that the limit could be raised without a major slowdown in inference times.
5.1.2. O t h e r c o n s i d e r a t i o n s The following assumptions also influenced the choices made in mapping machines:
SHRUTI
to real
9 Since the knowledge representation system should support any well-formed que .ry, the source of initial activation and the depth of derivation are unknown. In view of this, it is critical to focus on good average performance. 9 Since episodes of reasoning are expected to be rather brief, dynamic load balancing on a parallel machine is infeasible. Therefore, the static distribution of the knowledge base should be such that it leads to good dynamic load balancing on average. 9 Since the system has to reason with very large knowledge bases, the network size will be large. 5.2. SHRUTI-CM5 In this section, we briefly describe the design and implementation of S H R U T I - C M 5 , a n SPMD asynchronous message passing parallel reflexive reasoning system developed on the Connection Machines CM-5. A more detailed description of SHRUTI-CM5 can be found in [14]. 5.2.1. T h e C o n n e c t i o n M a c h i n e C M - 5 The Connection Machine model CM-5 [27] is an MIMD machine consisting of anywhere from 32 to 1024 powerful processors. 9 Each processing node is a general-purpose computer which can execute instructions autonomously and perform interprocessor communication. Each processor can have up to 32 megabytes of local memory 1~ and optional vector processing hardware. The processors constitute the leaves of a fat tree interconnection network, where the bandwidth increases as one approaches the root of the tree. A low-latency control network provides tightly coupled communications including synchronization, broadcasting, global reduction and scan operations. A high bandwidth data network provides loosely coupled interprocessor communication. The virtual machine emerging from a combination of the hardware and operating system consists of a control processor acting as a partition manager, a set of processing nodes, facilit~c~ for interprocessor communication and a CNTX-like programming interface. A typical user task consists of a process running on the partition manager and a process running on each of the processing nodes. Though the basic architecture of the CM-5 supports MIMD style programming, it is most often used to run SPMD (Single Program Multiple Data) style programs [29]. Both data parallel (SIMD) and message-passing programming on the CM-5 use the SPMD 9In principle, the CM-5 architecture can support up to 16K processors. 1~ amount of local memory is based on 4-Mbit DRAM technology and may increase as DRAMdensities increase.
30
model. If the user program takes a primarily global view of the system--with a global address space and a single thread of control--and processors run in synchrony, the operation is data parallel; if the program enforces a local, node-level view of the system and processors function asynchronously, the machine is used in a more MIMD fashion. We shall consistently use "SPMD" to be synonymous with the latter mode of operation. In this mode, all communication, synchronization and data layout are under the programs' explicit control. 5.2.2. T h e d e s i g n of SHRUTI-CM5 We outline the design and functionality of SHRUTI-CM5. A detailed discussion and justification of the design choices may be be found in [14]. T h e K n o w l e d g e Base. SHRUTI-CM5 supports all of SHRUTI'S legal rules, facts and type hierarchy relations. The type hierarchy can encode both is-a relations (which explicate the subconcept-superconcept relations between entities) and labeled relations which specify that two entities are related by a relation R. Q u e r i e s . SHRUTI-CM5 supports all the legal queries described in Section 4.1 and this includes queries posed to the rule base and/or the type hierarchy. With an appropriate front-end, the system can handle multiple queries, logical combinations of queries, and other variations. G r a n u l a r i t y of M a p p i n g . The individual processing elements on the CM-5 are fullfledged SPARC processors. A subnetwork in the connectionist model can therefore be implemented on a processor using appropriate data structures and associated procedures without necessarily mimicking the detailed behavior of individual nodes and links in the subnetwork. We therefore use knowledge-level partitioning in mapping SHRUTI onto the CM-5. This decision is also motivated by the fact that all the information pertaining to a predicate cluster can be encoded within a single CM-5 active message (see below). A c t i v e M e s s a g e s a n d C o m m u n i c a t i o n . SHRUTI-CM5 uses CMMD library functions [28] for broadcasting and synchronization, while almost all interprocessor communication is achieved using CMAML (CM Active Message Library) routines. CMAML provides efficient, low-latency interprocessor communication for short messages [28,31]. Active messages are asynchronous (non-blocking) and have very low communication overhead. A processor can send off an active message and continue processing without having to wait for the message to be delivered to its destination. When the message arrives at the destination, a handler procedure is automatically invoked to process the message. The use of active messages improves communication performance by about an order of magnitude compared with the usual send/receive protocol. The main restriction on such messages is their size--they can only carry 16 bytes of information. However, given the constraints on the number of entities involved in dynamic bindings (~ 10), there is an excellent match between the size of an active message and the amount of variable binding information that needs to be communicated between predicate instances during reasoning as specified by SHRUTI. SHRUTI-CM5 exploits this match to the fullest extent. P r o c e s s o r A l l o c a t i o n . SHRUTI-CM5 supports two major processor allocation schemes: random processor allocation and q-based processor allocation. Random processor allo-
31
while ( t e r m i n a t i o n condition not met) /* propagate a c t i v a t i o n in the type h i e r a r c h y */ spread bottom-up a c t i v a t i o n ; spread top-doma a c t i v a t i o n ; /* propagate a c t i v a t i o n in the r u l e base , / reverse-propagate c o l l e c t o r a c t i v a t i o n ; check f a c t matches; propagate enabler a c t i v a t i o n by r u l e - f i r i n g ;
Figure 7. The activation propagation loop.
cation involves allocating knowledge elements to random processors, q-based processor allocation allows the user to control the fraction of related elements that are assigned to the same processor. Random processor allocation is actually a special case of q-based allocation with q - ~1, where N is the number of processors in the parallel machine. 5.2.3. E n c o d i n g t h e k n o w l e d g e base The knowledge base is encoded by presenting rules and facts expressed in a human readable syntax like that of first-order logic. Knowledge encoding in SHRUTI-CM5 is a two-part process:
1. A serial preprocessor on a workstation reads the input knowledge base and partitions it into as many chunks as there are processors in the CM-5 partition. 2. During the parallel knowledge base encoding phase, each processor on the CM-5 independently and asynchronously reads and encodes the knowledge base fragment assigned to it by the preprocessor. This input mode parallelizes knowledge encoding and is well suited for large-scale knowledge bases. In addition, SHRUTI-CM5 also provides a direct input mode which by-passes the serial preprocessor, and is useful when small knowledge base fragments need to be added to an existing (large) knowledge base. Input processing results in allocating each knowledge base element 11 to a single processor, and encoding the knowledge using suitable internal data structures. The SHRUTI network is internally encoded by a series of pointers which serve to link predicate and concept representations. A specially designated server processor keeps track of processor assignments. The system is designed in such a manner that the server does not become a bottleneck--it is accessed only when posing a query, and does not come into play during the reasoning process. 12 11Predicates, concepts, facts, rules and is-a relations together constitute knowledge base elements. l~The server is also accessed when encoding a knowledge base in direct input mode.
20 4.1. F o r m of rules, facts, a n d queries S H R U T I encodes knowledge in terms of entities, types (classes of entities), the memberof relation between entities and types, sub- and super-type relations between types, n-ary predicates, facts and rules. The sub- and super-type relations define a partial ordering of types and thus types are organized in the form of a directed acyclic graph (we will refer to it as the T-DAG). It is assumed that there exists a unique type that is a super-type of all types. For convenience we view entities as the leaves of the T-DAG. Rules have the following form: 3Xl:Xl,... ,xp:Xp Vyl:Y1,..., yr:]~r [ P l ( . . . ) A ' " A P , ( . . . ) =V 3 u l , . . . , ut Q(...)] wherein an argument of Pi can be an entity or one of the variables xi and yi. An argument of Q can be an entity or one of the variables xi, y~, and ui. Xis and Yis are types and specify restrictions on the bindings of variables. Facts have the form: 3 x l : X l , . . . , xr:Xr Vyl:Y1,..., ys:Ys P ( . . . ) where arguments of P are either entities or variables xi and Yi. Universally quantified variables are assumed to be distinct (i.e., repeated universally quantified variables are not supported). Observe that facts include ground atomic formulas as well as quantified assertions of the type "All permanent employees receive a bonus" and "there exists an employee who is the boss of all other employees". The form of queries is similar to that of facts except that whereas repeated universally quantified variables can occur in queries, existentially quantified variables are assumed to be distinct. Recently, we have extended SHRUTI to allow negative literals to occur in facts, queries, and rules [24]. The representation language shares the function-free property of Datalog [30] but differs from it in a number of ways. For example, unlike Datalog, our language does not force a dichotomy between extensional and intensional relations and allows both intensional as well as extensional relations to occur in the head of a rule. Thus the language allows rules to define new views (as in Datalog) as well as specify integrity constraints. Furthermore, the occurrence of extensional relations in the head of a rule also allows our system to reformulate a query into a number of alternate, but semantically related queries. While our language is more general than Datalog in its treatment of relations, it does impose a restriction on the form of rules (see below). 4.2. T h e class of reflexive inference A characterization of the class of reflexive inferences is provided in [21]. This characterization is facilitated by the following definitions: [] Any variable that occurs in multiple argument positions in the antecedent of a rule is a pivotal variable. [] A rule is balanced if all pivotal variables occurring in the rule also appear in its consequent. Observe that rules that do not contain pivotal variables are also balanced.
33
QUERY DEPTH vs. RESPONSE TIME I
I
I
I
KB KB KB KB KB
0.6 spmd version 07.5 32 pe cm-5 kbl stats enabled
size size size size size
= = = = =
11.0036 219879 329871 440178 550041
', o :: :c, ;• =c
: : : ; :
0.5
"6" O
0.4
q)
E I-to
0.3
G)
0.2
0.1
0
2
4
6
8
10
Query Depth
Figure 8. SHR.UTI-CM5 query response time for artificially generated knowledge bases of different sizes.
KBSize 110036 219879 329871 440178 550041
Avg. Response Time Depth=0 Depth=l (Retrieval) 0.8 msec 1.4 msec 1.3 msec 2.5 msec 1.9 msec 4.5 msec 2.8 msec 9.2 msec 3.4 msec 15.7 msec
Table 1 Average response time for retrieval (depth 0) and depth 1 queries posed to artificially generated knowledge of different sizes.
34
!
QUERY DEPTH vs. TIME PER RULE FIRED | ,
0.0004
| KB size = 440178 KB size = 550041
spmd version 07.5 32 pe cm-5 k:bl stats enabled
0.0003
g ~0.00025
i
0.0002
!
0
2
4
6
8
Query Depth
Figure 9. Que .ry depth vs. time needed to fire a rule. Kbl on a 32 PE CM-5.
and facts, and (ii) Word Net, a real-world lexical database [16]. In this section we present these experimental results which demonstrate the effectiveness of SHRUTI-CM5 as a realtime reasoning system. Most of the experimentation has been carried out on a 32 node CM-5.
5.3.1. Experiments with artificially generated knowledge bases Part of the experimentation with SHRUTI-CM5 has been carried out using artificially generated knowledge bases. These knowledge bases are constructed automatically from a specification of their gross structure in terms of parameters such as the number of rules, facts, types and entities; the subdivision of the KB into domains; the ratio of interand intra-domain rules; and the depth of the type hierarchy. The specified structure is fleshed out with randomly generated predicates, facts, rules, and types. We have used two types of domains: target domains, which correspond to "expert" knowledge about various real-world domains; and special domains, which represent basic cognitive and perceptual knowledge about the world. A typical structured knowledge base consists of several target domains and a small number of special domains. The predicates within each domain are richly interconnected by rules. Predicates in each target domain are also richly connected by rules to predicates in the special domains. Predicates across different target domains, however, are sparsely connected. Predicates in different special domains are left unconnected. The structure imposed on the knowledge base is a gross attempt to mimic a plausible structuring of real-world knowledge bases. This is motivated by the notion that knowledge about complex domains is learned and grounded in metaphorical mappings from (to) some basic perceptually and bodily grounded domains [13]. Queries for experimenting with each artificial knowledge base were generated using the facts associated with predicates, and the inference dependency graph representing
35 rules interconnecting predicates. For the knowledge bases considered below more than 500 random queries were generated of which some 300 were answered affirmatively and had depths ranging from 0 (fact retrieval) to 8. Each successful query was run 5 times and the resulting data was used to evaluate the performance of SHRUTI-CM5. In the experimental results plotted below, points represent average values, point ranges shown are 95% confidence intervals, while the curves shown are piece-wise best-fits. Figure 8 plots response time for varying query depths and knowledge base sizes. Table 1 highlights the average response times for retrieval queries (these are queries with a derivation depth of 0) and queries with derivation depths of 1. The knowledge bases used for experimentation had 3 special domains, 150 target domains, about 50000 predicates and 50000 concepts. When the knowledge base size is about 200,000 or smaller, the response time for different query depths is essentially linear since activation is, for the most part, confined to the query domain (the domain in which the query was posed) and special domains. As the size of the knowledge base increases, the curve for each knowledge base size can be partitioned into two parts: For depths up to about 3 the response time increases steeply since all the predicates in the query domain and special domains become completely active. Beyond that, the rate at which the response time increases is lower and depends on the number of active predicates in other target domains. The latter in turn depends on the number of rules that link predicates in different target domains. As the knowledge base size increases, the number of inter-domain rules increases, and hence, the response time increases at a higher rate. This is illustrated by the top three curves in the figure. Figure 9 shows the time needed to fire a rule as a function of query depth for two knowledge bases. In general, it was found that for large knowledge bases and que .ry depths greater than 3, the number of rule firings per second on a 32 node CM-5 converged to a relatively constant value of about 125,000. This suggested that the number of rule firings per second per processor was 125.,000 52 , i.e., about 3,900. In other words, T, the time per rule firing per processor on a 32 node CM-5 was found to be about 39-~, i.e., 256 #sec. Experimental results showed that this value of T remained constant over various knowledge base structures and numbers of CM-5 processors. This means that if ]LTKB I is the size of the knowledge base, and a fraction r of this knowledge base becomes active during a reasoning episode, then the expected response time of SHRUTI-CM5 o n a n N processor CM-5 would be T = ILTrRIrT N The above observation suggests a way of designing a real-time reasoning system. Let T,nax be t h e maximum response time that the application can tolerate. The observation suggests that one requires:
Tmax _!! prenum(A) AND!! maxnum(B)
Now we will show how to verify that B IS-A A when A is a graph predecessor of B. Remember that a pair of processors (U, V) in the graph pairs strand is used to represent a graph pair. The tree pair in the odd processor (U) is used to represent a node S and the graph pair in the even processor (V) is used to represent a node which propagates its tree pair to S. Therefore, we are looking for a pair of processors (U, V) such that the tree pair of A is contained in processor U and the graph pair of B or its tree predecessor is contained in processor V. In the following functions the expression mark!![x] "- y means that the pvar mark!! on the processor with the ID x is assigned the value y. IS-A-VERIFY-2 (B, A: Node)" BOOLEAN ; Activate every occurrence of the tree pair of A in the graph pairs strand. ; Set the parallel flag mark!! on the right neighbor processors of the active ; processors. ACTIVATE-PROCESSORS-WITH pre!! _ 0 and x < ~ . T computes the odd position, and we generate the pair (T(x), T ( x ) + 1) for (Tree-Pair(N/), V). With these three steps, mapping each predecessor to its corresponding processor ID in the graph pairs strand can be completed. For instance, in Figure 9(a, b), when inserting the arc from H to E, we first activate every tree predecessor of E (C and E itself), but not A. Similarly we activate every graph predecessor of E (just B). Then, we call enumerate!! and assign numbers, 0, 1, and 2, respectively. The tree pairs of C and H are assigned to 1019 and 1020 which are T(0) = (I)~ - 2 9 1 and T(0) + 1 - (I)7 - 2 9 1 + 1. Similarly the tree pairs of E and H are stored at 1017 and 1018, and the tree pairs of B and H, at 1015 and 1016. We will now present our parallel propagation algorithm. During the propagation, we may have to consider two problem cases caused by redundant pairs. Let a pair (Tri #i) be the newly propagated pair and let another pair (n.i pj) be a pair at a target node of propagation. In the first problem case, a pair (Trj #j) at the target is enclosed by the propagated pair ( ~ #~), i.e., ~ < n.i and pj < #~, then the pair (nj #j) must be replaced by (hi pi). In the second problem case, the pair (Ei p.i) encloses the newly propagated pair ( ~ p~), i.e., r.i < ~ and #i _< #j, and we do not need to propagate the pair (~ #~) to this target. In the propagation, we replace the redundant pairs just described. The results of this algorithm correspond to the results of Agrawal et al.'s algorithm. The boolean function evenp!! returns TRUE on a processor if the processor's ID is an even number. In the
81 A[1
8] D[2 5]
D [2 5]
B
B
[8 8] (7 7)
-.
F [3 5] [71~7] I~, (5 5) ~
G [4 5]
G [4 5] ' ( ~ H [ 5 5]
H[551
(b) Graph After (H, E) is inserted.
(a) OriginalGraph
(c) Before Propagation 0
1
2
3
4
5
6
7
1021 1022 MI-1 MI
A I~I~IBI~I~IGI"I
. . . . . . . 9 ..
. .ll~ ~]ltO-,llE~~1 [~~t r,-,t~4~q~ ~!~
< ( [ 6 7] ~
5
)
. . . . . . .
)
~.,~~~----_,...,~
"'" 1
2
3
4
5
6
7/~
[S811(55)![6711(55,1t8,] /~o15 lO1610171018 lO19lO2OlO21lO22 o . .
Tree Pairs Strand
"~
>
....... (d) After Propagation
o
~
~ Free Space ~
Figure 9. Propagation in Double Strand Representation
Graph Pairs Strand
MI
82 algorithm, the expression redundantii stands for a boolean parallel variable that represents any redundant pairs in the predecessors. As before, in the following functions the expression mark!![x] "- y means that the pvar mark!! on the processor with ID x is assigned the value y. Finding tree predecessors will be different from finding graph predecessors because the tree pairs and the graph pairs are stored in a different form in the tree pairs strand and in the graph pairs strand, respectively. The function target-address!! returns addresses of the target processors of the propagated pairs for tree predecessors and graph predecessors uniformly. Mark-Predecessor(N-Pair, M-Pair" Pair) ; Activate every graph predecessor of a node N which is not predecessor ; of the node M, where N is a new parent node of C and M is the tree ; parent of the child node C. The nodes N and M have the tree pairs N-Pair ; and M-Pair, respectively. Then set the flag mark// on the graph predecessors. ACTIVATE-PROCESSORS-WITH pre!! _!! maxnum(N-Pair)AND!! NOT!!(pre!! __!! prenum(M-Pair) AND!! max!! :>!! maxnum(M-Pair)) DO BEGIN mark!![target-address!!()]:- 1 ; s e t predecessors END Note that, due to propagation, redundant pairs could appear in the marked predecessors. As mentioned before, there are two problem cases caused by redundant pairs. In the first case, the problem could occur only in graph pairs because in this step we are dealing with replacing enclosed pairs with enclosing pairs while in the second case it could occur either in tree pairs or in graph pairs. In the following algorithm, we will present the solution for these problems. For the first case, in the IF!! clause, we examine whether any graph pair in the predecessors is subsumed by the newly propagated pairs but only check the even processors in the graph pairs strand using evenp!! because every graph pair is stored at the even processors in the graph pairs strand. In contrast, for the second case, we examine whether any graph pair and any tree pair in the predecessors is subsuming the newly propagated pair because if that is true, we do not have to propagate the new pair any further. In both cases, the boolean pvar redundantii is set and additionally, in the first case, the enclosed pair is replaced with the number pair to be propagated. Redundant-Pair-Elimination(PM-Pair-V : Pair) ; Replace the pair at the target processor with the newly propagated ; pair PM-pair-V in the first case, set the flag redundant// on ; the target processor in both cases. ACTIVATE-PROCESSORS-WITH mark! ![target-address!! ()] = ! ! 1
83 DO BEGIN ; check whether it is the first case of redundant pairs IF!! (pre!! >!! prenum(PM-Pair-V)AND!! max!! ___!! maxnum(PM-Pair-V) AND!! evenp! !(self-address!! ()) AND!! self-address!!() ___!! g-lb) THEN pre!![self-address!!()]:= prenum(gM-Pair-V) ; replace the prenum max!![self-address!!()]:= maxnum(PM-Pair-V) ; replace the maxnum redundant!![target-address!!()]:= 1 ; set the flag ; check whether it is the second case of redundant pairs ELSE IF!! (pre!! _
T --- TIn(N) -t- Tn(N) + P(C) 9 (T,.(N, C) + Tp(N)).
(5)
As before Tin, Tn, Tr, and Tp can be regarded as constants because within constant processor set size, these do not grow with increasing knowledge base size. Then, we can simplify the run-time complexity to O(P). Similarly, the run-time of the propagation algorithm in the Grid Representation is
T' = Td(N) + P(C) 9 (T,.(N, C) + Tp(N)). By the same reasoning, it can be simplified to O(P).
(6)
85
Processor Number
Utilization
of Processors
.vs.
Number
of
Nodes
100.0 ':
'
i
80.0
,?'.
..' ..~, ;.'
-
'
:
i ..,-~:: ..:~.:.-.;
9 9
. :,."
:.,:.,-
9::. . . . . .
;" ,,,..
60.0
i..i -"
!
....
."
4~ ,.~. ,:': ,.. , .-" ..... ..'
40.0
...-.... i i / i ; ~;" i. ..
"!J'
1:3----O GR ,:.:...............:: D S R
!i
' / r-4_~" "
'
"
s - 4 3 - 4~" _ E3 _ E3_I:--I-
0.0
,
0.0
i
500.0
,
,
1000.0 Number of Nodes
,
i
1500.0
E3-
,
2000.0
Figure 10. Processors Utilization
6.2. Experimental results In this section we present experimental results of the parallel subclass verification and number pair propagation algorithms for the GR and DSR. The experiments were done on a Connection Machine CM-5, which makes use of groups of virtual processors executing serially on real processors. There are 32 real processors on the NPAC CM-5. Eve .ry processor emulates the activities of at least 32 virtual processors. The CM-5 [47] is programmed in *LISP, a dialect of Common LISP, by mapping parallel variables (pvar) onto distinct processors.
6.2.1. Experiments with random data For our experiments, we are using a random generator for DAGs. The following parameters are supplied as input to this generator: the number of nodes (N), the branching factor of each node (B), and the depth (D). Prelimina .ry experiments with several values of B and D showed that the computation time seems to be unaffectedby B and D. This should be expected as we have eliminated the explicit representation of the IS-A links from the outset. Therefore, we limited D = 9 , . . . , 12 and set B = 5. The effect of graph size on run-time was determined for both representations. The number of nodes was varied from 25 to 2000. Graphs have approximately 20% of graph arcs, e.g., a graph with 2000 nodes has about 400 graph arcs. For the GR, assume that k is 8. Then 1K processors are required for 1 to 128 nodes, 2K for 1 to 2 5 6 , . . . , 16K for up to 2K nodes. Processor utilization is very low, only up
87 Subclass Verification Run 0.014
Time
.vs. Number
i
of
N o d e s
i
|
[E~-E]- -Ea, l:~:]d'
0.012
--r (1)
,JX~.Fa "Q"
"15]- -IE]- -
r-a----El GR (: .......... -:. D S R
0.010
E I--e-
rr
0.008
i
,
'
]
_
. : ...... ...:-:.:'.L....:..~,.....:.:~..
....
0.006
0.004
. .:7 9 , :-:~:,:.:'::-::. :.,:~......~.,::::.:..--.~ ::.-:. r~
0.0
500.0
1000.0 Number
1500.0
2000.0
of Nodes
Figure 11. Run Time for Subclass Verification
to 18%. We also determined that the maximum number of actually used rows in the GR was 5. Our experiments with random graphs showed that the number of graph pairs increased at approximately the same rate as the number of nodes. For instance, 48 graph pairs are generated in a 100-node graph, ..., 900 graph pairs in a 2000-node graph. In our experiments, typically, the number of graph pairs is limited to less than half the number of tree pairs. According to that, for the DSR, approximately 1K processors are required for graphs with up to 0.5K nodes, 2K processors for graphs with up to 1K nodes, and 4K for up to 2K nodes with very high processor utilization (up to 99%). In Figures 11-13, the run-times jump at two critical points, namely at the node numbers 500 and 1000. These jumps are due to doubling of the number of allocated virtual processors, i.e., from 1K to 2K and 2K to 4K. As the number of real processors stays the same, every real processor has to double the number of operations it performs. The DSR shows better performance than the GR in terms of both the amount and utilization of processors with increasing knowledge base size (Figure 10). For the comparative run-time evaluation of DSR and GR with various sizes of the knowledge base, we implemented the graph insertion, link insertion, and subclass verification algorithms. The test data for link insertion makes a number of simplifying assumptions which are based on problems described in [23]. Figures 11-13 show the results of experiments with various sizes of the knowledge base. The figures show the run times in seconds
88 G r a p h Insertion Run
Time
.vs.
|
0.20
Number
of
Nodes
,
i
G----El
GR
C:.-', ........ ~,-:: D S R
0.15
O
....
G-O-G-E3
(.)
E ._ i-r
0.10
,~:~.
n'--
i
: -,
))
i;: '~.......... ~..:.,...,..:.::....
:
rn-D
0.05
/ .: . . . .
[]
...:...-
,..-.,..~"::z".'.::..::,,:..",..:.,2:: 1. The term n - 1 in the denominator represents the largest number of terms which can be different from zero, because there is at least one processor i with T/ = Tpe(n) (Tpe(n) = max,=l...nTi). LI(n) is an absolute measure, not taking into account the best or worst possible balance that can be obtained for a particular set of tasks. It ranges from perfect balance (Ll(n) = 0), which means that all processors finish working at the same time, to maximal imbalance (LI(n) = 1), where exactly one processor is busy during the execution. In order to assess the effects of slackness, a general modeling of search is used. For the simulations, OR-partitioning is assumed. This means that a given problem is solved as soon as one of the tasks is solved. Also, it is assumed that m = n • spp, and that each processor obtains spp tasks. Each computation (consisting of the treatment of a set of tasks) is assumed to be constrained by a user determined runtime limit Zlimi t. Therefore, a computation terminates as soon as a task terminates successfully, or after all tasks failed, or when Zlimi t is reached, whichever occurs first. T h e probability of a task terminating successfully is specified by p (i.e:, a task terminates unsuccessfully with probability 1 - p). The runtimes for all tasks ai'e independently drawn from a uniform distribution which is equal to 0.5 from 0 to 2 and equal to 0 otherwise, resulting in a mean value of 1. Previous experiments [18] have shown little qualitative dependence of the load balancing effect on the particular choice of runtime distribution. The runtime limit is important for the simulation for two reasons. First, externally triggered termination by a runtime limit influences load balance. Early system termination (compared to the average runtime of a task) renders load imbalance unlikely. In the simulation, all tasks have a mean runtime of one unit of time, and runtime limits are issued as multiples of the mean. Second, the actual runtime of a task which is terminated by Zlimi t becomes irrelevant. Such a task might as well have an infinite runtime. Thus, in those cases where tasks are terminated due to Zlimit, the runtime limif allows the extrapolation of the results to distributions which have larger runtimes (for those tasks) and therefore have a larger variance. The system model does not take into account communication or time-sharing overhead. This omission, however, is unlikely to affect the qualitative results: Neglecting the communication overhead is tolerable since communication occurs only at the beginning (distribution of tasks) and at the end (collection of results and global termination)of the parallel computation, and thus mainly depends on the number and physical size of tasks, but not the duration of their computation. The time-sharing overhead is usually low for contemporary operating systems, as long as no swapping occurs. Thus, all processes running on the same processor should fit into the main memo .ry, thereby limiting the degree of slackness. Another limitation for the degree of slackness arises from the time-sharing delay for a task. For these reasons the investigated slackness is limited to 16.
151 Three different modeling variants are presented. There are two options for obtaining different slackness values for a parallel search. The first is to generate different numbers of tasks for a fixed number of processors, and the second is to choose different numbers of processors for a fixed number of tasks. In the first case, another important distinction arises: How does the increasing degree of partitioning influence the overall probability to find a solution (within the given time limit)? In practice, increasing partitioning is likely to increase the overall probability for success. This is modeled by assuming that the probability of success per task p remains constant for different slackness values (i.e., for different numbers of tasks). For the assumption that increasing partitioning does not improve the overall success probability, the value of p is decreased for increasing slackness such that the overall success probability remains constant (and equal to the value obtained for s p p - 1). The load imbalance values for these two options are shown in Figure 2. Figure 3 shows the results for the case where the number of processors is varied instead of the number of tasks. The load imbalance for s p p = 1 is the same in all plots, because in this case in all plots 32 processors and 32 tasks are used. All figures give results for a low success probability p = 0.01 (higher success probability values result in lower load imbalance, since the computation terminates earlier).
p overall(success), 0.01 0.45
............ .
.
.
.
.
i .
.
.
.
.
.
.
A .
.
. .
0.4
p(success), 0.01 . . . . r,~p ,,,- 1 . - , ~ | s p p - 2 --,--
0.35
0.45
0.35
t;, ........
0.3
0.3
777 ....... .....7 .......... 7-
0.25
0.25
0.2
0.2
0.15
/j._
0.1
i ....... -
0.05
0.15 0.1 0.05
.
0 0
5
.
.
,
.
I0
15
T limit
20
25
30
0 0
5
10
15 TJi~
20
25
30
Figure 2. Load imbalance LI for uniform runtime distribution for constant (left plot) and increasing (right plot) overall success probability.
The left plot in figure 2 shows that even in the (worst) case that no advantage is gained by increased partitioning, the load imbalance can be cut down by more than one half with a slackness of 16 tasks per processor. A much larger reduction occurs in the right plot, where L I becomes negligible for s p p >_ 8. This is due to the increasing overall~ success probability, which increases the chances for an early OR-parallel termination. The load imbalance reduction found in Figure 3 is about in between the two previous cases. The experiments show that for all modeling variants, slackness leads to a noteworthy reduction of the load imbalance.
152 p(succem) ,, 0.01
~ 0.4
0.45
0.3 ~176 0.2 0.15~ 0.1 0.05 00
5
.........................................
10
15 T_,mn
20
25
30
Figure 3. Load imbalance LI for uniform runtime distribution with 32 tasks. Different slackness values are obtained by varying the number of processors from 2 to 32.
In [18] a set of experiments regarding slackness has been reported, focusing on the case where the overall success probability increases with increasing slackness. They show that similar results are obtained for quite different runtime distributions (exponential, uniform, and triangle). This suggests that the load imbalance reduction of slackness is largely independent of the shape of the distribution. Those experiments also show that for success probabilities p > 0.01, small slackness values soon reduce load imbalance to negligible values.
4. Worst Case Analysis Regarding worst-case behavior, it may be suspected that an extremely unbalanced search space will render the SPS-model inappropriate as compared to a dynamic scheme, which can adapt to such a situation by dynamically producing new tasks as necessa~.. Although this may be the case in many specific situations, the following considerations surprisingly reveal that the SPS-model performs quite well compared to dynamic partitioning schemes, in some straightforward worst case scenarios. In the following, two general worst case situations are described and analyzed. In the first situation it is assumed that regardless of the parallelization method, the generation of tasks is always possible as desired, but a maximal work imbalance among the tasks occurs. In the second situation no parallelism is inherent in the problem. All discussions are based on the assumption that no solutions exist. This is necessa .ry to make a comparison between the approaches possible. If solutions are allowed, the performance of a parallelization approach depends critically on the number of solutions and their location in the search space (in relation to the parallel search strategy). Situation 1: Maximal Work Imbalance among Tasks. Regardless of the parallelization model employed, assume that at any time of the parallel computation, all but one of the
153 currently available tasks terminate after one search step. The particular task that does not terminate spans the remaining search space, and may be used to generate new tasks depending on the parallelization model. Regarding runtime and accumulated processing time, this scenario describes the worst possible situation that can occur for the SPS-model. It leads to the generation of m ---n • spp tasks (where n is the number of processors), which are distributed once among the processors. Thus, a runtime delay of O(n) (for processor-to-processor communication, assuming that spp tasks fit into one message package) and an accumulated processing time overhead of O ( m ) (for communication and task handling) occurs. As a benefit, n • spp search steps are performed in parallel. Assuming that a single search step takes much less time than distributing a task, the overhead will usually outweigh the benefit from parallel execution. Furthermore, the main work remains to be done by a single task (no further partitioning occurs after the first task generation3). Assuming the search terminates after k search steps (with k >> m), the constant amount of work performed in parallel will be insignificant, and therefore the runtime will be approximately the same as without parallelization. However, it is important to note that no adverse effects (besides the overhead for task generation and distribution of O ( m ) ) occur either. In particular, the single remaining task can be executed without slowdown arising due to the parallelization. Thus, while no speedup is achieved, no relevant slowdown occurs either. The increase in the accumulated processing time depends on the ratio between the time for initializing the processing of a task and the time for performing a single search step. Assuming a low initialization overhead, the accumulated processing time will remain close to the sequential runtime. Let us now turn to the behavior of dynamic search space partitioning approaches. A dynamic partitioning scheme has provisions to generate tasks (or obtain tasks from other processors) whenever some number of processors becomes idle. Thus, in the given scenario, all processors will continue to participate in the computation until termination. Therefore, in contrast to the SPS-model, the number of search steps executed in parallel is not constant, but increases as the search continues. This, in fact, is usually considered as the prima .ry advantage of dynamic partitioning compared to static partitioning. While of course there are situations where this ability will pay off, in the given scenario this is in fact disadvantageous: 9 There is a permanent need for communication. Depending on the parallel hardware platform, the frequent necessity for communication can significantly decrease the performance of the system. In a multiuser system, this can seriously affect other users as well. 9 Assuming that a communication operation together with the task processing initialization takes significantly longer than the execution of a single search step (a realistic assumption for most search-based systems), a large fraction of the accumulated processing time is spent on overhead rather than useful search. 3Note that in the SPS-model parallel execution is avoided in situations where the problem does not provide enough inherent parallelism to generate the desired number of tasks. This advantage is ignored in the analysis.
154 9 There is no single task which runs undisturbed; unless a specific task leads to the generation of all other tasks, fast processing of such a "main task" is not possible. In the described scenario, typical dynamic partitioning schemes will actually run longer than a SPS-based system, for at least n times the cost of a SPS-based system. Both T1 where Tap(,,) = F~'~=IT~ regarding speedup (S(n) = Tpe(, T~ ) ) and productivity9 ( P ( n ) = Tap(n) , is the accumulated processing time of all involved processors), the described scenario is significantly less harmful for the SPS-model than for dynamic partitioning schemes. In particular, the potential for negative effects of dynamic partitioning schemes in multiuser time-sharing systems require precautions which, in effect, can only be achieved by reducing the dynamic partitioning ability of the system, thereby moving towards a static model. Of course, scenarios where dynamic partitioning performs better than the SPS-model exist. For example, if most of the tasks that are generated by the SPS-model terminate immediately unsuccessfully, and the remaining tasks could be partitioned further into equal slices of work, a dynamic partitioning scheme would be advantageous. In fact, this particular situation maximizes the advantage of dynamic partitioning over static partitioning. Altogether, the performance of a parallelization scheme depends not only on the structure of the search space. The size of tasks, the relationship of communication time to task runtime, and the given runtime-limit all influence the adequacy of a partitioning approach, and make an absolute comparison between static and dynamic schemes difficult. The advantages of static partitioning over dynamic partitioning are mainly due to the initial exploration phase lying at the base of the SPS-model. Of course, one may argue that such a phase can be used at the beginning for any dynamic partitioning scheme as well, combining the best of both approaches. This, indeed, can lead to interesting systems. A simulation study which investigates the use of an initial task distribution for a dynamic partitioning scheme is found in [15]. In this study, the use of an initial task distribution increased the efficiency Erei by approximately 15~ when more than about 50 processors were used. It thereby improved the scalability of the employed dynamic partitioning scheme. In general, it is difficult to determine in which cases the additional implementation effort and computational overhead for dynamic partitioning pay off. In practice the unnecessa .ry employment of dynamic partitioning may lead to increased computational costs for little or no benefit (if not slowdown).
Situation 2: No Inherent Parallelism. The worst case with respect to distributing work among processors is that the search tree consists of a single path (i.e., no alternatives for search steps exist). Thus, neither static nor dynamic partitioning is possible. Static partitioning simply performs sequential search in that case, since no parallel tasks are generated. Assuming an appropriate design of the task generation phase, the overhead for potential task generation is negligible. The performance of dynamic partitioning approaches depends on the particular scheme employed. If parallelism is utilized only after an initial set of tasks has been generated, no parallelization will occur and the performance is comparable to the SPS-model. Otherwise, if parallel processes are started independently of the current availability of tasks, a significant overhead may occur.
155
Performance Stability. Another important issue regarding performance is its stability. In multiuser systems the time for a communication operation is not fixed, but depends on the interconnection traffic caused by other users. Similarly, the load of individual processors may change due to processes unrelated to the intended computation. Both factors considerably influence the communication timing (i.e., the order in which communication events occur). In the SPS-model, a communication delay in the worst case causes a prolongation of the computation and an increase of the computational costs both of which are bounded by the order of the delay. The reason that the change is bounded is that the search space partitioning is independent of the order in which the communication events occur. In dynamic partitioning schemes, however, the generation of tasks, and therefore the exploration of the search space, is usually dependent on the order in which tasks are processed. As a consequence, large variations in the runtime may occur (see for example [12]). In general, changes in the communication overhead will lead to an undesirable system operation mode for dynamic partitioning approaches, because such systems are usually tuned to optimize their performance based on knowledge about typical communication performance parameters (for work on such optimization for a parallel theorem prover, see [10]).
Summary of Worst Case Considerations. The fact that in many particular situations dynamic partitioning schemes provide better flexibility for adapting to irregularly structured search problems is obvious. However, the previous discussions show that the overhead incurred with this flexibility leads to a nontrivial trade-off between static and dynamic schemes which is frequently overlooked. In general, the following statements can be made. Disadvantageous scenarios for the SPS-model lead to a strictly limited overhead with acceptable upper bounds. If the worst case occurs, there is no benefit from parallel computation. However, as the possible speedup decreases for more unfortunate situations, the accumulated processing time decreases as well. The computational gain (product of speedup and productivity) for the SPS-model achieves acceptable values even in worst case scenarios, for problems of sufficient size. This is not necessarily the case for dynamic partitioning schemes, for which the worst case overhead cannot be bounded easily. A benefit from parallel computation may not only be lost, it may actually lead to a significant performance decrease for this and, in multiprogramming environments, other computations. This happens because, unlike for the SPS-model, the accumulated processing time increases. As a result the computational gain can drop to very low values in the worst case for dynamic partitioning approaches, regardless of the problem size.
5. Appropriateness of the S P S - M o d e l This section consists of three parts: a discussion of the SPS-model with respect to important design issues arising for the construction of a parallel system; a list of system properties which make the application of the SPS-model particularly interesting; and a summary of the advantages and disadvantages that can arise from using the SPS-model.
156 5.1. D i s c u s s i o n of Suitability In general, the adequacy of a particular parallelization approach depends on many issues. A decision among different approaches requires a detailed specification of the intended goal of parallelization. In order to specify the intended use of parallelism sufficiently, at least the issues described below need to be clarified. For each item, first a discussion regarding parallelism in general is given, followed by remarks regarding the SPS-model. 9 W h i c h t y p e of c o m p u t i n g problems are to be solved? In general: Performance can be optimized for the average case or for specific problems. Especially for parallel systems, this distinction makes a significant difference. A large enough problem size commonly leads to a good scalability of most techniques, and therefore tends to pose little difficulties. The treatment of comparatively small problems, however, often leads to unacceptable overheads due to the unprofitable initialization of a large parallel system. Since the size of a search problem usually is not known prior to its solution, this can result in the inadequacy of a parallelization technique if average case performance improvement is desired. SPS-model: The SPS-model avoids parallelization overhead for problems which are small enough to be solved during the task generation phase. This feature automatically adapts to the machine size: more processors require more tasks, which leads to more search during task generation; thereby more problems become solvable before parallel processing is initiated (in effect, problems need to be more difficult in order to be parallelized on larger machines). Furthermore, it is possible to determine heuristically the number of processors to be utilized, based on information about the search space growth gathered during the task generation phase. It can thereby support the processing of medium-size problems, keeping the initialization costs at a level suitable for the given problem. 9 W h i c h t y p e of parallel m a c h in e will be used? In general: The particular strengths and weaknesses of the intended hardware platform (e.g., memory model, topology, communication bandwidth) significantly influence the suitability of a parallelization technique. Techniques with little requirements on hardware properties are less sensitive to this issue, while some parallelization approaches can be realized efficiently only on specific hardware platforms. SPS-model: The SPS-model has particularly little communication and no memo.ry model requirements, and is therefore suited to all types of MIMD machines, including workstation networks. 9 W h a t is th e intended degree of portability to different parallel ma c hi ne s ? In general: If no portability is required, an implementation can be optimized for combining the parallel model and the hardware. However, such tight bounds limit the lifetime of the system severely. Due to the unavailability of a common machine model for parallel computers (such as the von-Neumann model for sequential computers), successor models often feature major design changes. In that case, a specifically tuned implementation is bound to its first platform, and may be soon outperformed by an improved sequential system on the latest sequential hardware.
157
SPS-model: Due to its little requirements on communication performance, the SPSmodel can be realized efficiently using portable implementation platforms, such as PVM [2], p4 [4], or MPI [7]. This makes an implementation available on a large number of parallel machines 4 as well as on workstation networks. 9 W h a t is t h e m i n i m a l p e r f o r m a n c e increase e x p e c t e d ? In general: A given desired increase (e.g., S > 100) constrains the minimal number of processors, and thereby defines a minimal degree of required scalability. Scalability of search-based systems is application-specific, and can be hard to predict for a particular parallelization method. SPS-model: The SPS-model simplifies scalability prediction and analysis due to the possibility of simple and meaningful sequential simulation before a parallel system is built. 9 W h a t is t h e desired trade-off b e t w e e n s p e e d u p and p r o d u c t i v i t y ? In general: The adequacy of a parallelization technique depends on the relative importance of speedup S T1 and productivity P = T1 5 This can be %e(,) " Tap(,)" expressed by choosing an appropriate tuning parameter r in the definition of a computational gain G = P x S r, r E ]R+. SPS-model: Taking into account the accumulated processing time has been one of the driving motivations for the development of the SPS-model. The static task generation avoids much of the communication and management overhead required for dynamic partitioning schemes. 9 Are t h e r e r e q u i r e m e n t s regarding worst case p e r f o r m a n c e ? In general: The worst case runtime and accumulated processing time vary significantly for different parallelization approaches, and depend heavily on a number of system aspects. SPS-model: As shown in Section 4, the SPS-model has better worst case performance than many other approaches. 9 D o e s t h e search-based s y s t e m d e p e n d on i t e r a t i v e - d e e p e n i n g search? In general: Iterative-deepening is a wide-spread search technique. For parallel systems employing this method, the maintenance of deepening balance is desirable. Dynamic partitioning schemes in principle allow control of the balance, however for impractical costs (the associated communication problem is NP-complete). SPS-model: In [17] it is shown that slackness together with an iterative-deepening diagonalization and a delayed successor start strategy are effective techniques for reducing deepening imbalance (i.e., the differences in the iterative-deepening levels worked on at different processors) and load imbalance, without requiring explicit communication. 4E.g., PVM is available on: Intel iPSC/2, iPSC/860, and Paragon; Kendall Square Research KSR.-1; Thinking Machines CM2 and CM5; BBN Butterfly; Convex C-series; Alliant FX/8; etc. 5The formulas represent relative or absolute speedup and productivity, depending on the definition of 7'1. If T1 equals the runtime of the parallel system on one processor, relative metrics are obtained. If 7"1 equals the runtime of a sequential reference system, absolute metrics result.
158
9 A r e there s y s t e m - e x t e r n a l constraints? In general: In multiuser systems, constraints due to the computing environment arise. The number of processors available for computation may not be constant for all computations. This arises either when an independent system resource management splits a parallel machine into fixed partitions for individual users or when the load on individual nodes discourages additional processing load. SPS-model: It is possible to take such constraints into account within the SPSmodel, and to adjust the search partitioning to the available number of processors. 5.2. Beneficial P r o p e r t i e s for A p p l i c a t i o n There are several properties of search-based systems which render them suitable for the application of the SPS-model: 9 low probability of load i m b a l a n c e Tasks which only span a small search space (and do not contain a solution) cause load imbalance and therefore should be rare. This may be known empirically for an application, or may be ensured for individual tasks by look-ahead during task generation. 9 fast task g e n e r a t i o n A fast task generation reduces the serial fraction of the computation caused by the task generation phase. Useful for this are - a high task generation rate; Only a small amount of search is required to produce the next task. - an efficient task representation. Tasks can be stored and transferred with little overhead. This is generally desirable for all parallelization schemes, because it reduces the parallelization overhead.
5.3. S P S - M o d e l : C o n s e q u e n c e s of its A p p l i c a t i o n A summary, of the consequences of the application of the SPS-model is given below. Appropriate usage of the model promises the following properties: 9 little c o m m u n i c a t i o n is required As a consequence, - the communication overhead is bounded and small. This is important for achieving good productivity. - there are little requirements on hardware communication performance. Therefore, the SPS-model is well suited to the usage of general purpose parallel programming libraries and networks of workstations. - the complexity of communication is low, which simplifies the implementation and maintenance effort required.
159
9 informed d y n a m i c decisions about parallelization and search Based on information gathered during the task generation phase, heuristic decisions can be made regarding parallelization (e.g., appropriate number of processors, appropriate slackness) and search control (e.g., appropriate iterative-deepening increments). See Section 2.1. 9 global search o p t i m i z a t i o n The use of AND-partitioning can lead to a reduction of the amount of search to be done (see Section 2.1). 9 little m o d i f i c a t i o n of the target s y s t e m The search part of a sequential system does not need to be modified. The necessa .ry extensions consist of a means to produce tasks, and the ability to start the search given by a task. 9 efficient integration of A N D - p a r a l l e l i z a t i o n The use of static task generation before parallel execution allows control over the overhead induced by AND-parallelism. 9 m e a n i n g f u l s i m u l a t i o n for any n u m b e r of processors This is possible because the parallel exploration of the search space does not depend on the communication timing. For a simulation, all generated tasks can be processed in a sequence. The results can be combined to obtain results for any slackness between s p p - 1 (number of processors n - m) and s p p - m ( n - 1).
9 c o m b i n a t i o n of different search-based s y s t e m s Different systems can be combined by using one system for task generation, and several copies of one or more different systems for task processing. In particular, it allows a combination of forward- and backward-chaining search. In cases where the SPS-model is inappropriate, the following consequences of its application may occur:
9 no s p e e d u p c o m p a r e d to the original sequential s y s t e m (worst case) However, no significant slowdown occurs either (see also Section 4). 9 th e task g e n e r a t i o n phase b e c o m e s a b o t t l e n e c k For the generation of a large number of tasks, or in systems with a low task generation rate, the overall computation is slowed down due to the serial fraction caused by task generation. The potential for a bottleneck can be reduced by distributing tasks as soon as they become available and by hierarchical task generation.
9 th e task d i s t r i b u t i o n phase b e c o m e s a b o t t l e n e c k This can be avoided by distributing tasks during the generation phase (see previous item) or by hierarchical distribution. 9 a p e r f o r m a n c e discontinuity occurs Whenever the initial exploration phase finishes immediately before a solution could
160
be found by continued sequential search, the communication time to distribute the tasks and collect the results is wasted. In this particular case, a runtime increase compared to the sequential runtime occurs. The runtime decreases as the amount of further search required after the generation phase increases. 6. R e l a t e d W o r k Work related to the SPS-model can be grouped into three topics, namely research on static scheduling, usage of a task pool, and bulk synchronous programming. Static Scheduling. A traditional static scheduling problem is, given a set of independent tasks and the runtime for each, to find a distribution of the tasks such that the overall runtime is minimized. The tasks considered typically form an independent-AND-set, i.e., all given tasks need to be solved, but independence is assumed. This scheduling problem is well-known to be NP-complete [8], and research in this area focuses on efficient approximations to optimal solutions [3,16]. Unfortunately, for search-based systems the runtime is usually neither known nor can it be estimated accurately. Work on static scheduling without given knowledge about task runtimes is rare. However, interesting research with relation to the SPS-model is found in [11]. The authors investigate static task distribution for minimizing the runtime of a set of independentAND tasks. In their model, each processor repeatedly obtains a batch of k subtasks from a central queue and executes them, until all subtasks have been processed. For the case k - r n~, k becomes equivalent to spp. The authors conclude that for many distributions a static allocation provides reasonable performance relative to an optimal scheme. Task Pool. An alternative for the processor sharing of several tasks in the SPS-model is to choose a pool model: only one task is initially distributed to each processor, and the remaining tasks are stored in a central pool. Whenever a processor finishes its task unsuccessfully, it obtains another task from the pool. Obviously, such a scheme obtains a better load distribution due to its dynamic reactivity. For this, it requires additional communication and control. The expected performance of such a scheme for OR-parallel search has been theoretically analyzed in [14], for three different runtime distributions of tasks (constant, exponential, and uniform) and the probability of success as a variable. The case of constant runtime for all tasks (not realistic for search problems), in fact, is identical for the pool model and the SPS-model, if serial execution of the tasks at a single processor is chosen in the SPS-model. The pool model (as well as serial execution of tasks at one processor for the SPS-model), however, is inappropriate for many applications of search. The reason is that any practical search has to be terminated after some finite time, i.e., a runtime limit is posed onto each search process to ensure termination. For difficult search problems, many tasks take longer to terminate than can be allotted by any reasonable means. In fact, for undecidable problems termination itself cannot be guaranteed. Thus, in a model of computation where some tasks are delayed until some other tasks terminate, the ve.ry tasks which may allow a solution to be found quickly (which is the spirit of OR-partitioning) may be executed prohibitively late (or even never). Therefore a pool model is inappropriate. Bulk Synchronous Programming. The SPS-model also bears some relation to the bulk synchronous programming (BSP) model developed by L. Valiant [19,20]. In this model,
161 the programmer writes a program for a virtual number of processors v, which is then executed on a machine with n processors. According to the model, n should be much smaller than v (e.g. v = n log n). Then thisslackness can be exploited by compilers in order to optimize scheduling and communication. As for the SPS-model, a surplus of tasks is used to achieve a load-balancing effect. However, the BSP model is intended as a basic computational model for parallel processing, and assumes compiler and operating system support. It allows communication and dependencies between the tasks, and assumes that all tasks need to be finished for completing a job (AND-parallelism). While the BSPmodel is a model for general computations, the SPS-model is focused specifically towards search-based systems, where more specific assumptions apply.
7. S u m m a r y In this paper, the parallelization scheme static partitioning with slackness has been presented, independent of a particular application. The advantages and disadvantages of a sequential initial search phase for task generation have been discussed. The potential drawback of the model, namely the occurrence of load imbalance due to tasks with finite (and small) search spaces, can be effectively reduced by slackness. A worst case analysis revealed that, unlike for other parallelization approaches, the worst case for the SPS-model is bounded and moderate. Typical design issues occurring for the construction of a parallel search-based system have been considered; then advantageous system properties for applying SPS parallelization and resulting properties of a parallel system were presented. Finally, research related to the SPS-model has been discussed.
REFERENCES 1. K.A.M. Ali and R. Karlsson. The MUSE Or-parallel Prolog Model and its Performance. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, 1990. 2. A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V.S. Sunderam. A User's Guide to PVM: Parallel Virtual Machine. Technical Report ORNL/TM-11826, Oak Ridge National Laborato .ry, 1991. 3. K.P. Belkhale and P. Banerjee. Approximate Algorithms for the Partitionable Independent Task Scheduling Problem. In Proceedings of the 1990 International Conference on Parallel Processing, 1990. 4. R. Butler and E. Lusk. Users Guide to the p4 Programming System. Technical Report ANL-92/17, Argonne National Laboratory, 1992. 5. W.F. Clocksin and H. Alshawi. A Method for Efficiently Executing Horn Clause Programs using Multiple Processors. New Generation Computing, (5):361-376, 1988. 6. W. Ertel. Parallele Suche mit randomisiertem Wettbewerb in Inferenzsystemen, volume 25 of series DISKI. Infix-Verlag, 1993. 7. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. 1994. 8. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.
162 M. Huber. Parallele Simulation des Theorembeweiser SETHEO unter Verwendung des Static Partitioning Konzepts. Diplomarbeit, Institut fiir Informatik, Technische Universit/it Miinchen, 1993. 10. M. Jobmann and J. Schumann. Modelling and Performance Analysis of a Parallel Theorem Prover. In A CM SIGMETRICS and PERFORMANCE '92, International
Conference on Measurement and Modeling of Computer Systems, Newport, Rhode Islands, USA, volume 20, pages 2 5 9 - 260, SIGMETRICS and IFIP W.G. 7.3, 1992. ACM. 11. C.P. Kruskal and A. Weiss. Allocating Independent Subtasks on Parallel Processors. IEEE Transactions on Software Engineering, SE-11(10):1001-1016, 1985. 12. E. Lusk and W. McCune. Experiments with ROO, a Parallel Automated Deduction System. In Parallelization in Inference Systems, pages 139-162. Springer LNAI 590, 1992. 13. E.L. Lusk, W.W. McCune, and J. Slaney. ROO: A Parallel Theorem Prover. In Proceedings of CADE-11, pages 731-734. Springer LNAI 607, 1992. 14. K. S. Natarajan. Expected Performance of Parallel Search. In International Conference of Parallel Processing, pages 121-125, 1989. 15. J. Schumann and M. Jobmann. Analysing the Load Balancing Scheme of a Parallel System on Multiprocessors. In Proceedings of PARLE 94, LNCS 817, pages 819-822. Springer, 1994. 16. B. Shirazi, M. Wang, and G. Pathak. Analysis and Evaluation of Heuristic Methods for Static Task Scheduling. Journal of Parallel and Distributed Computing, (10):222232, 1990. 17. C.B. Suttner. Parallelization of Search-based Systems by Static Partitioning with Slackness, 1995. Dissertation, Institut fiir Informatik, Technische Universitht Miinchen. Published as volume 101 of series DISKI, Infix-Verlag, Germany. 18. C.B. Suttner and M.R. Jobmann. Simulation Analysis of Static Partitioning with Slackness. In Parallel Processing for Artificial Intelligence 2, Machine Intelligence and Pattern Recognition 15, pages 93-105. Elsevier, 1994. 19. L.G. Valiant. A Bridging Model for parallel Computation. Communication of the A CM, 33(8), August 1990. 20. L.G. Valiant. General Purpose Parallel Architectures. In J. Van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 18. Elsevier Science Publishers, 1990.
163 Christian S u t t n e r
Christian Suttner studied Computer Science and Electrical Engineering at the Technische Universit~it Miinchen and the Virginia Polytechnic Institute and State University. He received a Diploma with excellence from the TU Miinchen in 1990, and since then he is working as a full-time researcher on parallel inference systems in the Automated Reasoning Research Group at the TU Miinchen. He received a Doctoral degree in Computer Science from the TUM in 1995. His current research interests include automated theorem proving, parallelization of search-based systems, network computing, and system evaluation. Together with Geoff Sutcliffe, he created and maintains the TPTP problem library for automated theorem proving systems and designs and organizes theorem proving competitions. Home Page: http://wwwjessen.informatik.tu-muenchen.de/personen/suttner.html
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.
165
Problem Partition and Solvers Coordination in Distributed Constraint Satisfaction P. Berlandier a and B. Neveu b aILOG Inc. 1901 Landings Drive Mountain View, CA 94043, USA bINRIA - CERMICS 2004, Route des Lucioles, B.P. 93 06902 Sophia-Antipolis Cedex, FRANCE This paper presents a decomposition-based distributed algorithm for solving constraint satisfaction problems. The main alternatives for distributed constraint satisfaction are reviewed. An algorithm using a partition of the constraint graph is then detailed, with its parallel version. Experiments on problems made of loosely connected random constraint satisfaction problems show its benefits for under-constrained problems and problems with a complexity in the phase transition zone. 1. I n t r o d u c t i o n Many artificial intelligence problems (e.g. in vision, design or scheduling) may take the shape of constraint satisfaction problems (csPs) [1]. Being NP-complete, these problems are in need of any computational mean that could speed up their resolution. Parallel algorithms and parallel hardware are good candidates to help in this matter. A second motivation for distribution is that there exist csPs where the structure of the constraint graph is naturally close to a union of independent components. We are indeed especially interested in problems that result from the connection of subproblems by a global constraint. In such problems, the partition into subproblems is given and the challenge is to use that partition in a distributed search algorithm in order to speed up the resolution. Such problem structures happen to be quite common in configuration or design problems, where the whole problem consists of an assembly of subparts, each having its own constraints and being connected by a few global constraints on decision variables such as cost, weight or volume. In this paper, we will present a distributed algorithm for solving csPs over several processors using a partition of their constraint graph. 2. Sources of Parallelism in C S P R e s o l u t i o n The usual way of solving a csP alternates problem reduction and variable instantiation [2]. There are several opportunities for introducing some amount of parallelism in
166 these two processes. We give a brief review of these opportunities below. P a r a l l e l i s m in P r o b l e m R e d u c t i o n
Problem reduction is usually achieved by enforcing some level of partial consistency such as arc or path consistency [3] or by using a more limited filtering process such as forward checking [2]. Some operations that are required to enforce partial consistency can be performed independently. First, checking the consistency usually means controlling which possible value combinations are allowed by a constraint. These tests can obviously be conducted in parallel. This source of parallelism is easily exploited in most constraint systems by the use of operations on bit vectors [4]. A coarser grain parallelism is the parallel propagation of constraints: several constraints (or groups of connected constraints) are activated independently. Some synchronization mechanism is needed as different constraints might share the same variable. However, the fact that constraint propagation results in a monotonic reduction of the problem may simplify the synchronization. Parallel propagation [5-8] has received a great deal of attention. Especially, several procedures to achieve parallel arc-consistency have been devised, sometimes exhibiting a supra-linear speedup. However, [9] exhibits some theoretical restrictions on the gain that can be expected from this kind of parallelism. P a r a l l e l i s m in Variable I n s t a n t i a t i o n
Variable instantiation is a tree search process and the independent exploration of the different branches leads to or-parallelism. The introduction of or-parallelism in search procedures has been studied thoroughly, especially inside [10] but also outside [11] the logic programming community. An experiment of exploiting or-parallelism in the CHIP constraint logic programming language is described in [12]. P a r a l l e l i s m based on t h e C o n s t r a i n t G r a p h
Another way to parallelize the resolution is to partition the variable set and to allocate the instantiation of a subset of the variables to a process. The difficulty of this approach is to synchronize the different processes, which are not independent: there exist constraints between variables and conflicts may occur between the processes. A solution is to order the processes [13,14]. Another way to solve this difficulty is to have a central process that is responsible of the inter process conflict resolution. We will detail our approach in the next section. It takes place in that graph-based distributed framework, with a centralized control. 3. D i s t r i b u t e d C o n s t r a i n t Satisfaction
Binary constraint problems can be represented as graphs where variables are mapped to the vertices and constraints are mapped to the edges. For instance, the constraint graph associated with the well-known N-queens constraint problem is a clique: each variable is connected to all the others. But this is not the case for most real-world problems where it is more common to have loosely connected clusters of highly connected variables. Such
167
Constraint-connected subproblems
Variable-connected subproblems
Figure 2. Solving subproblems independently
almost independent subproblems could thus be solved almost independently with parallel processes and their results combined to yield a global solution. This is the approach proposed in the paper. The most important questions are: (1) How can the problem be partitioned "well"? (2) How can the efforts of the separate solvers be coordinated? As shown in figure 1, a problem can be partitioned along the constraints or along the variables. In the first case, the subproblems can be solved independently right from the start. But, when partial solutions are found, they must be tested against interface constraints. If some of these constraints are not satisfied, the resolution of the subproblems connected by these constraints has to be resumed. If the partition is made along the variables, we must start by looking for a consistent instantiation of the interface variables. After such an instantiation is found, each subproblem can be solved with complete independence as illustrated by figure 2. If they all succeed in finding a solution, we have a global solution. If no solution can be found for one of the subproblems, the instantiation of the interface variables should be changed and the resolution of the subproblems concerned by this change has to be resumed.
168 Let us suppose that the problem that we want to solve has n variables and is divided into k subproblems, each with p variables (so that n - kp). Each variable has a domain of d elements. Using a partition along the constraints, the worst case time complexity for finding the first solution is bounded by (dP)k which gives us a complexity of O(d~P). Now, using a partition along the variables and supposing that there are m interface variables, the worst case time complexity for finding the first solution is bounded by kdmd p which yields a complexity of O(dm+p). Of course, if the problem is not partitioned (i.e. k = 1, p = n and m = 0), we find the same usual complexity for both cases, i.e. O(dn). Conversely, when there are several subproblems, the theoretical complexity of the resolution with constraint-connected subproblems is much higher than with variable-connected subproblems. This is why we have chosen to investigate the latter mode of partition, keeping in mind that this choice is valid if and only if one solution is sufficient for our needs. 4. Definitions
Definition 1 (constraint p r o b l e m ) A binary constraint satisfaction problem P is pair of sets (V,C). The set ~; is the set of variables { v l , . . . , v n } . Each variable vi associated with a finite domain di where its values are sought. The set C is the set constraints. Each constraint is a pair of distinct variables {vi, vj} noted cij which associated with a relation rij that defines the set of allowed value pairs.
a is of is
Definition 2 ( s o l u t i o n ) A value assignment is a pair variable-value noted (v~, x~) where xi E di. A substitution a~ is a set of value assignments, one for each element in the set of variables E. A solution to the problem P = (~2, C) is a substitution av that satisfies all the constraints in C.
Definition 3 ( c o n s t r a i n t s u b p r o b l e m ) A subproblem P~ of P is a triple (Zi, ~)i,C~). The set Zi C V is the set of interface variables, ~)~ C )2 is the set of own variables, and Ci C C is the set of own constraints. Given an instantiation of its interface variables Zi, the solution to Pi is a substitution av~ that satisfies all the constraints in Ci. A subproblem has the following properties: P r o p e r t y 1 The sets of interface and own variables are disjoint i.e.: Zi N Vi = 0 P r o p e r t y 2 The set of own constraints is the maximal subset of the problem constraints that connect one own variable with either an own or an interface variable:
c~ = {C.b e Cl(vo, vb) 9 (V, x Z~) U (Z~ x V~) U (V~ x V~)}
Definition 4 (partition) A k-partition II~ of a problem P is a set of subproblems {P1,. 9 Pk} that have the following properties: k P r o p e r t y 3 The sets of own variables of the subproblems are disjoint i.e." N~=~ V~ = 0
169
Property 4 Each variable of the problem is either an interface or an own variable of a subproblem: k
k
(Ur~)u(UI~)=r ~=~ ~=1
k
and
k
(Uv~)n(Uz~)=O ~=~
~=~
Definition 5 (interface problem) The interface problem Pn of a partition I-I~ is a subproblem of P for which the own variable set is the union of the interface variable set of all the subproblems of the partition: k
))n = U Ii and I n = 0 i--1
Property 5 The set of constraints of the interface problem is the maximal subset of the problem constraints that connect any two of its own variables (that is any two interface variables from the other subproblems): c . = {c~b e Cl(v~ vb) e v . x v . }
T h e o r e m 1 Given a partition of P and a solution avn to its interface problem, the union of avn with any solution for all the subproblems of the partition constitutes a solution to the problem P. Proof." From the properties of a partition and the definition of the internee problem, k Gvi) instantiates once and only once each it is easy to show that the union avn U (Ui=I variable in 12 (from properties 1, 3 and 4) and that this union satisfies all the constraints k in 0 = On U (U~=~ 0~) (from properties 2 and 5). The union is therefore a solution to the whole problem P = (1;, 0). [] This theorem allows us to implement the resolution of a constraint problem as the resolution of an internee problem followed by an independent resolution of k subproblems. The following two sections describe shortly how to compute a problem partition and what coordination to implement between the parallel solvers.
5. P r o b l e m Partition Given k available processors and a problem P, the goal is to find a k-partition that best combines the following (possibly conflicting) desiderata: 1. The complexity of solving the different subproblems should be about the same. 2. The number of variables of the interface problem should be kept to a minimum. Of course, a complete exploration of the partition space is out of the question. We thus turn to a heuristics-based algorithm and we use the classic K-way graph partitioning algorithm presented in [15]. For our purposes, the cost of an edge is inversely proportional to the degree of satisfiability of the constraint represented by this edge. Therefore, a constraint that is easy to satisfy (i.e. with a high degree of satisfiability) has a low cost and will be preferred as a separating edge between two subproblems. The weight of a vertex is proportional to the domain size of the variable represented by this vertex. The set of interface variables is chosen as the minimal set of vertices that is connected to all the crossing edges determined by the partitioning algorithm.
170
1 while a new instantiation for interface variables can be found 2 3 4 5
instantiate all the interface variables for each subproblem Pi solve Pi as an independent CSP in case of success:
1,2
and 3}
store the partial solution {variants 2 and ,9}
6 7
{variants
in case of failure:
store the nogood {variants 2 and 3}
8 9
return to step 1 10 a solution is found; end. 11 no solution can be found; end.
Figure 3. Decomposition-based Search Algorithm Schema
6. A D e c o m p o s i t i o n B a s e d S e a r c h A l g o r i t h m The previous section was about how to get a good partition of the constraint problem. We have designed an algorithm for finding one solution that uses that problem partition to t.ry to solve each subproblem independently. The main idea of this algorithm is the following: first find a consistent instantiation for the interface variables in Vn, and solve each subproblem Pi with the so instantiated interface variables. As soon as a subproblem fails, we store the inconsistent instantiation (also known as a nogood) in order not to reproduce it and we look for a new consistent instantiation for the interface variables. The outline of the algorithm is presented in figure 3. From this outline, we can design three variants, depending on the backtracking schema for the interface variables and on the intermediate results we store. We use the following notations in the description of these three variants: 9 d: domain size 9 n: total number of variables 9 m: number of interface variables 9 p: maximum number of variables of a subproblem 9 s: maximum number of interface variables of a subproblem 9 k: number of subproblems 9 V a r i a n t 1: Standard backtracking on interface variables. The first idea is to use a standard chronological backtracking algorithm for the instantiation of the interface variables. No backtracking can occur between two subproblems and the reduction of worst case time complexity is then given by using
171 the constraint graph partition, as mentioned in section 3. For each instantiation of the interface variables, we have in the worst case one c o m p l e t e t r e e search for each subproblem, so the total complexity is in O(kdPd m) = O(kdp+m), instead of a complexity in O(d kp+m) for a global tree search algorithm. For that variant, the space complexity is the same as the global tree search algorithm: no additional storage is required. 9 V a r i a n t 2: Standard backtracking with storage of partial solutions and nogoods. In that second variant, we store the partial solutions and nogoods to ensure that each subproblem Pi will be solved only once for a given instantiation of its interface variables Z~. Then, the time complexity can be reduced to O(d TM + kdPd s) = O(d TM + kdP+S), the space complexity becoming O(kpdS), s being the maximum number of interface variables of one subproblem. 9 V a r i a n t 3: Dynamic backtracking on interface variables. We still store nogoods and partial solutions, as in the previous variant. We can notice that these nogoods can be used to implement a dynamic backtracking schema [16] for the instantiation of the interface variables. When a failure occurs in the solving of Pi, we try to change the instantiation of the interface variables Ii and to keep the instantiation of the other interface variables, which do not take part in the nogood. For that third variant, we have the same time and space complexity as for the preceding one. This variant is the decomposition-based search algorithm (DS) we have implemented. 7. Parallelization
7.1. A Parallel A l g o r i t h m with Centralized Control The DS algorithm we have presented can be naturally transformed into a parallel version PDS with k + 1 processes, one master for the instantiation of the interface variables and k slaves for solving of the subproblems. As for the sequential DS algorithm, given a partition II k, we first have to find a consistent instantiation as of the interface variables. This instantiation becomes the master's current instantiation. Once this instantiation is found, we can initiate the parallel processing of each subproblem. The master process, which is in charge of the interface instantiation, keeps waiting for the result of the slave processes. The result of the resolution of each subproblem can be either a failure or a success. In case of success, the partial solution of P~ is stored by the master in a dictiona~. (line 15). The slave in charge of Pi will then wait until the master restarts it further with another instantiation of its interface variables. In case of failure on a subproblem Pi, the subset ai of the current substitution as, corresponding to the interface variables of Pi is stored as a nogood (lines 6 and 15). The current substitution as of all interface variables is then invalid and it is necessa .ry to find a new combination of values for these variables. In order to not interrupt all the processes and let run some of them, the master will not follow a chronological backtracking, but a dynamic backtracking model, as seen in 6.2. Once this new instantiation a~ is found, we have to first interrupt, if they are running,
172
0 let Q be an empty message queue; 1 until exit from enumeration-end 2 instantiate consistently all the variables in ])n; 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Figure 4.
if no possible instantiation then exit from enumeration-end with no solution; else if failure in Q then store the nogood; else start in parallel the slaves processes whose interface variables have changed; tag reception-end while one of the slaves has not responded if Q is empty, then wait for a message; process the first message; in case of success: store the partial solution; in case of failure: store the nogood; exit from reception-end; exit from enumeration-end with solution;
PDS
algorithm for the master process
and then restart with the new instantiation of the interface variables the resolution of eve .ry subproblem Pj such that (lines 7 and 8): 3v 6 :Z'j,as(v) # ds(V ). Before restarting the resolution of a subproblem, the existence of a precomputed solution is looked-up in the dictiona .ry of partial solutions. The algorithm performed by the master process is presented in figure 4. 7.2. C o m m u n i c a t i o n s In this algorithm, the inter-process communications take place only between the master and the slaves:
9 communication from master to slaves The master is controlling the execution of the slaves. A message is an interruption that is immediately taken into account, as soon as the slave receives it. There is no message queue and the only possible message is: "start the resolution of the subproblem with the current instantiation of the interface variables." 9 communication from slave to master Here, a message corresponds to a report about the resolution of a subproblem. Either a partial solution was found, or a failure occurred. These messages are put in a queue, the master is not interrupted and it will process them in sequence. All the failures will be processed before restarting the slaves with another instantiation of their interface variables (lines 5, 6).
173
7.3. M e m o r y The storage of the nogoods and of the p~rtial solutions is done by the master. The slaves do not need to store any results. The memory requirement is the same as for the sequential DS algorithm (variant 3). 7.4. P r o c e s s Load B a l a n c i n g Due to synchronization, some processors can become idle. The master waits until one slave gives a result (line 11). A slave that gave an answer (a partial solution or a failure) will wait until the master relaunches it with another instantiation of its interface variables. If the new instantiation of the interface variables of a slave corresponds to an already solved subproblem, the processor remains idle. In future works, we will study a more efficient version of the coordination, where some idle processors can perform some search in advance, i.e. solve subproblems, with values of their interface variables different from the current values of these variables in the master. 8. E x p e r i m e n t a l E v a l u a t i o n
8.1. Tests on r a n d o m problems In order to study the performances of these decomposition-based algorithms, we have experimented with randomly generated constraint problems. The generation of random problems is based on the four usual parameters [17]: the number n of variables, the size d of the variables' domain, the constraint density cd in the graph and the constraint tightness ct. The constraint density corresponds to the fraction of the difference in the number of edges between a n-vertices clique and a n-vertices tree. A problem with density 0 will show n - 1 constraints; a problem with density 1 will show n(n - 1)/2 constraints. The constraint tightness ct corresponds to the fraction of the number of tuples in the cross-product of the domain of two variables that will not be allowed by the constraint between these two variables. Tightness 0 stands for the universal constraint and tightness 1, the unsatisfiable constraint. In our experiments, each problem we solved was made up of three constraint problems, generated with the same four parameters, and they were coupled by three interface variables, vl, v2, va, one for each subproblem. The coupling constraints were two difference constraints, vl ~- V2 and v2 r v3. We compared 3 algorithms: 1. A global algorithm which is the classical forward-checking with first fail (FC-FF), using a dynamic variable ordering, the smallest domain first. 2. The decomposition-based search algorithm, DS, which corresponds to variant 3 presented in section 6. 3. A simulation of the parallel version of the algorithm, PDS, presented in section 7. In order to have a fair comparison between these 3 algorithms, in the DS and PDS algorithms, the subproblems were solved with the same FC-FF algorithm used to solve the entire problem in the global algorithm. The parallelism of the third algorithm was simulated in one process, the communication time being then reduced to zero. One simulated processor was assigned to each subproblem.
174 8.2. E x p e r i m e n t a l Results We measure for each algorithm the cpu-time. In the case of the simulation of parallelism, the cpu-time we report is the execution time the parallel algorithm would have, if we suppose that the communications take no time. The results in figures 5, 6 and 7 report the time (in sec.) for solving sets of 40 problems, each set composed of subproblems generated with the same parameters. All tests reported here were run on problems made up of 3 subproblems composed of 25 variables with a domain size d equal to 7. In the figure 5, the constraint density cd is 0.2, and the constraint tightness ct is varying between 0.2 and 0.55. In the figure 6, the constraint density cd is 0.4, and the constraint tightness ct is varying between 0.1 and 0.4. In the figure 7, the constraint density cd is 0.6, and the constraint tightness ct is varying between .05 and 0.4. All these tests were run on a SUN Sparc 5 workstation, in a LELIsP implementation, the subproblems being solved with the PROSE [18] constraint toolbox. This toolbox offers some simple primitives to define constraints, build some constraint problems by connecting constraints and variables, and solve those problems. The main tool provided for resolution is parametrized tree search. The search can be adapted by selecting a variable ordering method, a value ordering method and a consistency method which is usually the forward-checking method. We have obtained three behaviors, depending on the difficulty of the problems. These behaviors correspond to under-constrained problems, to over-constrained problems and to problems in the phase transition zone [19,20]. U n d e r - C o n s t r a i n e d P r o b l e m s For the under-constrained problems (ct < 0.4 in figure 5, ct < 0.25 in figure 6, ct < 0.2 in figure 7), the global solution process is quite efficient, the DS algorithm doesn't improve the global algorithm very much. The parallelization is quite interesting in these cases: we have obtained between DS and P D S a speedup from 1.8 to 2.8. P h a s e Transition For the phase transition zone, (ct - 0.4 in figure 5, ct = 0.28 in figure 6, ct = 0.22 in figure 7), where some problems have few solutions and some are over-constrained, there exist problems that the global algorithm could not solve in 10 minutes. These problems, which were very difficult for the global algorithm, were solved by the decomposition-based algorithm in few seconds and the parallelization did not improve the results very much. We can explain that fact by the reduction of complexity thanks to the DS algorithm, which exploits the special structure of the constraint graph, while the standard forward checking algorithm with smallest domain heuristic cannot exploit it. O v e r - C o n s t r a i n e d P r o b l e m s For the over-constrained problems, (ct > 0.4 in figure 5, ct > 0.28 in figure 6, ct > 0.22 in figure 7), the uncoupling and the parallelization are not efficient. We can remark that the global forward checking with first fail algorithm focuses automatically on an unfeasible subproblem, and detecting that a subproblem is unfeasible is enough to deduce that the complete problem has no solution: in that case, the decomposition and the parallelization are useless. Furthermore, the decomposition-based algorithm changes the variable
175 200
|
|
i
|
w
!
run-time
i
"DS" -4--"PDS" --.~"FC-FF" -B--.
150
100
50 /jJ"
.15
.
i
i
!
i
i
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Figure 5. Runtime comparison for 3 connected subproblems with n = 25, d = 7, cd = 0.2
ordering, beginning with the interface variables and this new ordering is often less efficient than the ordering given by the standard minimum domain heuristics.
8.3. Future Experiments These first results show that our decomposition-based algorithm outperforms the standard forward checking algorithm in the transition phase, and that the parallel version is interesting in the zone of under-constrained problems. Some other experiments should be done, varying the number of connections between the subproblems and the number of subproblems in order to confirm these results. We are now close to completing the implementation of our distributed solution process on a multi-computer architecture (i.e. a network of SUN Sparc 5 computers Connected by Ethernet and using the CHOOE protocol [21]). We will then be ready to apply our approach to some benchmark problems and evaluate correctly the cost of the communications through the network [22] and the possible workload imbalance. REFERENCES 1. E. Tsang. Foundations of Constraint Satisfaction. Academic Press, 1993. 2. B. Nudel. Consistent-labeling problems and their algorithms. Artificial Intelligence, 21:135-178, 1983. 3. A. Mackworth. Consistency in networks of relations. Artificial Intelligence, 8:99-118, 1977. 4. R. Haralick and G. Elliott. Increasing tree search efficiency for constraint satisfaction problems. Artificial Intelligence, 14:263-313, 1980. 5. D. Baldwin. CONSUL: A parallel constraint language. IEEE Software, 6(4):62-69, 1989.
176
,
,
,
,;
i :
.
!
,
"os"
/~
-.--
"PDS" ..4--.
8OO
5OO
3O0
100
0 0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Figure 6. Runtime comparison for 3 connected subproblems with n = 25, d = 7,
16001
'
'
i
'
i/~
i' i ~
'.. "pDDs S" ~-~ "FO-FF......
0.2
0.25
0.3
0.35
run-li rne
,~
I-
6
]~
~
cd
= 0.4
1200
1000
600
4O0
200
0
005
0.1
0.15
0.4
Figure 7. Runtime comparison for 3 connected subproblems with n = 25, d = 7,
cd --
0.6
177
10. 11. 12. 13. 14.
15. 16. 17. 18. 19. 20. 21. 22.
P. Cooper and M. Swain. Domain dependance in parallel constraint satisfaction. In Proc. IJCAI, pages 54-59, Detroit, Michigan, 1989. W. Hower. Constraint Satisfaction via Partially Parallel Propagation Steps, volume 590 of Lecture Notes in Artificial Intelligence, pages 234-242. Springer-Verlag, 1990. J. Conrad. Parallel Arc Consistency Algorithms for Pre-Processing Constraint Satisfaction Problems. PhD thesis, University of North Carolina, 1992. S. Kasif. On the parallel complexity of discrete relaxation in constraint satisfaction networks. Artificial Intelligence, 45:275-286, 1990. D. Warren. The SRI model for or-parallel execution of prolog. In International Symposium on Logic Programming, pages 92-101, 1987. R. Finkel and U. Manber. DIB: A distributed implementation of backtracking. A CM Transactions on Programming Language and Systems, 2(9):235-256, 1987. P. Van Henten .ryck. Parallel constraint satisfaction in logic programming. In Proc. ICLP, pages 165-180, 1989. Q. Y. Luo, P. G. Hendry, and J. T. Buchanan. A hybrid algorithm for distributed constraint satisfaction problems. In Proc EWPC'92, Barcelona, Spain, 1992. M. Yokoo, E. Durfee, T. Hishida, and K. Kuwabara. Distributed constraint satisfaction for formalizing distributed problem solving. In Proc of 12th IEEE International Conference on Distributed Computing Systems, pages 614-621, 1992. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(1):291-307, 1970. M. Ginsberg. Dynamic backtracking. Journal of Artificial Intelligence Research, 1:25-46, 1993. D. Sabin and E. Freuder. Contradicting conventional wisdom in constraint satisfaction. In Proc. ECAI, pages 125-129, Amsterdam, Netherlands, 1994. P. Berlandier. PROSE : une boite A outils pour l'interpr~tation de contraintes : guide d'utilisation. Rapport Technique 145, INRIA Sophia Antipolis, 1992. P. Prosser. Binary constraint satisfaction problems : Some are harder than others. In Proc. ECAI, pages 95-99, Amsterdam, the Netherlands, 1994. B. Smith. Phase transition and the mushy region in constraint satisfaction problems. In Proc. ECAI, pages 100-104, Amsterdam, Netherlands, 1994. F. Lebastard. CHOOE: a distributed environment manager. Technical Report 93-22, CERMICS, Sophia-Antipolis (France), D~cembre 1993. P. Crandall and M. Quinn. Data partitioning for networked parallel processing. In Proc. 5th Symposium on Parallel and Distributed Processing, pages 376-380, Dallas, TX, 1993.
178 Pierre Berlandier Pierre Berlandier received his Ph.D. in computer science from INRIA and the University of Nice (France) in 1992. His research interests are focused on various aspects of constraint programming such as constraint satisfaction algorithms, consistency maintenance techniques and constraint-based languages design. He is now a senior consultant for ILOG Inc. in Mountain View, CA where he is working on constraint-based design and scheduling applications.
Bertrand Neveu
Bertrand Neveu graduated from Ecole Polytechnique and Ecole Nationale des Ponts et Chaussees. He worked as a research scientist at INRIA Sophia Antipolis since 1984 on the Smeci expert system shell project. He has then been leading the Secoia research project that focused on the design of AI tools using object oriented and constraint based knowledge representations. He is currently in charge of a constraint programming research team in the CERMICS laboratory in Sophia-Antipolis.
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.
181
Parallel Propagation in the Description-Logic System FLEX* Frank W. Bergmann a and J. Joachim Quantz b aTechnische Universit~it Berlin, Projekt K I T - V M l l , FR 5-12, Franklinstr. 28/29, D-10587 Berlin, Germany, e-mail:
[email protected] bTechnische Universit~it Berlin, Projekt K I T - V M l l , FR 5-12, Franklinstr. 28/29, D-10587 Berlin, Germany, e-mail:
[email protected] In this paper we describe a parallel implementation of object-level propagation in the Description-Logic (DL) system FLEX. We begin by analyzing the parallel potential of the main DL inference algorithms normalization, subsumption checking, classification, and object-level propagation. Instead of relying on a parallelism inherent in logic programming languages, we propose to exploit the application-specific potentials of DLs and to use a more data-oriented parallelization strategy that is also applicable to imperative programming languages. Propagation turns out to be the most promising inference component for such a parallelization. We present two alternative PROLOG implementations of paralle!ized propagation on a loosely coupled MIMD (Multiple Instruction, Multiple Data) system, one based on a .farm strategy, the other based on distributed objects. Evaluation based on benchmarks containing artificial examples shows that the farm strategy yields only poor results. The implementation based on distributed objects, on the other hand, achieves a considerable speed-up, in particular for large-size applications. We finally discuss the impact of these results for real applications. 1. I N T R O D U C T I O N In the last 15 years Description Logics (DL) have become one of the major paradigms in Knowledge Representation. Combining ideas from Semantic Networks and Frames with the formal rigor of First Order Logic, research in DL has focussed on theoretical foundations [1] as well as on system development [2] and application in real-world scenarios
[3-5]. Whereas in the beginning it was hoped that DL provide representation formalisms which allowed efficient computation, at least three trends in recent years caused efficiency problems for DL systems and applications: 9 a trend towards expressive dialects; *This work was funded by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the framework of the Verbmobil Project under Grant 01 IV 101 Q 8. The responsibility for the contents of this study lies with the authors.
182 9 a trend towards complete inference algorithms; 9 a trend towards large-scale applications. With the current state of technology it seems not possible to build a DL system for largescale applications which offers an expressive dialect with complete inference algorithms. The standard strategy to cope with this dilemma is to restrict either expressivity, or completeness, or application size. In this paper we investigate an alternative approach, namely a parallelization of Description Logics. Due to physical limitations in performance gains in conventional processor architectures, parallelization has become more and more important in recent years. This comprises parallel structures inside processors as well as outside by scaling several processors to parallel systems. Several fields of high-performance computing already adopted to this new world of paradigms, such as image processing [6], finite element simulation [7], and fluid dynamics [8]. We expect that in the future parallelism will become a standard technique in the construction of complex AI applications. A standard approach to parallelization in the context of logic programming concentrates on the development of parallel languages that exploit the parallelism inherent in the underlying logic formalism ([9,10] and many more). In this paper we will follow a rather different approach which analyzes a particular application, namely Description Logics. The parallelization we propose uses explicit parallelism based on the notion of processes and messages that is programming language independent. In the next section we give a brief introduction into Description Logics. Section 3 then pi'esents the main inference components of the DL system FLEX and investigates their parallel potential. In Section 4 we describe two different strategies of parallelizing objectlevel propagation in DL systems. The corresponding implementations are evaluated in detail in Section 5 based ' On benchmarks containing artificial examples. Section 6 finally discusses the impact of these results on real applications. 2. D E S C R I P T I O N
LOGICS
In this section we give a brief introduction into Description Logics. Our main goal is to provide a rough overview over DL-based knowledge representation and DL systems. In the next section we will then take a closer look at inferences in the DL system FLEX and their respective parallel potential. 2.1. T h e R e p r e s e n t a t i o n L a n g u a g e In DL one typically distinguishes between terms and objects as basic language entities from which three kinds of formulae can be formed: definitions, descriptions, and rules (see the sample model on page 3 below). A definition has the form 'tn:= t' and expresses the fact that the name tn is used as an abbreviation for the term t. A list of such definitions is often called terminology (hence also the name Terminological Logics). If only necessary but no sufficient conditions of terms are specified a definition has the form 'tn:< t', meaning that 'tn' is more specific than 't'. Terms introduced via ' : - ' are called defined terms, those introduced via ': c2' and stipulate that each instance of the concept cl is also an instance of the concept c2. In general, the representation language is defined by giving a formal syntax and semantics. Note that DL are subsets of First-Order Logic (with Equality), which can be shown easily by specifying translation functions from DL formulae into FOL formulae [11,12]. Just as in FOL there is thus an entailment relation between (sets of) DL formulae, i.e. a DL model can be regarded as a set of formulae F which entails other DL formulae (F ~ 7)- Depending on the term-forming operators used in a DL dialect this entailment relation can be decidable or undecidable and the inference algorithms implemented in a DL system can be complete or incomplete with respect to the entailment relation. 2.2. A S a m p l e M o d e l In order to get a better understanding of DL-based knowledge representation let us take a look at the modeling scenario assumed for applications. An application in DL is basically a domain model, i.e. a list of definitions, rules, and descriptions. Note that from a system perspective a model or knowledge base is thus a list of tells, from a theoretical perspective it is a set of DL-formulae F. Consider the highly simplified domain model below, whose net representation is shown in Figure 1. One role and five concepts are defined, out of which four are primitive (only necessa .ry, but no sufficient conditions are given). Furthermore, the model contains one rule and four object descriptions. product chemical product biological product company produces chemical company some(produces,chemical product) toxipharm biograin chemoplant toxiplant
:< :< :< :< :< := => :: :: :: ::
anything product product & not(chemical product) some(produces,product) domain(company) company & all(produces,chemical product) high risk company chemical product biological product chemical company atmost(1,produces) & produces:toxipharm
As mentioned above such a model can be regarded as a set of formulae and the service provided by DL systems basically is to answer queries concerning the formulae entailed
184
1..in
company
O
produces
high risk
~
~
9 fr
chemical ~ ( product j , . r _ . . ~ ~
biological product
chemical company
chemoplant
toxiplant
toxipharm
biogmin
produces
Figure 1. The net representation of the sample domain on page 3. Concepts are drawn as ellipses and are arranged in a subsumption hierarchy. Objects are listed below the most specific concepts they instantiate. Roles are drawn as labeled horizontal arrows (annotated with number and value restrictions) relating concepts or instances. The dashed arrow relates the left-hand side of a rule with its right-hand side ('conc_l' is the concept 'some(produces,chemical product)'). The flashed arrow between 'chemical product' and 'biological product' indicates disjointness.
185 by such a model. The following list contains examples for the types of queries that can be answered by a DL system: 9
tl ?< t2 Is a term tl more specific than a term t2, i.e., is tl subsumed by t27 In the sample model, the concept 'chemical company' is subsumed by 'high risk company', i.e., eve .ry chemical company is a high risk company. 2
9 t l and t2 ?< nothing Are two terms tl and t2 incompatible or disjoint? In the sample model, the concepts 'chemical product' and 'biological product' are disjoint, i.e., no object can be both a chemical and a biological product. .o?:c Is an object o an instance of concept c (object classification)? In the sample model, 'toxiplant' is recognized as a 'chemical company'. 9 o17:r:o2
Are two objects ol,o2 related by a role r, i.e., is 02 a role-filler for r at o1? In the sample model, 'toxipharm' is a role-filler for the role 'produces' at 'toxiplant'. 9 Which objects are instances of a concept c (retrieval)? In the sample model, 'chemoplant' and 'toxiplant' are retrieved as instances of the concept 'high risk company'. 9 o::cfails
Is a description inconsistent with the model (consistency check)? The description 'chemoplant :: produces:biograin' is inconsistent, with respect to the sample model, i.e., 'biograin' cannot be produced by 'chemoplant'. 3 This very general scenario can be refined by considering generic application tasks such as information retrieval, diagnosis, or configuration. 2.3. S y s t e m I m p l e m e n t a t i o n s From the beginning on, research in DL was praxis-oriented in the sense that the development of DL systems and their use in applications was one of the primary interests. In the first half of the 1980's several systems were developed that might be called in retrospection first-generation DL systems. These systems include KL-ONE, NIKL, KANDOR, KL-TWO, KRYPTON, MESON, and SB-ONE. In the second half of the 1980's three systems were developed which are still in use, namely BACK, CLASSIC, and LOOM. The LOOM system [13] is being developed at USC/ISI and focuses on the integration of a variety of programming paradigms aiming at a general purpose knowledge representation system. CLASSIC [2] is an ongoing 2'chemical company' is defined as a 'company' all whose products are chemical products; each 'company' produces some 'product'; thus 'chemical company' is subsumed by 'some(produces,chemical product)' and due to the rule by 'high risk company'. 3Object tells leading to inconsistencies are rejected by DL systems.
186 AT&T development. Favoring limited expressiveness for the central component, it is attempted to keep the system compact and simple so that it potentially fits into a larger, more expressive system. The final goal is the development of a deductive, object-oriented database manager. BACK [14] is intended to serve as the kernel representation system of AIMS (Advanced Information Management System), in which tools for semantic modeling, defining schemata, manipulating data, and que .rying, will be replaced by a single high-level description interface. To avoid a "tool-box-like" approach, all interaction with the information reposito .ry occurs through a uniform knowledge representation system, namely BACK, which thus acts as a mediating layer between the domain-oriented description level and the persistency level. The cited systems share the notion of DL knowledge representation as being the appropriate basis for expressive and efficient information systems [15]. In contrast to the systems of the first generation, these second generation DL systems are full-fledged systems developed in long-term projects and used in various applications. The systems of the second generation take an explicit stance to the problem that determination of subsumption is at least NP-hard or even undecidable for sufficiently expressive languages: CLASSIC offers a ve.ry restricted DL and almost complete inference algorithms, whereas LOOM provides a ve~. expressive language but is incomplete in many respects. Recently, the KRIS system has been developed, which uses tableaux-based algorithms and provides complete algorithms for a ve.ry expressive DL [16]. KRIS might thus be the first representative of a third generation of DL systems, though there are not yet enough experiences with realistic applications to judge the adequacy of this new approach. 4 In the following section we describe the FLEX System, which can be seen as an extension of the BACK system. FLEX is developed at the Technische Universit~it Berlin within the project KIT-VM11, which is part of the German Machine-Translation project VERBMOBIL. 3. T H E F L E X S Y S T E M Having sketched some of the general characteristics of DL we will now turn our attention towards a specific DL system, namely the FLEX system [18]. Compared to its predecessor, the BACK System, FLEX offers some additional expressivity such as disjunction and negation, term-valued features, situated descriptions, and weighted defaults. In the context of this paper another characteristic of FLEX is more important, however, namely the one giving rise to its name, i.e. flexible inference strategies. 3.1. D L I n f e r e n c e s Given the specification of DL in the previous Section, the inference algorithms have to answer queries of the form o tl
?: ?
s-g > S-h). #t relates the time needed to solve all problems (one after the other) with one processor to the time needed with p processors. This measure is especially useful for applications of the theorem prover, where one proof obligation after the other is to be solved. Then #t represents the ratio of the "waiting time" for the user, before the prover has finished all examples. The run-times given in this paper are those of the SETHEO Abstract Machine, including the time to load the compiled formula. The times needed to start and stop the system are not considered here. For a discussion of these times, see Section 5. All proof attempts (sequential and parallel) have been aborted after a maximal run-time of Tm~ = 300s (for SiCoTHEO-PID, Tmax = 300s is used). All times are CPU-times and are measured on HP-750 workstations with a granularity of 1/60 seconds. 4. P a r a m e t e r Competition for S E T H E O The Model Elimination Calculus and SETHEO's proof procedure can be parameterized in several ways. Table I shows a number of typical ways for modifying the basic algorithm. For each parameter, common values are shown. Values which are default for SETHEO are given in bold-face. The selection function determines which clause and literal is to be taken next, the search mode determines in which way the OR-tree is explored. Additional inference rules ( "fold-up" and "unit-lemmata") allow SETHEO to use intermediate results (lemmas). Finally, completeness mode and completeness bound determine SETHEO's search strategy.
parameter Selection function Search mode addt'l inference rules completeness modes completeness bounds
values as in formula/random/heuristically ordered top-down/bottom-up/combination
none/fold-up/unit-lemmata i t e r a t i v e d e e p e n i n g / o t h e r fair strategies depth/#inferences/#copies/combinations
Table 1 Basic parameters for SETHEO's calculus and proof-procedure. Values shown in bold-face are default values for SETHEO.
237 Given the parameters and their possible values from Table 1, a number of different paralleltheorem provers , based on competition could be designed. , In the following, we will focus on three parallel competitive systems, based on SETHEO. Since they compete on rather simple settings of parameters, the system is called SiCoTHEO (Simple Competitive provers based on SETHEO). The three systems compete via different completeness modes ("parallel iterative deepening", SiCoTHEO-PID), via a combination of completeness bounds (SiCoTHEO-CBC), and a combination of top-down and bottomup processing (SiCoTHEO-DELTA). Before we go into details of each prover, we sketch the common prototypical implementation for all SiCoTHEO provers. 5. P r o t o t y p i c a l Implementation SiCoTHEO is running on a (possibly heterogeneous) network of UNIX workstations. The control of the proving processes, the setting of the parameters and the final assembly of the results is accomplished by the tool pmake [1]. This implementation of SiCoTHEO has been inspired by a prototypical implementation of RCTHEO [2]. Pmake is a parallel version of make, a software engineering tool used commonly to generate and compile pieces of software given their source files. Pmake exploits parallelism by exporting as many independent jobs as possible to other processors. Hereby it assumes that all files are present on all processors (e.g., via NFS). Pmake stops, as soon as all jobs are finished or an error occurs. In our case, however, we need a "winner takes all strategy" which stops the system, as soon as one job is finished. This can be accomplished easily, by adapting SETHEO so that it returns "error" (i.e., a value :/: 0), as soon as it found a proof. Then pmake aborts all actions per default. In contrast, the implementation of RCTHEO had to transfer the output generated by all provers to the master processor. There, a separate process searched for success messages. This resulted in heavy network traffic and long delays. A critical issue in using pmake is its behavior w.r.t, the load of workstations: as soon as there is activity (e.g., keyboard entries) on workstations used by pmake, the current job will be aborted (and restarted later). Therefore, the number of active processors (and even the start-up times) can vary strongly during a run of SiCoTHEO.
6. Evaluation and Results In this section we look in detail at the results, obtained with the three different versions of SiCoTHEO. The experiments have been carried out on a network of HP-750 workstations, connected via Ethernet. All formulae for the experiments have been taken from the TPTP-problem libra .ry [11].
6.1. S i C o T H E O - P I D Parallel iterative deepening is one of the simplest forms of competition: each processor explores the search space to a specific bound. Assume we have to perform iterative deepening over the A-literal depth as the completeness bound. Then processor i (1 < i _< P) explores the search space to a depth i. If that processor could not find a proof with the given bound i, it starts the search again with bound i + P, i § 2P, and so on. This parallel scheme for iterative deepening (written in a C-like notation below) assures
238 completeness with a limited number of processors: all values for the depth bound are used by a processor eventually, while no two processors work with the same bound. f o r i = 1 , 2 . . . . . P i n p a r a l l e l do on p r o c e s s o r i do f o r k = 0 , 1 , 2 . . . . do depth_bound = i + k,P; s e t h e o (depth_bound) Due to time and resource restrictions, results on SiCoTHEO-PID have been obtained by evaluating existing run-time data 3 of SETHEO. Figure 2A shows the resulting mean values of the speed-up for different numbers of processors. As can be seen immediatedly, the variance of the speed-up values is extremely high. This fact results in a high arithmetic mean, whereas the gg, .Oh and st are ve.ry close to 1. This behavior can also be seen in Figure 2B which shows the ratio of 711 over Tseq for each example, using 5 processors. The speed-up s is always _> 1 since the entire search space (which has to be searched in the sequential case) is partitioned.
A
~
s-a
21
700
.o
B
600
or
500
16
m"
T, 400
g 11
...................
300
~
200
:4-h
100
Ce 9
,P
.0.o .s
,I
.m
...........
6
1 1
.......... 1 2 3 4 5 6 7 8 9 P
0
~ , ,. ~,o 0
"
.....~~~
200 300 460 56o 660 760
T~eq
Figure 2. SiCoTHEO-PID: A: mean speed-up values for different numbers of processors P (x for ffa, o for fig, 9 for fib, o for gt). The dotted line marks linear speed-up s = P. B: parallel run-time 711 over sequential run-time Tseq for P - 5. The dotted line corresponds to s --- 1, the solid line to s -- P.
Furthermore, it is evident from Figure 2A that SiCoTHEO-PID is not scalable. The speed-up values reach a saturation level already with 4 processors. Adding more processors 3The data have been obtained by running SETHEO (V3.1) on all examples of the TPTP [11] with Tmax = 1000s [13]. For our experiments, we have selected all examples which have a run-time Tseq 12
depth
Figure 3. Number of examples with a proof found with A-literal depth d over the depth d. Numbers are % of the total number of 858 samples.
Although in many cases, high speed-up values can be obtained, SiCoTHEO-PID should be used in applications only where deep proofs are expected. 6.2. S i C o T H E O - C B C
The completeness bound which is used for iterative deepening determines the shape of the search space and therefore has an extreme influence on the run-time the prover needs to find a proof. There exist many examples, for which a proof cannot be found using iterative deepening over the depth of the proof tree within reasonable times, whereas iterative deepening over the number of inferences almost immediatedly reveals a proof, and vice versa. 4 In general, during iterating over the depth of the proof-tree balanced trees are preferred. The growth of the search space per iteration level, however, is extremely high. On the other hand, the inference bound first tries rather unbalanced and deep trees. Here, the search often reaches areas with unmanageable search spaces. In order 4This dramatic effect can be seen clearly in e.g. [7], Table 3.
240 to level both extremes, R. Letz 5 proposed to combine the A-literal-depth bound d with the inference bound imoz. When iterating over depth d, the inference bound i m ~ is set according to imoz = d o where r/is the mean length of the clauses. For our experiments, however, we take a slightly different approach by using a quadratic polynomial:
imam, = ad 2 + fld where a,/3 E R+. 6 This polynomial approximates the structure of a tableau: a tableau (a tree) with a given depth do has do < i < #ao inferences (leaf nodes), where # is the maximal number of literals per clause. Hence, we estimate the number of inferences in the tableau to be i = x a~ for some x < # and allow the prover to search for tableaux with at most i inferences by setting ima~ "= i. A Taylor development of this formula leads to i = 1 + ~ dko(logx)k/k!. Since, in most cases, x is ve.ry close to 1, we only use the linear and quadratic terms, finally obtaining our quadratic polynomial. SiCoTHEO-CBC (ComBine Completeness bounds) explores a set of parameters (or,/3) in parallel by assigning different values to each processor. For the experiments we selected 0.1 < c~ < 1 and 0 < / 3 < 1. In our first experiment (Experiment 1) we used 50 processors with the following parameter settings: Pl" (0.1,0.0)
(0.1,0.2)
...
(0.1,0.S)
(0.2, 0.0)
(0.2, 0.2)
...
(0.2, 0.s)
(1.0,0.0}
(1.0,0.2}
...
Ps0" (1.0,0.8)
Note, that this grid does not reflect the architecture of the system. It rather represents a two-dimensional arrangement of the parameter values. For Experiment 2, Experiment 3 and Experiment 4, the number of processors was reduced to 25, 9, and 4 respectively by equally thinning out the grid. For all experiments, a total of 92 different formulae from the T P T P have been used. 48 examples show a sequential run-time Tseq of less than one second. 36 of the remaining examples have a sequential run-time which is higher than 100 seconds. Although measurements have been made with all examples, we do not present the results for those with run-times of less than one second. In that case, the resulting speed-up (ga = 1.57 for P = 50) is by far outweighted by the time, SiCoTHEO needs to export proof tasks to other processors. In a real application, this problem could be solved by the following strategy: first, start one sequential prover with a time-limit of I second. If a proof cannot be found within that time, SiCoTHEO-CBC would start exporting proof tasks to other processors. Table 2 (first group of rows) shows the mean values for all three experiments. These figures can be interpreted more easily when looking at the graphical representation of the ratio between Tseq and 7il , as shown in Figure 4. Each 9 represents a measurement with one formula. The dotted line corresponds to s = 1, the solid line to s = P, where P is the number of processors. The area above the dotted line contains examples where the parallel system is slower than the sequential prover, i.e., s < 1. Dots below the solid 5Personal communication. ~For a - 0,/~ = 1 we yield inference-bounded search, a = oo, ~ = c~ corresponds to depth-bounded search.
241 line (with a gradient of l/P) represent experiments which yield a super-linear speed-up
s>P. Figure 4 shows that even for few processors a large number of examples with superlinear speed-up exist. This encouraging fact is also reflected in Table 2 which exhibits good average speed-up values for 4 and 9 processors. For our long-running examples and P = 4 or P = 9, ~g is even larger than the number of processors. This means that in most cases, a super-linear speed-up can be accomplished. Table 2 furthermore shows that with an increasing number of processors, the speedup values are also increasing. However, for larger numbers of processors (25 or 50), the efficiency 77 = siP is decreasing. This means that SiCoTHEO-CBC obtains its peak efficiency with about 15 processors and thus is only moderately scalable.
Experiment ..
SiCoTHEO-CBC SiCoTHEO-CBC SiCoTHEO-CBC SiCoTHEO-CBC Experiment
s-a s-g S-h gt ..
4 P=4 61.21 5.92 2.12 2.06 7
,,
P=4 ,,
SiCoTHEO-DELTA SiCoTHEO-DELTA SiCoTHEO-DELTA SiCoTHEO-DELTA Experiment
s-a
SiCoTHEO-DELTA+ SiCoTHEO-DELTA+ SiCoTHEO-DELTA+ SiCoTHEO-DELTA+
s-a s-g S-h ~t
gg 8-h
gt ..
18.31 4.49 1.34 1.42 10 P=5 18.39 5.15 1.43 1.83
3 P=9 77.30 12.38 3.37 2.93 6 P=9 63.78 10.89 2.00 2.45 9 P=10 63.84 12.07 2.92 3.47
2 P=25 98.85 18.18 4.34 3.84 5
1 P=50 101.99 19.25 4.41 3.88
P=25 76.46 15.96 2.76 3.16 8 P=26 76.50 16.97 3.71 4.13
Table 2 SiCoTHEO: Mean values of speed-up for different numbers of processors P. The number of examples is 44.
6.3. S i C o T H E O - D E L T A
The third competitive system which will be considered in this paper affects the search mode of the prover. SETHEO normally performs a top-down search. Starting from a goal, Model Elimination Extension and Reduction steps are performed, until all branches of the tableau are closed. The DELTA iterator [9], on the other hand, generates small tableaux, represented as unit clauses in a bottom-up way during a preprocessing phase.
242 P - 9
300
:
200
.....:
9 9 9
P -- 25
300
...:
P - 50
300
..:
9
200
200
7il 100
,s ..."
. 9 o
9
100
s
711 100
s""
~~ ~
9
o ..-'"
."
9
9
..,----
' ~ ' ~ - - ' : : - " = "~ 0 100 200 300
T~q
o 9
0~ 0
_~o~,~_,
100
200
T.~q
9
0
300
9
,
0
100
200
--,
300
T~q
Figure 4. Parallel run-time 711 over sequential run-time Tseq for SiCoTHEO-CBC and different numbers of processors.
For example, the left subtree of Fig. 1 corresponds to the clause p(a, b). These unit clauses are added to the original formula. Then, in the main proving phase, SETHEO works in its usual top-down search mode. The generated unit clauses now can be used to close open branches of the tableau much earlier, thus combining top-down with bottom-up processing. This decrease of the proof size can reduce the amount of necessary search dramatically. On the other hand, adding new clauses to the formula increases the search space. In cases, where these clauses cannot be used for the proof, the run-time to find a proof increases (or a proof cannot be found within the given time limit). Thus, adding too many (or useless) clauses has a strong negative effect. The DELTA preprocessor has various parameters to control its operation. Here, we focus on two parameters: the number of iteration levels l, and the maximally allowable term depth td. l determines how many iterations the preprocessor executes. The number of generated unit clauses increases monotonically with 1. The term depth td of a term is the maximal nesting level of function symbols in that term. E.g., td(a) -- 1, td(f(a, f(b, c))) -3. In order to avoid an excessive generation of unit clauses, the maximal term depth td of any term in a generated unit clause can be restricted. Furthermore, DELTA is configured in such a way that a maximum number of 100 unit clauses are generated to avoid excessively large formulas. For our experiments, we use competition on the parameters l and td of DELTA. The resulting formula is then processed by SETHEO, using standard parameters (iterative deepening over A-literal depth). Hence, execution time in the parallel case consists of the time needed for the bottom-up iteration TDELTAplus that needed for the subsequent topdown search Ttd. As before, the overall execution time of the abstract machine, including the time to load the formula is used. With I E { 1 , 2 , . . . , 5 } and td E ( 1 , 2 , . . . , 5 } , a total of 25 processors are used. Figure 5 shows the ratio between Tseq and TII for all examples. Again, we va.ry the number of processors (4, 9, 25). Our experiment has been carried out with the same examples as in the previous section (Tseq > ls). Table 2 (middle section, Experiments 5-7) shows the
243 numeric values for the obtained speed-up.
P=4 200
P=9
.....""""
t ....i
200
.....i.""""*
TH 1oo
..,.. .-'"
0
P=25
....
$
200
Tll .
1oo
100
200
Zseq
300
.-""
m...~..---'-~
0 L-..~-~-= ~ . . . . 0 100 200 300 9
0
:,.
Zseq
Figure 5. Parallel run-time TII over sequential run-time different numbers of processors.
0 0
:
.
i!" 100
.. 200
300
Zseq
Tseq for
SiCoTHEO-DELTA and
In general, the speed-up figures obtained with SiCoTHEO-DELTA show a similar behavior as those for SiCoTHEO-CBC. This can be seen in Figure 5 and Table 2 (Experiments 5-7). Here, however, there are several cases in which the parallel system is running slower than the sequential one. The reason for this behavior is that the additional unitclauses are not useful for the proof and increase the search space too much. This negative effect can easily be overcome by using an additional processor which runs the sequential SETHEO with default parameters. The resulting speed-up figures are shown in Table 2 (Experiments 8-10, rows marked by SiCoTHEO-DELTA+). It is obvious that in this case the speed-up will always be greater or equal to 1. Although the arithmetic mean is not influenced dramatically, we can observe a considerable increase in the geometric mean. This fact indicates a "smoothing effect" when the additional processor is used. The scalability of SiCoTHEO-DELTA is also relatively limited. This is due to the coarse controlling parameters of DELTA. The speed-up and scalability could be increased, if one succeeded in producing a greater variety of preprocessed formulas. 7. C o n c l u s i o n s In this paper, we have presented a parallel theorem prover based on the theorem prover SETHEO. Parallelism is exploited by homogeneous competition. Each processor in the network is running SETHEO and tries to prove the entire formula. However, on each processor, a different set of parameters influence the search of SETHEO. If this influence results in large variations of the run-time, good speed-up values can be obtained. In this work, we compared three different systems based on this model: SiCoTHEO-PID performs parallel iterative deepening, SiCoTHEO-CBC combines two different completeness bounds using a parameterized function, and SiCoTHEO-DELTA combines the traditional top-
244 down search of SETHEO with the bottom-up preprocessor DELTA, where the parameters of DELTA are the basis for competition. Since the search space is partitioned by the prover, the speed-up values of SiCoTHEOPID are always larger than one. However, only little efficiency could be obtained, and SiCoTHEO-PID's peak performance is reached with only 4 processors. In general, good efficiency and reasonable scalability can be obtained only, if there are enough different values for a parameter, the parameter setting strongly influences the behavior of the prover, and if there is no good default estimation for that parameter. Both the combination of search-bounds and the combination of search modes have been shown to be appropriate for competition. The scalability of both systems is still relatively limited. The implementation of SiCoTHEO, using pmake combines simplicity and high flexibility (w.r.t. the network and modifications) with good performance. In many cases, superlinear speed-up could be obtained. Future enhancements of SiCoTHEO will certainly incorporate ways to control DELTA'S behavior more subtly. Furthermore, a combination of SiCoTHEO-DELTA with SiCoTHEO-CBC and heuristics control using Neural Networks will increase the overall efficiency and scalability of SiCoTHEO substantially. Finally, experiments with SiCoTHEO can reveal, how to set the parameters of sequential SETHEO and DELTA in an optimal way. REFERENCES
1. A. de Boor. PMake - A Tutorial. Berkeley Softworks, Berkeley, CA, Janua~. 1989. 2. W. Ertel. OR-Parallel Theorem Proving with Random Competition. In Proceedings of LPAR '92, pages 226-237, St. Petersburg, Russia. Springer LNAI 624, 1992. 3. W. Ertel. Parallele Suche mit randomisiertem Wettbewerb in Inferenzsystemen. Series DISKI 25. Infix, St. Augustin, 1993. 4. W. Ertel. On the Definition of Speedup. In PARLE, Parallel Architectures and Languages Europe, pages 289-300. Springer, 1994. 5. C. Goller, R. Letz, K. Mayr, and J. Schumann. SETHEO V3.2: Recent Developments (System Abstract) . In Proc. of Conference on Automated Deduction (CADE) 12, pages 778-782. Springer LNAI 814, 1994. 6. R. Letz, K. Ma.yr, and C. Goller. Controlled Integration of the Cut Rule into Connection Tableau Calculi. Journal of Automated Reasoning, 13:297-337, 1994. 7. R. Letz, J. Schumann, S. Ba.yerl, and W. Bibel. SETHEO: A High-Performance Theorem Prover. Journal of Automated Reasoning, 8:183-212, 1992. 8. D.W. Loveland. Automated Theorem Proving: a Logical Basis. North-Holland, 1978. 9. J. Schumann. D E L T A - A Bottom-up Preprocessor for Top-Down Theorem Provers. System Abstract. In Proc. of Conference on Automated Deduction (CADE) 12, pages 774-777. Springer LNAI 814, 1994. 10. J. Schumann and R. Letz. PARTHEO: a High Performance Parallel Theorem Prover. In Proc. of Conference on Automated Deduction (CADE) 10, pages 40-56. Springer LNAI 449, 1990. 11. G. Sutcliffe, C.B. Suttner, and T. Yemenis. The TPTP Problem Library.. In Proc. of Conference on Automated Deduction (CADE) 12, pages 252-266. Springer LNAI 814,
245 1994. 12. C.B. Suttner and J. Schumann. Parallel Automated Theorem Proving. In Parallel Processing for Artificial Intelligence, pages 209-257. Elsevier, 1994. 13. C.B. Suttner. Static Partitioning with Slackness. Series DISKI. Infix, St. Augustin, 1995.
246 Johann Schumann
Johann Schumann studied computer science at the Technische Universit~it Miinchen (TUM) from 1980 to 1986. Then he joined the research group "Automated Reasoning" at the TUM and worked on the development of sequential and parallel theorem proving. In 1991 he obtained his doctoral degree with a thesis on efficient theorem provers based on an abstract machine. 1991-1992 he worked in industry as project manager in the area of network management and control systems. His current research interests include sequential and parallel theorem proving and application of automated theorem provers in the area of Software Engineering.
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.
Low-Level Computer
249
Vision Algorithms: Performance Evaluation on
Parallel and Distributed Architectures G. Destri and P. Marenzoni Dipartimento di Ingegneria dell'Informazione Universit~ di Parma Viale delle Scienze, 1-43100 Parma, Italy Tel. +39-521-905708 Fax +39-521-905723 e-mail: {destri,marenz}@CE.UniPR.IT 1. I N T R O D U C T I O N Computer Vision (CV) is a valuable tool in many fields: from robotics to geophysics, from medicine to industrial quality control. The recognition of objects and their configuration in a scene is allowed by the identification and extraction of significant information from raw image data. Several paradigms can be used to classify CV methods. However, regardless of the chosen approach, a "low-level processing" is always necessary, namely processing which does not alter the image data structure (pixel array), but only changes individual pixel values. Early filtering, image smoothing, noise reduction, edge detection and region partitioning are some examples of low-level processing. In many cases low-level CV methods are not completely satisfactory, and it is necessary to introduce also at this first level the knowledge of scene contents into the automated system [1,2]. Other forms of computation can have output data structures different from the input ones. For example, a high-level image descriptor can output the number of regions with a particular label, obtained through an image segmentation algorithm [1]. This paper is about the performance evaluation of low level CV algorithms operating in parallel and distributed environments. Typically, the number of pixels to be processed in the images can range from several thousands to a few millions. This fact is more relevant in applications such as remote sensing, where realistic sizes range from 1024 x 1024 up to 4096 x 4096 pixels, or more. The necessity to limit processing times to reasonable values (e.g., in real-time systems), or the large memory requirements, force the user to go beyond the single machine hardware limits, exploring parallel or distributed approaches. In lowlevel CV algorithms most of the operations are local, that is, the new value associated with each pixel depends only on the values coming from a well defined and limited neighborhood of that pixel. Therefore, low-level CV problems are well suited to be ported on a parallel or distributed environment, since they show also the most balanced behavior from the point of view of the computation versus communication ratio. The Cellular Neural Network (CNN) paradigm [3,4] is very appropriate to describe this kind of computation, because it embodies, as special cases, all CV problems solved with algorithms involving local
250 operations. Hence, the use of CNNs to evaluate performance of CV applications in parallel environments is the best choice, since CNNs are both a superset of all local low-level CV algorithms and well suited for parallelization. A measurement taken with respect to the most general CNN formulation can give an effective value of the lower performance bound offered by a given platform for CV applications. Indeed, from a computational point of view, the chosen algorithm is comparable to the most intensive CV operators. In this work a complete performance evaluation of a low-level CV distributed algorithm, based on the CNN paradigm, is presented. This analysis can be considered as a performance measure, on the distributed platforms tested, of the whole class of CV algorithms which may be expressed by means of the CNN formalism and parallelized following the proposed scheme. A coarse-grained parallelization scheme is used in task and data partitioning. Each processor operates on a horizontal slice of the image, the communications being limited to slice borders updating [5]. The resulting communication pattern involves only the exchange of large packets, therefore performance is not penalized by hardware and software communication latencies. To obtain satisfactory performance speed-up with this CV algorithm it is not essential to run on a dedicated parallel architecture, in particular when large images must be processed and the computational weight is more important than the communication weight. Clusters of high performance workstations can provide a significant memo .ry extension, maintaining high efficiencies. Versions of the program have been implemented both on parallel architectures, a Connection Machine CM-5, a Cray T3D, and an IBM SP2, and on workstation clusters, adopting the available Fortran languages, in order to achieve the best performance in the most intensive computations on any parallel platform. Fortran environments are, in fact, better supported than C ones in many parallel architectures, and guarantee higher performance of the programs. While a portable version is implemented with the public domain PVM library, the CM-5, T3D and SP2 ones are optimized to achieve the best performance on each parallel architecture, exploiting the most effective message passing environment with respect to the underlying hardware. A wide set of tests have been performed, in order to complete a detailed performance evaluation of this CNN-based CV algorithm. Processing times have been taken, at increasing image sizes, on 32 nodes of a Connection Machine CM-5, 32 nodes of a Cray T3D, and 32 thin nodes of an IBM SP2 [6], comparing them with sequential CNN runs carried out on a SPARC-20 workstation. Homogeneous comparisons have been obtained also, running the PVM version on a cluster of SPARC-IPX workstations and the sequential version on a single identical machine.
2. C N N P A R A D I G M 2.1. C N N Basics Cellular Neural Network (CNN) [3] is a computational paradigm defined in discrete regular N-dimensional spaces. The building block element of such a paradigm is a unit (or cell), corresponding to a point in the N-dimensional space, that performs both arithmetic and logic operations. Typically, in CV applications each unit corresponds to a pixel. CNN's main characteristic is the locality of the connections between the units. The most important difference between CNN and other Neural Network paradigms is the fact that
251
information is directly exchanged only between neighboring units. In Hopfield networks, for example, the units can be distributed on the nodes of a regular lattice, but each node is connected to all the other nodes of the network. Furthermore, in the widely diffused multilayer perceptrons each unit is connected to all the units of the previous layer. CNN locality, however, does not prevent global processing. Communications between non directly connected (remote) units are obtained by means of consecutive moves through other units, with several algorithm iterations. Generally, to give a measure of the neighborhood size, the chessboard distance convention [1] is used, expressed by the equation:
d~ = m~x(Ix~ - xjl, ly~ - yjl).
(~)
CNN cells are multiple input-single output "processors," each one described by its own parametric functional. A cell is characterized by an internal state variable, usually not directly observable from outside the cell itself. The CNN cell grid can be a planar array with rectangular, triangular or hexagonal geometry, a 2-D or 3-D torus, a 3-D finite array, or a 3-D sequence of 2-D arrays (layers) [3,4]. A CNN dynamic system can operate both in continuous or discrete time. It is possible to consider the CNN paradigm as an evolution of the Cellular Automata paradigm [7], and it is also possible to exploit the existing Cellular Automata rules to design a CNN system [3]. Moreover, it has been demonstrated in [8] that the CNN paradigm is universal, being equivalent to the Turing Machine. 2.2. C e l l u l a r A u t o m a t a A Cellular Automaton (CA) is a discrete dynamical system. Space, time, and states of the system are discrete. Each point in a regular spatial lattice, called cell, can have one of a finite number of states. The states of the lattice cells are updated according to a local rule. That is, the state of a cell at a given time depends only on both its own state at the previous time step and on the states of its nearby neighbors at the previous time step. All cells are updated synchronously. Thus, the state of the entire lattice advances in discrete time steps. A CA A is formally described by a quadruple:
A=< S,d,V,f > . S is a finite set of labels or numbers, d is the dimension of the CA, V is the neighborhood and f is the evolution rule. The dimension d defines a d-dimensional lattice L - Z d, where Z is the set of all integers. A point x C L is called cell. The underlying space of A is St'; an element c E S L is a map from L to S, associating a label s C S with every point x E L:
c:L~S;
c(x)=s;
c E S r"
A label s associated with a cell is called a state, and c is called a configuration of A. The set of the finite configurations is the domain of the "local" function f:
f :S v ~S
252 the function f associates a label from the set S with every finite configuration. The "global" evolution function G is the application of the local function f to the finite configuration of the neighborhood of every point in L: the new value of G(c) at a point x is the value of f applied to the neighborhood of x. The function G is spatially invariant on S L. Repeated applications of G give the dynamics of a CA; i.e., a sequence of configurations, each obtained by applying G to the previous one. The time evolution of a CA is fully determined by the local function f. From the above definitions and considerations it is more evident why CNNs can be considered as an evolution of CAs. The states are continuous values (i.e., real numbers), the functions acting on these values can be as complex as required, and different in each lattice point. Moreover, external inputs can also participate in the time evolution of the system. 2.3. C N N E q u a t i o n s In this work only discrete time CNNs are considered, since we are interested in them as a formal model for the software algorithms. A formal mathematical description of the discrete time case is:
xj(t~+~) =
Aj[yk(t,~), pA] + y~ Bi[uk(tn),P~]
g[xj(t,~)] + 5 + keNt(j)
yj(t.) = f[xj(t.)],
(2)
kENs(j)
(a)
where xj is the internal state of a cell, Y.i is its output, uj is its external input and Ij is a local value called bias. Aj and Bj are two generic parametric functionals, also called templates, and pjA and P f are the two arrays of parameters. The two functionals can be, for example, linear combinations, nonlinear functions (e.g., exponentiations), or polynomial functions, while the parameter arrays play the role of the involved coefficients (e.g., in the case of linear connections pA and P f are the sets of connection weights). At each iteration, the y and u neighbor values are collected from the cells present in the neighborhoods Nr (for the feedback functional A) and Ns (for the control functional B). The two neighborhoods may be different, and they can include the cell itself, that is, the cell input value uj (in the B functional) or its output value yj(tn-1) (in the A functional) can be arguments to the functional itself. Then the two templates are computed, generating the internal state x with the addition of the bias I. Finally, the activation function f generates the output from the internal state, f is typically a Gaussian, a linear function with saturation (amplifier model), a quantizer, a single step or a sigmoid (also called logistic function). The instantaneous local feedback function g, often not used, expresses the possibility of an immediate feedback effect. In many cases the system is non-Markovian, that is the future system evolution can depend not only on the present but also on the past history. The system represented in equations 2 and 3 is, of course, strictly Markovian. Only the linear_connection subset of two-dimensional CNNs will be considered, since most CV algorithms can be expressed on the basis of this assumption. The A and B functionals become linear combinations of the neighborhood values, where the parameters are the connection weights [9]. The operation expressed in functionals A and B becomes
253 a convolution with a weight mask, usually represented as an h • h 2-D matrix, also called linear template (see block scheme in Figure 1). In this way, the equation 2 becomes, for the lattice point j:
xj(t~+l) = g[xj(t~)] + b +
~ kENt(j)
ak(.j) "y~(t~) +
~
bk(j) "uk(t,~)
(4)
kE Ns(j)
where ak(j) and bk(j) are the parameters, acting as weights of the connections, that is, coefficients of the linear combination of the neighborhood Nr(j) of the point j. In the following r will denote the template radius, i.e. the chessboard distance defining the neighborhood limit (for example, an r = 1 template is equivalent to a 3 x 3 template), and it will be considered constant in the whole lattice for both control and feedback templates. The f function will be considered space-invariant, but with different coefficients in each lattice cell (e.g., the gain and saturation limits in the linear case). Since our goal is to analyze the actual performance of several parallel or distributed architectures with respect to low-level CV applications, based on a CNN implementation, we have chosen the most penalizing case, from the point of view of computational cost and memory requirements. Therefore, even if always performing the linear combination of the neighborhood values, the A and B templates will not be constant along the image, but rather they will be space-variant, namely with different coefficients in each lattice point. This allows us to perform different operations on different image pixels, while executing everywhere the same machine sequence of instructions. Data and CNN parameters in this framework are floating-point numbers. 3. W H Y C N N s F O R CV? Given the formalism described in the previous section, in what follows we will analyze several concrete applications of CNNs to the CV domain, in this manner justifying the use of CNNs to "parameterize" the general behavior of low-level CV algorithms.
3.1. Elementary operators and C N N s Low-level CV algorithms generate as output 2-D arrays, i.e., a data type conformal to the input. The pixel transformations can be classified into three main categories.
9 Point functions, where the output of each pixel value depends only on the local input value of the pixel itself. Value rescaling and thresholding operations represent typical examples. 9 Local functions, where in each pixel the output value depends on the input values coming from a limited neighborhood of that pixel. Derivative operators, local averaging and convolution-based algorithms are some examples of these functions [10,5]. Some matching techniques require, however, enhanced values of the template radius (e.g. 33 • 33 neighborhoods), with a much higher computational weight. 9 Global functions, where the output value in each pixel depends on the whole image. Complex image transformations (e.g. spatial Fourier transform) or global histogrambased techniques [1] belong to this category.
254 Control Template
I
J 3
Local Input
Bias
f-k
I
_I f
:~Output
Feedback Template Input from Neighborhood
! I I
!.11 Feedback from Neighborhood
Figure 1. Scheme of a complete CNN iteration.
Both the first and the second transformation kinds can be immediately expressed in terms of the equations 2 and 3. In particular, the CNN becomes a point function when the feedback template and the bias are set to zero, and the control template has all coefficients equal to zero, except for the local cell, the input image being the external input. Moreover, without feedback template and performing just one iteration, the CNN becomes a simple convolution-based algorithm (a typical local function). In a similar way, the simple averaging algorithm can be expressed by a linear feedback template, by setting: 1
xj(t + 1)= -6 Z
y~(t),
(5)
kENt(j)
that is, a linear combination of the 3 • 3 pixel neighborhood with equal weights, while the control template and the bias are set to zero, and the f function is simply a multiplication by a rescaling coefficient. The well-known Sobel or Kirsch operators [1] can also be obtained by properly setting the control template weights and the f internal function, the control input variable u3 being the input image itself. Generally, a single iteration of a CNN has a computational cost greater than or at least equal to a local operator and, in this case the roles of the control and the feedback template are completely equivalent from a computational point of view. It must be observed that, for example, one CNN iteration has the same computational cost as a gradient extraction obtained through the combination of two spatial derivative operators. The internal function plays the role of this combinator (e.g., the maximum choice).
255
0.05 0.1 0.05
0.1 0.05
0
0
0
0.1
0
0.5
0
0.1 0.05
0
0
0
0.44
Figure 2. A noise cleaning CNN operator, suitable for clustering. (left) Feedback template. (right) Control template. The bias parameter is a function of the local average luminance. This operator is typically iterated from two to eight times.
A CNN-based noise-reduction operator, derived from [9], is shown in fig. 2. The same operator, with a quantizer as internal function f, was successfully used in [11] for the clustering of a noisy image [12]. In particular, this algorithm is based on the combination of the clustering with a luminance correction obtained through an appropriate function, based on a 7 x 7 CNN collecting the average value of luminance in the neighborhood of the point. Figure 3 shows the application of the operator to a real world noisy image (a road). In a similar way, we can express the operations dedicated to edge detection [9]. Other significant results of CNN applications to CV are texture image segmentation (e.g., [13, 14]), feature detection (e.g., [15,16]), and object tracking and recognition (e.g., [17]). Modular convolution-based algorithms are other interesting applications (e.g. the wellknown Canny Edge Detector [18], which can be expressed as a multilayer single-step iteration CNN, in which the operations are consecutively performed by the layers). Many global functions (e.g., diffusion algorithms [5]) can also be obtained through a CNN with appropriate coefficients, iterating a sufficient number of times to ensure the information transmission along the whole image. Generally, CNN parameters defining the system behavior can be chosen by the programmer, in such a way as to define "mathematically" the function to be performed by the network [4]. The possibility to impose a machine learning of these coefficients may become a necessity, especially in case of a desired complex behavior. In some cases the applicability of "traditional" techniques for Neural Network learning, such as back-propagation has been successfully demonstrated [14]. Sometimes, the characteristics of some nonlinear activation functions, widely used in CNNs, create many problems to the application of these learning techniques. To overcome these limits, a new kind of training, based on Genetic Algorithm techniques [19], has been developed in [20] and [15]. An application of these techniques to the design of filters for image clustering, presented in [21], is shown in figure 4 All the previous considerations and examples support the use of the CNNs as a performance evaluation "paradigm" for CV, since measurements taken with CNN can give a realistic image of the actual performance of a given architecture with respect to the whole class of local CV operators.
256
Figure 3. An example of CNN clustering algorithm: (a) original image, (b) processed image.
257
Figure 4. An advanced example of CNN clustering algorithm: the filter has been obtained by means of a genetic approach. (Top-left) original image, (top-right) filtered image after 4 CNN iterations, (bottom-left) filtered image after the application of a step threshold and (bottom-right) the edges of the regions, shown for clarity.
258
3.2. C N N s for e x p e c t a t i o n - d r i v e n algorithms One of the characteristics of low-level processing of images is that it is a data-driven process, this means that generally no global knowledge about the image is required to process it. Nevertheless, an iterative CV processing can "extract" part of its evolution rules from a priori knowledge. For example an expectation-driven algorithm has been applied in [22] and [23], with a "synthetic" image playing a guide role. In CNN-based algorithms the "synthetic" input becomes the control input, that can also be variable in time, while the initial state is the image to be processed, and the algorithm acts in several iterations. 3.3. C N N s for m i ddl e level functions The algorithm we describe here has the main goal to detect presence and position of some a-priori known shapes in a real world image [24]. The algorithm must be very robust with respect to noise and imperfection of the image. The method is based on the matching approach [5], enhanced by means of some a-priori knowledge. Given the a-priori knowledge of the shape and of its approximate size, with an appropriate tolerance, the algorithm is obtained by means of a single-iteration CNN, where the control input is the "ideal" representation of the shape to be searched, and the internal initial state is a subwindow of the image. We want to know whether the central pixel of this subwindow (i.e. the pixel where the CNN is applied) is a good candidate to be the center of the shape searched for. The response is a sort of quantitative measurement of the goodness to this matching. First the image is processed by means of a gradient operator to extract the border candidate pixels of the image. This result is not binarized, since we want to maintain a quantitative measurement of the edge intensity, and to exploit it to enhance the correctness of the matching process. Since a tolerance with respect to the size is also desired, the algorithm is iterated in successive steps, driven by appropriate thresholds, in a hysteresis approach [18]. 9 The first matching takes place between the gradient image subwindow and the ideal representation of the shape (see fig. 5 and 6); only in the pixels in whose neighborhood the gradient intensity exceeds an appropriate threshold the matching takes place; 9 if the value of the error obtained between the two images is smaller than the first threshold we have found the shape searched for; 9 if the value of the error is greater than the second value this point cannot be the center of the shape; 9 if the value is between the two thresholds, then the matching is performed with the "enlarged" ideal shape (see fig. 5 and 6), where new pixels have a smaller value than the original ones; 9 the iteration is repeated for two or three times. The tests, performed on sample real world images obtained in gray tones by means of a normal video-camera and a digitizer, demonstrated that this simple method allows to
259
Figure 5. The CNN-based matching algorithm for shape detection: the "ideal" form to be compared with the real image subwindow. In the successive steps the circular shape is "enhanced" by the adding of internal pixels. The darkness of the pixels is proportional to the intensity of match: the internal pixels contribute with a lower weight to the matching value.
obtain about 96% of correct results, and often to find the shape searched for also where it is difficult to find even for a human observer. 4. W H Y " P A R A L L E L " C N N s ? Parallel architectures can be classified into two main categories. The systems where a single code execution takes place and the Processing Elements (PEs) only operate on different data sets are called SIMD (Single Instruction stream over Multiple Data streams) parallel computers. The systems where each processor is able to run its own code asynchronously on its data set are the MIMD (Multiple Instruction streams over Multiple Data streams) parallel computers. Regular grid-oriented algorithms are best suited to be easily ported to any parallel or distributed platform, both SIMD and MIMD. Grid applications, in fact, allow all processors to execute the same work, the load balancing being always perfectly even, therefore minimizing the delay phases in MIMD environments. Moreover, many grid-oriented applications show the locality property, that is the evolution in time of each point (i.e., pixel) only depends on its nearest neighbors. This feature allows on a parallel platform to avoid general interprocessor communications, and to achieve the maximum efficiency of the algorithm. As a matter of fact, many parallel architectures dedicated to low-level image processing have been developed, and a number of parallel algorithms have been designed for that purpose. Moreover, most of the algorithms discussed in the previous section have been successfully parallelized [5]. CNNs combine both characteristics - the regularity and the locality- that are typically shown by CV applications. Therefore, the use of CNNs to "parameterize" the behavior and to analyze the performance of low-level CV applications on distributed platforms can be considered very adequate to obtain results of general validity. 5. M A C H I N E
CHARACTERISTICS
In this section we will briefly review some main characteristics of the parallel architectures we have used to test and analyze our general CNN a l g o r i t h m - the Connection
260
Figure 6. The CNN-based matching algorithm for shape detection" four step of the matching algorithm. (Top-left) Original Image, (top-right) gradient image, (bottom-left) thresholded gradient image with marker on shape position, (bottom-right) original image with marker on shape position.
261 Machine CM-5, the Cray T3D, and the IBM SP2. 5.1. C o n n e c t i o n M a c h i n e CM-5 The Connection Machine CM-5 is a multiuser, MIMD, timeshared massively parallel system [25] comprising many processing nodes, each with its own memory., supervised by a control processor, the partition manager (PM). A CM-5 PE consists of a RISC processor, a Network Interface chip, 4 memo .ry units of 8 MBytes RAM, and 4 Vector Unit (VU) arithmetic accelerators connected through a 64-bit M-bus. The RISC processor (a SPARC-2 chip with a clock rate of 32 MHz) acts as a control and scalar processing resource for the VU. The SPARC-2 performs address calculations, loop controls and instruction fetches, and executes the "scalar" portion of the node application. The microprocessor sends all the "parallel" operations (i.e. the vectorizable code) to the VU for execution. The VU accelerators use deep pipelines and large register files (128 32-bit registers per VU) to improve peak computational performance. The peak floating-point performance rate is 128 MFlop/s per node with VU. PM and PEs are connected to two communication networks, organized in a fat tree architecture: the data network, used for bulk data transfers in which each item has a single source and destination, and the control network, used for operations that involve all the nodes at once, such as synchronization, broadcasting and combining. In the 4-ary fat tree implementation of the data network each PE is a leaf, and data routers are all the internal nodes, each connection providing a bandwidth rate of 20 MBytes/s in each direction. In the first two levels of the tree, however, each router uses only two parent connections to the next higher level and only starting from the third level all routers use four parent connections. The CM-5 system supports several high level languages for the message passing programming model: CM-Fortran, C* and the standard C, C + + and Fortran77. Only CMFortran and C*, however, can support the VU hardware. C, C + + and Fortran77, instead, can program only the SPARC-2 microprocessors. All the languages can be exploited for message passing programming, integrating them by means of the CMMD library. All our programs, running on a CM-5, are written in CM-Fortran version 2.2 [26], with CMMD libra,, version 3.2 [27]. 5.2. C r a y T 3 D The Cray T3D [28] is a MIMD massively parallel system [25], connected to a host computer, that provides support for applications running on the T3D. All applications are compiled on the host system but run on the Cray T3D system. Each node contains two PEs, a local memory, a Network Interface and a block transfer engine. A T3D PE is a RISC DEC Alpha microprocessor, performing arithmetic and logical operations on 32 integer and 32 floating-point 64-bit registers, with a clock rate of 150 MHz. The microprocessor contains an internal instruction cache memory and a data cache memory each storing 256 lines of data or instructions. Each line is four 64-bit words wide. The size of local memory is 64 MBytes per PE. The block transfer engine is an asynchronous direct memory access controller that redistributes system data. The peak floating-point performance per PE is 150 MFlop/s. The interconnection network forms a three-dimensional matrix of paths and is composed of communication links and network routers, allowing a bidirectional maximum transfer rate of 300 MBytes/s.
262 Two compilers are available on the T3D: CRAFT Cf77 Fortran and C. The message passing programming model is supported on both compilers by the public domain PVM libra .ry for the message passing primitives. Moreover, Cray provides some extensions to the PVM libra .ry, in order to speed-up interprocessor data transfers. On the T3D the virtual shared memo .ry mechanism is also available. In a T3D system, in fact, memo .ry is physically distributed among processors, but is globally addressable. Any PE can address any memo .ry location in the system, providing communications much faster than PVM. All the high level compilers available on the T3D can directly take advantage of this shared memory mechanism. All the applications tested on the T3D are written in Cf77 Fortran version 6.2 [29], with shared memory primitives for message exchanges. 5.3. I B M S P 2 The IBM SP2 is a general-purpose scalable parallel system [6], based on a distributed memo .ry MIMD architecture. The POWER2 RISC System/6000 processors constitute SP2 nodes, each with its own private memory and its own copy of the AIX operating system. The SP2 provides two node types: wide node and thin node. Both nodes have two fixed-point units and two floating-point units (each capable of a mult-add every cycle), running at 66.7 MHz of clock rate, for a peak floating-point performance of 267 MFlop/s per node. An SP2 wide node can have up to 2 GBytes of memory, with a bandwidth of 2.1 GBytes per second, and a 256 KBytes four-way set associative cache. The SP2 thin nodes are similar to the wide nodes, but have a less robust memory hierarchy and I/O configurability. SP2 nodes are interconnected by a High-Performance Switch. The topology of the switch is an any-to-any packet-switched multistage or indirect network similar to an Omega network. This allows the bisection bandwidth to scale linearly with the size of the system, a critical aspect for system scalability. A consequence of the High-Performance Switch topology is that the available bandwidth between any pair of communicating nodes remains constant irrespective of where in the topology the two nodes lie. The parallel Message Passing Library (MPL) is the native communication library supporting explicit message passing for the XLF Fortran77 or C languages, tuned and optimized for the underlying communication hardware [30]. PVMe, the IBM optimized version of the public domain PVM library, is also available on the SP2. All our programs are written in XLF version 3.2 [31], with MPL support for message passing.
6. M E S S A G E
PASSING CNN ALGORITHM
There are two programming models used to parallelize programs in a distributed environment - the data parallel model and the message passing model. Data parallelism refers to a situation where the same operation is synchronously executed on a large array of data (operands) and the elementary unit viewed by the programmer is the array element, rather than the machine processor. In this environment only one code is running and the compiler is responsible of both data distribution and communications. The message passing model instead requires the user point of view to be the single processing unit, where a copy of the user program is asynchronously running. Therefore, given the problem to be parallelized, the programmer must explicitly code data distribution and data exchanges among processors.
263 Since our aim is to implement a general purpose CNN-based program, the only programming model universally supported is message passing. The algorithm has been conceived in order to run on any workstation cluster, therefore reserving particular attention to minimize communications. 9 Data partitioning. Two different strategies can be implemented when distributing grid-oriented problems over a multiprocessor platform. In particular, for a two dimensional image (Figure 7(a)), one can cut the corresponding 2-D data structure along both dimensions, producing a number of small squares [32], or one can subdivide the image into only horizontal or vertical stripes [5]. For the proposed message passing CNN implementation the second solution (i.e. the coarse-grained parallelization scheme) has been chosen, partitioning the two dimensional image along a single dimension. The resulting partial "windows" assigned to each task are then simply horizontal slices (Figure 7(b)). 9 Data
movement
scheme.
Great care must be taken, when designing a general purpose parallel algorithm, to the optimization of data exchanges between processors. This is even more important on a generic workstation cluster, where dedicated interconnection networks are not present. In our CNN algorithm, remote data items must be accessed only during the template computation, when operating over border pixels. A significant performance optimization can be obtained with a complete separation, in the source program, of computations and communications. The whole computation phase, thus, can be performed only on private (local) data. Moreover, to minimize the impact of the latency overheads on the communication primitives, the processors should exchange the largest allowed packet sizes, for the minimum number of times. Therefore, at each message passing phase, r complete rows of each image border are exchanged between logically adjacent P Es. 9 Data
rearrangement.
Each partial window of the image, assigned to a PE, is extended by means of two dummy stripes, placed at the cut borders. The image pixels necessary during the border computation, and belonging to the logically adjacent partial windows, are duplicated in the local PE memory and stored in the dummy stripes (Figure 7(c)). In this way, the whole computation phase can be completed without communications. In fact, a specific procedure called at each iteration is dedicated to rearrange (through a sort of shift operation) the border pixels between logically adjacent windows (i.e. PEs), filling the dummy stripes through specific communication primitives (Figure 7(d)). When processing an L x L image over Np PEs, each PE stores an L x L/Np partial window. Adding the two further border stripes, to collect the top and bottom r neighborhoods, the total window stored in each P E assumes an L x (L/Np + 2 x r) shape. Similar image partitioning schemes have already been proposed in [5], applied to a Meiko multiprocessor architecture.
264
Figure 7. Image partitioning across four PEs. (a) Original image. (b) Logical image subdivision in four windows. (c) Dummy borders enhanced. (d) Border exchanges between two logically adjacent windows.
The parallelization scheme described above, a coarse-grained technique from the point of view of the communication design, is specifically conceived to run on few nodes of loosely coupled machines, with large amounts of data stored on each processor, and with not too frequent exchanges of large packets. When running on generic workstation clusters, the major impact on the performance is determined by the ve.ry high communication latencies to be paid, due to the absence of high speed networks and of optimized dedicated communication protocols. The whole communication weight at each iteration is constituted only by two send primitives per running task, each one involving r x L data pixels for an L x L image. Therefore, latency overheads to be paid for each data transfer would be negligible, compared to transfer times. On the other hand, the increased amount of memo .ry required in each PE by the two dummy image stripes can be neglected, if the size of each local subimage is large enough. The complete flow chart of the general purpose distributed algorithm is sketched in Figure 8. After the spawning of the slave processes by the host task, at each iteration the function dedicated to border data movement is executed, then the two template procedures are performed, operating only on local data, and finally the internal function computes the cell outputs. No explicit synchronization is introduced among the processors, an implicit synchronization among the running tasks being guaranteed by the communication step.
265
BEGIN
I SPAWN PROCESSES
L COMPUTE CONTROL TEMPLATE
1 COMPUTE FEEDBACK TEMPLATE
1
INTERNAL FUNCTION
1 COMMUNICATION SHIFF
END
Figure 8. Complete flow chart diagram of the implemented CNN algorithm.
266
6.1. Optimized versions Four versions of the message passing CNN algorithm have been developed. Three versions are specifically optimized for the Connection Machine CM-5, the Cray T3D, and the IBM SP2 parallel machines, exploiting the most efficient message exchange capabilities offered by the architectures. Furthermore, a general purpose portable version also exists, efficiently running over any workstation cluster. All the programs are written in Fortran. More precisely, on the CM-5 we have used the CM-Fortran version 2.2 [26], on the T3D the CRAFT Cf77 version 6.2 [29], and on the SP2 the XLF version 3.2 [31]. The general purpose version has been coded using the standard Fortran77 language. The portable code is written in a master-slave model and exploits only public domain PVM message passing primitives. In the three dedicated parallel versions the master program must not be supplied by the user, since the slave executable copies are automatically spawned at run time by the operating systems. These programs are written in the so called host-less programming model. The CM-5 CNN version exploits the CMMD_send_and_receive message passing function. In the T3D version, the virtual shared memory mechanism, supported by Cray, is directly exploited. More precisely, the shmera_put functions are adopted in the communication procedure to write image borders on the memory of another PE, providing communication transfers much faster than the standard PVM primitives [33]. Finally, the SP2 code uses for message passing the optimized MPL_SHIFT primitive. 6.2. I m a g e I / O The image I/O problem has two possible solutions, depending on the environment. The first one, always available in a massively parallel environment, is the simultaneous accesses to the same file by many tasks, managed by the parallel operating system. Each task "knows" that it must read only a well-defined portion of file, that is, the assigned subwindow with the dummy borders. In a workstation cluster environment the same approach is possible only if a shared file system is available. Another solution can be a master task reading the image file and performing the image partition among the other tasks with the creation of the dummy borders and the collection of final results. 7. P E R F O R M A N C E
ANALYSIS
A number of experiments have been carried out in order to analyze the performance of this distributed CNN-based implementation of low-level CV applications, on all the platforms tested. The performance evaluation issue has been addressed following two main perspectives. 9 The first aspect concerns the study of the improvements in computing times allowed by the distributed implementations with respect to sequential ones. Therefore, runs have been performed over the three parallel architectures (CM-5, T3D, and SP2), comparing the results with a high performance workstation running a sequential CNN program. Furthermore, some measurements have been taken on homogeneous workstation clusters, comparing the computational times with the sequential implementation, running on a single identical workstation.
267 9 The second main performance parameter characterizing a distributed algorithm concerns the speed-up and the efficiency achieved at increasing number of PEs. This analysis has been carried out both on the parallel architectures and on different configurations of the workstation cluster, running the PVM portable CNN version. In what follows several performance measurements will be reported, with r = 1, as a function of both the image size and the number of processors (or workstations) used. Only timings referring to the representative linear internal function with saturation will be discussed. In this manner, the computation of the total number of elementary floating-point operations (Flops) executed is easier to be parameterized. Processing times measured with other functions are not substantially different from the reported ones and do not significantly modify performance considerations and speed-up behaviors.
time [sec]
5
/
/
sequential
/
1
T3D
0.5
,i
/.//
...I
---..t-,,.--
0.01 0.005 0.05 0.001
16o
260
560
1o'oo 20'o0
M-Flop/s 1000. ~.--
500
"
--
200.
SP2
.- -- ~ -._..-"- - - 7 . . ~ . _ ~. / --
/
CM-5
................
/ I00.
T3D
/ /
5O.
/
20. lO.
sequential 100.
200.
500.
1000.
2000.
50005
Figure 9. Processing times (top) and corresponding MFlop/s (bottom), as a function of the image size L, measured running a sequential CNN with r = 1 on a SPARC-20 workstation and the distributed CNN on a 32 PE CM-5, a 32 PE T3D, and a 32 PE SP2.
268
7.1. Performance comparison Figure 9 (top) reports the processing times of the three parallel CNN implementations, on a 32 PE CM-5, a 32 PE T3D, and a 32 thin node SP2, measured with L • L images (times refer to a single CNN iteration and are expressed in seconds). In the figure the execution times of a serial CNN implementation running on an high performance workstation are also plotted (a SPARC-20 workstation with 64 MBytes of memory). The comparison shows that an improvement by at least one order of magnitude is achieved by the message passing algorithm on all the parallel systems. The improvement is more important at large image sizes, because of the lower impact of communications on the performance of the parallel architectures. The SP2, and particularly the CM-5, show, in fact, a performance degradation at low L values. For the T3D, instead, the time scaling with L 2 is almost linear, owing to the very efficient communications capabilities allowed by the shared memory machine configuration. As a matter of fact, with L = 64, the CM-5 runs 6 times better than the SPARC-20, the SP2 30 times better, and the T3D 45 times better. Moreover, to give an exact evaluation of the different computational capabilities of the four systems under consideration, Table 1 reports the sec/cell required by the algorithm on the platforms tested, as resulting from all the measures, that is, even when more realistic problem sizes are processed (such as L = 1024). Figure 9 (bottom) reports the MFlop/s corresponding to the previous computational times. The number of elementary floating-point operations to be performed at each CNN iteration, as a function of the template radius r, considering the linear function with saturation, is: Ni,op = 2 - ( 2 - ( 2 r + 1) 2 - 1) + 5
(6)
While the sequential implementation assures always about 10 MFlop/s of performance, the three parallel implementations achieve, respectively for the CM-5, the T3D and the SP2, a maximum performance of 865, 425, and 1110 MFlop/s. It should be noted that, while CM-5 and SP2 performance increases with L, the T3D is not able to maintain the performance sustained at low L values, owing to the direct mapped architecture of the cache memo .ry present on the T3D Alpha processors, which causes an increasing number of cache miss memo .ry accesses, when the total memo .ry required by the problem increases. Another important aspect to be taken into account is the increase of the image sizes that can be be processed, due to the much larger memory available on the parallel systems. The maximum image size fitting the memory of the sequential workstation is L = 1024. On the contrary, with 32 nodes of the Massively Parallel Processor (MPP) machines, image sizes up to L -- 4096 can be processed, given the r = 1 template. Since the availability of MPP architectures is not yet widespread, interesting results can arise also from a performance evaluation of the distributed CNN algorithm on a workstation cluster. In fact, clusters of workstations are the most common type of "parallel machine" available, and it is important to focus on how CV programs perform on such a platform. Figure 10 reports the computational times (top) of the distributed CNN algorithm on a cluster of eight SPARC-IPX workstations with 32 MBytes of memo .ry, as a function of L. The results of the sequential runs carried out on an identical single workstation of the cluster are also plotted. The improvement allowed by the distributed
269 time [sec] sequential 5
2
,,- f "
PVM
11 l
0,5
z t
0.2
z z
z
0.i I
0.05
,." ""
16o
260
560
M-FI o p / s
lo'oo
~
2000 L
PVM
I0. .4
7 f /
5
/ / / /
2 1.5
----______ s e q u e n t i al
1do'.' i'd6~6'6~30"o. 5do.Tdo:lObO:lSbO:
L
Figure 10. Processing times (top) and corresponding MFlop/s (bottom), as a function of the image size L, measured running a sequential CNN with r = 1 on a SPARC-IPX workstation and the distributed PVM version on 8 identical workstations.
implementation is significant only at large image sizes, and asymptotically it reaches almost a factor of eight of speed-up, when the communication weight is smaller compared to the computational weight of the problem. On the contrary, at low L values communications become a significant bottleneck, due to the low bandwidth and high latency ethernet network and TCP/IP protocol. This consideration is confirmed by the analysis of the corresponding performance (in MFlop/s) obtained as a function of L, Figure 10 (bottom). While the single workstation is able to sustain almost 2 MFlop/s at low L and slightly decreases its performance with L (due to the increasing number of page faults), the distributed PVM version scales with L from 3 up to almost 14 MFlop/s of performance. However, the sequential implementation is limited at L - 512 by memory bounds, while images up to L = 1024 can be processed using the whole cluster. Hence, owing to the coarse-grained parallelization scheme followed for the distributed
270
CNN implementation, the program is able to perform very well even on loosely coupled workstation clusters, given that the problem complexity is high enough to keep the weight of the communications low compared to the computational cost.
speed-up ..- T3D
15
12.5
..'"/
10
,.,,-"
7.5
.--
CM-5
~,.~ ..,. --,-
14"
2.5 Ff
5
10
15
20
25
30
speed-up T3D
15 -/
i SP2
12,5
10 7.5 5
p' f
2,5
i j 0
'
' " " " ' 5
" " ".L' 10
'
" . . . . . . . . . . . . . . 15 20 25
'
'-.'.." 30
" "
P
Figure 11. Speed-up of the distributed CNN algorithm, as a function of the available processors P with a 3 • 3 template and 240 • 240 (top) and 480 • 480 (bottom) images, on the three MPP machines.
7.2. S p e e d - u p and efficiency In order to study the scaling behavior of this prototype of CNN application, as a function of the number of available processors, several experiments have been carried out, at a variable number P of PEs on the three parallel machines. Figure 11 plots the performance speed-up Sp achieved by increasing P from 2 to 32 (assuming Sp = 1 for P = 2), measured with two representative image sizes: a larger one of 480 • 480 (bottom) and a smaller one of 240 • 240 (top).
271 The key parameter in this test is also the communication weight. It increases either when the number of available processors increases, or when the image size decreases. In both cases the partial image size stored on each PE decreases, providing a larger weight of communications. Very impressive is the behavior observed on the T3D. This architecture seems to be only slightly sensitive to the increasing communication weight, due to the very fast interprocessor bandwidth available (more than 100 MBytes/s) [34]. Its speed-up scaling is almost linear with P, up to a large number of processors, particularly with L - 480. However, decreasing the grid size has a negligible impact on the measured speed-up. An important index, directly related to the degree of parallelism achieved as a function of P, is the efficiency Ep, defined as Sp/P (in our case Ep =~Sp 9 2/P, as the starting point is P = 2). Ep is plotted in Figure 12, for both L = 240 and L - 480: at P = 32, Ep on the T3D is 98% with L - 480, and 93% with L = 240.
eFFiciency - " --,. ,~.~- . , . ~ . . . . . . . . . . . . . . -..,,-,~ .,~.
T3"D
0.8 ' < ~ . -., . _ . . . . . . . .
SP2
.,.,_
0..6
"- C M - 5
0..4
0.2
0
....
5 ....
1"0 . . . .
1"5 . . . .
2"0 . . . .
2"5 . . . .
:3"0
" "
P
eFFiciency 1 --.
.
.
.
................
. -'-
~=-
T3D
.~_... ~.,_-_
_
--0.8
__
_~- -.- - s P 2 -CM-5
0.6
0.4
0.2
0
....
5
....
1 '0 . . . .
1"5 ....
2"0 ....
2'5 . . . .
3'0"
" "
P
Figure 12. Efficiency of the distributed CNN algorithm, as a function of the available processors P with a 3 x 3 template and 240 x 240 (top) and 480 x 480 (bottom) images, on the three M P P machines.
272 The CM-5 efficiency is much more penalized by communications. Whereas for the larger image the speed-up is not too far from linearity (the corresponding efficiency is Ep = 79% at P -- 32), on the smaller one the machine efficiency significantly slows down (Ep = 58% at P = 32). A better behavior than the CM-5, even if worse than the T3D, is shown by the SP2, whose speed-up curve is satisfactory even with a large number of processors and with small problem sizes. The corresponding efficiencies are 83% (L - 480) and 71% (L = 240), with P = 32. These high efficiency values are explained both by the extremely high performance interconnection networks available on the most recent parallel architectures and by the favorable ratio between computations and communications of the presented CNN algorithm.
speed-up T3D
,.,-'/,',
,-'/
3.5
SP2
,.,-,,,.
/
2.5
//
,//
/
1.5
I
/"t /
]]] / s'//
" '
2
4
6
' i0
8
speed-up 4
p
. T3D .,,~ SP2
z",/
,,'/ ,.,/
3.5
,,I
cluste~
2.5
:1..5 ,
,
..
.
.
.
.
,
.
.
.
.
,
.
.
.
.
.
,
,
,
,
,
p
10
Figure 13. Speed-up of the distributed CNN algorithm, as a function of the available processors P with a 3 x 3 template and 240 x 240 (top) and 480 x 480 (bottom) images, on the SPARC-IPX workstation cluster.
Important measurements are provided also by the evaluation of the performance speed-
273 up achieved when running on an increasing number of workstations in a cluster, in order to study the range of applicability of CV codes even on a generic distributed platform, without dedicated interconnection networks. Figure 13 reports, as a function of the number of workstations P, the performance speed-up Sp (up to P = 8) achieved running the portable PVM code on a cluster of SPARC-IPX, with L - 240 (top) and L = 480 (bottom). The above Sp values obtained by the T3D and by the SP2 are also reported in the figure, in order to compare the cluster results with the two best MPP results previously analyzed. The corresponding efficiencies Ep are reported in Figure 14. Key points, to be carefully considered when running on workstation clusters, are the load of the machines, which can heavily affect computation performance and subtract memory resources of the running processes, and the load of the interconnection network, which can affect communication times. Our measures are taken in single user mode and with low network traffic, considering, as the overall execution time, the best wall clock time among a series of several runs. The measurements show that, at least with up to four or six workstations (when communications do not introduce relevant overheads), the algorithm assures a nearly linear speed-up in the number of processors. The efficiencies are about 90% with both L = 240 and L = 480 image sizes. On the contrary, in the final regions of the graphs, the cluster speed-up becomes not so good, and the efficiency drops to less than 80% with L = 240, while for the two parallel machines Ep is still more than 90% with P = 8. These considerations emphasizes the fact that, when dedicated fast interconnection networks are not present, the number of processors should not become too high, in order to maintain satisfactory degrees of efficiency, even though the algorithm design is low in communication cost. Table 1 Computational times in s/cell/iteration over a L x L image, with a 3 x 3 template. Machine L = 64 L = 128 L - 256 L = 512 L = 1024 CM-5 8.25-10 -7 2.61- 10 -7 1.10-10 -7 6.71-10 -s 5.11- 10 -s T3D 9.23- 10 -8 9.85-10 -s 1.05-10 -7 1.63-10 -7 2.06- 10 -7 SP2 1.30- 10 -7 6 .5 7 . 1 0 -s 4.73-10 -s 3 . 8 9 . 1 0 -s 3.52-10 -8 SPARC-20 3.90-10 -6 3 . 9 1 . 1 0 -6 3.97-10 -6 4.04-10 -6 7.45-10 -6
8. CONCLUSIONS The CNN paradigm plays a crucial role in the CV framework. Most of the low-level CV algorithms can be expressed in terms of the general CNN equation. Moreover, middle-level CV functions too (e.g., object detection and recognition) can be efficiently implemented by means of this powerful formalism. Many CV applications involve the processing of large images, requiring the designer to overcome the limits imposed by the single workstation bounds. Parallel or distributed platforms, providing a large computational power and large memory availability, can significantly speed-up CV computations and extend the range of images to be processed. In this work a complete performance analysis of a
274
eFFiciencw 1
~
T3D
O. 9
SP2
0.8
cluster
0.7 0.6
0o5 2
. . . .
,~ . . . .
6
. . . .
~} . . . .
10
P
e~iciency
~
0.9
T3D SP2
cluster
0.8 0.7 0.6 o.s--2
. . . .
,i
. . . .
6
. . . .
e
. . . .
1'o P
Figure 14. Efficiency of the distributed CNN algorithm, as a function of the available processors P with a 3 • 3 template and 240 • 240 (top) and 480 • 480 (bottom) images, on the SPARC-IP workstation cluster.
coarse-grained distributed implementation of a general CNN algorithm for CV has been presented. Four program versions were presented: three optimized for the Connection Machine CM-5, the Cray T3D and the IBM SP2 parallel machines and a general purpose version for any workstation cluster supporting PVM. The processing times of the message passing algorithm on the parallel machines measured with increasing image sizes are proven at least 20-30 times faster than on a SPARC20 workstation. Moreover, significant improvements can be obtained in the size of the images which can be processed. The favorable coarse-grained parallelization scheme adopted in the design of the distributed algorithm, however, allows effective performance results even when running on loosely coupled workstation clusters, given that large enough image sizes are processed. The communication bottleneck becomes important on the clusters of workstations only when running with very high numbers of processors or when too small problem sizes are stored on each processor.
275 ACKNOWLEDGMENTS The authors warmly thank Prof. Leon O. Chua of Berkeley University, Prof. Gianni Conte, Prof. Giovanni Adorni of Parma University and Dr. Pietro Sguazzero and Dr. Carla Conci of IBM for the helpful suggestions and the encouragement. This work has been made possible by the cooperation of IPG in Paris allowing us to use the Connection Machine CM-5 and CINECA in Bologna (Italy), allowing us to use the Cray T3D and the IBM SP2.
REFERENCES
10. 11.
12.
13.
14.
D. H. Ballard and C. M. Brown. Computer Vision. Prentice-Hall, Englewood Cliffs, 1982. D. Marr. Vision. Prentice-Hall, Englewood Cliffs, 1982. L.O. Chua and L. Yang. Cellular Neural Network: Theory. IEEE Transactions on Circuits and Systems, 35:1257-1272, 1988. L.O. Chua and T. Roska. The CNN Paradigm. IEEE Transactions on Circuits and Systems- I, 40:147-155, 1993. Ioannis Pitas editor. Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks. John Wiley & Sons, New York, 1993. T. Agerwala amd J.L. Martin, J.H. Mirza, D.C. Sadler, D.M. Dias, and M. Snir. SP2 System Architecture. IBM System Journal, 34(2):152-184, 1995. Tommaso Toffoli and Norman Margolous. Cellular Automata Machines. MIT Press, Cambridge, MA, 1987. T. Roska and L.O. Chua. The CNN is Universal as the Turing Machine. IEEE Transactions on Circuits and Systems - I, 40:289-291, 1993. L.O. Chua and L. Yang. Cellular Neural Network: Applications. IEEE Transactions on Circuits and Systems, 35:1273-1290, 1988. W. K. Pratt. Digital Image Processing. John Wiley & Sons, New York, 1978. G. Adorni, V. D'Andrea, and G. Destri. A Massively Parallel Approach to Cellular Neural Networks Image Processing. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-94, pages 423428, Rome, Italy, December 1994. P. Saint-Marc and G. Medioni. Adaptive smoothing for feature extraction. In Proceedings of Image Understanding Workshop, pages 1100-1113, Cambridge, MA, 1988. MIT Press. A. Kellner, H. Magnussen, and J.A. Nossek. Texure Classification, Texture Segmentation and Text Segmentation with Discrete-Time Cellular Neural Networks. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-94, pages 243-248, Rome, Italy, December 1994. IEEE Press. H. Magnussen, G. Papoutsis, and J.A. Nossek. Continuation-Based Learning Algorithm for Discrete-Time Cellular Neural Networks. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA94, pages 171-176, Rome, Italy, December 1994. IEEE Press.
276 15. F. Dellaert and J. Vandewalle. Automatic Design of Cellular Neural Networks by means of Genetic Algorithms: Finding a Feature Detector. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 189-194, Rome, Italy, December 1994. IEEE Press. 16. N.N. Aizemberg, I.N. Aizemberg, and T.P. Belikova. Extraction and Localization of Important Features on Gray-Scale Images: Implementation on the CNN. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 207-212, Rome, Italy, December 1994. IEEE Press. 17. M. Balsi and N. Racina. Automatic Recognition of Train Tail Signs Using CNNs. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 225-230, Rome, Italy, December 1994. IEEE Press. 18. J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679-698, November 1986. 19. D.G. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, Readings, MA, 1989. 20. T. Kozek, T. Roska, and L.O. Chua. Genetic Algorithms for CNN Template Learning. IEEE Transactions on Circuits and Systems - I, 40:392-402, 1993. 21. G. Destri. Discrete Time Cellular Neural Networks Construction Through Evolution Programs. In Proceedings of The Fourth IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-96, pages 473-478, Seville, Spain, June 1996. 22. A. Broggi and G. Destri. Expectation-driven segmentation: A Pyramidal Approach. In Proceedings of The International Conference on Image processing: Theory and Applications 1993, pages 147-150, SanRemo, Italy, June 1993. 23. Alberto Broggi and Simona Berth. Vision-Based Road Detection in Automotive Systems: a Real-Time Expectation-Driven Approach. Journal of Artificial Intelligence Research, 3:325-348, 12 1995. 24. G. Adorni, V. D'Andrea, G. Destri, and M. Mordonini. Shape Searching in Real World Images: a CNN-Based Approach. In Proceedings of The Fourth IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-96, pages 213218, Seville, Spain, June 1996. 25. K. Hwang. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, Inc., New York, 1993. 26. Thinking Machines Corporation. CM-5 CM Fortran Language Reference Manual Version 2.1., 1992. 27. Thinking Machines Corporation. CMMD Reference Manual Version 3.2, 1992. 28. Crag T3D System Architecture Overview. Technical report, Cray Research Inc., 1994. 29. Cray Research Inc. MPP Fortran Programming Model, 1994. 30. IBM Corporation. Parallel Programming Subroutine Reference, 1995. 31. IBM Corporation. AIX XL Fortran Compiler//6000 Language Reference, Version 3.1, 1994. 32. G. Destri, V. D'Andrea, and M. Pontremoli. Using a 3-D mesh massively parallel computer for Cellular Automata Image Processing. In Proceedings of First Italian
277
Workshop on Cellular Automata for Research and Industry A CRI-94, pages 191-200, Rende (CS), Italy, September 1994. 33. G. Destri and P. Marenzoni. Cellular Neural Networks: A Benchmark for Lattice Problems on MPP. In Proceedings of ParCo '95, Gent, Belgium, September 1995. 34. P. Marenzoni. Performance Analysis of Cray T3D and Connection Machine CM-5: a Comparison. In Proceedings International Conference "High-Performance Computing and Networking HPCN'95", pages 110-117. Springer-Verlag, May 1995.
278 Giulio Destri
Giulio Destri received his Laurea degree in Electronic Engineering from the University of Parma in 1992, discussing a Master's Thesis about the implementation of Image Processing techniques on Massively Parallel Architectures. He is currently a Ph.D. candidate in Computer Engineering at the Dipartimento di Ingegneria dell'Informazione of the University of Parma. His research interests include the study of parallel paradigms such as Cellular Neural Networks, and parallel and distributed processing, and the application of them to computer vision. He has been involved in the Eureka project PROMETHEUS, an EEC project for improving traffic safety. From April to June 1996 he has been a TRACS visitor at the Edinburgh Parallel Computing Centre and at the AI department of the University of Edinburgh, Scotland. Giulio Destri is a member of AI*IA, IEEE, and the IEEE Computer Society. Home Page: http://www.ce.unipr.it/people/destri
Paolo Marenzoni
Paolo Marenzoni received his Laurea degree in Physics from the University of Parma in 1992, discussing a Master's Thesis about the parallel implementation of Monte Carlo simulations on the Connection Machine CM-2. He is currently a Ph.D. candidate in Computer Engineering at the Dipartimento di Ingegneria dell'Informazione of the University of Parma. His major interests cover the field of parallel and distributed algorithms and programming languages. In particular, he has been interested in the study of the parallel solutions of Petri Nets, the application of computational paradigms to distributed image processing and the implementation and optimization of communication protocols in distributed environments. Paolo Marenzoni is a member of IEEE and the IEEE Computer Society. Home Page: http://www.ce.unipr.it/people/marenz
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.
279
Decision Trees on Parallel Processors Richard Kufrin National Center for Supercomputing Applications University of Illinois at Urbana-Champaign 405 N. Mathews Ave., Urbana, Illinois 61801, U. S. A. rkufrin 9 u i u c . edu A framework for induction of decision trees suitable for implementation on sharedand distributed-memory multiprocessors or networks of workstations is described. The approach, called Parallel Decision Trees (PDT), overcomes limitations of equivalent serial algorithms that have been reported by several researchers, and enables the use of the very-large-scale training sets that are increasingly of interest in real-world applications of machine learning and data mining. 1. I n t r o d u c t i o n One of the most active areas of machine learning research over the past several years has been the development of algorithms for supervised learning (or learning from examples) [13]. Numerous techniques have been proposed, implemented, and extended by researchers from several disciplines; all have the primary goal of deriving a concept description from pre-classified examples - or equivalently, of inducing generalized descriptions of one or more classes given a set of examples representing instances of each class. The ability to classify unseen examples is applicable to a wide variety of real-world tasks. Sample applications include medical diagnosis~ financial analysis and forecasting, engineering fault diagnosis, and information retrieval [11]. There has been a great deal of interest in the emerging field of knowledge discovery in databases (KDD). Classification of examples extracted from real-world databases can be expected to involve huge amounts of training data; hence the ability to cope with extremely large training sets efficiently is an active research topic within the KDD community [4,6]. Most studies of machine learning algorithms to date have involved training sets of small to moderate size (for example, the mean size of training sets in the UCI Machine Learning Repository [15] is less than 2500 examples per d a t a b a s e - a figure that drops by approximately one-third if one excludes the two largest databases in the repository). To effectively deal with increasingly-large real-world databases, machine learning algorithms that are both space- and time-efficient are needed. Concurrent to advances in machine learning algorithms, development of hardware and software technologies that enable the application of multiple processors to the solution of problems have brought massive parallelism from the prototype phase to the production environment. Massively-parallel architectures are now in routine use in commercial settings for scientific, engineering, and business applications. Even more pervasive has
280 been the development of extremely high-powered workstations that now bring computeand memory-capacities once available only through supercomputer-class systems to the desktop. Further, the introduction of enabling software such as PVM, Express, and Linda (among others) that allow the creation of "virtual" parallel machines composed of networks of workstations has enabled cost-effective parallelism to be exploited by individual research groups for whom the purchase of tightly-coupled multiprocessors would be impossible. In the coming decade, we can expect the incorporation of new network technologies (notably ATM) to provide a distributed network computing environment for parallel applications in which inter-processor transfer rates of hundreds of megabytes per second will be possible [23]. The following sections describe Parallel Decision Trees (PDT) - a strategy for implementing a class of symbolic inductive inference algorithms within a parallel computing framework available today in shared- and distributed-memory multiprocessors or networked workstations. Section 2 gives an overview of this class of algorithms and approaches to parallelization, Section 3 presents the details of the PDT algorithm, Section 4 presents empirical results, and Section 5 describes additional modifications to the algorithm for improving performance. Section 6 discusses the incorporation of this parallelization approach to other inductive learning programs, and Section 7 provides a summary and offers suggestions for future work.
2. Supervised Learning Several different paradigms for supervised learning are in common use today, including neural networks, instance-based, genetic algorithm, rule induction, and analytic methods [10]. Numerous investigators have conducted empirical comparisons of the performance of representative systems from each of these classes (for a recent example, see [14]). Although no consensus has been reached regarding the relative accuracy of classification among these methods across problem domains, it is clear that methods such as decision trees require far less CPU time to induce a classifier due in part to the greedy algorithms employed by these techniques. However, recent studies of the scalability of symbolic methods have indicated that, for extremely large training sets, even decision tree algorithms can require an inordinate amount of CPU time to complete [3,17,12]. This work is concerned with decision tree algorithms and their application to very large training sets. We begin with a brief review of terminology; for a thorough description of methods for inducing decision trees, see [20,1,24]. Notation and Terminology The fundamental task of supervised learning algorithms is to find some representation of one or more classes (or concepts), denoted {C1, C 2 , . . . , CN}, from a training set T of preclassified examples (or cases), denoted {XI, X2,... ,Xm}. Each example Xi in T consists of a set of k attribute values described by a vector Xi = {xl,x2,... ,xk} and an associated class label c. Attributes in X may be categorical, where the domain of x~ is finite and has no inherent ordering, continuous (i.e., real-valued), or ordinal. Ordinal attributes, like continuous attributes, have a well-defined order among elements but are restricted to a countably infinite domain. These attributes are often the result of a discretization step
281
>
CL
2
= IIIGlt
I CLASS2
CI..~~S3 lO7
,q~ lO.5
GLASS4
CLASS1
Figure 1. A decision tree.
applied to a continuous attribute. A split, denoted S(T, xi), defines a partitioning of the examples X in T according to the value of attribute x~ for each example. The result of split S is two or more subsets of T, denoted {T1,T2,..., TD(x~)}, where D(x~) denotes the cardinality of attribute xi (in the case of continuous attributes where S enforces a binary split of T, D(xi) 2). =
2.1. D e c i s i o n T r e e s Decision tree methods are a family of learning algorithms that use a divide-and-conquer approach to inducing a tree-based representation of class descriptions. Among the most well-known decision tree algorithms are Quinlan's ID3 and its successor C4.5 and the CART method of Breiman et al. For consistency, we focus on ID3 methods hereafter; see Section 6 for comments regarding general applicability of the parallelization strategy described. Figure 1 shows a decision tree classifier induced from a four-class training set with three attributes: vl, v2, and v3. vl is ordinal with the possible values high, raed, or low, v2 is continuous, and v3 is categorical with values red, green, or blue. To classify an unseen case using this tree, a path is traced from the root of the tree to a leaf according to the value of attribute x encountered at each internal node. When a leaf is encountered, the class label associated with that leaf becomes the predicted class for t h e new case. For example, to classify a case with attribute vector X = {raed, 25.1, blue}, the path v3 = blue, v2 < 50, vl = reed leads to the leaf labeled c l a s s 3, so this class is chosen as the predicted value of the new case. Note that, in the decision tree of Figure 1, attribute
282
vl has been treated as if categorical in type, with separate branches created for each possible value. In practice, ordinal attributes are often treated as continuous, so that internal nodes associated with vl are labeled with relational tests such as "vl < reed" rather than tests of equality as shown here (we will return to this issue in Section 3.2). Having described the procedure for classifying unseen cases with an existing decision tree, we turn our attention to the issue of training, that is, determining the structure of a decision tree, given a particular pre-classified training set. Top-down decision tree algorithms begin with the entire set of examples and repeatedly subdivide the training set according to some heuristic until the examples remaining within a subset represent only a single class (or, if the available attributes do not sufficiently discriminate between classes, when no further discrimination is possible). A great many variations on this approach have been investigated, but in general, these algorithms follow a recursive partitioning scheme with the following outline: 1. Examine the examples in T. If all examples belong to the same class Cj, create a leaf node with label Cj and terminate. 2. Evaluate potential splits of T according to some "measure of goodness" H and select the "best" split, S(T, x~). If all attribute values are identical within examples in T or if no potential split appears beneficial, determine the majority class Cj represented in T, create a leaf node with label Cj and terminate. 3. Divide the set of examples into subsets according to the split S selected in step 2, creating a new child node for each subset. 4. Recursively apply the algorithm for each child node created in step 3. Decision tree algorithms can be themselves classified by how they address the following issues [1,16]: 9 restrictions on the values of xi (i.e., categorical, ordinal, continuous), 9 methods for constructing (and restrictions on) candidate partitions (S), 9 measures for evaluating candidate partitions (H), 9 approaches for coping with missing values, 9 approaches for pruning the resulting tree to avoid overfitting, and 9 strategies for dealing with noisy attributes or classifications. With respect to these issues, ID3 (as originally proposed): 9 accepts categorical and continuous data, 9 partitions the data based on the value of a single attribute, creating branches for each possible value,
283 9 uses the information-theoretic criterion uating a candidate partition S(T,x~),
gain
as the heuristic H for a means of eval-
9 provides for missing attribute values when evaluating H by assigning examples in T with unknown values for x~ in proportion to the relative frequency of known values for xi across C. Examples with unknown values for the most promising split are then discarded before the recursive call to the algorithm. The gain criterion evaluates the weighted sum of entropy of classes conditional on the selection of variable xi as a partition; at the core of this calculation is the determination of entropy (also known as info):
info(T) = - ).~
freq(Cj, T)
(freq!Cr * log2 k
IT I
where freq(Cj, T) represents the number of examples of class Cj among the total examples in T, and ITI is the total number of examples in T. By using the gain criterion as the heuristic for evaluating potential splits of the training set, ID3 attempts to judiciously select those attributes that discriminate among the examples in the training set so that, on average, the impurity of class membership at each node is reduced as quickly as possible. C4.5 is a descendant of ID3 that incorporates several additions to improve the capabilities and performance of the parent system. These improvements include: 9 refined information gain criterion to adjust for apparent gain attributable to tests with many attribute values, 9 modified approach to handle missing values during training and classification where examples with unknown values for a partitioning criterion are "fragmented" among child nodes {T1, T2,..., Tr)(x~)}, 9 methods for pruning to compensate for noise and to avoid overfitting the training set, and 9 providing for
value groups,
which merge subsets of attribute values.
In the remainder of this paper, references to ID3 should be taken to include the decision tree induction component of the C4.5 system, except where noted. Unlike categorical attributes, an infinite number of candidate splits are applicable to continuous attributes. ID3 (like CART) attempts to create binary splits for continuous attributes of the form x~ < = K, where K is some constant. Although there are an infinite number of possibilities for the choice of K, ID3 examines only m - 1 candidates, which are exactly those represented in the training set. The information gain is computed for each of the m - 1 candidates and is used (as in the categorical case) to evaluate possible splits.
284
Computational Requirements To evaluate the information gain associated with a split of T based on attribute xi, we must determine the class frequency totals for: 1. all examples in T, and 2. each subset T~ based on partitioning T according to possible values of xi. Quinlan [19] notes that the computational complexity of ID3 (for categorical attributes) at each node of the decision tree is O ( ] N [ , [A[), where N is the number of examples and A is the number of attributes examined at the node. A separate analysis that focused on the effect of continuous attributes on ID3's time requirements concludes that the total cost of the algorithm is over-quadratic in the size of the training set [17]. Clearly, the use of continuous data greatly expands the domains for which ID3 is useful, however it also significantly increases the computational time required to build a decision tree. To speed the examination (and associated frequency sum calculation) of the candidates, ID3 first sorts the training examples using the continuous attribute as the sort key. The sorting operation, which increases the computational requirements to O(m log2 m), contributes to potentially exorbitant CPU time for large training sets. In empirical studies involving very large training sets, Catlett [3] writes: ... as training sets become enormous, error rates continue to fall slowly, while learning time grows with a disturbingly large exponent . . . . Profiling on large training sets shows that most of the learning time is spent sorting the values of continuous attributes. The obvious cure for this would be not to sort so many continuous values, provided a way could be found of doing this that does not affect the accuracy of the trees, which may hinge on very precise selection of critical thresholds. Catlett's solution to the above problem is called peepholing; the basic idea is to discard a sufficient number of candidates for threshold values so that the computational expense of sorting is lessened. It is (approximately) an intelligent sampling of the candidates that aims to create a small "window" of threshold values; this window is then sorted as usual. Empirical results showed that peepholing produced significant improvements over the traditional ID3 algorithm for several large training sets, however there is no guarantee that this approach will perform with consistent accuracy over all possible domains.
Pruning Although not the focus of the present work, simplification of decision trees through pruning techniques is an important component of any decision tree algorithm. It is sufficient to note that several methods have been developed, some of which estimate error rates using unseen examples or cross-validation techniques while other approaches base simplification decisions on the examples used to induce the tree. In either case, we need only to obtain misclassification totals for the (training or test) set in order to predict error rates for the purposes of pruning. No aspect of the algorithm presented here precludes following an appropriate pruning algorithm as the entire training set is available throughout execution.
285
Figure 2. A model-driven parallelization of decision tree induction.
2.2. Approaches to Parallel Induction of Decision Trees For training sets of small to moderate size, ID3 is computationally inexpensive - it is unnecessary to apply parallelism to gain a benefit in execution time. However, when applied to massive quantities of data, eventually the sheer size of the training set can be expected to require non-trivial amounts of computation. Additionally, one can employ the aggregate available memory of distributed-memory multiprocessors or workstation clusters to accommodate ever-increasing sizes of training sets that may not be feasible on individual machines. M o d e l - d r i v e n Figure 2 shows a model-driven parallelization strategy for decision tree induction, which may seem to be the most natural strategy of assigning processing elements to nodes of the decision tree and reflects the "divide and conquer" nature of the algorithm. Although appropriate to many search strategies such as branchand-bound, the limitations of this approach when applied to decision tree induction become apparent. It is difficult to partition the workload among available processors (as the actual workload is not known in advance) - if the partitioning of Figure 2 is chosen, clearly the processor assigned the middle branch of the root of the tree will complete first and will idle. Alternatively, a "master-worker" scheme for task assignment, where available processors are assigned the task of determining the best attribute for splitting of a single node and then is returned to a "waiting pool" may exhibit excessively fine-grained parallelism the overall computation time may be dominated by the overhead of task assignment and bookkeeping activities. In both approaches, potential speedup is limited by the fact that, on a per-node basis, the root of the decision tree requires the largest computational effort as all m examples and k attributes must be examined to determine the initial partitioning of the full training set. Finally, this approach assumes global access to the training set, preventing efficient implementation on distributed-memory parallel platforms.
286
x1 x2 x3 x4
Training Set
~.
Figure 3. An attribute-based parallelization of decision tree induction.
A t t r i b u t e - b a s e d Shown in Figure 3, attribute-based decomposition is another strategy that associates each of p processing elements with kip independent subsets of the available attributes in X so that the evaluation of gain for all k attributes can proceed concurrently. This approach has the benefit of simplicity as well as achieving excellent load-balancing properties. Although this strategy does not require global access to the full training set, at least two limitations of attribute-based parallelism should be noted. The first involves potential load imbalance at lower nodes of the decision tree when data sets include a significant number of categorical attributes that are selected at higher nodes of the tree for splitting. Secondly, the potential for concurrent execution p is bounded by k, the total number of available attributes. D a t a - p a r a l l e l A data-parallel decomposition strategy, as shown in Figure 4, assigns "blocks" of training examples to processors, each of which executes a SIMD (singleinstruction/multiple-data) program on the examples assigned locally. A straightforward adaptation of a serial decision tree algorithm for data-parallel execution must still enable global access to the complete training set, discouraging development of implementations with this strategy. However, the PDT algorithm, described in Section 3, is a modified data-parallel approach that offers a solution to this limitation. Pearson [18] evaluated the performance of a combination of the "master-worker" approach and attribute-based decomposition. His experiments, implemented using the coordination language Linda on a Fujitsu cellular array processor, involved relatively complex strategies for processor assignment to tasks in order to compensate for rapidly-decreasing workloads in lower levels of the decision tree and the accompanying increase in the ratio of parallel overhead to "useful" work. Pearson's conclusion that "none of the results show a decrease in speed [ commensurate ] with the possible parallel computation" underscores the drawbacks of this strategy.
287
I I!
I
I I
I I
I
!
.... T r a i n i n g S e t
Figure 4. A data-parallel approach to decision tree induction.
3. T h e P D T A l g o r i t h m Returning to the data-parallel approach shown in Figure 4, we see that the motivation behind this decomposition strategy arises from the observation that most decision tree induction algorithms rely on frequency statistics derived from the data itself. In particular, the fundamental operation in ID3-1ike algorithms is the counting of the attribute value/class membership frequencies of the training examples. Parallel Decision Trees (PDT) is a strategy for data-parallel decision tree induction. The machine model employed assumes the availability of p processing elements (PE), each with associated local memory. The interprocessor communication primitives required are minimal: each PE must be able to send a contiguous block of data to its nearest neighbor; additionally, each PE must be able to communicate with a distinguished "host" processor. This machine model is general enough so that the strategy may be employed on currently-available massively-parallel systems as well as networks of workstations. Because the communication patterns involved are regular, with the bulk of transfers involving only nearest-neighbor PEs, the additional overhead incurred due to inter-processor communication is kept to a minimum (certain optimizations may be employed if the underlying machine supports them; these are described in Section 5). 3.1. D a t a D i s t r i b u t i o n PDT partitions the entire training set among the available PEs so that each processor contains within its local memory at most [m/p 1 examples from T. This partitioning is
288 static throughout induction and subsequent pruning. No examples are allocated to the host processor, which is instead responsible for: 1. Receiving frequency statistics or gain calculations from the "worker" PEs and determining the best split. 2. Notifying the PEs of the selected split at each internal node. 3. Maintaining the data structures for the decision tree itself. As attributes are chosen for splitting criteria associated with internal nodes of the decision tree, the host broadcasts the selected criterion to worker processors that use this information to partition training events prior to the recursive call to the algorithm at lower levels of the tree. 3.2. P a r a l l e l E v a l u a t i o n of C a n d i d a t e Splits In PDT, the evaluation of potential splits of the active training subset T proceeds ve.ry differently according to the type of attribute under consideration. We turn our attention first to the simpler case. Categorical Attributes The calculation of class frequency statistics for categorical variables is straightforward: each processor independently calculates partial frequency sums from its local data and forwards the result to the host processor. For an n-class training set, each of the a categorical attributes xi that remain under consideration will contribute n , D(xi) values to the host processor (where D(xi) again denotes the cardinality of attribute xi). These intermediate results are combined at the host processor, which can evaluate the required "measure of goodness" H. Each PE requires O ( m / p , a) time to complete the instance-count additions for its data partition; the information gain calculations (still computed on a single processor) remain the same. Communication between the host and PEs is now required, restricting the potential speedup to less than optimal (we do not consider effects of hierarchical memories that may lessen the penalties of the host/PE communication). Because the host processor is responsible for combining the partial frequency sums obtained from each PE, no communication between individual PEs is required for these tasks.
Continuous Attributes Continuous attributes pose at least two challenging problems for a data-parallel implementation. First, as in the serial implementation, we have no a priori knowledge of the candidate splits present in the training set. Since the data is distributed among the PEs, a parallel sort is required to allow a scan and update of the thresholds if we adhere strictly to the serial version of the algorithm. Although PRAM formulations of distributed sorts have been described that exhibit a parallel run time of O(log2N), implementations on more realistic parallel models are far less scalable and can vary depending on run-time conditions. Even if a distributed sort is available, the subsequent frequency count update step across PEs is not fully parallelizable due to dependencies on frequency counts from
289 preceding PEs in the sorted sequence. Second, it is likely that the calculation of information gain associated with all possible thresholds for continuous attributes will consume much more time than for categorical attributes if we concentrate all of the information gain calculations at the host processor. By following a different approach in the case of continuous attributes, we can significantly reduce the time complexity associated with these attributes while still evaluating all possible thresholds. The key observation is that it is not necessary to produce a single sorted list of all training examples. As mentioned earlier, we are only interested in frequency statistics gathered from the data sorting merely enables gathering of these frequencies in a single pass. A second observation is that, while the calculation of information gain for categorical attributes is most conveniently done at the host processor, we would do better to evaluate all m potential gain calculations for continuous attributes in parallel at the level of the worker PEs. The solution is to incorporate a three-phase parallel algorithm as shown in Figures 5 and 6. The strategy for evaluating candidate splits associated with continuous attributes in P DT can be summarized as follows: 1. L o c a l p h a s e . Perform p concurrent sorts of the partitions of data local to each PE. As in the serial ID3 implementation, determine the frequency statistics for each (local) candidate threshold as usual. Note that ID3 chooses data values present in the training set for candidate thresholds while other algorithms choose midpoints between data values - - either approach is suitable in this framework. 2. Systolic phase, p - 1 exchanges of each PE's local thresholds and associated frequency statistics ensue. Each PE receives the thresholds and frequencies calculated in step 1 from its neighbor; as these are already sorted, they can be merged with the current partial sums in a single pass. After all p - 1 exchanges have occurred, all PEs contain the frequency statistics required for information gain calculations, that are then calculated (locally) within-processor. 3. Reduction phase. Each PE determines the "best" candidate within its assigned subset. The candidate threshold and associated information gain value are sent to the host processor from all p workers; the host selects the best threshold and updates the decision tree once all requisite gain calculations are obtained for all candidate attributes.
Ordinal Attributes and Example Compression An important distinction between thresholds and examples should be noted. As depicted in Figures 5 and 6, it may appear that processors executing the P D T algorithm must exchange the entire active set of examples during the systolic phase of the algorithm. In fact, what must be shared are not examples, but thresholds. In the case of continuous (real- or integer-valued) attributes, these may be identical. However, in the case of ordinal attributes, it is possible that the domain of the attribute is far less than the number of representative examples of that attribute in the data set. More precisely, for an ordinal attribute that may assume one of d values, we can reduce the amount of data that must
290
I~-~176 ~176~o~1
I~o~i~.~ ~o~~o~!
I~176 ~~ ~~ ~~ ~~
~,~ Io.~I~.~I~.~I~.~I~.oI I!.oI~.~Io-oI~.oI~.oI I~.~I.... q~.ol ,, ~,'I ~I~I ~I~I ~I I~I~I ~I ~I ~I I ~I ~I ~I ~I ~I '~,~1, ol ot~t~l~q Io, I otol~t~ Io, ! ~1 ~i~l~
l~.~l~176176 I~I ~I~I ~I ~I I;l~, I~l~l~
~, i~.oi~.~i~.~io.~I~.oI ~,~I~I~I~121~i SORT
~,~ I~ ~.~I~.~I~.~I~176 ~,' i '~ I~ I~ I~ I;I PARTIAL FREQUENCY CALCULATION
SYSTOLIC SHIFT
1
s~, 1o.~l~.~l~.~-1~.4o.o~i~.~t2.~l~.ol~.ol~oo~ - I~i~i~.ol~.~l~i~l~.o~t~-.~l~.olo-~l~.~t~.~l.
.s x, ol01 l l llololol l I 10101~i~i0q
i.0i01~l,l~q
I I l 101 ohll l l l l h I. I lliq
I. i111
ql
Figure 5. Parallel formulation of information gain calculation for continuous attributes. Values of continuous attributes for each example are labeled Xi; associated classes are labeled C(X~). S(X~) indicates a sorted list of attribute X~. L(S(X~)) denotes the frequency count of examples less than or equal to the candidate threshold - - a similar count is maintained for examples greater than the threshold. For clarity, the algorithm is shown as applied to a two-class problem; extension to multi-class training sets is straightforward.
291
LOCAL GAIN CALCULATION
s~,)lo.~il.~l~~~l~.sl,.ol 11.512.716.ol7.ol9.o I 12.91~.ol,.118.519., I 12.11,.616.117.919.91 ~ , ) , t oil~ott~~.
t,o i o l o t ~ l ~
t,o t ~ l ~ i ~ l . ~
t,~ l ~ l ~ t ~ l ~
,ololol~l~--t, ol~l~l~l,~-~,~l~l~l~l~-~,olol~l~l ~
~,'~,))t,olot~t~
i,ol~lotst~o~
t,~t~tol~oll~n t,~t~tTl~i~
"('~('~-~,~I.~q."ol-~l.~71.-I
I.~71.~71-~1.-I.-I
I."o1-~1--I.~71-~1
I.~1.~1.~1.-I.-I
[!,,~i].,71.~1.,i., i
tii~i.~,l.,i. ~71.~1
H.-I.-I.-I.991
GLOBAL REDUCTION OF GAIN
,c,~,.,,:,, i. ~l,,i~]. 9~1.~71.,i
Figure 6. Parallel formulation of information gain calculation for continuous attributes (cont'd). Lg(S(X~)) contains the accumulated global frequency counts for thresholds (S(X~)). After the p - 1st shift, information gain H(S(Xi)) can be calculated locally. In the final stage, each PE submits its local "best" threshold (indicated by shaded boxes in the lower figure) to the host processor, which selects the "winner" from the p candidates.
292 be communicated between processors during each step of the systolic phase of PDT by a factor of I - ~--~/. This factor represents the amount of example compression, which can contribute greatly"" to improved performance of the algorithm in practice. Note that an alternative approach for treatment of ordinal attributes would likely produce superior performance improvements, both for sequential and parallel implementations of ID3 [8]. In this approach, class membership totals for ordinal attributes are gathered as if categorical (requiring no sorting of individual examples), after which the information gain associated with each binary split is calculated. The current version of PDT does not implement this strategy, instead treating ordinal and continuous attributes identically.
3.3. Training Set Fragmentation The recursive partitioning strategy of divide-and-conquer decision tree induction inevitably transforms a problem well-suited to data-parallel execution into a subproblem in which the overhead of parallelism far outweighs the potential benefits of parallel execution. At each internal node of the tree, the number of available examples that remain "active" decreases according to the chosen split; this is referred to as training set fragmentation. At the same time, the overhead associated with parallelism remains constant and is proportional to the number of processors p (recall that the systolic phase of PDT requires p - 1 steps). This overhead can be expected to quickly dominate the processing time, particularly in situations where the training set is evenly split at each internal node. Early experiments with P DT showed that parallel execution required an order of magnitude more processing time than serial execution on identical training sets; virtually all of the additional time was consumed by communication of small subsets of examples at lower nodes of the decision tree. The simplest remedy for coping with the effect of training set fragmentation on parallel execution is to monitor the size of the subset of training examples that remain active at each step of the induction process. When this subset reaches a user-selected minimum size threshold, the remaining examples are collected at the host processor which assumes responsibility for further induction associated with the example subset. Parallel execution is suspended until the host processor has induced the complete subtree, after which all processors resume execution with the complement of the fragmented set of examples. This approach is used in the current implementation of the algorithm and is discussed further in Section 4.
3.4. Combining Data-Parallel and Attribute-Based Parallelism While the basic PDT algorithm provides an approach for data-parallel decision tree induction, clearly the overhead associated with communication can be substantial. Specifically, the systolic phase of the PDT algorithm requires ( p - 1) 9 k communication steps to collect all the information required before determining the most promising split so that an increase in the number of processors and/or the number of attributes causes a corresponding increase in the time required at each node of the decision tree. An extension to the basic PDT algorithm involves a combined data-parallel/attribute-based approach in which the pool of available processors is divided into j subsets, called processor groups, each responsible for evaluating potential gain for k/j attributes, concurrently executing j independent instances of the PDT algorithm (note that, when j = p, this strategy is effectively a pure attribute-based decomposition). For induction problems where both a
293 significant amount of compute power (in terms of available processors) and a moderate-tolarge problem dimensionality (in terms of attributes) is present, such an approach offers a solution that may ameliorate the problem of increased communication costs. 4. E x p e r i m e n t s To evaluate the performance of PDT on representative data sets, the algorithm was implemented in ANSI C using message-passing for inter-processor communication. In order to conduct experiments that would permit evaluation under differing architectures, two compute platforms were chosen. The first is a workstation cluster consisting of eight Hewlett-Packard (HP) 9000-735/125 workstations, each configured with 128 MB of memo.ry. The workstations are interconnected with a 100 Mb/sec fiber optic network based on an ATM switch. The second platform is a Silicon Graphics (SGI) Challenge multiprocessor with 8 MIPS R4400 processors and I GB of memory. The message-passing library used was Parallel Virtual Machine (PVM) software, a freely-available package for programming heterogeneous message-passing applications from the University of Tennessee, Oak Ridge National Laboratory, and Emory University [7]. Although PVM is most frequently used as a message-passing layer utilizing UDP/TCP or vendor-supplied native communication primitives, recent enhancements support message-passing on shared-memory multiprocessors using IPC mechanisms (shared memory segments and semaphores) to increase efficiency. The shared-memory version of PVM was used for experiments conducted on the SGI. The application was programmed in a single-program, multiple data (SPMD) style in which a distinguished processor acts both as "host" and "worker", while the remaining processors perform the "worker" functions exclusively. No particular optimizations were applied to reduce the communication overhead of this software except for specifying direct TCP connections between communicating tasks through the PVM PvmRouteDirect request. The application can run either in single-processor mode or in parallel. Care was taken to avoid executing "useless" portions of the application in the single-processor case so as not to penalize serial performance with parallel bookkeeping and overhead. D a t a Sets
Two data sets were used in the experiments. The first is from a biomedical domain, specifically the study of sleep disorders. In this data set, each example represents a "snapshot" of a polysomnographic recording of physical parameters exhibited by a patient during sleep. The continuous measurements are divided into thirty-second epochs so that a full sleep study of several hours in duration produces several hundred examples per study. Over 120 individual sleep studies were available, ranging in size from 511 to 1051 examples. These studies were combined into a single data set-of 105,908 examples, 102,400 of which were used for parallel benchmarks. Each example consists of 13 ordinal attributes of cardinality 11 and a class label of cardinality 5. The task is to identify the sleep stage (i.e., awake, light/intermediate/deep sleep, or rapid eye movements). For a more complete description of this domain and the classification task, see [21,2]. The second data set, constructed specifically for these experiments, is a synthetic data set (SDS) consisting of 6 continuous attributes and a class label of cardinality 3. Three of the
294 attributes are relevant to the classification task; the others are irrelevant. Classification noise was introduced in 15% of the examples (i.e., in 15% of the training set, the class label was chosen randomly). A total of 1 million synthetic examples were generated for this training set.
Baseline Timings PDT was compared with three implementations of the ID3 algorithm (one public domain, two commercial) to benchmark the sequential run time of PDT. Although the standard release of C4.5 was not the fastest, a minor modification to the C4.5 source resulted in run times consistently faster than the other implementations and approximately equal to P D T (for splits associated with continuous attributes, C4.5 performs a scan to determine the largest value present in the training set that lies below each split so that reported threshold values are actually represented in the training set; the modification removes this scan, resulting in splits that correspond to midpoints as in CART - the resulting tree structure is unchanged). Table 1 summarizes the results on the experimental data sets. It appears that neither PDT nor modified C4.5 hold a clear advantage in execution time; for the sleep data set PDT required approximately 15% less time to complete, while for the synthetic data set, C4.5 showed an improvement of nearly 6% over PDT.
Table 1 CPU time comparison (in seconds) of C4.5 and PDT. Data Set Training Set Size C4.5 sleep 105,908 2876 synthetic 1,000,000 13652
C4.5 (modified) 155 2342
PDT 133 2480
Speedup and e]ficiency are metrics commonly used to evaluate the performance of parallel algorithms and/or implementations [9]: S(p) =
E(p) =
T1 S(p)
P
Speedup (S) is defined as the ratio of the serial run time (T1) of an application to the time required for a parallel version of the application on p processors (Tp). We distinguish between apparent speedup, which measures the speedup of a given parallel application with respect to the same application run on a single processor, and absolute speedup, which measures the speedup of a parallel application to the best-known sequential implementation. Efficiency (E) measures the effective use of multiple processors by an application as a fraction of run time so that an efficiency of one indicates ideal parallel execution. Based on the similar execution time shown in Table 1, speedup and efficiency measures are reported with respect to the serial run-time of P DT.
295 140
i
|
|
i
pTotal /Local --,---/ Other - e - Sort -~(......
120
100 ,+
~ ........ m . _ . . ~ . . . . . . . . . . . . :=:'_':'-':h~':
. ..... . - - " ...----x-
4"
12800
25600
i 51200 Training Set Size
i 102400
Figure 7. Single-processor (HP) benchmark for the sleep data set.
|
|
2500
|
~Total / Local / Sort Other
2000
1500
1000 .....,e/
500
250000
500000 Training Set Size
1000000
Figure 8. Single-processor (SGI) benchmark for the synthetic data set.
-~--4--. -e-.-~.--
296 Figure 7 shows the single-processor run time of PDT on the sleep data set (HP workstation), varying the training set size from 12,800 to 102,400 examples. Similar results are shown in Figure 8 using the synthetic data set with training set size varying from 250 thousand to 1 million examples (using the SGI machine as the full synthetic data set could not be run on the HP workstation due to insufficient memory). For the purpose of both sequential and parallel benchmarks (to follow), timing results are broken down as follows: t o t a l The total execution time measured on the host processor, which consistently requires the most CPU time due to the additional work assigned to the host after examples are gathered when the active examples in the training set fall below the minimum fragmentation threshold. This time includes the overhead of spawning p - 1 worker tasks on non-host processors. local The total time on the host processor for local operations such as collecting and updating frequency counts, calculating information gain for candidate thresholds, performing example compression, and updating the decision tree. s o r t Total time on the host processor executing the library routine q s o r t prior to determining frequency statistics for candidate thresholds. o t h e r Total time on the host processor doing all other tasks such as I/O, initialization of internal data structures, memory management, etc.
Results from parallel runs of PDT also include: c o m m u n i c a t i o n Total time on the host processor spent sending and receiving threshold values and frequency statistics during the PDT systolic phase, broadcasting selected splits during induction, and receiving gathered attributes from worker processors after reaching the training set fragmentation limits. As is evident from Figures 7 and 8, the majority of time is spent in local operations unrelated to sorting. A further breakdown shows that, for the sleep data set, 75% of the time spent in local operations is due to counting of frequency statistics (47~ and copying of data (28%) as a prelude to sorting. A similar breakdown for the synthetic data set reveals that the most time-consuming local operation is entropy calculation (46%), followed by data copying (24%), with frequency statistic counting requiring a considerably smaller percent of time (12%). These differing rankings for components of local operations are primarily due to the nature of the data sets; recall that all attributes in the sleep data set have a limited domain and therefore require relatively few calculations to evaluate the partitioning heuristic. Surprisingly, sorting accounted for only 10% and 12% of total execution time for the sleep and synthetic data sets, respectively. Parallel Benchmarks
Figures 9 and 10 display the best obtained times for the sleep benchmark for 1 to 8 processors on the HP and SGI. CPU times, speedups, and efficiencies are presented in Table 2. As noted on the figures, in both cases processor group attribute subsets
297
140 Total Local -4--. C o m m u n i c a t i o n -la--. Sort ..~ .... O t h e r ..,i,..-
120
100
§
b ..................................................... .4~
..............................
9
9- . . . . . . . . . . . . . . . . .
0
I 1
I
,
,,
2
~
,,
,
4
I
8
Number of Processors
Figure 9. Benchmark results (HP) for the sleep data set (m = 102,400). eight-processor runs specified two and four processor groups, respectively.
140
,
i
1
Four- and
|
Total -4-Communication
120
L o c a l -~---Bo-. Sort --~-.. Other - ~ . -
100
....................
I
I
I
2
I
4
"4"
,,I,,
.
8
Number of Processors
Figure 10. Benchmark results (SGI) for the sleep data set (m = 102,400). Four- and eight-processor runs specified two and four processor groups, respectively.
298 Table 2 Speedup and efficiency of PDT (sleep data set, m - 102,400). Processors Machine CPU Time Speedup 1 HP 134 2 HP 120 1.12 4 HP 96 1.40 8 HP 88 1.52 1 SGI 133 2 SGI 112 1.19 4 SGI 101 1.32 8 SGI 95 1.40
Efficiency 0.558 0.349 0.190 0.594 0.329 0.175
provided the best timings. Interestingly, the efficiency of the workstation cluster (HP) on this benchmark was slightly superior to the multiprocessor (SGI) for 4 and 8 processors, with the workstation cluster requiring less time for all components of the computation except for communication. However, it is difficult to draw any clear conclusions from these tests; in practice, the time required to induce a decision tree from this data set is minimal, therefore the potential for gains through parallelism is quite small.
Table 3 Speedup and efficiency of PDT (synthetic data set, m - 500,000 for HP; m - 1,000,000 for SGI). Processors Machine CPU Time Speedup Efficiency 1 HP 1629 2 HP 1227 1.33 0.664 4 HP 1072 1.52 0.380 8 HP 915 1.78 0.223 1 SGI 2480 2 SGI 1645 1.51 0.754 4 SGI 1288 1.93 0.481 8 SGI 1130 2.20 0.274
Figures 11 and 12 present the best obtained times for the synthetic data set benchmark for 1 to 8 processors on the workstation cluster and multiprocessor. CPU times, speedups and efficiencies are presented in Table 3. As noted previously, the training set size is limited to 500,000 examples for the cluster due to memory constraints (PDT was run successfully with the full 1 million example training set on 4 and 8 processor configurations of the cluster, however the timings do not appear due to inability to present accurate speedups and efficiencies). The results show improved efficiency (versus the smaller sleep data set) for all processor totals. Although the SGI timings in Figure 12 were obtained with processor groupings for attributes (as in the sleep benchmarks), the HP numbers shown use only a single processor group (strict data-parallel execution) to assist in un-
299
,
|
|
i
-~ -+---
Total Local
Communication -B--.
1600
Sort ..~--..
Other ..a,.1400
1200
._~
1000
I-o
800
600
400
200
B-
I
I
1
2
I
I
4
8
N u m b e r of Processors
Figure 11. Benchmark results (HP) for the synthetic data set (m = 500,000). All runs specified one processor group (default PDT algorithm).
2500
|
| Total - ~ - Local -+--C o m m u n i c a t i o n .n... Sort -.~.... O t h e r -,~---
2000
1500
"-,,. ",11.....
+
500
.......................
---"
. . . . . . :==2
I
I
1
2
~.
........ ':':':"=::tt
I
I
4
8
N u m b e r of Processors
Figure 12. Benchmark results (SGI) for the synthetic data set (m = 1,000,000). Fourand eight-processor runs specified one and four processor groups, respectively.
300 =
=
|
i
|
|
8192
co
P=8 P=4
--4--.
P-2 P=I
-B--. ........
4096
I-Q.. 0
...B.
"§
bm O
Q
"~ o.
2048
lO24
,
,
1
4
's 1
"~'.,. ..... B ..... "-§
'
'
64
256
02 1
4
Training Set Fragmentation Threshold
Figure 13. Effect of various training set fragmentation thresholds on CPU time (SGI) for 2, 4, and 8-processor runs using the synthetic data set (m - 1,000,000). For comparison, single-processor run time shown as horizontal line.
derstanding the extent of performance degradation if the attribute-based dimension of parallelism is not exploited. As can be seen in Figure 11, the pure data-parallel approach leads to rapidly-increasing communication overhead, although these costs do not appear to dominate the total execution time until reaching 8 processors, at which point communication costs exceed those for local operations- improved interprocessor networks would allow data-parallel execution on greater numbers of processors. Figure 13 provides a closer look at the effect of various thresholds of training set fragmentation for 2, 4, and 8 processors. It appears that the optimal level for fragmentation (at least for this combination of hardware and problem definition) lies near the 1000example threshold; choosing smaller values causes communication overhead to adversely affect CPU time, while larger values concentrate an excessive amount of CPU time at the host processor, which is responsible for inducing the subtree corresponding to the gathered examples. Another view of the algorithm's behavior is shown in Figure 14, in which the total communication requirements for various "configurations" of PDT is shown. The leftmost points shown correspond to execution with a single processor group (data-parallel execution) - for the synthetic data set, the total communication volume is equivalent to sending the entire training set between processors over 30 times! Not unexpectedly, the volume is considerably less for the sleep data set due primarily to the effects of example compression, as discussed in Section 3.2. The benefits of attribute-based execution are
301
Synthetic, P = Synthetic, P = SynthetiC,P = Sleep, P = Sleep, P = Sleep, P =
Communication ~Bytes)
10000000000
8 4 2 8 4 2
-e.--+---a--. -M--~-.-~--
1000000000 100000000 ~
10000000
?
xL. " a k : "
100o000 100000
,.--"---~-.~-~.-"2~.~~ ""~
9 ~=(~-"
""-
= ~ ~ - 5 - - - - - - ~ ....... -~.% ".:.:.. -:~... "-- The condition part can be either a boolean expression or one of the special keywords i n i t i a l l y or f i n a l l y (see Figure 3). These keywords make it possible to define initial and final rules, that will be executed respectively on creation and destruction of the cell. The action part is a sequence of instructions with a C-like syntax. The ParCel-1 language is strongly typed. The available types are: i n t , double, boolean, symbol and cell registration: r e g i s t r a t i o n (formerly immat). For example, r e g i s t r a t i o n f i l t e r f in Figure 3 declares the f variable with the type: registration
311
Computation phase: cell outputs updating Management phase: network topology modifications Communication phase: broadcasted routing of the channel contents
Figure 1. Main stages of a ParCeL-1 cycle.
of cell of type filter. As shown in the example program below, cell registrations are used to refer to given cells (not unlike pointers), in order, for example, to connect or kill them. 3.2. E x a m p l e of ParCeL-1 p r o g r a m In this paragraph, we explain the execution of a ParCeL-1 program with a simple example: the prime number generation program listed in Figure 3. When a program starts, a cell of type main is automatically created which will be in charge of creating the other ones. In the cell computational model, new cells are created and connected by other cells. At each cycle, every existing cell selects and executes at most one of its rules. The selection is done as follows: the initial rule is executed on creation of the cell; the final rule is executed on death of the cell; otherwise the highest priority rule is selected among those having their condition verified. Once a rule is selected, an additional primitive, called sub-rule, makes it possible to fix the evolution of the cell for several cycles (see [21]). Sub-rules are not used in the example program. The propagation from the output channels to the input channels is postponed until the communication phase at the end of the cycle. Thus, input channel values do not change during the computation phase (see Figure 1). A set of specific instructions, called requests, is used for the cell network management. Three of these requests are sufficient to create any cell network topology: creation of a new cell, destruction of a cell, connection between two cells, the cells being referred by their registrations (see table 1). The execution of the requests is postponed until the end of the computation phase of the cycle, thus the network cell topology will not change during the computation phase (see Figure 1). The basic principle of the prime number program (see Figure 3) is to use a series of filters, each filter being initialized by a prime number, and filtering each number divisible by that prime. The filters are arranged in a pipe-line (see Figure 2) and run concurrently. The main cell is a generator of integers, output on its new_int channel. A first f i l t e r cell is created in the i n i t i a l l y rule of the main cell (line: f = c r e a t e f i l t e r ( 9 ) ) , which will filter any integer divisible by 2. This filter will let odd numbers pass. The first non
312
filter(2) ~'~
._~"filter(3)
filter(5)
=~
o
Figure 2. Communication graph of the prime number program.
#include #define MAX_INT I000 typecell filter (int init) { /, cell type definition with one param ,/ in int input; /* one integer input channel ,/ out int output; /, one integer output channel ,/ registration filter next; /, one registration on a filter cell initially ==> { printi(init); printnl(); next = NULL;
}
,/
/* rule executed on creation of the cell ,/ /* printing of the init parameter */ /* new line */ /* no next cell for now ,/
TRUE ==> { /* new rule. always executed ,/ if (input ~ init != 0 ) /* if the input is not divisible by init ,/ if (next != NULL) /* and there is a next filter cell... ,/ output = input; /* then input is transmitted to this cell*/ else { /* if there is no next filter cell... ,/ next = create filter(input); /, we create a new one which */ connect input of next /, we connect to ourselves */ to output of self;
} else output = 0;
} t y p e c e l l main () { out i n t new_int; registration filter f ;
/, main cell type definition /* one integer output channel, new_int
,/ */
/* a registration on a filter cell
,/
initially ==> { /* rule executed on creation of the cell,/ f = create filter(2); /* creation of a filter cell... ,/ connect input of f /, and connection to this main cell ,/ to new_int of self; new_int = 2;
}
new_int < MAX_INT ==> new_int += I;
/, increase new_int at each cycle
,/
new_int = MAX_INT ==> halt;
/* stop when MAX_INT is reached
,/
Figure 3. ParCeL-1 prime number program.
313 Table 1 Main available ParCeL-1 requests. action syntax cell creation <x> = c r e a t e f i l t e r ( 9 ) ; cell removal k i l l <x>; cells connection connect of <x> to of ;
filtered number is three, thus three is a prime number, and a new cell is created that will filter all the numbers divisible by three. The process is iterative: each time a new f i l t e r cell is created, it prints the number it will filter, which is prime, and filters all the multiples of that prime number. The communication pattern between the cells is shown in Figure 2. The output channel new_int of the main cell is connected to the i n p u t channel of the first f i l t e r cell. Then, the o u t p u t channel of each f i l t e r cell is connected to the i n p u t channel of the next f i l t e r cell.
3.3. Parallel implementation of ParCeL-1 ParCeL-1 is implemented on a dedicated virtual machine: the Virtual Cellular Machine (VCM-1). This virtual machine implements three main operations. First, it manages the cyclic and synchronous functioning of all cells. Second, it executes all the cell requests and resolves possible conflicts. Third, it manages all the communications needed by ParCeL-1 on an MIMD 1 architecture, i.e. the routing of the requests and channel contents. Thus, in order to port ParCeL-1 to a new architecture, only the virtual machine needs to be re-implemented, using the communication instructions available on this architecture. 4. R E L A T E D L A N G U A G E S This section situates ParfieL-1 among several kinds of programming languages it is closely related to: concurrent object-oriented languages, actor languages, and connectionist languages.
Concurrent object-oriented l a n g u a g e s Object-oriented programming was initially developed without any relationship to parallel programming models. The main distinct feature of object-oriented languages is inheritance and not concurrency. However, object programming introduces a natural partitioning of data and tasks, so that the concept of an object provides a good metaphor for parallel computing models. Therefore parallel extensions have been proposed for most object-oriented languages (see for instance [17,9,3,6]). In concurrent object-oriented languages, the concept of inheritance as such is completely independent from the semantics of parallelism. Nearly all models based on concurrent objects use asynchronous execution models, i.e., models where computation and message passing are not synchronized. Consequently, 1Multiple Instructions Multiple Data
314 communication has to be synchronized explicitly and each transaction between two objects must be implemented as a separate communication. Mail boxes or message queues have to be managed by the underlying system. As a result, it is not easy to implement efficient communication, and concurrent object systems are often restricted to coarse grain parallelism, for performance reasons. As an additional result, concurrent object programming, while finding increased acceptance for implementing distributed systems over wide area networks [5], is still seldom used in massively parallel high performance computers. ParCeL-1 was designed with high performance computing in mind. It does not provide any inheritance mechanism. However, since objects in ParCeL-1 are statically typed, it would be feasible to extend it with inheritance functionalities similar to those of compiled object languages such as C + + .
Actor languages Actor-based programming traces back to the computational model proposed by Hewitt [10] and later improved by Clinger and Agha [1]. Actors are conceived as active objects, i.e., objects with their own resources and activities. Actor languages may or may not provide inheritance. When available, inheritance does not directly influence the way concurrency is handled. Most actor languages provide not only inter-object parallelism, i.e., concurrency between different objects, but also often intra-object parallelism, i.e., objects are able to process several requests simultaneously [1]. In the ACT language [20], eve .ry single computation (for instance the addition of two numbers) implies transactions between several actors. The communication and synchronization protocols of actor languages may prove tricky to implement efficiently on multiprocessor machines; as a matter of fact, actual multiprocessor implementations of actor languages appeared only after 1990. The HAL language based on the CHARM virtual machine [12] is a system in the lineage of the work of Hewitt, Clinger and Agha. The successive variants of the ABCL language [22], were developed by the team of A. Yonezawa. In ABCL, several conservative design choices were made, in an attempt to facilitate an efficient parallel implementation. ABCL/1 is a parallel extension of Common-Lisp. Simple data structures and operations are implemented in Lisp without involving any actor-based mechanism. Similarly, an actor can process only one request at a time; however, this processing may be interrupted to switch to another task. Several versions of ABCL actually exist as multiprocessor implementations. ParCeL-1 is similar to actor-based languages, but with a synchronous computational model. Several multiprocessor implementations of ParCeL-1 are fully operational. The underlying VCM-1 virtual machine is available since 1992, and the first implementation of ParCeL-1 itself since 1994.
Connectionist languages Three kinds of tools can be used to program neural networks: simulators; dedicated languages; general purpose languages. The easier to use, the more restricted a tool is. In this paragraph we compare ParCeL-1 viewed as a connectionist language to other such languages. A language can be more or less versatile, according to the variety of networks that can be built, and the variety of agents that can compose the network. Some languages, such as
315 Aspirin [15] or NeuDL [19], implement only one kind of model (usually back-propagation networks). Other languages, such as NIL [4] propose only one type of elementa .ry agents (the neuron). ParCeL-1 is completely unconstrained, that is, any kind of neural network can be implemented using any kind of basic agents. Synchronization mechanisms also differ from one language to another. In CuPit [18] or Condela-3 [2], a central agent applies a given function to a given set of agents (e.g. neurons) at each cycle. NIL's agents [4] follow a CSP-like [11] rendez-vous mechanism, and are activated when they receive data. ParfieL-l's agents are activated at each cycle, thus, when necessa .ry, synchronization must be implemented explicitly. For example, it is possible to create a manager cell connected to and controlling the other cells. Finally, parallelism is another key issue: connectionist applications are very demanding for computing power, but parallel languages are still rare. CuPit [18] relies on SIMD 2 parallelism on MasPar architectures, by triggering several identical agents at the same time. ParCeL-1 is implemented on MIMD architectures, and different agents can perform different computations at the same time. 5. A P P L I C A T I O N
PROGRAMMING
I N ParCeL-1
In this section, we describe several examples of ParCeL-1 programs: first some connectionist applications, then some general numeric computation applications, and finally a tree search application. We conclude by a set of possible methods to write programs in ParCeL-1. 5.1. K o h o n e n s e l f - o r g a n i z i n g f e a t u r e m a p As an extended programming example, we will explain here the implementation of a Kohonen self-organizing map [13]. Among the neural networks using unsupervised training, the Kohonen self-organizing map is the best known. Its applications are in the field of vector quantization and mapping of multi-dimensional spaces. A Kohonen map is a network of N neurons, usually arranged on a two-dimensional grid. With each neuron is associated a weight vector of a length equal to the number of inputs of the neural network. Training is performed iteratively. At each iteration t, an input vector is chosen randomly in the training set. For each input, its distance from each neuron's weight vector is measured. Then the neuron with the smallest distance (the winner neuron) is determined and all weights are updated according to a given learning law. The ParCeL-1 implementation of a Kohonen map uses two types of cells (Figure 4): N cells of type neuron and one supervising cell. The supervising cell first creates the neuron cells, using the c r e a t e instruction. It is then responsible for inputing training vectors to the neuron mesh, getting back the corresponding neuron activations, finding the minimal output value (and thus the winner), and broadcasting to every neuron the identity of the winner neuron. Figure 5 shows the declaration of the neuron cell type, of which the neurons will be instances. Two parameters are passed to each neuron cell when it is created: the (i, j) coordinates of the neuron on the feature map (we choose a classical square-grid topology). The body of the declaration starts with variable declarations, followed by input and output channel
2Single Instruction Multiple Data
316 activations
-y $ti
neural networkinputs
Figure 4. Kohonen feature maps in ParCel-l: implementation principle
declarations. First come the input vector and the output value of the neuron. Then, additional input channels contain information coming from the supervisor cell, which are useful for training: the index t of the current iteration and the (i, j) coordinates of the current winner neuron. The actions performed by the neurons are specified in a list of action rules. The first rule, with condition i n i t i a l l y , fires during the first cycle of the neuron life. Its function is to initialize the weights to random values. Then a new rule, with condition TRUE, fires iteratively starting at the second cycle until the death of the cell. The execution of this rule spreads over three cycles, thanks to the subrule operator -->. 9 first cycle: the distance between the input vector and the weight vector is computed; the result is sent to the supervisor cell. 9 second cycle: an empty cycle to wait for the identity of the winner neuron, computed by the supervisor. 9 last cycle: the Kohonen training formula is applied to the weight vector; functions a l p h a and ahood have been defined beforehand as global functions. To make the Kohonen program complete, a supervisor cell must now be built. For the sake of simplicity, the supervisor will be the main cell that is created at startup. Thus the whole program will comprise only two different cell types. Figure 6 shows the declaration of the m a i n cell type. The input and output channels are the ones also appearing in the neuron cell type: the value of input vector, the neuron output values (i.e. distance between input and weight vectors), and information related to the training process (iteration index and identity of the winner neuron). At first (rule i n i t i a l l y ) the main cell creates all neuron cells and connects them to itself. The next rule fires iteratively as long as t is smaller than the desired number of iterations. A new input vector is first sent to the neurons. For this test program, uniform random distribution is used as the training set. This is the usual type of input to test a Kohonen program. In a real application, we would read data from a file or from an array in memory instead. Once a new input has been sent, one empty cycle is
317 typecell neuron(int my_i, int my_j) { in double input [INPUTS]; /, the input vector */ out double output; /, the output value of the neuron */ in int t, winner_i, winner_j; /* time, coordinates of the winner cell ,/
double weight[INPUTS]; double p o t , d i s t ;
/* the internal weights of the c e l l / , temporary v a r i a b l e s
,/ ,/
int i ; initially ==> { /* randomly initialize the weights for(i = O; i < INPUTS; i += I) weight[i] = frand(); } TRUE •ffi> { for(i = O, pot = 0.0; i < INPUTS; i += I) pot += sqr(weight [i] - input [i]) ; /* compute distances
} --> { }
--> {
output = pot;
,/
,/
/* output the r e s u l t to the supervisor */ /* wait until the winner's identity /* is computed by the supervisor cell
/* update the weights f o r ( i = O; i < INPUTS; i += 1) { d i s t = abs(my i - winner i) + abs(my_j - winner j)
,/ */
,/
weight[i] += alpha(~) * nhood(t, dist) 9 (input[i] - weight[i]); )
}
Figure 5. The neuron cell source code of the Kohonen program
performed while the neurons compute the distances. Finally during the third cycle, the neuron with minimal distance to the input vector is identified. When t is greater than the number of iteration, the last rule is selected, and the program is stopped. This program uses a particular programming method that can be called supervised. The processing of a set of cells (the neuron cells) is sequenced by a single master cell (the main cell). In the following, we will more briefly describe another application using this kind of programming method, as well as other applications using two other programming methods. 5.2. T e m p o r a l O r g a n i z a t i o n M a p Temporal Organization Map (TOM) [8] is a temporal connectionist system that is used for speech recognition. The goal of this architecture is to detect sequences of patterns in a temporal phenomenon in a connectionist fashion. A set of super-units is used to encode the acoustic features of the speech, and is trained using a Kohonen-like algorithm. At the end of this training phase, each super-unit reacts to some particular acoustic event. To take into account the flow of acoustic events, each super-unit contains a set of units. During the learning phase, units are created and connected with each other into chains representing the succession of acoustic events. The training algorithm is robust enough to deal with the fuzziness and the variability of speech.
318
typecell main() { out int t, winner_i, winner_j; /* time, coordinates of the winner cell*/ out double proto[INPUTS] ; /* the input vector for the neurons */ in double activation[WIDTH] [HEIGHT]; /* activations of the neurons */ registration neuron neuron[WIDTH] [HEIGHT]; /, addresses of the neuron cells ,/ int i,j,k; double smallest; /, current smallest distance */ initially ==> { /, creation of the neural network */ for(i = O; i < WIDTH; i += I)
f o r ( j = 0; j < HEIGHT; j += 1) { n e u r o n [ i ] [j] = c r e a t e n e u r o n ( i , j ) ; /* c r e a t e a n e u r o n /* p e r f o r m e v e r y c o n n e c t i o n t o t h e n e u r o n c o n n e c t w i n n e r _ i of n e u r o n [ i ] [j] t o w i n n e r _ i of s e l f ; c o n n e c t winne r _ j of n e u r o n [ i ] [j] t o winner_j of s e l f ; f o r ( k = 0; k < INPUTS; k += 1) c o n n e c t i n p u t [ k ] of n e u r o n [ i ] [j] t o p r o t o [ k ] of s e l f ; c o n n e c t a c t i v a t i o n [ i ] [j] of s e l f t o o u t p u t of n e u r o n [ i ] [ j ] ; c o n n e c t t of n e u r o n [ i ] [j] t o t of s e l f ; } t = O;
/* initialize iteration counter
*/ */
*/
} t < ITERATIONS ==> { /* compute new prototype(random distribution)*/ for(i = O; i < INPUTS; i += I) proto[i] = frand() ; } --> { }
/, wait for the neurons to compute distance
--> { /* find the winner neuron smallest = MAX_FLOAT; winner_i = O; winner_j = O; for(i = O; i < WIDTH; i += I) for(j = O; j < HEIGHT; j += I) if(activation[i][j] < smallest) { smallest = activation[i] [j] ; winner_i = i; winner_j = j ; } t += I; /* this was one more iteration
}
TRUE
==>
halt;
/, otherwise,
s t o p t h e program
F i g u r e 6. T h e m a i n cell source code of the Kohonen p r o g r a m
*/ */
*/
,/
319
Figure 7. Cellular network for the TOM program
TOM has an intrinsically parallel functionality: each unit and each super-unit can be updated simultaneously, thus TOM's parallelism can be easily expressed in ParCeL-1. The first implementation possibility consists of creating one cell type for the units, and one cell type for the super-units. However, the units are very small processing elements, and are strongly interconnected with each other; thus, it is faster and more efficient to represent them as data structures in the super-units, rather than as independent cells. Therefore, the implementation of TOM in ParCeL-1 involves only two types of cells: a supervisor cell that is in charge of sending the inputs and collecting the results, and the super-unit cells, in charge of the actual processing (see Figure 7). TOM uses the same kind of programming method as the Kohonen programs: the processing of a set of computing cells is managed by a single supervisor cell. 5.3. G e n e r a l n u m e r i c c o m p u t a t i o n Many scientific computing applications may be expressed as the iterative computation of a set of variables: each new value of a variable is a function of the variables (or a subset of them) at the previous iteration. Two examples have been implemented: the Jacobi relaxation and the N-body simulation. The Jacobi relaxation program [16] iteratively solves the Laplace differential equation:
02v 02v x--OZ + ~ = 0 using a finite difference technique. An application is, for instance, the computation of the voltage v(x, y) at any point (x, y) of a two-dimensional conducting metal sheet. The N-body program [16] is a simulation of the trajectories of N weighted punctual objects in the three dimensional space. Each object has an instantaneous position, speed and acceleration and its trajectory is influenced by the position of the other bodies, due to a long range interaction (typically gravitational or electro-magnetic interaction). The computational model of ParCeL-I makes it well suited for the implementation of such algorithms. Typically, each cell is responsible for periodically calculating one or several variables, using the output values of its partner cells that are in charge of one or several other variables. The program essentially uses one type of cell, and as many cells as
320
Figure 9. Cellular network for the N-queens program
the number of subsets of variables have to be created (see Figure 8). These applications use another kind of programming method than the Kohonen and TOM programs: the cells compute concurrently without supervision, and return their results after a pre-determined number of cycles. This kind of programming method can be called iterative programming. 5.4. Tree s e a r c h As an example of tree search, which is a fundamental algorithm used in AI, we have implemented the N-queens problem. Solving this problem consists of exploring a tree in which the nodes are positions on the chess board. On the first level of the tree, only one queen is on the board; on the second level, two queens are on the board, etc. The basic principle to implement this kind of algorithm in ParCeL-1 is to divide the tree into several branches, and process the branches concurrently. In the case of the 8-queens, the tree can be easily divided into 8 branches, each branch fixing a particular position for the first queen. In ParCeL-1 this is implemented using 8 cells of a single cellular type that will each process one of the 8 branches of the tree. This division can go further, by fixing more levels of the tree, for example, the first two, that is, the first two queens. Then, 8 x 7 branches can be developed concurrently (see Figure 9). We have here yet another kind of programming method: each cell processes its own branch of the tree without regard to what the other cells do. The termination of each cell is independent of all the others: we can call this kind of programming independent processing.
321 Table 2 Performance of several ParCeL-1 programs on T-node with 9 processors v s ParCeL-1 and C versions on 1 processor Test program basic speed-up: speed-up vs C: efficiency: Tp.,C,L-- 1 (9)
9
Jacobi relaxation
TP,,C,L-- 1
4.2
1.5
17%
N-body
5.0
4.1
45%
6.0
2.3
25%
7.2
4.4
49%
8.6
7.1
79%
30 x 30, 5000 iter.
N
, 5000 iter.
Kohonen 1 neuron/cell 9 0 0 neur'ons, 5 inputs, 1000 Kohonen 25 neurons/cell
iter.
9 0 0 neur'ons, o inputs, l v u v iter.
N-queens
= 13, 132 cells
5.5. Conclusion: ParCeL-1 programming overview We have given some examples of possible programming methods that can be used in ParCel-1 programs: supervised (case of the Kohonen and TOM programs), iterative programming (case of the numeric computation programs), independent processing (case of the N-queens). These methods can be combined; we have developed an application called Resyn [14], that implements a hybrid symbolic-connectionist database, and interacts with the user by means of a a command interpreter. This command interpreter acts as a global supervisor for the program, and can sometimes order an iterative relaxation phase. Resyn also implements a delegation mechanism [1]: when the command interpreter receives a read-file instruction, the parsing of the file is delegated to a specialized cell. Resyn emphasizes the versatility of ParCel-1 and its computational model, since several kinds of programming methods can coexist in the same program. 6. A P P L I C A T I O N
PERFORMANCES
In this section, we present some measurements we have collected for several of the applications presented above, and we suggest a few guidelines for writing efficient ParCeL1 programs.
6.1. Methodology The performance measurements we obtained are shown in table 2. In each case, we compared the execution time of the ParCeL-1 program on a T-node machine with 9 processors to the execution time on one processor of either the ParCel-1 program or the corresponding C program. This last comparison is more significant for the user, because the language giving the best execution times for these applications on sequential computers is C. The speed-up is the sequential execution time divided by the parallel execution time, and the efficiency is the speed-up divided by the number of processors, that is the fraction of the processor's computational power actually used.
6.2. Experimental results and programming guidelines In order to obtain efficient parallel programs, two conditions should be observed. First, the load balance of the processors must be as good as possible: because of the cyclic
322 225
iti nHiiHiit r Hi iiHHHil ~::::!'~::i::ii~_.:!g~i~ii~#!:i!!i!!i!i@i!i::::i!:..;,!ii~ :::::-::~:;~.~.~:::::~::i:~.~ui::::.:::.~.~i~::::~::i:~.~..
1
10
100 1000 Number of cells
10000
Figure 10. Optimum number of cells for the 12-queens program on T-node with 9 processors: close to 1000
computational model, the least loaded processors will have to wait for the most loaded processor to complete its computation phase before a new cycle can start. Thus, a good load balance is essential. Second, the overhead due to the cell management and the communication time must be minimized. Two parameters are important to meet these conditions: the number of cells, and the granularity of the cells. In order to meet the first condition, the number of cells must be at least equal to the number of processors, to install one cell on each processor. If all the cells are identical, then the load balance is good. More generally, that is, if several kinds of cells exist in the program, a greater number is suitable to ensure a good statistical distribution of the cells on the processors; ten times the number of processors seems to be a minimum. For example, Figure 10 shows that, on a T-node with 9 processors, the optimal number of cells in the case of the 12-queens algorithm, is close to 1000 (100 times the number of processors). However, the performance obtained with only 100 cells (10 times the number of processors) is already ve~. close to the optimum. The performance degradation beyond the optimum number of cells is explained below. In order to meet the second condition, it is necessary to create cells with a granularity (essentially their computation time) that is large enough, compared both to their communication time and to the cell management overhead due to ParCeL-1 itself. If this condition is not met, the cell management time or communication time will be important compared to their computation time, and the overall performance will decrease. In the case of the 12-queens program (Figure 10), the performance decreases when too many smaller and smaller cells are active. In the case of the Kohonen program, it is necessa .ry to associate more than one neuron with each cell to get cells of a sufficiently coarse grain of parallelism, and thus to improve the performance. Table 2 shows different results for the Kohonen program according to the granularity of its cells, that is, the number of neurons
323
Table 3 Performance of the 15-queens program on Cray T3D and Intel Paragon Test program basic speed-up: speed-up v s C: efficiency: Tp.,C.L_,(1)
15-queens on T3D, 8 proc. time
speed-up
Tp,,,Cek--1 (n)
n
7.8
6.8
84%
29.7
25.7
80%
vs
= 13.6 s
'
~5-queenson T3D, 128 proc.
109.3
89.2
70%
15-queens on T3D 256 proc.
205.7
167.8
66%
7.9
7.1
88%
29.8
26.8
82%
ime
time
= J.y
s
= 2.1 s
15-queens on Paragon, 8 proc. time
-- 1 6 2 s
15-queens on Paragon, 32 proc. time
C
= 51.8 s
15-queens on T3D 32 proc. time
Tc(1)
Tp,,,CeL-l(n)
= 33 s
a given cell is in charge of. The Kohonen program with one neuron per cell is the one presented above. Grouping several neurons per cell consists basically of adding a loop in one of the rules of the n e u r o n cell, and does not result in a much more complex program. The source code for this Kohonen program is ve.ry close to the one we showed. Besides, the Jacobi relaxation program also has a too fine granularity, hence poor efficiency, but the granularity could be augmented as in the Kohonen program. Programs conforming to these rules (optimal number of cells and optimal granularity of the cells) show a rather good performance: the Kohonen program with 25 neurons per cell and the N-body program show a speed-up close to 4.5 on 9 processors, t h a t is, an efficiency close to 50%. The N-queens program also supports these rules, and its cells communicate very little, hence the excellent speed-up of 7.1 on 9 processors. These results have been obtained on a computer based on T-800 processors. These processors were released before 1990, and are now aging. Therefore, even if the speed-ups are good, the execution times remain higher than on modern workstations (e.g. Sparcstation 20 or DEC Alpha). We have implemented ParCeL-1 on two state of the art MIMD architectures: Cray T3D and Intel Paragon. This last implementation is ve.ry recent: optimizations and benchmarks are under way. Of course, on these architectures, the execution times are dramatically lower than on T-node and modern workstations: table 3 gives some samples of execution times and speed-ups on T3D and Paragon for the 15-queens program (generating 2184 cells). 7. C O N C L U S I O N We have presented a new language dedicated to AI programming. ParCeL-1 has proven its efficiency on several types of applications, on the connectionist and on the symbolic side of AI. ParCeL-1 is closely related to and benefits from many features of actor-based and connectionist-oriented languages. The performance tests we carried out so far on parallel implementations resulted in good speed-ups. Some small programs (e.g. Nqueens) were written efficiently in a short time by students, indicating that one can easily master ParCeL-1 and its computational model. Finally, ParCeL-1 can be used for many types of programming, even if its predilection domain remains networks composed of small
324 computing elements--such as neural or semantic networks. Its versatility and its parallel implementation make it especially attractive as a connectionist language. From a parallel programming viewpoint, ParCeL-1 seems to be a good compromise between an automatic parallelization of the source code, still out of reach, and an architecturedependent parallel style of programming. Compared to other concurrent object and actorbased systems, ParCeL-1 is more adapted to applications with very dense communication patterns, like neural network programs and other similar applications. A low-level language may give better results in terms of pure performance, but the masking of the parallel architecture and of the communication layers make ParCeL-1 suitable for a quicker development of portable programs - a single ParCeL-1 program can be executed without modification on several multi-processor architectures. The ongoing work on the ParCeL-1 project follows several directions: extensive programming experiments including real size applications, assessment and performance measurements of the parallel implementation, further development of the language itself to include higher-level functionalities and porting to new multi-processor architectures. REFERENCES
1. G. Agha. ACTORS, a model of concurrent computation in distributed systems. MIT Press, 1986. 2. N. Almassy, M. K5hle, and F. SchSnbauer. Condela-3: A language for neural networks. In International Joint Congress on Neural Networks, pages 285-290, San Diego, 1990. 3. I. Attali, D. Caromel, and S. Ehmety. Une s~mantique op~rationnelle pour le langage eiffel//. In Journdes du GDR Programmation, Grenoble, 22, 23 et 24 novembre 1995. 4. A. S. Bavam. Nps: A neural network programming system. In International Joint Congress on Neural Networks, pages 143-148, San Diego, 1990. 5. J . P . Briot and R. Guerraoui. A classification of various approaches for object based parallel and distributed programming. Technical report, University of Tokyo and Swiss Federal Institute of Technology, 1996. 6. A. Chien, U. Reddy, J. Plevyak, and J. Dolby. ICC++ A C + + dialect for high performance parallel computing. Lecture Notes in Computer Science, 1049:76-??, 1996. 7. Thierry Cornu and St(!phane Vialle. A framework for implementing highly parallel applications on MIMD architectures. In J. R. Davy and P. M. Dew, editors, Abstract Machine Models for Highly Parallel Computers, Oxford Science Publications, pages 314-337. Oxford University Press, 1995. 8. S. Durand and F. Alexandre. Learning speech as acoustic sequences with the unsupervised mod el, TOM. In NEURAP, 8th International Conference on Neural Networks and their Applications, Marseilles, France, 1995. 9. A. Grimshaw. Easy-to-use object-oriented parallel processing with Mentat. Computer, 26(5):39-51, May 1993. 10. C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial intelligence. In IJCAI-73, pages 235-245, 1973. 11. C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985.
325 12. W. Kim and G. Agha. Compilation of a highly parallel actor-based language, pages 1-15. Lecture notes in computer science 757. Springer-Verlag, 1993. 13. T. Kohonen. Self-Organization and Associative Memory, volume 8 of Springer Series in Information Sciences. Springer-Verlag, 1989. 14. Y. Lallement, T. Cornu, and S. Vialle. An abstract machine for implementing connectionnist and hybrid systems on multi-processor architectures. In V. Kumar H. Kitano and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence, 2, Machine Intelligence and Pattern Recognition Series, pages 11-27. Elsevier Science Publishers, 1994. 15. R. R. Leighton. The aspirin/migraines neural network software, user's manual. Technical Report MP-91W00050, MITRE Corporation, 1992. 16. Bruce. P. Lester. The Art of Parallel Programming. Prentice Hall, 1993. 17. J. Pallas and D. Ungar. Multiprocessor Smalltalk: A case study of a multiprocessorbased programming environment. In Conference on programming language design and implementation, pages 268-277, Atlanta, June 1998. 18. L. Prechelt. Cupit a parallel language for neural algorithms: Language reference and turotial. Technical report, Uni. Karlsruhe, Germany, 1994. 19. S. 3. Rogers. Neudl: Neural-network description language. Available by ftp at cs.ua.edu, file/pub/neudl/neuDLver0.2.tar.gz, August 1993. 20. D. G. Theriault. Issues in the design and implementation of act 2. Technical Report AI-TR-728, Massachusetts Institute of Technology, A.I. Lab., Cambridge, Massachusetts, 1983. 21. S. Vialle, T. Cornu, and Y. Lallement. Parcel-I, user's guide and reference manual. Technical Report R-10, Supelec Metz campus, SUPt~LEC, 2 rue Edouard Belin, F57078 Metz Cedex 3, November 1994. 22. A. Yonezawa, S. Matsuoka, M. Yasugi, and K. Taura. Implementing concurrent object-oriented languages on multicomputers. IEEE parallel and distributed technology, 1(2):49-61, May 1993.
326 Yannick Lallement Yannick Lallement was born in France in 1968. He obtained the masters degree in computer science in 1990 from the University of Metz, and the Ph.D. in Computer Science from the University of Nancy I in 1996. His research interests include parallel computation, hybrid connectionist-symbolic models, and cognitive modeling. He is currently a research scientist in the Soar group at Carnegie Mellon University.
Thierry Cornu Thierry Cornu was born in France in 1966. He obtained the engineering degree from Sup61ec in 1988 and the Ph.D. in Computer Science in 1992 from University of Nancy I. Since 1993, he has been a lecturer and a research scientist at the Computer Science Department of the Swiss Federal Institute of Technology (EPFL), first with the MANTRA Research Centre for Neuro-Mimetic Systems, and, since 1996, with the Parallel Computing Research Group of the EPFL. His research interests include parallel computation, performance prediction, neural network algorithms, their engineering applications and their parallel implementation.
St~phane Vialle St6phane Vialle was born in France in 1966. He graduated from the institute of technology of Grenoble (IUT 1) in electrical engineering in 1987. He obtained the engineering degree from Supdlec in 1990 and the Ph.D. in Computer Science in 1996 from the University of Paris XI. He has been a lecturer and research scientist at Supdlec since 1990. His research interests include parallel languages and parallel algorithmics, and their application to multi-agent systems.
Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.
327
A I A p p l i c a t i o n s of M a s s i v e P a r a l l e l i s m : A n E x p e r i e n c e R e p o r t David L. Waltz NEC Research Institute Princeton, NJ and Brandeis University Waltham, MA For nearly ten years my group and I at Thinking Machines Corporation worked at selling massively parallel computers for a variety of applications that fall broadly in the area now called "database mining." We had an amazing team of scientists and engineers, saw trends far ahead of the rest of the world, and developed several great systems. However, we began as novices in the business arena. Sometimes we made sales, sometimes we did not; but we learned a great deal in either case. This chapter recounts the sales process and a brief histo .ry, mostly in the form of "war stories" mixed with technical details, and attempts to summarize some messages to take away, based on what we learned. 1. I N T R O D U C T I O N : TANT?
W H A T IS D A T A M I N I N G A N D W H Y IS I T I M P O R -
Database mining is the automatic (or semi-automatic) extraction of information - e.g. facts, patterns, trends, rules, etc. - from large databases. Relevant basic methods for database mining include: statistics, especially various types of regression and projection pursuit; decision trees, CART (Classification And Regression Trees), and rule induction methods; neural nets; genetic algorithms and genetic programming; and memory-based reasoning (MBR) and other nearest neighbor methods. An ideal database mining system can identify interesting and important patterns quickly in current data, with little or no direct guidance from humans. For example, over a weekend a database mining system might collect sets of "factoids" and rules that characterize the behavior of customers or retailers, identify trends and likely causal factors, or find latent semantic categories in text databases. Data mining and learning are important for two main reasons: 1) the explosive growth in on-line data, and 2) the costs of developing software products. Let me explain: the amounts of data that we must cope with are growing explosively - even faster than computing power per dollar. Moreover, few if any people have much intuition about what patterns are in (or responsible for) this data. Automatic processing- sifting, refining, searching for patterns and regularities, etc. - is necessary and inevitable: hand analysis and human hypothesis and model generation are too uncertain, too time consuming, and too expensive. Since 1950, the cost of a transistor has fallen by eight orders of magnitude; the cost per line of (assembly code) software has fallen by at best one order of magnitude, perhaps not at all (see figure 1). Why such a discrepancy? The answer is that hardware is generated by a highly automated process that borrows from photolithography and other technologies, while software is still a labor-intensive process. Most of the gain in software
328
$1000
Cost per small black & white TV
-
$100
--)
T
$10 $1 $.1
"..
-
Cost per ofcode
"'.
line
$.01 $.001
m
$.0001
-
$.00001
Cost per transxstor
Ordinary rate ot laborintensive engineering progress
'.. ...
m
,
$.00(K~I $.0000001 I 1950
I
I
1960
I 1970
I
I 1980
I
I 1990
I
I 2000
'", 2010
Figure 1. Trends in Software and Hardware Costs from Software Developer's Perspective
(if there is one) is due to the availability of higher-level languages and better editing and debugging tools. The software we developed demonstrated that it is possible to break out of this cycle, and realize software productivity gains as dramatic as the gains in hardware costperformance. Figure 2 shows actual results from two projects (two application systems built for the US Bureau of the Census, described in more detail below). First a confession - the exact cost per line of code could not be easily calculated, so the placement of the dotted line is somewhat arbitra .ry, though we believe that by using expert system techn o l o g y - in effect a very high-level programming language- it was possible to develop the application for less cost than the usual cost per line of code. The critical point is that, by using learning and data discove .ry methods, we were able to build a "memo .ry-based reasoning" system that learned the application at a much lower cost (4 person-months vs. 200 person-months) - and performed with significantly higher accuracy to boot. Note that the slope of the line connecting these two points parallels the slope of the transistor cost curve. It is an open question is whether learning technologies will allow us to extend this trend; the answer to this question has huge economic consequences. 2. T H E S A L E S P R O C E S S I went to Thinking Machines Corporation at the beginning of 1984. (At the time I agreed to go in October 1983, there were only about 15 employees.) TMC's first hardware
329
$1000
$100
C o s t per line o f c o d e
_
$10 $1 $.1
"', _
Expert ~ystem
' -
$.01
Memo.ry Based Reasoning
% % %
$.001
% %
C o s t per transistor
$.0001
"o...~~
Cost trend
%
%
for learned software?
%
$.00001
% %
$.000001 $.00000Ol I
1950
I
1960
I
I
1970
I
I
1980
I
I
1990
I
I
2000
9
I
v
2010
Figure 2. Strategy: Use Hardware Power to Reduce Software Engineering Costs
product - the 64,000 processor CM-1 - went on sale in 1985. While smaller machines were introduced later, all hardware offered by Thinking Machines was in the $1/2M - $20M range. As a consequence, sales took a minimum of 9 months from first contact, often two years or more. Typically 3-5 FTE people were involved for the entire sales cycle. Profitable prospects were thus limited to large organizations and even then the Corporate Board would generally need to approve a purchase of this magnitude. (Thinking Machines also made a number of "sales" to universities and government laboratories. The university sales were always at steep discounts, and were not a significant source of revenue. Government sales were reasonably lucrative, but often had strings attached. While these sales would also make interesting stories, from here on in this chapter I will describe only commercial sales attempts.) Virtually all commercial (and many other) sales fell into one of two modes: 1) Benchmarking-based sales, or 2) Novel functionality-based sales. In benchmarkingbased sales "technical marketing" staff ported customer's existing applications (or key kernels), and demonstrated the level of performance possible with TMC hardware. In novel functionality-based sales, our R&D staff wrote novel, speculative software prototypes, exploiting the power and features of TMC's hardware, and we then attempted to convince customers to adopt proprietary software embodying the new functionality, which required buying our hardware. Virtually all the stories in this chapter are of the novel functionality type. The goal in this type of sale is daunting. In essence it is necessary to convince customers that they
330 can't live without something that they've been living without forever! In such sales the customer has to balance the commercial advantage in being first or fastest vs. the risk of finding no customers or being unable to recover the cost of the system in savings or sales.
3. P R O J E C T
1: A U T O M A T I C
BACK-OF-THE-BOOK
STYLE INDEXER
My group was responsible for the first product of Thinking Machines: TMC Indexer, a software product, built in 1985, while the CM-1 was still in development. The key ideas for the Indexer were the due to Dr. Howard Resnikov, one of the founders of Thinking Machines. The Indexer ran on Symbolics Lisp Machines, then a hot product in what was a thriving AI market. The TMC Indexer generated a back-of-the-book style index from a formatted text input. It used a novel natural language processing technology, with simple heuristics for finding indexable materials. The TMC Indexer concentrated on locating noun phrases. It had pattern recognizers for proper noun phrases (names of people, places, companies, and agencies), as well as for content noun phrases. The TMC Indexer automatically generated an alphabetized list of entries plus page numbers. The technology consisted of lists of proper nouns; noun phrase parsers; and simple rules for inverting phrases: e.g., "processing, parallel" as well as "parallel processing"; "Reagan, Ronald"; etc. The Indexer could index an entire book in less than one hour; it required only about 1/2 day total to complete the index (including hand editing) for Danny Hillis' MIT Press book "The Connection Machine" vs. 2-3 tedious days for hand indexing a book of similar length. With a prototype in hand, we approached publishers as a natural target market. While everyone seemed impressed and interested, we made no progress in selling any copies. This was really puzzling, since indexing was known to be time-consuming and expensive, and the quality of indexes often left much to be desired. Eventually we learned why we were having no success. The manager at a publishing house responsible for producing indexes told us that he (and other people in parallel positions) had "black books" of p e o p l e - mostly highly educated mothers at h o m e - that they used as a workforce for producing indexes. For these managers, having a list of (human) indexers was their prima .ry job security. He argued that, if this process were automated, he would be out of a job, and therefore he - and all our other potential customers - would never buy the product, regardless of the savings or quality advantage. While I cannot say for certain that he was correct in his analysis, we never did sell a copy to a publisher, despite a considerable effort to do so. We did sell one copy, the first sale of any sort by Thinking Machines. The customer was a "Beltway bandit" (i.e. a Washington, DC, area consulting firm working primarily with the US government), employing about 5000 people. Each of their technical employees was an expert in one or more areas, such as nuclear waste disposal, spread spectrum radar, or computational fluid dynamics. When generating a bid for a project listed in the Commerce Business Daily or responding to a request from an agency, this company typically had to find a suitable multi-person team, e.g. to do a nuclear waste disposal study. In the past, the company had formed teams by using a phone search to find those who they thought might be able to respond. This process was ve.ry time-consuming and spotty in its results. After seeing a demo of the TMC Indexer during a visit (for a different topic) someone realized that
331 the Indexer might help improve the team selection process: each of the 5000 technical employees had a text biography on file, including education, previous work experience, areas of expertise, previous projects, etc. By using the Indexer, this company was able to easily identify groups by topic, expertise, geographic area, previous company or agency contacts. They used this system successfully for several years, well into the period of obsolescence of the Lisp Machines running the application. This was our only sale of the TMC Indexer. The Indexer (and the prospect for all other software-only projects) was killed when the CM-1 was announced. In a nutshell, Thinking Machines wasn't big enough for both the VP who championed software and the VP who championed the CM-1. The CM-1 won; the Software VP left; the TMC Indexer died. 4. P R O J E C T 2: C M D R S ( C O N N E C T I O N M A C H I N E D O C U M E N T TRIEVAL SYSTEM)
RE-
In 1986, Thinking Machines was approached by an intelligence agency that was interested in whether the CM-1 offered any potential advantage for information retrieval tasks. As a result of this contact, an experiment was set up, benchmarking a task that at the time was being done on mainframes. This task involved searching for text documents against a ve.ry large number of query terms, corresponding to the batched questions of a number of analysts. Questions were batched for efficiency. Answers were returned in a batch, and then sorted out into bins corresponding to each analyst's question. Craig Stanfill and Brewster Kahle wrote a prototype system [8] that searched the database in serial chunks. A chunk of the database that exactly filled the memo .ry of a CM-1 was loaded; because of the massively parallel architecture of the CM, the memory was distributed evenly among the 64K processors. The query terms were then serially "broadcast" to all of the processors, which then searched their local memory for hits, a process that the CM performed very quickly. All hits could then be appended to the answer list, and the next section of the database loaded, etc. The CM-1 had ve~. slow I/O (only through its controlling front end), but the CM-2 had a fast I/O system designed into it. Following this experiment, Stanfill and Kahle asked the question: how might this fast search apply to commercial information retrieval? In commercial IR, users typically submit ve.ry small queries, so there is no advantage to the scheme above, which requires time serial in the number of memo .ry loads needed go through the entire database. Even though the memory of the CM-2 was ve.ry large by the standards of the day (32 MB), most commercial databases were much larger. Stanfill and Kahle devised a way of generating 1000-bit signatures for 100-word sections of a database. This allowed about 8:1 compression, so that potentially about 1/4 GB of (compressed) documents could be stored and searched (probabilistically) in the memo~, of one 1988-vintage CM-2 (1988 hardware cost: about $8M). Stanfill and Kahle also noted that, since many que~. terms could be searched for in a short time (less than 1/2 second, even for 100 search terms), a user could generate que .ry terms by pointing to a "good" document, once one was found, and using ALL the terms of the document as a que~.. They found experimentally that this led to high-quality search (high precision-recall product, where "precision" is the fraction of high-ranking documents actually relevant to one's query, and "recall" is the fraction of
332
all the relevant documents retrieved within the high-ranked set of documents). Thus they inadvertently reinvented the idea of "relevance feedback," initially described by Salton
[7]. This work led to the first commercial system based on "relevance feedback." Here's how it came about. Based on the earlier prototype, my group generated a demo using a Symbolics Lisp Machine with a fancy (for the time) mouse-driven interface. After typing in an initial query - which could be an English question or a "word salad," with no boolean operators required - a user would get back a ranked list of documents. Thereafter, the user could point and click at relevant documents or drag a mouse over relevant sections of a document, and search without further typing. The visionary President of Dow Jones Information Services, saw a demo and was wildly enthusiastic. Through his efforts, and despite resistance from his staff, he pushed through a sale of the system, and a joint project to build a commercial product, eventually known as CMDRS at Thinking Machines, and marketed as DowQuest. Thinking Machines wrote the back end - a memory-free-transaction system (i.e. one that kept no record of individual sessions or users, but simply took a set of query terms and returned pointers to a set of documents). Next-generation memories allowed two 32K processor CM-2's to each search signatures corresponding to 1/2 GByte of raw text in roughly 200 msec. (The final compression rate of text was only about 4:1 in the deployed system, since we found that adding common word pairs, e.g. White House, New Mexico, Artificial Intelligence, etc., dramatically improved search performance, offsetting the cost of handling only a smaller database.) Dow Jones produced the user interface, concentrating first on dumb lineoriented terminals, using a (ve .ry clunky) menu-driven interaction scheme, since that's what most users had in 1988. The plan was to eventually build a PC version as well, but Down Jones never followed through on it. Unfortunately the visiona .ry President - and through him Dow Jones marketing people - misunderstood CMDRS. They considered it a natural language system, and in the advertising and user manuals emphasized ability to use sentences rather than Boolean queries. In fact, CMDRS simply stripped out content words and ignored other words. (It had a list of all words in the database). Meanwhile the visiona .ry President lost interest in the product once the decision was made to buy, and moved on to investigating chaos and neural nets. In addition to technical and marketing problems, TMC - used to working with government agencies - was "taken to the cleaners" in negotiations with Dow Jones. During negotiations it became clear that Thinking Machines would not make much if any money on the Dow Jones sale and development project. However, the sale was viewed by TMC as very important p u b l i c i t y - too important to let a low profit margin interfere with it and we decided to go ahead, based on the idea that the development costs would be amortized over a number of other sales, made more likely by the visibility of the Dow Jones system, and the use of Dow Jones as a reference account. By many measures, this was a very successful project, constructed within the (very aggressive) schedule agreed to in negotiations by a team led by Bob Millstein. CMDRS was honored as Online Magazine product of the year (1989), and remained in service from 1989 through 1996, well past the obsolescence of the two 32K processor CM-2's with VAX 11/780 front ends. (Two systems were built to provide a "hot backup" in case of a failure -
333 of the prima .ry machine, but eventually, with the high reliability of the system, the two machines were loaded with different data, to provide a larger searchable database.) Throughout its lifetime, Dow Jones claimed that the DowQuest service was not profitable, and regularly threatened to turn the system o f f - thereby gaining concessions on hardware and software upgrades. The shortage of profits was also cited as excuse for not upgrading the user interface for PCs/workstations, insuring awkward operation and limited success of the service as PCs became widespread. We frequently asked ourselves: Would Dow Jones ever have bought the system if the demo had been of the system they ended up deploying? We think not. Starting in about 1989, Brewster Kahle and a small team built a PC interface for WAIS, an acronym for "Wide Area Information Server." This was the PC interface that Dow Jones had wanted to build, but never did. In addition, WAIS embodied the idea that the text servers on the Internet would be distributed, presaging the later World Wide Web explosion. In 1992, Brewster Kahle left Thinking Machines, along with a few other employees who had worked on this project, and founded WAIS, Inc. to commercialize a PC/Workstation version of CMDRS, for the most part developed while Kahle et al. were at Thinking Machines. In 1995, WAIS, Inc., with about a dozen employees, was purchased by America Online for about $15M.
5. P R O J E C T
3: L E G A L D A T A B A S E V E R S I O N O F C M D R S
We made several attempts to sell CMDRS to other database vendors. In one experiment, we quickly built a legal search demo for XYZ, Inc. (a well-known v e n d o r - not its real name). For this demo, we first built a list of legal technical terms by comparing word counts of news and legal databases, and keeping words that occurred much more often in the legal text as our legal lexicon. (We eliminated some common words that didn't seem to belong in a legal dictionary, only to discover later that most of them should have been kept.) Based on brief experiments, we also built special recognizers for case names and statute designators ("Massachusetts vs. Smith," "New Jersey HR-3224.05," etc.). Our test data was taken from real on-line interactions by legal users. We were given a large database, and the lists of documents retrieved for each query. Queries were made to our system by stripping out all the Boolean connectives, and searching on just the lists of terms from the queries. We brought in a legal expert to judge the results of our experiments, and to tune the system for maximum performance. In the judgement of our expert, we found a significant number of relevant cases (perhaps 50% more - my memory is hazy at this point) that had not been found by the existing Boolean search system; we also missed perhaps 5% of the articles found by the existing system. We - and our legal expert - were very excited by the quality of this result, which took on the order of two person-months total to achieve. We sent the results to the potential customer with high expectations. Our first warning should have come when the task of evaluation was given to the writers of the customer's existing legal search system. This group would naturally be expected to view its own credibility as being at stake in the comparison. The verdict from the customer was indeed grim, but the reason given was astounding: in their opinion, we had simply gotten the wrong answers. The "right answers" would have been an exact match
334
with what the Boolean system produced- it was viewed as the Gold Standard. No matter that the Gold Standard missed about 1/3 of the relevant articles. Our system didn't get the right answers. We didn't make the sale. As a postscript, all legal services, including the customer above, now proudly offer search services at least superficially identical to what we demoed to them in 1988. 6. P R O J E C T 4: P A C E : A U T O M A T I C C L A S S I F I C A T I O N O F U.S. C E N S U S LONG FORMS In 1989, we received a contact from Rob Creecy, a scientist at the U.S. Bureau of the Census. Creecy had seen the paper that Craig Stanfill and I had written on Memo .rybased reasoning [9], and felt that our method might be applicable to the task of classifying Census Long Forms into the appropriate Occupation and Industry categories of the respondents. Long Forms are given to 10% of the population, and have free text fields for respondents to describe their occupations and the industries in which they work. Through 1980, these returns were assigned to about 500 occupation and 250 industry categories by human operators, trained in the task, and working with a reference book, moved in 1980 to a computer terminal used by each operator. Starting in about 1983, M.V. Appel and others at the Census Bureau built a rule-based expert system to automate the classification task. They kept careful records on their effort, which required nearly 200 person-months to build, test, and tune the system. By the end of the project, the expert system, called AIOCS, assigned classification categories and confidence levels for each return. For each occupation and indust .ry category, a threshold was selected, and all classifications with confidence levels below that threshold were given to humans for hand classification. The outcome was a system that performed with the same accuracy that had been obtained by humans: AIOCS could classify about 47% of the input data at human levels of accuracy [1,2]. At some point after AIOCS was completed, Rob Creecy had tried to write a memo .rybased reasoning (MBR) system to do the same task. The basic idea of MBR is to use the classification of the nearest-neighbor (of k-nearest neighbors) to decide how a novel item should be classified. Rob's system worked, but not as well as AIOCS. Craig Stanfill and I had proposed a metric for judging nearness that applies to both symbolic and numeric data, and, with lots of computing power, had the possibility of trying lots of experiments in a short time to build and tune a system. The Census Bureau prides itself on being forward-looking- it was the first customer for a commercial c o m p u t e r - but it had fallen far behind the times. All processing was still mainframe-based. Creecy argued within the Bureau for an experiment and small budget to car .ry it out. He/we received approval. We produced a very successful benchmark in a very short time: About 61% of long forms were handled at human levels of accuracy vs. 47% for expert system. Moreover, the entire project required only 4 person-months to develop vs. 200 for AIOCS! A rough calculation showed that deploying the system would have saved more money (in salaries for human classifiers) than the purchase price of the Connection Machine hardware.
335 So this should have been a success and should have led to a sale. We had an in-house champion, an impressive demonstration, a very favorable cost-benefit analysis, and there was no Not-Invented-Here Syndrome at work. So was this an obvious win? Alas, no. We did get a nice paper out of this [3], but there was no sale. Why? Contractual agreements had already been made with human classifiers before a purchase decision could be made. No budget savings would be possible. We could try again in ten years... 7. O T H E R P R O J E C T S Over the years we completed a number of projects with customers, some of which led to sales, some to papers; all were great learning experiences (no sarcasm intended). One set of experiments with a credit-card issuing bank helped us to pioneer many tasks that have now been deployed as data mining applications. These included experiments in learning to recognize good credit card customers (for the sake of retaining them), and for rooting out bad customers (so that their credit lines would not be increased, or that cards would not be issued to them). We tried and compared many different learning methods. For example, in one experiment, a neural net predicted a set of cardholders ten times more likely than the general cardholder population to "attrite" (i.e. not renew their cards), and in another experiment CART and K-nearest neighbors outperformed ten other methods tested to find people about to miss a payment. In targeting marketing experiments for catalog sales customers, we used simulated annealing and genetic engineering-like mating methods to generate optimal catalogs and mailing lists. By "optimal" I mean that the solution found the right number of catalogs, catalog sizes, and contents of the catalogs to maximize net return, based on prior buying behavior, after mailing costs (variable to reflect the different sizes of catalogs) were subtracted. In yet another experiment, we showed that we could perform automatic classification of news articles with near-human accuracy. Using a nearest neighbor method that called CMDRS as a subroutine, our system was able to assign keywords to 92% of articles with a correctness performance equal to human editors. (As in the Census application, our system referred articles that fell below a confidence threshold to human experts.) This work is described in [6]. Based on these experiments, and on parallel relational database prototypes, Thinking Machines sold two large systems to American Express and Epsilon, Amex's subsidia .ry for mailing list generation and software development. These systems replaced several mainframes and cut turnaround time for answering marketing questions from several weeks to less than a day. 8. T H E F I N A L C H A P T E R
(~11)
A number of factors conspired to doom the old Thinking Machines: 9 the end of the cold war 9 big cuts in federal research funds
336
9 competition from "killer micros," shared-memory multiprocessors, and other MPP manufacturers. ("Killer micros" is Eugene Gross's phrase. It refers to the overwhelming effect of ever-cheaper and ever-more-powerful commodity hardware. MPP hardware required special software, and over the three years or more required to bring an entire MPP system to market, the PC's and workstations would have increased their cost-performance by a factor of four or more, making any MPP look bad in cost-performance terms.) The MPP competitors were very aggressive; some had deep pockets (Intel, IBM) and others lacked scruples (Kendall Square Research has been embroiled in court cases over misrepresenting income. KSR is charged with claiming income for donated computers, artificially boosting their bottom line, and thus receiving artificially inflated prices for their stock. It was hard to compete against a company that offered computers - quite excellent ones - for much less than cost! [4]). 9 bad (or unfortunate) technical decisions: in 1989, TMC chose to go with the SPARC chip for the CM-5 rather than the MIPS chip. In retrospect, MIPS delivered several generations of faster, compatible chips before even one faster SPARC generation arrived, and therefore MIPS would have been the better choice by far. DMA memo .ry should have been designed into the CM-5 but was not, with the net effect that communication and computation could not be overlapped. 9 To top this off, there were management power struggles and cases of less-thanoptimal decision-making in response to all the factors above. In the end, Thinking Machines survived, but in a much changed and smaller form. It is now a software vendor with about 200 employees vs. about 650 at its peak. It is possible that Thinking Machines would have avoided calamity if it had wholeheartedly embraced the data mining/commercial data processing goals at the time I and people around me began pushing for this (about 1988). However, there was vast inertia and momentum in direction of scientific computing- floating-point oriented scientific computing would have let the company enter the mainstream, whereas the original dream of making AI possible using NETL marker-passing methods on the non-floating-point equipped CM-1 had yielded mostly university interest. By the time we began lobbying for increasing TMC's commercial thrust, the people associated with commercial applications were outnumbered by at least 10 to 1 within the company, and the net management, sales, and marketing attention given to these non-scientific applications was in about the same ratio. To be fair, "the attack of the killer micros" would almost certainly have doomed TMC's hardware business in any case, but the end would have been less sudden, giving the company a better opportunity to shift its focus without mass layoffs. A Thinking Machines team under Steve Smith completed "Darwin," a package of data mining tools, in 1995. Darwin has been coverted to run on a wide variety of platforms, and is being sold to commercial customers. In the end, TMC may have won some battles but it lost the war. Data mining has become mainstream and "hot." But the data mining pieces have been picked up not so much by Thinking Machines, but by IBM, Sun, Dun & Bradstreet, Amex, and perhaps 100 other companies, many of them small. IBM SP-2's are hot sellers as database mining
337 engines and mainframe replacements; ironically, IBM had argued repeatedly throughout the 80's and early 90's that MPPs would never replace mainframes. 9. O V E R A L L M E S S A G E S So what can one take from all this? Here is an attempt to sum up some of the lessons we've learned, which apply to sales of any large system or to sales of systems that introduce big changes in customer operations: Good applications must show cost savings, but only 1) ve.ry large installations, or 2) highly replicated applications (i.e. a mass market) can support the high costs of development. (Most of the broad applications so far are generic - e.g. Oracle, SAS, SQL - and do not have novel functionality - e.g. Darwin.) Libraries of standard applications would be very useful, but they are chicken-or-egg problems - very costly to build, and no one may be willing to pay to develop them until there is some guarantee of costperformance. But cost savings do not guarantee sales, as in the case of the Census Bureau case above. Successful organizations inherently resist change; unsuccessful organizations can't afford new projects. Perceived risks must be addressed: the probability of technical success, job loss, retraining, user acceptance, scalability, maintenance and updating, migration to future platforms, etc. Customer confidence is important. Reference accounts can help, once a business gets rolling. But especially for a small start-up, it is difficult to overcome customer fears about whether the company will be around next year or the year after that. This gives established companies a huge advantage. All sales involve solving people problems, never just technical problems. To succeed, it is important for the vendor to understand the customer's organization, operations, decision making, power structure, individual motivations, etc. Sales of large systems are unlikely unless the is an internal champion. But big-ticket items also need an internal consensus. Overall, it is critical to offer clear benefits with manageable risks. The not-invented-here syndrome is often a problem. Involving the customer, e.g., with joint projects, can help get past this problem.
Acknowledgments I would like to thank the wonderful people at Thinking Machines who worked on and supported the projects listed above: Craig Stanfill, John Mucci, Bob Millstein, Marvin Denicoff, Sheryl Handler, Steve Smith, Gordon Linoff, Brij Masand, Franklin Davis, Kath Durant, Michael Berry, Lily Li, Kurt Thearling, Mario Bourgoin, Ga.ry Drescher, Ellie Baker, Tracy Shen, Chris Madsen, Danny Hillis, Paul Mott, Shaun Keller, Howard Resnikoff, Brewster Kahle, and Paul Rosenbloom. It is impossible to list everyone, and I apologize to those I've left out. I would also like to thank especially Bill Dunn, formerly of Dow Jones, and Rob Creecy of the US Bureau of the Census.
REFERENCES 1. M.V. Appel, Automated indust .ry and occupation coding, Development of Statistical
338
2.
3.
4. 5. 6.
7. 8. 9. 10.
Tools Seminar on Development of Statistical Expert Systems (DOSES), Luxembourg, December 1987. M.V. Appel and E. Hellerman, Census Bureau experiments with automated indust .ry and occupation coding, Proceedings of the American Statistical Association, 1983, 32-40. Robert Creecy, Brij Masand, Stephen Smith and David Waltz, Trading MIPS and Memo .ry for Knowledge Engineering, Communications of the A CM, 35, 8, August 1992, 48-64. Josh Hyatt, Kendall Square plans to restate '92 fiscal results, Boston Globe, first page, business section, December 3, 1993. Danny Hillis, The Connection Machine, Cambridge, MA: MIT Press, 1985. Brij Masand, Gordon Linoff, and David Waltz, Classifying News Stories Using Memo .ry Based Reasoning, Proceedings of the 15th Annual A CM/SIGIR Conference, Copenhagen, Denmark, 1992, 59-65. Gerald Salton, The SMART Retrieval System - Experiment in Automatic Document Classification, Cambridge, MA: MIT Press, 1971. Craig Stanfill and Brewster Kahle, Parallel free test search on the Connection Machine, Communications of the A CM, 29, 12, December 1986, 1229-1239. Craig Stanfill and David L. Waltz, Toward Memory-Based Reasoning, Communications of the ACM 29, 12, December 1986, 1213-1228. David L. Waltz, Massively Parallel AI, International Journal of High Speed Computing, 5, 3, 1993, 491-501.
339 David W a l t z David Waltz is Vice President of the Computer Science Research Division of NEC Research Institute in Princeton, N J, and Adjunct Professor of Computer Science at Brandeis University in Waltham, MA. From 1984 to 1993, he was Senior Scientist and Director of Advanced Information Systems at Thinking Machines Corporation in Cambridge, MA, and Professor of Computer Science at Brandeis. From 1974 to 1983 he was a Professor of Electical and Computer Engineering at the University of Illinois at Urbana-Champaign. Dr. Waltz received SB, SM, and Ph.D. degrees from MIT, in 1965, 1968, and 1972 respectively. His research interests have included constraint propagation, especially for computer vision, massively parallel systems for both relational and text databases, memory-based and case-based reasoning systems, protein structure prediction using hybrid neural net and memory-based methods, connectionist models for natural language processing, and natural language front ends for relational databases. He is President-Elect of the American Association of Artificial Intelligence and was elected a fellow of AAAI in 1990. He was President of ACM SIGART from 1977-9, served as Executive Editor of Cognitive Science from 1983-6, AI Editor for Communications of the ACM 1981-4, and is a senior member of IEEE, and a member of ACM, ACL (Association for Computational Linguistics), AAAI, and the Cognitive Science Society. Home Page: http://www.neci.nj.nec.com/homepages/waltz.html
341
APPENDIX
This appendix contains references to all the papers that originally appeared in four workshops: 1. PPAI91 - Workshop for Parallel Processing in Artificial Intelligence, IJCAI 1991, Sydney, Australia. 2. SSS93- Stanford Spring Symposium on Massively Parallel Artificial Intelligence, 1993, Stanford, CA. 3. PPAI93 - Workshop for Parallel Processing in Artificial Intelligence - 2, IJCAI 1993, Chambery, France. 4. PPAI95 - Workshop for Parallel Processing in Artificial Intelligence - 3, IJCAI 1995, Montreal, Canada. REFERENCES
D. Abramson, J Abela. A Parallel Genetic Algorithm for Solving the School Timetabling Problem. PPAI91. Emmanuel D. Adamides. Celluar Objects for Cellular Computers. SSS93. Ali M. A1Haj, Eiichiro Sumita, and Hitoshi Iida. A Parallel Text Retrieval System. PPAI95. Ed P. Andert Jr. and Thomas Bartolac. Parallel Neural Network Training. SSS93. Jean-Marc Andreoli, Paolo Ciancarini, and Remo Pareschi. Rarallel Searching with Multisets-as-Agents. SSS93. Ulrich Assman. A Model for Parallel Deduction. PPAI93. Tito Autrey and Herbert Gelernter. Parallel Heuristic Search. SSS93. Frank W. Bergmann and J. Joachim Quantz. Parallelizing Description Logics. PPAI95. Pierre Berlandier. Problem Partition and Solvers Coordination in Distributed Constraint Satisfaction. PPAI95. 10. Mark S. Berlin. Toward An Architecture Independent High Level Parallel Programming Model For Artificial Intellingence. PPAI93. 11. B. Boutsinas, Y. C. Stamatiou, and G. Pavlides. Parallel Reasoning using Weighted Inheritance Networks. PPAI95. 12. Jon Bright, Simon Kasif, and Lewis Stiller. Exploiting Algebraic Structure in Parallel State-Space Search (Extended Abstract). SSS93. 13. Daniel J. Challou, Maria Gini, and Vipin Kumar. Toward Real-Time Motion Planning. PPAI93. 14. Daniel J. Challou, Maria Gini, and Vipin Kumar. Parallel Search Algorithms for Robot Motion Planning. SSS93. 15. C.-C. Chu, J.C. Aggarwal. An Experimental Parallel Implementation of a Rule-Based Image Interpretation System. PPAI91.
342 16. Diane J. Cook. Fast Information Distribution for Massively Parallel IDA* Search. SSS93. 17. Diane J. Cook and Shubha Nerur. Maximizing the Speedup of Parallel Search Using HyPS. PPAI95. 18. J.-P. Corriveau. Constraint Satisfaction in Time-Constrained Memo .ry. PPAI91. 19. Van-Dat Cung and Lucien Gotte. A First Step Towards the Massively Parallel GameTree Search: a SIMD Approach. PPAI93. 20. R.F. DeMara, H Kitano. The MP-1 Benchmark Set for Parallel AI Architectures. PPAI91. 21. J. Denzinger. Parallel Equational Deduction by Team Work Completion. PPAI91. 22. G. Destri and P. Marenzoni. Performance Evaluation of Distributed Low-Level Computer Vision Algorithms. PPAI95. 23. Rumi M. Dubash and Farokh B. Bastani. Decentralized, Massively Parallel Path Planning and its Application to Process-Control and Multi Robot Systems. SSS93. 24. Wolfgang Ertel. Random Competition: A Simple, but Efficient Method for Parallelizing Inference Systems. PPAI91. 25. Wolfgang Ertel. Massively Parallel Search with Random Competition. SSS93. 26. Matthew P. Evett, William A. Anderson, and James A. Hendler. Massively Parallel Support for Computationally Effective Recognition Queries SSS93. 27. Scott E. Fahlman. Some Thoughts on NETL, 15 Years Later. SSS93. 28. M. Factor, S. Fertig, D.H. Gelernter. Using Linda to Build Parallel AI Applications. PPAI91. 29. Michael Fisher. An Open Approach to Concurrent Theorem-Proving. PPAI95. 30. U. Furbach. Splitting as a source of Parallelism in Disjunctive Logic Programs. PPAI91. 31. Edmund Furse and Kevin H. Sewell. Automatic Parallelisation of LISP program. PPAI93. 32. J.-L. Gaudiot, C.A. Montgomery, R.E. Strumberger. Data-Driven Execution of Natural Language Parsing. PPAI91. 33. James Geller. Upward-Inductive Inheritance and Constant Time Downward Inheritance in Massively Parallel Knowledge Representation. PPAI91. 34. James Geller. Massively Parallel Knowledge Representation. SSS93. 35. G. Grot]e. Actor Coordination in Parallel Planning. PPAI91. 36. L.O. Hall, D.J. Cook, W. Thomas. Parallel Window Search using TransformationOrdering Iterative-Deepening A*. PPAI91. 37. Sanda M. Harabagiu and Dan I. Moldovan. A Marker-Propagation Algorithm for Text Coherence. PPAI95. 38. R. Hasegawa, H. Fujita, M. Fujita. A Parallel Model-Generation Theorem Prover with Ramified Term-Indexing. PPAI91. 39. James A. Hendler. Massively-Parallel Marker-Passing in Semmantic Networks. PPAI91. 40. James A. Hendler. The Promise of Massive Parallelism for AI. SSS93. 41. Dominik Henrich. Initialization of Parallel Branch-and-bound Algorithms. PPAI93. 42. Tetsuya Higuchi, Tatsuya Niwa, Toshio Tanaka, Hitoshi Iba, Tatsumi Furuya. A Parallel Architecture for Genetic Based Evolvable Hardware. PPAI93.
343 43. Lothar Hotz. An Object-oriented Approach for Programming the Connection Machine. PPAI93. 44. Walter Hower. Parallel Global Constraint Satisfaction. PPAI91. 45. Walter Hower and Stephan Jacobi. Parallel Distributed Constraint Satisfaction, PPAI93, SSS93. 46. Ken Jung, Evangelos Simoudis, and Ramesh Subramonian. Parallel Induction Systems Based on Branch and Bound. SSS93. 47. George Ka .rypis, Vipin Kumar. Unstructured Tree Search on SIMD Parallel Computers: Experimental Results. SSS93. 48. P. Kefalas, T.J. Reynolds. Hill-Climbing and Genetic Algorithms coded using ORparallel Logic Plus Meta-Control. PPAI91. 49. S. Keretho, R. Loganantharaj, V. N. Gudivada. Parallel Path-Consistency Algorithms for Constraint Satisfaction PPAI91. 50. Hiroaki Kitano. Massively Parallel AI and its Application to Natural Language Processing PPAI91. 51. Hiroaki Kitano. Massively Parallel Artificial Intelligence and Grand Challenge AI Applications. SSS93. 52. Richard Kufrin. Decision Trees on Parallel Processors. PPAI95. 53. Deepak Kumar. An AI Architecture Based on Message Passing. SSS93. 54. F. KurfeB. Massive Parallelism in Inference Systems. PPAI91. 55. Franz Kurfefl. Massive Parallelism in Logic. SSS93. 56. Yannick Lallement, Thierry Cornu, and St~phane Vialle. An Abstract Machine for Implementing Connectionist and Hybrid Systems on Multi-processor Architectures. PPAI93. 57. Yannick Lallement, Thierry Cornu, and St~phane Vialle. Application development under ParCeL-1. PPAI95. 58. Trent E. Lange. Massively-Parallel Inferencing for Natural language Understanding and Memory Retrieval in Structured Spreading-Activation Networks. SSS93. 59. Eunice(Yugyung) Lee and James Geller. Parallel Operations on Class Hierarchies with Double Strand Representations. PPAI95. 60. Q.Y. Luo, P.G. Hend .ry, and J.T. Buchanan. Comparison of Different Approaches for Solving Distributed Constraint Satisfaction Problems. SSS93. 61. E.L. Lusk, W.W. McCune, J.K. Slane. ROO- a Parallel Theorem Prover. PPAI91. 62. A. Mahanti, C.J. Daniels. A SIMD Approach to Parallel Heuristic Search. PPAI91. 63. Takao Mohri, Masaaki Nakamura, and Hidehiko Tanaka. Weather Forecasting Using Memo .ry-Based Reasoning. PPAI93. 64. D. Moldovan, W. Lee, C. Lin. A Marker Passing Parallel Processor for AI. PPAI91. 65. Petri Myllymaki and Henry Tirri. Bayesian Case-Based Reasoning with Neural Networks. SSS93. 66. P.C. Nelson, A.A. Toptsis. Superlinear Speedup Using Bidirectionalism and Islands. PPAI91. 67. J. Thomas Ngo and Joe Marks. Massively Parallel Genetic Algorithms for Physically Correct Articulated Figure Locomation. SSS93. 68. T. Nishiyama, O. Katai, T. Sawaragi, T. Katayama. Multiagent Planning by Distributed Constraint Satisfaction. PPAI91.
344 69. Katsumi Nitta. Experimental Legal Reasoning System on Parallel Inference Machine. PPAI91. 70. Katsumi Nitta, Stephen Wong. The Role of Parallelism in Parallel Inference Applications. SSS93. 71. Kozo Oi, Eiichiro Sumita, Osamu Furuse, Hitoshi Iida, Hiroaki Kitano. Toward Massively Parallel Spoken Language Translation. PPAI93. 72. R. Oka. Parallelism for Heterarchical Aggregation of Knowledge in Image Understanding PPAI91. 73. Gdrald Ouvradou, Aymeric Poulain Maubant, and Andr~ Th~paut. Hybrid systems on a Multi-Grain Parallel Architecture. PPAI93. 74. Robert A. Pearson. A Coarse Grained Parallel Induction Heuristic. PPAI93. 75. G. Pinkas. Constructing Proofs using Connectionist Networks. PPAI91. 76. Curt Powley, R.E. Korf. IDA* on the Connection Machine. PPAI91. 77. Curt Powley, Richard E. Korf, and Chris Ferguson. Parallelization of Tree-Recursive Algorithms on a SIMD Machine. SSS93. 78. S. Rangoonwala, G. Neugebauer. Distributed Failure Production: Sequential Theorem Proving on a Parallel Machine. PPAI91. 79. Thilo Reski and Willy B. Strothmann. A Dense, Massively Parallel Architecture. PPAI93. 80. J. Riche, R. Whaley, J. Barlow. Massively Parallel Processing and Automated Theorem Proving. PPAI91. 81. James D. Roberts. Associative Processing: A Paradigm for Massively Parallel AI. SSS93. 82. Ian N. Robinson. PAM: Massive Parallelism in Support of Run-Time Intellignece. SSS93. 83. Satoshi Sato. MIMD Implementation of MBT3. PPAI93. 84. James G. Schmolze, Wayne Snyder. Using Confluence to Control Parallel Production Systems. PPAI93. 85. J. Schumann and M. Jobmann. Scalability of an OR-parallel Theorem Prover on a Network of Transputer- A Modelling Approach. PPAI93. 86. Johann Schumann. SiCoTHEO-Simple Competitive parallel Theorem Provers based on SETHEO. PPAI95. 87. S. Sei, N. Ichiyoshi. Experimental Version of Parallel Computer Go-Playing System "GOG". PPAI91. 88. R.V. Shankar, S. Ranka. Parallel Processing of Sparse Data Structures for Computer Vision. PPAI91. 89. Lokendra Shastri. Leveraging Massive Parallelism for Tractable Reasoning- taking inspiration from cognition. SSS93. 90. Kilian Stoffel, Ian Law, and B~at Hirsbrunner. Fuzzy Logic controlled dynamic allocation system. PPAI93. 91. Kilian Stoffel, James Hendler, and Joel Saltz. PARKA on MIMD-supercomputers. PPAI95. 92. Salvatore J. Stolfo, Hasanat Dewan, David Ohsie, Mauricio Hernandez and Leland Woodbury. A Parallel and Distributed Environment for Database Rule Processing: Open Problems and Future Directions. SSS93.
345 93. S.Y. Susswein, T.C. Henderson, J.L. Zachary. et.al. Parallel Path Consistency. PPAI91. 94. G. Sutcliffe. A Parallel Linear UR-Derivation System. PPAI91. 95. Christian B. Suttner. Competition versus Cooperation. PPAI91. 96. Christian B. Suttner and Manfred R. Jobmann. Simulation Analysis of Static Partitioning with Slackness. PPAI93. 97. Christian B. Suttner. Static Partioning with Slackness. PPAI95. 98. H. Tomabechi, H. Iida, T. Morimoto, A. Kurematsu. Graph-based CP in MassivelyParallel Memo~.: Toward Massively-Parallel NLP. PPAI91. 99. Dave Waltz. Innovative Massively Parallel AI Applications. SSS93. 100A. Martin Wildberger. Position Statement: Innovative Application of Massive Parallelism. SSS93. 101Stefan Winz and James Geller. Methods of Large Grammar Representation in Massively Parallel Parsing Systems. SSS93. 102Z4.J. Wise. Introduction to PMS-Prolog: A Distributed Coarse-Grain-Parallel Prolog with Processes, Modules and Streams. PPAI91. 103Andreas Zell, Niels Mache, Markus Huttel, and Michael Vogt. Massive Parallelism in Neural Network Simulation. SSS93. 104.Y. Zhang, A.K. Mackworth. Parallel and Distributed Constraint Satisfaction. PPAI91.