FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andre M. van Tilborg Other titles in the series: Foundations of Knowledge Acquisition: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 Foundations of Real-Time Computing: Formal Specifications and Methods, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 Foundations of Real-Time Computing: Scheduling and Resource Management, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7
FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
edited by
Alan L. Meyrowitz Naval Research Laboratory Susan Chipman Office of Naval Research
W K A P •-
)7a2 3 £OgjLz
A R C H I
OZ78
w KLUWER ACADEMIC PUBLISHERS Boston/Dordrecht/London
E F
7
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA
Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-Publication Data (Revised for vol. 2) Foundations of knowledge acquisition. (The Kluwer international series in engineering and computer science ; SECS 194) Editors' names reversed in v. 2. Includes bibliographical references and index. Contents: v. [1] Cognitive models of complex learning — v. [2] Machine learning. 1. Knowledge acquisition (Expert systems) I. Chipman, Susan. II. Meyrowitz, Alan Lester. III. Series. QA76.E95F68 1993 006.3'1 92-36720 ISBN 0-7923-9277-9 (v. 1) ISBN 0-7923-9278-7 (v. 2)
Chapter 8 is reprinted with permission from Computation & Cognition: Proceedings of the First NEC Research Symposium, edited by C. W. Gear, pp. 32-51. Copyright 1991 by the Society for Industrial and Applied Mathematics, Philadelphia, Pennyslvania. All rights reserved. Copyright © 1993 by Kluwer Academic Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061. Printed on acid-free paper. Printed in the United States of America
TABLE OF CONTENTS
Foreword
vii
Preface
ix
Learning = Inferencing + Memorizing Ryszard S. Michalski
1
Adaptive Inference Alberto Segre, Charles Elkan, Daniel Scharstein, Geoffrey Gordon, and Alexander Russell
43
On Integrating Machine Learning with Planning Gerald F. DeJong, Melinda T. Gervasio, and Scott W. Bennett The Role of Self-Models in Learning to Plan Gregg Collins, Lawrence Birnbaum, Bruce Krulwich, and Michael Freed Learning Flexible Concepts Using A Two-Tiered Representation R. S. Michalski, F. Bergadano, S. Matwin, and J. Zhang Competition-Based Learning John J. Grefenstette, Kenneth A. De Jong, and William M. Spears
83
117
145
203
VI
7.
8.
9.
10.
Index
Problem Solving via Analogical Retrieval and Analogical Search Control Randolph Jones
227
A View of Computational Learning Theory Leslie G. Valiant
263
The Probably Approximately Correct (PAC) and Other Learning Models David Haussler and Manfred Warmuth
291
On the Automated Discovery of Scientific Theories Daniel Osherson and Scott Weinstein
313
331
Foreword One of the most intriguing questions about the new computer technology that has appeared over the past few decades is whether we humans will ever be able to make computers learn. As is painfully obvious to even the most casual computer user, most current computers do not. Yet if we could devise learning techniques that enable computers to routinely improve their performance through experience, the impact would be enormous. The result would be an explosion of new computer applications that would suddenly become economically feasible (e.g., personalized computer assistants that automatically tune themselves to the needs of individual users), and a dramatic improvement in the quality of current computer applications (e.g., imagine an airline scheduling program that improves its scheduling method based on analyzing past delays). And while the potential economic impact of successful learning methods is sufficient reason to invest in research into machine learning, there is a second significant reason: studying machine learning helps us understand our own human learning abilities and disabilities, leading to the possibility of improved methods in education. While many open questions remain about the methods by which machines and humans might learn, significant progress has been made. For example, learning systems have been demonstrated for tasks such as learning how to drive a vehicle along a roadway (one has successfully driven at 55 mph for 20 miles on a public highway), for learning to evaluate financial loan applications (such systems are now in commercial use), and for learning to recognize human speech (today's top speech recognition systems all employ learning methods). At the same time, a theoretical understanding of learning has begun to appear. For example, we now can place theoretical bounds on the amount of training data a learner must observe in order to reduce its risk of choosing an incorrect hypothesis below some desired threshold. And an improved understanding of human learning is beginning to emerge alongside our improved understanding of machine learning. For example, we now have models of how human novices learn to become experts at various tasks ~ models that have been implemented as precise computer programs, and that generate traces very much like those observed in human protocols.
Vlll
The book you are holding describes a variety of these new results. This work has been pursued under research funding from the Office of Naval Research (ONR) during the time that the editors of this book managed an Accelerated Research Initiative in this area. While several government and private organizations have been important in supporting machine learning research, this ONR effort stands out in particular for its farsighted vision in selecting research topics. During a period when much funding for basic research was being rechanneled to shorter-term development and demonstration projects, ONR had the vision to continue its tradition of supporting research of fundamental long-range significance. The results represent real progress on central problems of machine learning. I encourage you to explore them for yourself in the following chapters. Tom Mitchell Carnegie Mellon University
Preface The two volumes of Foundations of Knowledge Acquisitiondocument the recent progress of basic research in knowledge acquisition sponsored by the Office of Naval Research. This volume you are holding is subtitled: Machine Learning, and there is a companion volume subtitled: Cognitive Models of Complex Learning. Funding was provided by a five-year Accelerated Research Initiative (ARI) from 1988 through 1992, and made possible significant advances in the scientific understanding of how machines and humans can acquire new knowledge so as to exhibit improved problem-solving behavior. Previous research in artificial intelligence had been directed at understanding the automation of reasoning required for problem solving in complex domains; consequent advances in expert system technology attest to the progress made in the area of deductive inference. However, that research also suggested that automated reasoning can serve to do more than solve a given problem. It can be utilized to infer new facts likely to be useful in tackling future problems, and it can aid in creating new problem-solving strategies. Research sponsored by the Knowledge Acquisition ARI was thus motivated by a desire to understand those reasoning processes which account for the ability of intelligent systems to learn and so improve their performance over time. Such processes can take a variety of forms, including generalization of current knowledge by induction, reasoning by analogy, and discovery (heuristically guided deduction which proceeds from first principles, or axioms). Associated with each are issues regarding the appropriate representation of knowledge to facilitate learning, and the nature of strategies appropriate for learning different kinds of knowledge in diverse domains. There are also issues of computational complexity related to theoretical bounds on what these forms of reasoning can accomplish. Significant progress in machine learning is reported along a variety of fronts. Chapters in Machine Learning include work in analogical reasoning; induction and discovery; learning and planning; learning by competition, using genetic algorithms; and theoretical limitations.
X
Knowledge acquisition, as pursued under the ARI, was a coordinated research thrust into both machine learning and human learning. Chapters in the companion volume, Cognitive Models of Complex Learning, also published by Kluwer Academic Publishers, include summaries of work by cognitive scientists who do computational modeling of human learning. In fact, an accomplishment of research previously sponsored by ONR's Cognitive Science Program was insight into the knowledge and skills that distinguish human novices from human experts in various domains; the Cognitive interest in the ARI was then to characterize how the transition from novice to expert actually takes place. Chapters particularly relevant to that concern are those written by Anderson, Kieras, Marshall, Ohlsson, and VanLehn. The editors believe these to be valuable volumes from a number of perspectives. They bring together descriptions of recent and on-going research by scientists at the forefront of progress in one of the most challenging arenas of artificial intelligence and cognitive science. Moreover, those scientists were asked to comment on exciting future directions for research in their specialties, and were encouraged to reflect on the progress of science which might go beyond the confines of their particular projects.
Dr. Alan L. Meyrowitz Navy Center for Applied Research in Artificial Intelligence Dr. Susan Chipman ONR Cognitive Science Program
FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
Chapter 1
LEARNING = INFERENCING + MEMORIZING Basic Concepts of Inferential Theory of Learning and Their Use for Classifying Learning Processes
Ryszard S. Michalski Center for Artificial Intelligence George Mason University Fairfax, VA 22030 ABSTRACT This chapter presents a general conceptual framework for describing and classifying learning processes. The framework is based on the Inferential Theory of Learning that views learning as a search through a knowledge space aimed at deriving knowledge that satisfies a learning goal. Such a process involves performing various forms of inference, and memorizing results for future use. The inference may be of any type— deductive, inductive or analogical. It can be performed explicitly, as in many symbolic systems, or implicitly, as in artificial neural nets. Two fundamental types of learning are distinguished: analytical learning that reformulates a given knowledge to the desirable form (e.g., skill acquisition), and synthetic learning that creates new knowledge (e.g., concept learning). Both types can be characterized in terms of knowledge transmutations that are involved in transforming given knowledge (input plus background knowledge) into the desirable knowledge. Several transmutations are discussed in a novel way, such as deductive and inductive generalization, abductive derivation, deductive and inductive specialization, abstraction and concretion. The presented concepts are used to develop a general classification of learning processes. Key words: learning theory, machine learning, inferential theory of learning, deduction, induction, abduction, generalization, abstraction, knowledge transmutation, classification of learning.
2
INTRODUCTION In the last several years we have been witnessing a great proliferation of methods and approaches to machine learning. Research in this field now spans such subareas or topics as empirical concept learning from examples, explanation-based learning, neural net learning, computational learning theory, genetic algorithm based learning, cognitive models of learning, discovery systems, reinforcement learning, constructive induction, conceptual clustering, multistrategy learning, and machine learning applications. In view of such a diversification of machine learning research, there is a strong need for developing a unifying conceptual framework for characterizing existing learning methods and approaches. Initial results toward such a framework have been presented in the form of Inferential Theory of Learning (ITL) by Michalski (1990a, 1993). The purpose of this chapter is to discuss and elaborate selected concepts of ITL, and use them to describe a general classification of learning processes. The ITL postulates that learning processes can be characterized in terms of operators (called "knowledge transmutations"—see next section) that transform input information and initial learner's knowledge to the knowledge specified by the goal of learning. The main goals of the theory are to analyze and explain diverse learning methods and paradigms in terms of knowledge transmutations, regardless of the implementation-dependent operations performed by different learning systems. The theory aims at understanding the competence of learning processes, i.e., their logical capabilities. Specifically, it tries to explain what type of knowledge a system is able to derive from what type of input and learner's prior knowledge, what types of inference and knowledge transformations underlie different learning strategies and paradigms, what are the properties and interrelationships among knowledge transmutations, how different knowledge transmutations are implemented in different learning systems, etc. The latter issue is particularly important for developing systems that combine diverse learning strategies and methods,
3
because different knowledge representations and computational mechanisms facilitate different knowledge transmutations. Knowledge transmutations can be applied in a great variety of ways to a given input and background knowledge. Therefore, the theory emphasizes the importance of learning goals, which are necessary for guiding learning processes. Learning goals reflect the knowledge needs of the learner, and often represent a composite structure of many subgoals, some of which are consistent and some may be contradictory. As to the research methodology employed, the theory attempts to explain learning processes at the level of abstraction that allows it to be relevant both to cognitive models of learning, and those studied in machine learning. The above research issues make the Inferential Theory of Learning different from and complementary to Computational Learning Theory (e.g., Warmuth and Valiant, 1991), which is primarily concerned with the computational complexity or convergence of learning algorithms. The presented work draws upon the ideas described in (Michalski, 1983 & 1990a; Michalski and Kodratoff, 1990b; and Michalski, 1993).
LEARNING THROUGH INFERENCE Any act of learning aims at improving learner's knowledge or skill by interacting with some information source, such as an environment or a teacher. The underlying tenet of the Inferential Theory of Learning is that any learning can be usefully viewed as a process of creating or modifying knowledge structures to satisfy a learning goal. Such a process may involve performing any type of inference—deductive, inductive or analogical. Figure 1 illustrates an information flow in a general learning process according to the theory. In each learning cycle, the learner generates new knowledge and/or a new form of knowledge by performing inferences from the input information and the learner's prior knowledge. When obtained knowledge satisfies the learning goal, the knowledge is assimilated into the learner's knowledge base. The input information to a learning process can be observations, stated facts, concept instances,
4 previously formed generalizations or abstractions, conceptual hierarchies, information about the validity of various pieces of knowledge, etc.
External Input
i Multitype inference Deduction Induction Analogy Output
I
Internal Input
Background Knowledge
Figure 1. A schematic characterization of learning processes. Any learning process needs to be guided by some underlying goal, otherwise the proliferation of choices of what to learn would quickly overwhelm any realistic system. A learning goal can be general (domainindependent), or domain-dependent. A general learning goal defines the type of knowledge that is desired by a learner. There can be many such goals, for example, to determine a concept description from examples, to classify observed facts, to concisely describe a sequence of events, to discover a quantitative law characterizing physical objects, to reformulate given knowledge into a more efficient representation, to learn a control algorithm to accomplish a task, to confirm a given piece of knowledge,
5 etc. A domain-specific goal defines a specific knowledge needed by the learner. At the beginning of a learning process, the learner determines what prior knowledge is relevant to the input and the learning goal. Such goalrelevant part of learner's prior knowledge is called background knowledge (BK). The BK can be in different forms, such as declarative (e.g., a collection of statements representing conceptual knowledge), procedural (e.g., a sequence of instructions for performing some skill), or a combination of both. Input and output knowledge in a learning process can also be in such forms. One way of classifying learning processes is based on the form of input and output knowledge involved in them (Michalski, 1990a). The Inferential Theory of Learning (ITL) states that learning involves performing inference ("inferencing") from the information supplied and the learner's background knowledge, and memorizing its results that are found to be useful. Thus, one can write an "equation": Learning = Inferencing + Memorizing
(1)
It should be noted that the term "inferencing" is used in (1) in a very general sense, meaning any type of knowledge transformation or manipulation, including syntactic transformations and random searching for a specified knowledge entity. Thus, to be able to learn, a system has to be able to perform inference, and to have a memory that supplies the background knowledge, and stores the results of inferencing. As mentioned earlier, ITL postulates that any learning process can be described in terms of generic units of knowledge change, called knowledge transmutations (or transforms). The transmutations derive one type of knowledge from another, hypothesize new knowledge, confirm or disconfirm knowledge, organize knowledge into structures, determine properties of given knowledge, insert or delete knowledge, transmit knowledge from one physical medium to another, etc. Transmutations may performed by a learner explicitly, by well-defined rules of inference (as in many symbolic learning systems), or implicitly, by specific
6 mechanisms involved in information processing (as in neural-net learning or genetic algorithm based learning). The capabilities of a learning system depend on the types and the complexity of transmutations a learning system is capable of performing. Transmutations are divided to two classes: knowledge generation transmutations and knowledge manipulation transmutations. Knowledge generation transmutations change the content of knowledge by performing various kinds of inference. They include, for example, generalization, specialization, abstraction, concretion, similization, dissimilization, and any kind of logical or mathematical derivation (Michalski, 1993). Knowledge manipulation transmutations perform operations on knowledge that do not change its content, but its organization, physical distribution, etc. For example, inserting a learned component to a given structure, replicating a given knowledge segment in another knowledge base, or sorting given rules in a certain order are knowledge manipulation transmutations. This chapter discusses two important classes of knowledge generation transmutations {generalization, specialization}, and {abstraction, concretion}. These classes are particularly relevant to the classification of learning processes discussed in the last section. Because Inferential Theory views learning as an inference process, it may appear that it only applies to symbolic methods of learning, and does not apply to "subsymbolic" methods, such as neural net learning, reinforcement learning or genetic algorithm-based learning. It is argued that it also applies to them, because from the viewpoint of the input-output transformations, subsymbolic methods can also be characterized as performing knowledge transmutations and inference. Clearly, they can generalize inputs, determine similarity between inputs, abstract from details, etc. From the ITL viewpoint, symbolic and subsymbolic systems differ in the type of computational and representational mechanisms they use for performing transmutations. Whether a learning system works in parallel or sequentially, weighs inputs or performs logic-based transformations
7
affects the system's speed, but not its ultimate competence (within limits), because a parallel algorithm can be transformed into a logically equivalent sequential one, and a discrete neural net unit function can be transformed into an equivalent logic-type transformation. These systems differ in the efficiency and speed of peforming different transmutations. This makes them more or less suitable for different learning tasks. In many symbolic learning systems, knowledge transmutations are performed in an explicit way, and in conceptually comprehensible steps. In some inductive learning systems, for example, INDUCE, generalization transmutations are performed according to well-defined rules of inductive generalization (Michalski, 1983). In subsymbolic systems (e.g., neural networks), transmutations are performed implicitly, in steps dictated by the underlying computational mechanism (see, e.g., Rumelhart et al., 1986). A neural network may generalize an input example by performing a sequence of small modifications of the weights of internode connections. Although these weight modifications do not directly correspond to any explicit inference rules, the end result, nevertheless, can be characterized as a certain knowledge transmutation. The latter point is illustrated by Wnek et al. (1990), who described a simple method for visualizing generalization operations performed by various symbolic and subsymbolic learning systems. The method, called DIAV, can visualize the target and learned concepts, as well as results of various intermediate steps, no matter what computational mechanism is used to perform them. To illustrate this point, Figure 2 presents a diagrammatic visualization of concepts learned by four learning systems: a classifier system using genetic algorithm (CFS), a rule learning program (AQ15), a neural net (BpNet), and a decision tree learning system (C4.5). Each diagram presents an "image" of the concept learned by the given system from the same set of examples: 6% of positive examples (5 out of the total 84 positive examples constituting the concept), and 3% of negative examples (11 out of possible 348).
8 Classifier System (CFS)
Decision Rules (AQ15)
Neural Net (BpNEt)
Decision T r e e ( C 4 . 5 )
CSS
Target concept
t Positive training example
V7?m
Learned concept
- Negative training example
The cell A corresponds to the description: HEAD-SHAPE = R & BODY-SHAPE= R & SMILING = Yes & HOLDING = F & JACKET COLOR = B & Tie = N Figure 2. A visualization of the target concept and concepts learned by four learning methods.
A
9 In the diagrams, the shaded area marked "Target Concept" represents all possible instances the concept to be learned. The shaded area marked "Learned concept" represents a generalization of training examples hypothesized by a given learning system. The set-theoretic difference between the "Target concept" and the "Learned concept" represents errors in learning (an "Error image"). Each instance belonging to the "Learned concept" and not to the "Target concept," or to the "Target concept" and not to "Learned concept" will be incorrectly classified by the system. To understand the diagrams, note that each cell of a diagram represents a single combination of attribute values, e.g., an instance of a concept. A whole diagram represents the complete description space (432 instances). The attributes spanning the description space characterize a collection of imaginary robot-like figures. Figure 3 lists the attributes and their value sets. ATTRIBUTES Head Shape Body Shape Smiling Holding Jacket Color Tie
LEGAL VALUES R- round, S- square, 0 - octagon R- round, S-square, O-octagon Y- yes , N- no S- sword, B- balloon, F- flag R- red, Y- yellow, G- green, B- blue Y- yes, N- no
Figure 3. Attributes and their value sets. To determine a logical description that corresponds to a given cell (or a set of cells), one projects the cell (or a set of cells) onto the ranges of attribute values associated with the scales aside of the diagram, and "reads out" the description. To illustrate this, the bottom part of Figure 2 presents a description of the cell marked in the diagram as A. By analyzing the images of the concepts learned by different paradigms, one can determine the degree to which they generalized the original examples, can "see" the differences between different generalizations, determine how new or hypothetical examples will be classified according to the learned concepts, etc. For more details on the
10 properties of the diagrams, on the method of "reading out" descriptions from the diagrams, and on the implemented diagrammatic visualization system, DIAV, see (Michalski, 1978, Wnek et al., 1990; Wnek and Michalski, 1992.) The diagrams allow one to view concepts as images, and thus to abstract from the specific knowledge representation used by a learning method. This demonstrates that from the epistemological viewpoint taken by the ITL, it is irrelevant if knowledge is implemented in the form of a set of rules, a decision tree, a neural net or some other way. For example, in a neural net, the prior knowledge is represented in an implicit way, specifically, by the structure of the net, and by the initial settings of the weights of the connections. The learned knowledge is manifested in the new weights of the connections among the net's units (Touretzky and Hinton, 1988). The prior and learned knowledge incorporated in the net could be re-represented, at least theoretically, in the form of images, or, as explicit symbolic rules or numerical expressions, and then dealt with as any other knowledge. For example, using the diagrams in Figure 2, one can easily "read out" from them a set of rules equivalent to the concepts learned by the neural network and genetic algorithm. The central aspect of any knowledge transmutation is the type of underlying inference, which characterizes a transmutation along the truthfalsity dimension. The type of inference thus determines the truth status of the derived knowledge. Therefore, before we discuss transmutations and their role in learning, we will first analyze basic types of inference. BASIC TYPES OF INFERENCE As stated earlier, ITL postulates that learning involves conducting inference on the input and current BK, and storing the results whenever they are evaluated as useful. Such a process may involve any type of inference, because any possible type of inference may produce knowledge worth remembering. Therefore, from such a viewpoint, a complete learning theory has to include a complete theory of inference.
11 Such a theory of inference should account for all possible types of inference. Figure 4 presents an attempt to schematically illustrate all basic types of inference. The first major classification divides inferences into two fundamental types: deductive and inductive. The difference between them can be explained by considering an entailment: P u BK 1= C
(2)
where P denotes a set of statements, called premise, BK represents the reasoner's background knowledge, and C denotes a set of statements, called consequent. Deductive inference is deriving consequent C, given premise P and BK. Inductive inference is hypothesizing premise P, given consequent C and BK. Thus, deductive inference can be viewed as "tracing forward" the relationship (2), and inductive inference as "tracing backward" such a relationship. Because of its importance for characterizing inference processes, relationship (2) is called the fundamental equation for inference.
CONCLUSIVE
CONTINGENT DEDUCTIVE
INDUCTIVE
Truth-preserving
Falsity-preserving
Figure 4. A classification of basic types of inference. Inductive inference underlies two major knowledge generation transmutations: inductive generalization and abductive derivation. They differ in the type of BK they employ, and the type of premise P they
12 hypothesize. Inductive generalization is based on tracing backward a tautological implication, specifically, the rule of universal specialization., i.e., Vx, P(x) => P(a), and produces a premise P that is a generalization of C, i.e., is a description of a larger set of entities than the set described by C (Michalski, 1990a, 1993). In contrast, abductive derivation is based on tracing backward an implication that represents a domain knowledge, and produces a description that characterizes reasons for C. Other, less known, types of inductive inference are inductive specialization and inductive concretion (see section on Inductive Transmutations). In a more general view of deduction and induction that also captures their approximate or commonsense forms, the entailment relationship "l=" may also include a "plausible" entailment, i.e., probabilistic or partial. The difference between the "conclusive" (valid) and "plausible" entailment leads to another major classification of inference types. Specifically, inferences can be divided into those based on conclusive or domain-independent dependencies, and those based on contingent or domain-dependent dependencies. A conclusive dependency between statements or sets of statements represents a necessarily true logical relationship, i.e., a relationship that must be true in all possible worlds. Valid rules of inference or universally accepted physical laws represent conclusive dependencies. To illustrate a conclusive dependency, consider the statement "All elements of the set X have the property q." If this statement is true, then the statement "x, an element of X, has the property q" must also be true. The above relationship between the statements is true independently of the domain of discourse, i.e., of the nature of elements in the set X, and thus is conclusive. If reasoning involves only statements that are assumed to be true, such as observations, "true" implications, etc., and conclusive dependencies (valid rules of inference), then deriving C, given P, is the conclusive (or crisp) deduction, and hypothesizing P, given C, is conclusive (or crisp) induction. For example, suppose that BK is "All elements of the set X have the property q," and the input (premise P) is "x is an element of X."
13 Deriving a statement "x has the property q" is conclusive deduction. If BK is "x is an element of X" and the input (the observed consequent C) is "x has the property q," then hypothesizing premise P "All elements of X have the property q" is conclusive induction. Contingent dependencies are domain-dependent relationships that represent some world knowledge that is not totally certain, but only probable. The contingency of these relationships is usually due to the fact that they represent incomplete or imprecise information about the totality of factors in the world that constitute a dependency. These relationships hold with different "degrees of strength." To express both conclusive and contingent dependencies within one formalism, the concept of mutual dependency is introduced. Suppose SI and S2 are sentences in PLC (Predicate Logic Calculus) that are either statements (closed PLC sentences; no free variables) or term expressions (open PLC sentences, in which some of the arguments are free variables). If there are free variables, such sentences can be interpreted as representing functions, otherwise they are statements with a truth-status. To state that there is a mutual dependency (for short, an m-dependency) between sentences SI and S2, we write SI S2: a, (3
(3)
where a and p, called merit parameters , represent an overall forward strength and backward strength of the dependency, respectively. If SI and S2 are statements, then an m-dependency becomes an m-implication. Such an implication reduces to a standard logical implication if a is 1, and (3 is undetermined, or a is undetermined and (3 is 1, otherwise it is a bidirectional plausible implication.. In such an implication, if SI (S2) is true, than a (J3) represents a measure of certainty that S2 (SI) is true, assuming that no other information relevant to S2 (SI) is known. If SI and S2 are term expressions, then a and p represent an average certainty with which the value of SI determines the value S2, and conversely. An obvious question arises as to the method for representing and computing merit parameters. We do not assume that they need to have a
14 single representation. They could be numerical values representing a degree of belief, an estimate of the probability, ranges of probability, or a qualitative characterization of the strength of conclusions from using the implication in either direction. Here, we assume that they represent numerical degrees of dependency based on the contingency table (e.g., Goodman & Kruskal, 1979; Piatetsky-Shapiro, 1992), or estimated by an expert. Another important problem is how to combine or propagate merit parameters when reasoning through a network of m-dependencies. Pearl (1988) discusses a number of ideas relevant to this problem. Since the certainty of a statement cannot be determined solely on the basis of the certainties of its constituents, regardless of its meaning, the ultimate solution of this open problem will require methods that take into consideration both the merit parameters and the meaning of the sentences. A special case of m-dependency is determination, introduced by Russell (1989), and used for characterizing a class of analogical inferences. Determination is an m-dependency between term expressions in which a is 1, and (3 is unspecified, that is, a unidirectional functional independency. If any of the parameters a or p takes value 1, then an independency is called conclusive, otherwise is called contingent. The idea of an m-dependency stems from research on human plausible reasoning (Collins and Michalski, 1989). Conclusions derived from inferences involving contingent dependencies (applied in either direction), and/or uncertain facts are thus uncertain. They are characterized by "degrees of belief (probabilities, degrees of truth, likelihoods, etc.). For example, "If there is fire, there is smoke" is a bi-directional contingent dependency, because there could be a situation or a world in which it is false. It holds in both directions, but not conclusively in either direction. If one sees fire, then one may derive a plausible (deductive) conclusion that there is smoke. This conclusion, however, is not certain. Using reverse reasoning ("tracing backward" the above dependency), observing smoke, one may hypothesize, that there is fire. This is also an uncertain inference, called contingent abduction. It
15 may thus appear that there is no principal difference between contingent deduction and contingent abduction. These two types of inferences are different if one assumes that there is a causal dependency between fire and smoke, or, generally, between P and C in the context of BK (i.e., P can be viewed as a cause, and C as its consequence). Contingent deduction derives a plausible consequent, C, of the causes represented by P. Abduction derives plausible causes, P, of the consequent C. A problem arises when there is no causal dependency between P and C in the context of BK. In such a situation, the distinction between plausible deduction and abduction can be based on the relative strength of dependency between P and C in both directions (Michalski, 1992). Reasoning in the direction of stronger dependency is plausible deduction, and reasoning in the weaker direction is abduction. If a dependency is completely symmetrical, e.g., P C, then the difference between deduction and abduction ceases to exist. In sum, both contingent deduction and contingent induction are based on contingent, domain-dependent dependencies. Contingent deduction produces likely consequences of given causes, and contingent abduction produces likely causes of given consequences. Contingent deduction is truth-preserving, and contingent induction (or contingent abduction) is falsity-preserving only to the extent to which the contingent dependencies involved in reasoning are true. In contrast, conclusive deductive inference is strictly truth-preserving, and conclusive induction is strictly falsitypreserving (if C is not true, then the hypothesis P cannot be true either). A conclusive deduction thus produces a provably •correct (valid) consequent from a given premise. A conclusive induction produces a hypothesis that logically entails the given consequent (though the hypothesis itself may be false). The intersection of the deduction and induction, i.e., an inference that is both truth-preserving and falsity-preserving, represents an equivalencebased inference (or reformulation). Analogy can be viewed as an extension of such equivalence-based inference, namely, as a similaritybased inference. Every analogical inference can be characterized as a
16 combination of deduction and induction. Induction is involved in hypothesizing an analogical match, i.e., the properties and/or relations that are assumed to be similar between the analogs, whereas deduction uses the analogical match to derive unknown properties of the target analog. Therefore, in Figure 4, analogy occupies the central area. The above inference types underlie a variety of knowledge transmutations. We now turn to the discussion of various knowledge transmutations in learning processes. TRANSMUTATIONS AS LEARNING OPERATORS Inferential Theory of Learning views any learning process as a search through a knowledge space, defined as the space of admissible knowledge representations. Such a space represents all possible inputs, all learner's background knowledge, and all knowledge that the learner can potentially generate. In inductive learning, knowledge space is usually called a description space. The theory assumes that search is conducted through an application of knowledge transmutations acting as operators. Such operators take some component of the current knowledge and some input, and generate a new knowledge component. A learning process is defined as follows: Given
• Input knowledge •Goal • Background knowledge • Transmutations
(I) (G) (BK) (T)
Determine • Output knowledge 0, satisfying goal G, by applying transmutations T to input I and background knowledge BK. The input knowledge, I, is the information (facts or general knowledge) that the learner receives from the environment. The learner may receive the input all at once or incrementally, Goal, G, specifies criteria that need to be satisfied by the Output, O, in order that learning is
17 accomplished. Background knowledge is a part of learner's prior knowledge that is "relevant" to a given learning process. Transmutations are generic types of knowledge transformation for which one can make a simple mental model. They can be implemented using many different computational paradigms. They are classified into two general categories: knowledge generation transmutations, which change the content or meaning of the knowledge, and knowledge manipulation transmutations, which change its physical location or organization, but do not change its content. Knowledge generation transmutations represent patterns of inference, and can be divided to synthetic and analytic. Synthetic transmutations are able to hypothesize intrinsically new knowledge, and thus are fundamental for knowledge creation (by "intrinsically new knowledge" we mean knowledge that cannot be conclusively deduced from the knowledge already possessed). Synthetic transmutations include inductive transmutations (those that employ some form of inductive inference), and analogical transmutations (those that employ some form of analogy). Analytic (or deductive) transmutations are those employing some form deduction. This chapter concentrates on a few knowledge generation transmutations that are particularly important for the classification of learning processes described in the last section. A discussion of several other knowledge transmutations is in (Michalski, 1993). In order to describe these transmutations, we need to introduce concepts of a well-formed description, the reference set of a description, and a descriptor. A set of statements is a well-formed description if and only if one can identify a specific set of entities such that this set of sentences describe. This set of entities (often a singleton) is called the reference set of the description. Well-formed descriptions have truthstatus, that is, they can be characterized as true or false, or, generally, by some intermediate truth-value.
18 For the purpose of this presentation, we will make a simplifying assumption that descriptions can have one of only three truth-values: "true," "false," or "unknown." The "unknown" value is attached to hypotheses generated by contingent deduction, analogy, or inductive inference. The "unkown" value can be turned to true or false by subjecting the hypothesis to a validation procedure. A descriptor is an attribute, a function, or a relation, whose value or status is used to characterize the reference set. Consider, for example, a statement: "Elizabeth is very strong, has Ph.D. in Astrophysics from the University of Warsaw, and likes soccer." This statement is a well-formed description because one can identify a reference set, {Elizabeth}, that this statement describes. This description uses three descriptors: a one-place attribute "degree-ofstrength(person)," a binary relation "likes(person, activity)," and a four place relation, "degree-received(person, degree, topic, University). The truth-status of this description is true, if Elizabeth has the properties stated, false it she does not, unknown, if it is not known to be true, but there is no evidence that it is false. Consider now a sentence: "Robert is a writer, and Barbara is a lawyer." This sentence is not a well-formed description. It could be split, however, to two sentences, each of which would be a well-formed description (one describing Robert, and another describing Barbara). Finally, consider a sentence "George, Jane and Susan like mango, political discussions, and social work." This is a well-formed description of the reference set {George, Jane, Susan}. Knowledge generation transmutations apply only to well-formed descriptions. Knowledge manipulation transmutations apply to descriptions, as well as entities that are not descriptions (e.g., tenns, or sets of terms). Below is a brief description of four major classes of knowledge generation transmutations. First two classes consists of a pair of opposite transmutations, and the third one contains a range of transmutations.
19 1. Generalization vs. specialization A generalization transmutation extends the reference set of the input description. Typically, a generalization transmutation is inductive, because the extended set is inductively hypothesized. A generalization transmutation can also be deductive, when the more general assertion is a logical consequence of the more specific one, or is deduced from the background knowledge and/or the input. The opposite transmutation is specialization transmutations, which narrows the reference set. A specialization transmutation usually employs deductive inference, but, as shown in the next section, there are also inductive specialization transmutations. 2. Abstraction vs. concretion Abstraction reduces the amount of detail in a description of a reference set, without changing the reference set. This can be done in a variety of ways. A simple way is by replacing one or more descriptor values by their parents in the generalization hierarchy of values. For example, suppose given is a statement "Susan found an apple." Replacing "apple" by "fruit" would be an abstraction transmutation (assuming that background knowledge contains a generalization hierarchy in which "fruit" is a parent node of "apple"). The underlying inference here is deduction. The opposite transmutation is concretion, which generates additional details about a reference set. 3. Similization vs. dissimilization Similization derives new knowledge about some reference set on the basis of detected partial similarity between this set and some other reference set, of which the reasoner has more knowledge. The similization thus transfers knowledge from one reference set to another reference set, which is similar to the original one in some sense. The opposite transmutation is dissimilization, which derives new knowledge from the lack of similarity between the compared reference sets. The similization and dissimilization are based on analogical inference. They can be viewed as a combination of deductive and inductive
20
inference (Michalski, 1992). They represent patterns of inference described in the theory of plausible reasoning by Collins and Michalski (1989). For example, knowing that England grows roses, and that England and Holland have similar climates, a similization transmutation might hypothesize that Holland may also grow roses. An underlying background knowledge here is that there exists a dependency between climate of a place and the type of plants growing in that location. A dissimilization transmutation would be to hypothesize that bougainvillea, which is widespread on the Caribbean islands, probably does not grow in Scotland, because Scotland and Caribbean islands have very different climate. 4. Reformulation vs. randomization A reformulation transmutation transforms a description into another description according to equivalence-based rules of transformation (i.e., truth- and falsity-preserving rules). For example, transforming a statement: "This set contains numbers 1,2,3,4 and 5" into "This set contains integers from 1 to 5" is a reformulation. An opposite transmutation is randomization, which transforms a description into another description by making random changes. For example, mutation in a genetic algorithm represents a randomization transmutation. Reformulation and randomization are two extremes of a spectrum of intermediate transmutations, called derivations. Derivations employ different degrees or types of logical dependence between descriptions to derive one piece of knowledge from another. An intermediate transmutation between the two extremes above is crossover, which is also used in genetic algorithms. Such a transmutation derives new knowledge by exchanging parts of two related descriptions. INDUCTIVE TRASMUTATIONS Inductive transmutations, i.e., knowledge transformations employing inductive inference have fundamental importance to learning. This is due to their ability to generate intrinsically new knowledge. As discussed earlier, induction is an inference type opposite to deduction. The results
21 of induction can be in the form of generalizations (theories, rules, laws, etc.), causal explanations, specializations, concretions and other. The usual aim of induction is not to produce just any premise ("explanation") that entails a given consequent ("observable"), but the one which is the most "justifiable." Finding such a "most justifiable" hypothesis is important, because induction is an under-constrained inference, and just "reversing" deduction would normally lead to an unlimited number of alternative hypotheses. Taking into consideration the importance of determining the most justifiable hypothesis, the previously given characterization of inductive inference based on (2) can be further elaborated. Namely, an admissible induction is an inference which, given a consequent C, and BK, produces a hypothetical premise P, consistent with BK, such that PuBKI=C
(4)
and which satisfies the hypothesis selection criterion. In different contexts, the selection criterion (which may be a combination of several elementary criteria) is called a preference criterion (Popper, 1972; Michalski, 1983), bias (e.g., Utgoff, 1986), a comparator (Poole, 1989). These criteria are necessary for any act of induction because for any given consequent and a non-trivial hypothesis description language there could be a very large number distinct hypotheses that can be expressed in that language, and which satisfy the relation (4). The selection criteria specify how to choose among them. Ideally, these criteria should reflect the properties of a hypothesis that are desirable from the viewpoint of the reasoner's (or learner's) goals. Often, these criteria (or bias) are partially hidden in the description language used. For example, the description language may be limited to only conjunctive statements involving a given set of attributes, or determined by the mechanism performing induction (e.g., a method that generates decision trees is automatically limited to using only operations of conjunction and disjunction in the hypothesis representation). Generally,
22 these criteria reflect three basic desirable characteristics of a hypothesis: accuracy, utility, and generality. The accuracy expresses a desire to find a "true" hypothesis. Because the problem is logically under-constrained, the "truth" of a hypothesis can never be guaranteed. One can only satisfy (4), which is equivalent to making a hypothesis complete and consistent with regard to the input facts (Michalski, 1983). If the input is noisy, however, an inconsistent and/or incomplete hypothesis may give a better overall predictive performance than a complete and consistent one (e.g., Quinlan, 1989; Bergadano et al., 1992). The utility requires a hypothesis to be computationally and/or cognitively simple, and be applicable to performing an expected set of problems. The generality criterion expresses the desire to have a hypothesis that is useful for predicting new unknown cases. The more general the hypothesis, the wider scope of different new cases it will be able to predict. Form now on, when we talk about inductive transmutations, we mean transmutations that involve admissible inductive inference. While the above described view of induction is by no means universally accepted, it is consistent with many long-standing discussions of this subject going back to Aristotle (e.g., Adler and Gorman, 1987; see also the reference under Aristotle). Aristotle, and many subsequent thinkers, e.g., Bacon (1620), Whewell (1857) and Cohen (1970), viewed induction as a fundamental inference type that underlies all processes of creating new knowledge. They did not assume that knowledge is created only from low-level observations and without use of prior knowledge. Based on the role and amount of background knowledge involved, induction, can be divided into empirical induction and constructive induction. Empirical induction uses little background knowledge. Typically, an empirical hypothesis employs the descriptors (attributes, terms, relations, descriptive concepts, etc.) that are selected from among those that are used in describing the input instances or examples, and therefore such induction is sometimes called selective (Michalski, 1983).
23
In contrast, a constructive induction uses background knowledge and/or experiments to generate additional, more problem-oriented descriptors, and employs them in the formulation of the hypothesis. Thus, it changes the description space in which hypotheses are generated. Constructive induction can be divided into constructive generalization, which produces knowledge-based hypothetical generalizations, abduction, which produces hypothetical domain-knowledge-based explanations, and theory formation, which produces general theories explaining a given set of facts. The latter is usually developed by employing inductive generalization with abduction and deduction. There is a number of knowledge transmutations that employ induction, such as empirical inductive generalization, constructive inductive generalization, inductive specialization, inductive concretion, abductive derivation, and other (Michalski, 1993). Among them, the empirical inductive generalization is the most known form. Perhaps for this reason, it is sometimes mistakenly viewed as the only form of inductive inference. Constructive inductive generalization creates general statements that use other terms than those used for characterizing individual observations, and is also quite common in human reasoning. Inductive specialization is a relatively lesser known form of inductive inference. In contrast to inductive generalization, it decreases the reference set described in the input. Concretion is related to inductive specialization. The difference is that it generates more specific information about a given reference set, rather than reduces the reference set. Concretion is a transmutation opposite to abstraction. Abductive explanation employees abductive inference to derive properties of a reference set that can serve as its explanation. Figure 5 gives examples of the above inductive transmutations.
24
A. Empirical generalization (BK limited: "pure" generalization) Input: "A girl's face" and "Lvow cathedral" are beautiful paintings. BK: "A girl's face" and "Lvow cathedral" are paintings bv Dawski. Hypothesis: All paintings by Dawski are beautiful. B. Constructive inductive generalization (generalization + deduction) Input: "A girl's face" and "Lvow cathedral" are beautiful paintings. BK: "A girl's face" and "Lvow cathedral" are paintings by Dawski. Dawski is a known painter. Beautiful paintings by a known painter are expensive. Hypothesis: All paintings by Dawski are expensive. C. Inductive specialization Input: There is high-tech industry in Northern Virginia. BK: Fairfax is a town in Northern Virginia. Hypothesis: There is high-tech industry in Fairfax.
P. Inductive Concretion Input: John is an expert in some formal science. BK: John is Polish. Many Poles like logic. Logic is a formal science. Hypothesis: John is an expert in logic.
Et Afrdyetiye derivation Input: There is smoke in the house. BK: Fire usually causes smoke. Hypothesis: There is a fire in the house. F
General constructive induction (generalization plus abductive derivation) Input: Smoke is coming from John's apartment. BK: Fire usually causes smoke. John's apt, is in the Hemingway building. Hypothesis: The Hemingway building is on fire. Figure 5. Examples of inductive transmutations.
25 In Figure 5, examples A, C and D illustrate conclusive inductive transmutations (in which the generated hypothesis conclusively implies the consequent), and examples B, E and F illustrate contingent inductive transmutations (the hypothesis only plausibly implies the consequent).In example B, the input is only a plausible consequence of the hypothesis and BK, because background knowledge states that "Beautiful paintings by a known painter are expensive." This does not imply that all paintings that are expensive are necessarily beautiful. The difference between inductive specialization (Example C) and concretion (Example D) is that the former reduces the set being described (that is, the reference set), and the latter increases the information about the reference set. In example C, the reference set is reduced from Virginia to Fairfax. In example D, the reference set is John; the concretion increases the amount of information about it. HOW ABSTRACTION DIFFERS FROM GENERALIZATION Generalization is sometimes confused with abstraction, which is often employed as part of the process of creating generalizations. These two transmutations are quite different, however, and both are fundamental operations on knowledge. This section provides additional explanation of abstraction, and illustrates the differences between it and generalization. As mentioned earlier, abstraction creates a less detailed description of a given reference set from a more detailed description, without changing the reference set. The last condition is important, because reducing information about the reference set by describing only a part of it would not be abstraction. For example, reducing a description of a table to a description of one of its legs would not be an abstraction operation. To illustrate an abstraction transmutation, consider a transformation of the statement "My workstation has a Motorola 25-MHz 68030 processor" to "My workstation is quite fast." To make such an operation, the system needs domain-dependent background knowledge that "a processor with the 25-MHz clock speed can be viewed as quite fast," and a rule "If a processor is fast then the computer with that
26 processor can be viewed as fast." Note that the more abstract description is a logical consequence of the original description in the context of the given background knowledge, and carries less information. The abstraction operation often involves a change in the representation language, from one that uses more specific terms to one that uses more general terms, with a proviso that the statements in the second language are logically implied by the statements in the first language. A very simple form of abstraction is to replace in a description of an entity a specific attribute value (e.g., the length in a centimeter) by a less specific value (e.g., the length stated in linguistic terms, such as short, medium and long). A more complex abstraction would involve a significant change of the description language, e.g., taking a description of a computer in terms of electronic circuits and connections, and changing it into a description in terms of the functions of the individual modules. In contrast to abstraction, which reduces information about a reference set but does not change it, generalization extends the reference set. To illustrate simply the difference between generalization and abstraction, consider a statement d(S,v), which says that attribute (descriptor) d takes value v for the set of entities S. Let us write such a statement in the form: d(S) = v
(5)
Changing (5) to the statement d(S) = v', in which v' represents a more general concept, e.g., a parent node in a generalization hierarchy of values of the attribute d, is an abstraction operation. By changing v t o v ' less information is being conveyed about the reference set S. Changing (5) to a statement d(S') = v, in which S' is a superset of S, is a generalization operation. The generated statement conveys more information than the original one, because the property d is not assigned to a larger set. For example, transferring the statement "color(my-pencil) = lightblue" into "color(my-pencil)=blue" is an abstraction operation. Such an operation is deductive, if one knows that light-blue is a kind of blue.
27 Transforming the original statement into "color(all-my-pencils) = lightblue" is a generalization operation. Assuming that one does not have prior knowledge that all writing instruments that I posses are blue, this is an inductive operation. Finally, transferring the original statement into "color(all-my-pencils)=blue" is both generalization and abstraction. Thus, associating the same information with a larger set is a generalization operation; associating a smaller amount of information with the same set is an abstraction operation. In summary, generalization transforms descriptions along the setsuperset dimension, and abstraction transforms descriptions along the level-of-detail dimension. Generalization often uses the same description space (or language), abstraction often involves a change in the representation space (or language). An opposite transmutation to generalization is specialization. An opposite transmutation to abstraction is concretion. Generalization is typically an inductive operation, and abstraction a deductive operation. As a parallel concept to constructive induction, which was discussed before, one may introduce the concept of constructive deduction. Similarly to constructive induction, constructive deduction is a process of deductively transforming a source description into a target description, which uses new, more goal-relevant terms and concepts than the source description. As in constructive induction, the process uses background knowledge for that purpose. Depending on the available background knowledge, constructive deduction may be conclusive or contingent. Abstraction can be viewed as a form of constructive deduction that reduces the amount of information about a given inference set, without changing it. Such a reduction may involve using terms at "higher level of abstraction" that are derived from the "lower lever' terms. Constructive deduction is a more general concept than abstraction, as it includes any type of deductive knowledge derivation, including transformations of a given knowledge to equivalent but different forms, plausible deductive derivations, such as those based on probabilistic inferences (e.g., Schum, 1986; Pearl, 1988), or plausible reasoning (e.g., Collins and Michalski,
28 1989). In such cases, the distinction between constructive induction and constructive deduction becomes a matter of degree to which different forms of reasoning play the primary role. A CLASSIFICATION OF LEARNING PROCESSES Learning processes can be classified according to many criteria, such as the type of the inferential learning strategy used (in our terminology, the type of primary transmutation employed), the type of knowledge representation (logical expressions, decision rules, frames, etc.), the way information is supplied to a learning system (batch vs. incremental), the application area in which it is applied, etc. Classifications based on such criteria have been discussed in Carbonell, Michalski and Mitchell (1983) and Michalski (1986). The Inferential Theory of Learning outlined above offers a new way of looking at learning processes, and suggests some other classification criteria. The theory considers learning as a knowledge transformation process whose primary purpose may be either to increase the amount of the learner's knowledge, or to increase the effectiveness of the knowledge already possessed. Therefore, the primary learning purpose can be used as a major criterion for classifying learning processes. Based on this criterion, learning processes are divided into two categories—synthetic and analytic. The main goal of synthetic learning is to acquire new knowledge that goes beyond the knowledge already possessed, i.e., beyond its deductive closure. Thus, such learning relies on synthetic knowledge transmutations. The primary inference types involved in such processes are induction and/or analogy. (The term "primary" is important, because every inductive or analogical inference also involves deductive inference. The latter form is used, for example, to test whether a generated hypothesis entails the observations, to perform an analogical knowledge transfer based the hypothesized analogical match, to generate new terms using background knowledge, etc. ) The main goal of analytic learning processes is to transform knowledge that the learner already possesses into the form that is most
29 desirable and/or effective for achieving the given learning goal. Thus, such learning relies on analytic knowledge transmutations. The primary inference type used is therefore deduction. For example, one may have a complete knowledge of how an automobile works, and therefore can in principle diagnose the problems based on it. By analytic learning, one can derive simple tests and procedures for more efficient diagnosis. Other important criteria for classification of learning processes include: • The type of input information—whether it is in the form of (classified) examples, or in the form of (unclassified) facts or observations. • The type of primary inference type employed in a learning process— induction, deduction or analogy. • The role of the learner's background knowledge in the learning process—whether learning relies primarily on the input data, primarily.on the background knowledge, or on some balanced combination of the two. Figure 6 presents a classification of learning processes according to the above criteria. A combination of specific outcomes along each criterion determines a class of learning methodologies. Individual methodologies differ in terms of the knowledge representation employed, the underlying computational mechanism, or the specific learning goal (e.g., learning rules for recognizing unknown instances, learning classification structures, or learning equations). Such methodologies like empirical generalization, neural-net learning and genetic algorithm based learning all share a general goal (knowledge synthesis), have input in the form of examples of observed facts (rather than rules or other forms of general knowledge), perform induction as the primary form of inference, and involve relatively small amount of background knowledge. The differences among them are in the employed knowledge representation and the underlying computational mechanism. If the input to a synthetic learning method are examples classified by some source of knowledge, e.g., a teacher, then we have learning from examples. Such learning can be divided in turn into "instance-to-class" and "part-to-whole" categories (not shown in the Figure).
30
anon ntcussmuTioa Primary
lEmnsFtmsES SYDTHETIC
Purpose
Type of Input
L, FBOm OBSEHVflTIOD
L. FHOm EiamPLES
\
Type of Primary Inference
SPECIFICHTIOn6DIDED
EiampLE8UIDED
/
\
/ DEDUCTIVE
IDDOCTIVE
BDflLOGY
Role of Prior Knowledge
EIBPIBICBL
moucTion
Empirical Symbolic Generalization
Learning Goal and/or Representational Paradigm
COHSTBUCTIVE
Qualitative Discovery Conceptual Clustering
Abductive Learning
Constructive Inductive Generalization
Simple Casebased Learning
MULTTSTRATEGY SYSTEMS
COnSTHUCTIVE DEDUCTIOn
BXIOIDHTIC
Learning by Analogy
Abstraction
Advanced Case- based Learning
Problem Reformulation
Explanationbased Learning ("pure")
Integrated Empirical & Explanationbased Learning
Learning by Plausible Deduction
Automatic Program Synthesis
Operationaliziation
MuUistrategy Task-adaptive Learning
Neural Net Learning Genetic Algorithms Reinforcement Learning
t t I
t
j_j
1— TTlFTHnnOLnGIES-1
Figure 6. A general classification of learning processes.
l
31 In the "instance-to-class" category, examples are independent entities that represent a given class or concepts. For example, learning a general diagnostic rule for a given disease from characteristics of the patients with this disease is an "instance-to-class" generalization. Here each patient is an independent example of the disease. In the "part-to-whole" category, examples are interdependent components that have to be investigated together in order to generate a concept description. For example, a "part-to-whole" inductive learning is to hypothesize a complete shape and look of a prehistoric animal from a collection of its bones. When the input to a synthetic learning method includes facts that need to be described or organized into a knowledge structure, without the benefit of advise of a teacher, then we have learning from observation. The latter is exemplified by learning by discovery, conceptual clustering and theory formation categories. The primary type of inference used in synthetic learning is induction. As described earlier, inductive inference can be empirical (background knowledge-limited) or constructive (background knowledge-intensive). Most work in empirical induction has been concerned with empirical generalization of concept examples using attributes selected from among those present in the descriptions of the examples. Another form of empirical learning includes quantitative discovery, in which learner constructs a set of equations characterizing given data. Empirical inductive learning (both from examples, also called supervised learning, and from observation, also called unsupervised learning) can be done using several different methodologies, such as symbolic empirical generalization, neural net learning, genetic algorithm learning, reinforcement learning ("learning from feedback"), simple forms of conceptual clustering and case-based learning. The above methods typically rely on (or need) relatively small amount of background knowledge, and all perform some form of induction. They differ from each other in the type of knowledge
32
representation, computational paradigm, and/or the type of knowledge they aim to learn. Symbolic methods frequently use such representations as decision trees, decision rules, logic-style representations (e.g., Horn clauses or limited forms of predicate calculus), semantic networks or frames. Neural nets use networks of neuron-like units; genetic algorithms often use classifier systems. Conceptual clustering typically uses decision rules or structural logic-style descriptions, and aims at creating classifications of given entities together with descriptions of the created classes. Reinforcement learning acquires a mapping from situations to actions that optimizes some reward function, and may use a variety of representations, such a neural nets, sets of mathematical equations, or some domain-oriented languages (Sutton, 1992). In contrast to empirical inductive learning, constructive inductive learning is knowledge-intensive. It uses background knowledge and/or search techniques to create new attributes, terms or predicates that are more relevant to the learning task, and use them to derive characterizations of the input. These characterizations can be generalizations, explanations or both. As described before, abduction can be viewed as a form of knowledge-intensive (constructive) induction, which "traces backward" domain-dependent rules to create explanations of the given input. Many methods for constructive induction use decision rules for representing both background knowledge and acquired knowledge. For completeness, we will mention also some other classifications of synthetic methods, not shown in this classification. One classification is based on the way facts or examples are presented to the learner. If examples (in supervised learning) or facts (in unsupervised learning) are presented all at once, then we have one-step or non-incremental inductive learning. If they are presented one by one, or in portions, so that the system has to modify the currently held hypothesis after each input, we have an incremental inductive learning. Incremental learning may be with no memory, with partial memory, or with complete memory of the past facts or examples. Most incremental
33
machine learning methods fall into the "no memory" category, in which all knowledge of past examples is incorporated in the currently held hypothesis. Human learning falls typically into a "partial memory" category, in which the learner remembers not only the currently held hypothesis, but also representative past examples supporting the hypothesis. The second classification is based on whether the input facts or examples can be assumed to be totally correct, or can have errors and/or noise. Thus, we can have learning from a perfect source, or imperfect (noisy) source of information. The third classification characterizes learning methods (or processes) based on the type of matching instances with concept descriptions. Such matching can be done in a direct way, which can be complete or partial, or an indirect way. The latter employs inference and a substantial amount of background knowledge. For example, rule-based learning may employ a direct match, in which any example has to exactly satisfy a condition part of some rule, or a partial match, in which a degree of match is computed, and the rule that gives the best match is fired Advanced casebased learning methods employ matching procedures that may conduct an extensive amount of inference to match a new example with past examples (e.g., Bareiss, Porter and Wier, 1990). Learning methods based on the two-tiered concept representation (Bergadano et al., 1992) also use inference procedures for matching an input with the stored knowledge. In both cases, the matching procedures perform a "virtual" generalization transmutation. Analytic methods can be divided into those that are guided by an example in the process of knowledge reformulation (example-guided), and those that start with a knowledge specification (specification-guided). The former category includes explanation-based learning (e.g., DeJong et al., 1986), explanation-based-generalization (Mitchell et al., 1986), and explanation-based specialization (Minton et al., 1987; Minton, 1988). If deduction employed in the method is based on axioms, then it is called axiomatic. A "pure" explanation-based generalization is an
34 example of an axiomatic method because it is based on a deductive process that utilizes a complete and consistent domain knowledge. This domain knowledge plays the role analogous to axioms in formal theories. Synthesizing a computer program from its formal specification is a specification-guided form of analytic learning. Analytic methods that involve truth-preserving transformations of description spaces and/or plausible deduction are classified as methods of "constructive deduction." One important subclass of these methods are those utilizing abstraction as a knowledge transformation operation. Other subclasses include methods employing contingent deduction, e.g., plausible deduction, or probabilistic reasoning. The type of knowledge representation employed in a learning system can be used as another dimension for classifying learning systems (also not shown in Figure 6). Learning systems can be classified according to this criterion into those that use a logic-style representation, decision tree, production rules, frames, semantic network, grammar, neural network, classifier system, PROLOG program, etc., or a combination of different representations. The knowledge representation used in a learning system is often dictated by the application domain. It also depends on the type of learning strategy employed, as not every knowledge representation is suitable for every type of learning strategy. Multistrategy learning systems integrate two or more inferential strategies and/or computational paradigms. Currently, most multistrategy systems integrate some form of empirical inductive learning with explanation-based learning, e.g., Unimem (Lebowitz, 1986), Odysseus (Wilkins, Clancey, and Buchanan, 1986), Prodigy (Minton et al., 1987), GEMINI (Danyluk, 1987 and 1989), OCCAM (Pazzani, 1988), IOE (Dietterich and Flann, 1988) and ENIGMA (Bergadano et al., 1990). Some systems include also a form of analogy, e.g., DISCIPLE-1 (Kodratoff and Tecuci, 1987), or CLINT (Raedt and Bruynooghe, 1993). Systems applying analogy sometimes is viewed as multistrategy, because analogy is an inference combining induction and deduction. An advanced
35 case-based reasoning system that uses different inference types to match an input with past cases can also be classified as multistrategy. The Inferential Theory of Learning is a basis for the development of multistrategy task-adaptive learning (MTL), first proposed by Michalski (1990a). The aim of MTL is to synergistically integrate such strategies as empirical learning, analytic learning, constructive induction, analogy, abduction, abstraction, and ultimately also reinforcement strategies. An MTL system determines by itself which strategy or a combination thereof is most suitable for a given learning task. In an MTL system, strategies may be integrated loosely, in which case they are represented as different modules, or tightly, in which case one underlying representational mechanism supports all strategies. Various aspects of research on MTL have been reported by Michalski (1990c) and by Tecuci and Michalski, (1991a,b). Related work was also reported by Tecuci (1991a,b; 1992). Summarizing, the theory postulates that learning processes can be described in terms of generic patterns of inference, called transmutations. A few basic knowledge transmutations have been discussed, and characterized in terms of three dimensions: A. The type of logical relationship between the input and the output: induction vs. deduction. B. The direction of the change of the reference set: generalization vs. specialization. C. The direction of the change in the level-of-detail of description: abstraction vs. concretion. Each of the above dimensions corresponds to a different mechanism of knowledge transmutation that may occur in a learning process. The operations involved in the first two mechanisms, induction vs. deduction, and generalization vs. specialization, have been relatively well-explored in machine learning. The operations involved in the third mechanism, abstraction vs. concretion, have been relatively less studied. Because these three mechanisms are interdependent, not all combinations of operations can occur in a learning process . The problems of how to quantitatively
36 and effectively measure the amount of change in the reference set and in the level-of-detail of descriptions are important topics for future research. The presented classification of learning processes characterizes and relates to each other major subareas of machine learning. As any classification, it is useful only to the degree to which it illustrates important distinctions and relations among various categories. The ultimate goal of this classification effort is to show that diverse learning mechanisms and paradigms can be viewed as parts of one general structure, rather than as a collection of unrelated components. SUMMARY The goals of this research are to develop a theoretical framework and an effective methodology for characterizing and unifying diverse learning strategies and approaches. The proposed Inferential Theory looks at learning as a process of making goal-oriented knowledge transformations. Consequently, it proposes to analyze learning methods in terms of generic types of knowledge transformation, called transmutations, that occur in learning processes. Several transmutations have been discussed and characterized along three dimensions: the type of the logical relationship between an input and output (induction vs. deduction), the change in the reference set (generalization vs. specialization), and the change in the level-of-detail of a description (abstraction vs. concretion). Deduction and induction has been presented as two basic forms of inference. In addition to widely studied inductive generalization, other form of induction have been discussed, such as inductive specialization, concretion, and abduction. Is has been also shown that abduction can be viewed as a knowledge-based induction, and abstraction as a form of deduction. The Inferential Theory can serve as a conceptual framework for the development of multistrategy learning systems that combine different inferential learning strategies. Research in this direction has led to the formulation of the multistrategy task-adaptive learning (MTL), that dynamically and synergistically adapts the learning strategy, or a combination of them, to the learning task.
37
Many of the ideas discussed are at a very early state of development, and many issues have not been resolved. Future research should develop more formal characterization of the presented transmutations, and develop effective methods for characterizing different knowledge transmutations, and measuring their "degrees." Other important research area is to determine how various learning algorithms and paradigms map into the described knowledge transmutations. In conclusion, the ITL provides a new viewpoint for analyzing and characterizing learning processes. By addressing their logical capabilities and limitations, it strives to analyze and understand the competence aspects of learning processes. Among its major goals are to develop effective methods for determining what kind of knowledge a learner can acquire from what kind of inputs, to determine the areas of the most effective applicability of different learning methods, and to gain new insights into how to develop more advanced learning systems. ACKNOWLEDGMENTS The author expresses his gratitude to George Tecuci and Tom Arciszewski for useful and stimulating discussions of the material presented here. Thanks also go to many other people for their insightful comments and criticism of various aspects of this work, in particular, Susan Chipman, Hugo De Garis, Mike Hieb, Ken Kaufman, Yves Kodratoff, Elizabeth Marchut-Michalski, Alan Meyrowitz, David A. Schum, Brad Utz, Janusz Wnek, Jianping Zhang, and the students who took the author's Machine Learning class. This research was done in the Artificial Intelligence Center of George Mason University. The research activities of the Center have been supported in part by the Office of Naval Research under grants No. N00014-88-K-0397, No. N00014-88-K-0226, No. N00014-90-J-4059, and No. N00014-91-J-1351, in part by the National Science Foundation under the grant No. IRI-9020266, and in part by the Defense Advanced Research Projects Agency under the grants administered by the Office of Naval Research, No. N00014-87-K-0874 and No. N00014-91 -J-1854.
38 REFERENCES Adler, M. J., Gorman, W. (Eds..) The Great Ideas: A Synopicon of Great Books of the Western World, Vol. 1, Ch. 39 (Induction), pp. 565571, Encyclopedia Britannica, Inc„ 1987. Aristotle, Posterior Analytics, in The Works of Aristotle, Volume 1, R. M. Hutchins (Ed.), Encyclopedia Britannica, Inc., 1987. Bacon, F., Novum Organum, 1620. Bareiss, E. R., Porter, B. and Wier, C.C., PROTOS, An Exemplar-based Learning Apprentice, in Machine Learning: An Artificial Intelligence Approach, Vol III, Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann, 1990. Bergadano, F., Matwin, S., Michalski, R.S. and Zhang, J., Learning Twotiered Descriptions of Flexible Concepts: The POSEIDON System, Machine Learning Journal, Vol. 8, No, 1, Januray 1992. Carbonell, J. G., Michalski R.S. and Mitchell, T.M., An Overview of Machine Learning, in Machine Learning: An Artificial Intelligence Approach, Michalski, R.S., Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann Publishers, 1983. Cohen, L.J., The Implications of Induction, London, 1970. Collins, A. and Michalski, R.S., "The Logic of Plausible Reasoning: A Core Theory," Cognitive Science, Vol. 13, pp. 1-49, 1989. Danyluk. A. P., "Recent Results in the Use of Context for Learning New Rules," Technical Report No. TR-98-066, Philips Laboratories, 1989. DeJong, G. and Mooney, R., "Explanation-Based Learning: An Alternative View," Machine Learning Journal, Vol 1, No. 2, 1986. Dietterich, T.G., and Flann, N.S., "An Inductive Approach to Solving the Imperfect Theory Problem," Proceedings of 1988 Symposium on Explanation-Based Learning, pp. 42-46, Stanford University, 1988. Goodman, L.A. and Kruskal, W.H., Measures of Association for Cross Classifications, Springer-Verlag, New York, 1979. Kodratoff, Y., and Tecuci, G., "DISCIPLE-1: Interactive Apprentice System in Weak Theory Fields," Proceedings ofIJCAI-87, pp. 271273, Milan, Italy, 1987. Lebowitz, M., "Integrated Learning: Controlling Explanation," Cognitive Science, Vol. 10, No. 2, pp. 219-240, 1986.
39 Michalski, R.S., "A Planar Geometrical Model for Representing MultiDimensional Discrete Spaces and Multiple-Valued Logic Functions, " Report No. 897, Department of Computer Science, University of Illinois, Urbana, January 1978. Michalski, R. S., "Theory and Methodology of Inductive Learning," Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, T. M. Mitchell (Eds.), Tioga Publishing Co., 1983. Michalski, R.S., Understanding the Nature of Learning: Issues and Research Directions, in Machine Learning: An Artificial Intelligence Approach Vol. II, Michalski, R.S., Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann Publishers, 1986. Michalski, R.S., Toward a Unified Theory of Learning: Multistrategy Task-adaptive Learning, Reports of Machine Learning and Inference Laboratory MLI-90-1, January 1990a. Michalski, R.S.and Kodratoff, Y. "Research in Machine Learning: Recent Progress, Classification of Methods and Future Directions," in Machine Learning: An Artificial Intelligence Approach, Vol. Ill, Kodratoff, Y. and Michalski, R.S. (eds.), Morgan Kaufmann Publishers, Inc., 1990b. Michalski, R.S., LEARNING FLEXIBLE CONCEPTS: Fundamental Ideas and a Method Based on Two-tiered Representation, in Machine Learning: An Artificial Intelligence Approach, Vol. Ill, Kodratoff, Y. and Michalski, R.S. (eds.), Morgan Kaufmann Publishers, Inc., 1990c. Michalski, R.S., INFERENTIAL THEORY OF LEARNING: Developing Foundations for Multistrategy Learning, in Machine Learning: A Multistrategy Approach, Vol. IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993. Minton, S., "Quantitative Results Concerning the Utility of ExplanationBased Learning," Proceedings of AAAI-88, pp. 564-569, Saint Paul, MN, 1988. Minton, S., Carbonell, J.G., Etzioni, O., et al., "Acquiring Effective Search Control Rules: Explanation-Based Learning in the PRODIGY System,'' Proceedings of the 4th International Machine Learning Workshop, pp. 122-133, University of California, Irvine, 1987. Mitchell, T.M., Keller,T., Kedar-Cabelli,S., "Explanation-Based Generalization: A Unifying View," Machine Learning Journal, Vol. 1, January 1986.
40 Pazzani, M.J., "Integrating Explanation-Based and Empirical Learning Methods in OCCAM," Proceedings of EWSL-88, pp. 147-166, Glasgow, Scotland, 1988. Pearl J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988. Piatetsky-Shapiro, G., "Probabilistic Data Dependencies," Proceedings of the ML92 Workshop on Machine Discovery, J.M. Zytkow (Ed.), Aberdeen, Scotland, July 4, 1992. Popper, K. R., Objective Knowledge: An Evolutionary Approach, Oxford at the Clarendon Press, 1972. Poole, D., Explanation and Prediction: An Architecture for Default and Abductive Reasoning, Computational Intelligence, No. 5, pp. 97110, 1989. Porter, B. W. and Mooney, R. J. (eds.), Proceedings of the 7th International Machine Learning Conference, Austin, TX, 1990. De Raedt, L. and Bruynooghe, M. CLINT: A Multistrategy Interactive Concept Learner, in Machine Learning: A Multistrategy Approach, Vol. IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993 (to appear). Rumelhart, D. E., McClelland and the PDP Research Group, Parallel Distributed Processing, Vol, 1 & 2, A Bradford Book, The MIT Press, Cambridge, Massachusetts, 1986. Russell, S., The Use of Knowledge in Analogy and Induction, Morgan Kaufman Publishers, Inc., San Mateo, CA, 1989. Schafer, D., (Ed.), Proceedings of the 3rd International Conference on Genetic Algorithms, George Mason University, June 4-7, 1989. Schum, D.,A.,"Probability and the Processes of Discovery, Proof, and Choice," Boston University Law Review, Vol. 66, No 3 and 4, May/July 1986. Segre, A. M. (Ed.), Proceedings of the Sixth International Workshop on Machine Learning, Cornell University, Ithaca, New York, June 2627, 1989. Sutton, R. S. (Ed.), Special Issue on Reinforcement Learning, Machine Learning Journal, Vol. 8, No. 3/4, May 1992. Tecuci G., "A Multistrategy Learning Approach to Domain Modeling and Knowledge Acquisition," in Kodratoff, Y., (ed.), Proceedings of the European Conference on Machine Learning, Porto, SpringerVerlag, 1991a.
41 Tecuci G., "Steps Toward Automating Knowledge Acquisition for Expert Systems," in Rappaport, A., Gaines, B., and Boose, J. (Eds.), Proceedings of the AAAI-91 Workshop on Knowledge Acquisition "From Science to Technology to Tools", Anaheim, CA, July, 1991b. Tecuci, G. and Michalski, R.S.,"A Method for Multistrategy Task-adaptive Learning Based on Plausible Justifications," in Birnbaum, L., and Collins, G. (eds.) Machine Learning: Proceedings of the Eighth International Workshop, San Mateo, CA, Morgan Kaufinann, 1991a. Tecuci G., and Michalski R.S., Input "Understanding" as a Basis for Multistrategy Task-adaptive Learning, in Ras, Z., and Zemankova, M. (eds.), Proceedings of the 6th International Symposium on Methodologies for Intelligent Systems, Lecture Notes on Artificial Intelligence, Springer Verlag, 1991b. Touretzky, D., Hinton, G., and Sejnowski, T. (Eds.), Proceedings of the 1988 Connectionist Models, Summer School, Carnegie Mellon University, June 17-26, 1988. Utgoff, P. Shift of Bias for Inductive Concept Learning, in Machine Learning: An Artificial Intelligence Approach Vol. II, Michalski, R.S., Carbonell, J.G., and Mitchell, T. M. (Eds.), Morgan Kaufmann Publishers, 1986. Warmuth, M. & Valiant, L. (Eds.) (1991). Proceedings of the 4rd Annual Workshop on Computational Learning Theory, Santa Cruz, CA: Morgan Kaufmann. Whewell, W., History of the Inductive Sciences, 3 vols., Third edition, London, 1857. Wilkins, D.C., Clancey, W.J., and Buchanan, B.G., An Overview of the Odysseus Learning Apprentice, Kluwer Academic Press, New York, NY, 1986. Wnek, J., Sarma, J., Wahab, A. A. and Michalski,R.S., COMPARING LEARNING PARADIGMS VIA DIAGRAMMATIC VISUALIZATION: A Case Study in Concept Learning Using Symbolic, Neural Net and Genetic Algorithm Methods, Proceedings of the 5th International Symposium on Methodologies for Intelligent Systems, University of Tennessee, Knoxville, TN, North-Holland, October 24-27, 1990. Wnek, J. and Michalski, R.S., COMPARING SYMBOLIC AND SUBSYMBOLIC LEARNING: A Case Study, in Machine Learing: A Multistrategy Approach, Volume IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993.
Chapter 2 Adaptive Inference Alberto Segre, Charles Elkan2, Daniel Scharstein, Geoffrey Gordon3, and Alexander Russell4 Department of Computer Science Cornell University Ithaca, NY 14853
Abstract Automatically improving the performance of inference engines is a central issue in automated deduction research. This paper describes and evaluates mechanisms for speeding up search in an inference engine used in research on reactive planning. The inference engine is adaptive in the sense that its performance improves with experience. This improvement is obtained via a combination of several different learning mechanisms, including a novel explanation-based learning algorithm, bounded-overhead success and failure caches, and dynamic reordering and reformulation mechanisms. Experimental results show that the beneficial effect of multiple speedup techniques is greater than the beneficial effect of any individual technique. Thus a wide variety of learning methods can reinforce each other in improving the performance of an automated deduction system.
1
Support for this research was provided by the Office of Naval Research grants N0014-88-K-0123 and N00014-90-J-1542, and through gifts from the Xerox Corporation and the Hewlett-Packard Company. 2 Current address: Department of Computer Science and Engineering, University of California at San Diego, La Jolia, CA 92093 3 Current address: Corporate Research and Development, General Electric, Schenectady, NY 12301 4 Current address: Laboratory for Computer Science, Massacusetts Institute of Technology, Cambridge, MA 02134
44
INTRODUCTION This paper presents an overview of our work in adaptive inference. It represents part of a larger effort studying the application of machine learning techniques to planning in uncertain, dynamic domains.5 In particular, it describes the implementation and empirical evaluation of a definite-clause, adaptive, automated deduction system. Our inference engine is adaptive in the sense that its performance characteristics change with experience. While others have previously suggested augmenting PROLOG interpreters with explanation-based learning components (Prieditis & Mostow, 1987), our system is the first to integrate a wide variety of speedup techniques such as explanation-based learning, bounded-overhead success and failure caching, heuristic antecedent reordering strategies, learned-rule management facilities, and a dynamic abstraction mechanism. Adaptive inference is an effort to bias the order of search exploration so that more problems of interest are solvable within a given resource limit. Adaptive methods include techniques normally considered speedup learning methods as well as other techniques not normally associated with machine learning. All the methods that we consider, however, rely on an underlying assumption about how the inference engine is to be used. The goal of most work within the automated deduction community is to construct inference engines which are fast and powerful enough to solve very large problems once. Large problems which were previously not mechanically solvable in a reasonable amount of time are of special interest. Once a problem is solved, another, unrelated, problem may be attempted. In contrast, we are interested in using our inference engine to solve a collection of related problems drawn from a fixed (but possibly unknown) problem distribution. These problems are all solved using the same domain theory. A complicating factor is that the inference engine is operating under rigid externally-imposed resource constraints. For example, in our own planning work, it is necessary to keep the resource constraint low enough so that the SEPIA agent is able to plan in real time. A stream of queries, corresponding to goals initiated by sensory input to the agent, are passed to the inference engine; the inference engine uses a logic of approximate plans (Elkan, 1990) to derive sequences of actions 5
Our SEPIA intelligent agent architecture (Segre & Turney, 1992a, 1992b) builds on our previous work in learning and planning (Elkan, 1990; Segre, 1987, 1988, 1991; Segre & Elkan, 1990; Turney & Segre, 1989a, 1989b). The goal of the SEPIA project is to build a scalable, real-time, learning agent.
45 which are likely to achieve the goal. Since much of the agent's world doesn't change from one query to the next, information obtained while answering one query can dramatically affect the size of the search space which must be explored for subsequent ones. The information retained may take many different forms: facts about the world state, generalized schemata of inferential reasoning, advice regarding fruitless search paths, etc. Regardless of form, however, the information is used to alter the search behavior of the inference engine. All of the adaptive inference techniques we employ share this same underlying theme. The message of this paper is that multiple speedup techniques can be applied in combination to significantly improve the performance of a automated deduction system (Segre, 1992). We begin by describing the design and implementation of our definite-clause automated deduction system and the context in which we intend to use it. Next, we present a methodology for reliably measuring the changes in performance of our system (Segre, Elkan & Russell, 1990, 1991; Segre, Elkan, Gordon & Russell, 1991). Of course, in order to discuss the combination of speedup techniques, it is necessary to understand each technique individually; thus we introduce each speedup technique, starting with our bounded-overhead caching system (Segre & Scharstein, 1991). We then discuss speedup learning and EBL* (our heuristic formulation of explanation-based learning) (Elkan & Segre, 1989; Segre & Elkan, 1990) and show how EBL* can acquire more useful new information than traditional EBL systems. We also briefly touch on other speedup techniques, describe our current efforts in combining and evaluating these techniques, and sketch some future directions for adaptive inference. DEFINITE-CLAUSE INFERENCE ENGINES A definite-clause inference engine is one whose domain theory {i.e., knowledge base) is a set of definite clauses, where a definite clause is a rule with a head consisting of a single literal and a body consisting of some number of non-negated antecedent literals. A set of definite clauses is a pure PROLOG program, but a definite-clause inference engine may be much more sophisticated than a standard pure PROLOG interpreter. All definite-clause inference engines, however, search an implicit AND/OR tree defined by the domain theory and the query, or goal, under consideration. Each OR node in this implicit AND/OR tree corresponds to a subgoal that must be unified with the head of some matching clause in the domain theory, while each AND node corresponds to the body of a clause in the domain theory. The children of an OR node represent alternative paths to search for a proof of the subgoal, while the children of an AND node represent sibling subgoals which
46 require mutually-consistent solutions. We are particularly interested in resource-limited inference engines. Resource limits specify an upper bound on the resources which may be allocated to solving a given problem or query before terminating the search and assuming no solution exists. Such limits are generally imposed in terms of maximum depth of search attempted, maximum nodes explored, or maximum CPU time expended before failing. While some inference engines may not appear to possess explicit resource limits, in practice, all inference engines must be resource limited, since in most interesting domains, some problems require an arbitrarily large amount of resources. Any resource limit creates a horizon effect: only queries with proofs that are sufficiently small according to the resource measure are solvable; others are beyond the horizon. More precisely, a domain theory and resource-limited inference engine architecture together determine a resource-limited deductive closure, DR, which is the set of all queries whose solutions can be found within the given resource bound /?. DR is, by construction, a subset of the deductive closure D of the domain theory. The exact size and composition of DR depend on several factors: the domain theory, the resource limit, and the search strategy used. The search strategy determines the order in which the nodes of the implicit AND/OR tree are explored. Different exploration orders not only correspond to different resource-limited deductive closures DR, but to different proofs of the queries in DR as well as different node expansion costs. For example, breadth-first inference engines guarantee finding the shallowest proof, but require excessive space for problems of any significant size. Depth-first inference engines require less space, but risk not terminating when the domain theory is recursive. Choosing an appropriate search strategy is a critical design decision when constructing an inference engine. The Testbed Inference Engine We have implemented a backward-chaining definite-clause inference engine in Common Lisp. The inference engine's inference scheme is essentially equivalent to PROLOG'S SLD-resolution inference scheme. Axioms are stored in a discrimination net database along with rules indexed by the rule head. The database performs a pattern-matching retrieval guaranteed to return a superset of those database entries which unify with the retrieval pattern. The cost of a single database retrieval in this model grows linearly with the number of matches found and logarithmically with the number of entries in the database.
47
The system relies on a well-understood technique called iterative deepening (Korf, 1985) for forcing completeness in recursive domains while still taking advantage of depth-first search's favorable storage characteristics. As generally practiced, iterative deepening involves limiting depth-first search exploration to a fixed depth. If no solution is found by the time the depth-limited search space is exhausted, the depth limit is incremented and the search is restarted. In return for completeness in recursive domains, depth-first iterative deepening generally entails a constant factor overhead when compared to regular depth-first search: the size of this constant depends on the branching factor of the search space and the value of the depth increment. Changing the increment changes the order of exploration of the implicit search space and, therefore, performance of the inference engine. Our inference engine performs iterative deepening on a generalized, user-defined, notion of depth while respecting the overall search resource limit specified at query time. Fixing a depth-update function (and thus a precise definition of depth) and an iterative-deepening increment establishes the exploration order of the inference engine. For example, one might define the iterative-deepening update function to compute depth of the search; with this strategy, the system is performing traditional iterative deepening. Alternatively, one might specify update functions for conspiratorial iterative deepening (Elkan, 1989), iterative broadening (M. Ginsberg & Harvey, 1990), or numerous other search strategies.6 Our implementation supports the normal PROLOG cut and fail operations, and therefore constitutes a full PROLOG interpreter. Unlike PROLOG, however, our inference engine also supports procedural attachment (i.e., escape to Lisp), which, among other things, allows for dynamic restriction and relaxation of resource limits. In addition, for a successful query our system produces a structure representing the derivation tree for a solution rather than a PROLOG-like answer substitution. When a failure is returned, the system indicates whether the failure was due to exceeding a resource limit or if in fact we can be guaranteed absence of any solution. 6
The conspiracy size of a subgoal corresponds to the number of other, as yet unsolved, subgoals in the current proof structure. Thus conspiratorial best-first search prefers narrow proofs to bushy proofs, regardless of the actual depth of the resulting derivation. Iterative broadening is an analogous idea that performs iterative deepening on the breadth of the candidate proofs.
48
The proof object is a tree whose structure reflects the successful portion of the search. The nodes in the tree are of two different types. A consequent node is used to represent an instance of a domain theory element (a rule consequent or a fact) that matches {i.e., unifies with) the query or current subgoal. A subgoal node represents an instantiation of a domain theory rule antecedent. The edges of the tree make explicit the relations between the nodes, and are also of two distinct types. A rule edge links the consequent node representing the rule's consequent to the subgoal nodes representing the rule's antecedents, while a match edge links a subgoal node to the consequent node below it {i.e., each match edge corresponds to a successful unification). The root of any proof tree is the consequent node linked by a match edge to the subgoal node representing the original query. The leaves of trees representing completed proofs are also consequent nodes where each leaf represents a fact in the domain theory. A proof tree is valid relative to a given domain theory if and only if: (1) all subgoal-consequent node pairs linked by a match edge in the tree represent identical expressions, and (2) every rule instance in the tree is a legal instance of a rule in the domain theory. If a proof tree is valid, then the truth value of the goal appearing at its root is logically entailed by the truth value of the set of leaves of the tree that are subgoals. An example should help make this clear. An Example Consider the following simple domain theory, where universallyquantified variables are indicated by a leading question mark: Facts: H{A,B) H{C,A) /(?*) K{B) K{C)
Rules: M{ly) -
L
L 6
1
4
1
°
o..-"
.*•*"'
Q Ll i
^
—
o'"
J
?•••**
/••*•'* H
O
—
-
—
-
-
_
—
1
1
10 log(ebfs) Figure 6b: Performance of an EBL* algorithm after learning from 2 problems on the remaining set of 24 problems drawn from the AI blocksstacking world. The inference engine is performing unit-increment depthfirst iterative deepening. Can we say something about the utility of EBL or EBL* when compared to a non-learning system? Not from this experiment. The assumption that the node expansion cost c is uniform across all three systems does not hold. While the two learning systems can be expected to have roughly comparable c parameters (each learning system acquires exactly two macro-operators), the non-learning system will not. The best we can do is observe that the learning systems search smaller spaces than the non-learning system for this particular distribution of queries: determining whether this reduction in the end corresponds to faster (or slower) performance is necessarily implementation-dependent. On the other hand, our conclusions relating the two learning systems are much stronger: EBL* clearly outperforms EBL for this particular training set and query distribution. COMBINING TECHNIQUES Up to this point, we have examined individual speedup learning techniques. It is our belief that the combined effect of multiple speedup techniques will exceed the beneficial effects due to the individual techniques.
67 Two distinct types of synergy can arise between different speedup techniques. The first is a natural synergy, where simply composing the two techniques is sufficient. We've already observed one example of natural synergy: the combined success-and-failure caching system of Figure 3 significantly outperformed success-only and failure-only caching systems of identical cache size and cache overhead. Another example of natural synergy occurs between EBL and failure caching. It is well-understood that the macro-operators added by EBL constitute redundant paths in the search space. While these redundant paths may serve to shortcut solutions to certain problems, they may increase the cost of solution for other problems, sometimes even pushing solutions outside the resource-limited deductive closure DR (Minton, 1990b). As for unsolvable problems (i.e., those problems whose solution lies outside D altogether), the cost of completely exhausting the search space can only increase with EBL. While this is not a cause for concern — since EBL only really makes sense for resource-limited problem solvers — the use of failure caching can nonetheless reduce this effect. To see how this is so, consider two alternate paths in the search space representing the same solution to a given subgoal. One path represents a learned macro-operator, while the other path represents the original domain theory. To determine that a similar but unsolvable subgoal is a failure, an EBL-only system would have to search both subtrees. However, an EBL system with failure caching need not necessarily search the second subtree. A second type of synergy arises by design. For example, a boundedoverhead caching system requires certain information about cache behavior in order to apply a cache management policy. This information could also be exploited by other speedup techniques (e.g., by an EBL* pruning heuristic); since the information is already being maintained, there should be no additional overhead associated with using this information. It is precisely this type of synergy by design which we hope will provide the greatest advantage of adaptive inference systems. In this section, we present some empirical findings — again based on the 26 randomly-ordered blocks world problems of the Appendix — that illustrate the natural synergy between EBL* and caching. We wish to compare a non-learning system with the caching system and the EBL* system tested earlier as well as with a system that performs both EBL* and caching. Unfortunately, these four systems do not exhibit uniform node expansion cost c. However, in the interests of simplifying the analysis, we again assume that the node expansion cost c is uniform across all four systems and limit our comparisons to the changes in search space size entailed by the different speedup techniques. While reductions in the size of
68 the search space generally entail a performance improvement, the magnitude of the improvement depends heavily on the details of the implementation (here represented by the exact relation between the various c parameters). Figure 7 presents the results obtained when applying both EBL* and caching to the same 24 situation-calculus blocks world problems. Our experimental procedure is to use the same 2 training problems used in Section 5. The system augments its domain theory by learning from the training problems and then tests the augmented theory on the remaining 24 problems. Unlike Section 5, however, performance is measured on the test problems with caching enabled. A cache size of 45 elements was used. The regression parameters obtained (Equation 8a) can be compared directly to the regression parameters obtained for the non-caching EBL* system (Equation 8b): log (e) = (0.865 ± 0.019) log (ebfs)
(8a)
log (e) = (0.982 ± 0.020) log (ebfs).
(8b)
We can also compare these regression parameters to the regression parameters for a non-learning system (Equation 8c) and a 45-element LRU caching system (Equation 8d):17 log (e) = (1.026 ± 0.004) log (ebfs)
(8c)
log (e) = (0.902 ±0.007) log (ebfs)
(8d)
We can draw several preliminary conclusions from these results. First, both the EBL*-only and caching-only systems search significantly fewer nodes than the non-learning system. Second, the EBL*-plus-caching system searches significantly fewer nodes than any of the other systems. As discussed previously, we cannot conclude outright that the EBL*-pluscaching system is necessarily faster than the other systems; however, inasmuch as this domain theory and problem set are representative of search problems as a whole, these results do lend credence to the view that several types of speedup learning can advantageously be composed. Among of the factors governing whether or not an improvement in final performance 7
Note that the parameters obtained for Equations 8c and 8d were computed using only the 24 datapoints corresponding to the learning system's test set. Nevertheless, they are essentially the same as the values shown in Figures 2 and 3, respectively, which were computed on the entire 26 problem situation-calculus problem set.
69
log(e)
EBL* and LRU Caching
10 log(ebfs) Figure 7: Performance of an EBL* algorithm after learning from 2 problems on the remaining set of 24 problems drawn from the AI blocksstacking world. The inference engine is performing unit-increment depthfirst iterative deepening, and LRU caching is enabled with a cache size of 45. emerges are specifics of the implementation. Another factor which was not controlled for in the experiment just reported is the selection of the training set for the learning systems. It is clear that the overall performance of a learning system is critically dependent on which problems are used in constructing new macro-operators. While it is usually safe to compare the performance of t\yo similar learning systems without controlling for differing training sets (as we did in Section 5), this procedure will generally not produce a meaningful comparison with nonlearning systems. Therefore, we now repeat the preliminary experiment just described, altering the experimental procedure slightly to reduce the reliance on training set composition. In this experiment, we perform 20 passes over the 26 problems, each time randomly selecting two problems as training examples and measuring performance of the original domain theory plus the two new macro-operators on the remaining 24 problems. By considering the performance of the learning systems over all passes, we control for training set composition. We
70
repeat the 20 passes twice, once for the EBL*-only system and once for the EBL*-plus-caching system. We then compare the results obtained with the datapoints obtained for the non-learning system and the cache-only system. The EBL*-only system solved all 24 problems within a resource limit of 600,000 nodes searched on only 11 of the 20 passes. On the 9 remaining passes, some of the problems were not solved within the resource bound. For the 9 incomplete passes, we make optimistic estimates of search space explored by treating unsolved problems as if they were solved after exploring the entire resource limit. When analyzed individually, the regression slopes for the 11 complete passes ranged from a low of log (b)=0.745±0.061 to a high offog(6)=1.039±0.051(for the 9 incomplete passes, these ranged from fog(6)=0.774±0.071 tofog(&)=1.334±0.096).Ten of 11 complete passes searched significantly fewer nodes than the non-learning system of Figure 2, while only 2 of 9 incomplete passes seemed to do so, even considering that these are optimistic estimates of performance (note that the use of optimistic performance estimates does not affect qualitative conclusions). A somewhat more useful analysis is shown in Figure 8; all 480 datapoints obtained (20 passes over 24 problems with unsolved problems charged the entire resource limit) are plotted together. The computed regression slope and standard error for the collected trials, which represents the average expected search performance over the entire problem distribution, is fog(6)=1.062±0.019. This represents significantly slower performance than that of the non-learning system.18 Our optimistic estimate of overall search performance for the EBL* only system factors out which problems are selected for training, and supports the conclusion that using this particular EBL* algorithm is not a good idea unless one has some additional information to help select training problems. A similar procedure is now used to measure the performance of the EBL*-plus-caching system. Each pass in this trial used the same randomly selected training problems as in the last trial. For the combined system, all 24 problems in the test set were solved within the resource bound on each and every pass.19 Here, the individually analyzed regression slopes ranged Note that the regression slope computed for the non-learning system does not change even if the data is repeated 20 times; only the standard error decreases. Thus we can compare the slope of this learning system directly to the slope of the nonlearning system (log (b)= 1.026) from Figure 1. 19 In fact, the resource bound used for this experiment was selected to meet this condition.
71 Depth-First Iterative Deepening with EBL* (20 passes) • i i i i 14 log{e)) fo*(e) = 1.062±0.019to$(«wS *°°* 12 10 8
:
•
°
.-••-""'
g
» | o .JSI*
n
8 o
6
s
4 oooo
2
..-•"" . '•'
0
_l |
° 1 __ _
I
1
_ !
1
10 log(ebfs)
Figure 8: Search performance of an iterative-deepening inference engine using EBL* on two randomly selected problems on the remaining 24 situation-calculus problems of the Appendix. Repeated 20 times for a total of 480 datapoints, many of which may be represented by a single point on the plot. Unsolved problems are charged 600,000 nodes, the entire resource limit. from a low of log (Z>)=0.667±0.051 to a high log (6)=1.245±0.054. Sixteen of twenty passes performed less search than the base system of Figure 1. The combined 480 datapoints are shown in Figure 9; the computed regression slope and standard error are log (6)=0.897±0.014. There are several conclusions one can draw from these results: (1) The EBL*-plus-caching system demonstrates better performance than the EBL*-only system, independent of training set selection. Note that the optimistic estimate of performance used for the EBL*only system does not affect this (qualitative) conclusion, but rather only the magnitude of performance advantage observed. (2) The EBL*-plus-caching system demonstrates better performance than both the non-learning and cache-only systems. This performance advantage is roughly independent of training set selection. Naturally, better training sets imply better performance; but on the average, the
72 Depth-First Iterative Deepening with LRU Caching and EBL* (20 passes) 1 1 14 log (e) = 0.897±0.014 log (ebfs) log(e) 12
r fo
o
8
o o o
10 8
o o
6
1**9
0
.••' %•«
o
4 2
I o
Jit" o o
.o"
2
4
6
8
10 log(ebfs)
Figure 9: Search performance of an iterative-deepening inference engine with a 45-element LRU cache and using EBL* on two randomly selected problems on the remaining 24 situation-calculus problems of the Appendix. Repeated 20 times for a total of 480 datapoints. advantages of learning outweigh the disadvantages regardless of the precise training set composition. (3)
The relative performance of the EBL*-only system with respect to the non-learning or cache-only system is critically dependent on the composition of the training set. In those situations where better training sets are selected, performance is potentially better than that of either a non-learning or cache-only system.
In summary, independent of which problems are selected for learning, the use of EBL* and a fixed-size LRU caching system will search significantly fewer nodes than any of the other systems tested previously.20 20
Note that these conclusions are independent of training set composition but not training set size. The size of the training set wasfixeda priori on the basis of the number of problems available overall. Additional experiments with differing training set sizes would have to be performed to determine the best training set size for this particular query distribution.
73
DISCUSSION AND CONCLUSION The main point of this paper is that multiple speedup techniques, when brought to bear in concert against problems drawn from a fixed (possibly unknown) problem distribution, can provide better performance than any single speedup technique. While the results just presented are certainly encouraging, there is still much room for improvement. We are pursuing our research on adaptive inference along several different lines. First, we are investigating additional speedup learning techniques with the intent to incorporate them in our adaptive inference framework. In particular, we are studying fast antecedent reordering strategies and the automatic construction of approximate abstraction hierarchies (Russell, Segre & Camesano, 1992). Given a predetermined search strategy (e.g., depth-first, breadth-first, etc.), the computation time required to find a proof for a given query relies on the order of exploration of the implicit search space. This is a much-studied problem in the automated reasoning and logic programming communities (Smith & Genesereth, 1985; Barnett, 1984). Most previously proposed heuristics are necessarily ad hoc; our heuristics are derived from successive approximations of an analytic model of search. By adding successively more sweeping assumptions about the behavior of the search process, we have built successively more efficient heuristics for reordering the body of a domain theory clause. Second, we are looking at how system performance may be improved by sharing information among the various speedup learning components. One example of this kind of sharing is using information maintained by the cache management strategy to support a dynamic abstraction hierarchy mechanism. Hierarchical planners generally achieve their computational advantage by either relying on a priori knowledge to construct appropriate hierarchies (Sacerdoti, 1974) or by automatically constructing hierarchies from syntactic cues in the domain theory (Knoblock, 1990). Unfortunately, neither of these approaches are very useful in practice. Our approach to this problem within the SEPIA planner framework is to use information maintained by the cache management strategy to decide which subgoals possess sufficiently high (or sufficiently low) probability of success to warrant being treated as explicit assumptions. Assumption subgoals are simply treated as true (or false) in order to produce an approximate plan very quickly. The assumptions are then verified using a secondary iterativedeepening strategy that relies on the inference engine's dynamic resourcereallocation scheme. If the appropriate assumptions are made, the cost of deriving a plan with assumptions plus the cost of verifying the assumptions is notably less than the cost of deriving the plan without using assumptions at all.
74
Finally, we are beginning to look at the problem of revising incorrect or partially-specified domain theories. Speedup learning techniques are meant to use a complete and correct domain theory more efficiently. Clearly, in more realistic domains, we cannot assume that the original domain theory is complete and correct. Generally stated, the theory revision problem is the problem of revising inaccurate or incomplete domain theories on the basis of examples which expose these inaccuracies. There has been much recent research devoted to the theory revision problem for propositional domain theories (Cain, 1992; A. Ginsberg, 1988a, 1988b; A. Ginsberg, Weiss & Politakis, 1988; Ourston & Mooney, 1990; Towell & Shavlik, 1990); the first-order problem is substantially harder (Richards & Mooney, 1991). Nevertheless, the shared central idea in each of these projects is to find a revised domain theory which is at once consistent with the obtained examples and as faithful as possible to the original domain theory. Here, faithfulness is generally measured in syntactic terms, e.g., smallest number of changes. We are working on afirst-ordertheory revision algorithm which is both efficient and incremental (Feldman, Segre & Koppel, 1991a, 1991b; Feldman, Koppel & Segre, 1992). Our probabilistic theory revision algorithm is based on an underlying mathematical model and therefore exhibits a set of desirable characteristics not shared by other theory revision algorithms. In this paper, we have presented our framework for adaptive inference, and we have briefly outlined some of the speedup techniques used in our system. We have also described a new experimental methodology for use in measuring the effects of speedup learning, and we have presented several exploratory evaluations of speedup techniques intended to guide the design of adaptive inference systems. We expect to integrate other learning techniques such as heurisitic antecedent reordering, dynamic abstraction hierarchies, and our probabilistic first-order domain-theory revision system into the adaptive inference framework in order to produce a comprehensive, adaptive, inference engine. Acknowledgements Thanks to Lisa Camesano, Ronen Feldman, Mark Ollis, Sujay Parekh, Doron Tal, Jennifer Turney, and Rodger Zanny for assisting in various portions of the research reported here. Thanks also to Debbie Smith for help in typesetting this manuscript.
75 References Barnett, J. (1984). How Much is Control Knowledge Worth? A Primitive Example. Artificial Intelligence, 22, pp. 77-89. Cain, T. (1991). The DUCTOR: A Theory Revision System for Propositional Domains. Proceedings of the Eighth International Machine Learning Workshop (pp. 485-489). Evanston, IL: Morgan Kaufmann Publishers. Elkan, C, Segre, A. (1989). Not the Last Word on EBL Algorithms (Report No. 89-1010). Department of Computer Science, Cornell University, Ithaca, NY. Elkan, C. (1989). Conspiracy Numbers and Caching for Searching And/Or Trees and Theorem-Proving. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 341-346). Detroit, MI: Morgan Kaufmann Publishers. Elkan, C.(1990). Incremental, Approximate Planning. Proceedings of the National Conference on Artificial Intelligence (pp. 145-150). Boston, MA: MIT Press. Feldman, R., Segre, A., Koppel, M. (1991a). Refinement of Approximate Rule Bases. Proceedings of the World Congress on Expert Systems. Orlando, FL: Pergamon Press. Feldman, R., Segre, A., Koppel, M. (1991b). Incremental Refinement of Approximate Domain Theories. Proceedings of the Eighth International Machine Learning Workshop (pp. 500-504). Evanston, IL: Morgan Kaufmann Publishers. Feldman, R., Koppel, M., Segre, A. (1992, March). A Bayesian Approach to Theory Revision. Workshop on Knowledge Assimilation. Symposium conducted at the AAAI Workshop, Palo Alto, CA. Fikes, R., Hart, P., Nilsson, N. (1972). Learning and Executing Generalized Robot Plans. Artificial Intelligence, 3, pp. 251-288.
76 Ginsberg, A. (1988a). Knowledge-Base Reduction: A New Approach to Checking Knowledge Bases for Inconsistency and Redundancy. Proceedings of the National Conference on Artificial Intelligence (pp. 585-589). St. Paul, MN: Morgan Kaufmann Publishers. Ginsberg, A. (1988b). Theory Revision via Prior Operationalization. Proceedings of the National Conference on Artificial Intelligence (pp. 590595). St. Paul, MN: Morgan Kaufmann Publishers. Ginsberg, A. Weiss, S., Politakis, P. (1988). Automatic Knowledge Base Refinement for Classification Systems. Artificial Intelligence, 35, 2, pp. 197-226. Ginsberg, M., Harvey, W. (1990). Iterative Broadening. Proceedings of the National Conference on Artificial Intelligence (pp. 216-220). Boston, MA: MIT Press. Hirsh, H. (1987). Explanation-based Generalization in a LogicProgramming Environment. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (pp. 221-227). Milan, Italy: Morgan Kaufmann Publishers. Kedar-Cabelli, S., McCarty, L. (1987). Explanation-Based Generalization as Resolution Theorem Proving. Proceedings of the Fourth International Machine Learning Workshop (pp. 383-389). Irvine, CA: Morgan Kaufmann Publishers. Knoblock, C. (1990). A Theory of Abstraction for Hierarchical Planning. In D.P. Benjamin (Ed.), Change of Representation and Inductive Bias (pp. 81-104). Hingham, MA: Kluwer Academic Publishers. Korf, R. (1985). Depth-First Iterative Deepening: An Optimal Admissible Tree Search. Artificial Intelligence, 27,1, pp. 97-109. Minton, S. (1990a). Learning Search Control Knowledge. Hingham, MA: Kluwer Academic Publishers. Minton, S. (1990b). Quantitative Results Concerning the Utility of Explanation-Based Learning. In J. Shavlik & T. Dietterich (Eds.), Readings in Machine Learning (pp. 573-587). San Mateo, CA: Morgan Kaufmann Publishers.
77
Mitchell, T., Utgoff, P., Banerji, R. (1983). Learning by Experimentation: Acquiring and Refining Problem-Solving Heuristics. In R. Michalski, J. Carbonell & T. Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach, Vol. 1 (pp. 163-190). San Mateo, CA: Morgan Kaufmann Publishers. Mitchell, T., Keller, R., Kedar-Cabelli, S. (1986). Explanation-Based Generalization: A Unifying View. Machine Learning 7, 7, pp. 47-80. Mooney, R., Bennett, S. (1986). A Domain Independent ExplanationBased Generalizes Proceedings of the National Conference on Artificial Intelligence (pp. 551-555). Philadelphia, PA: Morgan Kaufmann Publishers. Mooney, R. (1990). A General Explanation-Based Learning Mechanism. San Mateo, CA: Morgan Kaufmann Publishers. Ourston, D., Mooney, R. (1990). Changing the Rules: A Comprehensive Approach to Theory Refinement. Proceedings of the National Conference on Artificial Intelligence (pp. 815-820). Boston, MA: MIT Press. Plaisted, D. (1988). Non-Horn Clause Logic Programming Without Contrapositives. Journal of Automated Reasoning, 4, pp. 287-325. Prieditis, A., Mostow, J. (1987). PROLEARN: Towards a Prolog Interpreter that Learns. Proceedings of the National Conference on Artificial Intelligence (pp. 494-498). Seattle, WA: Morgan Kaufmann Publishers. Richards, B., Mooney, R. (1991). First-Order Theory Revision. Proceedings of the Eighth International Machine Learning Workshop, (pp. 447-451). Evanston, IL: Morgan Kaufmann Publishers. Russell, A., Segre, A., Camesano, L. (1992). Effective Conjunct Reordering for Definite-Clause Theorem Proving. Manuscript in preparation. Sacerdoti, E. (1974). Planning in a Hierarchy of Abstraction Spaces. Artificial Intelligence, 5, pp. 115-135. Segre, A. (1987). Explanation-Based Learning of Generalized Robot Assembly Plans. Dissertation Abstracts International, AAD87-21756. (University Microfilms No. AAD87-21756.)
78
Segre, A. (1988). Machine Learning of Robot Assembly Plans. Hingham, MA: Kluwer Academic Publishers. Segre, A., Elkan, C, Russell, A. (1990). On Valid and Invalid Methodologies for Experimental Evaluations ofEBL (Report No. 90-1126). Ithaca, NY: Cornell University. Segre, A., Elkan, C. (1990). A Provably Complete Family ofEBL Algorithms. Manuscript submitted for publication. Segre, A., Elkan, C, Gordon, G., Russell, A. (1991). A Robust Methodology for Experimental Evaluations of Speedup Learning. Manuscript submitted for publication. Segre, A., Elkan, C., Russell, A. (1991). Technical Note: A Critical Look at Experimental Evaluations ofEBL. Machine Learning, 6, 2, pp. 183196. Segre, A. (1991). Learning How to Plan. Robotics and Autonomous Systems, 8,1-2, pp. 93-111. Segre, A., Scharstein, D. (1991). Practical Caching for Definite-Clause Theorem Proving. Manuscript submitted for publication. Segre, A., Turney, J. (1992a). Planning, Acting, and Learning in a Dynamic Domain. In S. Minton (Ed.), Machine Learning Methods for Planning and Scheduling. San Mateo, CA: Morgan Kaufmann Publishers. Segre, A., Turney, J. (1992b). SEPIA: A Resource-Bounded Adaptive Agent. Artificial Intelligence Planning Systems: Proceedings of the First International Conference. College Park, MD: Morgan Kaufmann Publishers. Shavlik, J. (1990). Extending Explanation-Based Learning. San Mateo, CA: Morgan Kaufmann Publishers. Smith, D., Genesereth, M. (1985). Ordering Conjunctive Queries. Artificial Intelligence, 26, pp. 171-215. Sussman, G. (1973). A Computational Model of Skill Acquisition (Report No. 297). Cambridge, MA: MIT Artificial Intelligence Laboratory.
79 Towell, G., Shavlik, J., Noordewier, M. (1990). Refinement of Approximate Domain Theories by Knowledge-Based Neural Networks. Proceedings of the National Conference on Artificial Intelligence, (pp. 861-866), Boston, MA: MIT Press. Turney, J., Segre, A. (1989a). A Framework for Learning in Planning Domains with Uncertainty (Report No. 89-1009). Ithaca, NY: Cornell University. Turney, J., Segre, A. (1989b, March). SEPIA: An Experiment in Integrated Planning and Improvisation. Workshop on Planning and Search. Symposium conducted at the AAAI Workshop, Palo Alto, CA. Van Harmelen, F., Bundy, A. (1988). Explanation-Based Generalisation = Partial Evaluation (Research Note). Artificial Intelligence 36, 3, pp. 401-412.
80
Appendix Blocks world domain theory and randomly-ordered problem set used for the experiments reported herein. The domain theory describes a world containing 4 blocks, A, B, C, and D, stacked in various configurations on a Table. It consists of 11 rules and 9 facts: there are 26 sample problems whose first solutions range in size from 4 to 77 nodes and vary in depth from 1 to 7 inferences deep.
Facts: holds (on (A,Table),S0) holds (on (B,Table),SQ) holds (on (C,D),S0) holds (on (D,Table),S0) holds (clear (A),SQ) holds(clear(B),S0) holds (clear (C),SQ) holds (empty ( ),SQ) holds (clear (Tablets) Rules: holds (and (lx,ly),ls) qn} where each C] -> qi is an action-decision rule which represents the decision to execute the plan q* when the conditions Ci are true. Like the situation-action type rules used in reactive systems such as [Drummond & Bresina, 1990; Kaelbling, 1986; Mitchell, 1990; Schoppers, 1987], action-decision rules map different situations into different actions, allowing a system to make decisions based on its current environment However, in a completable plan a conditional pi = {COND Ci -> qi; C2 ->q25...; Cn -> qn} must also satisfy the following constraints for achievability: 1. Exhaustiveness: states(ci A...ACn) must be a probably exhaustive subset ofstates(EFF(pi_i)). 2. Observability: each Ci must consist of observable conditions, where an observable condition is one for which there exists a sensor which can verify the truth or falsity of the condition. 3. Achievement, for each qi, states(EFF(qi)) £ states(PREC(pi+i)). This is shown graphically in Figure 4. For the exhaustiveness constraint, coverstates^FFfo.!)) states(EFF(qO) V ^"-NSta tes ( c i) ^ p s a s ^ states(PREC(pi+i))
states(c3)
states(EFF(q3))
Figure 4. A completable conditional pi with three action-decision rules. age can be represented using qualitative or quantitative probabilities. The greater the coverage, the greater the conditional's chance of achieving PREC(pi+i). The observability constraint requires knowledge of sensory capability, and here we use the term sensor in the broader sense of some set of sensory actions, which we will assume the system knows how to execute to verify the associated condition. It is needed to ensure that the conditional can be successfully evaluated during execution. Finally, the achievement constraint ensures that the actions taken in the conditional achieve the preconditions of the succeeding plan component Provided these three constraints are satisfied, the conditional is considered probably completable, and the goal PREC(pi+i) of the conditional is probably achievable. Probably Completable Repeat-Loops. A repeat-loop is of the form: {REPEAT q UNTIL c }, which represents the decision to execute the plan q until the test c yields true. Repeat loops are similar in idea to servo-mechanisms, but in addition to the simple yet powerful failure-recovery strategy such mechanisms provide, repeat loops also permit the construction of repeated action sequences achieving incremental progress towards the goal, which may be viewed as a reactive, runtime method of achieving generalization-to-N [W. W. Cohen,
90 1988; Shavlik & DeJong, 1987]. Repeat loops are thus useful in computable plans for mainly two reasons: simple failure recovery and iterations for incremental progress. Repeat-loops for simple failure-recovery are useful with actions having nondeterministic effects, which arise from knowledge limitations preventing a planner from knowing which of several possible effects a particular action will have. For example, in attempting to unlock the door to your apartment, pushing the key towards the keyhole will probably result in the key lodging into the hole. However, once in a while, the key may end up jamming beside the hole instead; but repeating the procedure often achieves the missed goal. In computable planning, if an action has several possible outcomes, and if the successful outcome is highly probable, and if the unsuccessful ones do not prevent the eventual achievement of the goal, then a repeat-loop can be used to ensure the achievement of the desired effects. A repeat-loop p - {REPEAT q until c } for failure-recovery must satisfy the following constraints for achievability: 1. Observability: c must be an observable condition 2. Achievement: c must be a probable effect of q 3. Repeatability: the execution of q must not irrecoverably deny the preconditions of q until c is achieved. This is shown graphically in Figure 5a. The observability constraint is needed, once again, to be able to guarantee successful evaluation, while the achievement and repeatability constraints together ensure a high probability of eventually exiting the repeat loop with success. As with the exhaustiveness constraint for conditionals, die repeatability constraint may be relaxed so that the execution of q need only probably preserve or probably allow the re-achievement of the preconditions of q. Repeat-loops for incremental progress deal with over-general effect state description. Once again, knowledge limitations may result in a planner not having precise enough information to make action decisions a priori. In actions which result in changing the value of a quantity, for example, your knowledge may be limited to the direction of change or to a range of possible new values, which may not be specific enough to permit making decisions regarding precise actions—for example, determining die precise number of action repetitions or the precise length of time over which to run a process in order to achieve the goal. The implicit determination of such values during execution is achieved in computable planning through the use of repeat-loops which achieve incremental progress towards the goal and use runtime information to determine when the goal has been reached. A repeat-loop p={REPEAT c until p } for incremental progress must satisfy the following constraints for achievability:
91 1. Continuous observability: c must be an observable condition which checks a particular parameter for equality to a member of an ordered set of values— for example, a value within the range of acceptable values for a quantity. 2. Incremental achievement: each execution of q must result in incremental progress towards and eventually achieving c—i.e. it must reduce the difference between the previous parameter value and the desired parameter value by at least somefinitenon-infinitesimal e. 3. Repeatability: the execution of q must not irrecoverably deny the preconditions of q until c is achieved. This is shown graphically in Figure 5b. The continuous observability constraint X \ S S S ^
states(c)
probable successful outcome a. Failure recovery.
^iterations vJ
b. Incremental Progress.
Figure 5. Completable repeat-loops. ensures that the progress guaranteed by the incremental achievement and repeatability constraints can be detected and the goal eventually verified. For both failure recovery and interactions for incremental progress, if the repeat-loop satisfies the constraints, the repeat-loop is considered probably completable and the goal c is achievable. Contingent Explanation-Based Learning Explanation-based learning (EBL) is a knowledge-intensive procedure by which general concepts may be learned from an example of the concept [DeJong & Mooney, 1986; Mitchell, Keller, and Kedar-Cabelli, 1986]. EBL involves constructing an explanation for why a particular training example is an example of the goal concept, and then generalizing the explanation into a general functional definition of that concept or more general subconcepts. In planning, explanation and generalization may be carried out over situations and actions to yield macro-operators or general control rules. Here, we are interested in learning macro-operators or general plans. Reactive plans present a problem for standard explanation-based learning [Mooney & Bennett, 1986]. Imagine the problem of learning how to cross the street After the presentation of an example, an explanation for how the crosser got to the other side of the street may be that the crossing took place through some suitably-sized gap between two cars. Unfortunately, the generalization of this explanation would then include the precondition that there be such a suitably-sized gap between some two cars—a precondition which for some future
92 street-crossing can only be satisfied by reasoning about the path of potentially every car in the world over the time interval of the expected crossing! The basic problem is that standard explanation-based learning does not distinguish between planning decisions made prior to execution and those made during execution. After execution, an explanation may thus be constructed using information which became available only during execution, yielding a generalization unlikely to be useful in future instances. Contingent explanation-based learning uses conjectured variables to represent deferred goals and completors for the execution-time completion of the partial plans derivedfromthe general plan. A conjectured variable is a plannerposed existential used in place of a precise parameter value prior to execution, thus acting as a placeholder for the eventual value of a plan parameter. In the integrated approach, a planner is restricted to introducing conjectured variables only if achievability proofs can be constructed for the associated deferred goals. This is achieved by allowing conjectured variables in the domain knowledge of a system only in the context of its supporting achievability proof. In this manner, the provably-correct nature of classical plans may be retained in spite of the presence of conjectured variables. A completor is an operator which determines a completion to a completable plan by finding an appropriate value for a particular conjectured variable during execution. The supporting achievability proof accompanying a conjectured variable in a completable plan provides the conditions guaranteeing the achievement of the deferred goal represented by the variable. These conditions are used in constructing an appropriate completor. There are currently three types of completors, one for each of the three types of achievability proofs discussed earlier. Iterators perform a particular action repeatedly until some goal is achieved. Monitors observe a continuously-changing quantity to determine when a particular goal value for that quantity has been reached. Filters look for an object of a particular type. The contingent explanation-based learning algorithm is summarized in Figure 6. Example: Learning a Completable Plan for Spaceship Acceleration A system written in Common LISP and running on an IBM RT Model 125 implements the integrated approach to planning and learning reactive operators. The system uses a simple interval-based representation and borrows simple qualitative reasoning conceptsfrontQualitative Process Theory [Forbus, 1984]. The system is thus able to reason about quantity values at time points as well as quantity behaviors over time intervals. For example, (value (velocity spaceship) 65 10) represents the fact that the spaceship is traveling at 65 m/s at time 10), and (behavior (velocity spaceship) increasing (10 17)) represents the fact that the spaceship's velocity was increasing from time 10 to 17). The system also
93 Input training example and goal concept Construct an explanation/or why the example is an example of the goal concept If an explanation is successfully constructed Then Generalize and construct a general plan using the goal (root), the preconditions (leaves), determining applicability, and the sequence of operators achieving the goal Identify the conjectured variables in the generalized explanation. If there are conjectured variables Then For every conjectured variable Identify the supporting achievability conditions Construct an appropriate completor using these conditions Add the completor to the operators of the general plan. Output general completable reactive plan. Else Output general non-reactive plan. Else Signal FAILURE.
Figure 6. Contingent EBL Algorithm. uses a modified EGGS algorithm [Mooney & Bennett, 1986] in constructing and generalizing contingent explanations. The system is given the task of learning how to achieve a particular goal velocity higher than some initial velocity—i.e. acceleration. The example presented to the system involves the acceleration of a spaceship from an initial velocity of 65 m/s at time 10 to the goal velocity of 100 m/s at time 17.1576, with a fire-rockets action executed at time 10 and a stop-fire-rockets action executed at time 17.1576. In explaining the example, the system uses the intermediate value rule for an increasing quantity in 2 to prove the achievability of the goal velocity. It determines that the following conditions hold: 1) velocity increases continuously while the rockets are on, 2) if the rockets are on long enough, the maximum velocity of 500 m/s will be reached, and 3) the goal velocity of 100 m/s is between the initial velocity of 65 m/s and 500 m/s. There is thus some time interval over which the spaceship can be accelerated so as to achieve the goal. In this particular example, that time interval was (10 17.1576). The general explanation yields a two-operator (fire-rockets and stop-firerockets) completable plan. This plan contains a conjectured variable for the time the goal velocity is reached and the stop-fire-rockets action is performed. Using the conditions provided by the achievability proof, a monitor operator for observing the increasing velocity during the acceleration process and indicating when the goal velocity is reached to trigger the stop-fire-rockets operator is created and incorporated into the general plan.
94 Alternatively, the system can learn a classical plan from the same example by using equations derived from the principle of the conservation of linear momentum in order to explain the achievement of the goal velocity. This involves reasoning about various quantities, including the combustion rate of fuel and the velocity of the exhaust from the spaceship, in order to determine the acceleration rate. The learned general plan also involves two operators, but the time to stop the rocket firing is precomputed using some set of equations rather than determined during execution. Given the problem of achieving a goal velocity of yf from the initial velocity of vi at time ti, the system may construct either a completable planfromthe general computable plan or a classical plan from the general classical plan (Figure 7). In computing the time at which to stop the rocketCompletable Plan [fire-rocketsat time ti monitor increasing velocity for the goal value of v/, binding It to the time this value is reached stop-fire-rockets at time It ] Classical Plan [ fire-rockets at time ti given vi - velocity at time ti vf - goal velocity ve m relative exhaust velocity me = burn rate M = initial mass of spaceship stop-fire-rockets at time tf= ti + t ] Figure 7. Completable vs. classical acceleration plans. firing, the classical plan assumes a constant exhaust velocity and bum rate. Provided the expected values are accurate, it will achieve the goal velocity. However, if the actual values differ, the spaceship may not reach or may surpass the goal velocity. Even small deviations from the expected values could have devastating effects if a plan involved many such a priori computations, through which errors could get propagated and amplified. In contrast, the completable plan makes no assumptions regarding the exhaust velocity and burn rate, and instead uses execution-time information to determine when to stop firing the rockets. It is thus more likely to achieve the goal velocity regardless of such variations. For a classical planner to correctly compute when to stop the rockets, it would have to completely model the rocket-firing process—including the fuelto-oxygen ratio, combustion chamber dimensions, nozzle geometry, material characteristics, and so on. This intractability is avoided in the integrated ap-
95 proach through the deferment of planning decisions and the utilization of execution-time information in addressing deferred decisions. Extensions to Contingent EBL To extend computable planning to probable achievability we extended contingent EBL to learn probably completable plans. The idea of probably computable plans lends itself naturally to incremental learning strategies. Conditionals, for example, represent a partitioning of a set of states into subsets requiring different actions to achieve the same goal. With probable achievability, a plan may include only some of these subsets. As problems involving the excluded subsets are encountered, however, the plan can be modified to include the new conditions and actions. Similarly, incremental learning can be used to learn failurerecovery strategies within repeat-loops. The motivation behind the incremental learning of reactive components is similar to the motivation behind much work on approximations and learning from failure, including [Bennett, 1990; Chien, 1989; Hammond 1986; Mostow & Bhatnagar, 1987; Tadepalli, 1989]. The primary difference between these approaches and completable planning is that in these approaches, a system has the ability to correct the assumptions behind its incorrect approximations and thus tends to converge upon a single correct solution for a problem. In completable planning, uncertainty is inherent in the knowledge representation itself and the system instead addresses the problem of ambiguity through reactivity. As a system learns improved reactive components, it thus tends to increase a plan's coverage of the possible states which may be reached during execution. The preconditions of an action may be satisfied either prior to execution or during execution. The procedure in Figure 8 is applied on learned general plans For each precondition pr Ifpr is not satisfied by I then Ifpr is observable then Find all operators supported bypr For each such operator Make the execution ofthat operator conditional onpr Remove prfromthe general plan's preconditions. Figure 8. Procedure to distinguish between preconditions. to distinguish between these two types of preconditions. A conditional manifests itself in an explanation as multiple, disjunctive paths between two nodes (Figure 9a), with a path representing one action-decision rule, the leaves which cannot be satisfied in the initial state forming the condition, and the operators along the path forming the action. Since coverage may be incomplete, a system may fail to satisfy any of the conditions within a conditional, in which case, the system has the option of learning a new alternative (Figure 9b) to solve the cur-
96 rent problem and to increase coverage in future problems (Figure 9c). The pro-
a. old conditional b. new alternative c. new conditional Figure 9. Explanation structures in learning new conditionals. cedure in Figure 10 adds a new rule into a conditional. new-to-add := plan components in new plan not matching any in old plan old-to-change := plan component in old plan not matching any in new plan Make a new action-decision rule using new-to-add Append the new rule to the action-decision rules of old-to-change For each precondition pr in the new plan Ifpr is not already in the old plan then addpr to the preconditions of the old plan.. Figure 10. Procedure to add new rule to conditional. Recall that for conditionals to be computable, they must satisfy the constraints of exhaustiveness, observability, and achievement. Since the plans here are derived from explanations, the constraint of achievement is already satisfied. The procedure above checks for observability. For the exhaustiveness constraint, let X be the desired minimum coverage, where X can be a user-supplied value or one computed from other parameters such as available resources and importance of success. Coverage can be represented by qualitative probabilities—for example, the term "usually" can be used to denote high probability. The exhaustiveness constraint is satisfied in a conditional {COND ci -> qi; ... ;cn -> q n } iff the probability of (civc2v...vCn) is at least X. Repeat-loops for simple failure-recovery address the problem of actions with nondeterministic effects or multiple possible outcomes, and thus repeatloops are first constructed by identifying such actions in the general plan using the procedure in Figure 11. Recall that for a repeat-loop for failure to be comFor each action a in the plan If the outcome of a used in the plan is a probable outcome among others then If the desired outcome c is observable then Construct a repeat loop for a. Figure 11. Procedure for constructing a repeat loop. pletable, it must satisfy the constraint of repeatability aside from the constraints of observability and achievement. If the unsuccessful outcomes of a do not prevent the repetition of a, then the repeatability constraintis satisfied, and the probable eventual achievement of the desired effects is guaranteed. However, for unsuccessful outcomes which deny the preconditions to a, actions to recover the
97 preconditions must be learned. Precondition-recovery strategies within a repeat-loop can be characterized as a conditional, where the different states are the different outcomes, the actions are the different recovery strategies, and the common effect state is the precondition state of the action a. If we let Uj be an unsuccessful outcome, and ri be the recovery strategy for ui, then a repeat-loop eventually takes the form {REPEAT {q; [COND ui -> r^...; un -> rn]} UNTIL c}. Learning the embedded conditional for failure recovery can be done as in the previous section. Example: Learning a Probably Computable Train Route Plan The system was given the task of learning a plan to getfromone small city to another going through two larger cities using a train. The primary source of incompleteness preventing complete a priori planning is the system's knowledge with regard to the state of the railroads. For a system to getfromone city to another, the cities have to be connected by a railroad, and the railroad has to be clear. For a railroad to be considered clear, it must be notflooded,not be congested with traffic, be free of accidents, and not be under construction. These conditions cannot be verified a priori for all railroads, hence the need for conditionals. The training example involves getting from the city of Wyne to the city of Ruraly, where the rail connectivity between the two cities is shown in (Figure 12). Here, the railroad AB is a major railroad and sometimes gets congested. Wyne'
T" h"—-
TlT" B"I
Figure 12. Rail connectivity between Wyne and Ruraly.
Ruraly
T
Also, the northern railroads to andfromX, C, and Z are susceptible to flooding. And accidents and construction may occur from time to time. The initial training example given to the system is the route Wyne-A-B-Ruraly, which is generally the quickest way to get from Wyne to Ruraly. The learned general plan gets a train from one city to another with two intermediate stops, where only the railroad between the two intermediate cities is susceptible to heavy traffic and needs to be checked for it (Figure 13). When the system encounters a situation in which none of the conditions in a conditional is satisfied—in this example, the no-traffic condition is false just as the systemis to execute (go Amatrak A B AB) to achieve (at Amatrak B)—the system is given the alternative route A-C-B, which gets the system to B and allows it to continue with the next step in its original plan and achieve its goal of getting to Ruraly. From this experience, the sys-
98 PLANl [COMPS [COND ((NOT (ACC ?12)) (NOT (CON ?12))) - > ((GO ?AGT ?C1 ?C2 ?12)) [COND ((NOT (ACC 723)) (NOT (CON ?23)) (NOT (TRF ?23))) -> ((GO ?AGT ?C2 ?C3 ?23)) [COND ((NOT (ACC ?34)) (NOT (CON ?34))) -> ((GO ?AGT ?C3 ?C4 ?34)) [PRECS (AT ?AGT ?C1) (CONN ?C1 ?C2 ?12) (NOT (TRF ?12)) (NOT (FLD ?12)) (CONN ?C2 ?C3 71 (NOT (FLD ?23)) (CONN ?C3 ?C4 ?34) (NOT (TRF ?34)) (NOT (FLD ?34» [EFFS (AT ?AGT ?C4)] [EXPL: [EXPLANATION for (AT AMATRAK RURALY)]]
Figure 13. Initial Learned Plan. tern modifies its old plan to include the new alternative of going through another city between the two intermediate cities. The system thus now has two alternatives when it gets to city A. When it encounters a situation in which AB is congested and AC is flooded, it is given yet another alternative, A-D-E-B, from which it learns another plan to getfromA to B and modifies the old plan as before. Now, in planning to get from Wyne to Ruraly, the system constructs the conditional in Figure 14, which corresponds to the second conditional in Figure 13. Note that the incremental learning algorithm permits the system to learn PLANl [COMPS [COND ((NOT (ACC AB)) (NOT (CON AB)) (NOT (TRF AB)) - > ((GO AMATRAK A B AB)) ((NOT (ACC AQ) (NOT (CON AQ) (NOT (FLD AC))) - > (((GO AMATRAK A C AQ) (COND (((NOT (ACC CB)) (NOT (CON CB)) (NOT (FLD CB))) - > ((GO AMATRAK C B C-B))))) ((NOT (ACC A-D)) (NOT (CON A-D))) - > (((GO AMATRAK A D A-D)) (COND (((NOT (ACC D-E)) (NOT (CON D-E))) - > ((GO AMATRAK D E I>-E)))) (COND (((NOT (ACC E-B)) (NOT (CON E-B))) - > ((GO AMATRAK E B E-B)))))]
Figure 14. Final conditional in specific plan for getting from Ruraly to Wyne. conditionals only on demand. In this example, alternative routes for the getting from Wyne to A andfromB to Ruraly are not learned. Assuming either a training phase or an evaluation step for determining whether particular situations are likely to occur again, a system can use this algorithm to learn minimally contingent plans. Limitations Computable planning represents a trade-off. A planner in this approach incurs the additional cost of proving achievability as well as completing plans during execution. Our intuitions, however, are that there is a whole class of interesting problems for which proving achievability is much easier than determining plans and where additional runtime information facilitates planning. Future work will investigate such problems more thoroughly to develop a crisper definition of the class problems addressed by completable planning. As these problems are better defined, contingent EBL may also need to be extended to enable
99 the learning of the computable plans with different kinds of deferred decisions. This includes learning to construct different types of achievability proofs and completors. Another direction for future work is a more thorough analysis of the tradeoff between the advantages brought and costs incurred by completable planning. Aside from the a priori planning cost completable plans have over reactive plans, and the runtime evaluation cost completable plans have over classical plans, in proving achievability completable plans also sometimes require knowledge about the general behavior of actions not always available in traditional action definitions. On the other hand, completable planning minimizes a priori information requirements. Related to this is the development of completable planning within a hierarchical planning framework [Sacerdoti, 1974; Stefik, 1981]. Casting completable planning in such a framework gives rise to several interesting research issues, including the development of an abstraction hierarchy which incorporates runtime decision-making (as in [Firby, 1987]) and using achievability as a criterion for defining a hierarchy. PERMISSIVE PLANNING Permissive planning is, in some ways, the dual of the reactive approach. Like the reactive approach, it gives up the notion of a provably correct plan. However, the concept of projection remains. Indeed, it is, if anything, more central than before. In most real-world domains it is impossible to describe the world correctly and completely. It follows that internal system representations of the world must, at best, be approximate. Such approximations may arise from imperfect sensors, incomplete inferencing, unknowable features of the world, or limitations of a system's representation ability. We introduce the concept of permissiveness of a plan as a measure of how faithfully the plan's preconditions must reflect the real world in order for the plan to accomplish its goals. One plan is more permissive than another if its representations can be more approximate while continuing to adequately achieve its goals. We do not propose to quantify this notion of permissiveness. Instead, we employ a machine learning approach which enhances permissiveness of acquired planning concepts. The approach involves acquiring and refining generalized plan schemata or macro-operators which achieve often-occurring general goals and sub-goals. Acquisition is through rather standard explanation-based learning [DeJong & Mooney, 1986; Mitchell, Mahadevan, and Steinberg, 1985; Mitchell et al., 1986; Segre, 1988]. However, the refinement process is unique. Improving Permissiveness To drive refinement, the system constantly monitors its sensors during plan execution. When sensor readings fall outside of anticipated bounds, execution
100 ceases and the plan is judged to have failed. The failure can only be due to a data approximation; if there were no mismatch between internal representations and the real world, the plan would have the classical planning property of provable correctness. The plan's failure is diagnosed. Ideally, only a small subset of the system's data approximations could underlie the monitored observations. The system conjectures which of its data representations, if incorrect, might account for the observations. Next, the system uses qualitative knowledge of the plan's constituent operators. The small conjectured error is symbolically propagated through the plan to its parameters. The plan parameters are adjusted so as to make the planning schema less sensitive to the diagnosed discrepancy with the world. If the process is successful, the refined schema is uniformly more permissive than the original, which it replaces. Thus, through interactions with the world, the system's library of planning schemata becomes increasingly permissive, reflecting a tolerance of the particular discrepancies that the training problems illustrate. This, in turn, results in a more reliable projection process. Notice that there is no improvement of the projection process at the level of individual operators. Performance improvement comes at the level of plan schemata whose parameters are adjusted to make them more tolerant of real-world uncertainties in conceptually similar future problems. Adjustment is neither purely analytical nor purely empirical. Improvement is achieved through an interaction between qualitative background knowledge and empirical evidence derived from the particular real-world problems encountered. Domain Requirements The notion of permissive planning is not tied to any particular domain. Though domain-independent it is, nonetheless, not universally applicable. There are characteristics of domains, and problem distributions within domains, that indicate or counter-indicate the use of permissive planning. An application that does not respect these characteristics is unlikely to benefit from the technique. For permissive planning to help, internal representations must be approximations to the world. By this we mean that there must be some metric for representational faithfulness, and that along this metric, large deviations of the world from the system's internal representations are less likely than small deviations. Second, some planning choices must be subject to continuous real-valued constraints or preferences on planning choices. These choices are called parameters of the plan schema. They are usually real-valued arguments to domain operators that must be resolved before the plan can be executed. Permissiveness is achieved through tuning preferences on these parameters. Finally, the planner must be supplied with information on how each operator's preconditions and arguments qualitatively change its effects. This information is used to regress symbolic representations of the diagnosed out-of-bounds
101 approximations through the planning structure. Such propagation determines how parameters should be adjusted so as to decrease the likelihood of similar future failures. Once determined, the information so gained embodies a new preference for how to resolve parameter values. Permissive Planning in Robotics Clearly, many domains do not respect these constraints. However, robotic manipulation domains form an important class in which the above characteristics are naturally enforced. Consider the data approximation constraint. A typical expression in a robotics domain may refer to real-world measurements. Object positions and dimensions require the representation for metric quantities. An example might be something like (HEIGHT-IN-INCHES BLOCK3 2.2). Such an expression is naturally interpreted as an approximation to the world. Indeed, expressions such as this one are useless in the real world under a standard semantics. The conditions of truth require that the height of the world object denoted by BLOCK3 be exactly 2.2 inches. Technically, no deviation whatsoever is permitted. If the height of BLOCK3 is off by only IQr40 inches, the expression is false - just as false as if it were off by 5 inches or 50 inches. Clearly, such an interpretation cannot be tolerated; the required accuracy is beyond the numerical representational capabilities of most computers. Another nail is driven into the coffin for standard semantics by real-world constraints. Actual surfaces are not perfectly smooth. Since the top and bottom of BLOCK3 most likely vary by more than 10"40 inches, the "height" of a real-world object is not a well-defined concept. In fact, no working system could interpret expressions such as the one above as describing the real world. The most common of several alternatives is to relegate the system to a micro-world. Here, the system implementor takes on the responsibility for insuring that no problems will resultfromnecessarily imprecise descriptions of the domain. In general, this requires the implementor to characterize in some detail all of the future processing that will be expected of the system. Often he must anticipate all of the planning examples that the system will be asked to solve. Other alternatives have been pursued involving explicit representations of and reasoning about error [Brooks, 1982; Davis, 1986; Erdmann, 1986; Hutchinson & Kak, 1990; Lozano-Perez, Mason, and Taylor, 1984; Zadeh, 1965] and guaranteed conservative representations [Malkin & Addanki, 1990; Wong & Fu, 1985; Zhu & Latombe, 1990]. These either sacrifice completeness, correctness, or efficiency and offer no way of tuning or optimizing their performance through interactions with the world. Expressions such as (HEIGHT-IN-INCHES BLOCK3 2.2) are extremely common in robotic domains and can be easily interpreted as satisfying our informal definition of an approximation: the metric for faithfulness is the real-valued height measure, and, presumably, if a reasonable system describes the world using the expression (HEIGHT-IN-INCHES BLOCK3 2.2) it is more
102 likely the case that any point on the top surface of BLOCK3 is 2.2001 inches high than 7.2 inches high. It is essential that the expression not saddle the system with the claim that BLOCK3 is precisely 2.2 inches high. The second condition for permissive planning requires that continuous real-valued parameters exist in the system's general plans. Geometric considerations in robotic manipulation domains insure that this condition is met Consider some constraints on a robot manipulator motion past a block (in fact BLOCK3), whichrests on the table. Some height must be adopted for the move. From the geometrical constraints there is a minimum height threshold for the path over the block. Since the arm must not collide with anything (in particular with BLOCK3), it must be raised more than 2.2 inches above the table. This height threshold is one of the plan parameters. Any value greater than 2.2 inches would seem to be an adequate bound on the parameter for the specific plan; if 2.2 inches is adequate, so is 2.3 inches, or 5.0 inches, etc. Thus, the plan supports the parameter as a continuous real-valued quantity. Notice, that once the specific plan of reaching over BLOCK3 is generalized by EBL, the resulting plan schema parameterizes the world object BLOCK3 to some variable, say, ?x and the value 2.2 to ?y where (HEIGHT-IN-INCHES ?x ?y) is believed, and the threshold parameter to ?z where ?z is equivalent to (+ ?y e) for the tight bound, or equivalent to (+ ?y e 0.1), for the bound of 2.3, or equivalent to (+ ?y £ 2.8), for the bound of 5.0, etc. The value of e insures that the bound is not equaled and can be made arbitrarily small in a perfect world. As will become clear, in permissive planning, e may be set identically to zero or left out entirely. The final condition for permissive planning requires qualitative information specifying how the effects of domain operators relate to their preconditions and arguments. This constraint, too, can be naturally supported in robotic manipulation domains. Consider again the plan of moving the robot arm past BLOCK3. The plan involves moving the arm vertically to the height ?z and then moving horizontally past the obstacle. The required qualitative information is that the height of the robot arm (the effect of MOVE-VERTICALLY) increases as its argument increases and decreases as its argument decreases. With this rather simple information the generalized plan schema for moving over an obstacle can be successfully tuned to prefer higher bounds, resulting in a more permissive plan schema. One might imagine that the system would behave similarly if we simply choose to represent BLOCK3 as taller than it really is. But permissive planning is more than adopting static conservative estimates for world values. Only in the context of moving past objects from above does it help to treat them as taller than their believed heights. In other contexts (e.g., compliantly stacking blocks) it may be useful to pretend the blocks are shorter than believed. Permissive planning adjusts the planning concepts, not the representations of the world. It there-
103 fore preserves the context over which each parameter adjustment results in improved, rather than degraded, performance. From a different point of view, permissive planning amounts to blaming the plan for execution failures, even when in reality the accuracy of representations of the world, not the plan, are at fault This is a novel approach to planning which results in a different, rather strange semantics for the system's representations. Current research includes working out a more formal account of the semantics for representations in permissive plans. Straightforward interpretations of the expressions as probabilistic seem not to be sufficient Nor are interpretations that view the expressions as fuzzy or as having uncertainty or error bounds. The difficulty lies in an inability to interpret an expression in isolation. An expression "correctly" describes a world if it adequately supports the permissive plans that make use of it. Thus, an expression cannot be interpreted as true or not true of a world without knowing the expression's context including the system's planning schemata, their permissiveness', and the other representations that are believed. The GRASPER System The GRASPER system embodies our ideas of permissive planning. GRASPER is written in Common Lisp running on an IBM RT125. The system includes an RTX scara-type robotic manipulator and a television camera mounted over the arm's workspace. The camera sub-system produces bitmaps from which object contours are extracted by the system. The RTX robot arm has encoders on all of its joint motors and the capability to control many parameters of the motor controllers including motor current allowing a somewhat course control of joint forces. The GRASPER system learns to improve its ability to stably grasp isolated novel real-world objects. Stably grasping complex and novel objects is an open problem in thefieldof robotics. Uncertainty is one primary difficulty in this domain. Real-world visual sensors cannot, even in principle, yield precise information. Uncertainty can be reduced and performance improved by engineering the environment (e.g., careful light source placement). However, artificially constraining the world is apoor substitute for conceptual progress in planner design. The position, velocity, and force being exerted by the arm, whether sensed directly or derivedfromsensory data, are also subject to errors so that the manipulator's movements cannot be precisely controlled. Nor can quantities like the position at which the manipulator first contacts an object be known precisely. Intractability also plays a significant role in this domain. To construct plans in a reasonable amount of time, object representations must be simplified. This amounts to introducing some error in return for planning efficiency. Altogether, the robotic grasping domain provides a challenging testbed for learning techniques. Figure 15 shows the laboratory setup.
104
Figure 15. GRASPER Experimental Setup. Our current goal for the GRASPER system in the robotics grasping domain is to successfully grasp isolated plastic pieces from several puzzles designed for young children. The system does not possess any model of the pieces prior to viewing them with its television camera. Since the pieces are relatively flat and of fairly uniform thickness, an overhead camera is used to sense piece contours. These pieces have interesting shapes and are challenging to grasp. The goal is to demonstrate improved performance at the grasping task over time in response to failures. Concept Refinement in GRASPER No explicit reasoning about the fact that data approximations are employed takes place during plan construction or application. Thus, planning efficiency is not compromised by the presence of approximations. Indeed, efficiency can be enhanced as internal representations for approximated objects may be simpler. The price of permissive planning with approximations is the increased potential for plan execution failures due to discrepancies with the real world. GRASPER's permissive planning concepts contain three parts. First, there is a set of domain operators to be applied, along with their constraints. This part is similar toother EBL-acquiredmacro-operators [Mooney, 1990; Segre, 1988] and is not refined. Second, there is a specification of the parameters within the macro-operator and for each, a set of contexts and preferences for their settings. Third, there is a set of sensor expectations. These include termination conditions for the macro and bounds on the expected readings during executions of the macro. If the termination conditions are met and none of the expectations are violated then the execution is successful. Otherwise it is a failure. A failed execution indicates a real-world contradiction; aconclusion, supported by the system's internal world model, is inconsistent with the measured world. It is only
105 during failure handling that the system accesses information about approximations. In the spirit of permissive planning, the planning concept that supports the contradiction is blamed for the failure. A symbolic specification of the difference between the observed and expected sensor readings is qualitatively regressed through the concept's explanation. This regression identifies which parameters can influence the discrepancy and also discovers which direction they should be tuned in order to reduce the discrepancy. A parameter is selected from among the candidates, and a new preference is asserted for the context corresponding to the failure conditions. The preferences themselves are qualitative—"Under these conditions, select the smallest consistent value from the possibilities available." The resulting planning concept contains more context specific domain knowledge and is uniformly more likely to succeed than its predecessor. As an aside it is important that the proposed new parameter preference be consistent with previous context preferences for that parameter. If the new preference cannot be reconciled with existing experiential preferences the original macro-operator structure is flawed or an incorrect selection was made (possibly during some previous failure)fromthe candidate parameters. Ongoing research is investigating how to handle such inconsistencies in a theoretically more interesting way than simple chronological backtracking across previous decisions. The current system does no more than detect such "over-tuning" of parameters. We will now consider a brief example showing the GRASPER system refining its grasping concept. The system already possesses an EBI^acquired planning concept for grasping small objects. Basically, the concept says to raise the arm with the gripper pointing down, to select grasping points on the object, to position itself horizontally over the object's center of mass, to open the gripper, to rotate the wrist, to lower the arm, and to close the gripper. Also specified are parameters (like how high initially to move the gripper, how close the center of mass must be to the line between the grasp points, how wide to open the gripper before descending, etc.), and sensor expectations for positions, velocities, and forces for the robot's joints. Through experience prior to the example, the grasping concept was tuned to open the gripper as wide as possible before descending to close on the object This yields a plan more permissive of variations in the size, shape, and orientation of the target object. A workspace is presented to the GRASPER system. Figure 16 shows the output of the vision system (on the left) and the internal object representations on the right with gensymed identifiers for the objects. The upper center object (OBJECT5593) is specified as the target for grasping.Figure 17 highlights the selected target object. The dark line indicates the polygonal object approxima-
106 1 Vision Data
Nodes Explored:
\t> * *
*
CP
*0
90
Approxinated Objects
A
0BltlCTB6M
onncnstf
^
0BJECTSo^3 OBJXOWI
Figure 16. System Status Display During Grasp of Object5593. \ j — — — — Arrows illustrate planned finger positions
Figure 17. Grasp Target and Planned Finger Positions. tion. This is the object's internal representation used in planning. The light colored pixels show the vision system output which more accurately follows the object's true contours. The arrows illustrate the planned positions for the fingers in the grasping plan. Notice, that the fingers are well clear of the object due to previous experience with the opening-width parameter. The chosen grasp points are problematic. A human can correctly anticipate that the object may "squirt" away to the lower left as the gripper is closed. GRASPER, however, has a "proof that closing on the selected grasp points will achieve a stable grasp. The proof is simply a particular instantiation of GRASPER'S planning schema showing that it is satisfiable for OB JECT5593. The proof is, of course, contingent on the relevant internal representations of the world being "accurate enough," although this contingency is not explicitly represented nor acknowl-
107 edged by the schema. In particular, the coefficient of friction between any two surfaces is believed to be precisely 1.0. This is incorrect. If it were correct the gripper could stably grasp pieces whose grasped faces made an angle of up to 45 degrees. The system believes the angle between the target faces of OBJECT5593 is 41.83 degrees, well within the 45 degree limit. This is also incorrect. The action sequence is executed by the arm while monitoring for the corresponding anticipated sensor profiles. During a component action (the execution of the close-gripper operator) the expected sensor readings are violated as shown in Figure 18. The shaded areas represent the expected values only roughhsitioa (aa) VS. Elapsed Tise (seeoads)
Force vs. Elapsed Tise (seeoads)
Figure 18. Expected vs. Observed Features. ly. Some expectations are qualitative and so cannot be easily captured on such a graph. Position is in millimeters, force is in motor duty cycle where 64 is 100%. Only the observed data for the close-gripper action are given. This action starts approximately 10 seconds into the plan and concludes when the two fingers touch, approximately 18 seconds into the plan. The termination condition for close-gripper (the force ramping up quickly with littlefingermotion) is met, but close-gripper expects this to occur while the fingers are separated by the width of the piece. This expectation is violated so the close-gripper action and the grasp plan both fail. It is assumed that thefingerstouched because the target piece was not between them as they closed. A television picture after the failure verifies that the gripper was able to close completely because the target object is not where it used to be. The piece is found to have moved downward and to the left. The movement is attributed to the plan step in which expectations began to go awry, namely, the close-gripper action.
108 The failure is explained in terms of the original "proof" that the close-gripper action would result in a stable grasping of OBJECT5593. While reasoning about execution failures, the system has access to information about which world beliefs are approximate. The failure is "explained" when the system discovers which approximate representations may be at fault. The system must identify approximations which, if their values were different, would transform the proof of a stable grasp into a proof for the observed motion of the piece. In this example, the offending approximations are the angle between the target faces of OBJECT5593 which may be an under-estimate and the coefficient of friction between the gripper fingers and the faces which may be an over-estimate. Errors of these features in the other direction (e.g., a coefficient of friction greater than 1.0) could not account for the observation. We might, at this point, entertain the possibility of refining the approximations. This would be the standard AI debugging methodology which contributes to the conceptual underpinnings of many diverse AI researchfromdefault reasoning to debugging almost-correct plans to diagnosis to refining domain knowledge. However, debugging the system's representations of the world is not in the spirit of permissive planning. We do not believe a fully debugged domain theory is possible even in principle. The approximate beliefs (face angle and coefficient of friction representations) are left as before. Instead, the system attempts to adjust the plan to be less sensitive to the offending approximations. This is done by adjusting preferences for the parameters of the planning concept Adjustment is in the direction so as to increase the probability that the original conclusion of a stable grasp will be reached and to reduce the probability of the observed object motion. This is a straightforward procedure given the qualitative knowledge of the plan. All parameters that, through the structure of the plan, can qualitatively oppose the effects of the out-of-bound approximations are candidates. In the example, the only relevant plan parameter supports the choice of object faces. The previous preferences on the parameters to choose between face pairs are that they each have a minimum length of 5 cm., they straddle the center of geometry, and the angle they form must be greater than 0 and less than 45 degrees. The first and second preferences are unchanged; the third is qualitatively relevant to the offending approximations and is refined. The initial and refined preferences are shown in Figure 19. Note that the refinement is itself qualitative, not quantitative. Previously, the particular angle chosen within the qualitatively homogeneous intervalfrom0 to 45 degrees was believed to be unimportant (a flat preference). The system now believes that angles within that interval can influence the success of grasping and that small angles (more nearly parallel faces) are to be preferred. Angles greater than 45 degrees are not entertained. Notice that this preference improves robustness regardless of which approximation (the
109
P r e f
""1
i
P r e f
preference function before example
45 90 Angle Between Faces
preference function after example
I 45 90 Angle Between Faces
180
Figure 19. Refinement of Angle Preference coefficient offrictionor the angle between faces) was actually at fault for the failed grasping attempt When given the task again, GRASPER plans the grasping points given in Figure 20 which is successful. The change results in im-
ed Arrows illustrate planned finger positions
\
Figure 20. Successful Grasp Positions. proved grasping effectiveness for other sizes and shapes of pieces as well. In basic terms, the refinement says that one way to effect a more conservative grasp of objects is to select grasping faces that make a more shallow angle to each other. Empirical Results The GRASPER system was given the task of achieving stable grasps on the 12 smoothplastic pieces of a children's puzzle. Figure 21 shows the gripper and several of the pieces employed in these experiments. A random ordering and set of orientations was selected for presentation of the pieces. Targetpieces were also placed in isolation from other objects. That is, the workspace never had pieces near enough to the grasp target to impinge on the decision made for grasping the target Thefirstrun was performed with preference tuning turned off. The results are illustrated in Figure 22. Failures observed during this run included,/wger stubbing failures (FS) where a gripperfingerstruck the top of the
110
0Mmm :li^l§
Figure 21. Gripper and Pieces. FS Down Finger stubbing failure
I Hi knowledge about vertical I slipping failures has been I 1 included inciuaea .
0 1 2 3 4 5 6 7 8 9101112 Trials Without Tuning
0 1 2 3 4 5 S T S 910U12 Trials With T\ining
Figure 22. Comparison of Tuning to Non-tuning in Grasping the Pieces of a Puzzle. object while moving down to surround it and lateral slipping failures (LS) where, as the grippers were closed, the object slipped out of grasp, sliding along the table surface. The given coefficient of friction (1.0) and the choice for opening width as the object chord resulted in a high error rate. There were 9 finger stubbing failures and 1 lateral slipping failure in 12 trials. In our second run, preference tuning was turned on. An initial stubbing failure on trial 1 led to a tuning of the chosen-opening-width parameter which determines how far to open for the selected grasping faces. Since the generated qualitative tuning explanation illustrates that opening wider would decrease the chance of this type of a failure, the system tuned the parameter to choose the largest opening width possible (constrained only by the maximum gripper opening). In trials 2 and 3, finger stubbing failures did not occur because the opening width was greater than the object width for that orientation. Vertical slipping failures (VS), about which the current implementation does not have knowledge, did occur. Preventing vertical slipping failures involves knowing shape information
Ill
along the height dimension of the object, which we are considering to give in the future using a model-based vision approach. In trial 5, a lateral slipping failure is seen and the qualitative tuning explanation suggests decreasing the contact angle between selected grasping surfaces as in the example above. Single examples of thefingerstubbing and lateral slipping failures were sufficient to eliminate those failure modes from the later test examples. Limitations Permissive planning is not a panacea To be applicable the domain must satisfy some strong constraints outlined. Furthermore, there are other obstacles besides projection that must be surmounted to salvaging some vestige of traditional planning. In particular, the search space of an unconstrained planner seems intractably large. Here we might buy into an IOU for Minton-style [Minton, 1988] utility analysis for EBL concepts. That is not the focus of the current research. However, the endeavor of permissive planning would be called into question should that research go sour or fail to extend to schema-type planners. Our current conceptualization of permissive planning is more general than is supported by the implementation. For example, there is no reason that increasing permissiveness need be relegated to adjusting parameter preferences. Structural changes in the planning concept may also be entertained as a means of increasing permissiveness. The current implementation may be pushed to do so through simple chronological backtracking through permissiveness and planning choices when inconsistent parameter preferences arise. We are searching for a more elegant method. Our current theory of permissive planning also leaves room for improvement A more formal and general specification of permissive planning is needed. There are questions about the scope of applicability, correctness, and source of power that can be resolved only with a more precise statement of the technique. For example, we currently rely heavily on qualitative propagation through the planning "proof. Is qualitative reasoning necessary or is its use merely a consequent of the fact that permissiveness is achieved through tuning preferences on continuous parameters? The current theory and implementation also rely on the notion of a centralized macro-operator. These provide the context, a kind of appropriate memory hook, for permissiveness enhancements. But is commitment to a schema-like planner necessary to support permissive planning or only sufficient? These are the questions that drive our current research. CONCLUSIONS A primary motivation for our work is that internal representations of the external physical world are necessarilyflawed.It is neither possible nor desirable for a planner to manipulate internal representations that are perfectly faithful to
112 the real world. Even apparently simple real-world objects, when examined closely, reveal a subtlety and complexity that is impossible to model perfectly. Increasing the complexity of the representations of world objects can dramatically degrade planning time. Furthermore, in most domains, there can be no guarantee that a feature of the world, no matter how inconspicuous it seems, can be safely ignored. Very likely, some plan will eventually be entertained that exploits the over-simplified representation. As a result the standard planning process of projection, anticipating how the world will look after some actions are performed, is problematic. The problems arising from imperfect a priori knowledge in classical planning was recognized as early as the STRIPS system, whose PLANEX component employed an execution algorithm which adapted predetermined plans to the execution environment [Fikes, Hart, and Nilsson, 1972]. Augmenting traditional planning with explicit reasoning about errors and uncertainties complicates the problem [Brooks, 1982; Davis, 1986; Erdmann, 1986; Hutchinson & Kak, 1990; Lozano-Perez et al., 1984; Zadeh, 1965]. Such systems which model error explicitly are subject to a similar problem: the error model employed is seldom faithful to the distributions and interactions of the actual errors and uncertainties. The same issues of mismatches between domain theories and the real world arise when the domain theory is a theory about uncertainties. Other work such as [Wilkins, 1988] addresses these problems via execution monitoring and failure recovery. More recently, Martin and Allen (1990) presented a method for combining strategic (a priori) and dynamic (reactive) planning, but uses an empirical-based approach rather than a knowledge-based approach for proving achievability. The idea of incrementally improving a plan's coverage is also presented in [Drummond & Bresina, 1990], where a plan's chance of achieving the goal is increased through robustification. This deals primarily with actions having different possible outcomes, while the conditionals in this work deal with the problem of over-general knowledge. The idea of conditionals is also related to the work on disjunctive plans, such as [Fox, 1985; Homem de Mello & Sanderson, 1986], although these have focused on the construction of complete, flexible plans for closed-world manufacturing applications. There has also been work in learning disjunctions using similaritybased learning techniques [Shell & Carbonell, 1989; Whitehall, 1987]. Other work on integrating a priori planning and reactivity [Cohen, Greenberg, Hart, and Howe, 1989; Turney & Segre, 1989] focuses on the integration of the planning and execution of multiple plans. There has also been some work in learning stimulus-response rules for becoming increasingly reactive [Mitchell, 1990]. In this paper we have described two other approaches. Permissive planning endorses a kind of uncertainty-tolerant interaction with the world. Rather than
113 debugging or characterizing the flawed internal representations, the planning process itself is biased, through experience, to prefer the construction of plans that are less sensitive to the representational flaws. In this way the projection process becomes more reliable with experience. Completable planning and contingent EBL take advantage of the benefits provided by classical planning and reactivity while ameliorating some of their limitations through learning. Perfect characterizations of the real world are difficult to construct, and thus classical planners are limited to toy domains. However, the real world often follows predictable patterns of behavior which reactive planners are unable to utilize due to their nearsightedness. Contingent EBL enables the learning of plans for use in the completable planning approach, which provides for the goal-directed behavior of classical planning while allowing for the flexibility provided by reactivity. This makes it particularly well-suited to many interesting real-world domains. It is our belief that machine learning will play an increasingly central role in systems that reason about planning and action. Through techniques such as explanation-based learning, a system can begin to actively adapt to its problemsolving environment In so doing, effective average-case performance may be possible by exploiting information inherent in the distribution of problems, and simultaneously avoiding the known pitfall of attempting guaranteed or bounded worst-case domain independent planning.
REFERENCES Agre, P. & Chapman, D. (1987). Pengi: An Implementation of a Theory of Activity. Proceedings of the National Conference on Artificial Intelligence (pp. 268-272). Seattle, Washington: Morgan Kaufmann. Bennett, S. W. (1990). Reducing Real-world Failures of Approximate Explanationbased Rules. Proceedings of the Seventh International Conference on Machine Learning (pp. 226-234). Austin, Texas: Morgan Kaufmann. Brooks, R. A. (1982). Symbolic Error Analysis and Robot Planning (Memo 685). Cambridge: Massachusetts Institute of Technology, Artificial Intelligence Laboratory. Brooks, R. A. (1987). Planning is Just aWay of Avoiding Figuring OutWhattoDoNext (Working Paper 303). Cambridge: Massachusetts Institute of Technology, Artificial Intelligence Laboratory. Chapman, D. (1987). Planning for Conjunctive Goals. Artificial Intelligence 32, 5, 333-378. Chien, S. A. (1989). Using and Refining Simplifications: Explanation-based Learning of Plans in Intractable Domains. Proceedings of The Eleventh International Joint Conference on Artificial Intelligence (pp. 590-595). Detroit, Michigan: Morgan Kaufmann. Cohen, W. W. (1988). Generalizing Number and Learning from Multiple Examples in Explanation Based Learning. Proceedings ofthe Fifth International Conference on Machine Learning (pp. 256-269). Ann Arbor, Michigan: Morgan Kaufmann.
114 Cohen, P. R., Greenberg, M. L., Hart, D. M., & Howe, A. E. (1989). Trial by Fire: Understanding the Design Requirements for Agents in Complex Environments. Artificial Intelligence Magazine 10, 3,32-48. Davis, E. (1986). Representing and Acquiring Geographic Knowledge, Morgan Kaufmann. DeJong, G. F. & Mooney, R. J. (1986). Explanatioflh-Based Learning: An Alternative View. Machine Learning 1, 2,145-176. Drummond, M. & Bresina, J. (1990). Anytime Synthetic Projection: Maximizing the Probability of Goal Satisfaction. Proceedings oftheEighthNational Conference on Artificial Intelligence (138-144). Boston, Massachusetts: Morgan Kaufmann. Erdmann, M. (1986). Using Backprojections for Fine Motion Planning with Uncertainty. International Journal ofRobotics Research 5, 1, 19-45. Fikes, R. E., Hart, P. E., & Nilsson, N. J. (1972). Learning and Executing Generalized Robot Plans. Artificial Intelligence 3,4,251-288. Firby, R. J. (1987). An Investigation into Reactive Planning in Complex Domains. Proceedings of the National Conference on Artificial Intelligence (202-206). Seattle, Washington: Morgan Kaufmann. Forbus, K. D. (1984). Qualitative Process Theory. Artificial Intelligence 24,85-168. Fox, B. R. & Kempf, K. G. (1985). Opportunistic Scheduling for Robotic Assembly. Proceedings of the Institute ofElectrical and Electronics Engineers International Conference on Robotics and Automation (880-889). Gervasio, M. T. (1990a). Learning Computable Reactive Plans Through Achievability Proofs (Technical Report UIUCDCS-R-90-1605). Urbana: University of Illinois, Department of Computer Science. Gervasio, M. T. (1990b). Learning General Completable Reactive Plans. Proceedings of the Eighth National Conference on Artificial Intelligence (1016-1021). Boston, Massachusetts: Morgan Kaufmann. Gervasio, M. T. & DeJong, G. F. (1991). Learning Probably Completable Plans (Technical Report UIUCDCS-91-1686). Urbana: University of Illinois, Department of Computer Science. Hammond, K. (1986). Learning to Anticipate and Avoid Planning Failures through the Explanation of Failures. Proceedings ofthe National Conference on Artificial Intelligence (pp. 556-560). Philadelphia, Pennsylvania: Morgan Kaufmann. Hammond, K., Converse, T., & Marks, M. (1990). Towards a Theory of Agency. Proceedings of the Workshop on Innovative Approaches to Planning Scheduling and Control (pp. 354-365). San Diego, California: Morgan Kaufmann. Homem de Mello, L. S. & Sanderson, A. C. (1986). And/Or Graph Representation of Assembly Plans. Proceedings of the National Conference on Artificial Intelligence (pp. 1113-1119). Philadelphia, Pennsylvania: Morgan Kaufmann. Hutchinson, S. A. & Kak, A. C. (1990). Span A Planner That Satisfies Operational and Geometric Goals in Uncertain Environments. Artificial Intelligence Magazine II, 7,30-61. Kaelbling, L. P. (1986). An Architecture for Intelligent Reactive Systems. Proceedings ofthe 1986 Workshop on Reasoning About Actions & Plans (pp. 395-410). Timberline, Oregon: Morgan Kaufmann.
115 Lozano-Perez, T., Mason, M. T., & Taylor, R. H. (1984). Automatic Synthesis of FineMotion Strategies for Robots. International Journal of Robotics Research J, i, 3-24. Malkin, P. K. & Addanki, S. (1990). LOGnets: A Hybrid Graph Spatial Representation for Robot Navigation. Proceedings of the Eighth National Conference on Artificial Intelligence (1045-1050). Boston, Massachusetts: Morgan Kaufmann. Martin, N. G. & Allen, J. F. (1990). Combining Reactive and Strategic Planning through Decomposition Abstraction. Proceedings of the Workshop on Innovative Approaches to Planning, Scheduling and Control (pp. 137-143). San Diego, California: Morgan Kaufmann. Minton, S. (1988). Learning Search Control Knowledge: An Explanation-Based Approach. Norwell: Kluwer Academic Publishers. Mitchell, T. M., Mahadevan, S. & Steinberg, L. I. (1986). LEAP: A Learning Apprentice forVLSlI^sign.ProceedingsoftheNinthlnternationalJointConferenceonArtificial Intelligence (pp. 573-580). Los Angeles, California: Morgan Kaufmann. Mitchell, T. M., Keller, R., & Kedar-Cabelli, S. (1986). Explanation-Based Generalization: A Unifying View. Machine Learning 1,1, 47-80. Mitchell, T. M. (1990). Becoming Increasingly Reactive. Proceedings ofthe Eighth National Conference on Artificial Intelligence (1051-1058). Boston, Massachusetts: Morgan Kaufmann. Mooney, R. J. and Bennett, S. W. (1986). A Domain Independent Explanation-Based Generalizer. Proceedings of the National Conference on Artificial Intelligence (pp. 551-555). Philadelphia, Pennsylvania: Morgan Kaufmann. Mooney, R. J. (1990). A General Explanation-Based Learning Mechanism and its Application to Narrative Understanding. London: Pitman. Mostow, J. & Bhatnagar, N. (1987). FAILSAFE—A Floor Planner that uses EBG to Learn from its Failures. Proceedings ofthe Tenth InternationalJoint Conference on Artificial Intelligence. Milan, Italy: Morgan Kaufmann. Rosenschein, S. J. & Kaelbling, L. P. (1987). The Synthesis of Digital Machines with Provable Epistemic Properties (CSLI-87-83). Stanford: CSLI. Sacerdoti, E. (1974). Planning in a Hierarchy of Abstraction Spaces. Artificial Intelligence 5,115-135. Schoppers,M. J. (1987). Universal Plans for Reactive Robots in Unpredictable Environments. Proceedings ofthe Tenth InternationalJoint Conference on Artificial Intelligence (pp. 1039-1046). Milan, Italy: Morgan Kaufmann. Segre, A. M. (1988). Machine Learning of RobotAssembly Plans. Norwell: Kluwer Academic Publishers. Shavlik, J. W. & DeJong, G. F. (1987). An Explanation-Based Approach to Generalizing Number. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (pp. 236-238). Milan, Italy: Morgan Kaufmann. Shell,P. &Carbonell, J. (1989). Towards a General Framework for Composing Disjunctive and Iterative Macro-operators. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 596-602). Detroit, Michigan: Morgan Kaufmann.
116 Stefik, M. (1981). Planning and Metaplanning (MOLGEN: Part 2). Artificial Intelligence 76,2,141-170. Suchman, L. A. (1987). Plans and Situated Actions, Cambridge: Cambridge University Press. Tadepalli, P. (1989). Lazy Explanation-Based Learning: A Solution to the Intractable Theory Problem. Proceedings of the Eleventh International]oint Conference on Artificial Intelligence. Detroit, Michigan: Morgan-Kaufmann. Turney, J. & Segre, A. (1989). SEPIA: An Experiment in Integrated Planning and Improvisation. Proceedings of The American Association for Artificial Intelligence Spring Symposium on Planning and Search (pp. 59-63). Whitehall, B. L. (1987). Substructure Discovery in Executed Action Sequences (Technical Report UILU-ENG-87-2256). Urbana: University of Illinois, Department of Computer Science. Wilkins, D. E. (1988). Practical Planning: Extending the Classical Artificial Intelligence Planning Paradigm. San Mateo: Morgan Kaufman. Wong, E. K. & and Fu, K. S. (1985). A Hierarchical Orthogonal Space Approach to Collision-Free Path Planning. Proceedings ofthe 1985 Institute of Electrical and Electronics Engineers International Conference on Robotics and Automation (506-511). Zadeh, L. A. (1965). Fuzzy Sets. Information and Control 8,5,338-353. Zhu, D. J. and Latombe, J. C. (1990). Constraint Reformulation in a Hierarchical Path Planner. Proceedings of the 1990 Institute ofElectrical and Electronics Engineers International Conference of'Robotics and^Aw^mflriow (pp. 1918-1923). Cincinnati, Ohio: Morgan Kaufmann.
Chapter 4 THE ROLE OF SELF-MODELS IN LEARNING TO PLAN Gregg Collins, Lawrence Birnbaum, Bruce Krulwich, and Michael Freed Northwestern University The Institute for the Learning Sciences Evanston, Illinois
ABSTRACT We argue that in order to learn to plan effectively, an agent needs an explicit model of its own planning and plan execution processes. Given such a model, the agent can pinpoint the elements of these processes that are responsible for an observed failure to perform as expected, which in turn enables the formulation of a repair designed to ensure that similar failures do not occur in the future. We have constructed simple models of a number of important components of an intentional agent, including threat detection, execution scheduling, and projection, and applied them to learning within the context of competitive games such as chess and checkers.
INTRODUCTION The search for a domain-independent theory of planning has been a dominant theme in AI since its inception. This concern was explicit, for example, in Newell and Simon's (1963) pioneering model of human problem solving and planning, the General Problem Solver (GPS). The line of classical planning work that followed GPS, including STRIPS (Fikes and Nilsson, 1971), ABSTRIPS (Sacerdoti, 1974), and NOAH (Sacerdoti, 1977), has maintained this concern to the present day, as is perhaps best exemplified by Wilkins's (1984) SIPE. The reason for this concern seems clear enough, if we consider the alternative: Without a domain-independent theory, a planner cannot be viewed as anything more than a collection of domain- and task-dependent routines having
118 no particular relationship to each other. Such a view of planning offers no approach to the problem of adapting a planner to new domains. The pursuit of a domain-independent theory of planning, however, has led to an unexpected and unfortunate outcome, in that the resulting models are essentially general purpose search procedures, embodying hardly any knowledge about planning, and no knowledge about the world. This knowledge resides, instead, in the operators that specify the search space. In effect, such models of planning are analogous to programming languages: Generality has been achieved, but only by sacrificing almost completely any useful constraints on how the planner should be programmed. The responsibility for both the efficiency of the planning process, and the efficacy of the resulting plans, lies almost entirely with the human being who writes the operator definitions. In other words, we are left with a domain-independent theory of planning that offers very little guidance in attempting to adapt planners to new domains. Unfortunately, this was the very problem that motivated the search for such a theory in thefirstplace. The alternative, then, is to assume that a domain-independent theory planning must be knowledge-intensive rather than knowledgepoor, if it is to provide effective guidance in adapting to new domains. Human planners know a great deal about planning in the abstract, and it is this knowledge that enables them to adapt quickly to new domains and tasks. Our approach thus takes much of its inspiration from Sussman's (1975) and Schank's (1982) work showing how abstract planning knowledge, in the form of critics or thematic organization points (TOPs), can be used to improve planning performance in specific domains. More generally, we wish to construct a theory in which the detailed, domain-specific knowledge necessary for effective planning is generated by the planner itself, as a product of the interaction between die planner's knowledge about planning and its experience in particular domains. Our ultimate goal is a model that is capable of transferring lessons learned in one domain to other domains, through the acquisition of such abstract planning knowledge itself (Birnbaum and Collins, 1988). DEBUGGING THE PLANNER Any theory of learning must first of all address the question of when to learn. Sussman (1975) pioneered an approach to this problem, which has come to be known as failure-driven learning, in which
119 learning is motivated by the recognition of performance failures. A failure-driven learning system contains a debugging component that is called into play whenever the system's plans go awry; the repairs suggested by this component are then incorporated into the system's planning knowledge, thereby improving future performance in similar situations. Because this approach directly relates learning to task performance, it has become the dominant paradigm for learning how to plan within AI (see, e.g., Schank, 1982; Hayes-Roth, 1983; Kolodner, 1987; Simmons, 1988; Hammond, 1989a). Of course, such an approach immediately raises the question, what is being debugged? The obvious answer, and the one that has generally been taken for granted, is simply "plans." This view is based on what has come to be known as the "classical" tradition in planning, in which planners are assumed to produce as output completely self-contained, program-like plans, which plan executors are assumed to then faithfully carry out. The completely self-contained nature of plans within this framework leads rather naturally to the assumption that whenever an agent fails to achieve its goals, the fault must lie within the individual plans themselves. However, the classical conception of planning has become increasingly untenable as the role of reactivity in goal-directed behavior has become more clearly understood (see, e.g., Hayes-Roth and HayesRoth, 1979; Brooks, 1986; Agre and Chapman, 1987; Firby, 1989; Hammond, 1989b). The shift towards reactive models of planning has, in particular, called into question the idea that plans are completely selfcontained structures. In so doing, it raises serious problems for any theory that is based on the idea of debugging monolithic plans of this sort. Reactive models of planning are in large part motivated by the recognition that, since the conditions under which an agent's plans will be carried out cannot be completely anticipated, much of the responsibility for determining the particular actions that the agent will perform at a given time must lie in the plan execution component of that agent, rather than resting exclusively with the plans themselves. In order to be capable of carrying out the additional responsibilities required by these models, the plan execution component can no longer be taken to be a simple, general-purpose program interpreter. Rather, it must be seen as a highly articulated set of components, each devoted to controlling a particular aspect of behavior.
120 Consider, for example, a simple plan for keeping a piece safe in chess, formulated as a self-contained, program-like structure: while the piece is on the board do if a threat against the piece is detected then either a. move the piece b. guard the piece c. interpose another piece d. remove the threat c ••• eic There are two key points to notice about this plan. First, an agent cannot yield complete control of its behavior to a plan of this sort, because the plan will never relinquish control unless a threat is successfully executed and the piece is taken, and the agent cannot afford to execute such a plan to the exclusion of all others. Second, the details of the actions to be carried out in service of this plan cannot be specified in very much detail in advance of detecting a particular threat. Thus, to be able to carry out such a plan, the agent must perform some version of timesharing or multitasking. The plan must relinquish control of the agent's computational, perceptual, and behavioral resources until such time as a threat against the piece is actually detected. This in turn implies that some executive component of the agent must be charged with die responsibility of returning control to the plan at the appropriate time, i.e., when such a threat is detected. Thus, a task that was formerly the responsibility of individual plans—threat detection— now becomes the responsibility of a specialized component of the planning architecture. In light of the above discussion, we need to reconsider our original question of what is being debugged in a failure-driven approach to learning how to plan. Since a great deal of the responsibility for determining what to do has now been shifted to the agent's plan executor, any adequate approach to learning by debugging must be capable of determining the causes of, and repairs for, performance errors arising from the operation of this execution component. Approaches that consider only the plans themselves as the objects to be debugged are obviously incapable of making such determinations. Thus, as more responsibility is shifted to the plan executor, the focus of debugging effort must be shifted there as well.
121 Moreover, this shift offers a basis for addressing our original concern of how to adapt a planner to new domains. Errors that arise from the operation of the plan executor are the very sorts of errors that are most likely to yield lessons of broad applicability. Because all plans make extensive use of the components that comprise the plan executor, repairing bugs in these commonly held resources has the potential to improve the execution of any plan, regardless of the domain in which it is intended to function. More generally, this argument applies to any component of the intentional architecture: When ubiquitous functions such as threat detection are assigned to specialized components of the agent's architecture, any improvement in a particular component benefits all plans utilizing that component. Thus, learning that occurs in the context of one task or domain may subsequently yield improved performance in other tasks and domains. In a sense, the increased specialization entailed by this approach offers opportunities to increase efficiency in much the same way that manufacturing efficiency exploits specialization of workers and equipment on an assembly line: By breaking plans up into constituent pieces, and distributing responsibility for those pieces among components of the agent specialized for those purposes, we can optimize each component for its particular purpose. To extend the analogy, when a faulty item is discovered coming out of a factory, one might simply repair that item and continue on; but it is obviously more sensible to determine where in the manufacturing process the fault was introduced, and to see whether anything can be done to avoid such problems in the future. Our thesis is that a similar approach can be applied when learning how to plan. To put this in plainer terms, when a plan fails, debug the planner, not just the plan. A MODEL-BASED APPROACH TO DEBUGGING From the perspective outlined above, the process of debugging an intentional system must involve determining which element of that system is responsible for an observed failure. This is a difficult problem inasmuch as the architecture of the agent is, as we have argued, a rather complex mechanism. Our approach to this problem utilizes model-based reasoning, a methodology that has been developed in AI for reasoning about and debugging complex mechanisms such as electronic circuits (see, e.g., Stallman and Sussman, 1977; Davis, 1984; deKleer and Williams, 1987). In this paradigm for debugging, the diagnostic system uses a model of the device being debugged to
122 generate predictions about what the behavior of the device would be if it were functioning properly. These predictions are then compared with the actual behavior of the device. When a discrepancy is detected, the diagnostic system attempts to determine which of a set of possible faults is the underlying cause of the discrepancy. A fault is expressed as the failure of an assumption in the device model. For example, the model of an electronic circuit might include a number of assumptions of the following sort: that each circuit component is working according to specification, that each connection between these components is functioning properly, and that the input and output of the circuit have certain characteristics. A circuit debugger using such a model would then generate a set of predictions, for example, that the voltage across a given resistor should have a certain value. If the measured voltage were found to disagree with the prediction, the system would try to fault one or more of the assumptions included in the model. A reasonable diagnosis might thus be, for example, that a particular transistor was not functioning as specified by the model, or that the input voltage to the circuit was not as anticipated. The key issue in model-based debugging is inferring the faulty assumptions underlying an observed symptom. The ability to relate failed predictions to underlying assumptions in this way depends upon understanding how those predictions follow from the assumptions. Inasmuch as the performance expectations are generated by inference from the model in the first place, the most straightforward approach to this task is to record these inferences in the form of explicit justification structures (see, e.g., deKleer et aL, 1977; Doyle, 1979).1 By examining these justification structures, the system can then determine which assumptions of the model are relevant to an observed failure.2 Applied to our task of learning to plan by debugging, the paradigm described above comprises the following two steps: First, a model of the agent's intentional architecture is used to generate predictions about the performance of its plans; second, deviations from these predictions are used to pinpoint where in the mechanism an observed fault lies. 1
One of the first applications of such explicit reasoning records to the task of learning to plan can be found in CarbonelTs (1986) notion of derivational traces. 2 Of course, this does not guarantee that the system's ability to diagnose the cause of the failure, since it may be ambiguous which assumption was responsible for the fault. We will discuss the use of justification structures to support debugging in more detail below.
123 Thus, our approach entails that an intentional agent needs a model of itself in order to adequately diagnose its failures. Such a model will enable an agent to determine for itself where the responsibility for an observed failure lies. Determining the locus of failure is only part of the problem, however. Some way must also be found to ensure that similar failures do not occur in the future. This entails modifying either the agent, its environment, or both. Our primary concern here is with learning, that is, improving the performance of the agent itself in response to failures. Our approach to this task is again based on an analogy with man-made systems, such as engines, television sets, or factory production lines. Just as optimizing the performance of such a system involves manipulating the controls of that system in response to feedback about the system's performance, improving an agent's performance will similarly involvefindingtherightcontrol parameters to manipulate, and changing their settings in response to perceived plan failures. Consider, for example, a task that must be carried out by any realworld agent, threat detection. If the agent were, in general, detecting threats too late to respond to them effectively, despite having adequate time to do so in principle, a sensible response would be to increase the rate at which threats are checked for. On the other hand, if the agent were spending too much time checking for threats, and neglecting other tasks as a result, the parameter governing this rate should be adjusted the other way, so that less time is spent on the detection process. To provide an adequate basis for a theory of learning, then, our debugging model must support reasoning of this type, i.e., it must provide a means of identifying die controllable parameters of the agent that are relevant to a given fault In order for the debugger to identify relevant control parameters in this way, knowledge about those parameters and their settings must be part of the agent's model of itself. If the model includes assumptions about the correctness of the current settings of controllable parameters of the system, then the diagnosis process outlined above can, in principle, determine when the setting of a given parameter is responsible for an observed fault. Such a diagnosis suggests an obvious repair, namely, adjusting the faulty parameter setting in some way. In the example given above, for instance, the model might contain an assumption stating that the rate at which the agent checks for threats is rapid enough so that every threat that arises will be detected in time to respond to it
124
effectively. If this assumption is faulted, for example, when the system fails to counter some threat, then the appropriate repair is clearly to increase this rate. Thus, model-based reasoning not only provides a paradigm for fault diagnosis, it also provides a basis for a theory of repair. THE MODEL OF THE AGENT A central issue in our approach is the development of explicit models for intentional agents that can be used in debugging their performance. We have constructed simple models of a number of important components of an intentional agent, including threat detection, execution scheduling, projection, and case retrieval and adaptation. These models have been implemented in a computer program called CASTLE3, and applied to learning within the context of competitive games such as chess and checkers (see, e.g., Collins et al., 1989; Birnbaum etal., 1990; Birnbaum etal, 1991; Collins etaL, 1991). In this section we will describe models of two aspects of a simple planner, dealing with threat detection and execution scheduling. A model of threat detection The task of the threat detector is to monitor the current state of the world, looking for situations that match the description of known threats. In our approach, this matching task is modeled as a simple rulebased process: The planner's threat detection knowledge is encoded as a set of condition-action rules, with each rule being responsible for recognizing a particular type of threat. In chess, for example, the planner could be expected to possess rules specifying, among other things, the various board configurations that indicate that an attack on a piece is imminent When a threat-detection rule is triggered, the threat description associated with that rule is passed on to the plan selection component of the system, which will attempt to formulate a response to the threat. Because the system cannot, in general, afford the cognitive resources that would be necessary to apply all known threat-detection rules to all input features at all times, the threat-detection component also includes 3
CASTLE stands for "Concocting Abstract Strategies Through Learning from Expectationfailures."
125
two mechanisms for modulating the effort spent on this task: First, the threat-detection rules are evaluated at intervals, where the length of the interval is an adjustable parameter of the system; second, attention focusing circumscribes the domain to which the threat-detection rules will be applied Vx 3t t<ert(x) & detectfx, t) -> added-to(x,T(t))
Threats are placed on a threat queue when detected
Threats remain on the threat queue Vx, tj, %2 added-to(x, T(tj))& t2 <ert(x) & ~3rr (tl T2 such that (MEMBER (MOVE K2 3 4) Q T2) fails - > T2 such that (fflGHEST-PRIORITY (MOVE K2 3 4) T2) fails ~> T2 such that ( -Old threats — '
Not , disabled
Active—, threats
K>.
Move selection
Available -J opportunities
Figure 5: Explanation for enablement of threat
139 One approach to the problem of identifying such a feature set is to use some form of explanation-based learning (see, e.g., DeJong and Mooney, 1986; Mitchell, Keller, and Kedar-Cabelli, 1986; Schank, Collins, and Hunter, 1986). If the system can explain how the current threat was enabled, then this account can be used to pick out features of the current situation that can be used to identify future situations in which threats have been similarly enabled. In other words, the model's specification of the trigger conditions for focus rules—namely, that they should trigger when threats are potentially enabled—can be used as an EBL target concept (Krulwich, 1991). The first step in this process is to construct an explanation of how the threat against the rook was enabled in this instance. This explanation (see figure 5) is centered around the following chain of assertions: • A new threat by the opponent's bishop against the rook was enabled because • The line of attack between the two pieces became clear because • The opponent's pawn was moved out of the line of attack After generalizing and adjusting the leaves of this explanation, the system uses it to construct a new rule (shown in figure 6) that will correctly focus on discovered attacks. The rule directs the system's attention to potential moves through the square vacated by the previous move. Once this rule is added to the focus rule set, the system no longer falls prey to discovered attacks, regardless of the particular pieces involved or their location on the board. (def-brule leamed-focus-method25 (focus leamed-focus-method25 ?player (move ?player (capture ?taken-piece) ?taking-piece (rc->loc ?rowl ?coll) (rc->loc ?row2 ?col2)) (world-at-time ?time2)) loc ?r-interm ?c-interm) (rc->loc ?r-other ?c-other)) ?player ?goal ?timel) (loc-on-line ?r-interm ?c-interm ?rowl ?coll ?row2 ?col2) (at-loc ?player ?taking-piece (rc->loc ?rowl ?coll) (- gen-time2.24 2)))) Figure 6: A new focus rule
140 CONCLUSION We have argued that in order to learn to plan effectively, an agent needs an explicit model of its own planning and plan execution processes. Given such a model, the agent can pinpoint those elements of its planning and execution processes that are responsible for an observed failure to perform as expected, and can formulate a repair designed to ensure that similar failures do not occur in the future. In other words, a self-model of this sort enables an agent to determine for itself what it needs to learnfroma given experience. The proposal that learning to plan entails the use of a self-model may, at first blush, appear somewhat radical. However, the fact is that a variety of everyday planning and problem-solving behaviors straightforwardly depend upon self-models of this sort. For example, many common-sense plans, such as cooking, require waiting for the appropriate moment to perform a particular task. In such situations, human planners often employ the strategy of setting an alarm that will recall their attention to the activity when the pending task needs to be performed, thus freeing them to attend to other matters in the interim. For example, chemists often put a stopper in a test tube in which they are boiling something, so that the "Pop!" that occurs when rising pressure forces the stopper out will alert them to the fact that it is time to remove the test tube from the heat.9 To set up such an alarm, an agent must have some notion of the kinds of events that will in fact attract its attention—in other words, a model of the properties of its attentionfocusing component. Such a model might, for example, specify that flashing lights, loud noises, and quick movements attract the agent's attention; that once its attention is so attracted, it will attempt to explain the cause of the event; and that if the cause is due to the agent itself, it will recall its purpose in setting up the event. Armed with this theory, the agent can decide whether a particular type of event, for instance the "Pop!" of a stopper being disgorged from a test tube, can successfully serve as an alarm. Mnemonic devices, for example the proverbial string tied around one's finger, employ similar techniques to attack a slightly different problem, that of retrieving a piece of information at the appropriate time. To develop and employ such techniques, the agent must not only have a model of its attention focusing mechanisms, but also of its memory. The ubiquity of such examples in everyday life ^Thanks to Ken Forbus for this example.
141 argues that any theory of planning that aims to achieve human-level performance must ultimately come to grips with the need for selfmodeling, even without taking into account die issue of learning. Acknowledgments: We thank Matt Brand, Kris Hammond, Louise Pryor, Chris Riesbeck, and Roger Schank for many useful discussions. This work was supported in part by the Office of Naval Research under contract N00014-89-J-3217, by the Air Force Office of Scientific Research under contract AFOSR-91-0341-DEF, and by the Defense Advanced Research Projects Agency, monitored by the Air Force Office of Scientific Research under contract F49620-88-C-0058. The Institute for the Learning Sciences was established in 1989 with the support of Andersen Consulting, part of The Arthur Andersen Worldwide Organization. The Institute receives additional support from Ameritech, an Institute Partner, and from IBM. REFERENCES Agre, P., and Chapman, D. (1987). Pengi: An implementation of a theory of activity. Proceedings of the 1987 AAAI Conference, Seattle, WA, pp. 268-272. Birnbaum, L., and Collins, G. (1988). The transfer of experience across planning domains through the acquisition of abstract strategies. Proceedings of the 1988 Workshop on Case-Based Reasoning, Clearwater Beach, FL, pp. 61-79. Birnbaum, L., Collins, G., Freed, M., and Krulwich, B. (1990). Model-based diagnosis of planning failures. Proceedings of the 1990 AAAI Conference, Boston, MA, pp. 318-323. Birnbaum, L., Collins, G., Brand, M., Freed, M., Krulwich, B., and Pryor, L. (1991). A model-based approach to the construction of adaptive case-based planning systems. Proceedings of the 1991 Workshop on Case-Based Reasoning, Washington, DC, pp. 215-224. Brooks, R. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, vol. 2, no. 1. Carbonell, J. (1986). Derivational analogy: A theory of reconstructive problem solving and expertise acquisition. In R. Michalski, J. Carbonell, and T. Mitchell, eds., Machine Learning: An Artificial
142
Intelligence Approach, Volume II, Morgan Kaufmann, Los Altos, CA, pp. 371-392. Collins, G., Birnbaum, L., and Krulwich, B. (1989). An adaptive model of decision-making in planning. Proceedings of the Eleventh IJCAI, Detroit, MI, pp. 511-516. Collins, G., Birnbaum, L., Krulwich, B., and Freed, M. (1991). Plan debugging in an intentional system. Proceedings of the Twelfth IJCAI, Sydney, Australia, pp. 353-358. Davis, R. (1984). Diagnostic reasoning based on structure and behavior. Artificial Intelligence, vol. 24, pp. 347-410. DeJong, G., and Mooney, R. 1986. Explanation-based learning: An alternative view. Machine Learning, vol. 1, pp. 145-176. deKleer, J., and Williams, B. (1987). Diagnosing multiple faults. Artificial Intelligence, vol. 32, pp. 97-130. Doyle, J. (1979). A truth maintenance system. Artificial Intelligence, vol. 12, pp. 231-272. Fikes, R., and Nilsson, N. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, vol. 2, pp. 189-208. Firby, R. (1989). Adaptive execution in complex dynamic worlds. Research report no. 672, Yale University, Dept. of Computer Science, New Haven, CT. Hammond, K. (1989a). Case-Based Planning: Viewing Planning as a Memory Task. Academic Press, San Diego. Hammond, K. (1989b). Opportunistic memory. Proceedings of the Eleventh IJCAI, Detroit, MI, pp. 504-510. Hayes-Roth, F. (1983). Using proofs and refutations to learn from experience. In R. Michalski, J. Carbonell, and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach, Vol 7, Tioga, Palo Alto, CA, pp. 221-240.
143
Kolodner, J. (1987). Capitalizing on failure through case-based inference. Proceedings of the Ninth Cognitive Science Conference, Seattle, WA, pp. 715-726. Krulwich, B. (1991). Determining what to learn in a multi-component planning system. Proceedings of the Thirteenth Cognitive Science Conference, Chicago, IL, pp. 102-107. Mitchell, T., Keller, R., and Kedar-Cabelli, S. (1986). Explanationbased generalization: A unifying view. Machine Learning, vol. 1, pp. 47-80. Newell, A., and Simon, H. (1963). GPS, a program that simulates human thought. In E. Feigenbaum and J. Feldman, eds., Computers and Thought, McGraw-Hill, New York, pp. 279-293. Sacerdoti, E. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, vol. 5, pp. 115-132. Sacerdoti, E. (1977). A Structure for Plans and Behavior. American Elsevier, New York. Schank, R. (1982). Dynamic Memory: A Theory of Reminding and Learning in Computers and People. Cambridge University Press, Cambridge, England. Schank, R., Collins, G., and Hunter, L. (1986). Transcending inductive category formation in learning. The Behavioral and Brain Sciences, vol. 9, pp. 639-686. Simmons, R. (1988). A theory of debugging plans and interpretations. Proceedings of the 1988 AAAI Conference, St. Paul, MN, pp. 94-99. Stallman, R., and Sussman, G. (1977). Forward reasoning and dependency-directed backtracking in a system for computer-aided circuit analysis. Artificial Intelligence, vol. 9, pp. 135-196. Sussman, G. (1975). A Computer Model of Skill Acquisition. American Elsevier, New York. Wilkins, D. (1984). Domain independent planning: Representation and plan generation. Artificial Intelligence, vol. 22, pp. 269-302.
Chapter 5 LEARNING FLEXIBLE CONCEPTS USING A TWO-TIERED REPRESENTATION R. S. Michalski, F. Bergadano1, S. Matwin2 and J. Zhang Center for Artificial Intelligence George Mason University Fairfax, VA 22030 ABSTRACT Most human concepts are flexible in the sense that they inherently lack precise boundaries, and these boundaries are often contextdependent. This chapter describes a method for representing and inductively learning flexible concepts from examples. The basic idea is to represent such concepts using a two-tiered representation. Such a representation consists of two structures ("tiers"): the Base Concept Representation (BCR), which captures explicitly the basic and contextindependent concept properties, and Inferential Concept Interpretation (ICI), which characterizes allowable concept modifications and contextdependency. The proposed method has been implemented in the POSEIDON3 system (also called AQ16), and tested on various practical problems, such as learning the concept of "Acceptable union contracts" and "Voting patterns of Republicans and Democrats in the U.S. Congress." In the experiments, the system generated concept descriptions that were both, more accurate and simpler than those produced by other methods tested, such as methods employing simple exemplar-based representations, decision tree learning, and some previous methods for rule learning. 1 On leave of absence from the University of Torino, Italy 2 On leave of absencefromthe Univerity of Ottawa, Canada. 3 The system is named after POSEIDON, the Greek god of the sea, water and waves, which representfluidityand changing aspects of nature.
146 INTRODUCTION Typical assumptions underlying a large part of machine learning research are that concepts have precise boundaries, are contextindependent, and are representable by a single symbolic description. An important consequence of this assumption is that recognizing instances of such concepts, which we call crisp, is very simple: if an instance satisfies a given concept description, then it belongs to the concept, otherwise it does not. Another common assumption is that concept instances are equally representative, that is there is no distinction in the typicality among instances. In some methods, these assumptions are partially relaxed by assigning to a concept a fuzzy set membership function (e.g., Zadeh, 1974), or a probability distribution (e.g., Cheeseman et al., 1988; Fisher, 1987). However, once such a measure is defined explicitly for a given concept, the concept has a fixed, well-defined meaning. Moreover, these methods remain unsatisfactory for coping with context-dependency, handling exceptional cases, or for capturing gradual changes of knowledge about the concept properties. When one looks at human concepts, one can see that most of them inherently lack precisely defined boundaries, and that their meaning is often context-dependent. Although on the surface these properties can be viewed as undesirable, one can argue that they contribute to a cognitive economy of human knowledge representations (Michalski, 1987). Our view is that this imprecision and context-dependency can be more adequately captured by rules of inference and flexible concept matching than by a probability distribution or a numerical set membership function. In other words, we postulate that the imprecision and contextdependency has often a logical, rather than a probabilistic character. This is confirmed by an observation that people usually decide about the concept membership of borderline instances through inference—by reasoning from general knowledge, drawing an analogy, or performing induction, rather than by conducting a statistical analysis. Examples of human concepts can often be characterized by a degree
147 of typicality in representing the concept. For example, a robin is usually viewed as a more typical bird than a penguin or ostrich. The typicality is usually viewed as the degree to which an instance shares the common concept properties. Another property of concepts is that in different contexts they may have different meaning. For example, the concept "bird" may apply to a live, flying bird, a sculpture, a chick hatching out of the egg, or even an airplane. Thus, human concepts are flexible, as their boundaries have certain degree of fluidity, and can change with the context in which the concepts are used. It is clear that in order to learn such concepts, machine learning systems need to employ richer concept representations than are currently used. This chapter describes an approach to learning flexible concepts based on the idea of two-tiered representation (TT), proposed by Michalski (1987). In this representation, a concept is described by two structures ("tiers"), the base concept representation (BCR), and the inferential concept interpretation (ICI). The BCR defines explicitly the basic properties of the concept, while the ICI describes implicitly, through rules and matching procedures, the allowed modifications of the explicit meaning, and its changes or extensions in different contexts. In the general definition of the two-tiered representation, the "distribution" of the meaning between the two tiers is not fixed, but depends on the properties of the reasoning agent, and on the criteria for evaluating the quality of concept descriptions. In the instantiation of the two-tiered approach that applies to modeling human concept representation, the BCR is assumed to describe the most typical, common, and intentional meaning of a concept, while the ICI would handle the exceptional or borderline cases, and context dependency (Michalski, 1990). The ICI for specific concepts is often inherited from more general concepts. Early ideas, experiments and the first method for learning two-tiered concept representations were presented in (Michalski et al., 1986; Michalski, 1988; and Michalski, 1990). The general idea was to induce, in the first step, a concept description that is a complete and consistent characterization of all training examples. Such a description is often
148 overly complex and performs poorly on new examples, if the concept has flexible and/or complex oorders, or examples are noisy. Therefore, in the second step, such descriptions are simplified or optimized according to some criterion of description "quality." The method employed a simple form of description simplification, called TRUNC, which removes those parts of the description that cover only a small fraction of examples (the so called light disjuncts, or light rules). Such a description change can be logically interpreted as a specialization operation. As the ICI, the method applied a flexible matching procedure. An intriguing result of that research was that the description's complexity was substantially reduced without affecting its performance on new examples. The new method, described here, significantly extends these early ideas. One important advance is the development of a heuristic doublelevel search procedure, called TRUNC-SG, which explores the space of two-tiered descriptions to derive a globally optimized description. The search employs both generalization and specialization operators, and is guided by a new criterion, the general description quality measure (GDQ). This measure considers the accuracy of the description, the computational cost of both tiers - Base Concept Representation and Inferential Concept Interpretation, and its cognitive comprehensibility (Bergadano et al., 1988). By introducing such a general description quality measure, any form of concept learning can be viewed as a process of modifying the input concept description in order to maximize a given description quality measure. Initial concept description can be in the form of positive examples only, positive and negative examples, a complete and/or consistent concept description, an initial description supplied by a teacher, an abstract concept definition (as in the explanation-based learning), or a combination of these forms. Another advance is that flexible matching is used not only in the recognition process, as in (Michalski et al., 1986), but also in the learning process, i.e., in searching for high "quality" concept descriptions. This feature also distinguishes the method from the related work described in (Bergadano and Giordana, 1989), which does not involve deductive
149 reasoning in the learning phase, and evaluates the performance of generated descriptions solely on the basis of the coverage of examples. These earlier approaches may be compared to using hands in learning how to row a boat, and then using oars in the performance phase. The idea that learning is more effective if one uses the same instruments for learning and for performance phases was also present in some incremental learning systems (e.g., Fisher, 1987). The work described here represents also an important advance over tree-pruning techniques (e.g., Quinlan, 1987). These techniques apply a much more restrictive description reduction operator (a tree-pruning operator that performs a generalization of the class replacing the pruned subtree, and specialization of other classes), and do not use deductive matching or flexible interpretation of the learned descriptions. Other advances include the ability to take into consideration the typicality of training instances (when it is known), and the use of a rule base for the Inferential Concept Interpretation. This chapter describes basic ideas of two-tiered representation, the method proposed, and experimental results from comparing it with several other methods, such as variants of exemplar-based learning, decision tree learning, learning complete and consistent descriptions, and the earlier method using two-tiered representation based on the TRUNC procedure. The experiments have shown that the proposed method compares favorably with other methods. The descriptions learned by the method were both simpler and had higher accuracy in classifying testing examples. TWO-TIERED CONCEPT REPRESENTATION Motivation and Definition Traditional work on concept representation has assumed that the whole meaning of a concept resides in a single structure, e.g., a semantic network, a logic-based description, or a decision tree. Such a structure is expected to capture all relevant properties of the concept(s) and define the
150 concept boundary (e.g., Collins and Quillian, 1972; Minsky, 1975; Smith and Medin, 1981; Sowa, 1984). When concepts have flexible boundaries, or the learning examples have a considerable amount of noise, it may be advantageous to construct a concept representation that is partially inconsistent and/or incomplete with regard to the given examples. This idea was confirmed by the work on pruning decision trees (Quinlan, 1987), in HILLARY system (Iba et al., 1988), and in the work on twotiered representation (Michalski, 1987; and this chapter). In traditional approaches, the recognition of a concept instance is doen typically by directly matching the instance description with the stored concept representation. Such matching may include comparing feature values in an instance with those in the concept description, or tracing links in a semantic network, but is not assumed to involve any complex inferential processes. More recently, researchers working on exemplar-based reasoning (e.g., Bareiss, 1989; Kolodner, 1988 and Hammond, 1989) have proposed various inference mechanisms in order to classify new instances. In these methods, however, the concept representation consists of stored examples (cases). Such a representation taxes memory, and makes it difficult to compare different concepts. The two-tiered representation employs a general concept description (BCR), and an inference mechanism (ICI) for matching the description with instances. Such concept representation can be much simpler than the one that stores individual examples, or their independent generalizations. The BCR can be viewed as a characterization of the "central tendency" of a concept; its contains the most relevant properties, and specifies the basic intention behind the concept. The ICI handles special cases, exceptions 4 and context-dependency. It treats them either by extending the base concept representation (concept extension), or by specializing it (concept contraction). This process involves the background knowledge and relevant inference rules contained in the ICI. Inference allows the 4 The term "exceptions" is used here in its colloquial meaning. Subsection Types of Match gives it a precise meaning.
151 recognition, extension or modification of the concept meaning according to its context. When an unknown entity is to be recognized, it is first matched against the Base Concept Representation. Then, depending on the outcome, the entity may be related to the concepts inferential extensions or contractions. A simple inferential matching can be merely a probabilistic inference based on some measure of similarity, e.g., the flexible matching method (Michalski et al., 1986). Advanced matching may involve any kind of inference —deductive, analogical or inductive. Let us illustrate the idea of two-tiered representation using the concept of "chair." BCR: Superclass: A piece of furniture. Function: To seat one person. Structure: A seat supported by legs and a back rest attachedfromthe side. Physical properties: The number of legs is usually four. Often made of wood. The height of the seat is usually about 14-18 inches from the end of the legs, etc. (BCR may also include a picture of 3D models of typical chairs) ICI: Possible variations of the properties in BCR: The number of legs can vary from one to four. The legs may be replaced by any support. The shape of the seat, the legs and the backrest, and the material of which they are made are irrelevant, as long as the function is preserved. The backrest may be very small or missing, etc. Context dependency: Context = museum exhibit --> chair is not used for seating persons any more. Context = toys --> the size can be much smaller than stated in BCR. The chair does not serve for seating persons, but correspondingly small dolls. Special cases: If legs are replaced by wheels --> type(chair) is wheelchair Chair without the backrest --> type(chair) = stool Chair with the armrests -> type(chair) = armchair This simple example illustrates several important features of two-tiered representation. Commonly occurring cases of chairs match the BCR completely, and the ICI does not need to be involved. For such cases, the recognition time can thus be reduced. The BCR is not the same as a description of a prototype (e.g., Rosch and Mervis, 1975), as it can be a
152 generalization characterizing different typical cases or be a set of different prototypes. The ICI does not represent only distortions or corruptions of the prototype, but it can describe some radically different cases. When an entity does not satisfy the base representation of any relevant concept (which concepts are relevant is indicated by the context of discourse), or satisfies the base representation of more than one concept, the ICI is involved. The ICI can be changed, upgraded or extended, without any change to Base Concept Representation. While the BCR-based recognition involves just direct matching, the ICI-based recognition can involve a variety of transformations and any type of inference. The ideas of two-tiered representation are supported by research on the so-called transformational model (Smith and Medin, 1981). In this model, matching object features with concept descriptions may transform object features into those specified in the concept description. Such a matching is inferential. Some recent work in cognitive linguistics also seems to support the ideas of two-tiered representation. For example, Lakoff (1987), in his idealized cognitive models approach, stipulates that humans represent concepts as a structure, which includes a fixed part and mappings that modify it. The fixed part is a propositional structure, defined relative to some idealized model. The mappings are metaphoric or metonymic transformations of the concept's meaning. As mentioned before, in the general two-tiered model, the distribution of the concept meaning between BCR and ICI can vary, depending on the criterion of the concept description quality. For example, the BCR can be just concept examples, and ICI can be a procedure for inferential matching, as used in the cased-based reasoning approach. Consequently, the case-based reasoning approach can be viewed as a special case of the general two-tiered representation. Concept Representation Language In the proposed method, the formalism used for concept representation is based on the variable-valued logic system VLi (Michalski, 1975). This formalism allows us to express simply and
153 implemented, F maps events from the set E, and concept descriptions from the set D, into the degree of match from the interval [0..1]: F: E x D --> [0..1] The value of F for an event e, and a concept description D, is defined as the probabilistic sum of F for its rules. Thus, if D consists of two rules, ri and r2, we have: F(e, D) = F(e,n) + F(e, i?) - F(e, r;) x F(e, n) A weakness of the probabilistic sum is that it is biased toward descriptions with many rules. If a concept description D has a large number of rules, the value of F(e, D) may be close to 1, even if F(e,r) for each rule r, is relatively small (see Table 4). To avoid this effect, if the value of F(e,r) falls below a certain threshold, then it is assumed to be 0 . (In our method this problem does not occur, because concept descriptions are typically reduced to only few rules; see the TRUNC-SG procedure in the subsection Basic Algorithm). The degree of match, F(e,r) between an event e, and a rule r , is defined as the average of the degrees of fit for its constituent conditions, weighted by the proportion of positive examples to all examples covered by the rule: F(e,r) = ( £ F(e, cj n) x #rpos /(#rpos + #rneg) i
where F(e, c//n) is a degree of match between the event e and the condition c/ in the rule r, n is the number of conditions in r, and #rpos and #rneg are the number of positive examples and the number of negative examples covered by r, respectively. The degree of match between an event and a condition depends on the type of the attribute in the condition. Four types of attributes are distinguished: nominal, structured-nominal, linear and structured-linear (Michalski and Stepp, 1983). Values of a structured-nominal (linear) attribute are nodes of an unordered (ordered) generalization hierarchy. In an ordered hierarchy, the children nodes of any parent node constitute a totally ordered set.
154 In a nominal or structured-nominal condition, the referent is a single value or an internal disjunction of values, e.g., [color = red v blue v green]. The degree of match is 1, if such a condition is satisfied by an event, and 0 otherwise. In a linear or structured-linear condition, the referent is a range of values, or an internal disjunction of ranges, e.g., [weight = 1..3 v 6..9]. A satisfied condition returns the value of match 1. If the condition is not satisfied, the degree of match is a decreasing function of the distance between the value and the nearest end-point of the interval. If the maximum degree of match between an example and all the candidate concepts is smaller than a preset threshold, the result is "no match." Inferential Concept Interpretation: Deductive Rules In addition to flexible matching, the Inferential Concept Interpretation includes a set of deductive rules that allow the system to recognize exceptions and context-dependent cases. For example, flexible matching allows an agent to recognize an old sequoia as a tree, although it does not match the typical size requirements. Deductive reasoning is required to recognize a tree without leaves (in the winter time), or to include in the concept of tree its special instance (e.g., a fallen tree). In fact, flexible matching is most useful to cover instances that are close to the typical case, while deductive matching is appropriate to deal with concept transformations necessary to include exceptions, or take into consideration the context-dependency. The deductive inference rules in the Inferential Concept Interpretation are expressed as Horn clauses. The inference process is implemented using the LOGLISP system (Robinson and Sibert, 1982). Numerical quantifiers and internal connectives are also allowed. They are represented in the annotated predicate calculus (Michalski 1983). Types of Match. The method recognizes three types of match between an event and a two-tiered description: 1. Strict match: An event matches the Base Concept Representation exactly, and it said to be S-covered.
155 2. Flexible match: An event is not S-covered, but matches the Base Concept Representation through a flexible matching function. In this case, the event is said to be F-covered. 3. Deductive match: the event is not F-covered, but it matches the concept by conducting a deductive inference using the Inferential Concept Interpretation rules. In this case, the event is said to be D-covered. (In general, this category could be extended to include also matching by analogy and induction; Michalski, 1989). The above concepts provide a basis for proposing a precise definition of classes of concept examples that are usually characterized only informally. Specifically, examples that are S-covered are called representative examples; examples that are F-covered are called nearlyrepresentative examples; and examples that are D-covered are called exceptions. As mentioned earlier, one of the major advances of the presented method over previous methods using two-tiered representation (e.g., Michalski et al., 1986) is that the Inferential Concept Interpretation includes not only a flexible matching procedure, but also inference rules. Thus, using our newly introduced terminology, we can say that the method can handle not only representative or nearly representative examples, but also exceptions. AN OVERVIEW OF THE POSEIDON SYSTEM Basic algorithm The ideas presented above have been implemented in a system called POSEIDON (also called AQ16). Table 1 presents two basic phases in which the system learns the Base Concept Representation. The first phase generates a general consistent and complete concept description, and the second phase optimizes this description according to a General Description Quality measure. The optimization is done by applying different description modification operators.
156 Phase 1 Given: Concept examples obtained from a some source Relevant background knowledge DetermineComplete and consistent description of the concept Phase 2 Given: Complete and consistent description of the concept A general description quality (GDQ) measure Typicality of examples (if available) Determine: The Base Concept Representation that maximizes GDQ. Table 1. Basic Phases in Generating BCR in POSEIDON. The search process is defined by: Search space: A tree structure, in which nodes are two-tiered concept descriptions (BCR + ICI). Operators: Condition removal, Rule removal, Referent modification. Goal: Determine a description that maximizes the general description quality criterion. The complete and consistent description is determined by applying the AQ inductive learning algorithm (using program AQ15; Michalski et al., 1986). The second phase improves this description by conducting a "double level" best-first search. This search is implemented by the TRUNC-SG procedure ("SG" symbolizes the fact that the method uses both specialization and generalization operators). In this "double level" search, the first level is guided by a general description quality measure, which ranks candidate descriptions. The second level search is guided by heuristics controlling the search operators to be applied to a given description. The search operators simplify the description by removing some of its components, or by modifying the arguments or referents of
157 some of its predicates. A general structure of the system is presented in Figure 1. SOURCE OF EXAMPLES
K
i
Gmmte Consistent Comply Descrip&sB (A
| Phase 1
Compute Ueseriptfeft Quality
Phase 2
Figure 1. Learning Phases in POSEIDON. The goal of the search is not necessarily to find an optimal solution, as this would require a combinatorial search. Rather, the system tries to maximally improve the given concept description by expanding only a limited number of nodes in the search tree. The nodes to be expanded are suggested by various heuristics discussed before. The BCR is learned from examples. The Inferential Concept Interpretation contains two parts: a flexible matching function and a rule base. The rule base contains rules that explain exceptional examples, and is acquired through an interaction with an expert.
158 Operators for Optimizing Base Concept Representation A description can be modified using three general operators: rule removal, condition removal and referent modification. The rule removal operator removes one or more rules from a ruleset. This is a specialization operator because it leads to "uncovering" some examples. It is the reverse of the "adding an alternative" generalization rule (Michalski, 1983). Condition removal (from a rule) is a generalization operator, as it is equivalent to the "dropping condition" generalization rule. The referent modification operator changes the referent in a condition (i.e., the set of attribute values stated in a condition). Such changes can either generalize or specialize a description. Consequently, two types of referent modification operators are defined: condition extension, which generalizes the description, and condition contraction, which specializes the description. To illustrate these two types of referent modification, consider the condition: [size = 1..5 v 7]. Changing this condition to : [size = 1..7] represents a condition extension operator. Changing it to [size = 1..5] represents a condition contraction operator. On the other hand, if the initial condition is [size * 1..5 v 7], then changing it to [size * 1..7], represents a condition contraction operator. Similarly, changing it to [size * 1..5] represents a condition extension operator. A summary of the effect of different operators on a description is given in Table 2: Search operator Rule removal Condition removal Condition extension Condition contraction
Type of knowledge modification (RR) (CR) (CE) (CC)
Specialization Generalization Generalization Specialization
Table 2. Search operators and their effect on the description
159 Thus, applying the above search operators can either specialize or generalize the given description. A generalized (specialized) description covers potentially a larger (smaller) number of training examples, which can be positive or negative. At any given search step, the algorithm chooses an operator on the basis of an evaluation of the changes in the coverage caused by applying the operator (see Basic Algorithm subsection). Learning the Inferential Concept Interpretation As indicated above, by applying a search operator (RR, CR, CE or CC) to the current Base Concept Representation, one can make it either more general or more specific. If the modified representation is more specific, some positive examples previously covered may cease to be S-covered. These examples may, however, be still covered by the existing Inferential Concept Interpretation (and thus would become F-covered or D-covered). On the other hand, if the modified base representation is more general than the original one, some negative examples, previously uncovered, may now become S-covered. They may, however, remain to be excluded by the existing Inferential Concept Interpretation rules. Consequently, two types of rules in the Inferential Concept Interpretation can be distinguished: those that cover positive examples left uncovered by the base representation ("positive exceptions"), and rules that eliminate negative examples covered by the base representation ("negative exceptions"). A problem then is how to acquire these rules. The rules can be supplied by an expert, inherited from higher level concepts, or deduced from other knowledge. If the rules are supplied by an expert, they may not be operationally effective, but they can be made so through analytic learning (e.g., Mitchell et al., 86; Prieditis and Mostow, 1987). If the expert supplied rules are too specific or partially correct, they may be improved inductively (e.g., Michalski and Larson, 1978; Dietterich and Hann 1988; Mooney and Ourston, 1989). Thus, in general, rules for the Inferential Concept Interpretation can be developed by different strategies.
160 In the implemented method, the system identifies exceptions (i.e., examples not covered by the Base Concept Representation), and asks an expert for a justification. The expert is required to express this justification in the form of rules. The search procedure, shown in Fig. 1, guides the process by determining examples that require justification. This way, the role of the program is to learn the "core" part of the concept from the supplied examples, and to identify the exceptional examples. The role of a teacher is to provide concept examples, and to justify why the examples identified by the learning system as exceptions are also members of the concept class. QUALITY OF CONCEPT DESCRIPTIONS Factors Influencing the Description Quality The learning method utilizes a general description quality measure that guides the search for an improved two-tiered description. The General Description Quality measure takes into consideration three basic characteristics of a descritpion: its accuracy, comprehensibility, and its cost. This section discusses these three components, and describes a method for combining them into a single measure. The accuracy expresses the description's ability to produce correct classifications. Major factors in estimating the description's predictive power are its degree of completeness and consistency with regard to input examples. When learning from noisy examples, however, to achieve a high degree of completeness and consistency may lead to an overly complex and overspecialized description. Such a description may be well tuned to the particular training set, but may perform poorly in classifying future examples. For that reason, #hen learning from imperfect inputs, it may be better to produce descriptions that are only partially complete and/or consistent. If an intelligent system is supposed to give advice to humans, knowledge used by such a system should be comprehensible to human experts. A "black box" classifier, even with a high predictive power, is not satisfactory in such situations. To be comprehensible, a description should
161 involve terms, relations and concepts that are familiar to experts, and be syntactically simple. This requirement is called the comprehensibility principle (Michalski, 1983). Since there is no established measure of description's comprehensibility, we approximate it by the representational simplicity. Such a measure is based on the number of different operators involved in the description: disjunctions, conjunctions, and the relations embedded in individual conditions. In the case of twotiered representations, the measure takes into account the operators occurring in both, the BCR and the ICI, and weighs the relative contribution of each part to the comprehensibility of the whole description. The third criterion, the description cost, captures the cost of storing the desription and using it in computations to make a decision. Other things being equal, descriptions which are easier to store and easier to use for recognizing new examples are preferred. When evaluating the description cost, two characteristics are of primary importance. The first is the cost of measuring values of variables occurring in the description. In some application domains, e.g., in medicine, this is a very important factor. The second characteristic is the computational cost (time and space) of evaluating the description. Again, in some real-time applications, e.g., in speech or image recognition, there may be stringent constraints on the evaluation time. The cost and the comprehensibility of a description are frequently mutually dependent, but generally these are different criteria. The criteria described above need to be combined into a single evaluation measure that can be used to compare different concept descriptions. One solution is to have an algebraic formula that, given numeric evaluations for individual criteria, produces a number that represents their combined value. Such a formula may involve, e.g., a multiplication, weighted sum, maximum/minimum, or t-norm/t-conorm of the component criteria (e.g., Weber, 1983). Although the above approach is often appropriate, it also has significant disadvantages. First, it combines a set of heterogeneous
162 evaluations into a single number, and the meaning of this final number is hard to understand for a human expert. Second, it usually forces the system to evaluate all the criteria for each description, even if it is sufficient to compare descriptions on the basis of just one or two most important ones. The latter situation occurs when one description is so much better than the other according to some important criterion, that it is not worth to even consider the alternatives. To overcome these problems, we use a combination of a lexicographic evaluation and a linear functionbased evaluation, which is described in the next section. Combining Individual Factors Into a Single Preference Criterion Given a set of candidate descriptions, we use the General Description Quality criterion to select the "best" description. Such a criterion consists of two measures, the lexicographic evaluation functional (LEF), and the weighed evaluation functional (WEF). The LEF, which is computationally less expensive than WEF, is used to rapidly focus on a subset of the most promising descriptions. The WEF is used to select the final description. A general form of a LEF (Michalski, 1983) is: LEF: where Criterioni, Criterion, ... , Criterionk are elementary criteria used to evaluate a description, and T I , T 2 , ... ,Tk are corresponding tolerances, expressed in %. The criteria are applied to every candidate description in order from the left to right (reflecting their decreasing importance). At each step, all candidate descriptions whose score on a given criterion is within the tolerance range from the best scoring description on this criterion are considered equivalent with respect to this criterion, and are kept on the CANDIDATE LIST; other descriptions are discarded. If only one description remains on the list, it is chosen as the best. If the list is non empty after applying all criteria, a standard solution is to chose the description that scores highest on the first criterion. In POSEIDON, we chose another approach in the latter case (see below). The LEF evaluation scheme is not affected by the problems of using a linear function evaluation, mentioned above. The importance of a
163 criterion depends on the order in which it is evaluated in LEF, and on its tolerance. Each application of an elementary criterion reduces the CANDIDATE LIST, and thus the subsequent criterion needs to be applied only to a reduced set. This makes the evaluation process very efficient. In POSEIDON, the default LEF consists of the three elementary criteria discussed above, i.e., accuracy, the representational simplicity and the description cost, specified in that order. The next section describes them in detail. Tolerances are program parameters, and are set by the user. If the tolerance for some criterion is too small, the chances of using the remaining criteria decrease. If the tolerance is too large, the importance of the criterion is decreased. For this reason, the LEF criteria in POSEIDON are applied with relatively large tolerances, so that all the elementary criteria are taken into account. If after applying the last criterion the CANDIDATE LIST has still several candidates, the final choice is made according to a weighed evaluation functional (WEF). The WEF is a standard linear function of elementary criteria. The description with the highest WEF is selected. Thus, the above approach uses a computationally efficient LEF to obtain a small candidate set, and then applies a more complex measure to select from it the best description. Taking the Typicality of Examples into Consideration Accuracy is a major criterion to determine the quality of a concept description. In determining accuracy, current machine learning methods usually assume that it depends only on the number of positive and negative examples (training and/or testing) correctly classified by the description. One can argue, however, that in evaluating accuracy one might also take into consideration the typicality of examples (Rosch and Mervis, 1975). If two descriptions cover the same number of positive and negative examples, the one that covers more typical positive examples and fewer typical negative examples can be considered more accurate. For the above reason, we propose a measure of completeness and
164 consistency of a description that takes into account the typicality of the examples. In POSEIDON, the typicality of examples can be obtained in one of two ways. The first way is that the system estimates it by the frequency of the occurence of examples in the data (notice that this is different from a usual cognitive measure of typicality, which captures primarily the degree to which an example resembles a prototypical example). The second way is that the typicality of examples is provided by an expert who supplies training examples. If the typicality is not provided, the system makes the standard assumption that the typicality is the same for all examples. In the measures below, the degree of completeness of a description is proportional to the typicality of the positive events covered, and the consistency is inversely proportional to the typicality of the negative events covered5. Since the system is working with a two-tiered description, other factors are taken into account. One is that according to the idea of two-tiered representation, a "high quality" concept description should cover the typical examples explicitly, and the non-typical ones only implicitly. Thus, the typical examples should be covered the Base Concept Representation, and non-typical, or exceptional ones by the Inferential Concept Interpretation. In POSEIDON, the Base Concept Representation is inductively learned from examples provided by a teacher. Therefore, the best performance of the system will be achieved if the training set contains mostly typical examples of the concept being learned. For the exceptional examples, the teacher is expected to provide rules that explain them. These rule become part of the Inferential Concept Interpretation. An advantage of such an approach is that the system learns a description of typical examples by itself, and the teacher needs to explain only the special cases. 5 When negative examples are instances of another concept, as is often the case, their typicality is understood as the typicality of being members of that other concept
165 In view of the above, the examples covered explicitly (strictly-covcrtd, or S-COV) are assumed to contribute to the completeness of a description more than flexibly-covered (F-COV) or deductively-covered (D-COV). General Description Quality Measure This section defines the General Description Quality (GDQ) measure implemented in POSEIDON. As mentioned above, the measure combines the accuracy, representational simplicity and the cost of a description. The accuracy is based on two factors, the typicality-based completeness, T_COM, and the typicality-based consistency, T-CON. These two factors are defined for a two-tiered concept description, D, as follows: Zjws*Typ(e+) +2wf*Typ(e+) + £ wd*Typ (e+) e G S-cov TjCOM (D)
e e F-cov
e e D-cov
= SjTyp(e + ) e+ePOS 2jws*Typ(e~) + 2 w f *Typ(e~) + X wd*Typ (e~) e"G S-cov e"G F-cov e"e D-cov
TjCON (D))
=
Xryp(e~) e-GNEG
where POS and NEG are sets of positive and negative examples, respectively, which are covered by the two-tiered concept description Z). Typ(e) expresses the degree of typicality of example e of the given concept. Weights wSj wf, and wd represent different significance of the type of coverage (S-COV, F-COV, and D-COV). Thresholds ti, and t2 reflect the desirability of a given type of coverage for the given degree of typicality: ws: if Typ(e) > t2, then 1, else w wp if t2 > Typ(e) > ti, then 1, else w W(j: if tj > Typ(e), then 1, else w where thresholds tiand t2 satisfy the relation 0 < ti< t2 ^ 1, and 0<w < 1 .
166 The role of w is to decrease the weight the examples that are covered in a way (S, F or D) that is not compatible with their typicality. Using the terms of T__COM and T__CON, the description accuracy is defined as: Accuracy = wi*T_COMPLETENESS +W2*T__C0NSISTENCY
where wi+w2= 1. The weights wi and w2 reflect the expert's judgment about the relative importance of completeness and consistency for the given problem. The default value of both is 0.5. A measure of comprehensibility of a concept description is difficult to define. As mentioned earlier, we approximate it by a representational simplicity, defined as: RepSimplicity(D)) = TC - (vi * X c ( o p ) op G BCR (D))
+ v2 * E c ( o p ) ) op e ICI (D))
where TC is the sum of the complexities of all operators in the description D. BCR (D)) is the set of all operator occurrences in the BCR of the description, and ICI (dsp) is the set of all operator occurrences in the ICI. C (op) , the complexity of an operator, is a real function that maps each operator symbol into a real number representing its complexity. The complexities of the operators are chosen by an expert, assuming the following constraints: C(range) < C(intemal v ) < C(=) < C() < C(&) < C(v) < C(=>). When the operator is a predicate, C increases with the number of the arguments. Parameters vi and v2 represent relative weights of the operators in BCR and ICI, respectively, assuming V1+V2 = 1. The Base Concept Representation is supposed to describe the general and easy-to-define meaning of the concept, while the Inferential Concept Interpretation is mainly used to handle rare or exceptional events. As a consequence, the Base Concept Representation should be easier to comprehend than the Inferential Concept Interpretation, and thus v i should be larger than v2. The cost of a description D depends on two factors:
167 • Measuring-Cost (MC) - the cost of measuring variables used in the concept description MQ£))= £ I,mc(v)/(/Po!J+/Neg/) eePos+Neg veVar^e) • Evaluation-Cost (EC) - the cost of evaluating the concept description EC(D) = Z ec(e) /(/Pos/^/Neg/) eePos+Neg where Vars(e) is the set of all variables occuring in the concept description, mc(v) is the cost of measuring the value of the variable v, and &(£) is the computational cost of evaluating the concept description to classify the event e. The latter depends on the computing time and/or on the number of operators involved in the evaluation. We now define the cost of a description: Cost(D) = ui*MC(Z))) + U2*EC(D) where u i and 112 are weights defining the relative importance of the measuring-cost and the evaluation-cost for a given problem. The general description quality (GDQ) measure is in the form of a Lexicographic Evaluation Functional (LEF), in which the above defined concepts of accuracy, representational simplicity and the description cost are used as elementary criteria. The tolerances and other parameters defined above can be chosen by a user to reflect the problem domain, or determined experimentally. They also have default values, so that the user does not have to specify them. More details about the general description quality measure are in (Bergadano et al., 1988). LEARNING BY MAXIMIZING DESCRIPTION QUALITY As mentioned before, learning a base concept representation (BCR) of a concept is performed in two phases. In the first phase, a complete and consistent concept description is learned inductively from examples. In the second phase, the obtained complete and consistent description is optimized according to the general description quality criterion. In POSEIDON, the first phase is done using the AQ15 learning program, described in (Michalski et al., 1986a). The following subsections
168 describe the second phase (the TRUNC-SG procedure). Search Heuristics for Optimizing Base Concept Representation The task of optimizing BCR by directly applying the General Description Quality measure is computationally expensive. It requires that every newly generated description is matched flexibly against all training examples. To make this process more efficient, a double-level search method is employed. The first level uses a simple heuristic to determine which operator, RR, CR, CE or CC, is likely to improve the description, and the second level actually applies the operator, and evaluates the description according to the General Description Quality measure. The first level applies the so-called Potential Accuracy Improvement heuristic (PAI). The PAI is a function of the change in the coverage of positive and negative examples by the description due to an operator application. Specifically: PAI = AP/TP - AN/TN where AP (AN) is the change in the number of positive (negative) examples that would be covered by the description after applying the operator, and TP (TN) is the total number of positive (negative) examples. For generalizing operators, SR and CE, AP and AN are non-negative, and for specializing operators, CR and CC, AP and AN are non-positive. The advantage of the Potential Accuracy Improvement measure is that it can be computed much more efficiently than the General Description Quality. For every condition in the current description, a list of examples covered by it is maintained using bit vectors. The sets of examples covered by a ruleset (representing a complete description) is then obtained by intersection and union operations. The matching time can be improved further by also maintaining bit vectors for the examples covered by rules (the matching time trades off with the memory for storing the bit vectors). Note that computing the General Description Quality requires flexible matching, and thus cannot be done by an intersection and union operations on bit vectors. The above formula does not take into consideration the degree of
169 reduction of the description complexity caused by applying an operator. For example, removing a rule reduces complexity more than removing a condition. To account for this, POSEIDON assigns a higher weight (preference) to applying the RR operator (rule removal) than for applying the CR operator (condition removal). The condition removal operator generalizes the description, therefore, the description (ruleset) resulting from its application may cover some additional examples (positive or negative). Due to this, some rule(s) may become redundant. If the CR operation produces a rule that differs from another rule only in the value of one attribute, the two rules can be merged into one, in which the attribute is related to the internal disjunction of values (this is a case of the so-called "refunion" operation; see Michalski and Stepp, 1983). For example, the rules [shape = circle]&[size = 2..6] and [shape = square]&[size = 2..6] can be replaced by single rule [shape = circle v square]&[size = 2..6]. It is worth noting that in the case of operators RR and CR, the Potential Accuracy Improvement heuristic can be simplified by using an approximation: PAF = #P/TP - #N/TN where #P (#N) is the number of positive (negative) examples covered by the component (rule or condition) to be removed. Such a heuristic is very efficient because it needs to be computed only once for every condition and every rule in the initial description. This computation can be done before the search starts, and does not need to be repeated for every node in the search. The operator that produces the largest Potential Accuracy Improvement is chosen, and applied to the description under consideration. The descriptions so generated are then subjected to an evaluation by the General Description Quality criterion. The search algorithm (TRUNC-SG) is presented in Table 3. Let us explain the motivation and individual steps of the algorithm. Step 1
170 chooses the node (description) for expansion on the best-first basis, that is, chooses the node with the highest General Description Quality. This is not always an optimal choice, because "worse" nodes can sometimes lead to better descriptions after a number of removals. Whether the search will behave in this manner will depend on the adequacy of the General Description Quality as the measure of concept quality.
1.
2.
3.
4.
5. 6.
Search Algorithm (TRUNC-SG Identify in the search tree the best candidate description D. (Initially, D is the complete and consistent description obtained by AQ15 in Phase I. Subsequently, it is the highest ranked description according to the General Description Quality criterion). Apply to D that operator, selectedfromamong the operators RRi - Remove the i-th rule, or CRij - Remove the j-th condition from the i-th rule CQj - Contract the referent of the j-th condition in the i-th rule CEij - Extend the referent of the j-th condition in the i-th rule that maximizes the Potential Accuracy Improvement measure. Compute the General Description Quality (GDQ) of the description obtained in step 2. If the GDQ of this description does not exceed the GDQ of the original D by more than D (an experimental threshold), then proceed to step 1. (Computing the description accuracy for GDQ employsflexiblematching). Identify exceptional examples that are (a) the positive examples that cease to be covered, and (b) the negative examples that become covered. Ask an expert to provide rules explaining these examples. If such rules are obtained, add them to the Inferential Concept Interpretation; otherwise, add the exceptional example(s) to it. Update the GDQ value of the new node by taking into account the added Inferential Concept Interpretation. If the stopping criterion is satisfied, then STOP, otherwise proceed to step 1.
Table 3. The algorithm for Optimizing a Concept Description.
171 Step 2 chooses the "best" search operator according to the Potential Accuracy Improvement heuristic, and applies it to the current description. Step 3 computes the General Description Quality of the new node. It should be noted that, in the General Description Quality measure, the typical examples covered directly by the base representation can weigh more than those covered through flexible matching. The examples covered by Inferential Concept Interpretation rules weigh more than the ones covered through flexible matching, but less than the ones covered by the Base Concept Representation. A new description (node) is worth to consider only if it "sufficiently" better (more than A) than the previous one, otherwise the control goes to Step 1 (the reason for this is given below). Step 4 determines exceptional examples, and asks an expert for an explanation of them. If the explanation is provided, appropriate rules are added to the Inferential Concept Interpretation. These rules may extend or contract the Base Concept Representation. For example, the rule removal operator might uncover some positive examples, that were previously covered. In this case, new rules added to the Inferential Concept Interpretation would allow the system to reason about such "special" positive examples, and explain why they should be classified as instances of the concept being learned. On the other hand, the condition removal operator might cause some negative examples to be covered. In this case, new Inferential Concept Interpretation rules would have to be added to contract the Base Concept Representation. An important issue concerning step 4 is when an explanation should be required from an expert ("explainer"). The problem is that in some cases the chosen operator may not be appropriate, because it leads to a very poor description. In such a case, it is not worthwhile to ask an expert for an explanation, and search should continue in other direction. The method employs the following strategy. Suppose that N is the node (description) to be expanded, and M is the node obtained after applying an operator, e.g., the condition removal. The effort to obtain an explanation is made only if the General Description Quality of M is
172 "significantly" better than that of N (above a certain threshold T). In this case, the explainer is given the General Description Quality evaluations of both descriptions, N and M, and asked for an explanation. These evaluations give the explainer a sense of importance of the request. If the explainer cannot provide an explanation, the exceptional examples are directly added to the Inferential Concept Interpretation. Step 5 updates the General Description Quality of the obtained two-tiered description by taking into consideration the added Inferential Concept Interpretation rules. Step 6 decides whether to stop or continue the search. The stopping criterion is satisfied when the number of nodes explored exceeds value kl, or when the General Description Quality is not improved after the exploration of k2 nodes since the last improvement. The search parameters kl and k2 have a default value, which is modifiable by the user. When the search stops, the best node found until this point defines the chosen two-tiered concept description. In conclusion, let us point out the main difference between the above two-level search and the standard best-first search. The difference is that only one operator is applied to the (best-GDQ) node selected for expansion, rather than all available operators, as in the standard search. The operator applied is the "best" according to the PAI heuristic. Such a procedure helps to avoid generating low quality nodes, and thus makes unnecessary the computation of the General Description Quality for these nodes. Other operators are applied only if the results obtained along this branch of the search tree turn out to be unsatisfactory. An Abstract Example An abstract example of the search process is given in Figure 2. Individual nodes represent both components of a two-tiered description (BCR and ICI) generated at any given search step, and show the coverage of training examples by the description. The rectangular areas represent the coverage by the Base Concept Representation, and the curved lines denote the coverage by the Inferential Concept Interpretation. In the example, the accuracy is computed according to the formula described before, assuming that all examples have the same typicality.
173
Accuracy: 0.76 Complexity: 2 rules, 5 conds Truncate c2
Accuracy: 0.52 Complexity: 2 rules, 4 conds Truncate c4
Accuracy: 0.89 Complexity: 2 rules, 4 conds Truncate cl & c2
4.
Accuracy: 0. 79 Complexity: 2 rules, 3 conds
Accuracy: 0.92 Complexity: 1 rule, 2 conds
Figure 2. An Illustration of the Search Process. The initial description is represented by node 1. The BCR contains two rules represented by two rectangular areas, which cover five positive examples out of eight, and one negative example out of five. The
174 Inferential Concept Interpretation extends this coverage by recognizing one more positive example. Next nodes correspond to descriptions obtained by an application of operators marking the branches of the search tree. For example, node 3 is obtained by eliminating condition c 5 in the second rule of the initial description. The new description is more accurate because all positive examples are now covered, without changing the coverage of the negative examples. By truncating the first rule in node 3, node 5 is generated. The description no longer covers negative examples, and is simpler. This node is then accepted as the optimized description resulting from the search. The other nodes lead to inferior concept representations with respect to General Description Quality, and are discarded. The quality has been computed with wl=w2=0.5. For simplicity, the cost is omitted, and the complexity of the Inferential Concept Interpretation is ignored. The complexity of the Base Concept Representation is indicated by the number of rules and the number of conditions. EXPERIMENTS The proposed method was implemented in the POSEIDON system (also called AQ16). To evaluate its performance, it was tested^ together with several other methods, in two problem domains. The other methods tested included: simple forms of exemplar-based learning, learning consistent and complete descriptions (implemented in AQ15), generating top rule descriptions (described by Michalski et al., 1986), and generating pruned decision trees (implemented in the ASSISTANT program; Cestnik, Kononenko & Bratko, 1987). All these methods were applied to the same training data, and tested on the same testing data from the two problem domains. The first domain was labor-management contracts, and the problem was to learn a general description that discriminates between acceptable and unacceptable contracts. The second domain was congressional voting, and the problem was to characterize the voting behavior of Republicans and Democrats in the US House of Representatives.
175 Experimental Data Labor-management contracts. The data regarding labormanagement contracts were obtained from Collective Bargaining, a review of current collective bargaining issues published by the Department of Labor of the Government of Canada. The data describe labor-management contracts negotiated between various organizations and labor unions with at least 500 members, and concluded in the second half of 1987 or the first half of 1988. The experiments focused on the personal and business services sector. This sector includes unions representing hospital staff, teachers, university professors, social workers and certain classes of administrative personnel. The data involved multivalued attributes, and thus the VLi language was directly applicable. Each contract is described by sixteen attributes, belonging to two main categories. One category concerns issues related to the salaries, e.g., pay increases in each year of the contract, the cost of living allowance, a stand-by pay, etc., and the second category concerns issues related to fringe benefits, e.g., different kinds of pension contributions, holidays, vacation, dental insurance, etc. Positive examples represent contracts that have been accepted by both parties. Negative examples represent contracts deemed unacceptable at least by one of the parties. Here is an example of an acceptable labor-management contract: Duration of the contract = 2 years Wage increase in thefirstyear = 7.5% Wage increase in the second year = 3.5% Cost-of-living-allowance = unknown Hours of work/per week = 38 Pension offer = none Stand-by pay = $0.12/hr Shift differential = second shift is paid 25% more thanfirstshift Educational allowance is offered Holidays per year= 11 days VacLength = better than average in the industry Long term disability insurance = offered by the employer 50% dental insurance cost = covered by the employer Bereavement leave = available Employer-sponsored health plan = not mentioned
176 The above description is represented as the following VLi rule: [Dur = 2] [Wagel = 1.5] [Wage2 = 3.5] [Cola = unknown] [Workhours = 38] [Pension = none] [StbyPay = 12] [ShiftDff = 25] [Educallw = yes] [Hlds =11] [VacLen = better] [LngTrmDisbll = true] [Dntl-ins = half] [Bereavement = yes] [EmpHlthPln= unknown] ::> [Contract Class = acceptable] In the rule above, and next rules, the following abbreviations were used: SByPay for "Stand-by-pay" Vacation for "Vacation length" Hlds for "Holidays per year," LngTrmDisbl "Long term disability insurance," EmpHlthPln for "Employer-sponsored health plan" ShiftDff for Shift differential Contract Class for "Contract classification" Also, for simplicity, the conjunction is represented by concatenation. The training set consisted of 18 positive and 9 negative examples of contracts; the testing set consisted of 19 positive and 11 negative examples. US Congress Voting record. The data regarding the US Congress voting record were the same as the ones used by Lebowitz (1987) in his experiments on conceptual clustering. The data represent the 1981 voting records of 100 selected representatives (50 in the training set and 50 in the testing set). The problem was to learn descriptions discriminating between the voting record of Democrats and Republicans. Below is an example of the voting record of a Democrat in the US Congress: Draft registration = no Ban aid to Nicaragua = no Cut expenditure on MX missiles = yes Federal subsidy to nuclear power stations = yes Subsidy to national parks in Alaska = yes Fair housing bill = yes Limit on PAC contributions = yes Limit on food stamp program = no Federal help to education = no State = north east Population = large Occupation = unknown Cut in Social Security spending = no Federal help to Chrysler Corp. = vote not registered
177 A Description of Experiments For each problem domain, the experiments involved the following steps: 1. Learn a complete and consistent description from the training examples (by the AQ15 program). 2. Determine the top rule description from the above description using the TRUNC method (Michalski et al., 1986). The top rule description consists of a single rule that covers the maximum number of positive examples among all other rules in the complete and consistent description. Such a description is easy to determine, because the AQ15 generates rules together with measures indicating the number of examples covered totally and uniquely by each rule, which are denoted thetweight and u-weight of a rule, respectively (see below). In the experiments, one top rule description was generated for positive concept examples, and one for the negative examples (the latter one from a complete and consistent description of the negative examples). An instance was classified as belonging to a concept, if it matched best the top rule description of positive examples, and was rejected if it matched the top rule description of the negative examples. If both descriptions were matched with roughly the same degree, then the instance was classified as "no match." Learning the top rule description, and using it with flexible matching, represents a simple, but important version of the two-tiered concept learning approach (Michalski, 1990). 3. Determine an optimized two-tiered description from the complete and consistent description using the TRUNC-SG procedure. 4. Determine descriptions of the given concepts using other methods, specifically, variants of the exemplar-based learning approach, and the decision tree learning program ASSISTANT. 5. Test the performance of all generated descriptions on the testing examples. To illustrate the difference between the complete and consistent descriptions, the top rule, and the optimized descriptions created by POSEIDON, figures below a sample of these descriptions in the Labor Management domain. Figure 3 shows a complete and consistent description produced by AQ15. In the Figure, t (t-weight) is the total number of examples covered by a rule, and u (u-weight) is the number of examples uniquely covered by the rule.
178 {[cntrct-dur>l]&[wagejncr_jr2=>3.0%]&[#hlds >10]: {[wage_incr_yrl > 4.5%]: {[wage_jncr_yrl > 4%] & [wage_jncr_yr2 > 4.0%]: {[wage_incr_yrl > 4.5%] & [#hlds > 9]: {[wage_jncr_yrl > 2%] & [vacation > average]: ::> [Contract Class =
or or or or
(t = 5, u = 3)}
or
(t = 2, u =2)}
or
(t =7, u =7)}
or
(t=l,u = l) (f =7, w=7)} (t =7, u =7)}
or or
acceptable]
{[wage_incr_yrl=2..4%]&[#hlds3%]&[#hlds >10]: ::> [Contract Class =
(t =77, u = 77)}
acceptable]
{[wgjnoiyrl = 2..4%]&[#hlds < 10]&[vacation=AVG]: (t =5, u =5)} ::> [Contract Class =
unacceptable]
ICI: Flexible matching Figure 4. Top Rule Descriptions Obtained by the TRUNC Method.
179 By optimizing the complete and consistent description using the TRUNC-SC method, and acquiring the ICI rules from an expert, the following optimized two-tiered description was obtained (Figure 5). BCR: [wage_incr_yrl > 4.5%] or [wage.Jncr_yi2 > 3.0%] or [#hlds>9]or [vacation > AVG] ::> [Contract Class =
acceptable]
[wage_incr_yrl < 4.0%] & [#hlds < 10] or [wage_incr_yr2 < 4.0%] & [vacation < average] or [Dur = 1] & [wage_incr_yrl < 4.0%] or [wage_incr_yr2 < 3.0%] ::> [Contract Class = unacceptable] ICI: Flexible matching plus deductive matching using rules: [wage_incr_jrl>5.5%]&[vacation < average] ::> [Contract Class = acceptable] [wage_incr_yrl [ContractClass = unacceptable] [wage_incr_jrl^%]&[wagejncr_yr240]& [pension= empLcontr] ::> [Contract Class = unacceptable] Figure 5. Optimized Two-tiered Descriptions Obtained by POSEIDON. During the BCR description optimization process, the system determined the training events that were incorrectly classified by the base representation. An expert was asked to formulate rules explaining these examples (the ICI rules in Figure 5). For example, the first ICI rule for an unacceptable contract (Figure 5) describes contracts with the wage increase in the first year lower or equal 3%, and an even lower increase in the second year. In such circumstances, the holiday and vacation time do not matter, and the contract is classified as unacceptable (by the union). As one can see, the optimized BCR descriptions are significantly simpler than the complete and consistent descriptions generated by AQ15.
180 They also seem to represent the most important characteristics of the labor management contracts. Specifically, a contract is acceptable when it offers a significant wage increase (the first two rules in Figure 5), or it offers many holiday days, or the vacation time is above average. Results From Testing POSEIDON and Other Methods As mentioned earlier, experiments tested POSEIDON and three other methods, specifically, variants of exemplar-based learning, the method for learning consistent and complete descriptions, a method for generating top rule descriptions, and a method for generating pruned decision trees. All of these methods were employed to learn a concept description from the same set of training examples. All the learned descriptions were then applied to the same testing examples. The performance was evaluated by counting the number of examples that were classified correctly, incorrectly, or unclassified. Tables 4 to 7 present the results of different experiments. A summary of all results is shown in Table 8. In all tables, columns "Correct" and "Incorrect" specify the percentage of the testing events that were correctly and incorrectly classified, respectively. The column NoJAatch specifies the number unclassified examples (i.e., the examples that did not match any description to a sufficient degree). To provide an estimate of the complexity of descriptions learned, the tables also list the number of conditions and rules in each description. In the case of pruned decision trees, the table lists the number of nodes and leaves (the number of leaves corresponds to the number of rules that can be directly determined from the decision tree). Experiment 1 (Table 4) tested a factual description, and variants of the exemplar-based approach (1-, 3- and 5- nearest neighbor match). A factual description is a disjunction of all the training events, and, as such, is obviously complete and consistent with regard to the training set. The first part of Experiment 1 tested the factual description on the testing examples using the strict match method. In such a method, a testing example must match exactly one of the training examples to be classified.
181 In this case, obviously, the description had no predictive power. It produced No_Match answers for all testing examples of the labor contract data, and for 96% testing examples of the congressional voting data (two examples were the same in the training and testing sets). Simple Exemplar-based Description 27 rules and 432 Labor management problem (Labor): 51 rules and 969 Congress problem (Congress): No Correct Labor Labor Congress Strict Match 0% Training Set 100% 100% 100% 4% Testing Set 0% 1-Nearest Neighbor 0% Training Set 100% 100% 0% Testing Set 86% 77% 3-Nearest Neighbors 0% Training Set 100% 100% 0% 84% Testing Set 83% 5-Nearest Neighbor 0% Training Set 100% 100% 0% Testing Set 84% 80%
conditions conditions Match Congress 0% 96% 0% 0% 0% 0% 0% 0%
Table 4. Results of Experiment 1. Subsequent parts of Experiment 1 tested the factual description using the k-nearest neighbor method with different k. The method involved determining k closest (best "fitting") learning examples to the one being classified, and assigning to it the class of the majority of the closest examples. Such a method is equivalent to simple forms of exemplar-based learning. The 1-Nearest Neighbor row lists results from applying the factual description with a matching method somewhat similar to the one described in (Kibler and Aha, 1987). The only difference was that Kibler and Aha's method uses the maximum function for evaluating a ruleset (disjunction), while our flexible matching uses the probabilistic sum. The method was also tested with k=3 and 5.
182 The second experiment used concept descriptions generated by AQ15 without truncation (Table 5). Such descriptions are consistent and complete with regard to the training examples, i.e., they classify all training examples 100% correct when using the strict matching method. The flexible matching method did not change this result. Complete and Consistent Description (No truncation) Labor-mgmt problem (Labor): Congress problem (Congress):
Strict Match Training Set Testing Set Flexible Match Training Set Testing Set
11 rules and 28 conditions 10 rules and 32 conditions No Match Labor Congress
Labor
Correct Congress
100% 80%
100% 86%
0% 3%
0% 0%
100% 80%
100% 86%
0% 3%
0% 0%
Table 5. Results of Experiment 2. For the testing set, the number of correct classifications was relatively high (80-86%), the same for the strict and flexible matching methods. Flexible matching made no difference, probably due to two factors. Firstly, the complete and consistent descriptions include many specific rules, leaving little space for the "no match" cases (3%), in which flexible matching could help. Secondly, the descriptions consisted only of disjoint rules, as the program was run using the "disjoint cover" parameter. In such a situation, the "multiple match" cases do not occur, and flexible matching cannot help. The above results are similar to those obtained in the previous experiment, which used an exemplar-based approach (Table 4). The main difference is that the AQ descriptions are much simpler in terms of the number of rules and the number of conditions involved (11 vs. 27 rules in the labor management problem, and 10 vs. 51 rules in the congress voting problem). The simpler descriptions allow the system to
183 be more efficient in the recognition mode. The third experiment (Table 6) tested the top rule descriptions determined from the above complete and consistent descriptions. As shown in Table 5, the performance of these rules using flexible matching was comparable to that of the complete and consistent descriptions, as well as factual descriptions (compare with Tables 4 and 5). The Top Rule Description (the TRUNC method) Labor-mgmt problem (Labor): Congress problem (Congress):
2 rules and 6 conditions 2 rules and 6 conditions
Correct Labor Congress Strict Match Training Set Testing Set Flexible Match Training Set Testing Set
Labor
No_Match Congress
52% 63%
62% 69%
48% 30%
38% 24%
81% 83%
75% 85%
0% 0%
0% 0%
Table 6. Results of Experiment 3. It may be surprising that the top rule descriptions performed better on the testing set than on the training set. This is due to the fact that the training set contained more exceptions than the testing set. The system used the TRUNC method, in which the truncation process removes rules that cover all except the most typical training examples. The top rule descriptions consist of only one rule per concept, and therefore they are significantly simpler than the factual, and consistent and complete descriptions (they use only 2 vs. 11 vs. 27 rules in the Labor Management problem, and 2 vs. 10 vs. 51 rules in the Congress Voting problem). It is quite revealing that such simple rules performed as well as much more complex descriptions generated in previous methods. The
184 fourth experiment (Table 7) tested optimized descriptions generated by POSEIDON, i.e., derived by the TRUNC-SG method. The descriptions were tested using flexible matching alone (Flexible Match), and in combination with deductive matching (Deductive Match). Optimized Description (POSEIDON) 9 rules and 12 conditions Labor-mgmt problem (Labor): 10 rules and 21 conditions Congress problem (Congress): No_Match Correct Labor Congress Labor Congress Strict Match 16% 37% Training Set 84% 63% 23% 54% Testing Set 43% 73% Flexible Match 0% Training Set 15% 100% 85% 0% 4% 92% Testing Set 83% Deductive Match 0% 4% Training Set 96% 96% 0% 0% 92% Testing Set 90% Table 7. Results of Experiment 4. For comparison, the performance of these descriptions was also tested using strict match. The latter is rather an impractical combination. As expected, these descriptions used with strict matching gave relatively poor performance. The optimized descriptions (BCR) combined with deductive matching (ICI) gave the best performance (90-92% correct). When used with only flexible matching, the performance was slightly lower. The descriptions are simpler than complete and consistent descriptions, although they include the Inferential Concept Interpretation rules. They are, of course, more complex than the top rule descriptions, which do not use any interpretation rules.
185 For the Labor data, descriptions applied with deductive matching produced higher performance than when used with flexible matching only (90 vs. 83%)6. For the Congress data problem, the performance was the same for the two matching methods. This is because deductive rules were acquired on the training set; in the specific testing set, the Dcovered events were the same as F-covered ones. Table 8 summarizes the results of experiments, specifically, compares the performance and complexity of descriptions generated simple exemplar-based methods, the two-tiered descriptions generated POSEIDON, and pruned decision trees generated by ASSISTANT descendant of the Quinlan's ID3 program; Cestnik et al.f 1987).
it by by (a
ASSISTANT was applied to the same learning and training data, which were used in the previous experiments (whose results were presented in Tables 4, 5, 6 and 7.) The decision trees obtained by ASSISTANT were optimized using a tree-pruning mechanism (Cestnik et al., 1987). This mechanism is compared with the TRUNC-SG method in the next section. The factual description was applied with the flexible matching function. The complexity of a rule-based description was measured by stating the number of rules (#Rules) and the number of conditions (#Conds). The complexity of a decision tree was measured by the number of leaves (#Leaves) and the number of nodes (#Nodes). 6 This difference, for the Labor Contract data, is not %2 significant. Nevertheless, we think that there are other reasons to prefer deductive matching over flexible matching. Deductive classification is based on rules and knowledge-based inference, and is therefore easier to understand by humans. The rules may be modified locally, while changing theflexiblematching function is difficult and produces uncontrolled, global consequences. In other words, examples that are correctly recognized through ICI deductive rules are also explained ipso facto in terms of domain knowledge. The same cannot be said of examples correctly recognized by flexible matching, which is a knolwedge-independent distance measure. To reflect this, the GDQ measure assigns a higher score to a description with deductive matching than withflexiblematching.
186 Labor Contract Simple exemplar-hased method Performance (% Correct) 1-nearest neighbor 11% 3-nearest neighbor 83% 5-nearest neighbor 80% Complexity (#Rules / #Conds^ 27/432
Congress
Voting
86% 84% 84% 51/96
Pruned decision tree (ASSISTANT + PRUNING) Performance (%Correct) 86% Complexity (#Leaves/ #Nodes^ 29/53
86% 19/28
Complete and consistent description (AQ15 without rule truncation) Performance (%Correct) 80% Complexity
fffRuIes /
ffConds)
11/29
Top rule two-tiered description (AQ15 with rule truncation ) Performance (%Correct) 83% Complexity (#Rules / #Cond^ 2J4 Optimized two-tiered description (POSEIDON) Performance (% Correct) 90% Complexity
(ffRules / #Conds)
£212
86%
10/ 32
85% 21&
92%
10/ 21
Table 8. Summary of the Results of Testing Descriptions Generated by Different Methods.
187 In the above experiments, for both domain problems, the learning method implemented in POSEIDON produced descriptions that are simpler (except for the top rule descriptions), and also perform better on the testing data than other tested methods. Being simpler, these descriptions are also easier to understand, and have a lower evaluation cost. The meaning of the concept defined by such descriptions depends on the base representation (i.e., a TRUNC-SG optimized description learned from examples), and the inferential concept interpretation (consisting of an apriori defined flexible matching procedure and a set of deductive rules, formulated by the expert). Using rules in the inferential concept interpretation has an advantage that exceptional cases are easy to explain. In the current method, the system determines which examples are exceptional (those that are misclassified by the base representation). The expert analyzes them, and determines the rules for ICI. The top rule descriptions were significantly simpler than any other descriptions, but performed somewhat worse than the optimized description and the decision tree. Depending on the desired trade-off between the accuracy and simplicity, the top rule or the optimized description can be taken as the base representation of the concept being defined. The Role of Parameters and Related Issues POSEIDON has many parameters which can be controlled by a user. On the surface, this might be considered as a disadvantage. In our view, a learning system that allows the user to explicitly modify parameters that affect learning processes (but which are not just method-dependent), is to be preferred over a system that does not explicitly define such parameters. The point is that in the latter systems these parameters are defined only implicitly, by the assumptions and the structure of the method. For example, many systems do not take into consideration the typicality of examples. In POSEIDON, this is equivalent to an assumption that the typicality of all examples is equal to the default value 1. As another example, consider the cost of measuring the value of attributes. If a learning program does not have parameters representing such costs, then
188 this is equivalent to an assumption that all costs are the same (which in reality is often not true). By being able to control such learning parameters, the user can produce results that better fit the task at hand. For example, for some tasks, the accuracy of descriptions may be decisive criterion, while for others the description simplicity may be of equal concern. An important problem to be investigated is the sensitivity of POSEIDON to its various parameters. While a comprehensive answer to this problem goes beyond the scope of this paper, we report below a preliminary sensitivity analysis regarding the parameters controlling the trade-off between the description accuracy and simplicity. Such parameters are considered to have the most important effect on the performance of learned descriptions. Specifically, they are the tolerances in the lexicographic evaluation functional measuring the description quality. To explain their role, let us briefly review the description quality measure. This measure combines several criteria, such as the accuracy, the simplicity, and the cost. Each criterion is associated with a tolerance interval such that differences within this interval are not considered unimportant. Thus, if the tolerance interval of accuracy is very narrow, then the accuracy becomes the prevailing criterion in quality evaluation. On the other hand, if this tolerance interval is wide, the remaining criteria become more significant. An experiment was performed using the same Congress voting data, as used in experiments reported in Tables 4-7. The training set had 51 examples, while the testing set had 49 examples. The concept to be learned was the voting record of Republicans in the US Congress. The description tested in Table 7, had 10 rules and 21 conditions, and yielded the accuracy of 100% on the training set, and 92% on the testing set. The description was obtained using the accuracy tolerance (xi) value equal 0.05. To determine the method's sensitivity to this parameter, the accuracy tolerance TI was set to values 0.55, 0.35, 0.02, 0.005, and for each value the description accuracy was measured. For the above accuracy tolerances, the system's performance on the testing set was 88%, 88%, 90%, and 92%,
189 respectively. Thus, this experiment seems to indicate that the accuracy of the descriptions slowly grows with the narrowing of the tolerance interval on the accuracy in the description quality measure, which completely confirms an intuitive expectation. In general, when the accuracy tolerance interval is wide, the simplicity of the description assumes an important role, yielding performances close to the performance of the top rule in the two-tiered description. Intermediate values, such as the one used in the experiments presented in Table 7 (xi = 0.05) produced the best results, e.g., the performance of 92% on the testing set from the Congress data. In the case of the narrow tolerance interval for accuracy, the simplicity has a lower impact on the quality of the description. An interesting topic for future research is to systematically investigate the influence of such parameter changes on the performance of the descriptions7. Another issue that should be explored more in the future is the role of example typicality of learning examples. In the presented method, if the input examples are assigned typicality values, the generated base concept representation will tend to cover the most typical examples, while the inferential concept interpretation will tend to cover less typical examples. A problem for future investigation is to determine the effect of the typicality on the overall quality of generated concept descriptions. When the typicality information is unavailable, the system itself will assign examples to different classes of typicality. The examples covered by the base representation are classified as typical, those covered by flexible 7 In our experiment, for small values of z. = 0.02 and .005, which emphasize the role of accuracy in the measured quality of a description, the performance on the testing set was close or equal to the performance obtained for x. = 0.05, and higher than performance of 86% for AQ15 in Table 7. The reason is that in the last two experiments, as well as in the original experiment in Table 7, it was always possible to find a description that was simpler than the one produced by AQ15, but still 100% correct on the training data. Therefore, by giving more importance to accuracy, the simpler description was preferred, and better performance on the test set was obtained.
190 matching as nearly-typical, and those covered by the deductive rules as non-typical. 8 An interesting experiment would be to compare such classifications with human classifications. Another interesting issue relates to the noise in the data. The preliminary analysis indicates that the proposed method has a significant ability for handling noisy data. Experiments show that noisy examples are usually covered by the "light" rules, i.e., rules that cover few examples. By removing such rules from the description, the effect of noise can be significantly minimized (Zhang and Michalski, 1989). Future research should investigate these aspects of the method in greater detail. RELATED WORK The research presented here relates to various efforts on learning imprecise concepts, in particular, to learning methods generating pruned decision trees (e.g., Quinlan, 1987; Cestnik, Kononenko & Bratko, 1987; Fisher and Schlimmer, 1988). In these methods, a concept description (or a set of descriptions) is represented as a single tree structure ("one tier") that is supposed to account for all concept instances. An unknown instance is classified by following the nodes of the decision tree from the root to the leaf indicating the class. To avoid overfitting, some parts (subtrees) of the originally generated decision tree are pruned away. As a result, such decision trees do not cover some training examples. Since the recognition process does not use flexible matching, such pruned trees must always produce some error on the training examples, although the overall performance on new examples may increase. The two-tiered method avoids overfitting by simplifying original descriptions, yielding base concept representations that, in the formal logical sense, are usually also incomplete and inconsistent. The two-tiered method, however, can compensate for the lack of coverage or for an 8 This three-way classification of the examples can be viewed as a simple method of learning typicality. A similar feature is available in Cobweb (Fisher, 1987). On the other hand, if the typicality information is available, it is used by POSEIDON to improve the quality of the learned description.
191 excessive coverage of the first tier (BCR), by the application of the second tier (ICI). This can be done by flexible matching and/or deductive inference rules. The latter ones are normally unaffected by noise, because they depend on a deeper understanding of the domain. In addition, the presented method takes into consideration the typicality of the examples (if it is available). This feature gives the method an additional help for handling noisy examples. The method presented in (Quinlan, 1987) is based on a hill-climbing approach that first truncates conditions, and then rules. No search is performed, only one alternative truncation is tried at every step. The final result might possibly be far from optimal. By avoiding the search, such a procedure should, however, be significantly faster than the one implemented in POSEIDON. If the speed of learning and the simplicity of descriptions are of central importance, then the TRUNC method (that determines the top rule descriptions without search) should be applied rather than TRUNC-SG. In the same paper (Quinlan, 1987), other methods for pruning decision trees are also described. Some of these methods require a separate testing set for the simplification phase, and others use the same training set that was used in creating the tree. The simplification phase in POSEIDON can also be done either using the original training set, or using a separate set of examples. The experiments by Fisher and Schlimmer (1988) on pruning decision trees use a statistical measure to determine the attributes to be pruned. Such measures require a rather large data sample, and thus do not apply well to small training sets. In the two-tiered approach, training events are analyzed logically, rather than statistically, both in the phase creating a complete and consistent description, and in the optimization phase. Consequently, the two-tiered approach seems to be more suited for learning from a relatively small number of examples. An interesting possibility for future research is to integrate a statistical measure, such as used by Fisher and Schlimmer, or other, in the process of rule learning and truncating with large data sets. The system developed by (Iba et al. 1988) uses a trade-off measure
192 and truncating with large data sets. The system developed by (Iba et al. 1988) uses a trade-off measure that is somewhat similar to the general description quality (GDQ) measure proposed in this paper. Our GDQ measure considers more factors. Besides taking into account the typicality of the instances covered by the description, it considers different types of matching between an instance and a description. Moreover, the simplicity measured by GDQ depends not only on the number of rules in the description as in (Iba et al., 1988), but also on the different syntactic features in the description. The inductive algorithm implemented in CN2 uses a heuristic function to terminate search during rule construction (Clark & Niblett, 1989). The heuristic is based on an estimate of the noise present in the data. Such pruning of the search space of inductive hypotheses results in rules that may not classify all the training examples correctly, but that perform well on testing data. CN2 can be viewed as an induction algorithm that includes pre-truncation, while the algorithm reported here is based on post-truncation. CN2 applies truncation during rule generation, and POSEIDON applies truncation after rule generation. The advantage of pre-truncation is efficiency of the learning process. On the other hand, such an approach has difficulty with identifying irrelevant conditions and redundant rules. The two-tiered method described here can also be viewed as a kind of constructive induction in the sense of (Michalski, 1983). In fact, the whole learned description may include new terms, absent from the examples used for learning. This behavior is also encountered in several other systems (e.g., Sammut and Banerji, 1986; Drastal, Czako & Raatz, 1989). However, constructive learning in POSEIDON is due to the second tier based on domain knowledge characterizing non-typical examples. This is different from using domain knowledge to rewrite or augment the whole training set (e.g., Rouveirol, 1991), or to generate new attributes by a data-driven approach (Bloedorn & Michalski, 1992), or a hypothesis-driven approach (Wnek and Michalski, 1991). The exemplar-based learning system PROTOS (Bareiss, 1989) is
193 concept description and acquiring the matching knowledge via explanations of training events provided by a teacher. There are, however, major differences: 1) PROTOS stores exemplars as base concept descriptions, whereas POSEIDON generates simple and easy-tounderstand generalizations as base concept descriptions, 2) PROTOS uses domain knowledge in classifying all new cases, whereas POSEIDON uses Inferential Concept Interpretation rules only for classifying exceptions, 3) During the learning process, PROTOS asks the teacher for explanations for all exemplars, whereas POSEIDON only asks for explanations of exceptions. The problem of using some typicality measure of examples has not so far been given much attention in machine learning, although there were attempts in this direction. For example, Michalski and Larson (1978) introduced the idea of "outstanding representatives" of a concept to focus the learning process on the most significant examples. In cognitive science, the concept of typicality of examples has been studied extensively (e.g., Rosch and Mervis, 1975; Smith and Medin 1981). The concept of two-tiered representation has naturally led us to the proposition of a precise definition of representative, nearly-representative and exceptional examples, namely, as those that are covered by the first tier, the second tier's procedure for flexible matching, and the second tier's inference rules, respectively. The ideas of two-tiered representation are also consistent with recent research on two-stage category construction (Ahn and Medin, 1992). To summarize, there are several major differences between the method presented and related research described in the literature. First, the method has the ability to recover from the loss of coverage due to the description truncation by using the second tier. Specifically, the procedure of flexible matching or deductive rules are used to cover examples not covered explicitly. As has been demonstrated experimentally, this ability often leads to a significant reduction of concept descriptions, and at the same time, to an improvement of their predictive power. Second, the description reduction is done by independently performing both generalization and
194 specialization operators. Third, any part of the description may be truncated in the simplification process, not just only specific parts (as, e.g., in decision tree truncation). Fourth, the method is able to take into account the typicality of the examples. Finally, the method uses a general description quality measure, which takes into consideration a number of different aspects of a description. To relate the presented two-tiered approach to other basic machine learning approaches, Table 9 characterizes them in terms of the type of concept representation used and the kind of matching applied for classification.
Representation Matching
Simple Induction General Precise
Exemplar-based Specific Inferential
Two-tiered General Inferential
Table 9. A comparison of the two-tiered method with simple inductive and exemplar-based methods. SUMMARY AND OPEN PROBLEMS The most significant aspect of the presented method is that it represents concepts in a two-tiered fashion, in contrast to traditional learning methods that represent concepts by a monolithical structure. In this representation, the first tier, the base concept representation (BCR), captures the explicit and common concept meaning, and the second tier, the inferential concept interpretation (ICI) defines allowable modifications of the base meaning and exceptions. Thus, typical concept instances match the BCR, and thus can be recognized efficiently. Such a two-tiered representation is particularly suitable for learning flexible concepts, i.e., concepts that lack precise definition and are context-dependent. In the POSEIDON system that implements the method, the BCR is learned in two steps. First, a complete and consistent description is learned by a conventional learning program (AQ15). Next, this description is optimized according to a general description quality measure. This is done by a double-level search process that uses both generalization and
195 optimized according to a general description quality measure. This is done by a double-level search process that uses both generalization and specialization operators. The General Description Quality takes into account not only properties of BCR, but also of ICI. This is done by measuring the complexity and accuracy of the total description. The ICI has two components: one specifies a flexible matching function, and the second specifies inference rules for handling exceptions and context-dependency. The ICI rules can be of two types. The rules of the first type extend the meaning of the concept, while the rules of the second type contract this meaning. The first type rules are employed when an instance is neither covered by the BCR (not S-covered), nor by the flexible matching function (not F-covered). The second type of rules are used when an unknown instance covers a base representation of more than one concept, or when concept membership has to be confirmed. In both cases, the rules are used deductively. An advantage of using rules for matching over other matching methods is that they can serve as an explanation why a given instance does or does not belong to the concept. The experimental results have strongly supported the hypothesis that two-tiered concept descriptions can be simpler and easier to understand than "single-tier" descriptions. In the experiments, these descriptions also had greater prediction accuracy, i.e., performed better on new examples. For example, the two-tiered descriptions obtained for the acceptable labor management contracts gave a performance of over 90% correct using only about 9 rules. In contrast, the best performance of a simple exemplar based method gave the 80% correct predictions on new examples, and used 27 rules, and the corresponding pruned decision tree performed at 86%, and had 29 leaves (each of which may be viewed as corresponding to one rule). The system also performed better than the previous method based on the TRUNC procedure in terms of the prediction accuracy (80%), but at the cost of a more complex concept description. In addition, two-tiered descriptions are relatively easy to understand, and can easily represent an explicit domain knowledge. The presented method is different in several significant ways from the
196 earlier method of learning two-tiered representations (Michalski et al., 1986). The flexible matching procedure is used not only in the testing phase, but also in the learning phase. In addition to a flexible matching function, the method employs rules for extending or contracting the concept meaning. The earlier TRUNC method used only one specialization operator (rule removal), while TRUNC-SG employed in POSEIDON uses two generalization and two specialization operators. The price for that is that the new method is significantly more complex. There are many interesting problems for future research. An especially interesting problem is how to integrate the description optimization phase with the initial description generation phase (done by AQ). Another interesting problem is how to learn second tier rules from examples. In the initial method developed by (Plante and Matwin, 1990), the inferential concept interpretation rules are learned by a chunking process in the situations when multiple explanations of positive or negative training events are provided. Futureresearchshould also address the application of constructive induction (Michalski, 1983) in the process of learning flexible concepts. In constructive induction, background knowledge is used to construct new attributes and/or higher level descriptors. As a result, produced descriptions can capture the salient features of the concept, and can be simpler and more comprehensible. The ideas of constructive induction seem to be very relevant to the method proposed. For example, through constructive induction the system may be able to fold several rules into a single one, or prevent the removal of relevant rules. The current system does not address the problem of dynamically emerging hierarchies of concepts. The system only learns one concept at a time, and concepts do not change or split as new examples become available. Another open issue is the ability of the system to reorganize itself. The distribution of knowledge between the Base Concept Representation and the Inferential Concept Interpretation should be determined by the performance of the system on large testing sets. If it turns out, for instance, that some inferential concept interpretation rules
197 are used very often, then they could be compiled into the base representation. Further research is needed on the role and the importance of different parameters used in the method, and on the trade-offs that they can control. This paper has focused on learning attribution^ descriptions that characterize entities by attributes, and ignore their structural properties. Although such descriptions are quite powerful and sufficient for many practical problems, there are applications that require structural descriptions that characterize entities as systems of components, and the relationships among these components. Developing a method for learning two-tiered structural descriptions is therefore an important topic for future research. A relatively solution to the above problem would be to replace the AQ15 program by a version of INDUCE (e.g., Michalski, 1983) for learning the initial complete and consistent description. The basic search procedure would essentially be the same, but would deal with a more complex knowledge representation. A structural representation would allow additional description modification operators, so the descriptions could be modified in more ways, so this would increase the flexibility and the complexity of the search process. Also, the computation of the general quality of descriptions would require proper modification, and flexible matching would need to be extended to handle structural concept descriptions. As practical problems frequently require only attributional descriptions, and the method is domain-independent, POSEIDON has a potential to be useful for concept learning and knowledge acquisition in a wide range of real-world applications. ACKNOWLEDGMENTS The authors thank Hugo de Garis, Attilio Giordana, Ken Kaufman, Elizabeth Marchut-Michalski, Doug Medin, Franz Oppacher, Lorenza Saitta, Gail Thornburg, and Gheorghe Tecuci for useful comments and criticisms. The authors express special gratitude to Alan Meyrowitz and Susan Chipman for their support of the research described in this chapter.
198 In addition, Alan Meyrowitz provided detailed comments on this chapter that helped in the prepartion of the final version. The authors thank Zbig Koperczak for his help in acquiring the data used in the experiments. This research was done in the Artificial Intelligence Center of George Mason University. The research activities of the Center are supported in part by the Defense Advanced Research Projects Agency under the grants administered by the Office of Naval Research, No. N00014-87-K-0874 and No. N00014-91-J-1854, in part by the Office of Naval Research under grants No. N00014-88-K-0397, No. N0001488-K-0226, No. N00014-90-J-4059, and No. N00014-91-J-1351, and in part by the National Science Foundation under the grant No. IRI9020266. The second author was supported in part by the Italian Ministry of Education (ASSI), and the third author was supported in part by the Natural Sciences and Engineering Research Council of Canada. REFERENCES Ahn, W. & Medin, D.L. (1992). A Two-stage Model of Category Construction, Cognitive Science, 16, pp. 81-121. Bareiss, R. (1989). An exemplar-based knowledge acquisition. Academic Press. Bergadano, F. & Giordana, A. (1989). Pattern classification: an approximate reasoning frameworkJnternational Journal of Intelligent Systems. Bergadano, F., Matwin, S., Michalski, R.S. & Zhang, J. (1988a). Learning flexible concept descriptions using a two-tiered knowledge representation: Part 1 - ideas and a method. Reports of Machine Learning and Inference Laboratory, MLI-88-4. Center for Artificial Intelligence, George Mason University. Bergadano, F., Matwin, S., Michalski, R. S. & Zhang, J. (1988b). Measuring quality of concept descriptions, Proceedings Third European Working Sessions on Learning. Glasgow, 1-14. Bloedorn, E. & Michalski, R.S. (1992). Data-driven constructive induction AQ17: A method and experiments. Reports of Machine Learning and Inference Laboratory, Center for Artificial Intelligence, George Mason University (to appear).
199 Cheeseman, P., Kelly, J,, Self, M., Stutz, J., Taylor, W. & Freeman, D.(1988). AutoClass: A Bayesian classification system. Proceedings of the Fifth International Conf on Machine Learning, Ann Arbor, 54-64. Cestnik, B., Kononenko, L, Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. Proceedings of the 2nd European Workshop on Learning. 31-45. Clark, P. Niblett, T. (1989). The CN2 induction algorithm. Machine Learning Journal, Vol. 3, No.4, 261-183. Collins, A. M., Quillian, M. R. (1972).Experiments on semantic memory and language comprehension" in Cognition, Learning and Memory, L. W. Gregg (Ed.), John Wiley. DeJong, G., Mooney, R. (1986) Explanation-based learning: An alternative view. Machine Learning, Vol. 1. No. 2. Dietterich, T.. (1986). Learning at the knowledge level. Machine Learning, Vol. 1. No. 3. 287-315. Dietterich., T., Flann, N. (1988). An inductive approach to solving the imperfect theory problem. Proceedings of the Explanation-based Learning Workshop, Stanford University. 42-46. Drastal, G., Czako, G., Raatz, S. (1989). Induction in an abstraction space: A form of constructive induction. Proceedings ofUCAI 89, Detroit. 708-712. Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning. Vol.2,139-172. Fisher, D. H. & Schlimmer, J. C. (1988). Concept simplification and prediction accuracy. Proceedings of the Fifth Int'L Conf On Machine Learning. Ann Arbor, 22-28. Hammond, K. (1989). Case-based planning: Viewing planning as a memory task. Academic Press. Iba, W., Wogulis, J. & Langley, P. (1988). Trading off simplicity and coverage in incremental concept learning. Proceedings of the Fifth Int'L Conf. on Machine Learning. Ann Arbor, 73-79. Kedar-Cabelli, S. T. & McCarthy, L. T. (1987).Explanation-based generalization as resolution theorem proving. Proceedings of the 4th Int. Workshop on Machine Learning, Irvine. Kibler, D. & Aha, D. (1987). Learning representative exemplars of concepts. Proceedings of the 4th International Workshop on Machine Learning, Irvine.
200
Kolodner, J. (Ed.) (1988). Proceedings of the Case-based Reasoning Workshop, DARPA, Clearwater Beach, FL. Lakoff, G. (\9%l).Women, Fire, and Dangerous Things: What Categories Reveal about Mind, University of Chicago Press. Lebowitz, M. (1987). Experiments with incremental concept formation: UNIMEM, Machine Learning Journal, Vol. 2, No. 2. Michalski, R.S. (1975). Variable-valued logic and its applications to pattern recognition and machine learning. In D. C. Rine (Ed.), Computer science and multiple-valued logic theory and applications, North-Holland Publishing Co. 506-534. Michalski, R.S. & Larson, J. B. (1978). Selection of most representative training examples and incremental generation of VLi hypotheses: The underlying methodology and the description of programs ESEL and AQ11. Reports of the Department of Computer Science, TR 867, University of Illinois at Urbana-Champaign. Michalski, R.S. (1983). A theory and methodology of inductive learning. In R.S. Michalski J.G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga (now Morgan Kufmann). Michalski, R.S. & Stepp, R.E. (1983). Learning from observation: conceptual clustering. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach . Palo Alto, CA: Tioga (now Morgan Kaufmann). Michalski, R. S., Mozetic, I., Hong J. & Lavrac, N. (1986). The multipurpose incremental learning system AQ15 and its testing application to three medical domains. Proceedings of the 5th AAAI. 1041-1045. Michalski, R. S.(1989). Two-tiered concept meaning, inferential matching and conceptual cohesiveness. In S. Vosniadou & A. Ortony (Eds.),Similarity and analogy, Cambridge: Cambridge University Press. Michalski, R. S. & Ko, H. (1988). On the nature of explanation, or why did the wine bottle shatter? AAAI Symposium: ExplanationBased Learning, Stanford University. 12-16. Michalski, R. S. (1987). How to learn imprecise concepts: A method employing a two-tiered knowledge representation for learning. Proceedings of the Fourth International Workshop on Machine Learning, Irvine, CA. 50-58.
201 Michalski, R. S. (1990). Learning flexible concepts: fundamental ideas and a methodology in Y. Kodratoff and R. S. Michalski (Eds.) Machine Learning: An artificial intelligence approach, Vol. HI. San Mateo, CA: Morgan Kaufmann Publishers. Mooney, R. & Ourston, D. (1989). Induction over the unexplained: integrated learning of concepts with both explainable and conventional aspects. Proceedings of 6th Int'l Workshop on Machine Learning, Ithaca, NY, 5-7. Minsky, M. (1975). A framework for representing knowledge. In P. Winston (Ed.), The Psychology of computer vision. Mitchell, T. M., Keller, R. & Kedar-Cabelli, S. (1986) Explanationbased generalization: A unifying view. Machine Learning Journal, Vol. 1. No. 1, 11-46. Mitchell, T. M. (1977). Version spaces: an approach to concept learning. Ph.D. Dissertation, Stanford University. Plante, B. & Matwin, S. (1990). Learning second tier rules by chunking of multiple explanations. Research Report, Department of Computer Science, University of Ottawa. Prieditis, A. E. & Mostow, J. (1987). PROLEARN: Towards a Prolog interpreter that learns. Proceedings. ofUCAI 87, Milan. 494498. Quinlan, J. R.. (1987) Simplifying decision trees. Int. Journal of ManMachine Studies. Vol. 27,221-234. Robinson J. A. & Sibert E. E. (1982). LOGLISP: An aternative to Prolog. Machine Intelligence, Vol. 10, J. E. Hayes & D. Michie (Eds.), 399-419. Rosch, E. & Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, Vol. 7, 573-605. Rouveirol, C. (1991). Deduction and semantic bias for inverse resolution. Proceedings ofUCAI 91. Sydney, Australia. Sammut, C. & Banerji, R.B. (1986). Learning concepts by asking question, in R.S. Michalski, J. G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga (now Morgan Kaufmann Publishers). Smith, E. E. & Medin, D. L. (1981). Categories and concepts. Harvard University Press. Sowa, J. F. (1984). Conceptual sructures. Addison Wesley.
202
Sturt, E. (1981). Computerized construction in Fortran of a discriminant function for categorical data. Applied statistics, Vol. 30, 213222. Watanabe, S. (1969). Knowing and guessing, a formal and quantitative study. John Wiley. Weber, S. (1983). A general concept of fuzzy connectives, negations and implications based on t-norms and t-conorms,M Fuzzy sets and systems, Vol. 11, 115-134. Winston, P. H. (1975). Learning structural descriptions from examples," in P. Winston (Ed.) The Psychology of computer vsion., McGraw-Hill. Wnek, J. & Michalski, R.S. (1991). Hypothesis-driven constructive induction in AQ17: Method and experiments. Reports of Machine Learning and Inference Laboratory, Center for Artificial Intelligence, George Mason University. Zadeh, L. A. (1974). Fuzzy logic and its applications to approximate reasoning. Information Processing. North Holland, 591-594. Zhang, J. & Michalski, R. S. (1989). Rule optimization via SG-trunc method. Proceedings of the Fourth European Working Sessions on Learning. Glasgow.
Chapter 6 Competition-Based Learning1 John J. Grefenstette, Kenneth A. De Jong, William M. Spears Navy Center for Applied Research in Artificial Intelligence Information Technology Division Naval Research Laboratory Washington, DC 20375-5000 Abstract This paper summarizes recent research on competition-based learning procedures performed by the Navy Center for Applied Research in Artificial Intelligence at the Naval Research Laboratory. We have focused on a particularly interesting class of competition-based techniques called genetic algorithms. Genetic algorithms are adaptive search algorithms based on principles derived from the mechanisms of biological evolution. Recent results on the analysis of the implicit parallelism of alternative selection algorithms are summarized, along with an analysis of alternative crossover operators. Applications of these results in practical learning systems for sequential decision problems and for concept classification are also presented. INTRODUCTION One approach to the design of moreflexiblecomputer systems is to extract heuristics from existing adaptive systems. We have focused on a class of learning systems that use competition-based procedures, called genetic algorithms (GAs). GAs are based on principles derived from one of the most impressive examples of adaptation available: the adaptation achieved by natural systems to their environment through the mechanisms of biological evolution. The principles were first elucidated in a computational framework by John Holland (1975). Holland's analysis of natural adaptive systems shows that biological evolution embodies a sophisticated kind of generate-and-test strategy that rapidly identifies and exploits regularities in 1
Sponsored in part by the Office of Naval Research under Work Request N00014-91WX24011.
204 the environment. By extracting these processes from the specific context of genetics, the algorithms can be applied to a wide range of optimization and learning problems. GAs have in fact been applied successfully to routing and scheduling problems, machine vision, engineering design optimization, gas pipeline control systems, and others. In the area of machine learning, GAs have been used to learn rules for sequential decision problems as well as to learn classification rules from examples (De Jong, 1990). GAs have also been widely used for learning both the topology and the weights of neural nets. Our research efforts for the past few years have fallen into two main categories: the analysis of genetic algorithms, and the application of genetic algorithms to machine learning problems. This article will focus primarily on recent progress in the analysis of genetic algorithms. The remainder of the article is organized as follows: The next section contains a brief tutorial on genetic algorithms. This is followed by two sections that outline recent progress in the analysis of two fundamental topics in the field: how knowledge structures are selected for reproduction, and how the selected structures are recombined to create new plausible knowledge structures. These sections are followed by a brief overview of our work in developing machine learning systems based on genetic algorithms. The final section describes the directions of current work.
OVERVIEW OF GENETIC ALGORITHMS Genetic algorithms are adaptive search procedures based on principles derived from the dynamics of natural population genetics. GAs are distinguished from other search methods by the following features: • A population of structures that can be interpreted as candidate solutions to the given problem. • The competitive selection of structures for reproduction, based on each structure's fitness as a solution to the given problem. • Idealized genetic operators that alter the selected structures in order to create new structures for further testing. These features enable the GA to exploit the accumulating knowledge obtained during the search in such a way as to achieve an efficient balance between the need to explore new areas of the search space and the need to focus on high performance regions of the space. This section provides a general overview of a simple form of genetic algorithm. For more detailed
205 procedure GA begin t = 0; initialize P(t); evaluate structures in P(t); while termination condition not satisfied do begin t = t+l; select P(t) from P(M); alter structures in P(t); evaluate structures in P(t); end end. Figure 1: A Genetic Algorithm discussions, see (Holland, 1975; Goldberg, 1989). A genetic algorithm simulates the dynamics of population genetics by maintaining a knowledge base of structures that evolves over time in response to the observed performance of its structures in their operational environment. A specific interpretation of each structure (e.g. as a collection of parameter settings, a condition/action rule, etc.) yields a point in the space of alternative solutions to the problem at hand, which can then be subjected to an evaluation process and assigned a measure called its fitness, reflecting its potential worth as a solution. The search proceeds by repeatedly selecting structures from the current knowledge base on the basis of fitness and applying idealized genetic search operators to these structures to produce new structures (offspring) for evaluation. The basic paradigm is shown in Figure 1, and is explained in more detail below. At iteration r, the GA maintains a population of structures P(t) representing candidate solutions to the given problem. Population P(0) may be initialized using whatever knowledge is available about possible solutions. In the absence of such knowledge, the initial population should represent a random sample of the search space. Each structure is evaluated and assigned a measure of its fitness as a solution to the problem at hand. When each structure in the population has been evaluated, a new population of structures is formed in two steps. First, structures in the current
206 population are selected to be reproduced on the basis of their relative fitness. That is, high performing structures may be chosen several times for replication and poorly performing structures may not be chosen at all. In the absence of any other mechanisms, the resulting selective pressure would cause the best performing structures in the initial knowledge base to occupy a larger and larger proportion of the knowledge base over time. Next the selected structures are altered using idealized genetic operators to form a new set of structures for evaluation. The primary genetic search operator is the crossover operator, which combines the features of two parent structures to form two similar offspring. There are many possible forms of crossover. The simplest version operates by swapping corresponding segments of a string or list representation of the parents. For example, if the parents are represented by the lists: (ax a2 a3 a4 a5) and (bx b2 b3 b4 b5) then crossover might produce the offspring (Ai a2 t>3 b4 b$) and (b\ b2 a3 a4 #5).
Other forms of crossover operators have been defined for other representations (e.g., Whitley et al, 1989; Koza, 1989; Grefenstette, 1991b). Specific decisions as to whether both resulting structures are to be entered into the knowledge base, whether the precursors are to be retained, and which other structures, if any, are to be purged define a range of alternative implementations. The crossover operator usually draws only on the information present in the structures of the current knowledge base in generating new structures for testing. If specific information is missing, due to storage limitations or loss incurred during the selection process of a previous iteration, then crossover is unable to produce new structures that contain it. A mutation operator, which alters one or more components of a selected structure, provides the means for introducing new information into the knowledge base. Again, a wide range of mutation operators have been proposed, ranging from completely random alterations to more heuristically motivated local search operators. In most cases, mutation serves as a secondary search operator that ensures the reachability of all points in the search space. The power of the GA lies not in the testing of individual structures but in the efficient exploitation of the wealth of information that the testing of structures provides with regards to the interactions among the components comprising these structures. Specific configurations of component values
207
observed to contribute to good performance (e.g. a specific pair of parameter settings, a specific group of rule conditions, etc.) are preserved and propagated through the structures in the knowledge base in a highly parallel fashion. This, in turn, forms the basis for subsequent exploitation of larger and larger such configurations. Intuitively, we can view these structural configurations as the regularities in the space that emerge as individual structures are generated and tested. Once encountered, they serve as building blocks in the generation of new structures. That is, GAs actually search the space of all feature combinations, quickly identifying and exploiting combinations that are associated with high performance. The ability to perform such a search on the basis of the evaluation of completely specified candidate solutions is called the implicit parallelism of GAs. To summarize, the power of a GA derives from its ability to exploit, in a near-optimal fashion, information about the utility of a very large number of structural configurations without the computational burden of explicit calculation and storage. This leads to a focused exploration of the search space wherein attention is concentrated in regions that contain structures of above average utility. The knowledge base, nonetheless, is widely distributed over the space, insulating the search from susceptibility to stagnation at a local optima. A great variety of genetic algorithms have been studied and compared. Often these comparisons take the form of empirical studies, but the generality of the results are often difficult to assess, since they usually depend on the particular characteristics of the search space. More analytic tools for comparison need to be developed. Our recent efforts have included new analyses of the fundamental components of genetic algorithms: the rules for selecting knowledge structures for reproduction, and the effects of various crossover operators. The following two sections describe our progress on these two topics. ANALYSIS OF SELECTION ALGORITHMS One way to improve our understanding of genetic algorithms is to identify properties that are invariant across the many seemingly different versions of the algorithms. (Grefenstette, 1991a) focuses on invariances among genetic algorithms that differ along two dimensions: (1) the way the user-defined objective function is mapped to afitnessmeasure, and (2) the way the fitness measure is used to assign offspring to parents. The remainder of this section summarizes those results.
208 The process of reproducing knowledge structures in a genetic algorithm can be decomposed into four steps. First, each structure x is evaluated according to an objective function u(x) that defines problem-specific criterion for success. Second, afitnessfunction is applied to the result of the evaluation to obtain / ( * ) , the fitness of x. The range of/must be a non-negative interval, and larger values of f(x) indicate more desirable solutions to the objective function.2 Third, a selection algorithm assigns a target number of offspring to each population member. Finally, a probabilistic sampling algorithm assigns to each member of the population an integer number of offspring. The first step is, of course, entirely problem dependent, and will not concern us further. For the final step, several sampling algorithms have been investigated, culminating in one called stochastic universal sampling by Baker (1987), which appears to provide an optimal sampling method. Accordingly, variations on the sampling algorithm will not concern us further. That leaves the middle two steps open for variations, and in fact, many variations are in current use. A short discussion of some of the major variants of fitness functions and selection algorithms will give a fair indication of the range of possibilities. The fitness function maps the raw score of the objective function to a non-negative interval. Such a mapping is always necessary if the goal is to minimize the objective function, since higher fitness values correspond to lower objective function values in that case. More generally, the fitness function often serves to scale the raw values returned by the objective function in order to provide a high level of selective pressure. Scaling that accentuates small differences is especially desirable late in the search, when the variance in objective performance tends to diminish. One popular approach to scaling (Grefenstette, 1986) is to define the fitness function as a dynamic, linear transformation of the objective value: / ( x ) = fl(M(x)-fc(0) where a is positive for maximization problems and negative for minimization problems, and b(t) represents the worst value seen in the last few generations. The trajectory of b(t) generally rises over time, providing greater 2
This notation is at variance with that used in (Grefenstette, 1991a). The mnemonic here is that f(x) denotes the fitness, and u(x) denotes the user-defined utility (e.g., cost to be minimized or profit to be maximized). The notation was reversed in (Grefenstette, 1991a). We hope that standard notation may be adopted soon, but in the meantime, this paper will use the more intuitive notation.
209 selection pressure later in the search. This method is sensitive, however, to "lethals", i.e., poor performing individuals that may occasionally arise through crossover or mutation. A more robust method has been called sigma scaling (Goldberg, 1989):
f(x) = u(x)-(\i-c*o)
if u(x)> bi-c * a)
/ ( * ) = (), otherwise. where L| L is the mean objective function value of the current population and o is the current population standard deviation. Sigma scaling provides a level of selective pressure that is sensitive to the spread of performance values in the population. Besides these two forms of fitness functions, many other variations have been proposed and implemented (Goldberg, 1989). We next consider variations in the selection phase. The selection algorithm assigns an expected number of children C(x) to each population member x, based on the fitness values. The most widely used method is proportional selection, defined as:
C(x) = / ( * ) / / where / is the average fitness of the current population. This method was originally proposed and analyzed by Holland, who showed that it results in a nearly-optimal allocation of trials, under certain circumstances (Holland, 1975). In practice, this selection algorithm may lead to premature convergence, based on the unlimited number of offspring that may be assigned to "super individuals" that may arise early in a search (Baker, 1989). Other forms of selection are less brittle in this respect. For example, rank-based selection assigns offspring according to the formula: C(jc) = a + 6 * rank(x) where rank(x) indicates the relative position of x in the population, from 0 for the worst performer to 1 for the best, and a and b are constants chosen so that a is the minimum number of oflsping and a+b is the maximum. Rankbased selection eliminates the problem of premature convergence to "super individuals" by providing a strict upper bound on the number of offspring assigned to any one member in a given generation. In practice, rank-based selection tends to provide a slower, steady rate of convergence than proportional selection. A final example is threshold selection in which all population members whose objective function falls below a (possibly time-varying) threshold are deleted, and the survivers are assigned an equal number of offspring to fill the vacated slots. These three examples should give an
210 indication of the range of selection algorithms that have been explored in genetic algorithms. Understanding the similarities and differences between these options is a fundamental step toward a deeper understanding of genetic algorithms. Given these two dimensions of variations in the design of genetic algorithms, we say that a genetic algorithm is admissible if it meets what appear to be the weakest reasonable requirements along these dimensions. We can then show that any admissible genetic algorithm exhibits a form of implicit parallelism, meaning that it allocates search effort in way that differentiates among a large number of competing areas of the search space on the basis of a limited number of explicit evaluations of knowledge structures. These results provide a sense of coherence to the field, in that commonalities are exposed among superficially different versions of the genetic algorithm. These results can also serve to spotlight the features which distinguish broad classes of genetic algorithms from one another. A few definitions are required to make these ideas concrete. We say that afitnessfunction is monotonic if f(x) b where each a t G {0,1}). For these classes learning C by C is NPhard. In both cases, however, by enlarging C as functions we can obtain learnable classes. In the first case 2-CNF suffices, and in the second unrestricted half spaces (Pitt k Valiant, 1988). A further example of an NP-complete learning problem is the intersections of two half spaces (Megiddo, 1986). This remains NP-complete even in the case of {0,1} coefficients corresponding to certain three-node neural nets (Blum k Rivest, 1988). NP-hardness results are also known for learning finite automata (Li k Vazirani, 1988; Pitt, 1989; Pitt k Warmuth, 1989) and other classes of neural nets (Judd, 1988; Lin k Vitter, 1989). Representation Independent Limits As mentioned above there is a second reason for a class C not being learnable, in this case by any representation, and that is that C is too large. For reasons not well understood the only techniques known for establishing a negative statement of this nature are cryptographic. The known results are all of the form that if a certain cryptographic function is hard to compute then C is not learnable by any H. For such proofs the most natural choice of H is Boolean circuits since they are universal, and can be evaluated fast given their descriptions and a candidate input. The first such result was implicit in the random function construction of Goldreich, Goldwasser and Micali (1986). It says that assuming oneway functions exist, the class of all Boolean circuits is not learnable even for the uniform distribution and even with access to a membership oracle.
280 Various consequences can be deduced from this by means of reduction (Pitt k Warmuth, 1988; Warmuth, 1989) Since positive learning results are difficult to find even for much more restricted models it was natural to seek negative results closer to the known learnable classes. In Kearns and Valiant (1989) it was shown that deterministic finite automata, unrestricted Boolean formulae (i.e. tree structured circuits) and networks of threshold elements (neural nets) of a certain constant depth, are each as hard to learn as it is to compute certain number-theoretic functions, such as factoring Blum integers (i.e. the products of two primes both equal to 3 mod 4) or inverting the RSA encryption function. MODELS USEFUL FOR ALGORITHM DISCOVERY Having precise models of learning seems to aid the discovery of learning algorithms. It focuses the mind on what has to be achieved. One significant finding has been that different models encourage different lines of thought and hence the availability of a variety of models is fruitful. Many of the algorithms discovered recently were developed for models that are either superficially or truly restrictions of the basic pac model. One such model is that of learning from positive examples alone. This constraint suggests its own style of learning. Another model is the deterministic one using oracles discussed in section 6. Although the results for these translate to the pac model with oracles the deterministic formulation often seems the right one. A third promising candidate is the weak learning model. In seeking algorithms for classes not known to be learnable this offers a tempting approach which has not yet been widely exploited. We shall conclude by mentioning two further models both of which have proved very powerful. The first is Occam learning (Blumer et al., 1987). After seeing random examples the learner seeks to find a hypothesis that is consistent with them and somewhat shorter to describe than the number of examples seen. This model implies learnability (Blumer et al., 1987) and is essentially implied by it (Board & Pitt, 1990; Schapire, 1989). It expresses the idea that it is good to have a short hypothesis, but avoids the trap of insisting on the shortest one, which usually gives rise to
281 NP-completeness even in the simplest cases. Occam learning can be generalized to arbitrary domains by relacing the bound on hypothesis size by a bound on the VC dimension (Blumer et al., 1989). There are many examples of algorithms that use the Occam model. These include algorithms for decision lists (Rivest, 1987), restricted decision trees (Ehrenfeucht k Haussler, 1989), semilinear sets (Abe, 1989) and pattern languages (Kearns & Pitt, 1989). The second model is that of worst-case mistake bounds (Littlestone, 1988). Here after each example the algorithm makes a clasisfication._ It is required that for any sequence of examples there be only a fixed polynomial number of mistakes made. It can be shown that learnability in this sense implies pac learnability (Angluin, 1987b; Kearns et al., 1987a; Littlestone, 1989). Recently Blum (1990b) showed that the converse is false if one-way functions exist. There are a number of algorithms that are easiest to analyze for this model. The classical perceptron algorithm of Rosenblatt (1961), Minsky and Papert (1988) has this form, except that in the general case the mistake bound is exponential. Littlestone's algorithms that perform well in the presence of irrelevant attributes (Littlestone, 1988), as well as Blum's more recent ones (Blum, 1990a) are intimately tied to this model, as are a number of other algorithms including one for integer lattices (Helmbold, Sloan & Warmuth, 1990).
282
References Abe, N. (1989). Polynomial learnability of semilinear sets. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 25-40. Angluin, D. (1987a). Learning regular sets from queries and counter examples. Information and Computation, 75:87-106. Angluin, D. (1987b). Queries and concept learning. Machine Learning, 2:319-342. Angluin, D., Hellerstein, L., & Karpinski, M. (1989). Learning readonce formulas with queries (Technical Report Rept. No. UCB/CSD 89/528). Computer Science Division and University of California and Berkeley. Angluin, D. & Laird, P. (1987). Learning from noisy examples. Machine Learning, 2:343-370. Baum, E. (1990a). The perceptron algorithm is fast for non-malicious distributions. Neural Computation, 2:249-261. Baum, E. (1990b). A polynomial time algorithm that learns two hidden unit nets. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Baum, E. (1990c). When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples? Lecture Notes in Computer Science, 412:2-25. Baum, E. & Haussler, D. (1989). What size net gives valid generalization. Neural Computation, 1(1):151—160. Ben-David, S., Benedek, G., & Mansour, Y. (1989). A parametrization scheme for classifying models of learnability. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 285-302. Benedek, G. k Itai, A. (1987). Nonuniform learnability, (Technical Report TR 474). Computer Science Department, Technion, Haifa, Israel.
283 Benedek, G. M. k Itai, A. (1988). Learnability by fixed distributions. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 80-90. Berman, P. k Roos, R. (1987). Learning one-counter languages in polynomial time. In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 61-67. Blum, A. (1990a). Learning boolean functions in an infinite attribute space. In Proceedings of the 22nd ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY. Blum, A. (1990b). Separating distribution-free and mistake-bound learning models over the boolean domain. In Proceedings of the 31st IEEE Symposium on Foundation of Computer Science, IEEE Computer Society Press, Washington, D.C., 211-218. Blum, A. k Rivest, R. (1988). Training a 3-node neural network is NPcomplete. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 9-18. Blumer, A., Ehrenfeucht, A., Haussler, D., k Warmuth, M. (1987). Occam's razor. Information Proc. Letters, 25:377-380. Blumer, A., Ehrenfeucht, A., Haussler, D., k Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(2):929-965. Board, R. k Pitt, L. (1990). On the necessity of Occam algorithms. In Proceedings of the 22nd ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY. Boucheron, S. k Sallantin, J. (1988). Some remarks about spacecomplexity of learning, and circuit complexity of recognizing. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 125-138. Dietterich, T. (1990). Machine learning. Ann. Rev. of Comp. Sci., 4.
284 Ehrenfeucht, A. k Haussler, D. (1989). Learning decision trees from random examples. Inf. and Computation, 231-247. Ehrenfeucht, A., Haussler, D., Kearns, M., k Valiant, L. (1989). A general lower bound on the number of examples needed for learning. Inf. and Computation, 247-261. Floyd, S. (1989). Space-bounded learning and the Vapnik-Chervonenkis dimension. Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 349-364. Freund, Y. (1990). Boosting a weak learning algorithm by majority. Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Gereb-Graus, M. (1989). Lower Bounds on Parallel, Distributed and Automata Computations. (PhD thesis, Harvard University). Goldman, S., Rivest, R., k Schapire, R. (1989). Learning binary relations and total orders. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 46-53. Goldreich, 0., Goldwasser, S., k Micali, S. (1986). How to construct random functions. J. ACM, 33(4):792-807. Gu, Q. k Maruoka, A. (1988). Learning monotone boolean functions by uniform distributed examples. Manuscript. Hancock, T. (1990). Identifying //-formula decision trees with queries. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Haussler, D. (1987). Bias, version spaces and Valiant's learning framework. In Proc. 4th Intl. Workshop on Machine Learning, Morgan Kaufmann, 324-336 Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36(2):177222.
285 Haussler, D. (1990). Learning conjunctive concepts in structural domains. Machine Learning, 4. Haussler, D., Kearns, M., Littlestone, N., k Warmuth, M. (1988a). Equivalence of models of polynomial learnability. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 42-55. Haussler, D., Littlestone, N., k Warmuth, M. (1988b). Predicting 0,1functions on randomly drawn points. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 280-296. Helmbold, D., Sloan, R., k Warmuth, M. (1989). Learning nested differences of intersection-closed concept classes. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 41-56. Helmbold, D., Sloan, R., k Warmuth, M. (1990). Learning integer lattices. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Judd, J. (1988). Learning in neural nets. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 2-8. Kearns, M. (1990). The Computational Complexity of Machine Learning. MIT Press. Kearns, M. k Li, M. (1988). Learning in the presence of malicious errors. In Proceedings of the 20th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 267-279. Kearns, M., Li, M., Pitt, L., k Valiant, L. (1987a). On the learnability of Boolean formulae. In Proceedings of the 19th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 285-295.
286 Kearns, M., Li, M., Pitt, L., k Valiant, L. (1987b). Recent results on Boolean concept learning. In Proc. J[th Int. Workshop on Machine Learning, Los Altos, CA. Morgan Kaufmann, 337-352. Kearns, M., Li, M., k Valiant, L. (1989). Learning boolean formulae. Submitted for publication. Kearns, M. k Pitt, L. (1989). A polynomial-time algorithm for learning k-variable pattern languages from examples. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 57-71. Kearns, M. k Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Kearns, M. k Valiant, L. (1989). Cryptographic limitations on learning boolean formulae and finite automata. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 433-444. Kivinen, J. (1989). Reliable and useful learning. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 365-380. Kucera, L., Marchetti-Spaccamela, A., k Protasi, M. (1988). On the learnability of dnf formulae. In ICALP, 347-361. Laird, P. (1989). A survey of computational learning theory (Technical Report RIA-89-01-07-0), NASA, Ames Research Center. Li, M. k Vazirani, U. (1988). On the learnability of finite automata. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 359-370. Li, M. k Vitanyi, P. (1989). A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 34-39.
287 Lin, J.-H.- k Vitter, S. (1989). Complexity issues in learning by neural nets. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 118-133. Linial, N., Mansour, Y., & Nisan, N. (1989). Constant depth circuits, Fourier transforms and learnability. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 574-579. Linial, N., Mansour, Y., k Rivest, R. (1988). Results on learnability and the Vapnik-Chervonenkis dimension. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 56-68. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning, 2(4):245-318. Littlestone, N. (1989). From on-line to batch learning. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 269-284. Megiddo, N. (1986). On the complexity of polyhedral separability, (Technical Report RJ 5252), IBM Almaden Research Center. Minsky, M. k Papert, S. (1988). Perceptrons: an introduction to computational geometry. MIT Press. Natarajan, B. (1987). On learning boolean functions. In Proceedings of the 19th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 296-304. Natarajan, B. (1990). Probably approximate learning over classes of distributions. Manuscript. Ohguro, T. k Maruoka, A. (1989). A learning algorithm for monotone kterm dnf. In Fujitsu HAS-SIS Workshop on Computational Learning Theory. Paturi, R., Rajasekaran, S., k Reif, J. (1989). The light bulb problem. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 261-268.
288 Pitt, L. (1989). Inductive inference, dfas and computational complexity. In Jantke, K., (editor), Analogical and Indictive Inference. Lecture Notes in Computer Science, Vol 397, pp.(18-44) Spring-Verlag. Pitt, L. k Valiant, L. (1988). Computational limitations on learning from examples. J. ACM, 35(4):965-984. Pitt, L. k Warmuth, M. (1988). Reductions among prediction problems: on the difficulty of predicting automata. In Proc. 3rd IEEE Conf. on Structure in Complexity Theory, 60-69. Pitt, L. k Warmuth, M. (1989). The minimal consistent dfa problem cannot be approximated within any polynomial. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 421-432. Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3):229246. Rivest, R. k Sloan, R. (1988). Learning complicated concepts reliably and usefully. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 69-79. Rivest, R. L. k Schapire, R. (1987). Diversity-based inference of finite automata. In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 78-88. Rivest, R. L. k Schapire, R. (1989). Inference of finite automata using homing sequences. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY ,411-420. Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. Sakakibara, Y. (1988). Learning context-free grammars from structural data in polynomial time. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 330344.
289 Schapire, R. (1989). On the strength of weak learnability. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 28-33. Shackelford, G. & Volper, D. (1988). Learning k-dnf with noise in the attributes. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 97-105. Shvaytser, H. (1990). A necessary condition for learning from positive examples. Machine Learning, 5:101-113. Sloan, R. (1988). Types of noise for concept learning. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 91-96. Valiant, L. (1984). A theory of the learnable. Comm. ACM, 27(11):11341142. Valiant, L. (1985). Learning disjunctions of conjunctions. In Proc. 9th Int. Joint Conf on Artificial Intelligence, 560-566, Los Altos, CA. Morgan Kaufmann. Valiant, L. (1988). Functionality in neural nets. In Proc. Amer. Assoc, for Artificial Intelligence, 629-634, San Mateo, CA. Morgan Kaufmann. Vapnik, V. (1982). Estimation of dependencies based on Empirical Data. Springer-Verlag. Vapnik, V. k Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theor. Probability and Appi, 16(2):264-280. Vitter, J. k Lin, J.-H. (1988). Learning in parallel. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 106-124. Warmuth, M. (1989). Toward representation independence in pac learning. In Jantke, K., (editor), Analogical and Inductive Inference, vol 397, Lecture Notes in Computer Science, 78-103. Springer-Verlag.
Chapter 9 The Probably Approximately Correct (PAC) and Other Learning Models* David Haussler and Manfred Warmuth
[email protected],
[email protected] Baskin Center for Computer Engineering and Information Sciences University of California, Santa Cruz, CA 95064
ABSTRACT This paper surveys some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory. INTRODUCTION It's a dangerous thing to try to formalize an enterprise as complex and varied as machine learning so that it can be subjected torigorousmathematical analysis. To be tractable, a formal model must be simple. Thus, inevitably, most people will feel that important aspects of the activity have been left out of the theory. Of course, they will be right. Therefore, it is not advisable to present a theory of machine learning as having reduced the entirefieldto its bare essentials. All that can be hoped for is that some aspects of the phenomenon are brought more clearly into focus using the tools of mathematical analysis, and that perhaps a few new insights are gained. It is in this light that we wish *We gratefully acknowledge the support from ONR grants N00014-86-K-0454-P00002, N00014-86-K-0454-P00003, and N00014-91-J-1162. A preliminary version of this paper appeared in Haussler (1990).
292 to discuss the results obtained in the last few years in what is now called PAC (Probably Approximately Correct) learning theory (Angluin, 1988). Valiant introduced this theory in 1984 (Valiant, 1984) to get computer scientists who study the computational efficiency of algorithms to look at learning algorithms. By taking some simplified notions from statistical pattern recognition and decision theory, and combining them with approaches from computational complexity theory, he came up with a notion of learning problems that are feasible, in the sense that there is a polynomial time algorithm that "solves" them, in analogy with the class P of feasible problems in standard complexity theory. Valiant was successful in his efforts. Since 1984 many theoretical computer scientists and AI researchers have either obtained results in this theory, or complained about it and proposed modified theories, or both. Thefieldof research that includes the PAC theory and its many relatives has been called computational learning theory. It is far from being a monolithic mathematical edifice that sits at the base of machine learning; it's unclear whether such a theory is even possible or desirable. We argue, however, that insights have been gained from the varied work in computational learning theory. The purpose of this short monograph is to survey some of this work and reveal those insights. DEFINITION OF PAC LEARNING The intent of the PAC model is that successful learning of an unknown target concept should entail obtaining, with high probability, a hypothesis that is a good approximation of it. Hence the name Probably Approximately Correct. In the basic model, the instance space is assumed to be {0, l} n , the set of all possible assignments to n Boolean variables (or attributes) and concepts and hypotheses are subsets of {0, l}71. The notion of approximation is defined by assuming that there is some probability distribution D defined on the instance space {0, l} n , giving the probability of each instance. We then let the error of a hypothesis h w.r.t. afixedtarget concept c, denoted error(h) when c is clear from the context, be defined by error(h) = Y^ D(x), x£hAc
where A denotes the symmetric difference. Thus, error(h) is the probability that h and c will disagree on an instance drawn randomly according to D. The
293 hypothesis h is a good approximation of the target concept c if error(h) is small. How does one obtain a good hypothesis? In the simplest case one does this by looking at independent random examples of the target concept c, each example consisting of an instance selected randomly according to D9 and a label that is "+" if that instance is in the target concept c (positive example), otherwise " - " (negative example). Thus, training and testing use the same distribution, and there is no "noise" in either phase. A learning algorithm is then a computational procedure that takes a sample of the target concept c, consisting of a sequence of independent random examples of c, and returns a hypothesis. For each n > 1 let Cn be a set of target concepts over the instance space {0, l } n , and let C = {C n } n >i. Let Hn, for n > 1, and H be defined similarly. We can define PAC leamability as follows: The concept class C is PAC learnable by the hypothesis space H if there exists a polynomial time learning algorithm A and a polynomial p(-, •, •) such that for all n > 1, all target concepts c E Cn> all probability distributions D on the instance space {0, l } n , and all e and 8, where 0 < e,6 < 1, if the algorithm A is given at least p(n, 1/e, 1/6) independent random examples of c drawn according to Dy then with probability at least 1 -