1
INTRODUCTION
In which we try to explain why we consider artificial intelligence to be a subject most worthy ofstudy, and in which we try to decide what exactly it is, this being a good thing to decide before embarking.
INTELLIGENCE
We call ourselves
Homo sapiens-man
the wise-because our
to us. For thousands of years, we have tried to understand
intelligence
how we think;
i s so important
that is, how a mere
handful of matter can perceive, understand, predict, and manipulate a world far larger and ARTIFICIAL INTELLIGENCE
more complicated than itself. The field of attempts not just t o understand but also to
artificial intelligence,
or AI, goes further still: i t
build intelligent entities.
AI is one of the newest fields in science and engineering. Work started in eamest soon after World War II, and the name itself was coined in
1956.
Along with molecular biology,
AI is regularly cited as the "field I would most like to be in" by scientists in other disciplines. A student in physics might reasonably feel that all the good ideas have already been taken by Galileo, Newton, Einstein, and the rest. AI, on the other hand, still has openings for several full-time Einsteins and Edisons. AI currently encompasses a huge variety of subfields, ranging from t11e general (leaming and perception) to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases.
AI is relevant to any
intellectual task; i t is truly a universal field.
1.1
WHAT Is AI?
We have claimed that AI is exciting, but w e have not said what i t
is.
In Figure
1.1
w e see
eight definitions of AI, laid out along two dimensions. The definitions on top are concemed with
thought processes and reasoning,
definitions on the left measure success in terms RATIONALITY
the bottom address behavior. The of fidelity to human performance, whereas
whereas the ones on
the ones on the right measure against an
ideal performance
measure, called
rationality.
A
system is rational if it does the "right thing," given what it knows. Historically, all four approaches to AI have been followed, each by different people with different methods. A human-centered approach must be in prut an empirical science, in-
2
Chapter
1.
Introduction
Thinking Hmnanl y
Thinking Rationally
"The exciting new effort to make comput-
"The study of mental faculties through the
ers think ... machines
use of computational models."
with minds, in the
full and literal sense." (Haugeland,
1985)
(Charniak and McDermott, 1985)
the computations that make
"[The automation of] activities that we
'The study of
associate with human thinking, activities
it possible to perceive, reason, and act."
such as decision-making, problem solv-
(Winston,
1992)
ing, learning ... "(Bellman, 1978)
Acting Humanly
Acting Rationally
'The art of creating machines that per-
"Computational Intelligence is the study
form functions that require intelligence
of the design of intelligent agents." (Poole
when performed by people." (Kurzweil,
et al., 1998)
1990) 'The study of how to make computers do
"AI ...is concemed with intelligent be-
things at which, at the moment, people are
havior in artifacts." (Nilsson, 1998)
better." (Rich and Knight, 1991)
Figure 1.1
Some definitions of artificial intelligence, organized into four categories.
1 volving observations and hypotheses about human behavior. A rationalist approach involves a combination of mathematics and engineering. The various group have bot.h disparaged and helped each other. Let us look at the four approaches in more detail. 1.1.1 TURING lEST
The
Acting humanly: The Turing Test approach
Thring Test, proposed by Alan Turing (1950), was designed to provide a satisfactory
operational definition of intelligence. A computer passes the test if a human interrogator, after posing some written questions, cannot tell whether the written responses come from a person or
from a computer. Chapter 26 discusses the details of the test and whether a computer would
really be intelligent if it passed. For now, we note that programming a computer to pass a rigorously applied test provides plenty to work on. The computer would need to possess the following capabilities: NATURAl IANGUAGF PROCESSING KNCWLEDGE REPRESENTATION AUTOMATED REASONING
•
natm·allanguage processing to enable it to communicate succes sfully i n English;
•
knowledge representation to store what it knows or hears;
•
automated reasoning to use the stored information to answer questions and to draw new conclus.ions;
MACHINE LEARNING
•
machine learning to adapt to new circumstances and to detect and extrapolate patterns.
1 By distinguishing between human and rational behavior, we are not suggesting that humans are necessarily "irrational" in the sense of "emotionally unstable" or "insane." One merely need note that we are not perfect: not all chess players are grandmasters; and, unfortunately, not everyone gets an A on the exam. Some systematic enors in hwnan reasoning are cataloged by Kahneman et al. (1982).
Section 1.1.
What Is AI?
3
Turing's test deliberately avoided direct physical interaction between the interrogator and the computer, because physical simulation of a person is unnecessary for intelligence. However, TOTALlURINGTEST
the so-called
total Turing Test includes a video signal so that the interrogator can test the
subject's perceptual abilities, as well as the opportunity for the interrogator to pass physical objects "through the hatch." To pass the total Turing Test, the computer will need coMPUTERVISION
•
computer vision to perceive objects, and
FOBOTICS
•
robotics to manipulate objects and move about.
These six disciplines compose most of AI, and Turing deserves credit for designing a test that remains relevant 60 years later. Yet AI researchers have devoted little effort to passing the Turing Test, believing that it is more important to study the underlying principles of in telligence than to duplicate an exemplar. The quest for "artificial flight " succeeded when the Wright brothers and others stopped imitating birds and started using wind tunnels and learn ing about aerodynamics. Aeronautical engineering texts do not define the goal of their field as making "machines that fly so exactly like pigeons that they can fool even other pigeons." 1.1.2
Thinking humanly: The cognitive modeling approach
If we are going to say that a given program thinks like a human, we must have some way of determining how hwnans think. We need to get
inside the actual workings of human minds.
There are three ways to do this: through introspection-trying to catch our own thoughts as they go by; through psychological experiments-observing a person in action; and through brain imaging-observing the brain in action. Once we have a sufficiently precise theory of
the mind, it becomes possible to express the theory as a computer program. If the p ro gram' s input-{)utput behavior matches corresponding human behavior, that is evidence that some of the program's mechanisms could also be operating in humans. For example, Allen Newell and Herbert Simon, who developed GPS, the "General Problem Solver " (Newell and Simon, 1961), were not content merely to have their program solve problems correctly. They were more concemed with comparing the trace of its reasoning steps to traces of human subjects coGNITIVE SCIENCE
solving the same problems. The interdisciplinary field of computer models from AI and experimental techniques
cognitive science brings together
from psychology
to construct precise
and testable theories of the human mind. Cognitive science is a fascinating field in itself, worthy of several textbooks and at least one encyclopedia (Wilson and Keil, 1999). We will occasionally comment on similarities or differences between AI techniques and human cognition. Real cognitive science, however, is necessarily based on experimental investigation of actual humans or animals. We will leave that for other books, as we assume the reader has only a computer for experimentation. ln the early days of AI there was often confusion between the approaches: an author would argue that an algorithm performs well on a task and that it is
therefore a good model
of hwnan performance, or vice versa. Modem authors separate the two kinds of claims; tl1is distinction has allowed both AI and cognitive science to develop more rapidly. The two fields continue to fertilize each other, most notably in computer vision, which incorporates neurophysiological evidence into computational models.
4
Chapter
1.1.3
1.
Introduction
Thinking rationally: The "laws of thought" approach
The Greek philosopher Aristotle was one of the first to attempt to codify "right thinking;' that SYLLOGISM
is, irrefutable reasoning processes. His
syllogisms provided
patterns for argument structures
that always yielded correct conclusions when given correct premises-for example, "Socrates is a man; all men are mortal; therefore, Socrates is mortal." These laws of thought were LOGIC
su pposed to govem the operation of the mind; their study initiated the field called
logic.
Logicians in the 19th century developed a precise notation for statements about all kinds of objects in the world and the relations among them. (Contrast this with ordinary arithmetic
numbers.) By 1965, programs existed any solvable problem described in logical notation. (Although
notation, which provides only for statements about that could, in principle, so]ve LOGICIST
if no solution exists, the program might loo p forever.) The so-called
logicist tradition
within
at1ificial intelligence ho pes to build on such programs to create intelligent systems. There are two main obstacles to this approach. First, it is not easy to take informal knowledge and state it in the formal terms required by logical notation, particularly when the knowledge is less than I 00% certain. Second, there is a big difference between solving a problem "in principle" and solving it .in practice. Even problems with just a few hundred facts can exhaust the computational resources of any computer unless it has some guidance as to which reasoning steps to try first. Although both of these obstacles apply to any attempt to build computational reasoning systems, they appeared first in the logicist tradition.
1.1.4 AGENT
Acting rationally: The rational agent approach
An agent
is just something that acts
(agent comes from the Latin agere,
all computer programs do something, but computer agents
are
to do). Of course,
expected to do more: operate
autonomously, perceive their environment, persist over a p rolonged time period, adapt to RATIONALAGENT
change, and create and pursue goals. A
rational agent is
one that acts so as to achieve the
best outcome or, when there is tmcertainty, the best expected outcome. In the "laws of thought" approach to AI, the emphasis was on correct inferences. Mak ing correct inferences is sometimes
part
of being a rational agent, because one way to act
rationally is to reason logically to the conclusion that a given action will achieve one's goals and then to act on that conclusion. On the other hand, correct inference is not
all of
ration
ality; in some situations, there is no provably correct thing to do, but something must still be done. There are also ways of acting rationally that carmot be said to involve inference. For example, recoiling from a hot stove is a reflex action that is usually more successful than a slower action taken after cru·eful deliberation. All the skills needed for the Turing Test also allow an agent to act rationally. Knowledge representation and reasoning enable agents to reach good decisions. We need to be able to generate comprehensible sentences in natu.ral language to get by in a complex society. We need learning not only for erudition, but also because it imp roves our ability to generate effective behavior. The rational-agent approach has two advantages over the other approaches. First, it is more general than the "laws of thought" a pproach because correct inference is just one of several possible mechanisms for achieving rationality.
Second, it is more amenable to
Section
1.2.
5
The Foundations of Artificial Intelligence
scientific development than are approaches based on human behavior or human thought. The standard of rationality is mathematically well defined and completely general, and can be "unpacked" to generate agent designs that provably achieve it. Human behavior, on the other hand, is well adapted for one specific environment and is defined by, well, the sum total
This book therefore concentrates on general principles of rational agents and on components for constructing them. We will see that despite the of all the things that humans do.
apparent simplicity with which the problem can be stated, an enonnous variety of issues come up when we try to solve it. Chapter 2 outlines some of these issues in more detail. One important point to keep in mind: We will see before too long that achieving perfect rationality-always doing the right thing-is not feasible in complicated environments. The computational demands are just too high. For most of the book, however, we will adopt the working hypothesis that perfect rationality is a good starting point for analysis. It simplifies the problem and provides the appropriate setting for most of
�'[J� �LITY
the field.
Chapters
5
and
17
the
deal explicitly with the issue of
fotmdational material in
limited
rationality-ac ting
appropriately when there is not enough time to do all the computations one might like.
1 .2
THE FOUNDATIONS OF ARTIFICIAL INTELLIGENCE
In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints, and techniques to AI. Like any history, this one is forced to concentrate on a small nwnber of people, events, and ideas and to ignore others that also were important. We organize the history around a series of questions. We certainly would not wish to give the impression that these questions are the only ones the disciplines address or that the disciplines have all been working toward AI as their ultimate fruition.
1.2.1
• • • •
Philosophy Can formal rules be used to draw valid conclusions? How does the mind arise from a physical brain? Where does knowledge come from? How does knowledge lead to action?
Aristotle
(384-322 B.c.),
whose bust appears on the front cover of this book, was the first
to formulate a precise set of Jaws governing the rational part of the mind. He developed an informal system of syllogisms for proper reasoning, which in principle allowed one to gener ate conclusions mechanically, given initjal premises. Much later, Ramon Lull (d.
1315)
had
the idea that useful reasoning could actually be carried out by a mechanical artifact. Thomas Hobbes
(1588-1679) proposed
that reasoning was like numerical computation, that "we add
and subtract in our silent thoughts." TI1e automation of computation itself was already well w1der way. Around
1500, Leonardo da Vinci (1452-1519) designed but did not build
a me
chanical calculator; recent reconstructions have shown the design to be functional. The first
1623 by the German scientist Wilhelm Schickard (1 592-1635), although the Pascaline, built in 1642 by Blaise Pascal (1623-1662), known calculating machine was constructed arow1d
6
Chapter
1.
Introduction
is more famous. Pascal wrote that "the arithmetical machine produces effects which appear nearer to thought than all the actions of animals." Gottfried Wilhelm Leibniz
(1646-1716)
built a mechanical device intended to catTy out operations on concepts rather than numbers, but its scope was rather limited. Leibniz did surpass Pascal by building a calculator that could add, subtract, multiply, and take roots, whereas the Pascaline could only add and sub tract. Some speculated that machines might not just do calculations but actually be able to think and act on their own. In his
1651
book
Leviathan, Thomas Hobbes suggested
the idea
of an "artificial animal," arguing "For what is the herut but a spring; and the nerves, but so many strings; and the joints, but so many wheels." It's one thing to say that the mind operates, at least in part, according to logical rules, and to build physical systems that emulate some of those mles; it's another to say that the mind itself
is
such a physical system. Rene Descartes
(1596-1 650)
gave the first clear discussion
of the distinction between mind and matter and of the problems that arise. One problem with a purely physical conception of the mind is that it seems to leave little room for free will: if the mind is govemed entirely by physical Jaws, then it has no more free will than a rock
"deciding" to fall toward the center of the earth. Descartes was a strong advocate ofthe power
rationalism,
RATIONALISM
of reasoning in understanding the world, a philosophy now called
DUALISM
and one that
counts Aristotle and Leibnitz as members. But Descrutes was also a proponent of
dualism.
He held that there is a prut of the human mind (or soul or spirit) that is outside of nature, exempt from physical laws. Animals, on the other hand, did not possess this dual quality; MATERIALISM
they could be treated as machines. An alternative to dualism is that the brain's operation according to the laws of physics
materialism,
constitutes the mind.
which holds Free will is
simply the way that the perception of available choices appears to the choosing entity. Given a physical mind that manipulates knowledge, the next problem is to establish EMPIRICISM
empiricism movement, stru·ting with Francis Bacon's (15612 1626) Novum Organum, is characterized by a dictum of John Locke (1632-1704): "Nothing is in the understanding, which was not first in the senses." David Hume's (1711-1776) A Treatise of Human Nature (Hwne, 1739) proposed what is now known as the principle of
INDUCTION
induction:
the source of knowledge. The
that general mles are acquiJed by exposure to repeated associations between their
(1889-1951) and Bertrand Russell Rudolf Camap (1891-1970), developed the
elements. Building on the work of Ludwig Wittgenstein
(1872-1970), the LOGICALPOSITIVISM OBSERVATION SENTENCES CONFIRMATION THEORY
doctrine of logical
famous Vienna Circle, led by
positivism.
This doctrine holds that all knowledge can be characterized by
logical theories connected, ultimately, to
observation sentences
that correspond to sensory 3 inputs; thus logical positivism combines rationalism and empiJicism. The confirmation the
ory of edge
(1905-1997) attempted to analyze the acquisition of knowl Carnap's book The Logical Structure of the World (1928) defined an
Carnap and Carl Hempel
from experience.
explicit computational procedure for extracting knowledge from elementary experiences. It was probably the first theory of mind as a computational process.
2
The Novum Organum is an update of Aristotle's Organon., or instrument of thought. Thus Aristotle can be seen as both an empiricist and a rationalist. 3 In thls picture, all meaningful statements can be verified or falsified either by experimentation or by analysis of the meaning of the words. Because this rules out most of metaphysics, as was the intention, logical positivism was unpopular in some circles.
Section 1.2.
7
The Foundations of Artificial Intelligence
The final element in the philosophical picture of the mind is the cmmection between knowledge and action. This question is vital to AI because intelligence requires action as well as reasoning. Moreover, only by understanding how actions are justified can we understand how to build an agent whose actions are justifiable (or rational). Aristotle argued (in De Motu Animalium) that actions are justified by a logical connection between goals and knowledge of the action's outcome (the last part of this extract also appears on the front cover of this book, in the original Greek):
But how does it happen that thinking is sometimes accompanied by action and sometimes not, sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of reasoning and making inferences about unchanging objects. But in that case the end is a speculative proposition ... whereas here the conclusion which results from the two premises is an action. ... I need covering; a cloak is a covering. I need a cloak. What I need, I have to make; I need a cloak. I have to make a cloak. And the conclusion, the "I have to make a cloak," is an action. In the
Nicomachean Ethics
(Book 01. 3, 1112b), Aristot1e further elaborates on this topic,
suggesting an algorithm:
We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall heal, nor an orator whether he shall persuade, ... They assume the end and consider how and by what means it is attained, and if it seems easily and best produced thereby; while if it is achieved by one means only they consider how it will be achieved by this and by what means this will be achieved, till they come to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming. And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got; but if a thing appears possible we try to do it. Aristotle's algorithm was implemented 2300 years later by Newell and Simon in their G PS program. We would now call it a regression planning system (see Chapter 10). Goal-based analysis is useful, but does not say what to do when several actions will achieve the goal or when no action will achieve it completely. Antoine Arnauld (1612-1694)
to take in cases like this Utilitarianism (Mill, 1863) promoted
correctly described a quantitative formula for deciding what action (see Chapter 16). John Stuart Mill's (1806-1873) book
the idea of rational decision criteria in all spheres of hwnan activity. The more formal theory of decisions is discussed in the following section.
1.2.2
• • •
Mathematics What are the formal rules
to draw valid conclusions?
What can be computed? How do we reason with uncertain information?
Philosophers staked out some of the fundamental ideas of AI, but the leap
to a formal science
required a level of mathematical formalization in three fundamental areas: logic, computa tion, and probability. The idea of formal logic can be traced back to the philosophers of ancient Greece, but its mathematical development really began with the work of George Boole ( 1815-1864), who
8
Chapter
1.
Introduction
worked out the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848-1925) extended Boote's logic to include objects and relations, creating the first 4 order logic that is used today. Alfred Tarski (1902-1983) introduced a theory of reference
that shows how to relate the objects in a logic to objects in the real world. The next step was to determine the limits of what could be done with logic and com ALGORITHM
putation. The first nontrivial
algorithm
greatest conunon divisors. The word
is thought to be Euclid's algorithm for computing
algorithm
(and the idea of studying them) comes from
al-Khowarazmi, a Persian mathematician of the 9th century, whose writings also introduced Arabic numerals and algebra to Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to formalize general mathe matical reasoning as logical deduction. In 1930, Kurt Godel (1906-1978) showed that there exists an effective procedw·e to prove any tme statement in the first-order logic of Frege and Russell, but that first-order logic could not capture the principle of mathematical induction needed to characterize the natural numbers. ln 1931, GOdel showed that limits on deduc INCOMPLETENESS THEOREM
tion do exist. His
incompleteness theorem
showed that in any formal theory as strong as
Peano arithmetic (the elementary theory of natural numbers), there
are
true statements that
are undecidable in the sense that they have no proof within the theory. This fundamental result can also be interpreted as showing that some functions on the integers cannot be represented by an algorithm-that is, they cannot be computed. motivated Alan Turing (1912-1954) to try to characterize exactly which functions COMPUTABLE
putable--capable
the notion
This
are com
of being computed. This notion is actually slightly problematic because
of a computation or effective procedure really ca1mot be given a formal definition.
However, the Church-Turing thesis, which states that the Turing machine (Tw·ing, 1936) is capable of computing any computable function, is generally accepted as providing a sufficient definition. Turing also showed that there were some functions that no Turing machine can compute. For example, no machine can tell an
in general whether a
given program will return
answer on a given input or run forever. Although decidability and computability are important to an understanding of computa
TRACTABIUTY
tion, the notion of tractability has had an even greater impact. Roughly speaking, a problem is called intractable if the time required to solve instances of the problem grows exponentially with the size of the instances. The distinction between polynomial and exponential growth in complexity was first emphasized in the mid-1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even moderately large instances ca1mot be solved in any reasonable time. Therefore, one should strive to divide the overall problem of generating intelligent behavior into tractable subproblems rather than intractable ones. How can one recognize an intractable problem? The theory of NP-completeness, pio
NP-COMPLETENESS
neered by Steven Cook (1971) and Richard Karp (1972), provides a method. Cook and Karp showed the existence of large classes of canonical combinatorial search and reasoning prob lems that are NP-complete. Any problem class to which the class of NP-complete problems can be reduced is likely to be intractable. (Although it has not been proved that NP-.'Pioration that must be undertaken by
a vacuurn-deaning agent in an initia11y unknown environment. LEARNING
Our definition requires a rational agent not only to gather infonnation but also to as much as possible
from what
learn
it perceives. The agent's initial configuration could reflect
some prior knowledge of the environment, but as the agent gains experience this may be modified and augmented. There are extreme cases in which the environment is completely known a priori. In such cases, the agent need not perceive or Jearn; it simply acts correctly. Of cow·se, such agents are fragile. Consider the Jowly dung beetle. After digging its nest and laying its eggs, it fetches a ball of dung from a nearby heap to plug the entrance. I f the ball of dung is removed from its grasp en route, the beetle continues its task and pantomimes plug ging the nest with the nonexistent dung ball, never noticing that it is missing. Evolution has built an asswnption into the beetle's behavior, and when it is violated, unsuccessful behavior results. Slightly more intelligent is the sphex wasp. The female sphex will dig a burrow, go out and sting a caterpillar and drag it to the burrow, enter the burrow again to check all is well, drag
the caterpillar inside, and lay its eggs. The caterpillar serves as a food source when
the eggs hatch. So far so good, but if an entomologist moves the caterpillar a few inches away while
the sphex is doing
the check, it will revert to the "drag " step of its plan and will
continue the plan without modification, even after dozens of caterpillar-moving interventions. The sphex is unable to lean1 that its innate plan is failing, and thus will not change it. To the extent that an agent relies on the prior knowledge of its designer rather AUTONOMY
on its own percepts, we say that the agent lacks
than
autonomy. A rational agent should be
autonomous-it should learn what it can to compensate for partial or incorrect prior knowl edge.For example, a vacuum-cleaning agent that learns to foresee where and when additional dirt will appear will do better than one that does not. As a practical matter, one seldom re quires complete autonomy from the start: when the agent has had little or no experience, it would have to act randomly unless the designer gave some assistance. So, just as evolution provides animals with enough built-in reflexes to survive long enough to learn for themselves, it would be reasonable to provide an artificial intelligent agent with some initial knowledge as well as an ability to learn. After sufficient experience of its environment, the behavior of a rational agent can become effectively
independent
of its prior knowledge. Hence, the
incorporation of learning allows one to design a single rational agent that will succeed in a vast va riety of environments.
40 2.3
Chapter
2.
Intelligent Agents
THE NATURE OF ENVIRONMENTS
Now that we have a definition of rationality, w e TASKENVIRONMENT
are
rational agents. First, however, we must think about
almost ready
to think
about building
task environments, which are essen
tially the "problems" to which rational agents are the "solutions." We begin by showing how
to specify
a task envirorunent, illustrating the process with a nwnber of examples. We then
show that task environments come in a variety of flavors. The flavor of the task envirorunent directly affects the approptiate design for the agent program. 2.3.1
Specifying the task environment
In our discussion of the rationality of the simple vacuum-cleaner agent, we had to specify the performance measure, the environment, and the a.gent's actuators and sensors. We group all these under the heading of the task environment. For the acronymically minded, we can PEAS
this the PEAS (Performance, Environment, Actuators, Sensors) description. In designing an agent, the first step must always be to specify the task environment as fully as possible. The vacuum world was a simple example; let us consider a more complex problem: an automated taxi driver. We should point out, before the reader becomes alarmed, that a fully automated taxi is currently somewhat beyond the capabilities of existing technology. (page 28 describes an existing driving robot.) The full driving task is extremely
open-ended.
There is
no limit to the novel combinations of circumstances that can arise-another reason we chose it as a focus for discussion. Figure 2.4 summarizes the PEAS description for the taxi's task environment. We discuss each element in more detail in the following paragraphs. Agent Type
Perfonnance Measure
Environment
Actuators
Sensors
Taxi driver
Safe, fast, legal, comfortable trip,
Roads, other traffic,
Steering, accelemtor,
Catneras, sonar, speedometer,
maximize profits
pedestrians, customers
bmke, signal, horn, display
GPS, odometer, accelerometer, engine sensors, keyboard
Figure 2.4
PEAS description ofthe task environment for an automated taxi.
First, what is the
performance measure to which we would like our automated dtiver
to aspire? Desirable qualities include getting to the correct destination; minimizing fuel con sumption and wear and tear; minimizing the trip time or cost; minimizing violations of traffic Jaws and disturbances
to other drivers;
maximizing safety and passenger comfort; maximiz
ing profits. Obviously, some of these goals conflict, so tradeoffs will be required. Next, what is the driving
environment that the taxi will face? Any taxi driver must
deal with a variety of roads, ranging from rural lanes and urban alleys to 12-Jane freeways. The roads contain other traffic, pedestrians, stray animals, road works, police cars, puddles,
Section 2.3.
41
The Nature ofEnvironments
and potholes. The taxi must also interact with potential and actual passengers. There are also some optional choices. The taxi might need to operate in Southern California, where snow is seldom a problem, or in Alaska, where it seldom is not. It could always be driving on the right, or we might want it
to be flexible enough to drive on the left when in Britain or Japan.
Obviously, the more restricted the envirorunent, the easier the design problem. The actuators for an automated taxi include those available to a hwnan driver: control over the engine through the accelerator and control over steering and braking. In addition, it
to a display screen or voice synthesizer to talk back to the passengers, perhaps some way to commw1icat.e with other vehicles, politely or otherwise.
will need output The basic that it can see
sensors for the taxi will
and
include one or more controllable video cameras
so
the road; it might augment these with infrared or sonar sensors to detect dis
tances to other cars and obstacles. To avoid speeding tickets, the taxi should have a speedome ter, and to control the vehicle properly, especially on curves, it should have an accelerometer. To detetmine the mechanical state of the vehicle, it. will need the usual an·ay of engine, fuel, and electrical system sensors. Like many human drivers, it might. want a global positioning system (GPS) so that it doesn't get lost. Finally, it will need a keyboard or microphone for the passenger to request a destination. In Figure 2.5, we have sketched the basic PEAS elements for a number of additional agent types. Further examples appear in Exercise 2.4. It may come as a surprise ers that our list. of agent types n i cludes some programs that operate in
to some read
the entirely artificial
environment. defined by keyboard input and character output. on a screen. "Surely," one might say, "this is not a real environment, is it?" In fact., what matters is not the distinction between "real" and "artificial" environments, but the complexity of the relationship among the behav
ior of the agent, the percept sequence generated by the environment, and the performance measure. Some "real" environments are actually quite simple. For example, a robot designed to inspect parts as they come by on a conveyor belt can make use of a number of simplifying assumptions: that. the lighting is always just
so,
that. the only thing on the conveyor belt will
be patts of a kind that it knows about, and that only two actions (accept. or reject.) are possible. In contrast., some software agents (or software robots or
soflWAREAGENT soFTBOT
softbots) exist in rich, unlim-
to scan Internet news sources and advertising space to generate revenue.
it.ed domains. Imagine a softbot Web site operator designed show the interesting items to its users, while selling
To do well, that operator will need some natural language processing abilities, it will need to learn what each user and advertiser is interested in, and it will need to change its plans dynamically-for example, when the connection for one news source goes down or when a new one comes online. The Internet is an environment whose complexity rivals that of the physical world and whose inhabitants include many attificial and human agents.
2.3.2
Properties of task environments
The range of task envirorunents that might arise in AI is obviously vast. We can, however, identify a fairly small number of dimensions along which task environments can be catego rized. These dimensions detennine,
to a large
extent, the appropriate agent design and the
applicability of each of the principal families of techniques for agent implementation. First,
42
Chapter
Agent Type
Perfonnance Measure
Environment
Medical
Healthy patient,
diagnosis system
reduced costs
2.
Intelligent Agents
Actuators
Sensors
Patient, hospital,
Display of
Keyboard entry
staff
questions, tests,
of symptoms,
diagnoses,
findings, patient's
treatments,
answers
referrals
Satellite image
Correct image
Downlink from
Display of scene
Color pixel
analysis system
categorization
orbiting satellite
categorization
arrays
Part-picking
Percentage of
Conveyor belt
Jointed arm and
Camera, joint
robot
parts in correct
with parts; bins
hand
angle sensors
bins
Refinery
Purity, yield,
Refinery,
Valves, pumps,
Temperature,
controller
safety
operators
heaters, displays
pressure, chemical sensors
Interactive
Student's score
Set of students,
Display of
English tutor
on test
testing agency
exe.rcisest
Keyboard entry
suggestions, corrections
Figure2.5
Examples of agent types and their PEAS descriptions.
we list the dimensions, then we analyze several task environments to illustrate the ideas. The definitions here are informal; later chapters provide more precise statements and examples of each kind of environment. FULLY OBSERVABLE PJ'.RTIALLY OBSERVABLE
Fully observable vs. partially observable: If an agent's sensors give it access to the complete state of the environment at each point in time, then we say that the task environ ment is fully observable. A task environment is effectively fully observable if the sensors detect all aspects that are relevant to the choice of action; relevance, in turn, depends on the perfonnance measure. Fully observable envirorunents are convenient because the agent need not maintain any internal state to keep track of the world. An envirorunent might be partially observable because of noisy and inaccurate sensors or because parts of the state are simply missing from the sensor data-for example, a vacuum agent with only a local dirt sensor cannot tell whether there is dirt in other squares, and an automated taxi cannot see what other drivers are thinking.
UNOBSERVABLE
If the agent has no sensors at all then the environment is unobserv
able. One might think that in such cases the agent's plight is hopeless, but, as we discuss in Chapter 4, the agent's goals may still be achievable, sometimes with certainty.
SINGLE AGENT MULTIAGENT
Single agent vs. multiagent: The distinction between single-agent and multiagent en-
Section 2.3.
The Nature of Envirorunents
43
vironments may seem simple enough. For example, an agent solving a crossword puzzle by itself is clearly in a single-agent environment, whereas an agent playing chess is in a two agent envirorunent. There are, however, some subtle issues. First, we have described how an entity
may be viewed as an agent, but we have not explained which entities must be viewed
as agents. Does an agent A
(the taxi driver for example) have to treat an object B (another
vehicle) as an agent, or can it be treated merely as an object behaving according to the laws of physics, analogous to waves at the beach or leaves blowing in the wind? The key distinction is whether B's behavior is best described as maximizing a performance measure whose value depends on agent A's behavior. For example, in chess, the opponent entity B is trying to maximize its performance measure, which, by the rules of chess, minimizes agent A's percoMPETITIVE
formance measure. Thus, chess is a competitive multiagent environment. In the taxi-driving envirorunent, on the other hand, avoiding collisions maximizes the performance measure of
cooPERATIVE
all agents, so it is a partially
cooperative multiagent environment. It is also partially com
petitive because, for example, only one car can occupy a parking space. The agent-design problems in multiagent environments are often quite different from those in single-agent en vironments; for example,
communication often emerges as a rational behavior in multiagent environments; in some competitive environments, randomized behavior is rational because it avoids the pitfalls of predictability. DETERNINISTIC srocH>ISTIC
Deterministic vs. stochastic. If the next state of the environment is completely determined by the current state and the action executed by the agent, then we say the environment is deterministic; otherwise, it is stochastic. In principle, an agent need not worry about uncer tainty n i a fully observable, deterministic environment. (In our definition, we ignore uncer tainty that arises purely from the actions of other agents in a multiagent environment; thus,
a game can be deterministic even though each agent may be tmable to predict the actions of the others.) If the environment is partially observable, however, then it could stochastic. Most real situations
are so complex that it is impossible to keep
appear to be
track of all the
unobserved aspects; for practical purposes, they must be treated as stochastic. Taxi driving is clearly stochastic in this sense, because one can never predict the behavior of traffic exactly; moreover, one's tires blow out and one's engine seizes up without warning. The vacuum world as we described it is deterministic, but variations can include stochastic elements such as randomly appearing dirt and an unreliable suction mechanism (Exercise 2.13). We say an UNCERTAIN
envirorunent is
uncertain if it is not fully observable or not deterministic. One final note:
our use of the word "stochastic" generally implies that uncertainty about outcomes is quanNONDETERMINisnc
tified in terms of probabilities; a
nondeterministic envirorunent is one in which actions are
characterized by their possible outcomes, but no probabilities are attached to them. Nonde terministic environment descriptions are usually associated with performance measures that require the agent to succeed for all possible outcomes of its actions. EPISODe sEaUEHTIAL
Episodic vs. sequential: In an episodic task environment, the agent's experience is divided into atomic episodes. In each episode the agent receives a percept and then performs a single action. Crucially, the next episode does not depend on the actions taken in previous episodes.
Many classification tasks are episodic. For example, an agent that has to spot
defective parts on an assembly line bases each decision on the current part, regardless of previous decisions; moreover,
the
current decision doesn't affect whether the next part is
Chapter
44
Intelligent Agents
2.
defective. In sequential environments, on the other hand, the current decision could affect all future decisions.3 Chess and taxi driving are sequential: in both cases, short-tenn actions
can have Jong-tenn
consequences. Episodic environments are much simpler
than sequential
environments because the agent does not need to think ahead.
Static vs. dynamic:
STATIC 0\'NAMIC
If the environment can change while an agent is deliberating, then
we say the environment is dynamic for that agent; otherwise, it is static. Static environments are easy to deal with because the agent need not keep looking at the world while it is deciding on an action, nor need
H worry about the
passage of time. Dynamic environments, on the
other hand, are continuously asking the agent what it wants to do; if it hasn't decided yet,
that counts as deciding to do nothing.
If the environment itself does not change with the
passage of time but the agent's performance score does, then we say the environment is SEMIDI'NAMIC
semidynamic.
Taxi driving is clearly dynamic: the other cars and the taxi itself keep moving
while the driving algorithm dithers about what to do next. Chess, when played with a clock, is semidynarnic. Crossword puzzles are static.
Discrete vs. continuous:
DISCRETE CONTINUOUS
environment, to the way time
The discrete/continuous distinction applies to the state of the
is handled, and to the
percepts and actions of the agent. For
example, the chess environment has a finite number of distinct states (excluding the clock). Chess also has a discrete set of percepts and actions. Taxi driving is a continuous-state and continuous-time problem: the speed and location of the taxi and of the other vehicles sweep through a range of continuous values and do so smoothly over time. Taxi-driving actions are also continuous (steering angles, etc.). Input from digital cameras is discrete, strictly speak ing, but is typically treated as representing continuously varying intensities and locations.
Known vs. unknown:
KNCWN UNKNCWN
Strictly speaking, this distinction refers not to the environment
itself but to the agent's (or designer's) state of knowledge about the "laws of physics" of the environment.
In a known environment, the outcomes (or outcome probabilities if the
environment is stochastic) for all actions
are given.
Obviously, if the environment is unknown,
the agent will have to learn how it works in order to make good decisions.
Note that the
distinction between known and unknown environments is not the same as the one between fully and partially observable environments. to be
partially observable-for example,
It is quite possible for a
known
environment
in solitaire card games, I know the rules but am
still unable to see the cards that have not yet been turned over. Conversely, an
unknown
environment can be fully observable-in a new video game, the screen may show the entire game state but I still don't know what the buttons do until I try them. As one might expect, the hardes[ case is
partially observable, multiagent, stoclu:lstic,
sequential, dynamic, continuous, and unknown. Taxi driving is hard in all these senses, except that for the most part the driver's environment is known. Driving a rented car in a new country with unfamiliar geography and traffic laws is a Jot more exciting. Figure 2.6 lists the properties of a number of familiar environments.
Note that the
answers are not always cut and dried. For example, we describe the part-picking robot as episodic, because it normally considers each part n i isolation. But if one day there is a large 3 The word "sequential" is also used in computer science as the antonym of "parallel." largely unrelated.
The two meanings are
Section 2.3.
The Nature ofEnvirorunents
45
Observable Agents Determn i ist i c
Task Environment Crossword puzzle Chess with a clock
Fully Fully
Episodic
Static
Discrete
Single Deterministic Sequental i Multi Deterministic Sequential
Static Semi
Discrete Discrete
Static Static
Discrete Discrete
Poker Backgammon
Partially
Fully
Multi Multi
Stochastic Stochastic
Sequential Sequential
Taxi driving Medical diagnosis
Partially Partially
Multi Single
Stochastic Stochastic
Sequential Dynamic Continuous Sequential Dynamic Continuous
Image analysis Part-picking robot
Fully Partially
Single Deterministic Single Stochastic
Refinery controller Interact ive English tutor
Partially Partially
Single Multi
Figure2.6
Stochastic Stochastic
Episodic Episodic
Semi Continuous Dynamic Continuous
Sequential Dynamic Continuous Sequential Dynamic Discrete
Examples of task environments and their characteristics.
batch of defective parts, the robot should Jearn from several observations that the distribution of defects has changed, and should modify its behavior for subsequent parts. We have not included a "known/unknown" coltunn because, as explained earlier, this is not strictly a prop erty of the environment. For some environments, such as chess and poker, it is quite easy to supply the agent with full knowledge of the rules, but it is nonetheless interesting to consider how an agent might Jearn to play these games without such knowledge. Several of the answers in the table depend on how the task envirorunent is defined. We
have listed the medical-diagnosis task as single-agent because the disease process in a patient is not profitably modeled as an agent; but a medical-diagnosis system might also have to deal with recalcitrant patients and skeptical staff, so the envirorunent could have a multiagent aspect.. Furthermore, medical diagnosis is episodic if one conceives of the task as selecting a diagnosis given a Jist ofsymptoms; the problem is sequential ifthe task can include proposing a series of tests, evaluating progress over the course of treatment, and so on. Also, many envirorunents are episodic at higher levels than the agent's individual actions. For example, a chess tournament consists of a sequence of games; each game is an episode because (by and large) the contribution of the moves in one game to the agent's overall performance is not affected by the moves in its previous game. On the other hand, decision making within a single game is certainly sequential. The code repository associated with this book (aima.cs.berkeley.edu) includes imple mentations of a number of environments, together with a general-purpose envirorunent simu lator that places one or more agents in
a simulated
environment, observes their behavior over
time, and evaluates them according to a given performance measure. Such experiments are often carried out not for a single environment but for many environments drawn from ENVIAO.iMENT CLASS
vironment class. nm
an en
For example, to evaluate a taxi driver in simulated traffic, we would want to
many simulations with different traffic , lighting, and weather conditions. If we designed
the agent for a single scenario, we might be able to take advantage of specific properties of the pa1ticular case but might not identify a good design for driving in general. For this
Chapter
46 ENVIRONMENT GENERATOR
2.
Intelligent Agents
reason, the code repository also includes an environment generator for each envirorunent class that selects particular environments (with certain likelihoods) in which to run the agent. For example, the vacuum environment generator initializes the dirt pattern and agent location randomly. We are then interested in the agent's average performance over the environment class. A rational agent for a given environment class maximizes this average performance. Exercises 2.8 to 2.13 take you through the process of developing an environment class and evaluating various agents therein.
2.4
THE STRUCTURE OF AGENTS
So far we have talked about agents by describing behavior-the action that is performed after AGENT PROGRAM
any given sequence of percepts. Now we must bite the bullet and talk about how the insides work. The job of AI is to design an agent program that implements the agent fLmction
ARCHITECTURE
the mapping from percepts to actions. We asswne this program will run on some sort of computing device with physical sensors and actuators-we call tllis the architecture:
agent = architecture +program . Obviously, the program we choose has to be one that is appropriate for the architecture. If the program is going to recommend actions like
Walk, the architectw·e had better have legs.
The
architectLLre might be just an ordinary PC, or it might be a robotic car with several onboard computers, cameras, and other sensors. In general, the architecture makes the percepts from the sensors available to the program, nms the program, and feeds the program's action choices to the actuators as they are generated. Most of this book is about designing agent programs, although Chapters 24 and 25 deal directly with the sensors and actuators.
2.4.1
Agent programs
The agent programs that we design in this book all have the same skeleton: they take the
current percept as input from the sensors and return an action to the actuators.4 Notice the difference between the agent program, which takes the current percept as input, and the agent function, which takes the entire percept history. The agent program takes just the current percept as input because nothing more is available from the environment; if the agent's actions need to depend on the entire percept sequence, the agent will have to remember the percepts. We describe the agent programs in the simple pseudocode language that is defined in Appendix B . (The online code repository conta.ins implementations in real programming languages.) For example, Figure 2.7 shows a rather trivial agent program that keeps track of the percept sequence and then uses it to index into a table of actions to decide what to do. The table-an example of which is given for the vacuum world in Figw·e 2.3-represents explicitly the agent function that the agent program embodies. To build a rational agent in 4
There are other choices for the agent program skeleton; for example, we could have the agent programs be coroutines that run asynchronously with d1e environment. Each such coroutine has an input and output port and consists of a loop that rends the input port for percepts and writes actions to the output port.
Section 2.4.
The Structure of Agents
47
ftmction TABLE-DRIVEN-AGENT(pen:ept) returns an action persistent: percepts, a sequence, initially empty table, a table of actions, n i dexed by percept sequences, n i itially fully specified append percept to the end of percepts
action
Figure 3.32
�
3.
Solving Problems by Searching
� � gl� �
The track pieces in a wooden railway set; each is labeled with the number of
copies in the set. Note that curved pieces and "fork" pieces ("switches" or "points") can be flipped over so they can curve n i either direction. Each curve subtends 45 degrees.
3.14
Which of the following are true and which are false? Explain your answers.
a. Depth-first search always expands at least as many nodes as A* search with an admissib. c. d.
e.
ble heuristic. h(n) = 0 is an admissible heuristic for the 8-puzzle. A* is of no use in robotics because percepts, states, and actjons are continuous. Breadth-first search is complete even if zero step costs are allowed. Assume that a rook can move on a chessboard any number of squares in a straight line, vertically or horizontally, but cannot jump over other pieces. Manhattan distance is an admissible heuristic for the problem of moving the rook from square A to square B in the smallest number of moves.
Consider a state space where the start state is number 1 and each state k has two successors: numbers '2k and '2k + 1.
3.15
a. Draw the portion of the state space for states 1 to b. Suppose the goal state is
15.
11. List the order in which nodes will be visited for breadth
first search, depth-limited search with limit 3, and iterative deepening search. c. How well would bidirectional search work on this problem? What is the branching factor in each direction of the bidirectional search? d. Does the answer to (c) suggest a reformulation of the problem that would allow you to solve the problem of getting from state 1 to a given goal state with almost no search? e. Call the action going from k to 2k Left, and the action going to 2k + 1 Right. Can you find an algorithm that outputs the solution to this problem without any search at all? 3.16
A basic wooden railway set contains the pieces shown in Figure 3.32. The task is to
connect these pieces into a railway that has no overlapping tracks and no loose ends where a
train could run off onto the floor. a. Suppose that the pieces fit together exactly with no slack. Give a precise formulation of
the task as a search problem. b. Identify a suitable tminformed search algorithm for this task and explain your choice. c. Explain why removing any one of the "fork" pieces makes the problem tmsolvable.
117
Exercises
d. Give an upper bound on the total size of the state space defined by your formulation.
(Hint:
think about the maximum branching factor for the construction process and the
maximum depth, ignoring the problem of overlapping pieces and loose ends. Begin by pretending that every piece is unique.)
3.17
On page
90, we mentioned iterative lengthening search,
an iterative analog of uni
form cost search. The idea is to use increasing limits on path cost. If a node is generated whose path cost exceeds the current limit, it is immediately discarded. For each new itera tion, the limit is set to the lowest path cost of any node discarded in the previous iteration.
a. Show that this algorithm is optimal for general path costs. b. Consider a uniform tree with branching factor b, solution depth d, and unit step costs. How many iterations will iterative lengthening require?
c. Now consider step costs drawn from the continuous range
[E, 1], where 0 < E < 1 . How
many iterations are required in the worst case?
d. Implement the algorithm and apply it to instances of the 8-puzzle and traveling sales person problems. Compare the algorithm's perfonnance to that of unif01m-cost search, and comment on your results.
3.18
Describe a state space in which iterative deepening search pe1fonns much worse than
depth-first search (for example,
3.19
O(n2) vs. O(n)).
Write a program that will take as input two Web page URLs and find a path of links
from one to the other. What is an appropriate search strategy? Is bidirectional search a good
idea? Could a search engine be used to implement a predecessor function?
3.20
Consider the vacuum-world problem defined in Figure 2.2.
a. Which of the algorithms defined in this chapter would be appropriate for this problem? Should the algorithm use tree search or graph search?
b. Apply your chosen algorithm to compute an optimal sequence of actions for a 3 x 3 world whose initial state has dirt in the three top squares and the agent in the center.
c. Construct a search agent for the vacuum world, and evaluate its performance in a set of
3 x 3 worlds with probability 0.2 of dirt in each square.
Include the search cost as well
as path cost in the performance measure, using a reasonable exchange rate.
d. Compare your best search agent with a simple randomized reflex agent that sucks if there is dirt and otherwise moves randomly.
e. Consider what would happen if the world were enlarged to n x n. How does the per formance of the search a.gent and of the reflex agent vary with n'?
3.21
Prove each of the following statements, or give a cow1terexample:
a. Breadth-first search is a special case of uniform-cost search. b. Depth-first search is a special case of best-first tree search. c. Uniform-cost search is a special case of A* search.
Chapter
118 3.22
3.
Solving Problems by Searching
Compare the pe1formance of A* and RBFS on a set of randomly generated problems
in the 8-puzzle (with Manhattan distance) and TSP (with MST-see Exercise 3.30) domains. Discuss your results. What happens to the performance of RBFS when a small random num ber is added to the hew·istic values in the 8-puzzle domain?
3.23
Trace the operation of A* search applied to the problem of getting to Bucharest from
Lugoj using the straight-line distance heuristic. That is, show the sequence of nodes that the algorithm will consider and the
3.24
Devise a state space in which
with an HEURISTIC PATH ALGORITHM
f, g, and h score for each node.
A* using GRAPH-SEARCH returns a suboptimal solution
h(n) fw1ction that is admissible but inconsistent.
1977) is a best-first search in which the evalu ation function is f(n) = (2 - w)g(n) + wh(n). For what values of w is this complete? For what values is it optimal, assuming that h is admissible? What kind of search does this perform for w = 0, w = 1 , and w = 2'? 3.25
The heuristic
3.26
Consider the unbounded version of the regular 2D grid shown in Figure
state is at the origin, a.
path algorithm (Pohl,
(0,0), and the goal state is at (x, y).
What is the branching factor b in this state space?
b. How many distinct states are there at depth c.
3.9. The start
k (for k > 0)?
What is the maximwn number of nodes expanded by breadth-first tree search?
d. What is the maximum number of nodes expanded by breadth-first graph search? e.
Is
h = iu - xi + lv - Y l
an admissible hew·istic for a state at
(u, v)? Explain.
f. How many nodes are expanded by A* graph search using h? g. Does h remain admissible if some links are removed'? h. Does h remain admissible if some links are added between nonadjacent states?
3.27 n vehicles occupy squares (1, 1) through (n, 1) (i.e., the bottom row) of ann x n grid. The vehicles must be moved to the top row but in reverse order; so the vehicle
i that starts in
(i, 1) must end up in (n- i + 1 , n) . On each time step, every one of the n vehicles can move
one square up, down, left, or right, or stay put; but if a vehicle stays put, one other adjacent vehicle (but not more than one) can hop over it. Two vehicles cannot occupy the same square. a.
Calculate the size of the state space as a function of n.
b. Calculate the branching factor as a fw1ction of n. c.
Suppose that vehicle
i is at (xi, Yi); write a nontrivial admissible heuristic hi for the i + 1 , n), assuming no
number of moves it will require to get to its goal location ( n other vehicles are on the grid.
d. Which of the following heuristics are admissible for the problem of moving all n vehicles to their destinations'? Explain. (i) (ii) (iii)
L::=
hi. max{h1 , . . . , hn} · min {h1 , . . . , hn } · 1
1 19
Exercises 3.28
Invent a heuristic function for the 8-puzzle that sometimes overestimates, and show
how it can lead to a suboptimal solution on a particular problem. (You can use a computer to help if you want.) Prove that if h never overestimates by more than c, A* using h returns a solution whose cost exceeds that of the optimal solution by no more than c. 3.29
Prove that if a heuristic is consistent, it must be admissible. Construct an admissible
heuristic that is not consistent.
I••�
3.30
The traveling salesperson problem (TSP) can be solved with the minimum spanning -
tree (MS1) heuristic, which estimates the cost of completing a tour,
given that a pa rtial tour
has already been constructed. The MST cost of a set of cities is the smallest sum of the link costs of any tree that c01mects all the cities. a.
Show how this heuristic can be derived from a relaxed version of the TSP.
b. Show that the MST heuristic dominates straight-line distance. c.
Write a problem generator for instances of the TSP where cities are represented by random points in the unit square.
d. Find an efficient algorithm in the literature for constructing the MST, and use it with A*
graph search to solve instances of the TSP. On page 105, we defined the relaxation of the 8-puzzle in which a ti'le can move from
3.31
square A to square B if B is blank. The exact solution of this problem defines Gaschnig's
1979). Explain why Gaschnig's heuristic is at least as accurate as h1 (misplaced tiles), and show cases where it is more accurate than both h1 and h2 (Manhattan heuristic (Gaschnig,
distance). Explain how to calculate Gaschnig's heuristic efficiently.
liiii�
3.32
We gave two simple heuristics for the 8-puzzle: Manhattan distance and misplaced
tiles. Several heuristics in the literature purport to improve on this--see, for example, Nils son
(1971), Mostow and Priedi tis (1989), and Hansson eta/. ( 1 992). Test these claims by
implementing the heuristics and comparing the performance of the resulting algorithms.
BEYOND CLASSICAL
4
SEARCH
In which we relax the simplifying assumptions of the previous chapter; thereby getting closer to the real world. Chapter
3 addressed
a single category of problems: observable, deterministic, known envi
ronments where the solution is a sequence of actions. In this chapter, we look at what happens when these assumptions are relaxed. We begin with a fairly simple case: Sections cover algorithms that perfonn purely
4.1 and 4.2
local search in the state space, evaluating and modify
ing one or more current states rather than systematically exploring paths from an initial state. These algorithms are suitable for problems in which all that matters is the solution state, not the path cost to reach it. The family of local search algorithms includes methods inspired by statistical physics
(simulated annealing) and evolutionary biology (genetic algorithms).
Then, in Sections
4.3-4.4, we examine
what happens when we relax the assumptions
ofdeterminism and observability. The key idea is that if an agent cannot predict exactly what percept it will receive, then it will need to consider what to do tmder each
contingency that
its percepts may reveal. With partial observability, the agent will also need to keep track of the states it might be in. Finally, Section
4.5
investigates
online search, in which the agent is faced with a state
space that is initially unknown and must be explored.
4.1
LOCAL SEARCH ALGORITHMS AND OPTIMIZATION PROBLEMS
The search algorithms that we have seen so far are designed to explore search spaces sys tematically. This systematicity is achieved by keeping one or more paths in memory and by recording which alternatives have been explored at each point along the path. When a goal is
the path to that goal also constitutes a solution to the problem. In many problems, how ever, the path to the goal is irrelevant. For example, in the 8-queens problem (see page 71),
found,
what matters is the final configuration of queens, not the order in which they are added. The same general property holds for many important applications such as integrated-circuit de sign, factory-floor layout,job-shop scheduling, automatic programming, telecommunications network optimization, vehicle routing, and portfolio management.
120
Section 4.1.
LOCAL SEARCH CURRENT NODE
OPTIMIZATION PROBLEM OBJECTIVE FUNCTION
STATE-SPliCE LANDSCAPE
GLOBAL MINIMUM GLOBAL MAXIMUM
Local Search Algorithms and Optimization Problems
121
If the path to the goal does not matter, we might consider a different class of algo rithms, ones that do not worry about paths at all. Local search algorithms operate using a single current node (rather than multiple paths) and generally move only to neighbors of that node. Typically, the paths followed by the search are not retained. Although local search algorithms are not systematic, they have two key advantages: (1) they use very little memory-usually a constant ammmt; and (2) they can often find reasonable solutions in large or infinite (continuous) state spaces for which systematic algorithms are unsuitable. In addition to finding goals, local search algorithms are useful for solving pure op timization problems, in which the aim is to find the best state according to an objective function. Many optimization problems do not fit the "standard" search model introduced in Chapter 3. For example, nature provides an objective function-reproductive fitness-that Darwinian evolution could be seen as attempting to optimize, but there is no "goal test" and no "path cost" for this problem. To understand local search, we find it useful to consider the state-space landscape (as in Figure 4.1 ). A landscape has both "location" (defined by the state) and "elevation" (defined by the value of the heuristic cost function or objective function). If elevation corresponds to cost, then the aim is to find the lowest valley-a global minimum; if elevation corresponds to an objective function, then the aim is to find the highest peak-a global maximum. (You can convert from one to the other just by inserting a minus sign.) Local search algorithms explore this landscape. A complete local search algorithm always finds a goal if one exists; an optimal algorithm always finds a global minimum/maximum.
objective func.tion
-- global maximum
shoulder
�
.------
local maximum
'-------1---� state space current state Figure 4.1
A one-dimensional state-space landscape in which elevation corresponds to the
objective function. The aim is to find the global maximum. Hill-climbing search modifies the current state to try to m i prove it, as shown by the arrow. The various topographic features are
defined in the text.
122
Chapter
4.
Beyond Classical Search
function HILL-CLIMBING(pmblem) returns a state that is a local maximum
current ..- MAKE-NODE(problem.lNITIAL-STATE) loop do
neighbor .._ a highest-valued successor of curTent
if neighbor.VALUE ::; current.VALUE then return cun-ent.STATE
cun-ent ....- neighbor
Figure 4.2
The hill-climbing search algorithm, which is the most basic local search tech
nique. At each step the current node is replaced by the best neighbor; in this version, that means the neighbor with the highest VALUE, but if a heuristic cost estimate h is used, we would find the neighbor with the lowest h.
4.1.1
Hill-climbing search
HILLCLIMBING
TI1e
STEEPESTASCENT
simply a loop that continually moves in the direction of increasing value-that is, uphill. It terminates when it reaches a "peak" where no neighbor has a higher value. The algorithm
hill-climbing search algorithm (steepest-ascent version) is shown in Figure 4.2. It is
does not maintain a search tree, so the data structure for the current node need only record the state and the value of the objective function. Hill climbing does not look ahead beyond the immediate neighbors of the current state. This resembles trying to find the top of Mow1t Everest in a thick fog while suffering from amnesia. To illustrate hill climbing, we will use the 8-queens problem introduced on page 71. Local search algorithms typically use a complete-state formulation, where each state has 8 queens on the board, one per column. The successors of a state are all possible states generated by moving a single queen to another square in the same column (so each state has 8 x 7 = 56 successors). The hemistic cost function
h is the number of pairs of queens that
are attacking each other, either directly or indirectly. The global minimum of this function is zero, which occurs only at perfect solutions. Figure 4.3(a) shows a state with h = 17. The figure also shows the values of all its successors, with the best successors having
h = 12.
Hill-climbing algorithms typically choose randomly among the set of best successors if there is more than one. GREEDY LOCAL SEARCH
Hill climbing is sometimes called greedy local search because it grabs a good neighbor state without thinking allead about where to go next. Although greed is considered one of the seven deadly sins, it tums out that greedy algorithms often perform quite well. Hill climbing often makes rapid progress toward a solution because it is usually quite easy to improve a bad state. For example, from the state in Figure 4.3(a), it takes just five steps to reach the state in Figure 4.3(b), which has h = 1 and is very nearly a solution. Unfortunately, hill climbing often gets stuck for the following reasons:
LOCALMAXIMUM
•
Local maxima: a local maximum is a peak that is higher than each of its neighboring states but lower than the global maximum. Hill-climbing algorithms that reach the vicinity of a local maximum will be drawn upward toward the peak but will then be stuck with nowhere else to go. Figure 4.1 illustrates the problem schematically. More
Section 4.1.
Local Search Algorithms and Optimization Problems
123
14 16 14
'if
13
16
13 16
15
�
14
16
16
18
15
'i¥
15
15
15
14
14
� -
14
17
17
� -
18
14
14
14
13
'if 14
17
15
'if
16
� 16
(a)
(b)
(a) An 8-queens state with heuristic cost estimate h = 17, showing the value of
Figure 4.3
h for each possible successor obtained by moving a queen within its column. The best moves (b) A local minimum in the 8-queens state space; the state has h = 1 but every
are marked.
successor has a highercost.
concretely, the state in Figure 4.3(b) is a local maximum (i.e., a local minimum for the cost RIDGE
h); every move of a single queen makes the situation worse.
• Ridges:
a ridge is shown in rigure 4.4. Ridges result in a sequence of local maxima
that is very difficult for greedy algorithms to navigate. PLATEAU SHOULDER
• Plateaux:
a plateau is a flat area of the state-space landscape.
maximum, from which no uphill exit exists, or a
It can be a flat local
shoulder, from
which progress is
possible. (See Figure 4.1.) A hill-climbing search might get lost on the plateau. In each case, the algoritlun reaches a point at which no progress is being made. Stat1ing from a randomly generated 8-queens state, steepest-ascent hill climbing gets stuck 86% of the time, solving only 14% of problem instances. It works quickly, taking just 4 steps on average when 8 it succeeds and 3 when it gets stuck-not bad for a state space with 8 � 17 million states. The algorithm in Figure 4.2 halts if it reaches a plateau where the best successor has the same value as the current state. Might it not be a good idea to keep going-to allow a
SIDEWAYS MOVE
sideways mo,·e n i the hope that the plateau is really a shoulder, as shown in Figure 4.1? The answer is usually yes, but we must take care. If we always allow sideways moves when there are no uphill moves, an infinite loop will occur whenever the algorithm reaches a flat local maximum that is not a shoulder. One common solution is to put a limit on the number of con secutive sideways moves allowed. For example, we could allow up to, say, 100 consecutive sideways moves in the 8-queens problem. This raises the percentage of problem instances solved by hill climbing from 14% to 94%. Success comes at a cost: the algorithm averages roughly 21 steps for each successful instance and 64 for each failure.
124
Chapter
Figure 4.4
4.
Beyond Classical Search
Dlustration of why ridges cause difficulties for hill climbing. The grid of states
(dark circles) is superimposed on a ridge rising from left to right, creating a sequence oflocal maxima that are not directly connected to each other. From each local maximum, all the available actions point downhill. STOCHASriC HILL CLIMBING
Many variants of hill climbing have been invented. Stochastic hill climbing chooses at random from among the uphill moves; the probability of selection can vary with the steepness
FIRST.CHOICE HILL CLIMBING
of the uphill move. This usually converges more slowly than steepest ascent, but in some state landscapes, it finds better solutions.
First-choice hill climbing implements stochastic
hill climbing by generating successors randomly until one is generated that is better than the current state. This is a good strategy when a state has many (e.g., thousands) of successors. The hill-climbing algorithms described so far are incomplete-they often fail to find a goal when one exists because they RANOOM-RESTART HILLCLIMBING
can
get stuck on local maxima.
climbing adopts the well-known adage, "If at first you
Random-restart hill
don't succeed, try, try again." It con
1
ducts a series of hill-climbing searches from randomly generated initial states, until a goal is found. It is trivially complete with probability approaching I , because it will eventually generate a goal state as the initial state. If each hill-climbing search has a probability
p of
1/p. For 8-queens instances with � 0.14, so we need roughly 7 iterations to find a goal (6 fail
success, then the expected number of restarts required is no sideways moves allowed, p
1 success). The expected number of steps is the cost of one successful iteration plus (1 -p)jp times the cost offailure, or roughly 22 steps in all. When we allow sideways moves, 1/0.94 � 1.06 iterations are needed on average and ( 1 x 2 1 ) + (0.06/0.94) x 64 � 25 steps.
ures and
For 8-queens, then, random-restart hill climbing is very effective indeed. Even for three mil 2 lion queens, the approach can find solutions in tmder a minute. 1 Generating a random state from an implicitly specified state space can be a hard problem in itself.
2 Luby et al. ( 1993) prove that it is best, in some cases, to restart a randomized search algorithm after a particular, fixed amount of time and that this can be much more efficient than letting each search continue indefinitely. Disallowing or limiting the number of sideway� moves is an example of this idea.
Section 4.1.
125
Local Search Algorithms and Optimization Problems
The success of hill climbing depends very much on the shape of the state-space land scape: if there are few local maxima and plateaux, random-restatt hill climbing will find a good solution very quickly. On the other hand, many real problems have a landscape that looks more like a widely scattered family of balding porcupines on a flat floor, with miniature porcupines living on the tip of each porcupine needle, ad infinitum. NP-hard problems typi cally have an exponential number of local maxima to get stuck on. Despite this, a reasonably good local maximum can often be found after a smaU number of restarts. 4.1.2
SIMULATED ANNEALING
GRADIENT DESCENT
Simulated annealing
A hill-climbing algorithm that never makes "downhill" moves toward states with lower value (or higher cost) is guaranteed to be incomplete, because it can get stuck on a local maxi mum. In contrast, a purely random walk-that is, moving to a successor chosen uniformly at random from the set of successors-is complete but extremely inefficient. Therefore, it seems reasonable to try to combine hill climbing with a random walk in some way that yields both efficiency and completeness. Simulated annealing is such an algorithm. In metallurgy, annealing is the process used to temper or harden metals and glass by heating them to a high temperature and then gradually cooling them, thus allowing the material to reach a low energy crystalline state. To explain simulated annealing, we switch our point of view from hill climbing to gradient descent (i.e., minimizing cost) and imagine the task of getting a ping-pong ball into the deepest crevice in a bwnpy surface. If we just let the ball roll, it will come to rest at a local minimum. If we shake the swface, we can bounce the ball out of the local minimum. 1l1e trick is to shake just hard enough to bounce the ball out of local min ima but not hard enough to dislodge it from the global minimum. The simulated-annealing solution is to start by shaking hard (i.e., at a high temperature) and then gradually reduce the intensity of the shaking (i.e., lower the temperature). The innermost loop ofthe simulated-annealing algorithm (Figure 4.5) is quite similar to hill climbing. Instead ofpicking the best move, however, it picks a random move. Ifthe move improves the situation, it is always accepted. Otherwise, the algorithm accepts the move with some probability less than 1. The probability decreases exponentially with the "badness" of the move-the amount tJ.E by which the evaluation is worsened. The probability also de creases as the "temperature" T goes down: "bad" moves are more likely to be allowed at the start when T is high, and they become more unlikely as T decreases. If the schedule lowers T slowly enough, the algorithm will find a global optimwn with probability approaching l . Simulated annealing was first used extensively to solve VLSI layout problems in the
early 1980s. It has been applied widely to factory scheduling and other large-scale optimiza tion tasks. In Exercise 4.4, you are asked to compare its performance to that of random-restart hill climbing on the 8-queens puzzle. 4.1.3 LOCAL BEAM SEARCH
Local beam search
Keeping just one node in memory might seem to be an extreme reaction to the problem of memory limitations. The local beam search algorithm3 keeps track of k states rather than 3 Local berun search is an adaptation of beam search, which is a path based algorithm. -
Chapter 4.
126
Beyond Classical Search
function SJMULATED-ANNEALING(problem, schedule) returns a solution state inputs:
pmblem, a problem schedule, a mapping from time to "temperature"
current ..- MAKE-NODE(pmblem.INJTIAL-STATB)
for t = l to oo do
T +- schedule(t) if T = 0 then return current
next..- a randomly selected successor of current f:l.E +- next.VALUE - current. VALUE if f:l.E > 0 then cun-ent ..- next else cun-ent +-next only with probability ei:>.E/T
Figure 4.5
The simulated annealing algorithm, a version of stochastic hill climbing where
some downhill moves are allowed. Downhill moves are accepted readily early in the anneal ing schedule and then less often as time goes on. The schedule input determines the value of the temperature T as a function of time.
STOCHASTIC BEAM SEARCH
just one. It begins with k randomly generated states. At each step, all the successors of a]I k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best successors from the complete list and repeats. At first sight, a local beam search with k states might seem to be nothing more than mnning k random restarts in parallel instead of in sequence. In fact, the two algorithms are quite different. In a random-restart search, each search process runs independently of the others. In a local beam search, useful information is passed among the parallel search threads. In effect, the states that generate the best successors say to the others, "Come over here, the grass is greener!" The algoritlun quickly abandons unfruitful searches and moves its resources to where the most progress is being made. In its simplest form, local beam search can suffer from a lack of diversity among the k states-they can quickly become concentrated in a small region of the state space, making the search little more than an expensive version of hill climbing. A variant called stochastic beam search, analogous to stochastic hill climbing, helps alleviate this problem. Instead of choosing the best k from the the pool of candidate successors, stochastic beam search chooses k successors at random, with the probability of choosing a given successor being an increasing function of its value. Stochastic beam search bears some resemblance to the process of natural selection, whereby the "successors" (offspring) of a "state" (organism) populate the next generation according to its "value" (fitness). 4.1.4
GENETIC ALGORITHM
Genetic algorithms
i which successor states A genetic algorithm (or GA) is a variant of stochastic beam search n are generated by combining two parent states rather than by modifying a single state. The analogy to natural selection is the same as in stochastic beam search, except that now we are dealing with sexual rather than asexual reproduction.
Section 4.1.
7
Local Search Algorithms and Optimization Problems
24748552 32752411 24415124 1 325432131
327!5 2411 24 7!4 8552 32752�11 24415!124
11
(a)
(b)
(c)
Initial Population
Fitnc.�s Function
Selection
Figure 4.6
12
�1 3274@52 1 247�2411 �I 24752411 1 3275 2l124 1 ·1 32@;2124 �I 2441541rn (d)
Crossover
(e) Mutation
The genelirtion of the plan and using that label later instead of repeating the plan itself. Thus, our cyclic solution is
[Suck, L1 (A
:
Right, if State = 5 then L1 else Suck]
.
better syntax for the looping part of this plan would be
"while State = 5 do Right.")
In general a cyclic plan may be considered a solution provided that every leaf is a goal state and that a leaf is reachable from every point in the plan.
to AND-OR-GRAPH-SEARCH
The modifications needed
are covered in Exercise 4.6. The key realization is that a loop
in the state space back to a state L translates to a loop in the plan back to the point where the subplan for state
L is executed.
Given the definition of a cyclic solution,
an
agent executing such a solution will eventu
ally reach the goal provided that each outcome ofa nondeterministic action eventually occurs. Is this condition reasonable? It depends on the reason for the nondetenninism. If the action rolls a die, then it's reasonable
to suppose
that eventually a six will be rolled. If the action is
to inse1t a hotel card key into the door lock, but it doesn't work the first time, then perhaps it will eventually work, or perhaps one has the wrong key (or the wrong room!). After seven or
Chapter 4.
138
Beyond Classical Search
eight tries, most people will assume the problem is with the key and will go back to the front desk to get a new one. One way to understand this decision is to say that the initial problem formulation (observable, nondetenninistic) is abandoned in favor of a different formulation (pattially observable, deterministic) where the failure is attributed to an unobservable prop etty of the key. We have more to say on this issue in Chapter 13.
4.4
SEARCHING WITH PARTIAL OBSERVATIONS
BELIEF STATE
We now turn to the problem of partial observability, where the agent's percepts do not suf fice to pin down the exact state. As noted at the beginning of the previous section, if the agent is in one of several possible states, then an action may lead to one of several possible outcomes-even if the environment is deterministic. The key concept required for solving partially observable problems is the belief state, representing the agent's current belief about the possible physical states it might be in, given the sequence of actions and percepts up to that point. We begin with the simplest scenario for studying belief states, which is when the agent has no sensors at all; then we add in partial sensing as well as nondeterministic actions. 4.4.1
SENSORLESS CONFORMANT
COERCION
Searching with no observation
When the agent's percepts provide no information at all, we have what is called a sensor less problem or sometimes a conformant problem. At first, one might think the sensorless agent has no hope of solving a problem if it has no idea what state it's in; in fact, sensorless problems are quite often solvable. Moreover, sensorless agents can be surprisingly useful, primarily because they don't rely on sensors working properly. In manufacturing systems, for example, many ingenious methods have been developed for orienting parts correctly from an unknown initial position by using a sequence of actions with no sensing at all. The high cost of sensing is another reason to avoid it: for example, doctors often prescribe a broad spectrum antibiotic rather than using the contingent plan of doing an expensive blood test, then waiting for the results to come back, and then prescribing a more specific antibiotic and perhaps hospitalization because the infection has progressed too far. We can make a sensorless version of the vacuum world. Assume that the agent knows the geography of its world, but doesn't know its location or the distribution of dirt. In that case, its initial state could be any element of the set {1, 2, 3, 4, 5, 6, 7, 8}. Now, consider what happens if it tries the action Right. This will cause it to be in one of the states {2, 4, 6, �}-the agent now has more information! Furthermore, the action sequence [Right,Suck] will always end up in one of the states {4, 8}. Finally, the sequence [Right,Suck,Left,Suck] is guaranteed to reach the goal state 7 no matter what the start state. We say that the agent can coerce the world into state 7. To solve sensorless problems, we search in the space ofbelief states rather than physical states.10 Notice that in belief-state space, the problem is fully observable because the agent 10
In a fully observable environment, each belief state contains one physical state. Thus, we can view the algo rithms in Chapter 3 as searching in a belief-state space of singleton belief states.
Section 4.4.
Searching with Partial Observations
139
always knows its own belief state. Furthermore, the solution (if any) is always a sequence of actions. This is because, as in the ordinary problems of Chapter 3, the percepts received after each action are completely predictable-they're always empty! So there are no contingencies to plan for. This is tme even ifthe environment is nondeterminstic. It is instmctive to see how the belief-state search problem is constmcted. Suppose the underlying physical problem P is defined by ACTIONSp, RESULTp, GOAL-TESTp, and STEP-COST p. Then we can define the corresponding sensorless problem as follows: • Belief states: The entire belief-state space contains every possible set of physical states. If P has N states, then the sensorless problem has up to 2N states, although many may be unreachable from the initial state. • Initial state: Typically the set of all states in P, although in some cases the agent will have more knowledge than this. • Actions: This is slightly tricky. Suppose the agent is in belief state b = {s1,s2}. but ACTIONSp(s1) =f. ACTIONSp(sz); then the agent is unsure of which actions are legal. If we assume that illegal actions have no effect on the environment, then it is safe to take the union of all the actions in any ofthe physical states in the current belief state b: ACTIONS(b) =
u ACTIONSp(s) .
sEb
•
On the other hand, if an illegal action might be the end of the world, it is safer to allow only the intersection, that is, the set of actions legal in all the states. For the vacuum world, every state has the same legal actions, so both methods give the same result. Transition model: The agent doesn't know which state in the belief state is the right one; so as far as it knows, it might get to any of the states resulting from applying the action to one of the physical states in the belief state. For deterministic actions, the set of states that might be reached is b' = RESULT(b,a) = {s1 : s1 = RESULTp(s, a) and s E b} . (4.4) With deterministic actions, 1! is never larger than b. With nondeterminism, we have b' = RESULT(b, a) {s' : s' E RESULTS p( s, a) and s E b} =
U RESULTSp(s, a) , sEb
PREDICTION
•
•
which may be larger than b, as shown in Figure 4.13. The process of generating the new belief state after the action is called the prediction step; the notation b' = PREDICTp(b, a) will come in handy. Goal test: The agent wants a plan that is sure to work, which means that a belief state satisfies the goal only if all the physical states in it satisfy GOAL-TESTp. The agent may accidentally achieve the goal earlier, but it won't know that it has done so. Path cost: This is also tricky. If the same action can have different costs in different states, then the cost of taking an action in a given belief state could be one of several values. (This gives rise to a new class of problems, which we explore in Exercise 4.9.) For now we assume that the cost of an action is the same in all states and so can be transfened directly from the underlying physical problem.
140
Chapter
4.
Beyond Classical Search
(b)
(a)
(a) Predicting the next belief state for the sensorless vacuum world with a deterministic action, Right. (b) Prediction for the same belief state and action in the slippery version ofthe sensorless vacuum world.
Figure 4.13
Figure world.
4.14 shows the reachable belief-state space for the deterministic, sensorless vacuum 8 There are only 12 reachable belief states out of 2 = 256 possible belief states.
The preceding definitions enable the automatic construction of the belief-state problem formulation from the definition of the underlying physical problem. Once this is done, we can apply any of than that.
the search algorithms
of Chapter
3.
In fact, we can do a little bit more
In "ordinary" graph search, newly generated states are tested to see if they are
4.14, the [Suck,Left,Suck] starting at the initial state reaches the same belief state as [Right,Left,Suck], namely, {5, 7}. Now, consider the belief state reached by [Left], namely, {1, 3, 5, 7}. Obviously, this is not identical to {5, 7}, but it is a superset. It is easy to prove (Exercise 4.8) that if an action sequence is a solution for a belief state b, it is also a solution for any subset of b. Hence, we can discard a path reaching {1, 3, 5, 7} if {5, 7} has already been generated. Conversely, if {1, 3, 5, 7} has already been generated and found to be solvable, then any subset, such as {5, 7}, is guaranteed to be solvable. This extra level of pruning may identical to existing states. This works for belief states, too; for example, in Figure action sequence
dramatically improve the efficiency of sensorless problem solving. Even with this improvement, however, sensorless problem-solving as we have described it is seldom feasible in practice. The difficulty is not so much the vastness of the belief-state space-even though it is exponentially larger than the tmderlying physical state space; in most cases the branching factor and solution length in the belief-state space and physical state space are not so different. The real difficulty lies with the size of each belief state. For
10 x 10 vacuum world contains 100 x 2100 or around 32 10 physical states-far too many if we use the atomic representaHon, which is an explicit
example, the initial belief state for the list of states.
One solution is to represent the belief state by some more compact description.
In
English, we could say the agent knows "Nothing" in the initial state; after moving Left, we could say, "Not in the rightmost column;' and so on. Chapter 7 explains how to do tl1is in a formal representation scheme. Another approach is to avoid the standard search algorithms, which treat belief states as black boxes just like any other problem state. Instead, we can look
Section 4.4.
Searching with Partial Observations
141 L
R 2 I.,. 1::0I
I
�
"" "" 3 � "" "" 1� "" "" 4[;E] , 1::01.,. I 3 1�1 I R 2� L - 41.,. 1�1 s l....dll.,. I 6 1 l:fll - 6 1 I::Oi a [E] s l....dll.,. I a [B
7 EU
7EU
4 1.,. IAOI s iAOI.,. I 1 ff] a [E]
s
L
s iAOI.,. I
7EU
R! l
ls
s
t
s iAOI.,. I 31::01 I
7EU
L
s l l:fll a [B
s
ism I�
L
R L
R
s
!R
6 1 l:fl 4 1.,. 1-o1 a[B
s
4 I.,. 1....6)I a [EJ L
·17 EUI
s
! lR
3 1�1 I
7EU
The reachable portion of the belief-state space for the deterministic, sensorless vacuum world. Each shaded box corresponds to a single belief state. At any given point, the agent is in a particular beliefstate but does not know which physical state it is in. The initial belief state (complete ignorance) is the top center box. Actions are represented by labeled links. Self-loops are omitted for clarity. Figure 4.14
INCREMENTAL BELIEF-Sl'ITE SEARCH
inside the belief states and develop incremental belief-state search algoritluns that build up the solution one physical state at a time. For example, in the sensorless vacuum world, the initial belief state is {1, 2, 3,4, 5, 6, 7, 8}, and we have to find an action sequence that works in all 8 states. We can do this by first finding a solution that works for state 1; then we check if it works for state 2; if not, go back and find a different solution for state I , and so on. Just as an AND-OR search has to find a solution for every branch at an AND node, this algoritlun has to find a solution for every state in the belief state; the difference is that AND-OR search can find a different solution for each branch, whereas an incremental belief-state search has to find one solution that works for all the states. The main advantage of the incremental approach is that it is typically able to detect failure quickly-when a belief state is unsolvable, it is usually the case that a small subset of the belief state, consisting of the first few states examined, is also unsolvable. In some cases,
142
Chapter
4.
Beyond Classical Search
this leads to a speedup proportional to the size of the belief states, which may themselves be as large as the physical state space itself. Even the most efficient solution algorithm is not of much use when no solutions exist. Many things just cannot be done without sensing. For example, the sensorless 8-puzzle is impossible. On the other hand, a little bit of sensing can go a long way. For example, every 8-puzzle instance is solvable if just one square is visible-the solution involves moving each tile in twn into the visible square and then keeping track of its location.
4.4.2
Searching with observations
For a general partially observable problem, we have to specify how the environment generates percepts for the agent. For example, we might define the local-sensing vacuum world to be one in wh.ich the agent has a position sensor and a local dirt sensor but has no sensor capable of detecting dirt in other squares. The formal problem specification includes a PERCEPT(s) function that retums the percept received in a given state. (If sensing is nondeterministic, then we use a PERCEPTS function that retums a set of possible percepts.) For example, in the local-sensing vacuum world, the PERCEPT in state 1 is [A, Dirty] . Fully observable problems are a special case in which PERCEPT(s) = s for every state s, while sensorless problems are a special case in which PERCEPT(s) = null. When observations are partial, it will usually be the case that several states could have produced any given percept. For example, the percept [A, Dirty] is produced by state 3 as well as by state 1. Hence, given this as the initial percept, the initial belief state for the local-sensing vacuum world will be {1, 3}. The ACTIONS, STEP-COST, and GOAL-TEST are constructed from the w1derlying physical problem just as for sensorless problems, but the transition model is a bit more complicated. We can think of transitions from one belief state to the next for a particular action as occurring in three stages, as shown in Figure 4.15:
• The prediction stage is the same as for sensorless problems: given the action a in belief state b, the predicted belief state is b = PREDICT(b, a). 11 • The observation prediction stage determines the set of percepts o that could be ob served in the predicted belief state:
POSSIBLE-PERCEPTS ( b) = {o : o = PERCEPT(s) and s E b} .
• The update stage determines, for each possible percept, the belief state that would result from the percept. The new belief state b0 is just the set of states in
b that could
have produced the percept:
b0 = UPDATE(b,o) = {s : o = PERCEPT(s) and s E b} . Notice that each updated belief state b0 can be no larger than the predicted belief state b; observations can only help reduce uncertainty compared to the sensorless case. More over, for deterministic sensing, the belief states for the different possible percepts will be disjoint, fotming a partition of the original predicted belief state. 11
Here, and throughout the book, the "hot" in b means an estimated or predicted value for b.
Section 4.4.
Searching with Partial Observations
143
@
(a)
@
@
(b)
Two example of transitions in local-sensing vacuum worlds. (a) In the de terministic world, Right is applied in the initial belief state, resulting in a new belief state Figure 4.15
with two possible physical states; for those states, the possible percepts are [B, Dirty] and [B, Clean:, leading to two belief states, each of which is a singleton. (b) In the slippery world, Right is applied in the initial belief state, giving a new belief state with four physi cal states; for those states, the possible percepts are [A, Dirty], [B, Dirty], and [B, Clean], leading to three belief states as shown.
Putting these three stages together, we obtain the possible belief states resulting from a given action and the subsequent possible percepts:
RESULTS(b,a)
=
{bo : b0
UPDATE(PREDICT(b, a),o) and
o E POSSIBLE-PERCEPTS(PREDICT(b,a))} .
(4.5)
Again, the nondeterminism in the partially observable problem comes from the inability to predict exactly which percept will be .received after acting; underlying nondeterminism in the physical environment may contribute to this inability by enlarging the belief state at the prediction stage, leading to more percepts at the observation stage.
4.4.3
Solving partially observable problems
The preceding section showed how to derive the RESULTS function for a nondeterministic belief-state problem from an underlying physical problem and the PERCEPT function. Given
Chapter
144
4.
Beyond Classical Search
The first level of the AND-OR search tree for a problem in the local-sensing vacuum world; Suck is the first step of the solution.
Figure 4.16
such a formulation, the AND-OR search algorithm of Figure 4.11 can be applied directly to derive a solution. Figure 4.16 shows part of the search tree for the local-sensing vacuum world, assuming an initial percept [A, Dirty]. The solution is the conditional plan
[Suck,Right,if Bstate = {6} then Suck else []] . Notice that, because we supplied a belief-state problem to the AND-OR search algorithm, it returned a conditional plan that tests the belief state rather than the actual state. This is as it should be: in a partially observable environment the agent won't be able to execute a solution that requires testing the actual state. As in the case of standard search algoritluns applied to sensorless problems, the AND OR search algorithm treats belief states as black boxes, just like any other states. One can improve on this by checking for previously generated belief states that are subsets or supersets of the cun·ent state, just as for sensorless problems. One can also derive incremental search algorithms, analogous to those described for sensorless problems, that provide substantial speedups over the black-box approach.
4.4.4 TI1e
An agent for partiaUy observable environments
design of a problem-solving agent for partially observable environments is quite similar to the simple problem-solving agent in Figure 3.1: the agent formulates a problem, calls a search algorithm (such as AND-OR-GRAPH-SEARCH) to solve it, and executes the solution. There are two main differences. First, the solution to a problem will be a conditional plan rather than a sequence; if the first step is an if-then-else expression, the agent will need to test the condition in the if-pa1t and execute the then-part or the else-part accordingly. Second, the agent will need to maintain its belief state as it performs actions and receives percepts. This process resembles the prediction-observation-update process in Equation (4.5) but is actually simpler because the percept is given by the environment rather than calculated by the
Section 4.4.
Searching with Partial Observations
Figure 4.17
145
Two prediction-update cycles of belief-state maintenance in the kindergarten
vacuum world with local sensing.
agent. Given an n i itial belief state b, an action a, and a percept o, the new belief state is:
b1 - UPDATE(PREDICT(b, a), o)
MONITORING FILTERING STATE ESTIMATION RECURSIVE
LOCALIZATION
.
(4.6)
Figure 4.17 shows the belief state being maintained in the kindergarten vacuum world with local sensing, wherein any square may become dirty at any time tmless the agent is actively cleaning it at that moment. 12 In partially observable environments-which n i clude the vast majority of real-world environments-maintaining one's belief state is a core function of any intelligent system. This fw1ction goes under various names, including monitoring, filtering and state estima tion. Equation (4.6) is called a recursive state estimator because it computes the new belief state from the previous one rather than by examining the entire percept sequence. If the agent is not to "fall behind," the computation has to happen as fast as percepts are coming in. As the environment becomes more complex, the exact update computation becomes infeasible and the agent will have to compute an approximate belief state, perhaps focusing on the im plications of the percept for the aspects of the environment that are of current interest. Most work on this problem has been done for stochastic, continuous-state environments with the tools of probability theory, as explained in Chapter 15. Here we will show an example in a discrete envirorunent with detrministic sensors and nondeterministic actions. The example concerns a robot with the task of localization: working out where it is, given a map of the world and a sequence of percepts and actions. Our robot is placed in the maze-like environment of Figure 4.18. The robot is equipped with four sonar sensors that tell whether there is an obstacle--the outer wall or a black square in the figure--in each of the four compass directions. We assume that t.he sensors give perfectly correct data, and that the robot has a correct map of the enviomment. But unfortunately the robot's navigational system is broken, so when it executes a Move action, it moves randomly to one of the adjacent squares. The robot's task is to determine its current location. Suppose the robot has just been switched on, so it does not know where it is. Thus its initial belief state b consists of the set of all locations. The the robot receives the percept 12
The usual apologies to those who are unfamiliar with the effect ofsmall children on the enviro1mtent.
146
Chapter
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
0
(a) Possible locations of robot after E 1
0
8 0
0
=
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(b) Possible locations of robot After E1
0
0
0
0
0
0
0
0
0
0
0
0
0
N SW
0
0
Figure 4.18
0
Beyond Classical Search
8
0
G o
0
4.
0
=
0
0
0
N SW, E 2
0
0
0
0
0
=
0
0
0
0
NS
Possible positions of the robot, 0. (a) after one observation E1
=
NSW and
(b) after a second observation E2 = NS. When sensors are noiseless and the transition model is accurate, there are no other possible locations for the robot consistent with this sequence of two observations.
NSW, meaning
there are obstacles to the north, west, and south, and does an update using the equation b0 UPDATE(b), yielding the 4 locations shown in Figure 4. 18(a). You can inspect the maze to see that those are the only four locations that yield the percept NWS. Next the robot executes a Move action, but the result is nondeterministic. The new be lief state, ba PREDICT(bo, Move), contains all the locations that are one step away from the locations in b0• When the second percept, NS, arrives, the robot does UPDATE(ba, NS) and finds that the belief state has collapsed down to the single location shown in Figure 4.18(b). That's the only location that could be the result of =
=
UPDATE(PREDICT(UPDATE(b, NSW), Move), NS) .
With nondetermnistic actions the PREDICT step grows the belief state, but the UPDATE step shrinks it back down-as long as the percepts provide some useful identifying information. Sometimes the percepts don't help much for localization: If there were one or more long east-west corridors, then a robot could receive a long sequence of N S percepts, but never know where in the corridor(s) it was.
Section 4.5.
4.5
Online Search Agents and Unknown Environments
147
ONLINE SEARCH AGENTS AND UNKNOWN ENVIRONMENTS
OFFLINE SEARCH ONLINE SEARCH
So far we have concentrated on agents that use offline search algorithms. They compute a complete solution before setting foot. in the real world and then execute the solution. In
contrast, an online search13 agent interleaves computation and action: first it takes an action,
then it observes the environment and computes the next action. Online search is a good idea in dynamic or semidynamic domains-domains where there is a penalty for sitting around and computing too long. Online search is also helpful in nondeterministic domains because it allows the agent to focus its computational efforts on the contingencies that actually arise rather than those that
might happen but probably won't.
Of cow·se, there is a tradeoff: the
more an agent plans ahead, the less often it will find itself up the creek without a paddle. Online search is a necessary idea for tmknown environments, where the agent does not know what states exist or what its actions do. In this state of ignorance, the agent faces an EXPLORATION PR>IlLEM
exploration problem and must use its actions as experiments in order to learn enough to make deliberation worthwhile.
The canonical example of online search is a robot that is placed in a new building and must explore it to build a map that it can use for getting from
A to B.
Methods for escaping
from labyrinths-required knowledge for aspiring heroes of antiquity-are also examples of online search algorithms. Spatial exploration is not the only form of exploration, however. Consider a newborn baby: it has many possible actions but knows the outcomes of none of them, and it has experienced only a few of the possible states that it can reach. The baby's gradual discovery of how the world works is, in part, an online search process.
4.5.1
Online search problems
An online search problem must be solved by an agent executing actions, rather than by pure computation. We assume a deterministic and fully observable environment (Chapter 17 re laxes these assumptions), but we stipulate that the agent knows only the following: •
ACTIONS (s) , which returns a Jist of actions allowed in state s;
•
The step-cost function c(s, a, s')-note that this cannot be used until the agent !mows that s' is the outcome; and
•
GOAL-TEST (s).
Note in particular that
the agent cannot determine RESULT (s,a)
except by actually being
in s and doing a. For example, in the maze problem shown in Figure 4.19, the agent does not know that going Up from (1,1) leads to (1,2); nor, having done that, does it know that going Down will take it back to (1,1 ). This degree of ignorance can be reduced in some applications-for example, a robot explorer might know how its movement actions work and be ignorant only of the locations of obstacles. 13 The term "online" is commonly used in computer science to refer to algoritlmts that must process input data as tltey are received ratlter than waiting for the entire input data set to become available.
Chapter
148
3
4.
Beyond Classical Search
G
2
1
s 1
2
3
A simple maze problem. The agent starts at S and must reach G but knows Figure 4.19 nothing of the environment.
G
s
(a)
(b)
Figure 4.20 (a) Two state spaces that might lead an online search agent n i to a dead end. Any given agent will fail n i at least one of these spaces. (b) A two-dimensional environment that can cause an online search agent to follow an arbitrarily inefficient route to the goal. Whichever choice the agent makes, the adversary blocks that route with another long, thin wall, so that the path followed is much longer than the best possible path.
Finally, the agent might have access to an admissible heuristic function
h(s)
that es
timates the distance from the current state to a goal state. For example, in Figure 4.19, the agent might know the location of the goal and be able to use the Manhattan-distance heuristic. Typically, the agent's objective is to reach a goal state while minimizing cost. (Another possible objective is simply to explore the entire environment.) The cost is the total path cost of the path that the agent actually travels. It is common to compare this cost with the path cost of the path the agent would follow if it knew
the search space in advance-that
is, the
actual shortest path (or shortest complete exploration). In the language of online algorithms, coMPETITIVE RATIO
this is called the competitive ratio; we would like it to be as small as possible.
Section 4.5.
IRREVEfiSIBLE
DEAD END
AD.'ERSARY ARGUM:NT
SAFELYEXPLORABLE
Online Search Agents and Unknown Environments
149
Although this sounds like a reasonable request, it is easy to see that the best achievable competitive mtio is infinite in some cases. For example, if some actions are irreversible i.e., they lead to a state from which no action leads back to the previous state-the online search might accidentally reach a dead-end state from which no goal state is reachable. Per haps the tenn "accidentally" is unconvincing-after all, there might be an algorithm that happens not to take the dead-end path as it explores. Our claim, to be more precise, is that no algorithm can avoid dead ends in all state spaces. Consider the two dead-end state spaces in Figure 4.20(a). To an online search algorithm that has visited states S and A, the two state spaces look identical, so it must make the same decision in both. Therefore, it will fail in one of them. This is an example of an adversary argument-we can imagine an adversary constructing the state space while the agent explores it and putting the goals and dead ends wherever it chooses. Dead ends are a real difficulty for robot explomtion-staircases, mmps, cliffs, one-way streets, and all kinds of nattu·al terrain present opportunities for irreversible actions. To make progress, we simply assume that the state space is safely explorable--that is, some goal state is reachable from every reachable state. State spaces with reversible actions, such as mazes and 8-puzzles, can be viewed as undirected graphs and are clearly safely explorable. Even in safely explorable environments, no bounded competitive ratio can be guaran teed if there are paths of unbounded cost. This is easy to show in environments with irre versible actions, but in fact it remains true for the reversible case as well, as Figw·e 4.20(b) shows. For this reason, it is common to describe the perfonnance of online search algorithms in terms of the size of the entire state space rather than just the depth of the shaUowest goal. 4.5.2
Online search agents
After each action, an online agent receives a percept telling it what state it has reached; from this information, it can augment its map of the environment. The current map is used to decide where to go next. This interleaving of planning and action means that online search algorithms are quite different from the offline search algorithms we have seen previously. For example, offline algorithms such as A* can expand a node in one pa1t of the space and then immediately expand a node in another part of the space, because node expansion involves simulated rather than real actions. An online algorithm, on the other hand, can discover successors only for a node that it physicaUy occupies. To avoid traveling all the way across the tree to expand the next node, it seems better to expand nodes in a local order. Depth-first search has exactly this property because (except when backtracking) the next node expanded is a child of the previous node expanded. An online depth-first search agent is shown in Figure 4.21. This agent stores its map in a table, RESULT[s, a], that records the state resulting from executing action a in state s. Whenever an action from the current state has not been explored, the agent tries that action. The difficulty comes when the agent has tried all the actions in a state. In offline depth-first search, the state is simply dropped from the queue; in an online search, the agent has to backtrack physically. In depth-first search, this means going back to the state from which the agent most recently entered the current state. To achieve that, the algorithm keeps a table that
150
Chapter
4.
Beyond Classical Search
function ONLINE-DFS-AGENT(s') returns an action inputs: s', a percept that identifies the current state persistent: result, a table indexed by state and action, initially empty untried, a table that lists, for each state, the actions not yet tried unbacktracked, a table that lists, for each state, the backtracks not yet tried s, a, the previous state and action, initially nuU if GOAL-TEST(s') then return stop if s' is a new state (not in untl'ied) then untl'ied[s'J ....- ACTIONS(s') if s is not null then
result[s, a] .... - s'
add s to the front of unbacL-tracked[s'J if untried[s'J is empty then if unbacktmcked[s'J is empty then return stop else a+- an action b such that 1-esult[s', b)= POP(unbacktmcked[s'J) else a what lu du. Tht: twu dtui�.:t:s lt:ad
to terminal states with utility vectors (vA = 1, V B = 2, vc = 6) and (vA = 4, VB= 2, vc = 3). Since 6 is bigger than 3, C should choose the first move. This means that if state X is reached, subsequent play will lead to a terminal state with utilities (vA = 1, V B = 2, vc = 6). Hence, the backed-up value ofX is this vector. The backed-up value of a node n is always the utility
Chapter
166
Adversarial Search
5.
function MINIMAX-DECISION(state) returns an action return arg max E ACTIONS s MIN-VALUE(RESULT(state, a)) a
()
function MAX-VALUE(state) returns a utility value if TERMINAL-TEST(state) then return UTILITY(state) V f- - - 00
for each a in ACTIONS(state) do
v ....- MAX(v, MIN-VALUE(RESULT(s, a))) return v
function MIN-VALUE(state) returns a utility value
if TERMINAL-TEST(state) then return UTILITY(state) V t- - 00
for each a in ACTIONS(state) do V t-
MI�(v, MAX-VALUE(RESULT(s, a)))
return v
Figure 5.3 An algorithm for calculating minimax decisions. It returns the action corre sponding to the best possible move, that is, the move that leads to the outcome with the best utility, under the asswnption that the opponent plays to minimize utility. The functions MAX-VALUE and MIN-VALUE go through the whole game tree, all the way to the leaves, to determine the backed-up value of a state. The notation argmaxa e sf(a) computes the element a of set S that has the maximum value of f (a).
to move
A B c A
(1,2,6)
(4,2,3)
(6, 1,2)
(7,4,1)
(5,1,1)
(1,5,2)
(7,7,1)
(5,4,5)
Figure 5.4 The first three plies of a game tree with three players (A, B, G). Each node is labeled with values from the viewpoint of each player. The best move is marked at the root.
vector of the successor state with the highest value for the player choosing at
n.
Anyone
who plays multiplayer games, such as Diplomacy, quickly becomes aware that much more ALUANCE
is going on than in two-player games. Multiplayer games usually involve
alliances, whether
formal or informal, among the players. Alliances are made and broken as the game proceeds. How are we to understand such behavior? Are alliances a natural consequence of optimal strategies for each player in a multiplayer game? It turns out that they can be. For example,
Section 5.3.
Alpha-Beta Pnming
167
suppose A and B are in weak positions and C is in a stronger position. Then it is often optimal for both A and B to attack C rather than each other, lest C destroy each of them individually. In this way, collaboration emerges from purely selfish behavior. Of course, as soon as C weakens under the joint onslaught, the alliance loses its value, and either A or B could violate the agreement. In some cases, explicit alliances merely make concrete what would have happened anyway. In other cases, a social stigma attaches to breaking an alliance, so players must balance the immediate advantage of breaking an alliance against the long-term disadvantage of being perceived as untrustworthy. See Section 17.5 for more on these complications. If the game is not zero-sum, then collaboration can also occur with just two players. Suppose, for example, that there is a terminal state with utilities (vA = 1000, V B = 1000) and that 1000 is the highest possible utility for each player. Then the optimal strategy is for both players to do everything possible to reach this state-that is, the players will automatically cooperate to achieve a mutually desirable goal.
5.3
Al.I'HA-l:ltlA PRJNING
ALPHA-B ETA PRUNING
The problem with minimax search is that the number of game states it has to examine is exponential in the depth of the tree. Unfort.tmately, we can't eliminate the exponent, but it tums out we can effectively cut it in half. The trick is that it is possible to compute the correct minimax decision without looking at every node in the game tree. That is, we can bOITOW the idea of pruning from Chapter 3 to eliminate lmge parts of the tree from consideration. The particu1ar technique we examine is called alpha-beta pruning. When applied to a standard minimax tree, it returns the same move as minimax would, but prunes away branches that cannot possibly influence the final decision. Consider again the two-ply game tree from Figure 5.2. Let's go through the calculation of the optimal decision once more, this time paying careful attention to what we know at each point in the process. The steps are explained in Figure 5.5. The outcome is that we. can identify the minimax decision without ever evaluating two of the leaf nodes. Another way to look at this is as a simplification of the formula for MINIMAX. Let the two tmevaluated successors of node C in Figure 5.5 have values x and y. Then the value of the root node is given by MINIMAX( root)
max(min(3, 12, 8), m in(2, x, y), m in(14, 5, 2)) max(3, min(2, x, y), 2) max(3, z,2) where z - min(2, x, y) ::;
2
3. In other words, the value of the root and hence the minimax decision are independent of the values of the pnmed leaves x and y. Alpha-beta pruning can be applied to trees of any depth, and it is often possible to prune entire subtrees rather than just leaves. The general principle is this: consider a node n
168
Chapter 5.
Adversarial Search
(a) [
...oc,
3]
,_ '
'-�
1-,
Xi) where Xk is a neighbor of Xi. We need to do that because the change in Di might enable further reductions in the domains of Dk. even if we have previously considered Xk. If Di is revised down to nothing, then we know the whole CSP has no consistent solution, and AC-3 can immediately return failure. Otherwise, we keep checking, trying to remove values from the domains of variables until no more arcs are in the queue. At that point, we are left with a CSP that is equivalent to the original CSP-they both have the same solutions-but the arc-consistent CSP will in most cases be faster to seaJCh because its variables have smaller domains. The complexity of AC-3 can be analyzed as follows. Assume a CSP with n variables, each with domain size at most d, and with c binary constraints (aiCs). Each arc (Xk, Xi) can be inserted in the queue only d times because Xi has at most d values to delete. Checking
2 10
GENERALIZED ARC CONSISTENT
Chapter 6.
consistency of an arc can be done in 0(d?) time, so we get 0(cd3) total worst-
symbol is sometimes written in other books as :J or �.
(if and only if). The sentence W1'3 ¢::> write this as =:.
•W2'2 is a biconditional. Some other books
Sentence -+ AtomicSentence I ComplexSentence A tomicSentence -+
T1·ue I False I P I
QI
R
I
ComplexSentence -+ ( Sentence ) I [ Sentence ] -, Sentence Sentence 1\ Sentence Sentence V Sentence Sentence => Sentence
Sentence ¢:} Sentence OPERATOR PRECEDENCE
Figure 7.7
•,/\, V,=?,¢}
A BNF (Backus-Naur Form) grammar of sentences in propositional logic,
along with operator precedences, from highest to lowest.
Section 7.4.
Propositional Logic: A Very Simple Logic
245
Figure 7.7 gives a formal grammar of propositional logic; see page 1060 if you are not familiar with the BNF notation. The BNF grammar by itself is ambiguous; a sentence with several operators can be parsed by the grammar in multiple ways. To eliminate the ambiguity we define a precedence for each operator. The "not" operator ( •) has the highest precedence, which means that in the sentence ·A 1\ B the --, binds most tightly, giving us the equivalent of (·A) 1\ B rather than •(A 1\ B). (The notation for ordinary arithmetic is the same: -2 + 4 is 2, not -6.) When in doubt, use parentheses to make sure of the right interpretation. Square brackets mean the same thing as parentheses; the choice of square brackets or parentheses is solely to make it easier for a human to read a sentence. 7.4.2
TRUTH VALUE
Semantics
Having specified the syntax of propositional logic, we now specify its semantics. The se mantics defines the rules for determining the truth of a sentence with respect to a particular model. In propositional logic, a model simply fixes the truth value-true or false-for ev ery proposition symbol. For example, if the sentences in the knowledge base make use of the proposition symbols P1,2, P2,2, and P3,1, then one possible model is
m1 = {P1,2 =false, P2,2 =false, ?3,1 = true} . 3 With three proposition symbols, there are 2 = 8 possible models-exactly those depicted in Figure 7.5. Notice, however, that the models are purely mathematical objects with no necessary connection to wumpus worlds. P1,2 is just a symbol; it might mean "there is a pit in [1 ,2]" or "I'm in Paris today and tomorrow." The semantics for propositional logic must specify how to compute the truth value of any sentence, given a model. This is done recm·sively. All sentences are constructed from
atomic sentences and the five connectives; therefore, we need to specify how to compute the truth of atomic sentences and how to compute the truth of sentences formed with each of the five connectives. Atomic sentences are easy: • •
True is true in every model and False is false in every model. The truth value of every other proposition symbol must be specified directly in the model. For example, in the model m1 given earlier, P1,2 is false.
For complex sentences, we have five rules, which hold for any subsentences P and Q in any model m (here "iff" means "if and only if''): • --,p is true iff P •
•
P 1\ Q is true iff both P and Q are true in m. P V Q is true iff either P or Q is true in m.
• P� •
TRUTH "!ABLE
is false in m.
Q is true unless P is true and Q is false in m.
P ¢:? Q is true iff P and Q are both true or both false in m.
The rules can also be expressed with truth tables that specify the truth value of a complex sentence for each possible assignment of truth values to its components. Truth tables for the five connectives are given in Figure 7.8. From these tables, the truth value of any sentence s can be computed with respect to any model m by a simple recursive evaluation. For example,
246
Chapter
7.
Logical Agents
p
Q
,p
P I\ Q
PVQ
P � Q
P ¢:? Q
false false true true
false true false true
true true
false false false true
false true t·rue t·rue
true true
true
false false
false t·rue
false false true
Figure 7.8 Truth tables for the five logical connectives. To use the table to compute, for example, the value of P V Q when P is true and Q is false, first look on the left for the row where Pis t1-ue and Q is false (the third row). Then look in that row under the PvQ column to see the result: true.
the sentence •P1,2 /\ (P2,2 V ?3,1), evaluated in m1, gives true 1\ (false V true ) = true 1\ true = true. Exercise 7.3 asks you to write the algorithm PL-TRUE?(s, m), which computes the truth value of a propositional logic sentence s in a model m. The truth tables for "and," "or;' and "not" are in close accord with our intuitions about
the English words. The main point of possible confusion is that P V Q is true when P is true
or Q is true or both. A different connective, called "exclusive or" ("xor" for short), yields false when both disjuncts are true.7 There is no consensus on the symbol for exclusive or; some choices are V or =f. or $. The truth table for � may not quite fit one's intuitive understanding of "P implies Q" or "if P then Q." For one thing, propositional logic does not require any relation ofcausation or relevance between P and Q. The sentence "5 is odd implies Tokyo is the capital of Japan" is a true sentence of propositional logic (under the normal interpretation), even though it is a decidedly odd sentence of English. Another point of confusion is that any implication is true whenever its antecedent is false. For example, "5 is even implies Sam is smart" is true, regardless of whether Sam is smart. This seems bizarre, but it makes sense if you think of "P � Q" as saying, "If P is true, then I am claiming that Q is true. Otherwise I am making no claim." The only way for this sentence to be false is if P is true but Q is false. The biconditional, P ¢:? Q, is true whenever both P � Q and Q � P are true. In English, this is often written as "P if and only if Q." Many of the rules ofthe wumpus world are best written using ¢:?. For example, a square is breezy ifa neighboring square has a pit, and a square is breezy only ifa neighboring square has a pit. So we need a biconditional,
B1,1 ¢:? (P1,2 v P2,1) , where B1 1 means that there is a breeze in [1 ,1]. '
7.4.3
A simple knowledge base
Now that we have defined the semantics for propositional logic, we can construct a knowledge base for the wumpus world. We focus first on the immutable aspects of the wumpus world, leaving the mutable aspects for a later section. For now, we need the following symbols for each [x, ·y] location: 7
Lntin has a separate word, aut, for exclusive or.
Section 7.4.
Propositional Logic: A Very Simple Logic
247
Px,y is true ifthere is a pit in [x, y]. Wx,y is true if there is a wumpus in [x, y], dead or alive. Bx,y is true if the agent perceives a breeze in [x, y]. Sx,y is true if the agent perceives a stench in [x, y]. The sentences we write will suffice to derive •P1,2 (there is no pit in [ 1 ,2]), as was done informally in Section 7.3. We label each sentence Ri so that we can refer to them: • There is no pit in [1,1]:
R1 :
•H' 1 .
• A square is breezy if and only if there is a pit in a neighboring square. This has to be
stated for each square; for now, we n i clude just the relevant squares:
¢:? (P1,2 v P2,1) . R3 : ¢:? (P1,1 V P2,2 V P3,1) . • The preceding sentences are true in all wwnpus worlds. Now we include the breeze R2 :
B1,1 B2,1
percepts for the first two squares visited in the specific world the agent is in, leading up to the situation in Figure 7.3(b).
R4 : Rs : 7.4.4
•B11 ' . B2,1 .
A simple inference procedure
Our goal now is to decide whether KB f= a for some sentence a:. For example, is •P1,2 entailed by our KB'? Our first algorithm for inference is a model-checking approach that is a direct implementation of the definition of entailment: enumerate the models, and check that a is true in every model in which KB is true. Models are assignments of true or false to every proposition symbol. Retuming to our wumpus-world example, the relevant proposi tion symbols are B1,1, B2,1, H,1, H,2, P2,1, P2,2, and P3,1· With seven symbols, there are 27 = 128 possible models; in three of these, KB is true (Figure 7.9). In those three models, --,pl '2 is true, hence there is no pit .in [1,2]. On the other hand, P2'2 is true in two of the three models and false in one, so we cannot yet tell whether there is a pit in [2,2]. Figure 7.9 reproduces in a more precise form the reasoning illustrated in Figure 7.5. A general algorithm for deciding entailment in propositional logic is shown in Figure 7. I 0. Like the BACKTRACKING-SEARCH algorithm on page 215, TT-ENTAILS'? performs a recursive enumeration of a finite space of assignments to symbols. The algorithm is sound because it implements directly the definitjon of entailment, and complete because it works for any KB and a and always terminates-there are only finitely many models to examine. Of course, "finitely many" is not always the same as "few." If KB and a contain n symbols in all, then there are 2n models. Thus, the time complexity of the algorithm is 0(2n). (The space complexity is only O(n) because the emuneration is depth-first.) Later in tllis chapter we show algorithms that are much more efficient in many cases. Unfortunately, propositional entailment is co-NP-complete (i.e., probably no easier than NP-complete-see Appendix A), so every known inference algorithm for propositional logic has a worst-case
complexity that is exponential in the size ofthe input.
248
Chapter
7.
Logical Agents
B1,1
B2,1
11,1
Pt,2
P2,1
P2,2
Ps,1
R1
R2
Rs
I4
Rs
KB
false
false
false
false
false
false
true
t1·ue
true
tme
false
false
false
false
false
false
false
false
false true
true
t1-ue
false
t1·ue
false
false
false
true
false
false
false
false
false
true
tl-ue
false
t1·ue
true
false
false
true
false
false
false
false
true
true
t1-ue
true
t1·ue
true
true
false
true
false
false
false
true
false
true
true
true
true
true
true
false
true
false
false
false
t1-ue
true
true
true
true
t1·ue
true
true
false
true
false
false
true
false
false
true
false
false
tme
true
false
tme
true
true
true
true
true
true
false
t1-ue
true
false
true
false
A truth table constructed for the knowledge base given in the text. KB is true
Figure 7.9
if R1 through Rs are true, which occurs in just 3 of the 128 rows (the ones underlined in the
right-hand column). In all 3 rows, Pt,2 is false, so there is no pit in [I ,2]. On the other hand, there might (or might not) be a pit in [2,2].
function TT-ENTAILS?(KB,a) returns true or false inputs: KB, the knowledge base, a sentence in propositional logic
a, the query, a sentence in propositional logic symbols Symbol · · ·
GoalClauseForm --+ (Symbol1 1\
· · ·
1\ Symbol1) => False
A grammar for conjunctive normal form, Horn clauses, and definite clauses. A clause such as A 1\ B => Cis still a definite clause when it is written as -,A V -,B V C,
Figure 7.14
but only the former is considered the canonical form for definite clauses. One more class is
the k-CNF sentence, which s i a CNF sentence where each clause has at most k literals.
FORWARD-CHAINING BAC�AAD CHAINI�G
2. Inference with Hom clauses can be done through the forward-chaining and backward chaining algorithms, which we explain next. Both of these algoritluns are natural, in that the inference steps are obvious and easy for humans to follow. This type of inference is the basis for logic programming, which is discussed in Chapter 9.
3. Deciding entailment with Hom clauses can be done in time that is linear n i the size of the knowledge base-a pleasant surprise. 7.5.4
Forward and backward chaining
The forward-chaining algorithm PL-FC-ENTAILS?(KB,q) determines if a single proposi tion symbol q-the query-is entailed by a knowledge base of definite clauses. It begins from known facts (positive literals) in lhe knowledge base. If all the premises of an implica tion are known, then its conclusion is added to the set of known facts. For example, if £1,1 and Breeze are known and (£1,1 1\ Breeze) => B1,1 is in the knowledge base, then B1,1 can be added. This process continues until the query q is added or until no further inferences can be made. The detailed algoritlun is shown in Figure 7.15; the main point to remember is that it runs in linear time. The best way to understand the algoritlun is through an example and a picture. Fig ure 7 .16(a) shows a simple knowledge base of Hom clauses with A and B as known facts. Figure 7.16(b) shows the same knowledge base drawn as an AND-OR graph (see Chap ter 4). ln AND-OR graphs, multiple links joined by an arc indicate a conjunction-every link must be proved-while multiple links without an arc indicate a disjunction-any link can be proved. It is easy to see how forward chaining works in the graph. The known leaves (here, A and B) are set, and inference propagates up the graph as far as possible. ¥/her ever a conjunction appeat-s, the propa.gation waits w1til all the conjuncts are known before proceeding. The reader is encouraged to work through the example in detail.
Chapter
258
7.
Logical Agents
function PL-FC-ENTAILS?(KB, q) returns tr·ue or false inputs:
KB, the knowledge base, a set of propositional definite clauses
q, the query, a proposition symbol count � a table, where count[c1 is the number of symbols in c's premise inferred +- a table, where infen-ed[s] is initially false for all symbols agenda � a queue of symbols, initially symbols known to be true in KB while agenda s i not empty do p ..- POP(agenda) if p = q then return tme if inferred[p1 = false then infen-ed [p1 � tme for each clause c in KB where p is in c.PREMISE do decrement count[c1 if count[c)= 0 then add c. CONCLUSION to agenda return false The forward-chaining algorithm for propositional logic. The agenda keeps track of symbols known to be true but not yet "processed." The count table keeps track of Figure 7.15
how many premises of each implication are as yet unknown. Whenever a new symbol p from the agenda is processed, the count is reduced by one for each implication in whose premise
p appears (easily identified in constant time with appropriate indexing.) If a count reaches zero,
all the premises of the m i plication
are known, so its conclusion can
be added to
the
agenda. Finally, we need to keep track of which symbols have been processed; a symbol that is already in the set of inferred symbols need not be added to the agenda again. This avoids
redundant work and prevents loops caused by implications such as P
FIXED POINT
OO.lii·DRIVEN
::::;. Q and Q ::::;. P.
It is easy to see that f01ward chaining is sound: every inference is essentially an appli cation of Modus Ponens. Forward chaining is also complete: every entailed atomic sentence will be derived. The easiest way to see this is to consider the final state of the inferred table (after the algorithm reaches a fixed point where no new inferences are possible). The table contains true for each symbol inferred during the process, and false for all other symbols. We can view the table as a logical model; moreover, every definite clause in the original KB is true in this model. To see this, assume the opposite, namely that some clause a1A . . . Aak => b is false in the model. Then a1 A . . . A ak must be true in the model and b must be false in the model. But this contradicts ow· assumption that the algorithm has reached a fixed point! We can conclude, therefore, that the set of atomic sentences inferred at the fixed point defines a model of the original KB. Fm1hermore, any atomic sentence q that is entailed by the KB must be true in all its models and in this model in particular. Hence, every entailed atomic sentence q must be inferred by the algorithm. Forward chaining is an example of the general concept of data-driven reasoning-that is, reasoning in which the focus of attention starts with the known data. It can be used within an agent to detive conclusions from incoming percepts, often without a specific query in mind. For example, the wumpus agent might TELL its percepts to the knowledge base using
Section 7.6.
Effective Propositional Model Checking
259
Q
P => Q L /\ M => P B /\ L =? M A /\ P => L A /\ B =? L A B A (a) Figure 7.16
B
(b)
(a) A set of Horn clauses. (b) The corresponding AND-OR graph.
an incremental forward-chaining alg01ithm in which new facts can be added to the agenda to initiate new inferences. In htunans, a certain amount of data-driven reasoning occurs as new information arrives. For example, if I am indoors and hear rain starting to fall, it might occw· to me that the picnic will be canceled. Yet it will probably not occur to me that the seventeenth petal on the largest rose in my neighbor's garden will get wet; humans keep forward chaining tmder careful control, lest they be swamped with irrelevant consequences. The backward-chaining algorithm, as its name suggests, works backward from the yut:ry. If lilt: yut:ry q is kuuwu
GOAL·DRECTED REASONING
7.6
lu
bt: lrut:, lht:ll uu wurk is llt:t:tktl. Otltt:rwist:, lilt: algurilluu
finds those implications in the knowledge base whose conclusion is q. If all the premises of one of those implications can be proved true (by backward chaining), then q is true. When applied to the query Q in Figw·e 7.16, it works back down the graph tmtil it reaches a set of known facts, A and B, that forms the basis for a proof. The algorithm is essentially identical to the AND-OR-GRAPH-SEARCH algorithm in Figure 4 . 1 1 . As with forward chaining, an efficient implementation runs in linear time. Backward chaining is a form of goal-directed reasoning. It is useful for answering specific questions such as "What shall I do now?" and "Where are my keys?" Often, the cost of backward chaining is much less than linear in the size of the knowledge base, because the process touches on1y relevant facts.
EFFECTIVE PROPOSITIONAL MODEL CHECKING
In this section, we describe two famities of efficient algotithms for general propositional inference based on model checking: One approach based on backtracking search, and one on local hill-climbing search. These algorithms are part of the "technology" of propositional logic. This section can be skimmed on a first reading of the chapter.
260
Chapter
7.
Logical Agents
The algorithms we desctibe are for checking satisfiability: the SAT problem. (As noted earlier, testing entailment, a: f= f3, can be done by testing unsatisfiability of a: 1\ •f3.) We have already noted the connection between finding a satisfying model for a logical sentence and finding a solution for a constraint satisfaction problem, so it is perhaps not surprising that the two families of algorithms closely resemble the backtracking algorithms of Section 6.3 and the local search algorithms of Section 6.4. They are, however, extremely important in their own right because so many combinatorial problems in computer science can be reduced to checking the satisfiability of a propositional sentence. Any improvement in satisfiability algorithms has huge consequences for our ability to handle complexity in general. 7.6.1 OO.VIS-PUTNAM ALGORITHM
A complete backtracking algorithm
TI1e first algorithm we consider is often called the Davis-Putnam algoritlun, after the sem inal paper by Martin Davis and Hilary Putnam (1960). The algorithm is in fact the version described by Davis, Logemann, and Loveland (1962), so we will call it DPLL after the ini tials of all four authors. DPLL takes as input a sentence in conjunctive normal form-a set of clauses. Like BACKTRACKING-SEARCH and TT-ENTAILS?, it is essentially a recursive, depth-first enumeration of possible models. It embodies three improvements over the simple scheme of TT-ENTAILS? :
• Early termination: The algorithm detects whether the sentence must be true or false, even with a pattially completed model. A clause is true if any literal is true, even if the other literals do not yet have truth values; hence, the sentence as a whole could be judged true even before the model is complete. For example, the sentence (A V B) 1\ (A V C) is true if A is true, regardless of the values of B and C. Similarly, a sentence is false if any clause is false, which occurs when each of its literals is false. Again, this can occur long before the model is complete. Early termination avoids examination of entire subtrees in the search space. PURE SYMBOL
• Pure symbol heuristic: A pure symbol is a symbol that always appears with the same
"sign" in all clauses. For example, in the three clauses (A V B) (-,B V •C), and (C V A), the symbol A is pure because only the positive literal appears, B is pure because only the negative literal appears, and C is impure. It is easy to see that if a sentence has a model, then it bas a model with the pure symbols assigned so as to make their literals true, because doing so can never make a clause false. Note that, in determining the purity of a symbol, the algorithm can ignore clauses that are already known to be true in the model constructed so far. For example, if the model contains B =false, then the clause ( -,B V C) is already true, and in the remaining clauses C appears only as a positive literal; therefore C becomes pw·e. ·
,
·
• Unit clause heuristic: A unit clause was defined earlier as a clause with just one lit eral. In the context of D PLL, it also means clauses in which all literals but one are already assigned false by the model. For example, if the model contains B = true, then (-,B V C) simplifies to -,C, which is a unit clause. Obviously, for this clause to be true, C must be set to false. The unit clause heuristic assigns all such symbols before branching on the remainder. One impottant consequence of the heuristic is that ·
Section 7.6.
Effective Propositional Model Checldng
ftmction
261
DPLL-SATISFIABLE?(s) returns true or false s, a sentence in propositional logic
inputs:
clauses ....- the set of clauses in the CNF representation of s symbols ....- a list of the proposition symbols in s return DPLL(clauses, symbols, { }) ftmction
DPLL(clauses, symbols, model) returns true or false
f i every clause in clauses is true in model then return t1'ue if some clause in
clauses is false in model then return false
P, value < FIND-PUR!l-SYMDOL(symbols, clauses, model)
if Pis non-null then return DPLL(clauses, symbols - P, model
U { P=value})
P, value t- FIND-UNIT-CLAUSE(clauses, model)
if Pis non-null then return DPLL(clauses, symbols - P, model U { P=value})
P t- FIRST(symbols); Test ....- REST(symbols) return DPLL(clauses, rest, model U {P=true}) or DPLL(clauses, rest, model U {P=false}))
The DPLL algorithm for checking satisfiability of a sentence in propositional behind FIND-PURE-SYMBOL and FIND-UNIT-CLAUSE are described in the text; each returns a symbol (or null) and the truth value to assign to that symbol. Like Figure 7.17
logic. The ideas
TT-ENTAILS?, DPLL operates overpartial models.
UNIT PROAIGATION
any attempt to prove (by refutation) a literal that is already in the knowledge base will succeed immediately (Exercise 7.23). Notice also that assigning one unit clause can create another unit clause-for example, when C is set to false, (C V A) becomes a unit clause, causing true to be assigned to A. This "cascade" of forced assignments is called unit propagation. It resembles the process of forward chaining with definite clauses, and indeed, if the CNF expression contains only definite clauses then DPLL essentially replicates forward chaining. (See Exercise 7.24.) The DPLL algorithm is shown in Figure 7.17, which gives the the essential skeleton of the search process. What Figure 7.17 does not show are the tricks that enable SAT solvers to scale up to large problems. It is interesting that most of these tricks are in fact rather general, and we have seen them before in other guises:
1 . Component analysis (as seen with Tasmania in CSPs): As DPLL assigns truth values to variables, the set of clauses may become separated into disjoint subsets, called com ponents, that share no unassigned variables. Given an efficient way to detect when this occurs, a solver can gain considerable speed by working on each component separately.
2. Variable and value ordering (as seen in Section 6.3.1 for CSPs): Our simple imple mentation of DPLL uses an arbitrary variable ordering and always tries the value true before false. The degree heuristic (see page 216) suggests choosing the variable that appears most frequently over all remaining clauses.
262
Chapter
7.
Logical Agents
3. Intelligent backtracking (as seen in Section 6.3 for CSPs): Many problems that can not be solved in hours of run time with chronological backtracking can be solved in seconds with intelligent backtracking that backs up all the way to the relevant point of conflict.. All SAT solvers that do intelligent backtracking use some form of conflict clause learning to record conflicts so that they won't be repeated later in the search. Usually a limited-size set of conflicts is kept, and rarely used ones are dropped.
4. Random restarts (as seen on page 124 for hill-climbing): Sometimes a run appears not to be making progress. In this case, we can start over from the top of the search tree, rather than trying to continue. After restarting, different random choices (in variable and value selection) are made. Clauses that are learned in the first run are retained after the restart and can help prune the search space. Restarting does not guarantee that a solution will be found faster, but it does reduce the variance on the time to solution.
5. Clever indexing (as seen in many algorithms): The speedup methods used in DPLL itself, as well as the tricks used in modem solvers, require fast indexing of such things as "the set of clauses in which variable
Xi
appears as a positive literal." This task is
complicated by the fact that the algorithms are interested only in the clauses that have not yet been satisfied by previous assignments to variables, so the indexing structures must be updated dynamically as the computation proceeds. With these enhancements, modern solvers can handle problems with tens of millions of vari ables. They have revolutionized areas such as hardware verification and security protocol verification, which previously required laborious, hand-guided proofs. 7.6.2
Local search algorithms
We have seen several local search algorithms so far in this book, including HILL-CLIMBING (page 122) and SIMULATED-ANNEALING (page 126). These algorithms can be applied di rectly to satisfiability problems, provided that we choose the right evaluation function. Be cause the goal is to find an assignment that satisfies every clause, an evaluation function that counts the nwnber of unsatisfied clauses will do the job. In fact, this is exactly the measure used by the MIN-CONFLICTS algorithm for CSPs (page 221). All these algorithms take steps in the space of complete assignments, flipping the tmth value of one symbol at a time. The space usually contains many local minima, to escape from which various forms of random ness are required. In recent years, there has been a great deal of experimentation to find a good balance between greediness and randomness. One ofthe simplest and most effective algorithms to emerge from all this work is called WALKSAT (Figure 7.18). On every iteration, the algorithm picks an unsatisfied clause and picks a symbol in the clause to flip. It chooses randomly between two ways to pick which symbol to flip: (I) a "min
We can think of the percept history itself as a representation of the belief state, but one that makes inference increasingly expensive as the history gets longer.
270
Chapter 7.
function
Logical Agents
HYBRID-WUMPUS-AGENT(percept) returns an action pe1·cept, a list, [stench,b?·eeze,glitte?·,bump,scream]
inputs:
persistent: KB, a knowledge base, initially the atemporal "wumpus physics" t, a counter, initially 0, indicating time
plan, an action sequence, initially empty
TELL(KB, MAKE-PERCEPT-SENTENCE(pe1·cept, t)) TELL the KB the temporal "physics"sentences for time t safe ..._ {[x,y] : ASK(KB, OK�,y) = tr·ue} t if ASK( KB, Glitte1· ) = true then plan ...._ [ Gmb] + PLAN-ROUTE(cu1Tent, {[1 ,1]}, safe) + [Climb] if plan is empty then unvisited ...._ {[x, y] : ASK(KB,L;,y) = false for all t' 5: t} plan ....- PLAN-ROUTE(current, unvisited n safe, safe) ifplan is empty andASK(KB,Have..4rrowt) = true then possible_wumpus.-{[x,y] : ASK(KB,-. W.,,y) = false} plan ....- Pl .A N-SHOT(r.nn'P.fl.t, pn.•.•ihlP._ummpn.•, .•n.fP.) if plan is empty then II no choice but to take a risk not_unsafe .- { [x,y] : ASK(KB , - . OK�,y) = false} plan ....- PLAN-ROUTE(cun-ent, unvisited n not_unsafe, safe) if plan is empty then plan .- PLAN-ROUTE(cun-ent, {[1, 1]}, safe) + [Climb] action ....- POP(plan) TELL(KB, MAKE-ACTION-SENTENCE(action, t)) t.- t + 1 return action function
PLAN-ROUTE(current,goals,allowed) returns an action sequence inputs: cun-ent, the agent's current position goals, a set of squares; try to plan a route to one of them allowed, a set of squares that can form part ofthe route pmblem ....- ROUTE-PROBLEM(cun-ent, goals,allowed) return A*-GRAPH-SEARCH(pmblem) Figure 7.20
A hybrid agent program for the wumpus world. It uses a propositional knowl
edge base to infer the state of the world, and a combination of problem-solving search and
domain-spedJk such that as the number of variables n � oo, instances below the threshold are
threshold
satisfiable with probability 1 , while those above the threshold are unsatisfiable with proba bility 1 . The conjecture was not quite proved by Friedgut (1999): a sharp threshold exists but its location might depend on n even as n � oo. Despite significant progress in asymptotic
analysis of the threshold location for large k (Achlioptas and Peres, 2004; Achlioptas et at. , 2007), all that can be proved for k = 3 is that it lies in the range [3.52,4.51]. Current theory suggests that
a
peak in the run time of a SAT solver is not necessarily related
to the satisfta
bility threshold, but instead to a phase transition in the solution distribution and stmcture of SAT instances. Empirical results due to Coarfa
SURVEY PROPAGATION
gorithms such as
et at. (2003)
support this view. In fact, al
survey propagation (Parisi and Zecchina, 2002; Maneva et at., 2007) take
advantage of special properties of random SAT instances near the satisfiability threshold and greatly outperform general SAT solvers on such instances.
are the the regular International Conferences on
The best sow·ces for information on satisfiability, both theoretical and practical,
Handbook ofSatisfiability (Biere et at. , 2009) and Theory andApplications ofSatisfiability Testing, known as SAT.
The idea of building agents with propositional logic can be traced back to the seminal paper of McCulloch and Pitts
(1943),
which initiated the field of neural networks.
Con
trary to popular supposition, the paper was concerned with the implementation of a Boolean circuit-based agent design in the brain. Circuit-based agents, which perform computation by
propagating signals in hardware circuits rather than running algorithms in general-purpose
computers, have received little attention in AI, however. The most notable exception is the work of Stan Rosenschein (Rosenschein,
1985;
Kaelbling and Rosenschein,
1990), who de
veloped ways to compile circuit-based agents from declarative descriptions of the task envi ronment. (Rosenschein's approach is described at some length in the second edition of this
(1986, 1989) demonstrates the effectiveness of circuit-based designs for controlling robots-a topic we take up in Chapter 25. Brooks (1991) argues that circuit-based designs are all that is needed for Al-that representation and reasoning book.) The work of Rod Brooks
are cwnbersome, expensive, and unnecessary. In our view, neither approach is sufficient by itself.
Williams
et at. (2003)
show how a hybrid agent design not too different
from
our
\VUJnpus agent has been used to control NASA spacecraft, planning sequences ofactions and diagnosing and recovering from faults. The general problem of keeping track of a partially observable environment was intro duced for state-based representations in Chapter 4. Its instantiation for propositional repre sentations was studied by Amir and Russell
(2003),
who identified several classes of envi
ronments that admit efficient state-estimation algorithms and showed that for several other
TEMPORAL· PROJECTION
classes the problem is intractable. The
temporal-projection problem, which involves deter
mining what propositions hold ttue after an action sequence is executed, can be seen as a special case of state estimation with empty percepts. Many authors have studied this problem because of its importance in planning; some important hardness results were established by
279
Exercises
Liberatore (1997). The idea of representing a belief state with propositions can be traced to Wittgenstein (1922). Logical state estimation, of cow·se, requires a logical representation of the effects of actions-a key problem in AI since the late 1950s. The dominant proposal has been the sit uation calculus fonnalism (McCarthy, 1963), which is couched within first-order logic. We discuss situation calculus, and various extensions and alternatives, in Chapters 10 and 12. The approach taken in this chapter-using temporal indices on propositional variables-is more restrictive but has the benefit of simplicity. The general approach embodied in the SATPLAN algorithm was proposed by Kautz and Selman (1992). Later generations of SATPLAN were able to take advantage of the advances in SAT solvers, described earlier, and remain among the most effective ways of solving difficult problems (Kautz, 2006). The frame problem was first recognized by McCarthy and Hayes (1969). Many re searchers considered the problem unsolvable within first-order logic, and it spw·red a great deal of research into nonmonotonic logics. Philosophers from Dreyfus (1972) to Crockett (1994) have cited the frame problem as one symptom of the inevitable failure of the entire AI enterprise. The solution of the frame problem with successor-state axioms is due to Ray Reiter (1991). Thielscher (1999) identifies the inferential frame problem as a separate idea and provides a solution. In retrospect, one can see that Rosenschein's (1985) agents were using circuits that implemented successor-state axioms, but Rosenschein did not notice that the frame problem was thereby largely solved. Foo (2001) explains why the discrete-event control theory models typically used by engineers do not have to explicitly deal with the frame problem: because they are dealing with prediction and control, not with explanation and reasoning about counterfactual situations. Modem propositional solvers have wide applicability in industrial applications. The ap plication of propositional inference in the synthesis of computer hardware is now a standard technique having many large-scale deployments (Nowick et al., 1993). The SATMC satisfi ability checker was used to detect a previously unknown vulnerability in a Web browser user sign-on protocol (Armando et al. , 2008). The wwnpus world was invented by Gregory Yob (1975). Ironically, Yob developed it because he was bored with games played on a rectangular grid: the topology of his original wwnpus world was a dodecahedron, and we put it back in the boring old grid. Michael Genesereth was the first to suggest that the wumpus world be used as an agent testbed.
EXERCISES
Suppose the agent has progressed to the point shown in Figure 7.4(a), page 239, having perceived nothing in [1,1], a breeze in [2,1], and a stench in [1,2], and is now concemed with the contents of [1,3], [2,2], and [3,1]. Each of these can contain a pit, and at most one can contain a wwnpus. Following the example of Figure 7.5, construct the set of possible worlds. (You should find 32 of them.) Mark the worlds in which the KB is true and those in which 7.1
280
Chapter
7.
Logical Agents
each of the following sentences is true:
a2 = "There is no pit in [2,2]." a3 = "There is a wumpus in [1 ,3]." Hence show that
7.2
KB f= a2 and KB f= a3.
(Adapted from Barwise and Etchemendy
(1993).)
Given the following, can you prove
that the unicorn is mythical? How about magical? Horned? If the unicorn is mythical, then it is immortal, but if it is not mythical, then it is a mortal mammal. If the unicorn is either immortal or a mammal, then it is horned. The tmicorn is magical if it is horned.
7.3
Consider the problem of deciding whether a propositional logic sentence is true in a
given model.
a.
Write a recursive algorithm PL-TRUE?(s,
m) that returns true if and only if the sen
tence s is true in the model m (where m assigns a truth value for every symbol in s).
The algorithm should run in time linear in the size of the sentence. (Alternatively, use a version of this function from the online code repository.)
b.
Give three examples of sentences that can be determined to be true or false in a partial model that does not specify a truth value for some of the symbols.
c.
Show that the truth value (if any) of a sentence in a pat1ial model cannot be determined efficiently in general.
d. Modify your PL-TRUE? algorithm so that it can sometimes judge truth from partial models, while retaining its recursive structure and linear run time. Give three examples of sentences whose truth in a partial model is not detected by your algorithm.
e. Investigate whether the modified algorithm makes TT-ENTAILS? more efficient. 7.4
a. b.
c. d. e.
f. g.
h. i.
j. k. I.
Which of the following are correct?
False f= True. True f= False. (A 1\ B) F (A {:::} B). A {:::} B F A v B. A {:::} B f= ·A V B. (A I\ B) � C f= (A � C) v (B � C). (Cv (•A 1\ ·B)) = ((A � C) 1\ (B � C)). (A v B) 1\ (·CV ·D v E) F (.4 v B). (A v B) 1\ ( .c v -,D v E) F (.4 v B) 1\ (·D v E). (A V B) 1\ •(A � B) is satisfiable. (A {:::} B) 1\ (-,A V B) is satisfiable. (A {:::} B) {:::} C has the same number of models as (A proposition symbols that includes A, B, C.
{:::}
B) for any fixed set of
Exercises
281 7.5
Prove each of the following assertions:
a. a is valid if and only if
b. For any a, False
f= a.
True f= a.
c. a f= f3 if and only if the sentence (a �
d. e.
f3) is valid. a = f3 if and only if the sentence (a ¢:? {3) is valid. a f= f3 if and only if the sentence (a 1\ -.{3) is unsatisfiable.
Prove, or find a counterexample to, each of the following assertions: a. If a f= "for f3 f= "f (or both) then (a 1\ {3) f= 1
7.6
f= (f3 1\ "f) then a f= f3 and a f= "f. If a f= (f3 V 1) then a f= f3 or a f= "f (or both).
b. If a
c.
Consider a vocabulary with only four propositions, A, B, C, and D. How many models are there for the following sentences? 7.7
a.
b. c. 7.8
B V C. -.A v -.B v -.C v -.D. (A � B) 1\ A 1\ -.B 1\ C 1\ D. We have defined four binary logical connectives.
a. Are there any others that might be useful?
b. How many binary connectives can there be? c. Why are some of them not very useful"? 7.9
Using a method ofyour choice, verify each ofthe equivalences in Figure 7.11 (page 249).
Decide whether each of the following sentences is valid, unsatisfiable, or neither. Ver ify your decisions using truth tables or the equivalence rules of Figure 7.11 (page 249). 7.10
a.
b. c. d. e. f.
g.
Smoke � Smoke Smoke � Fire (Smoke � Fire) � (-.Smoke � -.Fire) Smoke V Fire V -.Fire ((Smoke 1\ Heat) � Fire) ¢:? ((Smoke � Fire) V (Heat � Fire)) (Smoke � Fire) � ((Smoke 1\ Heat) � Fire) Big v Dumb v (Big � Dumb)
Any propositional logic sentence is logically equivalent to the assertjon that each pos sible world in whlch it would be false is not the case. From this observation, prove that any sentence can be written in CNF. 7.11
7.12
Use resolution to prove the sentence -.A 1\ -.B from the clauses in Exercise 7.20.
7.13
Tilis exercise looks into the relationship between clauses and implication sentences.
282
Chapter
(-.P1 V · · · \! -.Pm V Q) sentence (P1 A · · · A Pm) � Q.
a. Show that the clause
7.
Logical Agents
is logically equivalent to the implication
b. Show that every clause (regardless of the number of positive literals) can be written in the form
(P1 A · · · A Pm) � (Q1 V · · · V Qn), where the Ps and Qs are proposition
symbols. A knowledge base consisting of such sentences is in implicative normal form or Kowalski form (Kowalski, 1979).
IMPLICATIVE NORMAL FORM
c. Write down the full resolution rule for sentences in implicative nonnal fonn. 7.14
According to some political pundits, a person who is radical
(R)
is electable
(E)
if
he/she is conservative (C), but otherwise is not electable. a. Which of the following (i) (ii) (iii)
are correct representations
of this assertion'?
(R A E) ¢::::::? C R � (E ¢::::::? C) R � ((C � E) v -.E)
b. Which of the sentences in (a) can be expressed in Hom form? 7.15
This question considers representing satisfiability (SAT) problems as CSPs.
a. Draw the constraint graph corresponding to the SAT problem
(-.X1 V X2) A (-.X2 V X3) A . . . A (--.Xn- 1 V Xn) for the particular case n = 5.
b. How many solutions are there for this general SAT problem as a function of n? c. Suppose we apply BACKTRACKING-SEARCH (page 215) to find all solutions to a SAT CSP of the type given in (a). (To find all solutions to a CSP, we simply modify the basic algorithm so it continues searching after each solution is found.) Assume that variables are ordered
X1, . . . , Xn
and false is ordered before
true.
How much time
will the algorithm take to terminate? (Write an 0(·) expression as a function ofn.) d. We know that SAT problems in Horn form can be solved in linear time by forward chaining (unit propagation). We also know that every tree-structured binary CSP with discrete, finite domains can be solved in time linear in the number of variables (Sec tion 6.5). Are these two facts connected? Discuss.
7.16
Explain why every nonempty propositional clause, by itself, is satisfiable. Prove rig
orously that every set of five 3-SAT clauses is satisfiable, provided that each clause mentions exactly three distinct variables. What is the smallest set of such clauses that is unsatisfiable? Construct such a set. 7.17
A propositional 2-CNFexpression is a conjunction of clauses, each containing exactly
2 literals, e.g., (A V B) A (-.A V C) A (-.B V D) A (-.C v G) A (-.D V G) . a. Prove using resolution that the above sentence entails G.
283
Exercises b.
Two clauses are semantically distinct if they are not logically equivalent. How many semantically distinct 2-CNF clauses can be constructed from n proposition symbols? c. Using your answer to (b), prove that propositional resolution always terminates in time polynomial in n given a 2-CNF sentence containing no more than n distinct symbols. d. Explain why your argument in (c) does not apply to 3-CNF.
7.18
Consider the following sentence:
[(Food =? Party) V (Drinks =? Party)] =? [(Food 1\ Drinks) ::::;. Party] . a. Detennine, using enumeration, whether this sentence is valid, satisfiable (but not valid),
or unsatisfiable. b. Convert the left-hand and 1ight-hand sides of the main implication into CNF, showing each step, and explain how the results confinn your answer to (a). c. Prove your answer to (a) using resolution. DISJUNCTIVE NORMAL FORM
7.19 A sentence is in disjunctive normal form (DNF) if it is the disjunction of conjunctions of literals. For example, the sentence (A 1\ B 1\ •C) V (·A 1\ C) V (B 1\ -.C) is in DNF.
a. Any propositional logic sentence is logically equivalent to the assertion that some pos
sible world in which it would be true is in fact the case. From this observation, prove that any sentence can be written in DNF. b. Construct an algorithm that converts any sentence in propositional logic into DNF. (Hint: The algorithm is similar to the algoritlun for conversion to CNF given in Sec lion 7.5.2.) c. Construct a simple algorithm that takes as input a sentence in DNF and returns a satis fying assignment if one exists, or reports that no satisfying assigrunent exists. d. Apply the algorithms in (b) and (c) to the following set of sentences:
A =? B B ::::;. C C =? ·A . e. Since the algoritlun in (b) is very similar to the algoritlun for conversion to CNF, and
since the algorithm in (c) is much simpler than any algorithm for solving a set of sen tences in CNF, why is this technique not used in automated reasoning? 7.20
Convert the following set of sentences to clausal form. Sl: S2: S3: S4: S5: S6:
A ¢:? (B V E). E =? D. C 1\ F =? ·B. E =? B. B =? F. B =? C
Give a trace of the execution of DPLL on the conjunction of these clauses.
284
Chapter
7.
Logical Agents
Is a randomly generated 4-CNF sentence with n symbols and m clauses more or Jess likely to be solvable than a randomly generated 3-CNF sentence with n symbols and m cla.uses? Explain.
7.21
7.22
Minesweeper, the well-known computer game, is closely related to the wumpus world. A minesweeper world is a rectangular grid of N squares with M invisible mines scattered among them. Any square may be probed by the agent; instant death follows if a mine is probed. Minesweeper indicates the presence of mines by revealing, in each probed square, the number of mines that are directly or diagonally adjacent. The goal is to probe every unmined square. a.
b. c.
d.
e.
f.
Let Xi,j be ttue iff square [i,j] oontains a mine. Write down the assertion that exactly two mines are adjacent to [1,1] as a sentence involving some logical combination of xi,j propositions. Generalize your assertion from (a) by explaining how to construct a CNF sentence asserting that k of n neighbors contain mines. Explain precisely how an agent can use DPLL to prove that a given square does (or does not) contain a mine, ignoring the global constraint that there are exactly M mines in all. Suppose that the global constraint is constructed from your method from pa1t (b). How does t.he number of clauses depend on M and N? Suggest a way to modify DPLL so that the global constraint does not need to be represented explicitly. Are any conclusions derived by the method in part (c) invalidated when the global constraint is taken into account"? Give examples of configurations of probe values that induce long-range dependencies such that the contents of a given unprobed square would give information about the contents of a far-distant square. (Hint: consider anN x 1 board.)
How long does it take to prove KB f= a using DPLL when a is a literal already contained in KB? Explain.
7.23
7.24
Trace the behavior of DPLL on the knowledge base in Figure 7.16 when trying to prove Q, and compare this behavior with that of the forward-chaining algorithm. Write a successor-state axiom for the Locked predicate, which applies to doors, as suming the only actions available are Lock and Unlock.
7.25
7.26
Section 7.7.1 provides some of the successor-state axioms required for the wumpus world. Write down axioms for all remaining fluent symbols.
7.27
Modify the HYBRID-WUMPUS-AGENT to use the 1-CNF logical state estimation method described on page 271. We noted on that page that such an agent will not be able to acquire, maintain, and use more complex beliefs such as the disjunction P3,l V P2,2· Sug gest a method for overcoming this problem by defining additional proposition symbols, and try it out n i the wumpus world. Does it improve the performance of the agent?
8
FIRST-ORDER LOGIC
In which we notice that the world is blessed with many objects, some ofwhich are related to other objects, and in which we endeavor to reason about them.
In Chapter
7, we showed how a knowledge-based agent could represent the world in which it
operates and deduce what actions to take. We used propositional logic as our representation language because it sufficed agents.
FIRST·ORDER LOGIC
to illustrate the basic concepts of logic and knowledge-based
Unfortunately, propositional logic is too puny a language
to represent knowledge
of complex environments in a concise way. In this chapter, we examine
first-order logic,1
which is sufficiently expressive to represent a good deal of our commonsense knowledge. It also either subsumes or forms the foundation of many other representation languages and has been studied n i tensively for many decades. We begin in Section representation languages in general; Section logic; Sections
8.1
8.1
with a discussion of
8.2 covers the syntax and semantics of first-order
8.3 and 8.4 illustrate the use of first-order logic for simple representations.
REPRESENTATION REVISITED
In this section, we discuss the nature of representation languages. Our discussion motivates the development of first-order logic, a much more expressive language than the propositional logic introduced in Chapter
7. We look at propositional logic and at other kinds oflanguages
to understand what works and what fails. Our discussion will be cursory, compressing cen turies of thought, trial, and error into a few paragraphs. Progranuning languages (such as C++ or Java or Lisp) are by far the largest class of formal languages in common use. Programs themselves represent, in a direct sense, only computational processes. Data structures within programs can represent facts; for example, a program could use a
4 x 4 array to represent the contents of the wumpus world.
programming language statement is a pit in square
[2,2].
Thus, the
World[2,2] � Pit is a fairly natural way to assert that there
(Such representations might be considered ad hoc; database systems
were developed precisely to provide a more general, domain-independent way to store and 1
Also called first-order predicate calculus, sometimes abbreviated as
285
FOL or FOPC.
286
Chapter
8.
First-Order Logic
retrieve facts.) What programming languages lack is any general mechanism for deriving
from other facts; each update to a data structure is done by a domain-specific procedure whose details are derived by the programmer from his or her own knowledge of the domain. facts
This procedural approach can be contrasted with the
declarative nature of propositional logic,
in which knowledge and inference are separate, and inference is entirely domain independent. A second drawback of data structures in programs (and of databases, for that matter) is the lack of any easy way to say, for example, "There is a pit in
[2,2] or [3,1]" or "If the
[1 ,1] then he is not in [2,2]." Programs can store a single value for each variable, and some systems allow the value to be "unknown;' but they lack the expressiveness required wumpus is in
to handle partial information. Propositional logic is a declarative language because its semantics is based on a rruth relation between sentences and possible worlds.
It also has sufficient expressive power to
deal with partial information, using disjunction and negation. Propositional logic has a third
coMPosmoNALITY
property that is desirable in representation languages, namely,
compositionality. In a com
positional language, the meaning of a sentence is a function of the meaning of its parts. For example,
the meaning of "S1'4 1\ 812" is related to the meanings of "S1 )4" and "81 ,2·" It )
[1,4] and "81,2" meant meant that France and Poland drew 1-1
would be very strange if"81,4" meant that there is a stench in square "
that there is a stench in square [1,2], but "81'4 1\81 '2
in last week's ice hockey qualifying match. Clearly, noncompositionality makes life much more difficult for the reasoning system. As we saw in Chapter
7,
however, propositional logic lacks the expressive power
concisely describe an environment with many objects.
to
For example, we were forced to write
a separate rule about breezes and pits for each square, such as B1,1
¢:?
(P1,2
V P2,1)
.
In English, on the other hand, it seems easy enough to say, once and for all, "Squares adjacent to pits are breezy." The syntax and semantics ofEnglish somehow make it possible to describe the environment concisely.
8.1.1
The language of thought
Natural languages (such as English or Spanish) are very expressive indeed. We managed to write almost this whole book in natural language, with only occasional lapses into other lan guages (including logic, mathematics, and the language of diagrams). There is a long ttadi tion in linguistics and the philosophy of language that views natural language as a declarative knowledge representation language. If we could tmcover the rules for natural language, we could use it in representation and reasoning systems and gain the benefit of the billions of pages that have been written in natural language. The modem view of natural language is that it serves a as a medium for communication rather than pure representation. When a speaker points and says, "Look!" the listener comes to know that, say, Supennan has finally appeared over the rooftops. Yet we would not want to say that the sentence "Look!" represents that fact. Rather, depends both on the sentence itself and on the
the meaning of the sentence
context in which the sentence was spoken.
Clearly, one could not store a sentence such as "Look!" in a knowledge base and expect to
Section 8.1.
Representation Revisited
287
recover its meaning without also storing a representation o f
the context-which raises the
question of how the context itself can be represented. Natural languages also suffer from
AMBIGUilY
ambiguity, a problem for a representation language. As Pinker (1995) puts it: "When people think about spring, surely they are not confused as to whether they are thinking about a season or something that goes boing-and if one word can correspond to two thoughts, thoughts can't be words." The famous
Sapir-Whorf hypothesis claims that our understanding of the world
is
strongly influenced by the language we speak. Whorf (1956) wrote "We cut nature up, orga
nize it into concepts, and ascribe significances as we do, largely because we are parties to an agreement to organize it this way-an agreement that holds throughout our speech commu nity and is codified in the patterns of our language." It is certainly true that different speech communities divide up the world differently. The French have two words "chaise" and "fau teuil;' for a concept that English speakers cover with one: "chair." But English speakers can easily recognize the category fauteuil and give it a name-roughly "open-arm chair"-so does language really make a difference"? Whorf relied mainly on intuition and speculation, but in the intervening years we actually have real data from anthropological, psychological and neurological studies. For example, can you remember which of the following two phrases formed the opening of Section 8.1? "In this section, we discuss the nature of representation languages . . ." "Tilis section covers the topic of lmowledge representation languages . . ."
Wanner (1974) did a similar experiment and found that subjects made the right choice at chance level-about 50% of the time-but remembered the content of what they read with better than
90% accuracy. This suggests that people process the
words to form some kind of
nonverbal representation. More interesting is the case in which a concept is completely absent in a language. Speakers of the Australian aboriginal language Guugu Yimithirr have no words for relative directions, such as front, back, right, or left.
Instead they use absolute directions, saying,
for example, the equivalent of "I have a pain in my north arm." This difference in language makes a difference in behavior: Guugu Yimithirr speakers are better at navigating in open terrain, while English speakers are better at placing the fork to the right of the plate. Language also seems to influence thought through seemingly arbitrary grammatical features such as the gender of nouns. For example, "bridge" is masculine in Spanish and feminine in German. Boroditsky
(2003)
asked subjects to choose English adjectives to de
big, dangerous, strong, towering, whereas German speakers chose beautiful, elegant,fragile, and slender. Words
scribe a photograph of a particular bridge. Spanish speakers chose and
can serve as anchor points that affect how we perceive the world. Loftus and Palmer (1974) showed experimental subjects a movie of an auto accident. Subjects who were asked "How fast were the cars going when they contacted each othe?" r reported an average of
32 mph,
while subjects who were asked the question with the word "smashed" instead of "contacted" reported 41mph for the same cars in the same movie.
288
Chapter
8.
First-Order Logic
In a first-order logic reasoning system that uses CNF, we can see that the linguistic fonn
"-.(A V B)" and "-.A 1\ -.B" are the
same because we can look inside the system and see
that the two sentences are stored as the same canonical CNF form. Can we do that with the human brain? Until recently the answer was "no," but now it is "maybe." Mitchell
et at.
(2008) put subjects in an tMRI (functional magnetic resonance imaging) machine, showed them words such as "celery;' and imaged their brains. The researchers were then able to train a computer program to predict, from a brain image, what word the subject had been presented with. Given two choices (e.g., "celery" or "airplane"), the system predicts correctly
77% of
the time. The system can even predict at above-chance levels for words it has never seen an tMRI image of before (by considering the images of related words) and for people it has never seen before (proving that tMRI reveals some level of common representation across people). This type of work is still in its infancy, but tMRI (and other imaging technology such as intracranial electrophysiology (Sahin
et at., 2009))
promises to give us much more
concrete ideas of what human knowledge representations are like. From the viewpoint of formal logic, representing the same knowledge in two different ways makes absolutely no difference; t.he same facts will be derivable from either represen tation. In practice, however, one representation might require fewer steps to derive a conclu sion, meaning that a reasoner with limited resources could get to the conclusion using one representation but not the other. For
nondeductive
tasks such as learning from experience,
outcomes are necessarily dependent on the form of the representations used. We show in Chapter 18 that when a learning program considers two possible themies of the world, both
of which are consistent with all the data, the most common way ofbreaking the tie is to choose the most succinct theory-and that depends on the language used to represent theories. Thus,
the influence of language on thought is unavoidable for any agent that does learning. 8.1.2
Combining the best of formal and natural languages
We can adopt the foundation of propositional logic-a declarative, compositional semantics that is context-independent and unambiguous-and build a more expressive logic on that foundation, borrowing representational ideas from natural language while avoiding its draw backs. When we look at the syntax of natw·al language, the most obvious elements are nouns
OBJECT RELATION FUNCTION
and noun phrases that refer to objects (squares, pits, wumpuses) and verbs and verb phrases that refer to relations among objects (is breezy, is adjacent to, shoots). Some of these rela tions
are functions-relations
in which there is only one "value" for a given "input." It is
easy to start listing examples of objects, relations, and functions: • Objects: people, houses, numbers, theories, Ronald McDonald, colors, baseball games, wars, centuries . . .
PROPERTY
• Relations: these can be unary relations or properties such as red, round, bogus, prime, multistoried . . . , or more general n-ary relations such as brother of, bigger than, inside, part of, has color, occurred after, owns, comes between, . . . • Functions: father of, best friend, third inning of, one more than, beginning of . . . Indeed, almost any assertion can be thought of as referring to objects and properties or rela tions. Some examples follow:
Section 8.1 .
Representation Revisited •
•
•
ONTOLOGICAL COMMITMENT
TEMPORAL LOGIC
HIGHER-ORDER LOGIC
289
"One plus two equals three." Objects: one, two, three, one plus two; Relation: equals; Function: plus. ("One plus two" is a name for the object that is obtained by applying the function "plus" to the objects "one" and "two." "Three" is another name for this object.) "Squares neighboring the wumpus are smelly." Objects: wumpus, squares; Property: smelly; Relation: neighboring. "Evil King John ruled England in 1 200." Objects: John, England, 1 200; Relation: ruled; Properties: evil, king.
The language of first-order logic, whose syntax and semantics we define in the next section, is built around objects and relations. It has been so important to mathematics, philosophy, and artificial intelligence precisely because those fields-and indeed, much of everyday human existence-can be usefully thought of as dealing with objects and the relations among them. First-order logic can also express facts about some or all of the objects in the universe. This enables one to represent general Jaws or rules, such as the statement "Squares neighboring the wumpus are smelly." The plimary difference between propositional and first-order logic Ues in the ontologi cal commitment made by each language-that is, what it assumes about the nature of reality. Mathematically, this commitment is expressed through the nature of the formal models with respect to which the truth of sentences is defined. For example, propositional logic assumes that there are facts that either hold or do not hold in the world. Each fact can be in one of two states: true or false, and each model assigns true or false to each proposition sym bol (see Section 7.4.2) .2 First-order logic assumes more; namely, that the world consists of objects with certain relations among them that do or do not hold. The formal models are correspondingly more complicated than those for propositional logic. Special-purpose logics make still further ontological c 01runitments; for example, temporal logic assumes that facts hold at particular times and that those times (which may be points or intervals) are ordered. Thus, special-purpose logics give certain kinds of objects (and the axioms about them) "first class" status within the logic, rather than simply defining them within the knowledge base. Higher-order logic views the relations and functions referred to by first-order logic as ob jects in themselves. This allows one to make assertions about all relations-for example, one could wish to define what it means for a relation to be transitive. Unlike most special-purpose logics, higher-order logic is strictly more expressive than first-order logic, in the sense that some sentences of higher-order logic cannot be expressed by any finite number of first-order
EPISTEMOLOGICAL COMMITMENT
logic sentences. A logic can also be characterized by its epistemological commitments-the possible states of knowledge that it allows with respect to each fact. In both propositional and first order logic, a sentence represents a fact and the agent either believes the sentence to be true, believes it to be false, or has no opinion. These logics therefore have three possible states of knowledge .regarding any sentence. Systems using probability theory, on the other hand, 2
In contrast, facts in fuzzy logic have a degree oftruth between 0 and 1. For example, the sentence "Vienna is
a large city" might be true in our world only to degree 0.6 in fuzzy logic.
290
Chapter can
have any
degree of belief,
ranging from 0 (total disbelief) to
8.
1
First-Order Logic
(total beliet).
ample, a probabilistic wumpus-world agent might believe that the wumpus is in probability
0.75.
3
For ex
[1,3] with
The ontological and epistemological commitments of five different logics
are summarized in Figure 8.1.
Language
Ontological Commitment (What exists in the world)
Epistemological Commitment (What an agent believes about facts)
Propositional logic First-order logic Temporal logic Probability theory Fuzzy logic
facts facts, objects, relations facts, objects, relations, times facts facts with degree of truth E [0, 1]
true/false/unknown true/false/unknown true/false/unknown degree of belief E [0, 1] known interval value
FigureS.!
Formal languages and their ontological and epistemological conunitments.
In the next section we will launch into the cletails of first-orcler logic. Just as a shJclent of ,
physics requires some familiarity with mathematics, a student of AI must develop a talent for working with logical notation. On the other hand, it is also important not to get too concerned with the specifics of logical notation-after all, there
are dozens of different versions.
The
main things to keep hold of are how the language facilitates concise representations and how its semantics leads to sound reasoning procedures.
8.2
SYNTA X AND SEMANTICS OF FIRST-ORDER LOGIC
We begin this section by specifying more precisely the way in which the possible worlds of first-order logic reflect the ontological commitment to objects and relations. Then we introduce the various elements of the language, explaining their semantics 8.2.1
as we
go along.
Models for first-order logic
Recall from Chapter 7 that the models of a logical language are the formal structures that constitute the possible worlds under consideration. Each model links the vocabulary of the logical sentences to elements of the possible world, so that the truth of any sentence can be determined. Thus, models for propositional logic link proposition symbols to predefined truth values. Models for first-order logic are much more interesting. First, they have objects OOMAIN OOMAINELEMENTs
in them! The domain of a model is the set ofobjects or domain elements it contains. The domain is required to be nonempty--every possible world must contain at least one object. (See E:<ercise 8.7 for a discussion of empty worlds.) Mathematically speaking, it doesn't matter
what these objects are-all that matters is how many there are in each particular model-but for pedagogical purposes we'll use a concrete example. Figure 8.2 shows a model with five 3 It is important not to confuse the degree of belief in probability theory with the degree oftruth in fuzzy logic. Indeed, some fuzzy systems allow uncertainty (degree of belief) about degrees of truth.
Section
8.2.
291
Syntax and Semantics of First-Order Logic
objects: Richard the Lionheart, King of England from evil King John, who ruled from The objects in the model TUPLE
John
1189 to 1199; his younger brother, the
1199 to 1215; the left legs of Richard and John; and a crown. may be related in various ways. In the figure, Richard and
are brothers.
related.
Formally speaking, a relation is just the set of tuples of objects that are (A tuple is a collection of objects arranged in a fixed order and is written with angle
brackets surrmmding the objects.) Thus, the brotherhood relation in this model is the set { (Richard the Lionheart, King John), (King John, Richard the Lionheart) } .
(8.1)
(Here we have named the objects in English, but. you may, ifyou wish, mentally substitute the pictures for the names.) The crown is on King John's head, so the "on head" relation contains just one tuple, (the crown, King John). The "brother" and "on head" relations
are
binary
relations-that is, they relate pairs of objects. The model also contains unary relations, or properties: the "person" property is true of both Richard and John; the "king" property is true only of John (presumably because Richard is dead at this point); and the "crown" property is true only of the crown. Certain kinds of relationships are best considered as fw1ctions, in that a given object must be related to exactly one object in this way. For example, each person has one left leg,
so
the model has a unary "left leg" function that n i cludes the following mappings: (Richard the Lionheart) --+ Richard's left leg
(8.2)
(King John) --+ John's left leg . TOTAL FUNCTIONs
Strictly speaking, models in first-order logic require
total functions, that is, t.here must be a
value for every input tuple. Thus, the crown must have a left leg and
so must each of the left
legs. There is a technical solution to this awkward problem involving an additional "invisible"
Figure 8.2 A model containing five objects, two binary relations, (indicated by labels on the objects), and one unary function, left-leg.
three
unary relations
Chapter 8.
292
First-Order Logic
object that is the left leg of everything that has no left leg, including itself. Fortunately, as long as one makes no assertions about the left legs of things that have no left legs, these technicalities are of no import. So far, we have described the elements that populate models for first-order logic. The other essential part of a model is the link between those elements and the vocabulary of the logical sentences, which we explain next. 8.2.2
CONSliiNTSYMBOL PREDICATE SYMBOL FUNCTION SYMBOL
ARITY
INTERPRETATION
INTENDED INTERPRETATION
Symbols and interpretations
We tum now to the syntax of first-order logic. The impatient reader can obtain a complete description from the formal grammar in Figure 8.3. The basic syntactic elements of first-order logic are the symbols that stand for objects, relations, and functions. The symbols, therefore, come in three kinds: constant symbols, which stand for objects; predicate symbols, which stand for relations; and function sym bols, which stand for functions. We adopt the convention that these symbols will begin with uppercase letters. For example, we might use the constant symbols Richard and John; the predicate symbols Brother, OnHead, Person, King, and Crown; and the function symbol LeftLeg. As with proposition symbols, the choice of names is entirely up to the user. Each predicate and function symbol comes with an arity that fixes the number of arguments. As in propositional logic, every model must provide the information required to deter mine if any given sentence is true or false. Thus, in addition to its objects, relations, and ftmctions, each model includes an interpretation that specifies exactly which objects, rela tions and functions are referred to by the constant, predicate, and function symbols. One possible interpretation for our example-which a logician would call the intended interpre tation is as follows: -
• •
Richard refers to Richard the Lionheart and John refers to the evil King John. Brother refers to the brotherhood relation, that is, the set of tuples of objects given in Equation (8.1); OnHead refers to the "on head" relation that holds between the crown and King John; Person, King, and Crown refer to the sets of objects that are persons, kings, and crowns.
•
LeftLeg refers to the "left leg" function, that is, the mapping given in Equation (8.2).
There are many other possible interpretations, of course. For example, one interpretation maps Richard to the crown and John to King John's left leg. There are five objects in the model, so there are 25 possible interpretations just for the constant symbols Richard and John. Notice that not all the objects need have a name-for example, the intended interpretation does not name the crown or the legs. It is also possible for an object to have several names; there is an interpretation tmder which both Richard and John refer to the crown.4 If you find this possibility confusing, remember that, in propositional logic, it is perfectly possible to have a model in which Cloudy and Sunny are both hue; it is the job of the knowledge base to rule out models that are inconsistent with our knowledge. 4
Lnter, in Section 8.2.8, we examine n semantics in which every object has exactly one name.
Section
8.2.
Syntax and Semantics
ofFirst-Order
293
Logic
Sentence
-+
AtomicSentence I ComplexSentence
AtomicSentence
-+
Pr-edicate I Predicate(Tet·m, . . . ) I
ComplexSentence
-+
( Sentence ) I ..,
Term = Term
[ Sentence ]
Sentence
Sentence
1\ Sentence
Sentence V Sentence Sentence =? Sentence Sentence
¢}
Sentence
Quantifiet• Vat'iable, . . . Sentence Tet•m -+
Function( Tet·m, . . . ) Constant Variable
Quantifier
-+
Constant -+ Variable
-+
Predicate
-+
Function
-+
OPERATOR PRECEDENCE
Figure 8.3
VI 3
A I xl I John I . . . aI
x
I
s
I
· · ·
True I False I
After I Loves I Raining I
·
· ·
Mother I LeftLeg I ..,
, =, 1\, V, =?,¢}
The syntax of first-order logic with equality, specified in Backus-Naur form
(see page 1060 if you are not familiar with this notation). Operator precedences are specified, from highest to lowest. The precedence of quantifiers is such that a quantifier holds over everything to the right of it.
•
Figure 8.4
•
•
Some members ofthe set of all models for a language with two constant sym
bols, Rand J, and one binary relat ion symbol. The interpretation of each constant symbol is
shown by a gray arrow. Within each model, the related objects are connected by arrows.
294
Chapter
8.
First-Order Logic
In summary, a model in first-order logic consists of a set of objects and an interpretation that maps constant symbols to objects, predicate symbols to relations on those objects, and function symbols to functions on those objects. Just as with propositional logic, entailment, validity, and so on are defined in terms of all possible models. To get an idea of what the set of all possible models looks like, see Figure 8.4. It shows that models vary in how many objects they contain-from one up to infinity-and in the way the constant symbols map to objects. If there are two constant symbols and one object, then both symbols must refer to the same object; but this can still happen even with more objects. When there are more objects than constant symbols, some of the objects will have no names. Because the number of possible models is unbounded, checking entailment by the enumeration of all possible models is not feasible for first-order logic (unlike propositional logic). Even ifthe nmnber of objects is restricted, the nwnber of combinations can be very large. (See Exercise 8.5.) For the example in Figw·e 8.4, there are 137,506,194,466 models with six or fewer objects. 8.2.3 TERM
Terms
A tenn is a logical expression that refers to an object. Constant symbols are therefore terms, but it is not always convenient to have a distinct symbol to name every object. For example, in English we might use the expression "King John's left leg" rather than giving a name
to his leg. This is what ftmction symbols are for: instead of using a constant symbol, we use
LeftLeg(John).
In the general case, a complex term is formed by a function symbol
followed by a parenthesized list of terms as argwnents to the function symbol. It is important to remember that a complex term is just a complicated kind of name. It is not a "subroutine call" that "returns a value." There is no
LeftLeg subroutine that takes a person as input and
returns a leg. We can reason about left legs (e.g., stating the general rule that everyone has one and then deducing that John must have one) without ever providing a definition of LeftLeg. 5 nus is something that cannot be done with subroutines in programming languages. The formal semantics of terms is straightforward. Consider a term function symbol
f(tl, . . . , tn)
·
The
f refers to some function in the model (call it F); the argument terms refer
to objects in the domain (call them d1, . . . , dn); and the term as a whole refers to the object that is the value of the function F applied to d1, . . . , dn . For example, suppose the LeftLeg function symbol refers to the ftmction shown in Equation (8.2) and
then
LeftLeg(John) refers to King John's left leg.
John refers to King John,
In this way, the interpretation fixes the
referent of every term. 8.2.4
Atomic sentences
Now that we have both terms for referring to objects and predicate symbols for refeiTing to relations, we can put them together to make atomic sentences that state facts. An atomic 5 >.-expressions provide a useful notation in whlch new function symbols are constructed "on the fly." For
example, the function that squares its argumem can be written as (>.x x x x) and can be applied to arguments just like any other function symbol. A >.-expression can also be defined and used as a predicate symbol. (See Chapter 22.) The lambda operator in Lisp plays exactly the same role. Notice that the use of>. in thls way does not increase the formal expressive power of first-{)rder logic, because any sentence that includes a >.-expression can be rewritten by "plugging in" its arguments to yield an equivalent sentence.
Section 8.2.
Syntax and Semantics of First-Order Logic
295
ATOMICSENTENCE
sentence (or atom for sh01t) is formed from a predicate symbol optionally followed by a
ATOM
parenthesized Jist of terms, such as
Brother(Richard, John). This states, under the intended interpretation given earlier, that Richard the Lionheart is the brother of King John.6 Atomic sentences can have complex terms as arguments. Thus,
Married (Father(Richard), Mother( John)) states that Richard the Lionheart's father is married to King Jolm's mother (again, under a suitable interpretation). An atomic sentence is true in a given model ifthe relation referred to by the predicate symbol holds among the objects referred to by the arguments. 8.2.5
Complex sentences
We can use logical connectives to construct more complex sentences, with the same syntax and semantics as .in propositional calculus. Here are four sentences that are tme in the model ofFigure 8.2 under our intended interpretation:
·Brother(LeftLeg(Richard), John) Brother(Richard, John) A Brother(John, Richm Evil(Father(John)) . UNIVERSAL INSTANTIATION G�UNDTERM
EXISTENTIAL IN3TANT"'TION
The ntle of Universal Instantiation (UI for short) says that we can infer any sentence ob tained by substituting a ground term (a term without variables) for the variable. 1 To write out the inference ntle formally, we use the notion of substitutions introduced n i Section 8.3. Let. SUBST(B, a) denote the result of applying the substitution B to the sentence a. Then the rule is written
\fv a SUBST({vjg},a) for any variable v and ground term g. For example, the three sentences given earlier are obtained with the substitutions {x/John}, { xf Richard}, and {x/Father( John)}. In the rule for Existential Instantiation, the variable is replaced by a single new con stant symbol. The formal statement is as follows: for any sentence a, vatiable v, and constant symbol k that does not appear elsewhere in the knowledge base, 3v a . SUBST({vjk},a) For example, from the sentence
3 x Croum(x) 1\ OnHead(x, John) we can infer the sentence
Crown(Ct) 1\ OnHead(Ct, John) as long as C1 does not appear elsewhere in the knowledge base. Basically, the existential
SKOLEM CONSTANT
INFERENTIAL EQUIVALENCE
sentence says there is some object satisfying a condition, and applying the existential instan tiation rule just gives a name to that object. Of cow·se, that name must not already belong to another object. Mathematics provides a nice example: suppose we discover that there is a number that is a little bigger than 2.71828 and that satisfies the equation d(xY)jdy =xY for x. We can give this number a name, such as e, but it would be a mistake to give it the name of an existing object, such as 1r. In logic, the new name is called a Skolem constant. Existen tial Instantiation is a special case of a more general process called skolemization, which we cover in Section 9.5. Whereas Universal Instantiation can be applied many times to produce many different consequences, Existential Instantiation can be applied once, and then the existentially quan tified sentence can be discarded. For example, we no longer need 3 x Kill(x, Victim) once we have added the sentence Kill(Murderer, Victim). Strictly speaking, the new knowledge base is not logically equivalent to the old, but it can be shown to be inferentially equivalent in the sense that it is satisfiable exactly when the original knowledge base is satisfiable. 1 Do not confuse these substitutions with the extended interpretations used to define the semantics ofquantifiers. The substitution replaces a variable with a tel1ll (a piece of syntax) to produce a interpretation maps a variable to an object in the domain.
new
sentence, whereas an
Chapter 9.
324 9.1.2
Inference in First-Order Logic
Reduction to propositional inference
Once we have rules for inferring nonquantified sentences from quantified sentences, it be comes possible to reduce first-order inference to propositional inference. In this section we give the main ideas; the details are given in Section 9.5. The first idea is that, just as an existentially quantified sentence can be replaced by one instantiation, a universally quantified sentence can be replaced by the set of all possible instantiations. For example, suppose our knowledge base contains just the sentences
Vx King(x) 1\ Greedy(x) King( John) Gn�edy (John) Brother(Richard, John) .
=?
Evil(x) (9.1)
Then we apply Ul to the first sentence using all possible ground-term substitutions from the vocabulary of the knowledge base-in this case, {xfJohn} and {xfRichard}. We obtain
King( John) 1\ Greedy( John) =? Evil(John) King(Richard) 1\ Greedy(Richard) =? Evil(Richard) , and we discard the tmiversally quantified sentence. Now, the knowledge base is essentially propositional if we view the ground atomic sentences-King( John), Greedy(John), and so on-as proposition symbols. Therefore, we can apply any of the complete propositional algorithms in Chapter 7 to obtain conclusions such as Evil( John). This technique of propositionalization can be made completely general, as we show in Section 9.5; that is, every first-order knowledge base and query can be propositionalized in such a way that entailment is preserved. Thus, we have a complete decision procedure
for entailment . . . or perhaps not. There is a problem: when the knowledge base includes a function symbol, the set of possible ground-term substitutions is infinite! For example, if the knowledge base mentions the Father symbol, then infinitely many nested terms such as Father(Father(Father( John))) can be constructed. Our propositional algorithms will have difficulty with an infinitely large set of sentences. Forttmately, there is a famous theorem due to Jacques Herbrand (1930) to the effect that if a sentence is entailed by the original, first-order knowledge base, then there is a proof involving just a finite subset of the propositionalized knowledge base. Since any such subset has a maximum depth of nesting among its grotmd terms, we can find the subset by first generating all the instantiations with constant symbols (Richard and John), then all terms of depth 1 (Father(Richard) and Father( John)), then all terms of depth 2, and so on, until we are able to construct a propositional proof of the entailed sentence. We have sketched an approach to first-order inference via propositionalization that is complete-that is, any entailed sentence can be proved. This is a major achievement, given that the space of possible models is infinite. On the other hand, we do not know until the proof is done that the sentence is entailed! What happens when the sentence is not entailed? Can we tell? Well, for first-order logic, it turns out that we carmot. Our proof procedure can go on and on, generating more and more deeply nested terms, but we will not know whether it is stuck in a hopeless loop or whether the proof is just about to pop out. This is very much
Section 9.2.
Unification and Lifting
325
like the halting problem for Turing machines. A1an Turing (1936) and Alonzo Church (1936) both proved, in rather different ways, the inevitability of this state of affairs. The question of
entailmentforfirst-order logic is semidecidable-that is, algorithms exist that say yes to every entailed sentence, but no algorithm exists that also says no to every nonentailed sentence. 9.2
UNIFICATION AND LIFTING
The preceding section described the understanding of first-order inference that existed up to the early 1960s. The sharp-eyed reader (and certainly the computational logicians of the early 1960s) will have noticed that the propositionalization approach is rather inefficient. For example, given the query Evil(x) and the knowledge base in Equation (9.1), it seems per verse to generate sentences such as King(Richard) 1\ G?Y�edy(Richard) =? Evil(Richard). Indeed, the inference of Evil(John) from the sentences
Vx 1\ing(x) 1\ Greedy(x) King( John) Greedy (John)
=?
.l!:vil(x)
seems completely obvious to a human being. We now show how to make it completely obvious to a computer. 9.2.1
A first-order inference rule
The inference that Jolm is evil-that is, that {xj John} solves the query Evil(x)-works like this: to use the rule that greedy kings are evil, find some x such that x is a king and x is greedy, and then infer that this x is evil. More generally, if there is some substitution () that makes each of the conjuncts of the premise of the implication identical to sentences already in the knowledge base, then we can assert the conclusion of the implication, after applying fJ. In this case, the substitution () = {xj John} achieves that aim. We can actually make the inference step do even more work. Suppose that instead of knowing Greedy( John), we know that everyone is greedy:
Vy Greedy(y) .
GENERAUZED MODUS PONENS
(9.2)
Then we would still like to be able to conclude that Evil(John), because we know that John is a king (given) and John is greedy (because everyone is greedy). What we need for this to work is to find a substitution both for the variables in the implication sentence and for the variables in the sentences that are n i the knowledge base. 1n tllis case, applying the substitution {xjJohn, yj John} to the implication premises King(x) and Greedy(x) and the knowledge-base sentences King(John) and Greedy(y) will make them identical. Thus, we can infer the conclusion of the implication. This inference process can be captured as a single inference rule that we call Gener alized Modus Ponens:2 For atomic sentences Pi, p/, and q, where there is a substitution ()
Chapter 9.
326
Inference in First-Order Logic
such that SUBST(B,p/) = SUBST(B,pi), for all i,
PI' , P21,
•
•
•
, Pn', (pi /\P2/\ . . . /\pn � q) SUBST(B,q)
There are n+ 1 premises to this mle: the n atomic sentences p/ and the one implication. The conclusion is the result of applying the substitution B to the consequent q. For our example:
PI' is King( John) PI is King(x) P21 is Greedy(y) P2 is Greedy(x) B is {xjJohn, yjJohn} q is Evil(x) SUBST(B,q) is Evil( John) . It is easy to show that Generalized Modus Ponens is a sound inference rule. First, we observe that, for any sentence p (whose variables are asswned to be universally quantified) and for any substitution B,
p F SUBST(B,p) holds by Universal Instantiation. It holds in particular for a B that satisfies the conditions of the Generalized Modus Ponens rule. Thus, from PI', . . . ,pn' we can infer
SUBST(B,pi') 1\ . . . 1\ SUBST(B,pn') and from the implication PI 1\ . . . 1\ Pn � q we can infer
SUBST(B,p1) 1\ . . . 1\SVBST(B,pn) � SUBST(B,q) .
LIFTING
Now, B in Generalized Modus Ponens is defined so that SUBST(B,p/) = SUBST(B,pi), for all i; therefore the first of these two sentences matches the premise of the second exactly. Hence, SUBST(B, q) follows by Modus Ponens. Generalized Modus Ponens is a lifted version of Modus Ponens-it raises Modus Ponens from ground (variable-free) propositional logic to first-order logic. We will see in the rest of this chapter that we can develop lifted versions of the forward chaining, backward chaining, and resolution algorithms introduced in Chapter 7. The key advantage of lifted inference rules over propositionalization is that they make only those substitutions that are required to allow particular inferences to proceed. 9.2.2
Unification
Lifted inference mles require finding substitutions that make different logical expressions UNIACATION UNIAER
look identical. This process is called
unification and
is a key component of all first-order
inference algorithms. The UNIFY algorithm takes two sentences and retw11S a unifier for them if one exists:
UNIFY(p, q) = 8 where SUBST(ll,p) = SVBST(B,q) . Let us look at some examples of how UNIFY should behave. Suppose we have a query AskVars(Knows(John, x)): whom does John know? Answers to this query can be found Generalized Modus Ponens is more general than Modus Ponens (page 249) in the sense that the known facts and the premise of the implication need match only up to a substitution, rather than exactly. On the other hand,
2
Modus Ponens allows any sentence a as the premise, rather than just a conjunction of atomic sentences.
Section 9.2.
Unification and Lifting
327
by finding all sentences in the knowledge base that unify with
Knows( John, x). Here are the
results of unification with four different sentences that might be in the knowledge base:
UNIFY(Knows(John, x), UNIFY(Knows(John, x), UNIFY(Knows( John, x), UNIFY(Knows(John, x),
Knows( John, Jane)) = {xj Jane} Knows(y, Bill)) = {xjBill, yjJohn} Knows(y, Mother(y))) = {yjJohn, xjMother( John)} Knows(x, Elizabeth)) = fail . x cannot take on the values John and Elizabeth at the Knows(x, Elizabeth) means "Everyone knows Elizabeth,"
The last unification fails because same time. Now, remember that
so
we
should be able to infer that John knows Elizabeth.
the two sentences happen to STANDIRDIZING AAIRT
by
use
The problem arises only because
the same variable name,
x.
The problem can be avoided
standardizing apart one of the two sentences being unified, which means renaming its
variables to avoid name clashes. For example, we can rename x in
X17 (a new variable name) without changing its meaning.
Knows(x, Elizabeth) to
Now the unification will work:
UNIFY(Knows(John,x), Knows(x17, Elizabeth)) = {xjElizabeth, x17/John} . Exercise 9.12 delves further into the need for standardizing apart. There is one more complication:
we said that
UNIFY should
retum a substitution
that makes the two arguments look the same. But there could be more than one such uni
UNIFY(Knows(John,x),Knows(y,z)) could return {yjJohn, xjz} or {yj John, xj John, zjJohn}. The first unifier gives Knows( John, z) as the result of unifi cation, whereas the second gives Knows( John, John). The second result could be obtained from the first by an additional substitution {zj John}; we say that the first unifier is more general than the second, because it places fewer restrictions on the values of the variables. It
fier. For example,
MOST GENERAL UNIFIER
turns out that, for every unifiable pair of expressions, there is a single most general unifier (or
MGU) that is unique up to renaming and substitution of variables. (For example, {xj John} and {yjJohn} are considered equivalent, as are {xjJohn, yjJohn} and {xjJohn, yjx}.) In this case it is {yjJohn, xjz}. An algorithm for computing most general unifiers is shown in Figure
9.1. The process
is simple: recursively explore the two expressions simultaneously "side by side;' building up a unifier along the way, but failing if two corresponding points in the structures do not match. There is one expensive step: when matching a variable against a complex term, one must check whether the variable itself occurs inside the term; if it does, the match fails because no consistent unifier can be constructed. For example, OCCUR CHECK
S(x) can't unify with S(S(x)). This so
called occur check makes the complexity of the entire algorithm quadratic in the size of the expressions being unified. Some systems, including all logic programming systems, simply omit the occur check and sometimes make unsound inferences as a result; other systems use more complex algorithms with linear-time complexity.
9.2.3
Storage and retrieval
Underlying the
TELL
and ASK functions used to inform and inteJTogate a knowledge base
STORE and FETCH functions. STORE(s) stores a sentence s into the knowledge bas e and FETCH(q) returns all unifiers such that the query q unifies with some are the more primitive
328
Chapter
9.
Inference in First-Order Logic
function UNIFY(x, y, 8) returns a substitution to make x and y identical inputs: x, a variable, constant, list, or compound expression y, a variable, constant, list, or compound expression 8, the substitution built up so far (optional, defaults to empty) if 8 = failure then return failure
else if x = y then return 8 else if VARIABLE?(x) then return UNJFIY-VAR(x, y, 8) else if VARIABLE?(y) then return UNJFIY-VAR(y,x,8) else if COMPOUND?(x) and COMPOUND?(y) then return UNIPY(x.ARGS, y.ARGS, UNJPY(x.OP, y.OP, 8)) else if LIST?(x) and LiST?(y) then return UNJFIY(x.REST, y.REST, UNJFY(x.FIRST, y.FIRST, 8)) else return failure function UNIFY-VAR(var, x, 8) returns a substitution if { varjval} E 8 then return UNIFY(val,x,8) else if{xjval} E 8 then return UNIFY(var,val,8) else ifOCCUR-CHECK?(va1·,x) then return failure else return add {varix} to 8
Figure 9.1 The unification algorithm. The algorithm works by comparing the structures of the inputs, element by element. The substitution 8 that is the argument to UNIFY is built up along the way and is used to make sure that later comparisons are consistent with bindings that were established earlier. In a compound expression such as F(A, B). the OP field picks out the function symbol F and the ARGS field picks out the argument list (A, B).
INDEXING PREDICATE INDEXING
sentence in the knowledge base. The problem we used to illustrate unification-finding all facts that unify with Knows( John, x )-is an instance of FETCHing. The simplest way to implement STORE and FETCH is to keep all the facts in one long list and unify each query against every element of the list. Such a process is inefficient, but it works, and it's all you need to understand the rest of the chapter. The remainder of this section outlines ways to make retrieval more efficient; it can be skipped on first reading. We can make FETCH more efficient by ensuring that tmificalions are attempted only with sentences that have some chance of tmifying. For example, there is no point in trying to unify Knows( John, x) with Brother(Richard, John). We can avoid such unifications by indexing the facts in the knowledge base. A simple scheme called predicate indexing puts all the Knows facts in one bucket and all the Brother facts in another. The buckets can be stored in a hash table for efficient access. Predicate indexing is useful when there are many predicate symbols but only a few clauses for each symbol. Sometimes, however, a predicate has many clauses. For example, suppose that the tax authorities want to keep track of who employs whom, using a predi cate Employs(x, y). This would be a very large bucket with perhaps millions of employers
Section 9.2.
329
Unification and Lifting Emp!Oy>{x,y)
------
Employs(IBM;y)
Employs(x,Richard)
------Employs(IBM,Riclwrd)
(a)
Employs(x,y)
Employs(x,Jolm)
Employs(x,x)
Employs(Jolm,y)
En�oloys(Jolm,Jolm) (b)
(a) The subsumption lattice whose lowest node is Employs (IBM, Richar-d). (b) The subsumption lattice for the sentence Employs( John, John).
Figure 9.2
and tens of millions of employees. Answering a query such as Jimploys(x, H.ichard) with predicate indexing would require scamling the entire bucket. For this particular query, it would help if facts were indexed both by predicate and by second argument, perhaps using a combined hash table key. Then we could simply constmct the key from the query and retrieve exactly those facts that unify with the query. For other queries, such as Employs(IBM, y) , we would need to have indexed the facts by combining the predicate with the first argument. Therefore, facts can be stored under multiple index keys, rendering them instantly accessible to various queries that they might unify with. Given a sentence to be stored, it is possible to construct indices for all possible quedes that unify with it. For the fact Employs(IBM, Richard), the queries are
Employs (IBM, Richard) Does IBM employ Richard? Employs (x, Richard) Who employs Richard? Employs(IBM,y) Whom does IBM employ? Employs(x, y) Who employs whom? SUBSUMPTION LATTICE
These queries form a subsumption lattice, as shown in Figure 9.2(a). The lattice has some interesting properties. For example, the child of any node in the lattice is obtained from its parent by a single substitution; and the "highest" common descendant of any two nodes is the result of applying their most general unifier. The portion of the lattice above any ground fact can be constructed systematically (Exercise 9.5). A sentence with repeated constants has a slightly different lattice, as shown in Figure 9.2(b). Function symbols and variables in the sentences to be stored introduce still more interesting lattice structures. The scheme we have described works very well whenever the lattice contains a small number of nodes. For a predicate with n arguments, however, the lattice contains 0(2n) nodes. If function symbols are allowed, the nwnber of nodes is also exponential in the size of the te1ms in the sentence to be stored. This can lead to a huge number of indices. At some point, the benefits of indexing are outweighed by the costs of storing and maintaining all the indices. We can respond by adopting a fixed policy, such as maintaining indices only on keys composed of a predicate plus each argument, or by using an adaptive policy that creates indices to meet the demands of the kinds of queries being asked. For most AI systems, the number of facts to be stored is small enough that efficient indexing is considered a solved problem. For commercial databases, where facts number in the billions, the problem has been the subject of intensive study and technology development..
330
9.3
Chapter
9.
Inference in First-Order Logic
FORWARD CHAINING
A forward-chaining algorithm for propositional definite clauses was given in Section 7.5. The idea is simple: start with the atomic sentences in the knowledge base and apply Modus Ponens in the fmward direction, adding new atomic sentences, until no further inferences can be made. Here, we explain how the algorithm is applied to first-order definite clauses. Definite clauses such as Situation => Response are especially useful for systems that make inferences in response to newly arrived information. Many systems can be defined this way, and forward chaining can be implemented very efficiently. 9.3.1
First�order definite clauses
First-order definite clauses closely resemble propositional definite clauses (page 256): they are disjunctions of literals of which exactly one is positive. A definite clause either is atomic or is an implication whose antecedent is a conjunction of positive literals and whose conse quent is a single positive literal. The following are first-order definite clauses:
King(x) A Greedy(x) King( John) . Greedy(y) .
=>
Evil(x) .
Unlike propositional literals, first-order literals can include variables, in which case those variables are assumed to be universally quantified. (Typically, we omit universal quantifiers when writing definite clauses.) Not every knowledge base can be converted into a set of definite clauses because of the single-positive-literal restriction, but many can. Consider the following problem: The law says that it is a crime for an American to sell weapons to hostile nations. The country Nono, an enemy of America, has some missiles, and all of its missiles were sold to it by Colonel West, who is American.
We will prove that West is a criminal. First, we will represent these facts as first-order definite clauses. The next section shows how the forward-chaining algorithm solves the problem. ". . . it is a crime for an American to sell weapons to hostile nations":
American(x) 1\ Weapon(y) 1\ Sells(x, y,z) 1\ Hostile(z) => Criminal(x) .
(9.3)
"Nono . . . has some missiles." The sentence 3 x Owns( Nono, x) AMissile(x) is transformed into two definite clauses by Existential Instantiation, introducing a new constant M1:
Owns(Nono,M1) Missile(M1)
(9.4) (9.5)
"All of its missiles were sold to it by Colonel West":
Missile(x) A Owns(Nono,x)
=>
Sells( West,x, Nono) .
(9.6)
We will also need to know that missiles are weapons:
Missile(x) => Weapon(x)
(9.7)
Section 9.3.
331
Forward Chaining and we must know that an enemy of America counts as "hostile":
Enemy(x, America) � Hosti!e(x) .
(9.8)
"West, who is American . . .":
American( West) .
(9.9)
"The cotmtry Nono, an enemy of America . . . ":
Enemy(Nono, America) .
(9.10)
This knowledge base contains no function symbols and is therefore an instance of the class
ll6.TALOO
of Datalog knowledge bases. Datalog is a language that is restricted to first-order definite daus�s with 110 fw1c.:tiuu symbols. Datalug g�ts its uam� be::c.;aus� it C.:IUI r�pr�s�11l tl1� ty)J� uf
statements typically made in relational databases. We wi11 see that the absence of function symbols makes inference much easier. 9.3.2
A simple forward-chaining algorithm
The first forward-chaining algorithm we consider is a simple one, shown in Figure 9.3. Start ing from the known facts, it triggers all the rules whose premises are satisfied, adding their conclusions to the known facts. The process repeats until the query is answered (assuming that just one answer is required) or no new facts are added. Notice that a fact is not "new" RENAMI�G
if it is just a renaming of a known fact. One sentence is a renaming of another if they are identical except for the names of the variables. For example, Likes(x, IceCream) and
Likes(y, IceCream) are renamings of each other because they differ only in the choice of x or y; their meanings are identical: everyone likes ice cream. W� us� uur c.;rim� J.lrubkm tu illuslral� huw
FOL-FC-ASK
works.
Th� implkatiuu
sentences are (9.3), (9.6),
(9.7), and (9.8). Two iterations are required: • On the first iteration, rule (9.3) has unsatisfied premises. Rule (9.6) is satisfied with {xjMt}, and Sells(West,M1, Nono) is added. Rule (9.7) is satisfied with {xjMt}, and Weapon(M1) is added. Rule (9.8) is satisfied with {xj Nona}, and Hostile(Nona) is added. • On the second iteration, rule (9.3) is satisfied with {xj West, yjM1, zjNona}, and Criminal( West) is added. Figure 9.4 shows the proof tree that is generated. Notice that no new inferences are possible at this point because every sentence that could be concluded by forward chaining is already contained explicitly in the KB. Such a knowledge base is called a fixed point of the inference process. Fixed points reached by forward chaining with first-order definite clauses are similar to those for propositional forward chaining (page 258); the principal difference is that a first order fixed point can include universally quantified atomic sentences. FOL-FC-ASK is easy to analyze. First, it is sound, because every inference is just an application of Generalized Modus Ponens, which is sound. Second, it is complete for definite clause knowledge bases; that is, it answers every query whose answers are entailed by any knowledge base of definHe clauses. For Datalog knowledge bases, which contain no function symbols, the proof of completeness is fairly easy. We begin by counting the number of
332
Chapter
9.
Inference in First-Order Logic
function FOL-FC-ASK(KB, a) returns a substitution or false inputs: KB, the knowledge base, a set of first-order definite clauses a, the query, an atomic sentence local variables: new, the new sentences inferred on each iteration repeat tmtil new is empty new Colorable() which con-esponds to the reduced CSP shown in Figure
6.12 on page 224. Algorithms
for solving tree-structured CSPs can be applied directly to the problem of rule matching. •
We can try
to to eliminate redundant rule-matching attempts in the forward-chaining
algorithm, as described next.
Incremental forward chaining When we showed how forward chaining works on the crime example, we cheated; in partic ular, we omitted some of the rule matching done by the algorithm shown in Figure
9.3. For
example, on the second iteration, the rule
Missile(x) => Weapon(x) matches against
Missile(M1) (a.gain), and of course the conclusion Weapon(M1) is already
known so nothing happens. Such redundant rule matching
can
be avoided if we make the
Every new fact inferred on iteration t must be derived from at least one new fact inferred on iteration t - 1 . This is true because any inference that does not following observation:
require a new fact from iteration
t-1
could have been done at iteration
t-1
already.
This observation leads naturally to an n i cremental forward-chaining algorithm where,
t, we check a rule only if its premise includes a conjunct Pi that unities with a fact p� newly infen-ed at iteration t - 1 . The rule-matching step then fixes Pi to match with Jl., but at iteration
allows the other conjuncts of the rule to match with facts from any previous iteration. This algorithm generates exactly the same facts at each iteration as the algorithm in Figure 9.3, but is much more efficient. With suitable indexing, it is easy to identify all the rules that can be triggered by any given fact, and indeed many real systems operate in an "update" mode wherein forward chain ing occurs in response
to each new fact that is TELLed to the system. Inferences cascade
through the set of rules until the fixed point is reached, and then the process begins again for the next new fact. Typically, only a small fraction of the rules in the knowledge base are actually triggered by the addition of a given fact. This means that a great deal of redundant work is done in 1-epeatedly constructing partial matches that have some unsatisfied premises. Our crime ex ample is rather too small to show this effectively, but notice that a partial match is constructed on the first iteration between the rule
American(x) 1\ Weapon(y) 1\ Sells(x,y,z) 1\ Hostile(z) => C1'iminal(x) and the fact
American( West). This partial match is then discarded and t-ebuilt on the second
iteration (when the rule succeeds). It would be better to retain and gradually complete the
RETE
partial matches as new facts arrive, rather than discarding them. 3 The rete algorithm was the first to address this problem. The algorithm preprocesses
the set of rules in the knowledge base to construct a sort of dataflow network in which each
3 Rete is Latin for net. The English pronunciation rhymes with treaty.
Chapter
336
9.
Inference in First-Order Logic
node is a literal from a rule premise. Variable bindings flow through the network and are filtered out when they fail to match a literal. If two literals in a rule share a variable-for example,
Sells (x , y, z) 1\ Hostile(z)
in the crime example-then the bindings from each
literal are filtered through an equality node. A variable binding reaching a node for an n ary literal such as Sells(x, y, z) might have to wait for bindings for the other variables to be established before the process can continue. At any given point, the state of a rete network captures all the partial matches of the rules, avoiding a great deal of recomputation. Rete networks, and various improvements thereon, have been a key component of so PRODUCTION SYSTEM
called production systems, which were among the earliest forward-chaining systems in 4 widespread use. The XCON system (originally called R 1 ; McDennott, 1982) was built with a production-system architecture. X CON contained several thousand rules for designing configurations of computer components for customers of the Digital Equipment Corporation. It was one of the first clear commercial successes in the emerging field of expert systems. Many other similar systems have been built with the same underlying technology, which has
COGNITIVE ARCHITECTURES
been implemented in the general-purpose language OPS-5. Production systems are also popular in cognitive architectures-that is, models of hu man reasoning-such as ACT (Anderson, 1983) and SOAR (Laird et al., 1987). In such sys tems, the "working memory" of the system models human short-tenn memory, and the pro ductions are part of long-term memory. On each cycle of operation, productions are matched against the working memory of facts. A production whose conditions are satisfied can add or delete facts in working memory. In contrast to the typical situation n i databases, production systems often have many rules and relatively few facts. With suitably optimized matching technology, some modem systems can operate in real time with tens of millions of rules.
Irrelevant facts TI1e final source of inefficiency in forward chaining appears to be intrinsic to the approach and also arises in the propositional context. Forward chaining makes all allowable inferences based on the known facts, even ifthey are irrelevant to the goal at hand. In our crime example, there were no rules capable of drawing irrelevant conclusions, so the lack of directedness was not a problem. In other cases (e.g., if many rules describe the eating habits of Americans and the prices of missiles), FOL-FC-ASK will generate many irrelevant conclusions. One way to avoid drawing irrelevant conclusions is to use backward chaining, as de scribed in Section 9.4. Another solution is to restrict forward chaining to a selected subset of rules, as in PL-FC-ENTAILS? (page 258). A third approach has emerged in the field of de DEDUCTIVE ��BASES
ductive databases, which are large-scale databases, like relational databases, but which use forward chaining as the standard inference tool rather than SQL queries. The idea is to rewrite the mle set, using information from the goal, so that only relevant variable bindings-those
MAGIC SET
belonging to a so-called magic set
are considered dm·ing forward inference. For example, if
-
Criminal(West), the mle that concludes Criminal(x) will be rewritten to include an extra conjm1ct that constrains the value of x: Magic(x) 1\ Amel'ican(x) 1\ Weapon(y) 1\ Sells(x, y,z) 1\ Hostile(z) =? Criminal(x) . the goal is
4
The word production in production systems denote� a condition-action rule.
Section 9.4.
Backward Chaining
337
The fact Magic( West) is also added to the KB. In this way, even if the knowledge base contains facts about millions of Americans, only Colonel West will be considered during the forward inference process. The complete process for defining magic sets and rewriting the knowledge base is too complex to go into here, but the basic idea is to perform a sort of "generic" backward inference from the goal in order to work out whlch variable bindings need to be constrained. The magic sets approach can therefore be thought of as a kind of hybrid between forward inference and backward preprocessing. 9.4
BACKWARD CHAINING
The second major family of logical inference algorithms uses the backward chaining ap proach introduced in Section 7.5 for definite clauses. These algorithms work backward from the goal, chaining through rules to find known facts that support the proof. We describe the basic algorithm, and then we describe how it is used in logic programming, which is the most widely used fonn of automated reasoning. We also see that backward chaining has some disadvantages compared with forward chaining, and we look at ways to overcome them. Fi nally, we look at the close connection between logic programming and constraint satisfaction problems. 9.4.1
GENERATOR
A backward-chaining algorithm
Figure 9.6 shows a backward- ) anti l.i.uk ( 1.> , = z >= 1. Standard logic programs are just a special case of CLP in which the solution constraints must be equality constraints-that is, bindings. CLP systems incorporate various constraint-solving algorithms for the constraints al lowed in the language. For example, a system that allows linear inequalities on real-valued variables might include a linear programming algorithm for solving those constraints. CLP systems also adopt a much more flexible approach to solving standard logic programming queries. For example, instead of depth-first, left-to-right backtracking, they might use any of the more efficient algorithms discussed in Chapter 6, including heuristic conjunct ordering, backjumping, cutset conditioning, and so on. CLP systems therefore combine elements of constraint satisfaction algorithms, logic programming, and deductive databases. Several systems that allow the programmer more control over the search order for in ference have been defined. The MRS language (Genesereth and Smith, 1981; Russell, 1985) allows the programmer to write metarules to determine which conjuncts are tried first. The user could write a rule saying that the goal with the fewest variables should be tried first or could write domain-specific rules for particular predicates.
RESOLUTION
The last of our three families of logical systems is based on resolution. We saw on page 250 that propositional resolution using refutation is a complete inference procedure for proposi tional logic. In this section, we describe how to extend resolution to first-order logic. 9.5.1
Conjunctive normal form for first-order logic
As in the propositional case, first-order resolution requires that sentences be in conjunctive normal form (CNF)-that is, a conjunction of clauses, where each clause is a disjunction of literals.6 Literals can contain variables, which are assumed to be universally quantified. For example, the sentence
\fx American(x) A Weapon(y) A Sells(x,y,z) A Hostile(z)
=?
Criminal(x)
becomes, in CNF,
•Ame1'ican(x) V ..., Weapon(y) V •Sells(x,y,z) V •Hostile(z) V Criminal(x) . Every sentence offirst-order logic can be converted into an inferentially equivalent CNF sentence. In particular, the CNF sentence will be unsatisfiable just when the original sentence is unsatisfiable, so we have a basis for doing proofs by contradiction on the CNF sentences. 6 A clause can also be represented as an implication with a conjunction of atoms in the premise and a disjunction of atoms in the conclusion (Exercise 7.13). This is called implicative nonnal form or Kowalski form (especially when written with a right-to-left implication symbol (Kowalski, 1979)) and is often much easier to read.
346
Chapter
9.
Inference in First-Order Logic
The procedure for conversion to CNF is similar to the propositional case, which we saw on page 253. The principal difference arises from the need to eliminate existential quantifiers. We illustrate the procedure by translating the sentence "Everyone who loves all animals is loved by someone," or
Vx [\fy Animal(y) � Loves(x,y)] � [3y Loves(y,x)] . The steps are as follows: •
Eliminate implications:
Vx [-Ny ·Animal(y) V Loves(x,y)] V [3y Loves(y,x)] . •
Move -, inwards: In addition to the usual rules for negated connectives, we need rules for negated quantifiers. Thus, we have
-,\f x p -,3x p
becomes becomes
3 x •P V x •P .
Our sentence goes through the following transformations:
Vx [3y •(•Animal(y) V Loves(x,y))] V [3y Loves(y,x)] . Vx [3y ••Animal(y) J\ •Loves(x,y)] v [3y Loves(y,x)] . Vx [3y Animal(y) 1\ -,Loves(x,y)] V [3y Loves(y,x)] . Notice how a universal quantifier (Vy) in the premise of the implication has become an existential quantifier. The sentence now reads "Either there is some animal that x doesn't love, or (if this is not the case) someone loves x." Clearly, the meaning of the original sentence has been preserved. •
Standardize variables: For sentences like (3x P(x)) V (3 x Q(x)) which use the same variable name twice, change the name of one of the variables. This avoids confusion later when we drop the quantifiers. Thus, we have
Vx [3y Animal(y) 1\ ·Loves(x,y)] V [3z Loves(z,x)] . SKOLEMIZATION
•
Skolemize: Skolemization is the process of removing existential quantifiers by elimi nation. In the simple case, it is just like the Existential Instantiation rule of Section 9.1: translate 3 x P(x) into P(A), where A is a new constant. However, we can't apply Ex istential Instantiation to our sentence above because it doesn't match the pattem 3 v a; only parts of the sentence match the pattem. If we blindly apply the rule to the two matching parts we get
Vx [Animal(A) 1\ -,Loves(x,A)] V Loves(B,x) , whkh hali the:: wruug me::aniug e::ulirely: it says lltal e::ve::ryuue:: e:: ilht:r fails lu love:: a par
ticular animal A or is loved by some particular entity B. In fact, our original sentence allows each person to fail to love a different animal or to be loved by a different person. Thus, we want the Skolem entiHes to depend on x and z:
SKOLEM FUNCTION
Vx [Animal(F(x)) 1\ ·Loves(x,F(x))] V Loves(G(z),x) . Here F and G are Skolem functions. The general rule is that the arguments of the Skolem function are all the w1iversally quantified variables in whose scope the exis tential quantifier appears. As with Existential Instantiation, the Skolemized sentence is satisfiable exactly when the original sentence is satisfiable.
Section 9.5.
Resolution
347
• Drop universal quantifiers: At this point, all remaining variables must be universally quantified. Moreover, the sentence is equivalent to one in which all the universal quan tifiers have been moved to the left. We can therefore drop the universal quantifiers:
[Animal(F(x)) A ·Loves(x, F(x))] V Loves(G(z) , x) . • Distribute V over A:
[Animal(F(x)) V Loves(G(z),x)] A [•Loves(x,F(x)) V Loves(G(z),x)] . This step may also require flattening out nested conjunctions and disjunctions. The sentence is now in CNF and consists of two clauses. It is quite unreadable. (It may help to explain that the Skolem function F(x) refers to the animal potentially unloved by x, whereas G(z) refers to someone who might love x.) Forttmately, humans seldom need look at CNF sentences-the translation process is easily automated.
9.5.2
The resolution inference rule
The resolution rule for first-order clauses is simply a lifted version of the propositional reso lution rule given on page 253. Two clauses, which are assumed to be standardized apart so that they share no variables, can be resolved if they contain complementary literals. Propo sitional literals are complementary if one is the negation of the other; first-order literals are complementary if one unifies with the negation ofthe other. Thus, we have
£1 V V lk, m1 V V mn V £;_, V l!i+l V V l!k V m1 V Vmj - 1 V mJ+l V · · ·
· · ·
SUBST(B,£,
V
· · ·
· · ·
· · ·
· · ·
V mn)
where UNIFY(li, •mj) = B . For example, we can resolve the two clauses
[Animal(F(x)) V Loves(G(x), x)] and
[•Loves(u, t>) V ·Kills(u, v)]
by eliminating the complementary literals Loves(G(x) , x) and B = {ujG(x), vjx}, to produce the resolvent clause
•Loves(u,v), with unifier
[Animal(F(x)) v ·Kills(G(x),x)] . BINARY RESOLUTION This rule is called the binary resolution rule because it resolves exactly two literals. The
binary resolution rule by itself does not yield a complete inference procedure. The full reso lution rule resolves subsets ofliterals in each clause that are unifiable. An altemative approach is to extend factoring-the removal of redundant literals-to the first-order case. Proposi tional factoring reduces two literals to one if they are identical; first-order factoring reduces two literals to one if they are u11ijiable. The unifier must be applied to the entire clause. The combination of binary resolution and factoring is complete.
9.5.3
Example proofs
Resolution proves that KB f= a by proving KB A •a: unsatisfiable, that is, by deriving the empty clause. The algorithmic approach is identical to the propositional case, described in
Chapter
348
Inference in First-Order Logic
9.
I k ,.-..- ..,-,l" i"' 'I1-,--,-- -:::...,::- -::-:- ---:---::-:=::- ----:,..: --, -:- I I ...,.-:::-
I
-,AJ1rerican(x)fMJ-.Weapor(y�eiJ:t(x.y,;r.JfN.l-'Ho:;�i/e( Z)JN.CriminoJ(x) Ameri c a n ( lfe. \1 )
ai(WeSI) 1-Sell.r("IVe.u,y,<Jnft
� At(Spar8,Ground)
-------
82
At(Sparo,Trunk)
� At(Spar8,Axl9)
�At(Ftat,G r ound)
Classical Planning
81
,-
Re1110Y9( --,.� Spare -. .T:-ru-. rl- A LS(B) - Dumtion(A) .
Chapter
404
I \
[0,0] Start
I I '
0
[30,45]
[0,15]
1-- AddWheels1 1--
A
,;,
.,0
,;,
sO
100
1�0
tOo
Figure 11.3 A solution to the job-shop scheduling problem from Figure 11.1, taking into account resource constraints. The left-hand margin lists the three reusable resources, and actions are shown aligned horizontally with the resources they use. There are two possi-
bit: sdtedules, depc::ntling un whidt assembly uses Lite engine huisL firsL; we've shown Lite shortest-duration solution, which takes 115 minutes.
require the same EngineHoist and so cannot overlap. The "cannot overlap" constraint is a disjunction of two linear inequalities, one for each possible ordering. The introduction of disjunctions turns out to make scheduling with resource constraints NP-hard. Figure 1 1 . 3 shows the solution with the fastest completion time, 1 1 5 minutes. Tllis is 30 minutes longer than the 85 minutes required for a schedule without resource constraints. Notice that there is no time at which both inspectors are required, so we can immediately move one of our two inspectors to a more productive position. The complexity of scheduling with resource constraints is often seen in practice as well as in theory. A challenge problem posed in 1963-to find the optimal schedule for a problem involving just 10 machines and 10 jobs of 100 actions each-went unsolved for
MINIMUM SLACK
23 years (Lawler et al., 1993). Many approaches have been tried, including branch-and bound, simulated armealing, tabu search, constraint satisfaction, and other techniques from Chapters 3 ar1d 4. One simple but popular· heuristic is the minimwn slack algorithm: on each iteration, schedule for the earliest possible start whichever unscheduled action has all its predecessors scheduled and has the least slack; then update the ES and LS times for each affected action and repeat. The heuristic resembles the minimum-remaining-values (MRV) heuristic in constraint satisfaction. It often works well in practice, but for our assembly problem it yields a 1 30-minute solution, not the 1 1 5-minute solution of Figure 1 1 .3. Up to this point, we have assumed that the set of actions and ordering constraints is fixed. Under these assumptions, every scheduling problem can be solved by a nonoverlapping sequence that avoids all resource conflicts, provided that each action is feasible by itself. If a scheduling problem is proving very difficult, however, it may not be a good idea to solve it tllis way-it may be better to reconsider the actions and constraints, in case that leads to a much easier scheduling problem. Thu�, it makes sense to integrate planning and scheduling by taking into account durations and overlaps during the construction of a partial-order plan. Several of the planning algorithms in Chapter I 0 can be augmented to handle this information. For example, partial-order planners can detect resource constraint violations in much the same way they detect conflicts with causal links. Heuristics can be devised to estimate the total completion time of a plan. This is currently an active area of research.
Chapter
406 1 1 .2
Planning and Acting in the Real World
11.
HIERARCHICAL PLANNING
The problem-solving and planning methods of the preceding chapters
all operate with a fixed
set of atomic actions. Actions can be strung together into sequences or branching networks; state-of-the-art algorithms can generate solutions containing thousands of actions. For plans executed by the human brain, atomic actions are muscle activations. In very round numbers, we have about
103
muscles to activate
(639, by
some counts, but many of
them have multiple subtmits); we can modulate their activation perhaps and we are alive and awake for about
1013
109
seconds in
all.
10 times per second;
Thus, a human life contains about
actions, give or take one or two orders of magnitude. Even if we restrict ourselves to
planning over much shorter time horizons-for example, a two-week vacation in Hawaii-a detailed motor plan would contain around
1010 actions.
This is a lot more than
1000.
To bridge this gap, A I systems will probably have to do what humans appear to do: plan at higher levels of abstraction. A reasonable plan for the Hawaii vacation might be "Go to
1 1 to Honolulu; do vacation stuff for two 12 back to San Francisco; go home." Given such a plan,
San Francisco airport; take Hawaiian Airlines flight weeks; take Hawaiian Airlines flight
the action "Go to San Francisco airport" can be viewed as a planning task in itself, with a solution such as "Drive to the long-tenn parking lot; park; take the shuttle to the tenninal." Each of these actions, in tum, can be decomposed further, tmtil we reach the level of actions that can be executed without deliberation to generate the required motor control sequences. In this example, we see that planning can occur
botl1 before and during the execution
of the plan; for example, one would probably defer the problem of planning a route from a
parking sput iu Jung-lt:nrl parkiug tu tl1t: shuttlt: bus stup tmlil
a
parti�;ular parking spul has
been found during execution. Thus, that particular action will remain at an abstract level prior to the execution phase. We defer discussion of this topic until Section HIERARCHICAL DECOMPOSITION
concentrate on the aspect of hierarchical
1 1.3.
Here, we
decomposition, an idea that pervades almmt all
attempts to manage complexity. For example, complex software is created from a hierarchy of subroutines or object classes; armies operate as a hierarchy of units; govenunents and cor porations have hierarchies of departments, subsidiaries, and branch offices. The key benefit of hierarchical structure is that, at each level of the hierarchy, a computational task, military mission, or administrative function is reduced to a
small number of activities at the next lower
level, so the computational cost of finding the correct way to arrange those activities for tlle current problem is small. Nonhierarchical methods, on the other hand, reduce a task to a
large number ofindividual actions; for large-scale problems, this is completely impractical. 11.2.1
High-level actions
The basic formalism we adopt to w1destand r hierarchical decomposition comes from the area HIERARCHICALTASK NETWORK
of hierarchical task networks or HTN planning. As in classical planning (Chapter
10), we
assume full observability and determinism and the availability of a set of actions, now called PRIMITIVEACTION HIGH·LE\IELACTION
primitive actions, with standard precondition-effect schemas. The key additional concept is the high-level action or HLA-for example, the action "Go to San Francisco airport" n i the
Section
Hierarchical Planning
1 1 .2.
407
Refinement( Go( Home, SFO), STEPS: [Drive( Home, SFOLongTermParl.,"ing), Shuttle(SFOLongTermPa,·lcing, SFO)] ) Refinement( Go( Home, SFO), STEPS: [Taxi(Home, SFO)] ) Refinement(Navigate([a,b], [x, y]), PRECOND: a=x 1\ b = y STEPS: [] ) Refinement(Navigate([a,b], [x, y]), PRECOND: Connected([a, b], [a- 1, b]) STEPS: [Left, Navigate([a - 1, b], [x,y])] ) Refinement(Navigate([a,b], [x, y]), PRECOND: Connected([a, b], [a+ 1, b]) STEPS: [Right, Navigate([a + 1, b] , [x, y])] )
Figure 11.4 Definitions of possible refinements for two high-level actions: going to San Francisco airport and navigating in the vacuum world. In the latter case, note the recursive nature of the refinements and the use of preconditions.
REFINEMENT
example given earlier. Each HLA has one or more possible
refinements, into a sequence1
of actions, each of which may be an HLA or a primitive action (which has no refinements by definition). For example, the action "Go to San Francisco airport;• represented formally as
Go(Home, SFO), might have two possible refinements,
same figure shows a
as shown in Figure
1 1 .4.
The
recursive refinement for navigation in the vacuwn world: to get to a
destination, take a step, and then go to the destination. These examples show that high-level actions and their refinements embody knowledge about how to do things. For instance, the refinements for
Go( Home, SFO) say that to get to
the airport you can drive or take a taxi; buying milk, sitting down, and moving the knight to
e4 are not to be considered. IMPLEMEN'DITION
An HLA refinement that contains only primitive actions is called an implementation
[Right, Right, Down] and [Down, Right, Right] both implement the HLA Navigate([!, 3], [3, 2]). An implementation
of the HLA. For example, in the vacuum world, the sequences
of a high-level plan (a sequence of HLAs) is the concatenation of implementations of each HLA in the sequence. Given the precondition-effect definitions of each primitive action, it is straightforward to determine whether any given implementation of a high-level plan achieves
a high-level plan achieves the goalfrom a given state if at least one of its implementations achieves the goalfrom that state. The "at least one" in this definition is cmcial-not all implementations need to achieve the goal, because the agent gets the goal. We can say, then, that
1 HTN planners often allow refinement into partially ordered plans, and they allow the refinements of two different HLAs in a plan to share actions. We omit these important complications in the interest of Ullderstanding the basic concepts of hierarchical planning.
408
Chapter
11.
Planning and Acting in the Real World
to decide which implementation it will execute. Thus, the set of possible implementations in
HTN planning-each of which may have a different outcome-is not the same as the set of possible outcomes in nondeterministic planning. There, we required that a plan work for all outcomes because the agent doesn't get to choose the outcome; nature does. The simplest case is an HLA that has exactly one implementation. In that case, we can compute the preconditions and effects of the HLA from those of the implementation (see Exercise 11.3) and then treat the HLA exactly as if it were a primitive action itself. It can be shown that the right collection of HLAs can result in the time complexity of blind search dropping from exponential in the solution depth to linear in the solution depth, al though devising such a collection of HLAs may be a nontrivial task in itself. When HLAs have multiple possible implementations, there are two options: one is to search among the implementations for one that works, as in Section 1 1 .2.2; the other is to reason directly about the HLAs-despite the multiplicity of implementations-as explained in Section 1 1.2.3. The latter method enables the derivation of provably correct abstract plans, without the need to consider their implementations.
11.2.2
Searching for primitive solutions
HTN planning is often formulated with a single "top level" action cal1ed Act, where the aim is to find an implementation of Act that achieves the goal. This approach is entirely general. For
example, classical planning problems can be defined as follows: for each primitive action ai, provide one refinement of Act with steps [ai,Act] . That creates a recursive definition of Act that lets us add actions. But we need some way to stop the recursion; we do that by providing one more refinement for Act, one with an empty list of steps and with a precondition equal to the goal of the problem. This says that if the goal is already achieved, then the right implementation is to do nothing. The approach leads to a simple algorithm: repeatedly choose an HLA in the current plan and replace it with one of its refinements, until the plan achieves the goal. One possible implementation based on breadth-first tree search is shown in Figure 1 1 .5. Plans are consid ered in order of depth of nesting of the refinements, rather than number of primitive steps. It is straightforward to design a graph-search version of the algorithm as well as depth-first and iterative deepening versions. In essence, this form of hierarchical search explores the space of sequences that confonn to the knowledge contained in the HLA library about how things are to be done. A great deal of knowledge can be encoded, not just in the action sequences specified in each refinement but also in the preconditions for the refinements. For some domains, HTN planners have been able to generate huge plans with very little search. For exampie, 0-PLAN (Bell and Tate, 1 985), which combines HTN planning with scheduling, has been used to develop production plans for Hitachi. A typical problem involves a product line of 350 different products, 35 assembly machines, and over 2000 different operations. The planner generates a 30-day schedule with three 8-hour shifts a day, involving tens of millions of steps. Another important aspect of HTN plans is that they are, by definition, hierarchically structured; usually this makes them easy for humans to understand.
Section
1 1 .2.
Hierarchical Planning
409
ftmction HIERARCHICAL-SEARCH(problem, hierarchy) returns a solution, or failure
frontier..- a FIFO queue with [Act] as the only element loop do
ifEMPTY?(ft·ontier) then return failure plan ..- POP(/rontie1·) I* chooses the shallowest plan in ft-ontier *I hla ..- the first HLA in plan, or n1.1ll if none prefix,suffix ..- the action subsequences before and after hla in plan outcome t- RESULT(pmblem.INITIAL-STATE, p1•ejix) if hla is null then I* so plan is primitive and outcome is its result *I if outcome satisfies problem. GOAL then return plan else for each sequence in REFINEMBNTS(hla, outcome, hiem1·chy) do ft·ontier ..- INSERT(APPBND(prefix, sequence, suffix),frontie?') Figure 11.5 A breadth-first implementation of hierarchical forward planning search. The initial plan supplied to the algorithm is [Act]. The REFINEMENTS function returns a set of action sequences, one for each refinement of the HLA whose preconditions are satisfied by the specified state, outcome.
The computational benefits of hierarchical search can be seen by examining an ide alized case. Suppose that a planning problem has a solution with
d primitive actions.
For
a nonhierarchical, fmward state-space planner with b allowable actions at each state, the cost is
O(bd), as explained
in Chapter 3. For an HTN planner, let us suppose a very reg
ular refinement structure: each nonprimitive action has
r
possible refinements, each into
k actions at the next lower level. We want to know how many different refinement trees there are with this structure.
Now, if there
are
d
actions at the primitive level, then the
number of levels below the root is logk d, so the number of internal refinement nodes is 1 + k + k2 + + k10gkd-1 = (d - I)j(k - 1 ) . Each internal node has r possible refine ·
·
·
1 k-1) possible regular decomposition trees could be constructed. Examining ments, so r(d- )/( this fonnula, we see that keeping r small and k large can result in huge savings: essentially we are taking the kth root of the nonhierarchical cost, if b and
·r
are comparable. Small -r and
large k means a library of HLAs with a small number of refinements each yielding a long action sequence (that nonetheless allows us to solve any problem). This is not always pos sible: long action sequences that are usable across a wide range of problems
are extremely
precious. The key to HTN planning, then, is the construction of a plan library containing known methods for implementing complex, high-level actions. One method of constructing the li brary is to
learn the methods from problem-solving experience.
After the excruciating ex
petience of constructing a plan from scratch, the agent can save the plan in the library as a method for implementing the high-level action defined by the task. In this way, the agent can become more and more competent over time as new methods are built on top of old methods. One important aspect of this learning process is the ability to
generalize the
methods that
are constructed, eliminating detail that is specific to the problem instance (e.g., the name of
Chapter
410
11.
Planning and Acting in the Real World
the builder or the address of the plot of land) and keeping just the key elements of the plan. Methods for achieving this kind of generalization are described in Chapter inconceivable that humans could be as competent as they
11.2.3
19.
It seems to us
are without some such mechanism.
Searching for abstract solutions
TI1e hierarchical search algorithm in the preceding section refines HLAs all the way to primi tive action sequences to determine if a plan is workable. This contradicts common sense: one should be able to determine that the two-HLA high-level plan
[Drive(Home, SFOLongTermParking), Shuttle(SFOLongTermParking,SFO)] gets one to the airport without having to determine a precise route, choice of parking spot, and so on. The solution seems obvious: write precondition-effect descriptions of the HLAs, just as we write down what the primitive actions do. From the descriptions, it ought to be easy to prove that the high-level plan achieves the goal. This is the holy grail, so to speak, of hierarchical planning because if we derive a high-level plan that provably achieves the goal,
working in a small search space of high-level actions, then we can commit to that plan and work on the problem of refining each step of the plan. This gives us the exponential reduction we seek. For this to work, it has to be the case that every high-level plan that "claims" to achieve the goal (by virtue of the descriptions of its steps) does in fact achieve the goal in
OC.Wr.MIARD
REANEMENT PROPERTY
the sense defined earlier: it must have at least one implementation that does achieve the goal. This property has been called the
downward refinement property for HLA descriptions.
Writing HLA descriptions that satisfy the downward refinement property is, in princi ple, easy: as long as the descriptions are true, then any high-level plan that claims to achieve the goal must in fact do s�therwise, the descriptions
are making some false claim about
what the HLAs do. We have already seen how to write true descriptions for HLAs that have exactly one implementation (Exercise
1 1 .3);
a problem arises when the HLA has
multiple
implementations. How can we describe the effects of an action that can be implemented in many different ways?
One safe answer (at least for problems where aU preconditions and goals are positive) is
to include only the positive effects that are achieved by every implementation of the HLA and the negative effects of any implementation. Then the downward refinement prope1ty would be satisfied. Unfortunately, this semantics for HLAs is much too conservative. Consider again the HLA
Go(Home,SFO), which has two refinements, and suppose, for the sake of argu
ment, a simple world in which one can always drive to the airport and park, but taking a taxi requires
Cash as a precondition.
In that case,
Go(Home, SFO)
doesn't always get you to
Cash is false, and so we cannot assert At(Agent, SFO) as an effect of the HLA. This makes no sense, however; if the agent didn't have Cash, it would the airport. In particular, it fails if
drive itself. Requiring that an effect hold for every implementation is equivalent to assuming
that someone else-an adversary-will choose the implementation. It treats the HLA's mul tiple outcomes exactly as if the HLA were a nondeterministic action, as in Section
4.3.
For
our case, the agent itself will choose the implementation.
demonic nondetennin ism for the case where an adversary makes the choices, contrasting this with angelic nondeThe progra1runing languages community has coined the term
DEMONIC NONDETERMINISM
Section 11.2.
it '',
Hierarchical Planning
•
•
I I
I \
�
-
I I I I I /
411
•
•
•
•
•
•
/
•
•
•
•
•
•
•
• •
- - - - - -
•
•
•
•
'•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(a)
(b)
Figure 11.6 Schematic examples of reachable sets. The set of goal states is shaded. Black and gray arrows indicate possible implementations of h1 and h2, respectively. (a) The reach able set of an HLA h1 in a state s. (b) The reachable set for the sequence [hbh2]· Because this intersects the goal set, the sequence achieves the goal. ANGELIC NONDETEAMINISM ANGELIC SEMANTICS REACHABLE SET
terminisrn, where the agent itself makes the choices. We borrow this term to define angelic semantics for HLA descriptions. The basic concept required for tmderstanding angelic se mantics is the reachable set of an HLA: given a state s, the reachable set for an HLA h, written as REACH(s, h), is the set of states reachable by any of the HLA's implementations. The key idea is that the agent can choose
which
element of the reachable set it ends up in
when it execut.es the HLA; thus, an HLA with multiple refinements is more "powerful" than the same HLA with fewer refinements. We can also define the reachable set of a sequences of
[h1, h2] is the union of all the reachable sets obtained by applying h2 in each state in the reachable set of h1: REACH (s , h2) . REACH(s, [h1, h2]) = u HLAs. For example, the reachable set of a sequence
'
s1EREACH(s, h1 )
Given these definitions, a high-level plan-a sequence of HLAs-achieves the goal if its reachable set intersects the set of goal states. (Compare this to the much stronger condition for demonic semantics, where every member of the reachable set has to be a goal state.) Conversely, if the reachable set doesn't intersect the goal, then the plan definitely doesn't work. Figure I 1 .6 illustrates these ideas. The notion of reachable sets yields a straightforward algorithm: search among high level plans, looking for one whose reachable set intersects the goal; once that happens, the algorithm can
commit
to that abstract plan, knowing that it works, and focus on refining
the plan further. We will come back to the algorithmic issues later; first, we consider the question of how the effects of an HLA-the reachable set for each possible initial state-are represented. As with the classical action schemas of Chapter 10, we represent the
changes
Chapter
412
11.
Planning and Acting in the Real World
made to each tluent. Think of a fluent as a state variable. A primitive action can add or delete a variable or leave it unchanged. (With conditional effects (see Section 11.3.1) there is a fowth possibility: flipping a variable to its opposite.) An HLA under angelic semantics can do more: it can control the value of a variable, setting it to true or false depending on which implementation is chosen. In fact, an HLA can have nine different effects on a variable: if the variable starts out true, it can always keep it true, always make it false, or have a choice; if the variable starts out false, it can always keep it false, always make it true, or have a choice; and the three choices for each case can be combined arbitrarily, making nine. Notationally, this is a bit challenging. We'll use the symbol to mean "possibly, if the agent so chooses." Thus, an effect +A means "possibly add A," that is, either leave A unchanged or make it true. Similarly, =A means "possibly delete A" and ±A means "possibly add or delete A." For example, the HLA Go( Home, SFO), with the two refinements shown in Figure 1 1 .4, possibly deletes Cash (if the agent decides to take a taxi), so it should have the effect =Cash. Thus, we see that the descriptions of HLAs are derivable, in principle, from the descriptions of their refinements-in fact, this is required if we want true HLA descriptions, such that the downward refinement property holds. Now, suppose we have the following schemas for the HLAs hi and �:
Action(hi, PRECOND:--.A, EFFECT: A 1\ =B) , Action(h2, PRECOND:--.B, EFFECT:.:tA 1\ ±C) . That is, hi adds A and possible deletes B, while h2 possibly adds A and has full control over C. Now, if only B is true in the initial state and the goal is A 1\ C then the sequence [hi, h2] achieves the goal: we choose an implementation of hi that makes B false, then choose an implementation of h2 that leaves A true and makes C true.
OPTIMISTIC DESCRIPTION PESSIMISTIC DESCRIPTION
The preceding discussion assumes that the effects of an HLA-the reachable set for any given initial state-can be described exactly by describing the effect on each variable. It would be nice if this were always true, but in many cases we can only approximate the ef fects because an HLA may have infinitely many implementations and may produce arbitratily wiggly reachable sets-rather like the wiggly-belief-state problem illustrated in Figure 7.21 on page 271. For example, we said that Go( Home, SFO) possibly deletes Cash; it also possibly adds At( Car, SFOLongTermParking); but it cannot do both-in fact, it must do exactly one. As with belief states, we may need to write approximate descriptions. We will use two kinds of approximation: an optimistic description REACH+( s, h) of an HLA h may overstate the reachable set, while a pessimistic description REACH- (s, h) may w1derstate the reachable set. Thus, we have
REACW(s,h)
�
REACH(s,h)
�
REACH+(s,h) . For example, an optimistic description of Go(Home, SFO) says that it possible deletes Cash and possibly adds At( Car, SFOLongTermParking). Another good example arises in the 8-puzzle, half of whose states are unreachable from any given state (see Exercise 3.4 on page 113): the optimistic description of Act might well include the whole state space, since the exact reachable set is quite wiggly. With approximate descriptions, the test for whether a plan achieves the goal needs to be modified slightly. If the optimistic reachable set for the plan doesn't intersect the goal,
Section 1 1 .2.
413
Hierarchical Planning
,
•
•
•
•
I I I I I I
.,.. - - - - - - - - - - '
•
•
• :
•
•
•
•
•
•
•
•
•
�-·
- - - - - - -,
'
•
,
_
•
•
� . I I I I \
I
_
_
_
• _
_
_
_
\
I I I I
• :
_
_
,
(a)
I I I I I I I I I
•
I
I I I I
ii
\
..- - - - - - - - - - - '
•
•
• :
•
•
•
•
•
•
•
•
'--
I I I I I I I I I
-------,
•
•
•
•
•
•
•
•
•
•
•
(b)
Figure 11.7 Goal achievement for high-level plans with approximate descriptions. The set of goal states is shaded. For each plan, the pessimistic (solid lines) and optimistic (dashed lines) reachable sets are shown. (a) The plan indicated by the black arrow definitely achieves the goal, while the plan indicated by the gray arrow definitely doesn't. (b) A plan that would need to be refined further to determine if it really does achieve the goal.
then the plan doesn't work; if the pessimistic reachable set intersects the goal, then the plan does work (Figure 1 1 .7(a)). With exact descriptions, a plan either works or it doesn't, but with approximate descriptions, there is a middle ground: if the optimistic set intersects the goal but the pessimistic set doesn't, then we cannot tell if the plan works (Figure 1 1 .7(b)). When this circwnstance arises, the uncertainty can be resolved by refining the plan. This is a very common situation in human reasoning. For example, in planning the aforementioned two-week Hawaii vacation, one might propose to spend two days on each of seven islands. Prudence would indicate that this ambitious plan needs to be refined by adding details of inter-island transportation. An algorithm for hierarchical planning with approximate angelic descriptions is shown in Figure 11.8. For simplicity, we have kept to the same overall scheme used previously in Figure 11.5, that is, a breadth-first search in the space of refinements. As just explained, the algorithm can detect plans that will and won't work by checking the intersections of the opti mistic and pessimistic reachable sets with the goal. (The details of how to compute the reach able sets of a plan, given approximate descriptions of each step, are covered in Exercise 1 1.5.) When a workable abstract plan is found, the algorithm decomposes the original problem into subproblems, one for each step of the plan. The initial state and goal for each subproblem are obtained by regressing a guaranteed-reachable goal state through the action schemas for each step of the plan. (See Section 10.2.2 for a discussion of how regression works.) Fig ure 1 1 .6(b) illustrates the basic idea: the right-hand circled state is the guaranteed-reachable goal state, and the left-hand circled state is the intermediate goal obtained by regressing the
Chapter
414
11.
Planning and Acting in the Real World
function ANGELIC-SEARCH(problem, hierm·chy, initialPlan) returns solution or fail
frontie1· ..- a FIFO queue with initialPlan as the only element loop do if EMPTY?(fnmtier) then return fail plan ..- POP(fnmtie•·) /* chooses the shallowest node in fnmtie•· *I if REACH+ (pmblem.INITIAL-STATE, plan) intersects pmblem.GOAL then if plan is primitive then return plan /* REACH+ is exact for primitive plans */
guamnteed t- REACH- (problem.INITIAL-STATE, plan)
n
problem.GOAL
if guamnteed#{ } and MAKING-PROGRESS(plan, initialPlan) then finalState .--any element of guaranteed return
DECOMPOSE(hiemt-chy, problem.INITIAL-STATE, plan,fina/State)
hla .-- some HLA in plan prefix,suffix
Red(x) 1\ Round(x) .
Thus, we can write down useful facts about categories without exact defini tions. The difficulty of providing exact definitions for most natural categories was explained in depth by Wittgenstein (1953). He used the example of games to show that members of a category shared "family resemblances" rather than necessary and sufficient characteristics: what strict definition encompasses chess, tag, soli taire, and dodgeball? The utility of the notion of strict definition was also challenged by Quine (1953). He pointed out that even the definition of "bachelor" as an un married adult male is suspect; one might, for example, question a statement such as "the Pope is a bachelor." While not strictly false, this usage is certainly infe licitous because it induces unintended inferences on the part of the listener. The tension could perhaps be resolved by distinguishing between logical definitions suitable for internal knowledge representation and the more nuanced criteria for felicitous linguistic usage. The latter may be achieved by "filtering" the assertions derived from the former. It is also possible that failures of linguistic usage serve as feedback for modifying n i ternal definitions, so that filtering becomes unnecessary.
444
Chapter 12.2.2
12.
Knowledge Representation
Measurements
In both scientific and commonsense theories of the world, objects have height, mass, cost, MEASURE
and so on.
The values that we assign for these properties are called measures.
Ordi
nary quantitative measures are quite easy to represent. We imagine that the universe in cludes abstract "measure objects," such as the length that is the length of this line seg ment:
. We can call this length 1.5 inches or 3.81 centimeters. Thus,
the same length has different names in our language.We represent the length with a units UNITS FUNCTION
function that takes a number as argwnent. (An altemative scheme is explored in Exer
L1, we can write Length(L1) = Inches(1.5) = Centimeters(3.81) .
cise 12.9.) If the line segment is called
Conversion between units is done by equating multiples of one unit to another:
Centimeters(2.54 x d) = Inches(d) . Similar axioms can be written for pounds and kilograms, seconds and days, and dollars and cents. Measures can be used to describe objects as follows:
Diameter(Basketball12) = Inches(9.5) . ListPrice(Basketball12) = $(19) . d E Days =? Dumtion(d) = Hours(24) . Note that $(1) is not a dollar bill! One can have two dollar bills, but there is only one object named $(1). Note also that, while Inches(O) and Centimeters(O) refer to the same zero length, they are not identical to other zero measures, such as Seconds(O). Simple, quantitative measures
are
easy to represent. Other measures present more of a
problem, because they have no agreed scale of values. Exercises have difficulty, desserts have deliciousness, and poems have beauty, yet numbers cannot be assigned to these qualities. One might, in a moment of pure accountancy, dismiss such properties as useless for the purpose of logical reasoning; or, still worse, attempt to impose a numerical scale on beauty. This would be a grave mistake, because it is unnecessary. The most important aspect of measures is not
the particular numerical values, but the fact that measures can be ordered.
Although measw·es are not numbers, we can still compare them, using an ordering
symbol such as >. For example, we might well believe that Norvig's exercises are tougher than Russell's, and that one scores less on tougher exercises:
e1 E Exercises 1\ e2 E Exercises 1\ Wrote(Norvig, e1) 1\ Wrote( Russell, e2) Difficulty(e1) > Difficulty( e2) . i 1\ Difficulty( e1) > Difficulty(e2) =? e1 E Exercises 1\ e2 E Exercses ExpectedScore(e1) < ExpectedScore(e2) .
=?
This is enough to allow one to decide which exercises to do, even though no nwnerical values for difficulty were ever used. (One does, however, have to discover who wrote which exer cises.) These sorts of monotonic relationships among measures form the basis for the field of qualitative physics, a subfield of AI that investigates how to reason about physical systems without plunging into detailed equations and numerical simulations. Qualitative physics is discussed in the historical notes section.
Section
12.2.
Categories and Objects
12.2.3
445
Objects: Things and stuff
The real world can be seen as consisting of primitive objects (e.g., atomic particles) and composite objects built from them. By reasoning at the level of large objects such as apples and cars, we can overcome the complexity involved in dealing with vast numbers of primitive objects individually. There is, however, a significant portion of reality that seems to defy any INDIVIDUATION
obvious
STUFF
stuff.
individuation-division into distinct objects.
We give this portion the generic name
For example, suppose I have some butter and an aardvark in front of me. I can say
there is one aardvark, but there is no obvious number of "butter-objects;' because any part of a butter-object is also a butter-object, at least until we get to very small parts indeed. This is the major distinction between
stuff and things.
If we cut an aardvark in half, we do not get
two aardvarks (unfortunately). The English language distinguishes clearly between stuff and
things.
We say "an aard
vark," but, except in pretentious Califomia restaurants, one cannot say "a butter." Linguists COUNT NOUNS
distinguish between count nouns, such as aardvarks, holes, and theorems, and mass nouns,
MASS NOUN
such as butter, water, and energy. Several competing ontologies claim to handle this distinc tion. Here we describe just one; the others are covered in the historical notes section. To represent
stuff properly,
we begin with the obvious. We need to have as objects in
our ontology at least the gross "lumps" of
stuff we
interact with. For example, we might
recognize a Jump of butter as the one left on the table the night before; we might pick it up, weigh it, sell it, or whatever. In these senses, it is an object just like the aardvark. Let us call it
Butter3. We also define the category Butter.
Informally, its elements will be all those
things of which one might say "It's butter," including
Butter3. With some caveats about very
small parts that we w omit for now, any part of a butter-object is also a butter-object:
bE Butter 1\ PartOJ(p,b)
=>
pE Butter .
We can now say that butter melts at around
bE Butter
=>
30 degrees centigrade:
MeltingPoint(b, Centigrade(30)) .
We could go on to say that butter is yellow, is less dense than water, is soft at room tempera
ture, has a high fat content, and so on. On the other hand, butter has no particular size, shape,
or weight.
We can define more specialized categories of butter such as
which is also a kind of
stuff
Note that the category
PoundOfButter,
UnsaltedButter,
which includes as
members all butter-objects weighing one pound, is not a kind of stuff I f we cut a pound of butter in half, we do not, alas, get two pounds of butter. INTRINSIC
What is actually going on is this: some properties are
intrinsic:
they belong to the very
substance of the object, rather than to the object as a whole. When you cut an instance of
stuff in half, the EXTRINSIC
two pieces retain the intrinsic properties-things like density, boiling point,
flavor, color, ownership, and so on. On the other hand, their
extrinsic properties-weight,
length, shape, and so on-are not retained under subdivision. includes in its definition only that includes
any extrinsic
A category of objects that
intrinsic properties is then a substance, or mass nmm;
the most general substance category, specifying no intrinsic properties.
a class
Stuff is The category Thing
properties in its definition is a count noun. The category
is the most general discrete object category, specifying no ext1insic properties.
Chapter
446
1 2 .3
12.
Knowledge Representation
EVENTS
In Section
10.4.2, we showed how situation calculus represents actions and their effects.
Situation calculus is limited in its applicability: it was designed to describe a world in which actions are discrete, instantaneous, and happen one at a time. Consider a continuous action, such as filling a bathtub. Situation calculus can say that the tub is empty before the action and full when the action is done, but it can't talk about what happens
during the
action. I t also
can't describe two actions happening at the same time-such as brushing one's teeth while waiting for the tub
EVENT CALCULUS
to fill. To handle such cases we introduce an altemative formalism known
as event calculus, which is based on points of time rather than on situations. 3 Event calculus reifies fluents and events. The fluent ject that refers
At(Shankar, Berkeley) is an ob
to the fact of Shankar being in Berkeley, but does not by itself say anything
about whether it is true. To assert that a fluent is actually true at some point in time we use
the predicate T, as in T(At(Shankar, Berkeley), t).
4 Events are described as instances of event categories. The event
Et of Shankar flying
from San Francisco to Washington, D.C. is described as
E1 E Flyings 1\ Flyer(Et, Shankar) 1\ Origin(E1, SF) 1\ Destination(E1, DC) . If this is too verbose, we can define an alternative three-argument version of the category of flying events and say
Et E Flyings(Shankar, SF, DC) . Happens(Et, i) to say that the event E1 took place over the time interval i, and we say the same thing in functional form with Extent(Et) = i. We represent time intervals by a (start, end) pair of times; that is, i = (t1, t2) is the time interval that starts at t1 and ends at t2. The complete set of predicates for one version of the event calculus is
We then use
T(f,t) Happens(e, i) Initiates(e, J, t) Terminates(e, J, t) Clipped(!, i) Restored(!, i)
f is true at time t Event e happens over the time interval i Event e causes fluent f to start to hold at time t Event e causes fluent f to cease to hold at time t Fluent f ceases to be true at some point during time interval i Fluent f becomes tme sometime during time interval i
Fluent
We assume a distinguished event,
Start, that describes the initial state by saying which fluents T by saying that a fluent holds at a point
are initiated or tenninated at the start time. We define
in time if the fluent was initiated by an event at some time in the past and was not made false (clipped) by an intervening event.
A fluent does not hold if it was terminated by an event and
3 The terms "event" and "action" may be used interchangeably. Informally, "action" connotes an agent while
conootes the possibility of agentless actions. Some versions ofevent calculus do not distinguish event categories from instances of the categories.
"event" 4
Section 12.3.
447
Events
not made b·ue (restored) by another event. Formally, the axioms are:
Happens(e, (t1, t2)) 1\ Initiates(e, J, t1) 1\ · Clipped(!, (t�, t)) 1\ t1 < t =? T(f, t) Happens(e, (t1, t2)) 1\ Terminates(e,J, tl) 1\ ·Restored (!, (t�, t)) 1\ t1 < t =? ·T(f, t) where Clipped and Restored are defined by Clipped(!, (t1, t2)) {::} 3 e, t, t3 Happens(e, (t, t3)) 1\ t1 � t < t2 1\ Terminates(e, f, t) Restored(!, (t1, t2)) {::} 3 e, t, t3 Happens(e, (t, t3)) 1\ t1 � t < t2 1\ Initiates(e, f, t) It is convenient to extend T to work over intervals as well as time points; a fluent holds over an interval if it holds on every point within the interval:
T(f,(t1,t2))
{::}
[\f t (t1 � t < t2)
T(f, t)]
=?
Fluents and actions are defined with domain-specific axioms that are similar to successor state axioms. For example, we can say that the only way a wumpus-world agent gets an arrow is at the start, and the only way to use up an an-ow is to shoot it.:
Initiates(e, HaveArrow(a), t)
{::}
Terminates(e, HaveArrow(a),t)
e #
=
Start
e
E
Shootings(a)
B y reifying events we make it possible to add any amount of arbitrary infonnation about them. For example, we can say that Shankar's flight was bumpy with
Bumpy(EI).
In an
ontology where events are n-ary predicates, there would be no way to add extra information like this; moving to an n + 1-ary predicate isn't a scalable solution. We can extend event calculus to make it possible to represent simultaneous events (such as two people being necessary to ride a seesaw), exogenous events (such as the wind blowing and changing the location of an object), continuous events (such as the level of water in the bathtub continuously rising) and other complications.
12.3.1 DISCRETE EVENTS
Processes
The events we have seen so far are what we call discrete events-they have a definite struc ture. Shankar's trip has a beginning, middle, and end. If inteiTupted halfway, the event would be something different-it would not be a trip from San Francisco to Washington, but instead a trip from San Francisco to somewhere over Kansas. On the other hand, the category of events denoted by
Flyings has
a different quality. If we take a small interval of Shankar's
flight, say, the third 20-minute segment (while he waits anxiously for a bag of peanuts), that event is still a member of Flyings. In fact, this is true for any subinterval. P�ESS LIQUID EVENT
Categories of events with this property are called process categories or liquid event categories. Any process
e that happens over an interval also happens over any subinterval:
(e E Pmcesses) 1\ Happens(e, (t1,t4)) 1\ (t1
Happens(e, (t2,ts)) .
The distinction between liquid and nonliquid events is exactly analogous to the difference between substances, or TEMPORAL SUBSTANCE SAITIAL SUBSTANCE
stuff,
and individual objects, or
things.
In fact, some have called
liquid events temporal substances, whereas substances like butter are spatial substances.
Chapter
448
12.3.2
12.
Knowledge Representation
Time intervals
Event calculus opens us up to the possibility of talking about time, and time intervals. We will consider two kinds of time intervals: moments and extended intervals. The distinction is that only moments have zero dm·ation:
Partition( {Maments, Extendedinteruals}, Intervals) i E Moments ¢:? Duration(i) = Seconds(O) . Next we invent a time scale and associate points on that scale with moments, giving us ab solute times. The time scale is arbitrary; we measure it in seconds and say that the moment
at midnight (GM1) on January 1, 1900, has time 0. The functions Begin and End pick out the earliest and latest moments in an interval, and the fimction Time delivers the point on the time scale for a moment. The function
Duration
gives the difference between the end time
and the start time.
Interval(i) =? Duration(i) = ( Time(End(i)) - Time(Begin(i))) . Time(Begin(AD1900)) = Seconds(O) . Time(Begin(AD200!)) = Seconds(3187324800) . Time(End(AD200!)) = Seconds (3218860800) . Duration(AD2001) = Seconds(3!536000) . To make these numbers easier to read, we also introduce a function Date, which takes
six
arguments (hours, minutes, seconds, day, month, and year) and returns a time point:
Time(Begin(AD2001)) = Date(O, 0, 0, I , Jan, 2001) Date(O, 20, 21, 24, 1, 1995) = Seconds(3000000000) . Two intervals Meet if the end time of the first equals the start time of the second. The com plete set ofinterval relations, as proposed by Allen (1983), is shown graphically in Figure 12.2 and logically below:
Meet(i,j) Before (i, j) After(j, i) During(i,j) Overlap(i, j) Begins(i,j) Finishes (i, j) Equals(i,j)
End(i) = Begin(j) End(i) < Begin(j) Before(i,j) Begin(j) < Begin(i) < End(i) < End(j) ¢:? Begin(i) < Begin(j) < End (i) < End(j) Begin(i) = Begin(j) End(i) = End(j) Begin(i) = Begin(j) 1\ End(i) = End(j) TI1ese all have their intuitive meaning, with the exception of Overlap: we tend to think of overlap as symmetric (if i overlaps j then j overlaps i), but in this definition, Overlap(i,j) only holds if i begins before j. To say that the reign of Elizabeth II immediately followed that of George VI, and the reign of Elvis overlapped with the
1950s, we can write the following:
Meets(ReignOJ( Georye VI) , ReignOf(Elizabethii)) . Overlap(Fifties, ReignOJ(Elvis)) . Begin( Fifties) = Begin(AD1950) . End(Fifties) = End(ADI959) .
Section 12.3.
449
Events
Figure 12.3
A schematic view ofthe object President( USA) for the first 15 years of its
existence.
12.3.3
Fluents and objects
Physical objects can be viewed as generalized events, in the sense that a physical object is a chunk of space-time. For example, USA can be thought of as an event that began in, say, 1776 as a union of 1 3 states and is still in progress today as a union of 50. We can describe the changing properties of USA using state Huents, such as Population( USA). A property of the USA that changes every four or eight years, batring mishaps, is its president. One might propose that
President( USA)
is a logical term that denotes a different object
at different times. Unfortunately, this is not possible, because a tenn denotes exactly one object in a given model structure. (The term
President( USA, t) can denote different objects,
depending on the value of t, but our ontology keeps time indices separate from Huents.) The
Chapter
450 only possibility is that
President( USA)
12.
Knowledge Representation
denotes a single object that consists of different
people at different times. It is the object that is George Washington
from 1789 to 1797, John
Adams from 1797 to 1801, and so on, as in Figure 12.3. To say that George Washington was president tlu-oughout 1790, we can write
T(Equals(President( USA), George Washington), AD1790) . We use the ftmction symbol Equals rather than the standard logical predicate =, because we cannot have a predicate as an argument to T, and because the interpretation is not that George Washington and President( USA) are logically identical in 1790; logical identity is not something that can change over time. The identity is between the subevents of each object that are defined by the period 1790.
1 2 .4
MENTAL EVENTS AND MENTAL OBJECTS
The agents we have constructed so far have beliefs and can deduce new beliefs. Yet none of them has any knowledge
about beliefs or about deduction.
Knowledge about one's own
knowledge and reasoning processes is useful for controlling inference. For example, suppose Alice asks "what is the square root of 1764" and Bob replies "I don't know." If Alice insists "think harder;' Bob should realize that with some more thought, this question can in fact be answered. On the other hand, if the question were "Is your mother sitting down right now?" then Bob should realize that thinking harder is unlikely to help. Knowledge about the knowledge of other agents is also important; Bob should realize that his mother knows whether she is sitting or not, and that asking her would be a way to find out. What we need is a model of the mental objects that
are in someone's head (or some
thing's knowledge base) and of the mental processes that manipulate those mental objects. The model does not have to be detailed. We do not have to be able to predict how many milliseconds it will take for a particular agent to make a deduction. We will be happy just to be able to conclude that mother knows whether or not she is sitting. PROPOSITIONAL ATTITUDE
We begin with the jects: attitudes such as
propositional attitudes that an agent can have toward mental ob
Believes, Knows, Wants, Intends, and Informs.
The difficu1ty is
that these attitudes do not behave like "normal" predicates. For example, suppose we try to assert that Lois knows that Superman can fly:
Knows(Lois, CanFly(Superman) ) . One minor issue with this is that we normally think of CanFly(Superman) as a sentence, but here it appears as a term. That issue can be patched up just be reifying
CanFly(Superman) ;
making it a fluent. A more serious problem is that, if it is true that Superman is Clark Kent, then we must conclude that Lois knows that Clark can fly:
(Superman = Clark) 1\ Knows(Lois, CanFly(Superman)) f= Knows(Lois, CanFly(Clark)) . This is a consequence of the fact that equality reasoning is built into logic. Normally that is a good thing; if our agent knows that
2 + 2 = 4 and 4 < 5, then we want our agent to know
Section
12.4.
REFERENTIAL TRANSPARENCY
Mental Events and Mental Objects
that
2+2
< 5. This property is called
451
referential transparency-it doesn't
matter what
term a logic uses to refer to an object, what matters is the object that the term names. But for propositional attitudes like believes and knows, we would like
to have referential opacity-the
terms used do matter, because not all agents know which terms are co-referential. MODAL LOGIC
Modal logic is designed to address this problem.
Regular logic is concerned with a sin
gle modality, the modality of truth, allowing us to express
"P is true."
Modal logic includes
special modal operators that take sentences (rather than terms) as arguments. For example,
"A knows P" is represented with the notation KAP, where K is the modal operator for knowl edge. It takes two arguments, an agent (written as the subscript) and a sentence. The syntax of modal logic is the same as first-order logic, except that sentences can also be formed with modal operators. The semantics of modal logic is more complicated. In first-order logic a
model
con
tains a set of objects and an interpretation that maps each name to the appropriate object, relation, or ftmction. In modal logic we want to be able to consider both the possibility that Superman's secret identity is Clark and that it isn't. Therefore, we will need a more com POSSIBLEWORLD ACCESSIBILITY RELATIONS
plicated model, one that consists of a collection of possible
worlds rather than just one true world. The worlds are connected in a graph by accessibility relations, one relation for each modal operator. We say that world w1 is accessible from world wo with respect to the modal
KA if everything in w1 is consistent with what A knows in wo, and we write this as Acc(KA, wo, w1). In diagrams such as Figure 12.4 we show accessibility as an arrow be operator
tween possible worlds. As an example, n i the real world, Bucharest is the capital of Romania,
but for an agent that did not know that, other possible worlds are accessible, including ones where the capital of Romania is Sibiu or Sofia. Presumably a world where
2 + 2 = 5 would
not be accessible to any agent. In general, a knowledge atom world accessible from
w.
KAP is true in world w if and only if P is true in every
The truth of more complex sentences is derived by recursive appli
cation of this rule and the normal mles of first-order logic. That means that modal logic can be used to reason about nested knowledge sentences: what one agent knows about another agent's knowledge. For example, we can say that, even though Lois doesn't know whether Superman's secret identity is Clark Kent, she does know that Clark knows:
KLois [Kclarkldentity(Superman, Clark) V Kclark 'Identity(Superman, Clark)] Figure
12.4 shows some possible worlds for this domain, with accessibility relations for Lois
and Superman. In the TOP-LEFT diagram, it is common knowledge that Superman knows his own iden tity, and neither he nor Lois has seen the weather report. So in wo the worlds
wo and tt12 are
accessible to Superman; maybe rain is predicted, maybe not. For Lois all four worlds are ac cessible from each other; she doesn't know anything about the report or if Clark is Superman. But she does know that Superman knows whether he is Clark, because in every world that is accessible to Lois, either Supennan knows
I, or he knows •I. Lois does not know which is
the case, but either way she knows Superman knows. In the report.
TOP-RIGHT
So in
W4
diagram it is common knowledge that Lois has seen the weather
she knows rain is predicted and in
ws
she knows rain is not predicted.
Chapter
452
w : 6
Knowledge Representation
12.
J,-.R
1-- ---- -...
(b)
w : 7
-.J,-.R
Figure 12.4 Possible worlds with accessibility relations Ksuperman (solid arrows) and KLois (dotted arrows). The proposition R means "the weather report for tomorrow is rain" and I means "Superman's secret identity is Oark Kent." All worlds are accessible to them selves; the arrows from a world to itself are not shown. Superman does not know the report, but he knows that Lois knows, because in every world that is accessible to him, either she knows In the
R or she knows -,R.
BOTIOM diagram we represent the scenario where it is common knowledge that
Superman knows his identity, and Lois might or might not have seen the weather report. We represent tllis by combining the two top scenarios, and adding arrows to show that Superman does not know which scenario actually holds. Lois does know, so we don't need arrows for her. In Lois knows
R.
wo
Superman still knows
to add any
I but not R, and now he does not know whether
From what Superman knows, he might be in wo or 1112 , in which case Lois
does not know whether which case she knows
R is true, or he could be in W4, in which case she knows R, or ws, in R.
......
There are an n i finite number of possible worlds, so the trick is you need to represent what you are trying to model.
to introduce just the ones
A new possible world is needed to talk
about different possible facts (e.g., rain is predicted or not), or to talk about different states of knowledge (e.g., does Lois know that rain is predicted). That means two possible worlds, such as W4 and
wo
in Figure
12.4, might have the same base facts about the world, but differ
in their accessibility relations, and therefore in facts about knowledge. Modal logic solves some tricky issues with the interplay of quantifiers and knowledge. The English sentence "Bond knows that someone is a spy" is ambiguous. The first reading is
Section 12.5.
Reasoning Systems for Categories
453
that there is a particular someone who Bond knows is a spy; we can write this as
3 x KBond Spy(x) ,
which in modal logic means that there is an x that, in all accessible worlds, Bond knows to be a spy. The second reading is that Bond just knows that there is at least one spy:
KBond3 X Spy(x) .
The modal logic interpretation is that in each accessible world there is an x that is a spy, but it need not be the same
x in each world.
Now that we have a modal operator for knowledge, we can write axioms for it. First, we can say that agents are able to draw deductions; if an agent knows P and knows that
implies Q, then the agent knows Q:
P
(KaP AKa(P =? Q)) =? KaQ. From this (and a few other rules about logical identities) we can establish that KA(P V •P) is a tautology; every agent knows every proposition hand, an
(KAP) V (KA•P) is not a tautology;
P is either
true or false. On the other
in general, there will be lots of propositions that
agent does not know to be true and does not know to be false.
It is said (going back to Plato) that knowledge is justified true belief. That is, if it is
true, if you believe it, and if you have an unassailably good reason, then you know it. That means that if you know something, it must be true, and we have the axiom:
KaP =? P . Furthermore, logical agents should be able to introspect on their own knowledge. If they know something, then they know that they know it:
KaP =? Ka(KaP) . We can define similar axioms for belief (often denoted by B) and other modalities. However, LOGICAL OMNISCIENCE
logical omniscience on the part of agt::uls. That is, if an agt::ut k.t1uws a st::t of axioms, tht::n it knows all t:onst::yut::nt:t::s uf
one problem with the modal logic approach is that it assumes
those axioms. This is on shaky ground even for the somewhat abstract notion of knowledge, but it seems even worse for belief, because belief has more connotation of referring to things that are physically represented in the agent, not just potentially derivable. There have been attempts to define a form of limited rationality for agents; to say that agents believe those assertions that can be derived with the application of no more than
k reasoning steps, or no
more than s seconds of computation. These attempts have been generally
12.5
WIsatisfactory.
REASONING SYSTEMS FOR CATEGORIES
Categories are the primary building blocks of hu·ge-scale knowledge representation schemes. This section describes systems specially designed for organizing and reasoning with cate gories. There are two closely related families of systems:
semantic networks provide graph
ical aids for visualizing a knowledge base and efficient algorithms for inferring properties
Chapter
454
12.
of an object on the basis of its cat.egory membership; and
Knowledge Representation
description logics provide a for
mal language for constructing and combining category definitions and efficient algorithms for deciding subset and superset relationships between categories.
12.5.1 In EXISTENTIAL GRAPHS
Semantic networks
1909, Charles S . Peirce proposed a graphical notation of nodes and edges called existential
graphs that he called "the logic
of the future." Thus began a long-running debate between
advocates of "logic" and advocates of "semantic networks." Unfortunately, the debate ob scured the fact that semantics networks-at least those with well-defined semantics--are a form of logic. The notation that semantic networks provide for certain kinds of sentences is often more convenient, but if we strip away the "human interface" issues, the underlying concepts--objects, relations, quantification, and so on-are the same.
There are many variants of semantic networks, but all are capable of representing in
dividual objects, categories of objects, and relations among objects. A typical graphical no tation displays object or category names in ovals or boxes, and connects them with labeled
12.5 has a MemberOf link between
Mary and FemalePersons, corresponding t.o the logical assertion Mary E FemalePersons; similarly, the SisterOf link between Mary and John corresponds to the assertion SisterOf(Mary, John). We can con nect categories using SubsetOf links, and so on. It is such fun drawing bubbles and arrows links. For example, Figure
that one can get carried away. For example, we know that persons have female persons as mothers, so can we draw a is no, because
Ha.sMother link from Persons to FemalePersons? The answer
HasMother is a relation between a person and his or her mother, and categories 5
do not have mothers.
For this reason, we have used a special notation-the double-boxed link-in Figure
12.5.
This link asserts that
Vx x E Persons We might also want
�
[\fy HasMother(x,y)
�
y E FemalePersons] .
to assert that persons have two legs-that is,
V x x E Persons
�
Legs(x, 2) .
As before, we need to be careful not to assert that a category has legs; the single-boxed link in Figure
12.5 is used to assert properties of every member of a category.
The semantic network notation rnakes it convenient ofthe kind introduced in Section
to perform inheritance reasoning
12.2. For example, by virtue ofbeing a person, Mary inherits
the property of having two legs. Thus, to find out how many legs Mary has, the inheritance algorithm follows the
MemberOf link from Mary to the category she belongs t.o, and then
SubsetOf links up the hierarchy until it finds a category for which there is a boxed Legs link-in this case, the Persons category. The simplicity and efficiency of this inference
follows
5 Several early systems failed to distinguish between properties of members of a category and properties of the
category as a whole. This can lend directly to inconsistencies, as pointed out by Drew McDermott (1976) in his article "Artificial Intelligence Meets Natural Stupidity." Another common problem was the use of !sA links for both subset and membership relations, in correspondence with English usage: "a cat is a mammal" and "FiJi is a cat." See Exercise 12.22 for more on these issues.
Section 12.5.
Reasoning Systems for Categories
455
Figure 12.5 A semantic network with four objects (John, Mary, 1, and 2) and four categories. Relations are denoted by labeled links.
Figure 12.6 A fragment of a semantic network showing the representation of the logical assertion Fly( Shankar, NewY01·k, NewDelhi, Yesterday).
mechanism, compared with logical theorem proving, has been one of the main attractions of semantic networks. Inheritance becomes complicated when an object can belong to more than one category
or when a category can be a subset ofmore than one other category; this is called multiple n i MULTIPLE INHER�NCE
heritance. In such cases, the inheritance algorithm might find two or more conflicting values answering the query. For this reason, multiple inheritance is banned in some object-oriented programming (OOP) languages, such as Java, that use inheritance in a class hierarchy. It is usually allowed in semantic networks, but we defer discussion ofthat until Section 12.6. The reader might have noticed an obvious drawback ofsemantic network notation, com pared to first-order logic: the fact that links between bubbles represent only binary relations. For example,
the sentence
Fly(Shankar, New York, NewDelhi, Yesterday) cannot be as can obtain the effect of n-ary asser
serted directly in a semantic network. Nonetheless, we
tions by reifying the proposition itself as an event belonging to an appropriate event category. Figure 12.6 shows the semantic network st111cture for this particular event. Notice that the restriction to binary relations forces the creation of a rich ontology of reified concepts. Reification of propositions makes it possible to represent every ground, function-free atomic sentence of first-order logic n i the semantic network notation. Certain kinds of univer-
456
Chapter
12.
Knowledge Representation
sally quantified sentences can be asserted using inverse links and the singly boxed and doubly boxed arrows applied to categories, but that still leaves us a long way short of full first-order logic. Negation, disjunction, nested function symbols, and existential quantification are all missing. Now it is possible to extend the notation to make it equivalent to first-order logic-as in Peirce's existential graphs-but doing so negates one of the main advantages of semantic networks, which is the simplicity and transparency of the inference processes. Designers can build a large network and still have a good idea about what queries will be efficient, because (a) it is easy to visualize the steps that the inference procedure will go through and (b) in some cases the query language is so simple that difficult queries cannot be posed. In cases where the expressive power proves to be too limiting, many semantic network systems provide for
procedural attachment
to fill in the gaps. Procedural attachment is a technique whereby
a query about (or sometimes an assertion of) a certain relation results in a call to a special procedure designed for that relation rather than a general inference algoritlun. One of the most important aspects of semantic networks is their ability to represent DEfAULTVALUE
default values for categories. Examining Figure 12.5 carefully, one notices that John has one leg, despite the fact that he is a person and all persons have two legs. In a strictly logical KB, this would be a contradiction, but in a semantic network, the assertion that all persons have two legs has only default status; that is, a person is assumed to have two legs unless tllis is contradicted by more specific information. The default semantics is enforced naturally by the inheritance algorithm, because it follows links upwards from the object itself (John in this case) and stops as soon as it finds a value. We say that the default is
OVERRIDING
overridden by the more
specific value. Notice that we could also override the default number of legs by creating a category of
OneLeggedPersons, a subset of Persons of which John is a member.
We can retain a strictly logical semantics for the network if we say that the Legs asser
tion for
Persons includes an exception for John:
V x x E Persons 1\ x =!= John
=?
Legs(x, 2) .
For a fixed network, this is semantically adequate but will be much Jess concise than the network notation itself if there are lots of exceptions. For a network that will be updated with more assertions, however, such an approach fails-we really want to say that any persons as yet unknown with one leg are exceptions too. Section
12.6 goes into more depth on this issue
and on default reasoning in general.
12.5.2
DE&::RIPTION
LOGIC
Description logics
TI1e syntax of first-order logic is designed to make it easy to say things about objects. scription logics are notations properties of categories.
De-
that are designed to make it easier to describe definitions and
Description logic systems evolved from semantic networks in re
sponse to pressure to formalize what the networks mean while retaining the emphasis on taxonomic structure as an organizing principle. suesuMPTION cLASSIFICATION
subsumption (checking if one classification (checking systems also include consistency of a cate
The principal inference tasks for description logics are
category is a subset of another by comparing their definitions) and whether an object belongs to a category).. Some
gory definition-whether the membersllip criteria are logically satisfiable.
Section 12.5.
Reasoning Systems for Categories
Concept
�
Path
�
Figure 12.7
457
Thlng I ConceptName And( Concept, . . . ) All(RoleName, Concept) AtLeast(Integf:r, RoleName) AtMost(Integer, RoleName) Fills(RoleName, IndividualName, . . . ) SameAs(Path, Path) OneOf(IndividualName, . . . ) [RoleName, . . .]
The syntax of descriptions in a subset of the CLASSIC language.
The CLASSIC language (Borgida et at., 1989) is a typical description logic. The syntax of CLASSIC descriptions is shown in Figure 12.7.
6 For example, to say that bachelors are
unmarried adult males we would write
Bachelor = And( Unmarried, Adult, Male) . The equivalent in first-order logic would be
Bachelo1·(x) ¢:? Unmarried(x) 1\ Adult(x) 1\ Male(x) . Notice that the description logic has
an
an algebra of operations on predicates, which of
course we can't do in first-order logic. Any description in CLASSIC can be translated into an equivalent first-order sentence, but some descriptions are more straightforward in CLASSIC. For example, to describe the set of men with at least three sons who are all unemployed and married to doctors, and at most two daughters who are all professors in physics or math departments, we would use
And(Man, AtLeast(3, Son), AtMost(2, Daughter), All(Son, And( Unemployed, Married, All(Spouse, Doctor))), All(Daughter, And(Professor, Fills(Department, Physics, Math)))) . We leave it as an exercise to translate this into first-order logic. Perhaps the most important aspect of description logics is their emphasis on tractability of inference. A problem instance is solved by describing it and then asking if it is substuned by one of several possible solution categories. In standard first-order logic systems, predicting the solution time is often impossible. It is frequently left to the user to engineer the represen
tation to detour arotmd sets of sentences that seem to be causing the system to take several weeks to solve a problem. The thrust in description logics, on the other hand, is to ensure that 7 subsumption-testing can be solved in time polynomial in the size of the descriptions.
Notice that the language does not allow one to simply state that one concept, or category, is a subset of another. This is a deliberate policy: subsumption between categories must be derivable from some aspects of the descriptions of the categories. If not, then something is missing from the descriptions. 7 CLASSIC provides efficient subsumption testing n i practice, but the worst-case run time is exponential.
6
458
Chapter
12.
Knowledge Representation
This sounds wonderful in principle, until one realizes that it can only have one of two consequences: either hard problems cannot be stated at aU, or they require exponentially large descriptions! However, the tractability results do shed light on what sorts of constructs cause problems and thus help the user to understand how different representations behave. For example, description logics usually Jack negation and disjunction. Each forces first order logical systems to go through a potentially exponential case analysis in order to ensure completeness. CLASSIC allows only a limited form of disjunction in the Fills and OneOJ constructs, which permit disjunction over explicitly enumerated individuals but not over de scriptions. With disjunctive descriptions, nested definitions can lead easily to an exponential number of alternative routes by which one category can subsume another. 1 2.6
REASONING WITH DEFAULT INFORMATION
In the preceding section, we saw a simple example of an assertion with default status: people have two legs. This default can be overridden by more specific information, such as that Long John Silver has one leg. We saw that the inheritance mechanism in semantic networks implements the overriding of defaults in a simple and natural way. In this section, we study defaults more generally, with a view toward understanding the semantics of defaults rather than just providing a procedural mechanism. 12.6.1
NONMONOTONICtTY NONMONOTONIC LOGIC
Circumscription and default logic
We have seen two examples of reasoning processes that violate the monotonicity property of logic that was proved in Chapter 7.8 In this chapter we saw that a property inherited by all members of a category in a semantic network could be overridden by more specific informa tion for a subcategory. In Section 9.4.5, we saw that under the closed-world assumption, if a proposition a is not mentioned in KB then KB f= •a, but KB 1\ a f= a. Simple introspection suggests that these failures of monotonicity are widespread in commonsense reasoning. It seems that humans often "jump to conclusions." For example, when one sees a car parked on the street, one is normally willing to believe that it has four wheels even though only three are visible. Now, probability theory can certainly provide a conclusion that the fourth wheel exists with high probability, yet, for most people, the possi bility of the car's not having four wheels does not arise unless some new evidence presents itself. Thus, it seems that the four-wheel conclusion is reached by default, in the absence of any reason to doubt it. If new evidence arrives-for example, if one sees the owner carrying a wheel and notices that the car is jacked u�hen the conclusion can be retracted. This kind of reasoning is said to exhibit nonmonotonicity, because the set of beliefs does not grow monotonically over time as new evidence arrives. Nonmonotonic logics have been devised with modified notions of truth and entailment in order to captw-e such behavior. We will look at two such logics that have been studied extensively: circumscription and default logic. 8 Recall that monotonicity requires all entailed sentences to remain entailed after new sentences are added to the KB. That is, if KB f= a: then KB A {3 f= a:.
Section 12.6.
Reasoning with Default Infonnation
459
Circumscription can be seen as a more powerful and precise version of the closed-
CIRCUMSCRIPTION
world assumption. The idea is to specify particular predicates that are assumed to be "as false as possible"-that is, false for every object except those for which they are known to be true. For example, suppose we want to assert the default rule that birds fly. We would introduce a
Abnormal1 (x) , and write
predicate, say
Bird(x) A ·Abnormalt(x)
=?
Flies(x) .
If we say that
Abnormal1 is to be circumscribed, a circwnscriptive reasoner is entitled to assume •Abnormal1(x) unless Abnormal1(x) is known to be true. This allows the con clusion Flies(Tweety) to be drawn from the premise Bird( Tweety), but the conclusion no longer holds if A bno·rmal1 ( Tweety) is asserted. MODEL PREFERENCE
Circumscription can be viewed as an example of a
model preference logic. In such
logics, a sentence is entailed (with default status) if it is true in all preferred models ofthe KB, as opposed to the requirement of truth in all models in classical logic. For circumscriptjon, 9 Let us see how thls idea
one model is preferred to another if it has fewer abnonnal objects.
works in the context of multiple inheritance in semantic networks. The standard example for
which multiple inheritance is problematic is called the "Nixon diamond." It alises from the observation that Richard Nixon was both a Quaker (and hence by default a pacifist) and a Republican (and hence by default not a pacifist). We can write this as follows:
Republican(Nixon) A Quaker(Nixon) . Republican(x) A ·Abno1·mal2(x) =? --,Pacifist(x) . Quaker(x) A ·Abnormah(x) =? Pacifist(x) . Abnormal2 and Abnormal3, there are two preferred models: one in which Abnormal2(Nixon) and Pacifist(Nixon) hold and one in which Abnormal3(N ixon) and •Pacifist( Nixon) hold. Thus, the circumscliptive reasoner remains properly agnostic as to whether Nixon was a pacifist. If we wish, in addition, to assert that religious beliefs take If we circumsclibe
PRIORITIZED CIRCUMSCRIPTION
precedence over political beliefs, we can use a fonnalism called prioritized circumscription to give preference to models where
Default logic is a formalism in which default rules can be written to generate contin
DEFAIJLT LOGIC DEFAIJLT RULES
Abnormah is minimized.
gent, norunonotonic conclusions. A default rule looks like this:
Bird(x) : Flies(x)/Flies(x) . This rule means that if Bird(x) is true, and if Flies(x) is consistent with the knowledge base, then
Flies(x) may be concluded by default. P :
J1,
·
·
·
In general, a default rule has the form
, ln/C
where P is called the prerequisite, C is the conclusion, and
Ji are the justifications-if any
one of them can be proven false, then the conclusion cannot be drawn. Any variable that
For the closed-world assumption, one model is preferred to another ifit has fewer true atoms-thatis, preferred models are minimal models. There is a natural connection between the closed-world assumption and definite clause KBs, because the fixed point reached by forward chaining on definite-clause KBs is the unique minimal model. See page 258 for more on tltis point. 9
Chapter
460 appears in
Ji
or
C must also appear in P.
12.
Knowledge Representation
The Nixon-diamond example can be represented
in default logic with one fact and two default rules:
Republican( Nixon) 1\ Quaker( Nixon) . Republican(x) : •Pacifist(x)f•Pacifist(x) . Quaker(x) : Pacifist(x)jPacifist(x) . EXTENSION
To interpret what the default rules mean, we define the notion of an
extension of a default
theory to be a maximal set of consequences of the theory. That is, an extension S consists of the original known facts and a set of conclusions from the default rules, such that no additional conclusions can be drawn from S and the justifications ofevery default conclusion inS are consistent with S. As in the case of the preferred models in circumscription, we have two possible extensions for the Nixon diamond: one wherein he is a pacifist and one wherein he is not. Prioritized schemes exist in which some default rules can be given precedence over others, allowing some ambiguities to be resolved. Since 1980, when nonmonotonic logics were first proposed, a great deal of progress has been made in understanding their mathematical properties. There are sti11 unresolved questions, however. For example, if "Cars have four wheels" is false, what does it mean to have it in one's knowledge base?
What is a good set of default rules to have?
If we
cannot decide, for each rule separately, whether it belongs in our knowledge base, then we have a serious problem ofnorunodularity. Finally, how can beliefs that have default status be used to make decisions"? This is probably the hardest issue for default reasoning. Decisions often involve tradeoffs, and one therefore needs to compare the outcomes of different actions, and the
strengths of belief
costs of making a wrong decision.
in the
In cases where the
same kinds of decisions are being made repeatedly, it is possible to interpret default rules as "threshold probability" statements. For example, the default rule "My brakes are always OK" really means "The probability that my brakes are OK, given no other information, is sufficiently high that the optimal decision is for me to drive without checking them." When the decision context changes-for example, when one is driving a heavily laden truck down a steep mountain road-the default rule suddenly becomes inappropriate, even though there is no new evidence offaulty brakes. These considerations have led some researchers to consider how to embed default reasoning within probability theory or utility theory.
12.6.2
Truth maintenance systems
We have seen that many of the inferences drawn by a knowledge representation system will have only default status, rather than being absolutely certain. Inevitably, some of these in ferred facts will tum out to be wrong and will have to be retracted in the face of new infonna BELIEF REVISION
tion. This process is called
belief revision. 10 Suppose that a knowledge base KB contains
a sentence P-perhaps a default conclusion recorded by a forward-chaining algorithm, or perhaps just an incorrect assertion-and we want to execute TELL(KB,
•P).
To avoid cre
ating a contradiction, we must first execute RETRACT(KB, P). This sounds easy enough.
10 Beliefrevision is often contrasted with beliefupdate, which occurs when a knowledge base is revised to reflect a change in the world rather than new information about a fixed world. Belief update combines belief revision with reasoning about time and change; it is also related to the process of filtering described in Chapter 15.
Section 12.6.
Reasoning with Default Infonnation Problems atise, however, if any
461
additional
sentences were inferred from P and asserted in
the KB. For example, the implication P � Q might have been used to add Q. The obvious
"solution"-retracting all sentences inferred from P-fails because such sentences may have TRUTH MAINTENANCE SYSTBI
other justifications besides P. For example, if R and R � Q are also in the KB, then Q does not have to be removed after all.
Truth maintenance systems, or TMSs, are designed
to handle exactly these kinds of complications. One simple approach to truth maintenance is to keep track of the order in which sen tences are told to the knowledge base by munbering them from P1 to Pn . When the call RETRACT(KB, ?;_) is made, the system reverts to the state just before Pi was added, thereby removing both
P;_ and any inferences that were derived from Pi· The sentences P;_+l through
Pn can then be added again. This is simple, and it guarantees that the knowledge base will be consistent, but retracting Pi requires retracting and reasserting n
-
i sentences as well as
tmdoing and redoing all the inferences drawn from those sentences. For systems to which many facts are being added-such as la.rge commercial databases-this is impractical. A more efficient approach is the justification-based truth maintenance system, or JTMS.
JTMS JUSTIFICATION
In a JTMS, each sentence in the knowledge base is annotated with a justification consisting of the set of sentences from which it was inferred.
For example, if the knowledge base
already contains P � Q, then TELL(P) will cause Q to be added with the justification {P, P � Q}.
In general, a sentence can have any number of justifications.
Justifica
tions make retraction efficient. Given the call RETRACT(P), the JTMS will delete exactly those sentences for which P is a member of every justification. So, if a sentence Q had
the single justification {P, P � Q}, it would be removed; if it had the additional justi fication {P, P
{R, P V R
=}-
VR
�
Q}, it would still be removed; but if it also had the justification
Q}, then it would be spared. In this way, the time required for retraction of P
depends only on the nwnber of sentences derived from P rather
than on the number of other
sentences added since P entered the knowledge base. The JTMS assumes that sentences that are considered once will probably be considered again, so rather than deleting a sentence from the knowledge base entirely when it loses all justifications, we merely mark the sentence as being
out
of the knowledge base. If a
subsequent assertion restores one of the justifications, then we mark the sentence as being back
in.
In this way, the JTMS retains all the inference chains that it uses and need not
rederive sentences when a justification becomes valid again. In addition to handling the retraction of incorrect information, TMSs can be used to speed up the analysis of multiple hypothetical situations.
Suppose, for example, that the
Romanian Olympic Committee is choosing sites for the swimming, athletics, and eques trian events at the 2048 Games to be held in Romania. For example, let the first hypothe sis be
Site(Swimming, Pitesti), Site(Athletics, Bucharest), and Site(Equestrian, Arad).
A great deal of reasoning must then be done to work out the logistical consequences and
hence the desirability of this selection.
If we want to consider
stead, the TMS avoids the need to start again from scratch.
Site(Athletics, Sibiu)
in
Instead, we simply retract
Site( Athletics, Bucharest) and assert Site( Athletics, Sibiu) and the TMS takes care of the necessary revisions. Inference chains generated from the choice of Bucharest can be reused with Sibiu, provided that the conclusions are the same.
462 ATMS
Chapter
12.
Knowledge Representation
An assumption-based truth maintenance system, or ATMS, makes this type of context switching between hypothetical worlds particularly efficient. In a JTMS, the maintenance of justifications allows you to move quickly from one state to another by making a few retrac tions and assertions, but at any time on1y one state is represented. An ATMS represents all the states that have ever been considered at the same time. Whereas a JTMS simply labels each sentence as being in or out, an ATMS keeps track, for each sentence, of which assumptions would cause the sentence to be true. In other words, each sentence has a label that consists of a set of asswnption sets. The sentence holds just in those cases in which all the assumptions
EXPLANATION
in one of the assumption sets hold. Truth maintenance systems also provide a mechanism for generating explanations. Technically, an explanation of a sentence P is a set of sentences E such that E entails P. If the sentences in E are already known to be true, then E simply provides a sufficient ba
ASSUMPTION
sis for proving that P must be the case. But explanations can also include asswnptions sentences that are not known to be true, but would suffice to prove P if they were true. For example, one might not have enough infOJmatjon to prove that one's car won't start, but a reasonable explanation might include the assumption that the battery is dead. This, combined with knowledge of how cars operate, explains the observed nonbehavior. In most cases, we will prefer an explanation E that is minimal, meaning that there is no proper subset of E that is also an explanation. An ATMS can generate explanations for the "car won't start" problem by making assumptions (such as "gas in car" or "battery dead") in any order we like, even if some assumptions are contradictory. Then we look at the label for the sentence "car won't start'' to read off the sets of assumptions that would justify the sentence. The exact algorithms used to implement truth maintenance systems are a little compli cated, and we do not cover them here. The computational complexity of the truth maintenance
problem is at least as great as that of propositional inference-that is, NP-hard. Therefore, you should not expect truth maintenance to be a panacea. When used carefully, however, a TMS can provide a substantial increase in the ability of a logical system to handle complex environments and hypotheses.
1 2 .7
THE INTERNET SHOPPING WORLD
In this final section we put together all we have Jeamed to encode knowledge for a shopping research agent that helps a buyer find product offers on the b1temet. The shopping agent is given a product description by the buyer and has the task of producing a Jist of Web pages that offer such a product for sale, and ranking which offers are best. In some cases the buyer's product description will be precise, as in Canon Rebel XTi digital camera, and the task is then to find the store(s) with the best offer. In other cases the description will be only partially specified, as in digital camera for
under $300, and the agent will have to compare
different products. The shopping agent's environment is the entire World Wide Web in its full complexity not a toy simulated environment. The agent's percepts are Web pages, but whereas a human
Section 12.7.
The Intemet Shopping World
463
Example Online Store
Select from our fine line of products: • Computers • Cameras • Books • Videos • Music
Example Online Store
Select from our fine line
of product s :
- Computers
- Cameras
- Books
- Videos
- Music
Figure 12.8
A Web page from a generic online store in the form perceived by
the human
user of a browser (top), and the corresponding HTML string as perceived by the browser or the shopping agent (bottom). In HTML, characters between < and > are markup directives that specify how the page is displayed. For example, the string Select means to switch to italic font, display the word Select, and then end the use of italic font. A page identifier such as http: I Iexample . com/books is called a uniform resource locator (URL). The markup Books means to create a hypertext link to uri with the anchor text Books.
Web user would see pages displayed as an array of pixels on a screen, the shopping agent will perceive
a
page as a character string consisting of ordinary words interspersed with for
matting commands in the HTML markup language. Figure 12.8 shows a Web page and a corresponding HTML character string. The perception problem for the shopping agent in volves extracting useful information from percepts of this kind. Clearly, perception on Web pages is easier than, say, perception while driving a taxi in Cairo. Nonetheless, there are complications to the Intemet perception task. The Web page in Figure 12.8 is simple compared to real shopping sites, which may include CSS, cookies, Java, Javascript, Flash, robot exclusion protocols, malfonned HTML, sound files, movies, and text that appears only as part of a JPEG image. An agent that can deal with all of the Internet is almost as complex as a robot that can move in the real world. We concentrate on a simple agent that ignores most of these complications. The agent's first task is to collect product offers that are relevant to a query. If the query is "laptops," then a Web page with a review of the latest high-end laptop would be relevant, but if it doesn't provide a way to buy, it isn't an offer. For now, we can say a page is an offer if it contains the words "buy" or "price" or "add to cart" within an HTML link or form on the
464
Chapter
12.
Knowledge Representation
page. For example, if the page contains a string of the form "" I " 2.5 GHz and price < $500." Implement a shopping agent that accepts descriptions in this language. Our description of Internet shopping omitted the all-important step of actually buying the product. Provide a formal logical description of buying, using event calculus. That is, 12.25
define the sequence of events that occurs when a buyer submits a credit-card purchase and then eventually gets billed and receives the product.
13
QUANTIFYING UNCERTAINTY
In which we see how an agent can tame uncertainty with degrees ofbelief
13.1
UNCERTAINTY
ACTING UNDER UNCERTAINTY
Agents may need to handle uncertainty, whether due to partial observability, nondetermin ism, or a combination of the two. An agent may never know for certain what state it's in or where it will end up after a sequence of actions. We have seen problem-solving agents (Chapter 4) and logical agents (Chapters 7 and 1 1) designed to handle uncertainty by keeping track of a belief state-a representation of the set of all possible world states that it might be in-and generating a contingency plan that han dles every possible eventuality that its sensors may repo11 during execution. Despite its many virtues, however, this approach has significant drawbacks when taken literally as a recipe for creating agent programs: • When interpreting partial sensor information, a logical agent must consider every log ically possible explanation for the observations, no matter how unlikely. This leads to impossible large and complex belief-state representations. • A correct contingent plan that handles every eventuality can grow arbitrarily large and must consider arbitrarily unlikely contingencies. • Sometimes there is no plan that is guaranteed to achieve the goal-yet the agent must act. It must have some way to compare the merits of plans that are not guaranteed. Suppose, for example, that an automated taxi!automated has the goal of delivering a pas senger to the airport on time. The agent forms a plan, Ago, that involves leaving home 90 minutes before the flight departs and driving at a .reasonable speed. Even though the airport is only about 5 miles away, a logical taxi agent will not be able to conclude with ce11ainty that "Plan Ago will get us to the airport in time." Instead, it reaches the weaker conclusion "Plan Ago will get us to the airport .in time, as long as the car doesn't break down or run out of gas, and I don't get into an accident, and there are no accidents on the bridge, and the plane doesn't leave early, and no meteorite hits the car, and . . . ." None of these conditions can be 480
Section 13.1.
481
Acting under Uncertainty
deduced for sure, so the plan's success cannot be inferred. This is the qualification problem (page 268), for which we so far have seen no real solution. Nonetheless, in some sense
Aoo is in fact the right thing to do.
What do we mean by
this? As we discussed in Chapter 2, we mean that out of all the plans that could be executed,
Ago is expected to maximize the agent's performance measure (where the expectation is rel ative to the agent's knowledge about the environment). The performance measure includes getting to the airport in time for the flight, avoiding a long, unproductive wait at the airport, and avoiding speeding tickets along the way. The agent's knowledge cannot guarantee any of these outcomes for Aoo, but it can provide some degree of belief that they will be achieved. Other plans, such as A1so, might increase the agent's belief that it will get to the airport on time, but also increase the likelihood of a long wait. The righr rhing to do-the rational
decision-therefore depends on both the relative importance of various goals and the likeli hood that, and degree to which, they will be achieved. The remainder of this section hones these ideas, in preparation for the development of the general theories of uncertain reasoning and rational decisions that we present in this and subsequent chapters. 13.1.1
Summarizing uncertainty
Let's consider an example of uncertain reasoning: diagnosing a dental patient's toothache. Diagnosis-whether for medicine, automobile repair, or whatever-almost always involves uncertainty. Let us try to write rules for dental diagnosis using propositional logic, so that we can see how the logical approach breaks down. Consider the following simple rule:
Toothache
�
Cavity .
The prohlem is that this rule is wrong. Not all patients with toothache.� ha.ve cavities; some of them have gmn disease, an abscess, or one of several other problems:
Toothache
�
Cavity V GumP.roblem V Abscess . . .
Unfortunately, in order to make the rule true, we have to add an almost unlimited list of possible problems. We could try turning the tu1e into a causal tu1e:
Cavity
�
Toothache .
But this ntle is not right either; not all cavities cause pain. The only way to fix the rule is to make it logically exhaustive: to augment the left-hand side with all the qualifications required for a cavity to cause a toothache. Trying to use logic to cope with a domain like medical diagnosis thus fails for three main reasons: LAZINESS
• Laziness:
It is too much work to list the complete set of antecedents or consequents
needed to ensure an exceptionless rule and too hard to use such rules. TIICORCTICAL IGNORANCE PRACTICAL IGNORANCE
• Theoretical ignorance: Medical science has no complete theory for the domain. • Practical ignorance: Even if we know all the rules, we might be tmcertain about
a
particular patient because not all the necessary tests have been or can be run. The connection between toothaches and cavities is just not a logical consequence in either direction. This is typical of the medical domain, as well as most other judgmental domains: law, business, design, automobile repair, gardening, dating, and so on. The agent's knowledge
Chapter
482 DEGREE OFBELIEF PROBAEILITY THEOR'I
OUTOOME
UTILITYTHEORY
Quantifying Uncertainty
can at best provide only a degree of belief in the relevant sentences. Ow· main tool for
dealing with degrees of belief is probability theory. In the terminology of Section 8.1, the ontological commitments of logic and probability theory are the same-that the world is composed of facts that do or do not hold in any particular case-but the epistemological commitments are different: a logical agent believes each sentence to be tme or false or has no opinion, whereas a probabilistic agent may have a numerical degree of belief between 0 (for sentences that are certainly false) and 1 (certain1y true). Probability provides a way ofsununarizing the uncertainty that comes from our lazi ness and ignorance, thereby solving the qualification problem. We might not know for sure what afflicts a particular patient, but we believe that there is, say, an 80% chance-that is, a probability of 0.8-that the patient who has a toothache has a cavity. That is, we expect that out of all the situations that are indistinguishable from the current situation as far a� our knowledge goes, the patient will have a cavity in 80% of them. This belief could be derived from statistical data-80% of the toothache patients seen so far have had cavities--or from some general dental knowledge, or from a combination of evidence sources. One confusing point is that at the time of our diagnosis, there is no uncertainty in the actual world: the patient either has a cavity or doesn't. So what does it mean to say the probability of a cavity is 0.8? Shouldn't it be either 0 or 1 ? The answer is that probability statements are made with respect to a knowledge state, not with respect to the real world. We say "The probability that the patient has a cavity, given that she has a toothache, is 0.8." If we later learn that the patient has a history of gum disease, we can make a different statement: "The probability that the patient has a cavity, given that she has a toothache and a history of gum disease, is 0.4." If we gather further conclusive evidence against a cavity, we can say "The probability that the patient has a cavity, given all we now know, is almost 0." Note that these statements do not contradict each other; each is a separate assertion about a different knowledge state. 13.1.2
PREFEFENCE
13.
Uncertainty and rational decisions
Consider again the Ago plan for getting to the airport. Suppose it gives us a 97% chance of catching our flight. Does this mean it is a rational choice? Not necessarily: there might be other plans, such as A1so, with higher probabilities. If it is vital not to miss the flight, then it is worth risking the longer wait at the airport. What about A1440, a plan that involves leaving home 24 hours in advance? In most circumstances, this is not a good choice, because although it almost guarantees getting there on time, it involves an intolerable wait-not to mention a possibly unpleasant diet of airport food. To make such choices, an agent must first have preferences between the different pos sible outcomes of the vatious plans. An outcome is a completely specified state, n i cluding such factors as whether the agent arrives on time and the length of the wait at the airport. We use utility theory to represent and rearon with preferences. (The term utility is used here in the sense of "the quality of being useful;' not in the sense of the electric company or water works.) Utility theory says that every state has a degree of usefulness, or utility, to an agent and that the agent will prefer states with higher utility.
Section 13.2.
DECISICNTHEORY
Basic Probability Notation
483
The utility of a state is relative to an agent. For example, the utility of a state in which White has checkmated Black in a game ofchess is obviously high for the agent playing White, but low for the agent playing Black. But we can't go strictly by the scores of 1 , l/2, and 0 that are dictated by the rules of tournament chess-some players (including the authors) might be thrilled with a draw against the world champion, whereas other players (including the former world champion) might not. There is no accounting for taste or preferences: you might think that an agent who prefers jalapeno bubble-gum ice cream to chocolate chocolate chip is odd or even misguided, but you could not say the agent is irrational. A utility function can accotmt for any set of preferences--quirky or typical, noble or perverse. Note that utilities can accotmt for altruism, simply by including the welfare of others as one of the factors. Preferences, as expressed by utilities, are combined with probabilities in the general theory of rational decisions called decision theory:
Decision theory =probability theory + utility theory . The fundamental idea of decision theory is that an agent is rational if and only if it chooses MAXIMl.M EXPECTED UTILITY
the action that yields the highest expected utility, averaged over all the possible outcomes of the action. This is cal1ed the principle of maximum expected utility (MEU). Note that "expected" might seem like a vague, hypothetical term, but as it is used here it has a precise meaning: it means the "average," or "statistical mean" of the outcomes, weighted by the probability of the outcome. We saw tllis principle in action in Chapter 5 when we touched briefly on optimal decisions in backgammon; it is in fact a completely general principle. Figure 13.1 sketches the structure of an agent that uses decision theory to select actions. The agent is identical, at an abstract level, to the agents described in Chapters 4 and 7 that maintain a belief state reflecting the history of percepts to date. The primary difference is that the decision-theoretic agent's belief state represents not just the possibilities for world states but also their probabilities. Given the belief state, the agent can make probabilistic predictions of action outcomes and hence select the action with highest expected utility. This chapter and the next concentrate on the task of representing and computing with probabilistic information in general. Chapter 1 5 deals with methods for the specific tasks of representing and updating the belief state over time and predicting the environment. Chapter 16 covers utility theory in more depth, and Chapter 17 develops algorithms for planning sequences of actions in w1certain environments.
1 3 .2
BASIC PROBABILITY NOTATIOK
For our agent to represent and use probabilistic information, we need a formal language. The language of probability theory has traditionally been informal, written by human math ematicians to other human mathematicians. Appendix A includes a standard introduction to elementary probability theory; here, we take an approach more suited to the needs of AI and more consistent with the concepts of formal logic.
484
Chapter
13.
Quantifying Uncertainty
function DT-AGENT(percept) returns an action persistent: belief..state, probabilistic beliefs about the current state ofthe world action, the agent's action update belief_state based on action and pe1-cept calculate outcome probabilities for actions, given action descriptions and current belief_state select action with highest expected utility given probabilities of outcomes and utility information return action Figure 13.1 13.2.1
A decision-theoretic agent that selects rational actions.
What probabilities are about
Like
SAMPLE sPACE
PROBABILITY MODEL
logical assertions, probabilistic assertions are about possible worlds. Whereas logical assertions say which possible worlds are strictly mled out (all those in which the assertion is false), probabilistic assertions talk about how probable the various worlds are. In probability theory, the set of all possible worlds is called the sample space. The possible worlds are mutually exclusive and exhaustive-two possible worlds cannot both be the case, and one possible world must be the case. For example, if we are about to roll two (distinguishable) i e, there are 36 possible worlds to consider: (I, 1), (1,2), . . . , (6,6). The Greek letter 11 dc (uppercase omega) is used to refer to the sample space, and w (lowercase omega) refers to elements of the space, that is, particular possible worlds. A fully specified probability model associates a numerical probability P(w) with each possible world. 1 The basic axioms of probability theory say that every possible world has a probability between 0 and 1 and that the total probability of the set of possible worlds is 1: 0�
EVENT
P(w) � 1 for every w and
L P(w)
wEl1
=
1.
(13.1)
For example, if we assume that each die is fair and the rolls don't interfere with each other, then each of the possible worlds (1,1), (1 ,2), . . . , (6,6) has probability 1/36. On the other hand, if the dice conspire to produce the same munber, then the worlds (1, 1 ), (2,2), (3,3), etc., might have higher probabilities, leaving the others with lower probabilities. Probabilistic assertions and queries are not usually about particular possible worlds, but about sets of them. For example, we might be interested in the cases where the two dice add up to 1 1, the cases where doubles are rolled, and so on. In probability theory, these sets are called events-a tenn already used extensively in Chapter 12 for a different concept. In AI, the sets are always described by propositions in a formal language. (One such language is described in Section 13.2.2.) For each proposition, the corresponding set contains just those possible worlds in which the proposition holds. The probability associated with a proposition 1 For now, we assume a discrete, countable set of worlds. The proper treatment of the continuous case brings in certain complications that are less relevant for most purposes in AI.
Section 13.2.
Basic Probability Notation
485
is defined to be the sum of the probabilities of the worlds in which it holds: For any proposition ¢>, P(¢>) = L P(w) . wE¢
(13.2)
For example, when rolling fair dice, we have P( Total = 11) P( (5, 6)) + P((6, 5)) = 1/36 + 1/36 = 1/18. Note that probability theory does not require complete knowledge of the probabilities of each possible world. For example, if we believe the dice conspire to produce the same number, we might assert that P( doubles) = Ij4 without knowing whether the dice prefer double 6 to double 2. Just as with logical assertions, this assertion constrains the underlying probability model without fully determining it. UNCONDITIONAL PAOEIIBILITY PRIOR PROBABILITY
EVIDENCE
CONDITIONAL PAOEIIBILITY POSTERIOR PAOEIIBILITY
Probabilities such as P( Total = 1 1) and P( doubles) are called unconditional or prior probabilities (and sometimes just "priors" for short); they refer to degrees of belief in propo
sitions in the absence of any other information. Most of the time, however, we have some information, usually called evidence, that has already been revealed. For example, the first die may already be showing a 5 and we are waiting with bated breath for the other one to stop spinning. In that case, we are interested not in the unconditional probability of rolling doubles, but the conditional or posterior probability (or just "posterior" for short) of rolling doubles given that thefirst die is a 5. This probability is written P( doubles I Die1 = 5), where the " I " is pronounced "given." Similarly, if I am going to the dentist for a regular checkup, the probability P( cavity) = 0.2 might be of interest; but if I go to the dentist because I have a toothache, it's P( cavity I toothache) = 0.6 that matters. Note that the precedence of" I " is such that any expression of the form P( . . . I . . . ) always means P( (. . . ) I ( . . . ) ) . It is important to understand that P( cavity) = 0.2 is still valid after toothache is ob served; it just isn't especially useful. When making decisions, an agent needs to condition on all the evidence it has observed. It is also important to understand the difference be tween conditioning and logical implication. The assertion that P( cavity I toothache) = 0.6 does not mean "Whenever toothache is true, conclude that cavity is true with probabil ity 0.6" rather it means "Whenever toothache is true and we have no further information, conclude that cavity is true with probability 0.6." The extra condition is important; for ex ample, if we had the further information that the dentist found no cavities, we definitely would not want to conclude that cavity is true with probability 0.6; instead we need to use P( cavity I toothache 1\ •cavity) = 0.
Mathematically speaking, conditional probabilities are defined in terms of uncondi tional probabilities as follows: for any propositions a and b, we have b) , P(a I b) = (13.3)
P�(�)
which holds whenever P(b) > 0. For example,
P(doubles I Die1 = 5) =
P( doubles 1\ Die1 = 5) P(Die l = S)
The definition makes sense if you remember that observing b rules out all those possible worlds where b is false, leaving a set whose total probability is just P(b) . Within that set, the a-worlds satisfy a 1\ b and constitute a fraction P(a 1\ b)/P(b).
486
PRODUCT RULE
Chapter
13.
Quantifying Uncertainty
The definition of conditional probability, Equation (13.3), can be written in a different form called the product rule:
P(a 1\ b) = P(a I b)P(b) , The product rule is pemaps easier to remember: it comes from the fact that, for a and b to be true, we need b to be true, and we also need a to be true given b. 13.2.2
The language of propositions in probability assertions
In this chapter and the next, propositions describing sets of possible worlds are written in a
RANDOMVARIABLE
DOMAIN
notation that combines elements of propositional logic and constraint satisfaction notation. In the terminology of Section 2.4.7, it is a factored representation, in which a possible world is represented by a set of variable/value pairs. Variables in probability theory are called random variables and their names begin with an uppercase letter. Thus, in the dice example, Total and Die1 are random variables. Every random variable has a domain-the set of possible values it can take on. The domain of Total for two dice is the set {2, . . . , 12} and the domain of Die1 is {1, . . . , 6}. A Boolean random variable has the domain {true ,false} (notice that values are always lowercase); for example, the proposition that doubles are rolled can be written as Doubles = true. By con vention, propositions of the form A = true are abbreviated simply as a, while A =false is abbreviated as •a. (The uses of doubles, cavity, and toothache in the preceding section are abbreviations of this kind.) As in CSPs, domains can be sets of arbitrary tokens; we might choose the domain of Age to be {juvenile, teen, adult} and the domain of Weather might be {sunny, rain, cloudy, snow}. When no ambiguity is possible, it is common to use a value by ilst:lf lu slt(x)q(x
�
x') .
X
STATIONARY DISTRIBUTION
We say that the chain has reached its stationary distribution if ?rt = ?rt+l· Let us call this stationa1y distribution 1r; its defining equation is therefore
1r(x')
=
L 1r(x)q(x
�
for all x'
x')
(14.10)
.
X
ERGODIC
q is ergodic-that is, every state is reachable from every other and there are no strictly periodic cycles-there is exactly one distribution 1r satisfying this equation for any given q. Equation (14.10) can be read as saying that the expected "outflow" from each state (i.e., Provided the transition probability distribution
its current "population") is equal to the expected "inflow" from all the states. One obvious way to satisfy this relationship is if the expected flow between any pair of states is the same in both directions; that is,
1r(x)q(x DETAILED BALANCE
�
x')
=
1r(x1)q(x'
�
for all x,
x)
When these equations hold, we say that
q(x
�
(14.11)
x' .
x') is in detailed balance with 1r(x).
We can show that detailed balance implies stationarity simply by swnming over Equation
(14.11). We have
L 1r(x)q(x X
�
x')
=
L 1r(x')q(x' X
�
x)
=
1r(x') L q(x' X
�
x)
=
1r(x')
x in
Chapter
538
14.
Probabilistic Reasoning
where the last step follows because a transition from x' is guaranteed to occur. The transition probability q(x � x') defined by the sampling step in GIBBS-ASK is actually a special case of the more general definition of Gibbs sampling, according to which each variable is sampled conditionally on the current values of all the other variables. We start by showing that this general definition of Gibbs sampling satisfies the detailed balance equation with a stationary distribution equal to P(x I e), (the true posterior distribution on the nonevidence variables). Then, we simply observe that, for Bayesian networks, sampling conditionally on all variables is equivalent to sampling conditionally on the variable's Markov blanket (see page 517). To analyze the general Gibbs sampler, which samples each Xi in twn with a transition probability qi that conditions on all the other variables, we define Xi to be these other vari ables (except the evidence variables); their values in the cmTent state are Xi. If we sample a new value x� for Xi conditionally on all the other variables, including the evidence, we have
qi(x � x') = qi((xi, xi) � (x�, Xi)) = P(x� I xi, e) . Now we show that the transition probability for each step of the Gibbs sampler is in detailed balance with the true posterior:
7l'(x)qi(x � x') = P(x l e)P(x� l x.,e) = P(xi,Xi l e)P(x� l xi , e) P(xi I x., e) P(x. I e) P(x� I Xi, e) (using the chain rule on the first term) (using the chain rule backward) = P(xi I x., e)P(x�, Xi I e) = 7l'(x')qi(x' � x) . We can think of the loop "for each Zi in Z do" in Figure 14.16 as defining one large transition probability q that is the sequential composition q1 q2 · · · q,.. of the transition probabilities for the individual variables. It is easy to show (Exercise 14.19) that if each of qi and qj has its stationary distribution, then the sequential composition qi qj does too; hence the transition probability q for the whole loop has P(x I e) its stationary distribution. Finally, tmless the CPTs contain probabilities of 0 or 1-which can cause the state space to become disconnected-it is easy to see that q is e�godic. Hence, the samples generated by Gibbs o
o
o
1l' as
o
as
sampling will eventually be drawn from the true posterior distribution. The final step is to show how to perform the general Gibbs sampling step-sampling xi from P(Xi I Xi, e)-in a Bayesian network. Recall from page 517 that a variable is inde pendent of all other variables given its Markov blanket; hence,
P(x� I '4, e) = P(x� I mb(Xi)) , where mb(Xi) denotes the values of the variables in Xis Markov blanket, M B(Xi) ·
As shown in Exercise 14.7, the probability of a variable given its Markov blanket is proportional
to til� probability of Lit� variablt: give::u il� pareut� lim�s the:: probability uf e::adt dtild give::u it�
respective parents:
P(x� I mb(Xi)) = a P(x� I parents(Xi))
x Y;
lparents(Yj)) . (14.12) II P(yj EChildren(Xi)
Hence, to flip each variable Xi conditioned on its Markov blanket, the number of multiplica tions required is equal to the number of Xi's children.
Section 1 4.6.
Relational and First-Order Probability Models
(a)
539
(b)
Figure 14.17 (a) Bayes net for a single customer C1 recommending a single book B1. Honest(CI) is Boolean, while the other variables have integer values from 1 to 5. (b) Bayes net with two customers and two books.
14.6
RELATIONAL AND FIRST-ORDER PROBABILITY MODELS
In Chapter 8, we explained the representational advantages possessed by first-order logic in comparison to propositional logic. First-order logic commits to the existence of objects and relations among them and can express facts about some or all of the objects in a domain. This often results in representations that are vastly more concise than the equivalent propositional descriptions. Now, Bayesian networks are essentially propositional: the set of random vari ables is fixed and finite, and each has a fixed domain of possible values. This fact limits the applicability of Bayesian networks. If we can .find a way to combine probability theory with the expressive power of.first-order representations, we expect to be able to increase dramati cally the range ofproblems that can be handled. For example, suppose that an online book retailer would like to provide overall evalu ations of products based on recommendations received from its customers. The evaluation will take the form of a posterior distribution over the quality of the book, given the avail able evidence. The simplest solution to base the evaluation on the average recommendation, perhaps with a variance determined by the number of reconunendations, but this fails to take into account the fact. that some customers are kinder than others and some are less honest than
others. Kind customers tend to give high recommendations even to fairly mediocre books, while dishonest customers give very high or very low recommendations for reasons other than quality-for example, they might work for a publisher. 6 For a single customer C1, recommending a single book B1, the Bayes net might look like the one shown in Figure 1 4.17(a). (Just as in Section 9.1, expressions with parentheses such as Honest(C1) are just fancy symbols-in this case, fancy names for random variables.)
6 A game theorist would advise a dishonest customer to avoid detection by occasionally recommending good n
book from a competitor. See Chapter 17.
Chapter
540
14.
Probabilistic Reasoning
With two customers and two books, the Bayes net looks like the one in Figure 14.17(b). For larger numbers of books and customers, it becomes completely impractical to specify the network by hand. Fortunately, the network has a lot of repeated structure. Each Recommendation(c, b) variable has as its parents the variables Honest(c), Kindness(c), and Quality(b). Moreover, the CPTs for all the Recommendation(c,b) variables are identical, as are those for all the Honest(c) variables, and so on. The situation seems tailor-made for a first-order language. We would like to say something like
Recommendation(c, b) RecCPT(Honest(c), Kindness(c), Quality(b)) �
with the intended meaning that a customer's recommend ation for a book depends on the
customer's honesty and kindness and d1e book's quality according to some fixed CPT. This section develops a language that lets us say exactly this, and a lot more besides. 14.6.1
Possible worlds
Recall from Chapter 1 3 that a probability model defines a set Q of possible worlds with a probability P(w) for each world w. For Bayesian networks, the possible worlds are as signments of values to variables; for the Boolean case in particular, the possible worlds are identical to those of propositional logic. For a first-order probability model, then, it seems we need the possible worlds to be those of first-order logic-that is, a set of objects with relations among them and an interpretation that maps constant symbols to objects, predicate symbols to relations, and ftmction symbols to functions on those objects. (See Section 8.2.) The model also needs to define a probability for each such possible world, just as a Bayesian m:t wurk udhae::s a probability fur �::ada assignme::nt uf value::s tu variablt:s .
Let us suppose, for a moment, that we have figured out how to do this. Then, as usual (see page 485), we can obtain the probability of any first-order logical sentence as a sum over the possible worlds where it is true:
P() =
P(w) .
(14.13)
w:4> is true in w
Conditional probabilities P( I e) can be obtained similarly, so we can, in principle, ask any question we want of our model-e.g., "Which books are most likely to be recommended highly by dishonest customers?"-and get an answer. So far, so good. There is, however, a problem: the set of first-order models is infinite. We saw this explicitly in Figure 8.4 on page 293, which we show again in Figure 14.18 (top). This means that (1) the swrunation in Equation (14.13) could be infeasible, and (2) specifying a complete, consistent distributi.on over an infinite set of worlds could be very difficult. Section 14.6.2 explores one approach to dealing with this problem. The idea is to borrow not from the standard semantics of first-order logic but from the database seman tics defined in Section 8.2.8 (page 299). The database semantics makes the unique names assumption-here, we adopt it for the constant symbols. It also asswnes domain closure there are no more objects than those that are named. We can then guarantee a finite set of possible worlds by making the set of objects in each world be exactly the set of constant
Section 1 4.6.
Relational and First-Order Probability Models
R
541
R
J
•
•
•
R
J
•
•
J
•
•
•
Figure 14.18 Top: Some members of the set of all possible worlds for a language with two constant symbols, R and J, and one binary relation symbol, under the standard semantics for first-order logic. Bottom: the possible worlds under database semantics . The interpretation of the constant symbols is fixed, and there is a distinct object for each constant symbol.
�����rlv MoDEL
SIRVI SIBYL ATTACK EXISTENCE UNCERTAINTY IDENTITY UNCERTAINTY
symbols that are used; as shown in Figure 14.18 (bottom), there is no uncertainty about the mapping from symbols to objects or about the objects that exist. We will call models defined in this way relational probability models, or RPMs.7 The most significant difference be tween the semantks of RPMs and the database semantics introduced in Section 8.2.8 is that RPMs do not make the closed-world assumption-obviously, assuming that every wllcnown fact is false doesn't make sense in a probabilistic reasoning system! When the underlying asswnptions of database semantics fail to hold, RPMs won't work well. For example, a book retailer might use an ISBN (International Standard Book Number) as a constant symbol to name each book, even though a given "logical" book (e.g., "Gone With the Wind") may have several ISBNs. It would make sense to aggregate recommenda tions across multiple ISBNs, but the retailer may not know for sure which ISBNs are really the same book. (Note that we are not reifying the individual copies of the book, which might be necessary for used-book sales, car sales, and so on.) Worse still, each customer is iden tified by a login ID, but a dishonest customer may have thousands of IDs! In the computer security field, these multiple IDs are called sibyls and their use to confound a reputation sys
tem is called a sibyl attack. Thus, even a simple application in a relatively well-defined, online domain involves both existence uncertainty (what. are the real books and customers underlying the observed data) and identity uncertainty (which symbol really refer to the same object). We need to bite the bullet and define probability models based on the standard semantics of first-order logic, for which the possible worlds vary in the objects they contain and in the mappings from symbols to objects. Section 14.6.3 shows how to do this.
The name relational probability model was given by Pfeffer (2000) to a slightly different representation, but the underlying ideas are the same. 7
542
Chapter 14.6.2
lYPE SIGNATURE
14.
Probabilistic Reasoning
Relational probability models
Like first-order logic, RPMs have constant, function, and predicate symbols. (It tums out to be easier to view predicates as functions that return true or false.) We will also asstune a type signature for each function, that is, a specification of the type of each argument and the function's value. If the type of each object is known, many spurious possible worlds are elim inated by this mechanism. For the book-recommendation domain, the types are Customer and Book, and the type signatures for the functions and predicates are as follows:
Honest : Customer -4 {true ,false }Kindness : Customer -4 {1, 2, 3, 4, 5} Quality : Book -4 {1, 2, 3, 4, 5} Recommendation : Customer x Book -4 {1,2,3,4,5} The constant symbols will be whatever customer and book names appear in the retailer's data set. In the example given earlier (Figure 14.17(b)), these were C1, C2 and B1, Bz. Given the constants and their types, together with the functions and their type signa tures, the random valiables of the RPM are obtained by instantiating each function with each possible combination of objects: Honcst(C1), Quality(Bz), Rccommcndation(C1,Bz), and so on. These are exactly the variables appearing in Figure 14.17(b). Because each type has only finitely many instances, the number of basic random valiables is also finite. To complete the RPM, we have to write the dependencies that govern these random variables. There is one dependency statement for each ftmction, where each argument of the function is a logical variable (i.e., a variable that ranges over objects, as in first-order logic):
Honest(c) (0.99, 0.01) Kindness(c) (0.1, 0.1, 0.2, 0.3, 0.3) Quality(b) (0.05, 0.2, 0.4, 0.2, 0.15) Recommendation(c, b) RecCPT(Honest(c), Kindness(c), Quality(b)) where RecCPT is a separately defined conditional distribution with 2 x 5 x 5 = 50 rows, �
�
�
�
CONTEXT-SPECIFIC INDEPENDENCE
each with 5 entlies. The semantics of the RPM can be obtained by instantiating these de pendencies for all known constants, giving a Bayesian network (as in Figure 14.17(b)) that defines a joint distlibution over the RPM's random valiables. 8 We can refine the model by introducing a context-specific independence to reflect the fact that dishonest customers ignore quality when giving a recommendation; moreover, kind ness plays no role in their decisions. A context-specific independence allows a variable to be independent of some ofits parents given certain values of others; thus, Recommendation(c, b) is indt:pt:mlt:nl uf Kindne�;�;(e) and Qual-ily(b) wht:n Hune�;t(e) = fal�;e:
Recommendation(c, b)
�
Honest(c) then HonestRecCPT(Kindness(c), Quality(b)) else (0.4, 0.1, 0.0, 0.1, 0.4) . if
Some technical conditions must be observed to guarantee that the RPM defines a proper distribution. First, the dependencies must be a.cyclic, otherwise the resulting Bayesian network will have cycles and will not define a proper distribution. Second, the dependencies must be well-founded, that is, there can be no infinite ancestor
8
chains, such as might arise from recursive dependencies. Under some circumstances (see Exercise 14.6), a fixed point calculation yields a well-defined probability model for a recursive RPM.
Section 1 4.6.
Relational and First-Order Probability Models
Figure 14.19
543
Fragment of the equiva lent Bayes net when Autho1·(B2) is unknown.
This kind of dependency may look like an ordinary if-then-else statement on a programming language, but there is a key difference: the inference engine doesn't necessarily know the value ofthe conditional test! We can elaborate this model in endless ways to make it more realistic. For example, suppose that an honest customer who is a fan of a book's author always gives the book a 5, regardless of quality:
if Honest(c) then if Fan(c, Author(b)) then Exactly(5) else HonestReeCPT(Kindness(c), Quality(b)) else (0.4, 0. 1 , 0.0, 0.1, 0.4) Again, the conditional test Fan(c, Author(b)) is unknown, but if a customer gives only Ss to Reeommendation(c,b)
MULTIPLEXER
RELATIONAL UNCERTAINTY
�
a particular author's books and is not otherwise especially kind, then the posterior probability that the customer is a fan of that author will be high. Furthermore, the posterior distribution will tend to discount the customer's 5s in evaluating the quality of that author's books. In the preceding example, we implicitly assumed that the value of Author(b) is known for every b, but this may not be the case. How can the system reason about whether, say, C1 is a fan of Author(B2) when Author(B2) is unknown? The answer is that the system may have to reason about all possible authors. Suppose (to keep things simple) that there are just two authors, A1 and A2. Then Author(B2) is a random variable with two possible values, A1 and A2, and it is a parent of Recommendation(C1, B2). The variables Fan(C1, A1) and Fan(Ct, A2) are parents too. The conditional distribution for Recommendation(C1, B2) is then essentially a multiplexer in which the Author(B2) parent acts as a selector to choose which of Fan(C1,A1) and Fan(C1, A2) actually gets to influence the recommendation. A fragment of the equivalent Bayes net is shown in Figure 14. 19. Uncertainty in the value of Author(B2), which affects the dependency structure of the network, is an instance of
relational uncertainty.
In case you are wondering how the system can possibly work out who the author of is: B2 consider the possibility that three other customers are fans of A1 (and have no other favorite authors in common) and all three have given B2 a 5, even though most other cus tomers find it quite dismal. In that case, it is extremely likely that A1 is the author of B2.
Chapter
544
14.
Probabilistic Reasoning
The emergence of sophisticated reasoning like this from an RPM model of just a few lines is an intriguing example of how probabilistic influences spread through the web of intercon nections among objects in the model. As more dependencies and more objects are added, the picture conveyed by the posterior distribution often becomes clearer and clearer. The next question is how to do inference in RPMs. One approach is to collect the evidence and query and the constant symbols therein, construct the equivalent Bayes net, and apply any of the inference methods discussed in this chapter. This technique is called UNROLLING
unrolling. The obvious drawback is that the resulting Bayes net may be very large. Further
more, if there are many candidate objects for an unknown relation or ftmction-for example, the unknown author of Bz-then some variables in the network may have many parents. Fortunately, much can be done to improve on generic inference algorithms. First, the presence of repeated substructure in the unrolled Bayes net means that many of the factors constructed during variable elimination (and similar kinds of tables constructed by cluster ing algorithms) will be identical; effective caching schemes have yielded speedups of three orders of magnitude for large networks. Second, inference methods developed to take advan tage of context-specific independence in Bayes nets find many applications in RPMs. Third, MCMC inference algorithms have some interesting properties when applied to RPMs with
relational uncertainty. MCMC works by sampling complete possible worlds, so in each state
the relational structure is completely known. In the example given earlier, each MCMC state
Autho1·(
would specify the value of B2), and so the ot.her potential authors are no longer par ents of the recommendation nodes for B2. For MCMC, then, relational uncertainty causes no increase in network complexity; instead, the MCMC process includes transitions that change the relational structure, and hence the dependency structure, of the unrolled network. All of the methods just described assume that the RPM has to be partially or completely unrolled into a Bayesian network. This is exactly analogous to the method of
alization for first-order logical inference.
proposition
(See page 322.) Resolution theorem-provers and
logic programming systems avoid propositionalizing by instantiating the logical variables only as needed to make the inference go through; that is, they lift the inference process above the level of ground propositional sentences and make each lifted step do the work of many
ground steps. The same idea applied n i probabilistic inference. For example, in the variable
elimination algorithm, a lifted factor can represent an entire set of ground factors that assign probabilities to random variables in the RPM, where those random variables differ only in the constant symbols used to construct them. The details of this method are beyond the scope of this book, but references are given at the end of the chapter. 14.6.3
Open-universe probability models
We argued earlier that database semantics was appropriate for situations in which we know exactly the set of relevant objects that exist and can identify them unambiguously. (ln partic ular, all observations about an object are correctly associated with the constant symbol that names it.) In many real-world settings, however, these assumptions are simply untenable. We gave the examples of multiple ISBNs and sibyl attacks in the book-recommendation domain (to which we will return in a moment), but the phenomenon is far more pervasive:
Section 14.6.
545
Relational and First-Order Probability Models
• A vision system doesn't know what exists, if anything, around the next comer, and may
not know if the object it sees now is the same one it saw a few minutes ago. • A text-understanding system does not know in advance the entities that will be featured
in a text, and must reason about whether phrases such as "Mary;• "Dr. Smith," "she;' "his cardiologist;• "his mother," and so on refer to the same object. • An intelligence analyst hunting for spies never knows how many spies there really
are
and can only guess whether various pseudonyms, phone numbers, and sightings belong to the same individual. In fact, a major part of human cognition seems to require learning what objects exist and being able to connect observations-which almost never come with unique IDs attached-to hypothesized objects in the world. OPEN U�IVERSE
For these reasons, we need to be able to write so-called
open-universe probability
models or OUPMs based on the standard semantics of first-order logic, as illustrated at the top of Figure 14.18. A language for OUPMs provides a way of writing such models easily while guaranteeing a unique, consistent probability distribution over the infinite space of possible worlds. The basic idea is to understand how ordinary Bayesian networks and RPMs manage to define a unique probability model and to transfer that insight to the first-order setting. In essence, a Bayes net generates each possible world, event by event, in the topological order defined by the network structure, where each event is an assignment of a value to a variable. An RPM extends this to entire sets of events, defined by the possible instantiations of the logical variables in a given predicate or function. OUPMs go further by allowing generative steps that add objects to the possible world under construction, where the number and type
of objects may depend on the objects that are already in that world. That is, the event being generated is not the assignment of a value to a variable, but the very existence of objects.
One way to do this in OUPMs is to add statements that define conditional distributions
over the numbers of objects of various kinds. For example, in the book-recommendation domain, we might want to distinguish between customers (real people) and their login
IDs.
Suppose we expect somewhere between 100 and 10,000 distinct customers (whom we cannot 9 observe directly). We can express this as a prior log-normal distribution as follows:
# Customer � LogNormal[6.9, 2.32]() .
We expect honest customers to have just one ID, whereas dishonest customers might have anywhere between 10 and 1000 IDs:
# LoginiD(Owner=c) � if Honest(c) then Exactly(!) else LogNormal[6.9, 2.32]() . This statement defines the number of login IDs for a given owner, who is a customer. The ORIGIN fUNCTION
Owner function is called an origin function because it says where each generated object
came from. In the formal semantics of BLOG (as distinct from first-order logic), the domain elements in each possible world are actually generation histories (e.g., "the fourth login ID of the seventh customer'') rather than simple tokens.
9 A distribution LogNormal[I., o-2] (x) is equivalent to distribution N [I., o-2] (x) a
over
log. (x).
Chapter
546
14.
Probabilistic Reasoning
Subject to teclmical conditions of acyclicity and well-foundedness similar to those for RPMs, open-universe models of this kind define a unique distribution over possible worlds. Furthermore, there exist inference alg01ithms such that, for every such well-defined model and every first-order query, the answer returned approaches the true posterior arbitrarily closely in the limit. There are some tricky issues involved in designing these algorithms. For example, an MCMC algorithm cannot sample directly in the space of possible worlds when the size of those worlds is unbounded; instead, it samples finite, partial worlds, rely ing on the fact that only firutely many objects can be relevant to the query in distinct ways. Moreover, transitions must allow for merging two objects into one or splitting one into two. (Details are given in the references at the end of the chapter.) Despite these complications, the basic principle established in Equation (14.13) still holds: the probability of any sentence is well defined and can be calculated. Research in this area is still at an early stage, but already it is becoming clear that first order probabilistic reasoning yields a tremendous increase in the effectiveness of AI systems at handling uncertain information. Potential applications include those mentioned above computer vision, text tmderstanding, and intelligence analysis-as well as many other kinds of sensor interpretation.
1 4 .7
OTHER APPROACHES TO U N CERTAIN REASONING
Other sciences (e.g., physics, genetics, and economics) have long favored probability as a model for uncertainty. In 1819, Pierre Laplace said, "Probability theory is nothing but com mon sense reduced to calculation." In 1850, James Maxwell said, "The true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probabil ity which is, or ought to be, in a reasonable man's mind." Given this long tradition, it is perhaps surprising that AI has considered many alterna tives to probability. The earliest expert systems of the 1970s ignored uncertainty and used strict logical reasoning, but it soon became clear that this was impractical for most real-world domains. The next generation of expert systems (especially in medical domains) used prob abilistic techniques. Initial results were promising, but they did not scale up because of the exponential number of probabilities required in the full joint distribution. (Efficient Bayesian network algorithms were unknown then.) As a result, probabilistic approaches fell out of favor from roughly 1975 to 1988, and a vatiety of alternatives to probability were tried for a variety of reasons: •
One common view is that probability theory is essentially numerical, whereas human judgmental reasoning is more "qualitative." Certainly, we are not consciously aware of doing numerical calculations of degrees of belief. (Neither are we aware of doing urufication, yet we seem to be capable of some kind of logical reasoning.) It might be that we have some kind of numerical degrees of belief encoded directly in strengths of connections and activations in our neurons. In that case, the difficulty of conscious access to those strengths is not surprising. One should also note that qualitative reason-
Section 14.7.
Other Approaches to Uncertain Reasoning
547
ing mechanisms can be built directly on top of probability theory, so the "no numbers" argument against probability has little force. Nonetheless, some qualitative schemes have a good deal of appeal in their own right. One of the best studied is default rea soning, which treats conclusions not as "believed to a certain degree;' but as "believed until a better reason is found to believe something else." Default reasoning is covered in Chapter 12. •
•
•
Rule-based approaches to tmcertainty have also been t1ied.
Such approaches hope to build on the success of logical rule-based systems, but add a sort of "fudge factor" to each rule to accommodate uncertainty. These methods were developed in the mid-1970s and formed the basis for a large number of expert systems in medicine and other areas. One area that we have not addressed so far is the question of ignorance, as opposed to uncertainty. Consider t.he flipping of a coin. If we know that the coin is fair, then a probability of 0.5 for heads is reasonable. If we know that the coin is biased, but we do not know which way, then 0.5 for heads is again reasonable. Obviously, the two cases are different, yet the outcome probability seems not to distinguish them. The Dempster-Shafer theory uses interval-valued degrees ofbelief to represent an agent's knowledge of the probability of a proposition. Probability makes the same ontological commitment as logic: that propositions are true or false in the world, even if the agent is uncertain as to which is the case. Researchers in fuzzy logic have proposed an ontology that allows vagueness: that a proposition can be "smt of' true. Vagueness and uncertainty are in fact orthogonal issues.
The next three subsections treat some ofthese approaches in slightly more depth. We will not provide detailed technical material, but we cite references for further study. 14.7.1
Rule-based methods for uncertain reasoning
Rule-based systems emerged from early work on practical and intuitive systems for logical inference. Logical systems in general, and logical mle-based systems in particular, have three desirable properties: LOCALITY
DETACI-NENT
Locality: In logical systems, whenever we have a rule of the form A � B, we can conclude B, given evidence A, without worrying about any other rules. In probabilistic systems, we need to consider all the evidence. • Detachment: Once a logical proof is fotmd for a proposition B, the proposition can be used regardless of how it was derived. That is, it can be detached from its justification.
•
In dealing with probabilities, on the other hand, the source of the evidence for a belief is important for subsequent reasoning. IHUIH· FUNCTIONALITY
•
Truth-functionality:
In logic, the truth of complex sentences can be computed from the truth of the components. Probability combination does not work this way, except under strong global independence asswnptions.
There have been several attempts to devise w1certain reasoning schemes that retain these advantages. The idea is to attach degrees of belief to propositions and rules and to devise purely local schemes for combining and propagating those degrees of belief. The schemes
Chapter
548
are also truth-functional; for example, the degree of belief in in
A and the belief in B.
14.
Probabilistic Reasoning
A V B is a function of the belief
The bad news for rule-based systems is that the properties of locality, detachment, and
truth-functionality are simply not appropriate for uncertain reasoning. Let us look at truth
H1
T1
functionality first. Let be the event that a fair coin flip comes up heads, let be the event that the coin comes up tails on that same flip, and let be the event that the coin comes
H2
up heads on a second flip. Clearly, all three events have the same probability, 0.5, and so a truth-functional system must assign the same belief to the disjunction of any two of them.
But we can see that the probability of the disjtmction depends on the events themselves and not just on their probabilities:
P(A) P(Ht) = 0.5
P(A v B) P(B) P(Ht) = 0.5 P(Ht v Ht) = 0.50 P(Tt) = 0.5 P(Ht v Tt) = 1.00 P(H2) = 0.5 P(Ht v H2) = 0.75
It gets worse when we chain evidence together. Truth-functional systems have
rules of the
A 1-4 B that allow us to compute the belief in B as a function of the belief in the rule and the belief in A. Both forward- and backward-chaining systems can be devised. The belief
form
in the rule is assumed to be constant and is usually specified by the knowledge engineer-for
A
B.
example, as 1-40.9 Consider the wet-grass situation from Figure 14.12(a) (page 529). If we wanted to be able to do both causal and diagnostic reasoning, we would need the two rules
Rain
WetGmss Rain . These two rules form a feedback loop: evidence for Rain increases the belief in WetGmss, which in tum increases the belief in Rain even more. Clearly, uncertain reasoning systems H
WetGmss
H
and
need to keep track of the paths along which evidence is propagated. Intercausal reasoning (or explaining away) is also tricky. Consider what happens when we have the two rules
Sprinkler
H
WetCross
and
WetGmss
H
Rain .
Suppose we see that the sprinkler is on. Chaining forward through our rules, this increases the belief that the grass will be wet, which in tum increases the belief that it is raining. But this is ridiculous: the fact that the sprinkler is on explains away the wet grass and should reduce the belief in rain. A truth-functional system acts as if it also believes H
Sp1-inkler
Rain.
Given these difficulties, how can truth-functional systems be made useful in practice? The answer lies in restricting the task and in carefully engineering the rule base so that un cERTAINTY FACTOR
desirable interactions do not occur. The most famous example of a truth-functional system for uncertain reasoning is the model, which was developed for the MYCIN
certainty factors
medical diagnosis program and was widely used in expert systems of the late 1970s and 1980s. Almost all uses of certainty factors involved rule sets that were either purely diagnos
tic (as in
MYCIN) or purely causal.
Furthennore, evidence was entered only at the "roots"
of the rule set, and most rule sets were singly connected. Heckerman (1986) has shown that,
Section 14.7.
Other Approaches to Uncertain Reasoning
549
under these circumstances, a minor variation on certainty-factor inference was exactly equiv alent to Bayesian inference on polytrees. In other circumstances, certainty factors could yield disastrously incorrect degrees of belief through overcounting of evidence. As rule sets be came larger, undesirable interactions between rules became more common, and practitioners fmmd that the certainty factors of many other rules had to be "tweaked" when new rules were added. For these reasons, Bayesian networks have largely supplanted rule-based methods for tmcertain reasoning. 14.7.2 DEMI"SlBl SHAFEn THEORY
BELIEF FUNCTION
Representing ignorance: Dempster-Sbafer theory
The Dempster-Shafer theory is designed to deal with the distinction between uncertainty and ignorance. Rather than computing the probability of a proposition, it computes the probability that the evidence supports the proposition. This measure of belief is called a belief function, written Bel(X). We return to coin flipping for an example of belief functions. Suppose you pick a coin from a magician's pocket. Given that the coin might or might not be fair, what belief should you ascribe to the event that it comes up heads? Dempster-Shafer theory says that because you have no evidence either way, you have to say that the belief Bel(Heads) = 0 and also that Bel(-.Heads) = 0. This makes Dempster-Shafer reasoning systems skeptical in a way that has some intuitive appeal. Now suppose you have an expert at your disposal who testifies with 90% certainty that the coin is fair (i.e., he is 90% sure that P(Heads) = 0.5). Then Dempster-Shafer theory gives Bel(Heads) = 0.9 x 0.5 = 0.45 and likewise Bel(-.Heads) = 0.45. There is still a 10 percentage point "gap" that is not accounted for by the evidence.
MASS
The mathematical underpinnings of Dempster-Shafer theory have a similar flavor to those of probability theory; the main difference is that, instead of assigning probabilities to possible worlds, the theory assigns masses to sets of possible world, that is, to events. The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum of masses for all events that are subsets of (i.e., that entail) A, including A itself. With this definition, Bel(A) and Bel(·A) sum to at most l, and the gap-the interval between Bef(A) and 1 Bel(·A)-is often interpreted as bounding the probability of A. As with default reasoning, there is a problem in connecting beliefs to actions. Whenever there is a gap in the beliefs, then a decision problem can be defined such that a Dempster Shafer system is unable to make a decision. In fact, the notion of utility in the Dempster Shafer model is not yet well understood because the meanings of masses and beliefs them selves have yet to be understood. Pearl (1988) has argued that Bel(A) should be interpreted not as a degree of belief in A but as the probability assigned to all the possible worlds (now interpreted as logical theories) in which A is provable. While there are cases in which this quantity might be of interest, it is not the same as the probability that A is true. A Bayesian analysis of the coin-flipping example would suggest that no new formalism is necessary to handle such cases. The model would have two variables: the Bias of the coin (a number between 0 and l , where 0 is a coin that always shows tails and 1 a coin that always shows heads) and the outcome of the next Flip. The ptior probability distribution for Bias -
550
Chapter
14.
Probabilistic Reasoning
would reflect our beliefs based on the source of the coin (the magician's pocket): some small probability that it is fair and some probability that it is heavily biased toward heads or tails. The conditional distribution P( Flip I Bias) simply defines how the bias operates. If P(Bias) is symmetric about 0.5, then our prior probability for the flip is
P(Flip = heads) =
11 P(Bias=x)P(Flip = heads I Bias =x) dx = 0.5 .
This is the same prediction as if we believe strongly that the coin is fair, but that does not mean that probability theory treats the two situations identically. The difference arises after the flips in computing the posterior distribution for Bias. If the coin came from a bank, then seeing it come up heads three times running would have almost no effect on our strong prior belief in its fairness; but if the coin comes from the magician's pocket, the same evidence will lead to a stronger posterior belief that the coin is biased toward heads. Thus, a Bayesian approach expresses our "ignorance" in terms of how our beliefs would change in the face of future information gathering. 14.7.3 FUZZV SETTHEORY
FUZZYLOGIC
Representing vagueness: Fuzzy sets and fuzzy logic
Fuzzy set theory is a means of specifying how well an object satisfies a vague description. For example, consider the proposition "Nate is tall." Is this true if Nate is 5' 10"? Most
people would hesitate to answer "true" or "false;' preferring to say, "sort of." Note that this is not a question of uncertainty about the external world-we are sure of Nate's height. The issue is that the linguistic term "tall" does not refer to a sharp demarcation of objects into two classes-there are degrees of tallness. For this reason, fuzzy set theory is not a method for uncertain reasoning at all. Rather, fuzzy set theory treats Tall as a fuzzy predicate and says that the tmth value of Tall( Nate) is a number between 0 and I , rather than being just true or false. The name "fuzzy set" derives from the n i terpretation of the predicate as implicitly defining a set of its members-a set that does not have sharp boundaries. Fuzzy logic is a method for reasoning with logical expressions describing membership in fuzzy
sets. Por example, the complex sentence
Tall(Nate) 1\ Ileavy(Nate) has a fuzzy
truth value that is a function of the truth values of its components. The standard rules for evaluating the fuzzy truth, T, of a complex sentence are
T(A 1\ B) = min(T(A),T(B)) T(A v B) = ma.x(T(A),T(B)) T(·A) = 1 -T(A) . Fuzzy logic is therefore a truth-functional system-a fact that causes serious difficulties. For example, suppose that T( Tall( Nate)) = 0.6 and T(Heavy(Nate)) = 0.4. Then we have T(Tall(Nate) 1\ Heavy(Nate))= 0.4, which seems reasonable, but we also get the result T( Tall(Nate) 1\ -, Tall( Nate)) = 0.4, which does not. Clearly, the problem arises from the iuabilily uf a lrulh-fuudiuual apprua�;h lu lakt:: iutu a�;�;uunl tl1t:: �;uiTdaliun� ur auli�;un·datjuus
FUZZYCONTROL
among the component propositions. Fuzzy control is a methodology for constructing control systems in which the mapping between real-valued input and output parameters is represented by fuzzy rules. Fuzzy con trol has been very successful in commercial products such as automatic transmissions, video
Section 14.8.
551
Swrunary
cameras, and electric shavers. Critics (see, e.g., Elkan, 1993) argue that these applications are successful because they have small rule bases, no chaining of inferences, and tunable parameters that can be adjusted to improve the system's performance. The fact that they
are
implemented with fuzzy operators might be incidental to their success; the key is simply to provide a concise and intuitive way to specify a smoothly interpolated, real-valued function. There have been attempts to provide an explanation of fuzzy logic in terms of probabil ity theory. One idea is to view assertions such as "Nate is Tall" as discrete observations made concerning a continuous hidden variable, Nate's actual The probability model speci
Height. fies P(Observer says Nate is tall I Height), pemaps using a probit distribution as described on page 522. A posterior distribution over Nate's height can then be calculated in the usual way, for example, if the model is part of a hybrid Bayesian network. Such an approach is not truth-functional, of course. For example, the conditional distribution P(Observer says Nate is tall and heavy
I Height, Weight)
allows for interactions between height and weight in the causing of the observation. Tlms, someone who is eight feet tall and weighs 190 pounds is very unlikely to be called "tall and heavy," even though "eight feet" counts as "tall" and "190 pounds" counts as "heavy." RANDOM SET
random Tall The probability P( Tall= 81),
Fuzzy predicates can also be given a probabilistic interpretation in terms of sets-that is, random variables whose possible values are sets of objects. For example,
is a random set whose possible values are sets of people. where is some particular set of people, is the probability that exactly that set would be identified as "tall" by an observer. Then the probability that "Nate is tall" is the swn of the
S1
probabilities of all the sets of which Nate is a member. Both the hybrid Bayesian network approach and the random sets approach appear to capture aspects of fuzziness without introducing degrees of truth. Nonetheless, there remain many open issues concerning the proper representation oflinguistic observations and contin uous quantities-issues that have been neglected by most outside the fuzzy community.
14.8
SUMMARY
This chapter has described
Bayesian networks, a well-developed representation for uncertain
knowledge. Bayesian networks play a role roughly analogous to that of propositional logic for definite knowledge. • A Bayesian network is a directed acyclic graph whose nodes correspond to random variables; each node has a conditional distribution for the node, given its parents. • Bayesian networks provide a concise way to represent
conditional independence rela
tionships in the c!omain.
• A Bayesian network specifies a full joint distribution; each joint entry is defined as the product of the corresponding entries in the local conditional distributions. A Bayesian network is often exponentially smaller than an explicitly enwnerated joint distribution. • Many conditional distributions can be represented compactly by canonical families of
Chapter
552 distiibutions.
14.
Probabilistic Reasoning
Hybrid Bayesian networks, which include both discrete and continuous
variables, use a variety of canonical distributions. •
Inference in Bayesian networks means computing the probability distribution of a set of query variables, given a set of evidence variables. Exact inference algorithms, such
variable elimination, evaluate sums of products of conditional probabilities as effi
as ciently as possible. •
In
polytrees (singly connected networks), exact inference takes time linear in the size
of the network. In the general case, the problem is intractable. •
•
likelihood weighting
Markov chain
Stochastic approximation techniques such as and can give reasonable estimates of the true posterior probabilities in a net work and can cope with much larger networks than can exact algorithms.
Monte Carlo
Probability theory can be combined with representational ideas from first-order logic to produce very powetful systems for reasoning under uncertainty.
Relational probabil
ity models (RPMs) include representational restrictions that guarantee a well-defined probability distribution that can be expressed as an equivalent Bayesian network. Open universe probability models handle existence and identity uncertainty, defining prob abilty distributions over the infinite space of first-order possible worlds.
•
Various altemative systems for reasoning under uncertainty have been suggested. Gen erally speaking, systems are not well suited for such reasoning.
truth-functional
BffiLIOGRAPHICAL AND HISTORICAL NOTES
The use of networks to represent probabilistic information began early in the 20th century, with the work of Sewall Wright on the probabilistic analysis of genetic inheritance and an imal growth factors (Wright, 1921, 1934). I. J. Good (1961), in collaboration with Alan Turing, developed probabilistic representations and Bayesian inference methods that could be regarded as a forerunner of modem Bayesian networks-although the paper is not often
cited in this context. 10 The same paper is the original source for the noisy-OR model. The representation for decision problems, which incorporated a
influence diagram
DAG representation for random variables, was used in decision analysis .in the late 1970s
(see Chapter 16), but only enwneration was used for evaluation. Judea Pearl developed the message-passing method for carrying out inference in tree networks (Pearl, 1982a) and poly tree networks (Kim and Pearl, 1983) and explained the importance of causal rather than di agnostic probability models, in contrast to the certainty-factor systems then in vogue. The first expert system using Bayesian networks was CONVINCE (Kim, 1983). Early applications in medicine included the MUNIN system for diagnosing neuromuscular disorders (Andersen et al., 1989) and the
PATHFINDER
system for pathology (Heckerman, 1991). The
CPCS system (Pradhan et al. , 1994) is a Bayesian network for internal medicine consisting
I. J. Good was chief statistician for Turing's code-breaking team in World War II. In 2001: A Space Odyssey (Clarke, 1968a), Good and Minsky are credited with making the breakthrough that led to the development of the HAL 9000 computer. 10
Bibliographical and Historical Notes
553
of 448 nodes, 906 links and 8,254 conditional probability values. (The front cover shows a portion of the network.) Applications in engineering include the Electric Power Research Institute's work on monitoring power generators (Morjaria et al. , 1995), NASA's work on displaying time critical information at Mission Control in Houston (Horvitz and BaiTy, 1995), and the general field of
network tomography, which aims to infer unobserved local properties of nodes and
links in the Internet from observations of end-to-end message performance (Castro et al., 2004). Perhaps the most widely used Bayesian network systems have been the diagnosis
and-repair modules (e.g., the Printer Wizard) in Microsoft Windows (Breese and Heckerman, 1996) and the Office Assistant in Microsoft Office (Horvitz et al. , 1998). Another impor
tant application area is biology: Bayesian networks have been used for identifying human genes by reference to mouse genes (Zhang et al. , 2003), inferring cellular networks Friedman (2004), and many other tasks in bioinfonnatics. We could go on, but instead we'll refer you to Pourret et al. (2008), a 400-page guide to applications of Bayesian networks. Ross Shachter(l986), working in the influence diagram community, developed the first complete algorithm for general Bayesian networks. His method was based on goal-directed reduction of the network using posterior-preserving transformations. Pearl (1986) developed a clustering algorithm for exact inference in general Bayesian networks, utilizing a. conversion
to a directed polytree of clusters in which message passing was used to achieve consistency over variables shared between clusters. A similar approach, developed by the statisticians David Spiegelhalter and Steffen Lauritzen (Lauritzen and Spiegelhalter, 1988), is based on MARKeN NElWORK
conversion to an undirected form of graphical model called a
Markov network.
This ap
proach is implemented in the HUGIN system, an efficient and widely used tool for uncertain reasoning (Andersen et al. , 1989). Boutilier et al. (1996) show how to exploit context-specific independence in clustering algorithms. The basic idea of variable elimination-that repeated computations within the overall sum-of-products expression can be avoided by caching-appeared in the symbolic probabilis tic inference (SPl) algorithm (Shachter et al., 1990). The elimination algorithm we describe is closest to that developed by Zhang and Poole (1994). Criteria for pruning irrelevant vari ables were developed by Geiger et al. (1990) and by Lauritzen et al. (1990); the criterion we NONSERIAL DYNAMIC PAOORIAMMING
give is a simple special case of these. Dechter (1999) shows how the variable elimination idea is essentially identical to (Bertele and Brioschi, 1972), an
nonserial dynamic programming
algorithmic approach that can be applied to solve a range of inference problems in Bayesian networks-for example, finding the
most likely explanation for a set of observations. This
connects Bayesian network algorithms to related methods for solving CSPs and gives a direct measure ofthe complexity of exact inference in tenns of the tree width of the network. Wexler and Meek (2009) describe a method of preventing exponential growth in the size of factors computed in variable elimination; their algorithm breaks down large factors into products of smalh:r fa(;lors ami �mullaut::ously (;OIIIpult::s au t::ITor bouml for lht:: resultiug appruximatiou.
The inclusion of continuous random variables in Bayesian networks was considered
by Pearl (1988) and Shachter and Kenley (1989); these papers discussed networks contain ing only continuous variables with linear Gaussian distributions. The inclusion of discrete variables has been n i vestigated by Lauritzen and Wermut.h (1989) and implemented in the
554
14.
Chapter
Probabilistic Reasoning
cHUGIN system (Olesen, 1993). Further analysis of linear Gaussian models, with connec tions to many other models used in statistics, appears in Roweis and Ghahramani (1999) The probit distribution is usually attributed
to Gaddum (1933)
and Bliss (1934), although it had
been discovered several times in the 19th century. Bliss's work was expanded considerably by Finney (1947). The probit has been used widely for modeling discrete choice phenomena and can be extended to handle more than two choices (Daganzo, 1979). The logit model was introduced by Berkson (1944); initially much derided, it eventually became more popular than the probit model. Bishop
(1995) gives a simple justification for its use.
Cooper (1990) showed that the general problem of inference in unconstrained Bayesian networks is NP-hard, and Paul Dagwn and Mike Luby (1993) showed the corresponding approximation problem to be NP-hard. Space complexity is also a serious problem in both clustering and variable elimination methods. The method of
cutset conditioning, which was
developed for CSPs in Chapter 6, avoids the construction of exponentially large tables. In a Bayesian network,
a
cutset is a set of nodes that, when instantiated, reduces the remaining
nodes to a polytree that can be solved in linear time and space. The query is answered by summing over all the instantiations of the cutset, so the overall space requirement is still lin ear (Pearl, 1988). Darwiche (2001) describes a recursive conditioning algorithm that allows a complete range of space/time trarleoffs.
The development of fast approximation algorithms for Bayesian network inference is a very active area, with contributions from statistics, computer science, and physics. The rejection sampling method is a general technique that is long known
to statisticians;
first applied to Bayesian networks by Max Henrion (1988), who called it
it was
logic sampling.
Likelihood weighting, which was developed by Fung and Chang (1989) and Shachter and Peot (1989), is an example of the well-known statistical method of
importance sampling.
Cheng and Druzdzel (2000) describe an adaptive version of likelihood weighting that works well even when the evidence has very low prior likelihood. Markov chain Monte Carlo (MCMC) algorithms began with the Metropolis algorithm,
due to Metropolis et al. (1953), which was also the source of the simulated annealing algo rithm described in Chapter 4. The Gibbs sampler was devised by Geman and Geman (1984)
for inference in undirected Markov networks. The application of MCMC to Bayesian net works is due to Pearl (1987). The papers collected by Gilks et al. (1996) cover a wide variety of applications of MCMC, several of which were developed in the well-known BUGS pack
age (Gilks et al., 1994).
There are two very important families of approximation methods that we did not cover VARIATIONAL APPROXIMATION
in the chapter. The first is the family of used
to
variational approximation methods, which can be
simplify complex calculations of all kinds. The basic idea is
to
propose a reduced
version of the original problem that is simple to work with, but that resembles the original problem as closely as possible. The reduced problem is described by some VAR�TIONAL
Fl'.RAMETER
ramt:l.t:cs >. that
variational pa
an:: adjuste::d tu miuimize:: a distaw.:e:: fwt�;l iuu D be::lwe::e:u : the:: urigiual aud
the reduced problem, often by solving the system of equations
aDfa>. = 0.
In many cases,
strict upper and lower bounds can be obtained. Variational methods have long been used in MEAN FIB.D
statistics (Rustagi, 1976). In statistical physics, the
mean-field method is a particular vari
ational approximation in which the individual variables making up the model are asswned
Bibliographical and Historical Notes
to be completely
555
independent. This idea was applied to solve large undirected Markov net
works (Peterson and Anderson, 1987; Parisi, 1988). Saul et al. (1996) developed the math ematical fotmdations for applying variational methods to Bayesian networks and obtained accurate lower-botmd approximations for sigmoid networks with the use of mean-field meth ods. Jaakkola and Jordan (1996) extended the methodology to obtain both lower and upper bounds. Since these early papers, variational methods have been applied
to
many specific
families of models. The remarkable paper by Wainwright and Jordan (2008) provides a uni fying theoretical analysis of the literature on variational methods. A second important family of approximation algorithms is based on Pearl's polytree message-passing algorithm (1982a). This algorithm can be applied to general networks, as suggested by Pearl (1988). The results might be incorrect, or the algorithm might fail to ter minate, but in many cases, the values obtained are close to the true values. Little attention BELIEF PAOFIIGATION
was paid to this so-called
belief propagation (or BP) approach until McEliece et at. (1998)
observed that message passing in a multiply connected Bayesian network was exactly the TURBO DEOODING
turbo decoding
computation performed by the algorithm (Ben·ou et al., 1993), which pro vided a major breakthrough in the design of efficient error-correcting codes. The implication is that BP is both fast and accw·ate on the very large and very highly connected networks used for decoding and might therefore be useful more generally. Murphy et al. (1999) presented a promising empirical study of BP's performance, and Weiss and Freeman (2001) established strong convergence results for BP on linear Gaussian networks. Weiss (2000b) shows how an approximation called loopy belief propagation works, and when the approximation is con·ect. Yedidia et al. (2005) made further connections between loopy propagation and ideas from statistical physics. The connection between probability and first-order languages was first studied by Car
nap (1950). Gaifman (1964) and Scott and Krauss (1966) defined a language in which proba bilities could be associated with first-order sentences and for which models were probability measures on possible worlds. Within AI, this idea was developed for propositional logic by Nilsson (1986) and for first-order logic by Halpem (1990). The first extensive inves tigation of knowledge representation issues in such languages was carried out by Bacchus (1990). The basic idea is that each sentence in the knowledge base expressed a constraint on the distribution over possible worlds; one sentence entails another if it expresses a stronger constraint. For example, the sentence > 0.2 rules out distributions in which any object is hungry with probability less than 0.2; thus, it entails the sentence
Vx P(Hungry(x))
Vx P(Hungry(x))
> 0.1. It tums out that writing a consistent set of sentences in these
languages is quite difficult and constructing a unique probability model nearly impossible unless one adopts the representation approach of Bayesian networks by writing suitable sen tences about conditional probabilities. Beginning in the early 1990s, researchers working on complex applications noticed the expressive limitations of Bayesian networks and developed various languages for writing "templates" with logical variables, from which large networks could be constructed automat ically for each problem instance (Breese, 1992; Wellman et al., 1992). The most important such language was BUGS (Bayesian inference Using Gibbs Sampling) (Gilks et al., 1994), INDEXED RANDOM VARIABLE
which combined Bayesian networks with the
indexed random variable notation common in
Chapter
556 statistics. (In
14.
Probabilistic Reasoning
BUGS, an indexed random variable looks like X [i], where i has a defined integer
range.) These languages inherited the key property of Bayesian networks: every well-fonned knowledge base defines a tmique, consistent probability model. Languages with well-defined semantics based on unique names and domain closure drew on the representational capa bilities of logic programming (Poole,
1993; Sato and Kameya, 1997; Kersting et al., 2000)
and semantic networks (Koller and Pfeffer,
1998; Pfeffer, 2000). Pfeffer (2007) went on to
develop IBAL, which represents first-order probability models as probabilistic programs in a programming language extended with a randomization primitive. Another important thread was the combination of relational and first-order notations with (undirected) Markov net works (Taskar et al.,
2002; Domingos and Richardson, 2004), where the emphasis has been
less on knowledge representation and more on learning from large data sets. InitiaJly, inference in these models was performed by generating an equivalent Bayesian network. Pfeffer
et al. (1999)
introduced a variable elimination algorithm that cached each
computed factor for reuse by later computations involving the same relations but different objects, thereby realizing some of the computational gains of lifting. The first truly lifted inference algorithm was a lifted form of variable elimination described by Poole subsequently improved by de Salvo Braz
et al. (2007).
(2003) and
Further advances, including cases
where certain aggregate probabilities can be computed in closed form, are described by Milch
et al. (2008) and Kisynski and Poole (2009). Pasula and Russell (2001) studied the application of MCMC to avoid building the complete equivalent Bayes net in cases of relational and identity tmcertainty. Getoor and Taskar
(2007) collect many impm1ant papers on first-order
probability models and their use in machine learning. Probabilistic reasoning about identity uncertainty has two distinct origins. necono UN�CE
In statis
tics, the problem of record linkage arises when data records do not contain standard w1ique identifiers-for example, various citations of this book might name its first author "Stuart Russell" or "S.
J. Russell" or even "Stewart Russle," and other authors may use the some of
the same names. Literally hundreds of companies exist solely to solve record linkage prob lems in financial, medical, census, and other data. Probabilistic analysis goes back to work by Dunn
(1946); the Fellegi-Stmter model (1969), which is essentially naive Bayes applied
to matching, still dominates current practice. The second origin for work on identity uncer tainty is multitarget tracking (Sittler,
1964), which we cover in Chapter 15. For most of its
history, work in symbolic AI assumed enoneously that sensors could supply sentences with unique identifiers for objects. The issue was studied in the context of language understanding by Chamiak and Goldman
(1992) and in the context of surveillance by (Huang and Russell,
1998) and Pasula et al. (1999). Pasula et al. (2003) developed a complex generative model for authors, papers, and citation strings, involving both relational and identity uncertainty, and demonstrated high accuracy for citation information extraction. The first formally de fined language for open-universe probability models was BLOG (Milch
et al. , 2005), which
came with a complete (albeit slow) MCMC inference algorithm for all well-defined mdoels. (The program code faintly visible on the front cover of this book is part of a BLOG model for detecting nuclear explosions from seismic signals as part of the UN Comprehensive Test Ban Treaty verification regime.) Laskey language called
(2008) describes another open-universe modeling
multi-entity Bayesian networks.
Bibliographical and Historical Notes
557
As explained in Chapter 13, early probabilistic systems fell out of favor in the early 1970s, leaving a partial vacuum to be filled by alternative methods. Certainty factors were invented for use in the medical expert system
MYCIN (Shortliffe, 1976), which was intended
both as an engineering solution and as a model of human judgment under uncertainty. The collection
Rule-Based Expert Systems (Buchanan and Shortliffe, 1984) provides a complete
overview of
MYCIN
and its descendants (see also Stefik, 1995). David Heckerman (1986)
showed that a slightly modified version of certainty factor calculations gives correct proba bilistic results in some cases, but results in serious overcounting of evidence in other cases. The PROSPECTOR expert system (Duda et al., 1979) used a rule-based approach in which the rules were justified by a (seldom tenable) global independence assumption. Dempster-Shafer theory originates with a paper by Arthur Dempster (1968) proposing a generalization of probability to interval values and a combination rule for using them. Later work by Glenn Shafer (1976) led to the Dempster-Shafer theory's being viewed as a compet ing approach to probability. Pearl (1988) and Ruspini
et al. (1992) analyze the relationship
between the Dempster-Shafer theory and standard probability theory.
Fuzzy sets were developed by Lotti Zadeh (1965) in response to the perceived difficulty of providing exact inputs to intelligent systems. The text by Zimmermann (2001) provides a thorough introduction to fuzzy set theory; papers on fuzzy applications are collected in Zimmermatm (1999). As we mentioned in the text, fuzzy logic has often been perceived incorrectly as a direct competitor to probability theory, whereas in fact it addresses a different PossrsrurYTHEORY
set of issues.
Possibility theory (Zadeh, 1978) was introduced to handle uncertainty in fuzzy
systems and has much in common with probability.
Dubois and Prade (1994) survey the
connections between possibility theory and probability theory. The resurgence of probability depended mainly on Pearl's development of Bayesian networks as a method for representing and using conditional independence information. Tms resurgence did not come without a fight; Peter Cheeseman's (1985) pugnacious "In Defense of Probability" and his later article "An Inquiry into Computer Understanding" (Cheeseman, 1988, with commentaries) give something of the flavor of the debate.
Eugene Charniak
helped present the ideas to AI researchers with a popular article, "Bayesian networks with " out tears" (1991), and book (1993). The book by Dean and Wellman (1991) also helped introduce Bayesian networks to AI researchers. One of the principal philosophical objections of the logicists was that the numerical ca1culations that probability theory was thought to re quire were not apparent to introspection and preswned an unrealistic level of precision in our uncertain knowledge. The development of
qualitative probabilistic networks (Wellman,
1990a) provided a purely qualitative abstraction of Bayesian networks, using the notion of positive and negative influences between variables. Wellman shows that in many cases such information is sufficient for optimal decision making without the need for the precise spec ification of probability values. Goldszrnidt and Pearl (1996) take a similar approach. Work by Adnan Darwiche and Matt Ginsberg (1992) extracts the basic properties of conditioning and evidence combination from probability theory and shows that they can also be applied in logical and default reasoning. Often, programs speak louder than words, and the ready avail11
Thetitle of the original version of the articlewas "Pearl for swine."
Chapter
558
14.
Probabilistic Reasoning
ability of high-quality software such as the Bayes Net toolkit (Murphy, 2001) accelerated the adoption of the technology. The most important single publication in the growth of Bayesian networks was undoubt· edly the text
Probabilistic Reasoning in Intelligent Systems
(Pearl, 1988). Several excellent
texts (Lauritzen, 1996; Jensen, 2001; Korb and Nicholson, 2003; Jensen, 2007; Darwiche,
2009; Koller and Friedman, 2009) provide thorough treatments of the topics we have cov ered in this chapter.
New research on probabilistic reasoning appears both in mainstream
AI journals, such as Artificial Intelligence and the Journal ofAI Research, and in more spe cialized journals, such as the International
Journal ofApproximate Reasoning.
Many papers
on graphical models, which include Bayesian networks, appear in statistical journals. The proceedings of the conferences on Uncertainty in Artificial Intelligence (UAI), Neural Infor mation Processing Systems (NIPS), and Artificial Intelligence and Statistics (AISTATS) are excellent sources for current research.
EXERCISES
14.1
We have a bag of three biased coins
a, b, and c with probabilities of coming up heads
of 20%, 60%, and 80%, respectively. One coin is drawn randomly from the bag (with equal likelihood of drawing each of the three coins), and then the coin is flipped three times to generate the outcomes
xl, x2, and x3.
a. Draw the Bayesian network corresponding to this setup and define the necessary CPTs. b. Calculate which coin was most likely to have been drawn from the bag if the observed flips come out heads twice and tails once.
14.2
Equation (14.1) on page 5 1 3 defines the joint distribution represented by a Bayesian
B(Xi I Parents(Xi)). This exercise asks you to derive the equivalence between the parameters and the conditional probabilities P(Xi I Parents(Xi)) network in terms ofthe parameters
from this definition.
a.
Consider a simple network
X
�
Y
�
Z with three Boolean variables.
Use Equa
tions (13.3) and (13.6) (pages 485 and 492) to express the conditional probability
b. c.
P(z I y) as the ratio of two sums, each over entries in the joint distribution P(X, Y, Z).
Now use Equation (14.1) to write this expression in terms of the network parameters
B(X), B(Y I X), and B(Z I Y).
Next, expand out the summations in your expression from part (b), writing out explicitly the terms for the true and false values of each summed variable. network parameters satisfy the constraint resulting expression reduces to B(x I y) .
d.
Generalize this derivation to show that for any Bayesian network.
Assuming that all
L:x; B(xi I parents(Xi)) = 1 , show that the
B(Xi I Parents(Xi)) = P(Xi I Parents(Xi))
Exercises ARC REVERSAL
559
The operation of arc reversal in a Bayesian network allows us to change the direction of an arc X --4 Y while preserving the joint probability distribution that the network repre sents (Shachter, 1986). Arc reversal may require introducing new arcs: all the parents of X also become parents ofY, and all parents of Y also become parents ofX. 14.3
a.
b. c.
Assume that X and Y start with m and n parents, respectively, and that all variables have k values. By calculating the change in size for the CPTs ofX and Y, show that the total number of parameters in the network cannot decrease during arc reversal. (Hint: the parents of X and Y need not be disjoint.) Under what circumstances can the total number remain constant? Let the parents of X be U U V and the parents of Y be V U W, where U and W are disjoint. The formulas for the new CPTs after arc reversal are as follows: P(YIU,V, W)
L P(Y I V, W, x)P(x i U, V) X
P(Y I X, v, W)P(X I u, V)/P(Y I u, v, W) . Prove that the new network expresses the same joint distribution over all variables as the original network. P(X I U,V, W,Y)
14.4
a. b.
Consider the Bayesian network in Figure
14.2.
Ifno evidence is observed, are Burglary and Earthquake independent? Prove this from the numerical semantics and from the topological semantics. If we observe .4larm = true, are Burglary and Emthquake independent? Justify your answer by calculating whether the probabilities involved satisfy the definition of condi tional independence.
14.5 Suppose that in a Bayesian network containing an unobserved variable Y, all the vali ables in the Markov blanket MB(Y) have been observed.
a. b.
Prove that removing the node Y from the network will not affect the posterior distribu tion for any other unobserved variable in the network. Discuss whether we can remove Y if we are planning to use (i) rejection sampling and (ii) like!ihood weighting.
Let H, be a random variable denoting the handedness of an individual x, with possible values l or r. A common hypothesis is that left- or right-handedness is inherited by a simple mechanism; that is, perhaps there is a gene G,, also with values l or ·r, and perhaps actual handedness turns out mostly the same (with some probability s) as the gene an individual possesses. Furthermore, perhaps the gene itself is equally likely to be inherited from either of an individual's parents, with a small nonzero probability m of a random mutation flipping 14.6
the hancleclness.
a. b.
Which of the three networks in Figure 14.20 claim that P(Gfathen Gmothen Gchild) = P(Gjather)P( Gmother )P(Gchild)? Which of the three networks make independence claims that are consistent with the hypothesis about the inheritance of handedness?
Chapter
560
(a)
(b)
14.
Probabilistic Reasoning
(c)
Figure 14.20
Three possible structures for a Bayesian network describing genetic inheri tance of handedness.
c. d.
Which of the three networks is the best description of the hypothesis? Write down the CPT for the Gchild node in network (a), n i terms of s and m. e. Suppose that P(GJather =l) = P(Gmother =l) = q. In network (a), derive an expres sion for P(Gchild = l) in terms of m and q only, by conditioning on its parent nodes. f. Under conditions of genetic equilibrium, we expect the distribution of genes to be the same across generations. Use this to calculate the value of q, and, given what you know about handedness in humans, explain why the hypothesis described at the beginning of this question must be wrong.
The Markov blanket of a variable is defined on page 517. Prove that a variable is independent of all other variables in the network, given its Markov blanket and derive Equation (14.12) (page 538). 14.7
Figure 14.21
A Bayesian network describing some features of a car's electrical system and engine. Each variable is Boolean, and the t1·ue value indicates that the corresponding aspect of the vehicle is in working order.
Exercises
561
Consider the network for car diagnosis shown in Figure 14.21.
14.8
a. Extend the network with the Boolean variables Icy Weather and StarterMotor. b. Give reasonable conditional probability tables for all the nodes. c. How many independent values are contained in the joint probability distribution for eight Boolean nodes, assuming that no conditional independence relations are known to hold among them? How many independent probability values do your network tables contain? The conditional distribution for Starts could be described as a noisy-AND distribution. Define this family in general and relate it to the noisy-OR distribution.
d. e.
Consider the family of linear Gaussian networks, as defined on page 520.
14.9
a. b.
In a two-variable network, let X1 be the parent of X2, let X1 have a Gaussian prior, and Jet P(X2 I X1) be a linear Gaussian distribution. Show that the joint distribution P(X1,X2) is a multivariate Gaussian, and calculate its covariance matrix. Prove by induction that the joint distribution for a general linear Gaussian network on X1, . . . ,Xn is also a multivariate Gaussian.
The probit distribution defined on page 522 describes the probability distribution for a Boolean child, given a single continuous parent. 14.10
a. b.
How might the definition be extended to cover multiple continuous parents? How might it be extended to handle a multivalued child variable? Consider both cases where the child's values are ordered (as in selecting a gear while driving, depending on speed, slope, desired acceleration, etc.) and cases where they are unordered (as in selecting bus, train, or car to get to work). (Hint: Consider ways to divide the possible values into two sets, to mimic a Boolean variable.)
In your local nuclear power station, there is an alarm that senses when a temperature gauge exceeds a given threshold. The gauge measures the temperature of the core. Consider the Boolean variables A (alarm smmds), FA (alarm is faulty), and Fe (gauge is faulty) and the multivalued nodes G (gauge reading) and T (actual core temperature). 14.11
a. b. c. d. e.
Draw a Bayesian network for this domain, given that the gauge is more likely to fail when the core temperature gets too high. Is your network a polytree? Why or why not? Suppose there are just two possible actual and measured temperatures, nonnal and high; the probability that the gauge gives the correct temperature is x when it is working, but y when it is faulty. Give the conditional probability table associated with G. Suppose the alarm works coJTectly unless it is faulty, in which case it never sounds. Give the conditional probability table associated with A. Suppose the alarm and gauge are working and the alann sounds. Calculate an expres sion for the probability that the temperature of the core is too high, in terms of the various conditional probabilities in the network.
562
Chapter
(i) Figure 14.22
14.
(ii)
Probabilistic Reasoning
(iii)
Three possible networks for the telescope problem.
Two astronomers in different parts of the world make measurements lvh and M2 of the number of stars N in some small region of the sky, using their telescopes. NormaUy, there is a small possibility e of eiTor by up to one star in each direction. Each telescope can also (with a much smaller probability f) be badly out of focus (events H and F2), in which case the scientist will undercount by three or more stars (or if N is less than 3, fail t.o detect any stars at. all). Consider the three networks shown in Figure 14.22. 14.12
a.
Which of these Bayesian networks are correct (but not necessarily efficient) represen tations of the preceding information? b. Which is the best network? Explain. c. Write out a conditional distribution for P(M1 I N), for the case where N E {1, 2, 3} and M1 E {0, 1, 2, 3, 4}. Each entry in the conditional distribution should be expressed as a function of the parameters e and/or f. d. Suppose M1 = 1 and M2 = 3. What are the possible numbers of stars if you assume no prior constraint on the values of N? e. What is the most likely number of stars, given these observations? Explain how to compute this, or if it is not possible to compute, explain what additional information is needed and how it would affect the result. Consider the network shown in Figure 14.22(ii), and assume that the two telescopes work identically. N E { 1, 2, 3} and Mt, M2 E {0, 1, 2, 3, 4}, with the symbolic CPTs as de scribed in Exercise 14.12. Using the emunerat.ion algorithm (Figure 14.9 on page 525), cal culate the probability distribution P(N I M1 = 2, M2 = 2). 14.13
14.14
a.
Consider the Bayes net shown in Figure 14.23.
Which of the following are asserted by the network structure? (i) P(B ,I, M) = P(B)P(I)P(M). (ii) P(J I G) = P(J I G,I). (iii) P(M I G,B,I) = P(M I G,B,I, J).
563
Exercises B M P(/) .9
II f f
1�)1 B 8 I M
II I I fI f f
II I f fI fI fI I fI ff f
I'(G) .9
.8.0 .20 .I
f.) .0
I fI 55 f .I T
M
IP(;�I
G J
lffiJ
Figure 14.23 A simple Bayes net with Boolean variables B = BrokeElectionLaw, I= Indicted, M = PoliticallyMotivatedProsecutor, G = FoundGuilty, J = Jailed.
b. c. d.
e.
Calculate the value of P(b, i, •m, g,j). Calculate the probability that someone goes to jail given that they broke the law, have been indicted, and face a politically motivated prosecutor. A context-specific independence (see pa.ge 542) allows a variable to be independent of some of its parents given certain values of others. In addition to the usual conditional independences given by the graph structure, what context-specific independences exist in the Bayes net in Figure 14.23? Suppose we want to add the variable }J = 1-'residentialJ:lardon to the network; draw the new network and briefly explain any links you add.
14.15
a.
Consider the variable elimination algorithm in Figure 14.1 1 (page 528).
Section 14.4 applies variable elimination to the query
P(Burglary I JohnCalls = true, MaryCalls = true) . b. c. d.
Perform the calculations indicated and check that the answer is correct. Count the number of arithmetic operations performed, and compare it with the nwnber perfonned by the enumeration algorithm. Suppose a network has the form of a chain: a sequence of Boolean variables X1 , . . . , Xn where Parents(Xi) = {Xi-d for i = 2, . . . , n. What is the complexity of computing P(X1 I Xn = true) using enumeration? Using variable elimination? Prove that the complexity of running variable elimination on a polytree network is linear in the size of the tree for any variable ordering consistent with the network structure.
14.16
a.
Investigate the complexity of exact inference in general Bayesian networks:
Prove that any 3-SAT problem can be reduced to exact inference in a Bayesian network constructed to represent the particular problem and hence that exact inference is NP-
564
Chapter
14.
Probabilistic Reasoning
hard. (Hint: Consider a network with one variable for each proposition symbol, one for
b.
each clause, and one for the conjunction of clauses.) The problem of counting the number of satisfying assigrunents for a 3-SAT problem is #P-complete. Show that exact inference is at least as hard as this.
14.17 Consider the problem of generating a random sample from a specified distribution on a single variable. Asswne you have a random munber generator that returns a random number unifonnly distributed between 0 and 1 .
CUMULATIVE DISTRIBUTION
a. Let X be a discrete variable with P(X =xi)= Pi for i E {1, . . . , k}. The cumulative distribution of X gives the probability that X E {X1, . . . , Xj} for each possible j. (See also Appendix A.) Explain how to calculate the cumulative distribution in U(k) time
and how to generate a single sample of X from it. Can the latter be done in less than O(k) time? b. Now suppose we want to generate N samples of X, where N » k. Explain how to do this with an expected run time per sample that is constant (i.e., independent of k). c. Now consider a continuous-valued variable with a parameterized distribution (e.g., Gaussian). How can samples be generated from such a distribution? d. Suppose you want to query a continuous-valued variable and you are using a sampling algoritlun such as LIKELIHOODWEIGHTING to do the inference. How would you have to modify the query-answering process?
Consider the query P(Rain I Sprinkler = true, (page 529) and how Gibbs sampling can answer it. 14.18
WetGrass = true) in Figure 14. 12(a)
a. b.
How many states does the Markov chain have? Calculate the transition matrix Q containing q(y � y') for all y, y'. c. What does Q2 , the square of the transition matrix, represent? d. What about Qn as n � oo? e. Explain how to do probabilistic inference in Bayesian networks, assuming that Qn is available. Is this a practical way to do inference?
14.19
This exercise explores the stationary distribution for Gibbs sampling methods.
a. The convex composition [a, q1; 1 - a, q2] of q1 and q2 is transition probability distri bution that first chooses one of q1 and q2 with probabilities a and 1 - a, respectively, and then applies whichever is chosen. Prove that if q1 and q2 are in detailed balance a
with 7f, then their convex composition is also in detailed balance with 7f. (Note: this result justifies a variant of GIBBS-ASK in which variables are chosen at random rather
b. METROPOLI$ HASTINGS
than sampled in a fixed sequence.) Prove that if each of q1 and q2 has 7f as its stationary distribution, then the sequential composition q = q1 o q2 also has 1r as its stationary distribution.
The Metropolis-Hastings algorithm is a member of the MCMC family; as such, it is designed to generate samples x (eventually) according to target probabilities 1r(x) . (Typically 14.20
565
Exercises we are interested in sampling from PR:li'OSAL DISTRieUTION
ACCEPIANGE PR:li!'BUTY
1r(x)
=
P(x I e).)
Like simulated annealing, Metropolis
Hastings operates in two stages. First, it samples a new state x' from a proposal distribution
q(x' l x), given the current state x.
Then, it probabilistically accepts or rejects
x' according
to
the acceptance probability
a(x'l x)
=
min
x')q(x I x') ) (1 ' 1r(1r(x)q(x' l x) .
If the proposal is rejected, the state remains at a.
x.
Consider an ordinary Gibbs sampling step for a specific variable Xi. Show that this step, considered as a proposal, is guaranteed to be accepted by Metropolis-Hastings. (Hence, Gibbs sampling is a special case of Metropolis-Hastings.)
b. Show that the two-step process above, viewed as a transition probability distribution, is in detailed balance with 1r.
1iiii�
14.21
Three soccer teams A, B, and C, play each other once. Each match is between two
teams, and can be won, drawn, or lost. Each team has a fixed, unknown degree of quality an integer ranging from 0 to 3-and the outcome of a match depends probabi listically on the difference in quality between the two teams. a.
Construct a relational probability model to describe this domain, and suggest numerical values for all the necessary probability dist1ibutions.
b. Construct the equivalent Bayesian network for the tlu-ee matches. c. Suppose that in the first two matches A beats B and draws with C. Using an exact inference algorithm of your choice, compute the posterior distribution for the outcome of the third match. d. Suppose there
are
n teams in the league and we have the results for all but the last
match. How does the complexity of predicting the last game vary with n?
e.
Investigate the application of MCMC to this problem. How quickly does it converge in practice and how well does it scale?
15
PROBABILISTIC REASONING OVER TIME
In which we try to interpret the present, understand the past, and perhaps predict thefuture, even when very little is crystal clear.
Agents in partially obseiVable environments must be able to keep track of the current state, to the extent that their sensors allow. In Section 4.4 we showed a methodology for doing that: an agent maintains a belief state that represents which states of the world are currently possible. From the belief state and a transition model, the agent can predict how the world might evolve in the next time step. From the percepts obseiVed and a sensor model, the agent can update the betief state. This is a pervasive idea: in Chapter 4 belief states were represented by explicitly enumerated sets of states, whereas in Chapters 7 and 1 1 they were represented by logical formulas. Those approaches defined belief states in terms of which world states were possible, but could say nothing about which states were likely or unlikely. In this chapter, we use probability theory to quantify the degree of belief in elements of the belief state. As we show in Section 15.1, time itself is handled in the same way as in Chapter 7: a changing world is modeled using a variable for each aspect of the world state at each point in time. The transition and sensor models may be uncertain: the transition model describes the probability distribution of the variables at time t, given the state of the world at past times, while the sensor model describes the probability of each percept at time t, given the current state of the world. Section 15.2 defines the basic inference tasks and describes the gen eral structure of inference algorithms for temporal models. Then we describe three specific kinds of models: hidden Markov models, Kalman filters, and dynamic Bayesian net works (which include hidden Markov models and Kalman filters as special cases). Finally, Section 15.6 examines the problems faced when keeping track of more than one thing.
15.1
TIME AND UNCERTAINTY
We have developed our techniques for probabilistic reasoning in the context of static worlds, in which each random variable has a single fixed value. For example, when repairing a car, we assmne that whatever is broken remains broken during the process of diagnosis; our job is to infer the state of the car from obseiVed evidence, which also remains fixed.
566
Section
15.1.
567
Time and Uncertainty
Now consider a slightly different problem: treating a diabetic patient. As in the case of car repair, we have evidence such as recent insulin doses, food intake, blood sugar measure ments, and other physical signs. The task is to assess the current state of the patient, including the actual blood sugar level and insulin level. Given this information, we can make a deci sion about the patient's food intake and insulin dose. Unlike the case of car repair, here the
dynamic aspects of the problem are essential.
Blood sugar levels and measurements thereof
can change rapidly over time, depending on recent food intake and insulin doses, metabolic activity, the time of day, and so on. To assess the cunent state from the history of evidence and to predict the outcomes of treatment actions, we must model these changes. The same considerations arise in many other contexts, such as tracking the location of a robot, tracking the economic activity of a nation, and making sense of a spoken or written sequence of words. How can dynamic situations like these be modeled?
15.1.1 TIMESLCE
States and observations
time slices, each of which contains a set of 1 random variables, some observable and some not. For simplicity, we will assume that the We view the world as a series of snapshots, or
same subset of variables is observable in each time slice (although this is not strictly necessary in anything that follows). We will use are assumed to be unobservable, and The observation at time t is
Xt to denote the set of state variables at time t, which
F-t
to denote the set of observable evidence variables.
Et = � for some set of values �·
Consider the following example: You are the seCllrity guard stationed at a secret under
grOLmd installation. You want
to
know whether it's raining today, but your only access to the
outside world occurs each morning when you see the director coming in with, or without, an
Et thus contains a single evidence variable Umbrellat or Ut for short (whether the umbrella appears), and the set Xt contains a single state variable Raint
umbrella. For each day t, the set
or Rt for short (whether it is raining). Other problems can involve larger sets of variables. In
MeasuredBloodSugart and PulseRatet, and state variables, such BloodSugart and StomachContentst. (Notice that BloodSugart and MeasuredBloodSugart are not the same variable; this is how we deal with the diabetes example, we might have evidence variables, such as as
noisy measurements of actual quantities.) The inte1val between time slices also depends on the problem. For diabetes monitoring, a suitable interval might be an hour rather than a day. In this chapter we asswne the interval between slices is fixed, so we can label times by integers. We will assume that the state sequence start.s at t = 0; for various uninteresting reasons, we will assume that evidence starts
t = 0. Hence, our umbrella world is represented by state variables R:J, Rt, R2, . . . and evidence variables U1, U2, . . . . We will use the notation a:b to denote the sequence of integers from to b (inclusive), and the notation Xa:b to denote the set of variables from Xa to Xb. For example, U1:3 corresponds to the variables U1, U2, U3.
arriving at t =
1
rather than
a
1 Uncertainty over conti11uous time can be modeled by stochastic differential equations (SDEs). The models studied in this chapter can be viewed discrete-time approximations to SDEs. as
568
Chapter
15.
Probabilistic Reasoning over Time
(a)
(b) Figure 15.1
(a) Bayesian network structure corresponding to a first-order Markov process with state defined by the variables (b) A second-orderMarkov process.
15.1.2
Xt.
Transition and sensor models
With the set of state and evidence variables for a given problem decided on, the next step is to specify how the world evolves (the transition model) and how the evidence variables get their values (the sensor model). The transition model specifies the probability distribution over the latest state variables, MARKOV
P(Xt I Xo:t-1)· Now we face a problem: the set Xo:t-1 is t increases. We solve the problem by making a Markov assumption
given the previous values, that is,
ASSUMPTION
wtbotmded in size as
MARKOV PROCESS
(1856-1922) and are called Markov processes or Markov chains.
FIRST.ORDER MARKOV PROCESS
vors; the simplest is the
that the current state depends on only a finitefixed number of previous states. Processes sat
isfying this assumption were first studied in depth by the Russian statistician Andrei Markov They come in various fla
first-order Markov process, in which the current state depends only
on the previous state and not on any earlier states. In other words, a state provides enough information
to make the future conditionally independent of the past, and we have
P(Xt I Xo:t-1) = P(Xt I Xt-1) .
(15.1)
Hence, in a first-order Markov process, the transition model is the conditional distribution
P(Xt I Xt-1)· The transition model for a second-order Markov process is the conditional distribution P(Xt I Xt-2, Xt-1)· Figure 15.1 shows the Bayesian network structures corre sponding
to lirst-order and second-order Markov processes.
Even with the Markov assumption there is still a problem: there are infinitely many possible values oft. Do we need
to
specify a different distribution for each time step? We
avoid this problem by assuming that changes in the world state are caused by a STATIONARY PROCESS
stationary
process--that is, a process of change that is governed by laws that do not themselves change over time. (Don't confuse
stationary with static:
in a
static process,
the state itself does not
change.) In the wnbrella world, then, the conditional probability of rain,
P(l4 1 Rt-1), is the
t, and we only have to specify one conditional probability table. Now for the sensor model. The evidence variables Et could depend on previous vari
same for all
ables as well as the current state variables, but any state that's wot1h its salt should suffice to SENSOR MARKO/ ASSUMPTION
generate the current sensor values. Thus, we make a
sensor Markov assumption as follows:
P(Et I Xo:t, Eo:t-1) = P(Et I Xt) . (15.2) Thus, P(Et I Xt) is our sensor model (sometimes called the observation model). Figure 15.2 shows both the transition model and the sensor model for the umbrella example. Notice the
Section
1 5.1.
569
Time and Uncertainty
R,. • t
f
Rain,_;'
P(RI) 0.7 0.3
rRain,./ Rr t
Umbrella,_1
f
Umbrella,
Rain,+1 P( UJI) 0.9 0.2
Umbrella,+1
Bayesian network structure and conditional distributions describing the umbrella world. The transition model is P(Raint I Raint-1) and the sensor model is
Figure 15.2
P(Umbrella1 I Raint)·
direction of the dependence between state and sensors: the arrows go from the actual state of the world
to sensor values
particular values: the rain
because the state of the world
causes the umbrella to appear.
causes the
sensors to take on
(The inference process, of course,
goes in the other direction; the distinction between the direction of modeled dependencies and the direction of inference is one of the principal advantages of Bayesian networks.) In addition to specifying the transition and sensor models, we need to say how every thing gets started-the prior probability distribution at time 0,
P(Xo).
With that, we have a
specification of the complete joint distribution over all the variables, using Equation For any
t,
t P(Xo:t,El:t) = P(Xo) IT P(Xi I Xi-1) P(Ei I Xi) . i=l
(14.2).
(15.3)
The three terms on the right-hand side are the initial state model
P(Xo), the transition model
P(Xi I �-1), and the sensor model P(Ei I Xi)· The structure in Figure assumed
15.2 is a first-order Markov process-the probability of rain is
to depend only on whether it rained the previous day.
Whether such an assumption
is reasonable depends on the domain itself. The first-order Markov asswnption says that the state variables contain
all the
information needed to characterize the probability distribution
for the next time slice. Sometimes the assumption is exactly true-for example, if a patticle is executing a random walk along the x-axis, changing its position by
±1
at each time step,
then using the x-coordinate as the state gives a first-order Markov process. Sometimes the assumption is only approximate, as in the case of predicting rain only on the basis of whether it rained the previous day. There are two ways to improve the accuracy of the approximation:
1 . Increasing the order of the Markov process model. For example, we could make a second-order model by adding
Ra.int-2 as a parent of Raint, which might give slightly
more accurate predictions. For example, in Palo Alto, California, it very rarely rains more than two days in a row.
2.
Increasing the set of state variables.
For example, we could add
Seasont to allow
Chapter
570
15.
Probabilistic Reasoning over Time
us to incorporate historical records of rainy seasons, or we could add
Temperaturet,
Humidityt and Pressuret (perhaps at a range of locations) to allow us to use a physical
model of rainy conditions. Exercise
15.1
asks you to show that the first solution-increasing the order--can always be
reformulated as an increase in the set of state variables, keeping the order fixed. Notice that adding state variables might improve the system's predictive power but also increases the prediction
requirements: we now have to predict the new variables as well. Thus, we are
looking for a "self-sufficient" set of variables, which really means that we have to understand the "physics" of the process being modeled. The requirement for accurate modeling of the process is obviously lessened if we can add new sensors (e.g., measurements of temperature and pressure) that provide information directly about the new state variables. Consider, for example, the problem of tracking a robot wandering randomly on the X-Y
plane. One might propose t.hat the position and velocity are a sufficient set of state variables:
one can simply use Newton's laws to calculate the new position, and the velocity may change unpredictably. If the robot is battery-powered, however, then battery exhaustion would tend to have a systematic effect on the change in velocity. Because this n i turn depends on how much power was used by all previous maneuvers, the Markov property is violated. We can restore the Markov property by including the charge level make up
Xt.
for predicting
Batteryt as one of the state variables that
This helps in predicting the motion of the robot, but in tw·n requires a model
Batteryt from Batteryt-1 and the velocity.
In some cases, that can be
done
reliably, but more often we find that e1ror accumulates over time. In that case, accuracy can be improved by adding a
1 5 .2
new sensor for the battery level.
INFERENCE I N TEMPORAL MODELS
Having set up the structure of a generic temporal model, we can fonnulate the basic inference tasks that must be solved: FILTERitiG
•
BELIEF STATE STATE ESTIMATION
Filtering:
belief state-the posterior distribution over the most recent state-given all evidence to date. Filtering2 is also called state estimation. In our example, we wish to compute P(Xt I el:t) · In the umbrella example, This is the task of computing the
this would mean computing the probability of rain today, given all the observations of the wnbrella carrier made so far. Filtering is what a rational agent does to keep track of the current state so that rational decisions can be made. It turns out that an almost
identical calculation provides the likelihood of the evidence sequence, PREDICTION
•
P(el:t) ·
Prediction: This is the task ofcomputing the posterior distribution over thefuture state, given all evidence to date. That is, we wish to compute P(Xt+k I el:t ) for some k > 0. In the umbrella example, this might mean computing the probability of rain three days
from now, given all the observations to date. Prediction is useful for evaluating possible courses of action based on their expected outcomes.
2 The tenn "filtering" refers to the roots of this problem in early work on signal processing, where the problem is to filter out the noise in signal by estimating its underlying properties. a
Section
15.2.
571
Inference in Temporal Models
SMOOTHING
•
Smoothing: This is the task of computing the posterior distribution over a past state, given all evidence up to the present. That is, we wish to compute P(Xk I el:t) for some k such that 0
� k < t.
In the umbrella example, it might mean computing the probability
that it rained last Wednesday, given all the observations of the tunbrella carrier made up to today. Smoothing provides a better estimate of the state than was available at the 3 time, because it incorporates more evidence. •
Most Ukely explanation: Given a sequence of observations, we might wish to find the sequence of states that is most likely to have generated those observations. That is, we wish to comput.e a.rgmaxxl:t
P(xt:t I el:t)· For example, ifthe umbrella appears on each
of the first three days and is absent on the fourth, then the most likely explanation is that it rained on the first three days and did not rain on the fourth. Algorithms for this task are useful in many applications, including speech recognition-where the aim is to find the most likely sequence of words, given a series of sounds-and the reconstruction of bit strings transmitted over a noisy charu1el. In addition to these inference tasks, we also have •
Learning:
The transition and sensor models, if not yet known, can be learned from
observations. Just as with static Bayesian networks, dynamic Bayes net learning can be done as a by-product of inference. Inference provides an estimate of what transitions actually occtuTed and of what states generated the sensor readings, and these estimates can be used to update the models. The updated models provide new estimates, and the process iterates to convergence. The overall process is an instance of the expectation maximization or
EM algorithm. (See Section 20.3.)
Note that learning requires smoothing, rather than filtering, because smoothing provides bet ter estimates of the states of the process. Leaming with filtering can fail to converge correctly; consider, for example, the problem of learning to solve murders: unless you are an eyewit ness, smoothing is
always
required to infer what happened at the murder scene from the
observable variables. The remainder of this section describes generic algoritluns for the four inference tasks, independent of the particular kind of model employed. Improvements specific to each model are described in subsequent sections.
15.2.1
Filtering and prediction
As we pointed out in Section
7.7.3, a useful filtering algoritlun needs to maintain
a current
state estimate and update it, rather than going back over the entire history of percepts for each update. (Otherwise, the cost of each update increases as time goes by.) In other words, given
the result of filtering up
ESTIMATION
time t, the agent needs
to
compute the result for t 1 1 from the
et+l • P(Xt+l l el:t+l ) = f (�+l, P(Xt I el:t)) , for some function f. This process is called recursive estimation. We can view the calculation
new evidence
RECURSIVE
to
In particular, when tracking a movingobject with inaccurate position observations, smoothing gives a smoother estimated trajectory than filtering-hence the name. 3
Chapter
572
15.
Probabilistic Reasoning over Time
as being composed of two parts: first, the current state distribution is projected forward from
t to t + 1 ; then it is updated using the new evidence �+1·
This two-part process emerges quite
simply when the formula is rearranged:
P(Xt+l l el:t+l) = P(Xt+l l el:t, �+1) (dividing up the evidence) = a P(�+1 I xt+1, el:t) P(Xt+l I e1:t) (using Bayes' rule) = a P(�+1 I Xt+1) P(Xt+1 I e1:t) (by the sensor Markov assumption).
(15.4)
Here and throughout this chapter, a is a normalizing constant used to make probabilities sum
P(Xt+l I e1:t) represents a one-step prediction of the next state, and the first term updates this with the new evidence; notice that P( Ct+l l Xt+1) is obtainable up to
1.
The second term,
directly from the sensor model. Now we obtain the one-step prediction for the next state by
Xt: P(Xt+l I e1:t+l) = a P( Ct+l I Xt+l) L P(Xt+l I Xt, el:t)P(xt I el:t)
conditioning on the current state
= a
P(�+1 I Xt+1) L P(Xt+l I Xt)P(xt I e1:t)
(Markov assumption).
(15.5)
Within the stunmation, the first factor comes from the transition model and the second comes
from the current state distribution. Hence, we have the desired recursive formulation. We can think of the filtered estimate
P(Xt I el:t) as a "message" fl:t that is propagated forward along
the sequence, modified by each transition and updated by each new observation. The process is given by
f1:t+1 = a FORWARD(fl:t, �+1) , where FORWARD implements the update described in Equation (15.5) and the process begins with f1:0 = P(Xo). When all the state variables discrete, the time for each update is are
constant (i.e., independent of t), and the space required is also constant.
(The constants
depend, of course, on the size of the state space and the specific type of the temporal model in question.)
The time and space requirements for updating must be constant ifan agent with limited memory is to keep track ofthe current state distribution over an unbounded sequence ofobservations. Let us illustrate the filtering process for two steps in the basic umbrella example (Fig
ure
15.2.) That is, we will compute P(R2 I U1:2) as follows: On day 0, we have no observations, only the security guard's prior beliefs; Jet's assume that consists ofP(Ro) = (0.5,0.5). On day 1, the umbrella appears, so U1 = true. The prediction from t = 0 to t = 1 is
•
•
P(R1)
L P(R1 I ro)P(ro) ro
(0. 7, 0.3)
X
0.5 + (0.3, 0. 7) X 0.5 = (0.5, 0.5) .
Then the update step simply multiplies by the probability of the evidence for t = 1 and normalizes, as shown in Equation
P(R1 I u1)
a
(15.4):
P(u1 l R1)P(Rl) = a (0.9, 0.2) (0.5, 0.5)
= a (0.45, 0.1)
�
(0.818, 0.182) .
Section 15.2.
Inference in Temporal Models •
573
On day 2, the tunbrella appears, so U2 = true. The prediction from t = 1 to t = 2 is
LP(R2 I rt)P(rt I u1)
P(R2 I u1)
(0.7, 0.3)
X
0.818 + (0.3, 0.7)
X
0.182
�
(0.627, 0.373) ,
and updating it with the evidence for t = 2 gives
P(R2 I u1, U2)
= a P(u2 l �)P(R2 I Ut) = a (0.9, 0.2)(0.627, 0.373) = a (0.565,0.075) � (0.883, 0.117) .
Intuitively, the probability of rain increases from day 1 to day 2 because rain persists. Exer cise 1 5.2(a) asks you to investigate this tendency further. The task of prediction can be seen simply as filtering without the addition of new evidence. In fact, the filtering process already incorporates a one-step prediction, and it is easy to derive the following recursive computation for predicting the state at t + k + 1 from a prediction for t + k:
P(Xt+k+t l el:t) = L P(Xt+k+l l Xt+k)P(xt+k I el:t) .
(15.6)
Xt+k
MIXING TIME
Naturally, this computation involves only the transition model and not the sensor model. It is interesting to consider what happens as we try to predict further and further into the future. As Exercise 15.2(b) shows, the predicted distribution for rain converges to a fixed point (0.5, 0.5), after which it remains constant for all time. This is the stationary distribution of the Markov process defined by the transition model. (See also page 537.) A great deal is known about the properties of such distributions and about the mixing time roughly, the time taken to reach the fixed point. In practical terms, this dooms to failure any
attempt to predict the actual state for a number of steps that is more than a small fraction of the mixing time, unless the stationary distribution itself is strongly peaked in a small area of the state space. The more uncertainty there is in the transition model, the shorter will be the mixing time and the more the future is obscured. In addition to filtering and prediction, we can use a forward recursion to compute the likelihood of the evidence sequence, P(el t)· This is a useful quantity if we want to compare different temporal models that might have produced the same evidence sequence (e.g., two different models for the persistence of rain). For this recursion, we use a likelihood message ft:t(Xt) = P(Xt el:t)· It is a simple exercise to show that the message calculation is identical to that for filtering: :
,
fl:t+l = FORWARD(£1:t,et+l) . Having computed ft:t, we obtain the actual likelihood by summing out Xt: L1:t = P(el:t ) = L fl:t(xt) .
(15.7)
Notice that the likelihood message represents the probabilities of longer and longer evidence sequences as time goes by and so becomes numerically smaller and smaller, leading to under flow problems with floating-point arithmetk. This is an important problem in practice, but we shall not go into solutions here.
574
Chapter
15.
Probabilistic Reasoning over Time
- - - - Xt I el:t) ,
that is, the probabilities of the most likely path to each state
2.
the summation over
Xt; and
Xt in Equation (15.5) is replaced by the maximization over Xt in
Equation ( 1 5 . 1 1 ) . Thus, the algorithm for computing the most likely sequence is similar to filtering: it runs for ward along the sequence, computing the
m message at each time step, using Equation (1 5. 1 1 ).
The progress of this computation is shown in Figure 15 .5(b). At the end, it will have the probability for the most likely sequence reaching
each of the final
states. One can thus easily
select the most likely sequence overall (the states outlined in bold). In order to identify the actual sequence, as opposed to just computing its probability, the algorithm will also need to record, for each state, the best state that leads to it; these are indicated by the bold arrows in Figure 15.5(b). The optimal sequence is identified by following these bold - C >- A result in irrational behavior. (b) The decomposability axiom.
>-
In other words, once the probabilities and utilities ofthe possible outcome states are specified, the utility of a compound lottery involving those states is completely determined. Because the outcome of a nondeterministic action is a lottery, it follows that an agent can act rationally that is, consistently with its preferences--only by choosing an action that maximizes expected utility according to Equation (16.1 ). The preceding theorems establish that a utility ftmction exists for any rational agent, but they do not establish that it is unique. It is easy to see, in fact, that an agent's behavior would not change if its utility function U(S)were transformed according to
U'(S)= aU(S) + b , where a and b are constants and a >
VALUE FUNCTION ORDINALUTIUTY FUNCTION
(16.2)
0; an affine transformation.4 This fact was noted in Chapter 5 for two-player games of chance; here, we see that it is completely general. As in game-playing, in a deterministic envirorunent an agent just needs a preference ranking on states-the numbers don't matter. This is called a value function or ordinal utility function.
It is important to remember that the existence of a utility function that describes an agent's preference behavior does not necessarily mean that the agent is explicitly maximizing that utility function in its own deliberations. As we showed in Chapter 2, rational behavior can be generated n i any number of ways. By observing a mtional agent's preferences, however, an observer can construct the utility function that represents what the agent is actually trying to achieve (even if the agent doesn't know it).
In this sense, utilities resemble temperatures: a temperature in Fahrenheit is 1.8 times the Celsius temperature plus 32. You get the same results in either measurement system. 4
Section
16.3
16.3.
615
Utility Functions
UTILITY FUNCTIONS
Utility is a function that maps from lotteries to real munbers. We know there are some axioms
on utilities that all rational agents must obey. Is that all we can say about utility functions?
Strictly speaking, tl1at is it: an agent can have any preferences it likes. For e: L P(s' l s, -rr[s]) U[s']thendo a e A(s)
s'
7r[s].- argmax a e A(s)
81
L P(s' l s,a) U[s'] s'
unchanged? .-- false until unchanged? return 1r Figure 17.7
The policy iteration algorithm for calculating an optimal policy.
Chapter
658
17.
Making Complex Decisions
The algorithms we have described so far require updating the utility or policy for all states at once. It turns out that this is not strictly necessary. In fact, on each iteration, we can
ASYNCHRONOUS POLICY ITERATION
pick
any subset of states and apply either kind of updating (policy improvement or sim plified value iteration) to that subset. This very general algoritlun is called asynchronous policy iteration. Given certain conditions on the initial policy and initial utility function, asynchronous policy iteration is guaranteed to converge
to an optimal policy.
The freedom
to choose any states to work on means that we can design much more efficient heuristic algorithms-for example, algorithms that concentrate on updating the values of states that are likely to be reached by a good policy. This makes a lot of sense in real life: if one has no intention of throwing oneself off a cliff, one should not spend time worrying about the exact value ofthe resulting states.
17.4
PARTIALLY OBSERVABLE MDPS
TI1e description of Markov decision processes in Section was
fully observable.
17.1
assumed that the environment
With this assumption, the agent always knows which state it is in.
Tiris, combined with the Markov assumption for the transition model, means that the optimal policy depends only on the current state. When the environment is only
partially observable,
the situation is, one might say, much less clear. The agent does not necessarily know which state it is in, so it cannot execute the action
1r(8) recommended for that state. Furthermore, the
utility of a state 8 and the optimal action in 8 depend not just on 8, but also on how
PARTIALLY OBSERVABLE MOP
agent knows when it is in 8.
For these reasons,
much the
partially observable MDPs (or POMDPs
pronounced "pom-dee-pees") are usually viewed as much more difficult than ordinary MOPs. We cannot avoid POMDPs, however, because the real world is one.
17.4.1
Definition of POMDPs
To get a handle on POMDPs, we must first define them properly. elements as an MDP-the transition model
()
A POMDP has
P(8' 1 8, a), actions A(8), and reward function
R 8 -but, like the partially observable search problems of Section
model P(e I 8). ing evidence
the same
4.4, it also has a sensor
Here, as in Chapter 15, the sensor model specifies the probability of perceiv
e in state 8.3 For example, we can convert the 4 x 3 world of Figure 17.1 into
a POMDP by adding a noisy or partial sensor instead of asstuning that the agent knows its location exactly. Such a sensor might measure the
to be 2 in all the nonterminal
number ofadjacent walls, which happens
squares except for those in the third column, where the value
is
1 ; a noisy version might give the wrong value with probability 0.1. In Chapters 4 and 1 1 , we studied nondeterministic and partially observable planning problems and identified the belief state-the set of actual states the agent might be in-as a key concept for describing and calculating solutions. In POMDPs, the belief state
b becomes a
probability distribution over all possible states, just as in Chapter 15. For example, the initial 3 As with the reward function for MDPs, the sensor model can also depend on the action and outcome state, but again this change is not fundamental.
Section 17.4.
Partially Observable MDPs
659
belief state for the 4 x 3 POMDP could be the uniform disttibution over the nine nonterminal states, i.e., (�, �� �� �� �� �' �' �� �� 0, 0). We writ.e b(s) for the probability assigned to the acn•al state s by belief state b. The agent can calculate its cuJTent belief state as the conditional probability distribution over the acwal states given the sequence of percepts and actions so far. This is essentially the filtering task described in Chapter 15. The basic recursive filtering equation (15.5 on page 572) shows how to calculate the new belief state from the previous belief state and the new evidence. For POMDPs, we also have an action to consider, but the result is essentially the same. If b(s) was the previous belief state, and the agent does action a and then perceives evidence e, then the new belief state is given by
b'(s ') =a: P(e I s') L P(s' l s,a)b(s) , s
where a: is a normalizing constant that makes the belief state sum to 1 . By analogy with the update operator for filtering (page 572), we can write this as
b1 = FORWARD(b,a,e) .
(17.11)
In the 4 x 3 POMDP, suppose the agent moves Left and its sensor reports 1 adjacent wall; then it's quite likely (although not guaranteed, because both the motion and the sensor are noisy) that the agent is now in (3, 1 ). Exercise 17.13 asks you to calculate the exact probability values for the new belief state. The fundamental insight required to understand POMDPs is this: the optimal action depends only on the agent's current beliefstate. That is, the optimal policy can be described by a mapping 1r*(b) from belief states to actions. It does not depend on the actual state the agent is in. This is a good thing, because the agent does not know its actual state; all it knows is the belief state. Hence, the decision cycle of a POMDP agent can be broken down into the following three steps: l . Given the current belief state b, execute the action a = 1r* (b). 2. Receive percept e. 3. Set the current belief state to FORWARD(b, a, e) and repeat. Now we can think of POMDPs as requiring a search in belief-state space, just like the meth ods for sensorless and contingency problems in Chapter 4. The main difference is that the POMDP belief-state space is continuous, because a POMDP belief state is a probability dis tribution. For example, a belief state for the 4 x 3 world is a point in an 11-dimensional continuous space. An action changes the belief state, not just the physical state. Hence, the action is evaluated at least in part according to the information the agent acquires as a result. POMDPs therefore include the value of information (Section 16.6) as one component of the decision problem. Let's look more carefully at the outcome of actions. In particular, let's calculate the probability that an agent in belief state breaches belief state b' after executing action a. Now, if we knew the action and the subsequent percept, then Equation (17.11) would provide a deterministic update to the belief state: b' = FORWARD(b, a, e). Of course, the subsequent percept is not yet known, so the agent might arrive in one of several possible belief states b', depending on the percept that is received. The probability of perceiving e, given that a was
Chapter 17.
660
Making Complex Decisions
perfonned starting in belief state b, is given by summing over all the actual states s' that the agent might reach:
P(ela,b)
=
L P(ela, s',b)P(s'la,b) •
'
L P(e I s')P(s'la, b) •
'
LP(e l s')L P(s' l s, a)b(s) . •'
s
b' from b, given action a, as P(b' I b, a)). Then that
Let us write the probability of reaching gives us
P(b' l b,a)
=
P(b'la,b) = l:.:: P(b'le,a, b)P(ela,b) e
e
s
'
(17.12) s
where P(b'le,a, b) is 1 ifb' = FORWARD(b, a, e) and 0 otherwise. Equation (17.12) can be viewed as defining a transition model for the belief-state space. We can also define a reward function for belief states (i.e., the expected reward for the actual states the agent might be n i ):
p(b) = L b(s)R(s) . s
Together, P(b' I b, a) and p(b) define an observable MOP on the space of belief states. Fur thermore, it can be shown that an optimal policy for this MDP, 1r•(b), is also an optimal policy for the original POMDP. In other words, solving a POMDP on a physical state space can be reduced to solving an MDP on the corresponding belief-state space. This fact is perhaps less surprising ifwe remember that the belief state is always observable to the agent, by definition. Notice that, although we have reduced POMDPs to MOPs, the MOP we obtain has a continuous (and usually high-dimensional) state space. None of the MDP algorithms de scribed in Sections 17.2 and 17.3 applies directly to such MOPs. The next two subsec tions describe a value iteration algorithm designed specifically for POMDPs and an online decision-making algoritlun, similar to those developed for games in Chapter 5. 17.4.2
Value iteration for POMDPs
Section 17.2 described a value iteration algoritlun that computed one utility value for each state. With infinitely many belief states, we need to be more creative. Consider an optimal policy 7r* and its application in a specific belief state b: the policy generates an action, then, for each subsequent percept, the belief state is updated and a new action is generated, and so on. For this specific b, therefore, the policy is exactly equivalent to a conditional plan, as de tined in Chapter 4 for nondeterministic and partially observable problems. Instead of thinking about policies, let us think about conditional plans and how the expected utility of executing a fixed conditional plan varies with the initial belief state. We make two observations:
Section 17.4.
Partially Observable MDPs
661
1. Let the utility ofexecuting afixedconditional plan p starting in physical state 8 be ap(8). Then the expected utility of executing pin belief state b is just E. b(8)ap(8), or b · et.p if we think of them both as vectors. Hence, the expected utility of a fixed conditional plan varies linearly with b; that is, it corresponds to a hyperplane in belief space. 2. At any given belief state b, the optimal policy will choose to execute the conditional plan with highest expected utility; and the expected utility of b under the optimal policy is just the utility of that conditional plan:
U(b) =
rr· (b)
= max p b · aP .
If the optimal policy 7r* chooses to execute p starting at b, then it is reasonable to expect that it might choose to execute p in belief states that are very close to b; in fact, if we bound the depth of the conditional plans, then there are only finitely many such plans and the continuous space of belief states will generally be divided into regions, each corresponding to a particular conditional plan that is optimal in that region. From these two observations, we see that the utility function U(b) on belief states, being the maximum of a collection of hyperplanes, will be piecewise linear and convex. To illustrate this, we use a simple two-state world. The states are labeled 0 and I , with R(O) = 0 and R(l) = 1. There are two actions: Stay stays put with probability 0.9 and Go switches to the other state with probability 0.9. For now we will assume the discount factor 'Y = 1 . The sensor reports the correct state with probability 0.6. Obviously, the agent should Stay when it thinks it's in state I and Go when it thinks it's in state 0. The advantage of a two-state world is that the belief space can be viewed as one dimensional, because the two probabilities must sum to l . In Figw·e 17.8(a), the x-axis represents the belief state, defined by b(l), the probability of being in state l. Now Jet us con sider the one-step plans [Stay] and [Go], each of which receives the reward for the current state followed by the (discounted) reward for the state reached after the action:
a[stayJ(O) a[stayJ(l) a[GoJ(O) a[GoJ(l)
R(O) +7(0.9R(O) + O.lR(l)) = 0.1 R(l) +7(0.9R(l) + O.lR(O)) = 1.9 R(O) + 7(0.9R(l) + O.lR(O)) = 0.9 R(l) + 7(0.9R(O) + O.lR(l)) = 1.1
The hyperplanes (lines, in this case) for b·a[stayJ and b· a[GoJ are shown in Figure 17.8(a) and their maximum is shown in bold. The bold line therefore represents the utility function for the finite-horizon problem that allows just one action, and in each "piece" of the piecewise linear utility function the optimal action is the first action of the corresponding conditional plan. In this case, the optimal one-step policy is to Stay when b(l) > 0.5 and Go otherwise. Once we have utilities ap(8) for all the conditional plans p of depth 1 in each physical state 8, we can compute the utilities for conditional plans of depth 2 by considering each possible first action, each possible subsequent percept, and then each way of choosing a depth-1 plan to execute for each percept:
[Stay; [Stay;
if Percept = 0 then Stay else Stay]
if Percept= 0 then Stay else
Go] . . .
Chapter
662 3
3
2.5
2.5
2
2
0.5
--- -- -
0� 0 0.2
17.
Making Complex Decisions
0.5
-....... - --i OA 0.6 0.8 Probability of state 1 (a)
0 �----0 0.2 0.6 0.8 0.4 Probability of state I (b)
3
7.5
2.5
7
2
6.5 6
0.5
5.5
--
5
.._ .... _... ... 0� 0 0.2 0.6 0.8 0.4 Probability of state I
--4.5 �0.4 0 0.2 0.6 0.8
....j
_ _ _
Probability of state I
(c)
(d)
Figure 17.8 (a) Utility of two one-step plans as a function of the initial belief state b(l) for the two-state world, with the corresponding utility function shown in bold. (b) Utilities for 8 distinct two-step plans. (c) Utilities for four undorninated two-step plans. (d) Utility function for optimal eight-step plans.
DOMINATED PLAN
TI1ere are eight distinct depth-2 plans in all, and their utilities are shown in Figure 17.8(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief space-we say these plans are dominated, and they need not be considered fmther. There are four tmdominated plans, each of which is optimal in a specific region, as shown in Fig ure 17.8(c). The regions partition the belief-state space. We repeat the process for depth 3, and so on. In general, let p be a depth-d conditional plan whose initial action is a and whose depth-d - 1 subplan for percept e is p.e; then
o:p(s) = R(s) + 1
(�
P(s' l s, a)
� P( I s')o:p.e(s')) . e
(17.13)
Tilis recursion naturally gives us a value iteration algorithm, which is sketched in Figure 17.9. The structure ofthe algorithm and its error analysis are similar to those of the basic value iter ation algorithm in Figure 17.4 on page 653; the main difference is that instead of computing one utility number for each state, POMDP-VALUE-ITERATION maintains a collection of
Section 17.4.
Partially Observable MDPs
663
ftmction POMDP-VALUE-ITERATION(pomdp, c) returns a utility function inputs: pomdp, a POMDP with states S, actions A(s), transition model P(s' l s, a), sensor model P(e I s), rewards R(s), discount 'Y €,
the maximum error allowed in the utility ofany state local variables: U, U', sets of plans p with associated utility vectors a,
U' ..- a set containing just the empty plan [], with a[] (s) = R(s) repeat Ut- U' U' n total steps. The agents are thus incapable of representing the number of remaining steps, and must treat it as an unknown. Therefore, they cannot do the induction, and are free to arrive at the more favorable (refuse, refuse) equilibrium. In this case, ignorance is bliss-or rather, having your opponent believe that you are ignorant is bliss. Your success in these repeated games depends on the other player's perception of you as a bully or a simpleton, and not on your actual characteristics. 17.5.3
EXTENSIVE FCflM
Sequential games
In the general case, a game consists of a sequence of tums that need not be all the same. Such games are best represented by a game tree, which game theorists call the extensive form. The tree includes all the same information we saw in Section 5.1: an initial state So, a function PLAYER(s) that tells which player has the move, a function ACTIONS(s) enumerating the possible actions, a function RESULT(s,a) that defines the transition to a new state, and a partial function UTILITY(s,p), which is defined only on terminal states, to give the payoff for each player. To represent stochastic games, such as backgammon, we add a distinguished player, chance, that can take random actions. Chance's "strategy" is part of the definition of the
Section 17.5.
Decisions with Multiple Agents: Game Theory
675
game, specified as a probability distiibution over actions (the other players get to choose their own strategy). To represent games with nondeterministic actions, such as billiards, we break the action into two pieces: the player's action itself has a deterministic result, and then chance has a tum to react to the action in its own capiicious way. To represent simultaneous moves, as in the prisoner's dilemma or two-finger Morra, we impose an arl>itrary order on the players, but we have the option of asserting that the earlier player's actions are not observable to the subsequent players: e.g., Alice must choose refuse or testify first, then Bob chooses, but Bob does not know what choice Alice made at that time (we can also represent the fact that the move is revealed later). However, we assume the players always remember all their own previous actions; this assumption is called perfect recall. The key idea of extensive fonn that sets it apart from the game trees of Chapter 5 is the representation of partial observability. We saw in Section 5.6 that a player in a partially observable game such as Kriegspiel can create a game tree over the space of belief states. With that tree, we saw that in some cases a player can find a sequence of moves (a strategy) that leads to a forced checkmate regardless of what actual state we started in, and regardless of what strategy the opponent uses. However, the techniques of Chapter 5 could not tell a player what to do when there is no guaranteed checkmate. If the player's best strategy depends on the opponent's strategy ami vice versa, then minimax (or a.lpha.--heta) hy itself ca.nnot
INFORMAnoNsETs
find a solution. The extensive form does allow us to find solutions because it represents the belief states (game theorists call them information sets) of all players at once. From that representation we can find equilibiium solutions, just as we did with normal-form games. As a simple example of a sequential game, place two agents in the 4 x 3 world of Fig ure 17.1 and have them move simultaneously until one agent reaches an exit square, and gets the payoff for that square. If we specify that no movement occurs when the two agents try to move into the same square simultaneously (a common problem at many traffic intersec tions), then certain pure strategies can get stuck forever. Thus, agents need a mixed strategy to perform well in this game: randomly choose between moving ahead and staying put. This is exactly what is done to resolve packet collisions in Ethernet networks. Next we'll consider a very simple variant of poker. The deck has only four cards, two aces and two kings. One card is dealt to each player. The first player then has the option to raise the stakes of the game from 1 point to 2, or to check. If player 1 checks, the game is over. If he raises, then player 2 has the option to call, accepting that the game is worth 2 points, orfold, conceding the 1 point. If the game does not end with a fold, then the payoff depends on the cards: it is zero for both players if they have the same card; otherwise the player with the king pays the stakes to the player with the ace. The extensive-form tree for this game is shown in Figure 17 .13. Nontenninal states are shown as circles, with the player to move n i side the circle; player 0 is chance. Each action is depicted as an arrow with a label, corresponding to a raise, check, call, orfold, or, for chance, th� four pussibk ut:als ("AK" mt:aus that playt:r 1 gt:ts au a�.:t: a11u playt:r 2 a k.iug). Tt:nuiual states are rectangles labeled by their payoff to player I and player 2. Information sets are shown as labeled dashed boxes; for example, h,I is the information set where it is player I 's tum, and he knows he has an ace (but does not know what player 2 has). In information set h,I, it is player 2's tum and she knows that she has an ace and that player 1 has raised,
676
Chapter
Figure 17.13
17.
Making Complex Decisions
Extensive form of a simplified version of poker.
but does not know what card player 1 has. (Due to the limits of two-dimensional paper, this information set is shown as two boxes rather than one.) One way to solve an extensive game is to convert it to a normal-form game. Recall that the normal form is a matrix, each row of which is labeled with a pure strategy for player 1, and each column by a pure strategy for player 2. 1n an ext.ensive game a pure strategy for player i corresponds to an action for each information set involving that player. So in Figure 17.13, one pure strategy for player 1 is "raise when in h,1 (that is, when I have an ace), and check when in h,2 (when I have a king)." 1n the payoff matrix below, this strategy is called rk. Similarly, strategy cf for player 2 means "call when I have an ace and fold when I have a king." Since this is a zero-sum game, the matrix below gives only the payoff for player 1 ; player 2 always has the opposite payoff: 2:cc l :rr l:kr l:rk l:kk
0 -1/3 1/3
0
2:cf - l/6 -l/6
0 0
2:ff 1 5/6 1/6
2fc 7/6 2/3 l/2
0
0
This game is so simple that it has two pure-strategy equilibria, shown in bold: cf for player 2 and rk or kk for player 1 . But in general we can solve extensive games by converting to normal fonn and then finding a solution (usually a mixed strategy) using standard linear programming methods. That works in theory. But if a player has I information sets and a actions per set, then that player will have a1 pure strategies. In other words, the size of the normal-form matrix is exponential in the munber of information sets, so in practice the
Section 17.5.
SEQUENCE FORM
ABSTRACTION
Decisions with Multiple Agents: Game Theory
677
approach works only for very small game trees, on the order of a dozen states. A game like Texas hold'em poker has about 1018 states, making this approach completely infeasible. What are the altematives? In Chapter 5 we saw how alpha-beta search could handle games of perfect information with huge game trees by generating the tree incrementally, by pruning some branches, and by heuristically evaluating nonterminal nodes. But that approach does not work well for games with imperfect infonnation, for two reasons: first, it is harder to prune, because we need to consider mixed strategies that combine multiple branches, not a pure strategy that always chooses the best branch. Second, it is harder to heuristically evaluate a nontenninal node, because we are dealing with information sets, not individual states. Koller et al. (1996) come to the rescue with an alternative representation of extensive games, called the sequence form, tha[ is only linear in the size of the tree, rather than ex ponential. Rather than represent strategies, it represents paths through the tree; the number of paths is equal to the number of tenninal nodes. Standard linear programming methods can again be applied to this representation. The resulting system can solve poker variants with 25,000 states in a minute or two. This is an exponential speedup over the nonnal-form approach, but still falls far short of handling full poker, with 1018 states. If we can't handle 1018 states, perhaps we can simplify the problem by changing the game to a simpler fonn. For example, if I hold an ace and am considering the possibility that t.he next card will give me a pair of aces, then I don't care about the suit of the next card; any suit will do equally well. This suggests fonning an abstraction of the game, one in which suits are ignored. The resulting game tree wiU be smaller by a factor of 4! = 24. Suppose I can solve this smaller game; how will the solution to that game relate to the original game? If no player is going for a flush (or bluffing so), then the suits don't matter to any player, and the solution for the abstraction wil1 also be a solution for the original game. However, if any
player is contemplating a flush, then the abstraction will be only an approximate solution (but it is possible to compute bounds on the error). There are many oppmtunities for abstraction. For example, at the point in a game where each player has two cards, if I hold a pair of queens, then the other players' hands could be abstracted into three classes: better (only a pair of kings or a pair of aces), same (pair of queens) or worse (everything else). However, this abstraction might be too coarse. A better abstraction would divide worse into, say, medium pair (nines through jacks), low pair, and no pair. These examples are abstractions of states; it is also possible to abstract actions. For example, instead of having a bet action for each integer from 1 to 1000, we could restrict the bets to 10°, 101, 102 and 103. Or we could cut out one of the rounds of betting altogether. We can also abstract over chance nodes, by considering only a subset of the possible deals. This is equivalent to the rollout teclmique used in Go programs. Putting all these abstractions together, we can reduce the 1018 states of poker to 107 states, a size that can be solved with current techniques. Poker programs based on this approach can easily defeat novice and some experienced human players, but are not yet at the level of master players. Part of the problem is that the solution these programs approximate-the equilibrium solution-is optimal only against an opponent who also plays the equilibrium strategy. Against fallible human players it is important to be able to exploit an opponent's deviation from the equilibrium strategy. As
678
COURNOT COMPETITION
BAYEs-NASH EQUILIBRIUM
Chapter
17.
Making Complex Decisions
Gautam Rao (aka "The Count"), the world's leading online poker player, said (Billings et at. , 2003), "You have a very strong program. Once you add opponent modeling to it, it will kill everyone." However, good models of human fallability remain elusive. In a sense, extensive game form is the one ofthe most complete representations we have seen so far: it can handle partially observable, multiagent, stochastic, sequential, dynamic environments-most of the hard cases from the list of environment properties on page 42. However, there are two limitations of game theory. First, it does not deal well with continuous states and actions (although there have been some extensions to the continuous case; for example, the theory of Coumot competition uses game theory to solve problems where two companies choose prices for their products from a continuous space). Second, game theory assumes the game is known. Parts of the game may be specified as unobservable to some of the players, but it must be known what parts are unobservable. In cases in which the players learn the unknown structure of the game over time, the model begins to break down. Let's examine each source of uncertainty, and whether each can be represented in game theory. Actions: There is no easy way to represent a game where the players have to discover what actions are available. Consider the game between computer virus writers and security experts. Part of the problem is anticipating what action the virus writers will try next. Strategies: Game theory is very good at representing the idea that the other players' strategies are initially unknown-as long as we assume all agents are rational. The theory itself does not say what to do when the other players are less than fully rational. The notion of a Bayes-Nash equilibrium partially addresses this point: it is an equilibrium with respect to a player's prior probability distribution over the other players' strategies-in other words, it expresses a player's beliefs about the other players' likely strategies. Chance: If a game depends on the roll of a die, it is easy enough to model a chance node with unifonn distribution over the outcomes. But what if it is possible that the die is unfair? We can represent that with another chance node, higher up in the tree, with two branches for "die is fair" and "die is unfair," such that the corresponding nodes in each branch are in the same information set (that is, the players don't know if the die is fair or not). And what if we suspect the other opponent does know·? Then we add another chance node, with one branch representing the case where the opponent does lmow, and one where he doesn't. Utilities: What if we don't know our opponent's utilities? Again, that can be modeled with a chance node, such that the other agent knows its own utilities in each branch, but we don't. But what if we don't know our own utilities? For example, how do I know if it is rational to order the Chef's salad if I don't know how much I will like it? We can model that with yet another chance node specifying an unobservable "intrinsic quality" of the salad. Thus, we see that game theory is good at representing most sources of uncertainty-but at the cost of doubling the size of the tree every time we add another node; a habit which quickly leads to intractably large trees. Because of these and other problems, game theory has been used primarily to analyze environments that are at equilibrium, rather than to control agents within an environment. Next we shall see how it can help design environments.
Section 17.6. 17.6
Mechanism Design
679
MECHANISM DESIGN
MECHANISM DESIGN
MECHANISM cENTER
In the previous section, we asked, "Given a game, what is a rational strategy?" In this sec tion, we ask, "Given that agents pick rational strategies, what game should we design?" More specifically, we would like to design a game whose solutions, consisting of each agent pursu ing its own rational strategy, result n i the maximization of some global utility ftmction. This problem is called mechanism design, or sometimes inverse game theory. Mechanism de sign is a staple of economics and political science. Capitalism 101 says that if everyone tries to get rich, the total wealth of society will increase. But the examples we will discuss show that proper mechanism design is necessary to keep the invisible hand on track. For collections of agents, mechanism design allows us to construct smart systems out of a collection of more limited systems--even tmcooperative systems--in much the same way that teams of htunans can achieve goals beyond the reach of any individual. Examples of mechanism design include auctioning off cheap airline tickets, routing TCP packets between computers, deciding how medical interns will be assigned to hospitals, and deciding how robotic soccer players wil1 cooperate with their teammates. Mechanism design became more than an academic subject in the 1990s when several nations, faced with the problem of auctioning off licenses to broadcast in various frequency bands, lost hundreds of millions of dollars in potential revenue as a result of poor mechanism design. Formally, a mechanism consists of (I) a language for describing the set of allowable strategies that agents may adopt, (2) a distinguished agent, called the center, that collects reports of strategy choices from the agents in the game, and (3) an outcome rule, known to all agents, that the center uses to determine the payoffs to each agent, given their strategy choices. 17.6.1
AUCTION
ASCENDING-BID ENGLISHAUCTION
Auctions
Let's consider auctions first. An auction is a mechanism for selling some goods to members of a pool of bidders. For simplicity, we concentrate on auctions with a single item for sale. Each bidder i has a utility value vi for having the item. ln some cases, each bidder has a private value for the item. For example, the first item sold on eBay was a broken laser pointer, which sold for $14.83 to a collector of broken laser pointers. Thus, we know that the collector has vi 2:: $14.83, but most other people would have Vj « $14.83. In other cases, such as auctioning drilling rights for an oil tract, the item has a common value-the tract will produce some amount of money, X, and all bidders value a dollar equally-but there is uncertainty as to what the actual value of X is. Different bidders have different information, and hence different estimates of the item's true value. In either case, bidders end up with their own Vi. Given Vi, each bidder gets a chance, at the appropriate time or times in the auction, to make a bid bi. The highest bid, bmax wins the item, but the price paid need not be bmax; that's part ofthe mechanism design. The best-known auction mechanism is the ascending-bid,8 or English auction, in which the center starts by asking for a minimum (or reserve) bid bmin · If some bidder is 8 The word "auction" comes from the Latin augere, to increase.
680
EFFICIENT
COLLUSION
STRATEGY·PROOF TRUTH-REVEALING REVELATION PRINCIPLE
Chapter
17.
Making Complex Decisions
willing to pay that amount, the center then asks for bmin + d, for some increment d, and continues up from there. The auction ends when nobody is willing to bid anymore; then the last bidder wins the item, paying the price he bid. How do we know if this is a good mechanism? One goal is to maximize expected revenue for the seller. Another goal is to maximize a notion of global utility. These goals overlap to some extent, because one aspect of maximizing global utility is to ensure that the winner of the auction is the agent who values the item the most (and thus is willing to pay the most). We say an auction is efficient if the goods go to the agent who values them most. The ascending-bid auction is usually both efficient and revenue maximizing, but if the reserve price is set too high, the bidder who values it most may not bid, and if the reserve is set too low, the seller loses net revenue. Probably the most important things that an auction mechanism can do is encourage a sufficient number of bidders to enter the game and discourage them from engaging in collu sion. Collusion is an unfair or illegal agreement by two or more bidders to manipulate prices. It can happen in secret backroom deals or tacitly, within the mles of the mechanism. For example, in 1999, Gennany auctioned ten blocks of cell-phone spectrum with a simultaneous auction (bids were taken on all ten blocks at the same time), using the rule that any bid must be a minimum of a 10% raise over the previous bid on a block. There were only two credible bidders, and the first, Mannesman, entered the bid of 20 million deutschmark on blocks l-5 and 18.18 million on blocks 6-10. Why 18.18M? One ofT-Mobile's managers said they "interpreted Mannesman's first bid as an offer." Both parties could compute that a 10% raise on 18.18M is 19.99M; thus Ma1mesman's bid was interpreted as saying "we can each get half the blocks for 20M; Jet's not spoil it by bidding the prices up higher." And in fact T-Mobile bid 20M on blocks 6-10 and that was the end of the bidding. The German goverrunent got Jess than they expected, because the two competitors were able to use the bidding mechanism to come to a tacit agreement on how not to compete. From the government's point of view, a better result could have been obtained by any of these changes to the mechanism: a .higher reserve price; a sealed-bid first-price auction, so that the competitors could not communicate tlu-ough their bids; or incentives to bring in a third bidder. Perhaps the 10% rule was an error in mechanism design, because it facilitated the precise signaling from Mannesman to T-Mobile. In general, both the seller and the global utility fw1ction benefit if there are more bid ders, although global utility can suffer if you count the cost of wasted lime of bidders that have no chance of winning. One way to encourage more bidders is to make the mechanism easier for them. After all, if it requires too much research or computation on the part of the bidders, they may decide to take their money elsewhere. So it is desirable that the bidders have a dominant strategy. Recall that "dominant" means that the strategy works against all other strategies, which in tum means that an agent can adopt it without regard for the other strategies. An agent with a dominant strategy can just bid, without wasting time contemplat ing other agents' possible strategies. A mechanism where agents have a dominant strategy is called a strategy-proof mechanism. If, as is usually the case, that strategy involves the bidders revealing their true value, Vi, then it is called a truth-revealing, or truthful, auction; the term incentive compatible is also used. TI1e revelation principle states that any mecha-
Mechanism Design
Section 17.6.
681
nism can be transfonned into an equivalent truth-revealing mechanism, so part of mechanism design is finding these equivalent mechanisms. It turns out that the ascending-bid auction has most of the desirable properties. The
bidder with the highest value vi gets the goods at a price of b0 + d, where b0 is the highest 9 bid among all the other agents and d is the auctioneer's increment. Bidders have a simple dominant strategy: keep bidding as long as the current cost is below your vi· The mechanism is not quite truth-revealing, because the winning bidder reveals only that his vi
;::: b0 + d; we
have a lower bound on vi but not an exact amount. A disadvanta.ge (from the point of view of the seller) of the ascending-bid auction is that it can discourage competition. Suppose that in a bid for cell-phone spectrum there is one advantaged company that everyone agrees would be able to leverage existing customers and infrastructure, and thus can make a larger profit than anyone else. Potential competitors can see that they have no chance in an ascending-bid auction, because the advantaged com pany can always bid higher. Thus, the competitors may not enter at all, and the advantaged company ends up wilming at the reserve price. Another negative property ofthe English auction is its high communication costs. Either the auction takes place in one room or all bidders have to have high-speed, secw·e commtmi cation lines; in either case they have to have the time available to go through several rmmds of
SEALED-BID AUCTION
bidding. An alternative mechanism, which requires much Jess commtmication, is the sealed
bid auction. Each bidder makes a single bid and communicates it to the auctioneer, without the other bidders seeing it. With this mechanism, there is no longer a silnple dominant strat
vi and you believe that the maximum of all t.he other agents' bids will should bid b0 + E, for some small E, if that is Jess than vi· Thus, your bid
egy. If your value is be
b0, then you
depends on your estimation of the other agents' bids, requiring you to do more work. Also, note that the agent with the highest vi might not win the auction. This is offset by the fact that the auction is more competitive, reducing the bias toward an advantaged bidder.
SEALED-BID SECOND-PRICE AUCTION VICKREYAUCTION
A small change in the mechanism for sealed-bid auctions produces the sealed-bid second-price auction, also known as a Vickrey auction. 10 In such auctions, the winner pays the price of the second-highest bid,
b0, rather than paying his own bid.
This simple modifi
cation completely eliminates the complex deliberations required for standard (or first-price) sealed-bid auctions, because the dominant strategy is now simply to bid vi; the mechanism is truth-revealing. Note that the utility of agent i in terms of his bid bi, his value vi, and the best
{ (vi - bo)
bid among the other agents, b0, is U'·
=
0
if bi >
bo
otherwise.
To see that bi = vi is a dominant strategy, note that when (vi - bo) is positive, any bid that wins the auction is optimal, and bidding vi in particular wins the auction. On the other hand, when
(vi - bo)
is negative, any bid that loses the auction is optimal, and bidding vi in
9 There is actually a small chance that the agent with highest Vi fails to get the goods, in the cnse in which b0 < Vi < b0 +d. The chance of this can be made arbitrarily small by decreasing the increment d. 10 Named after William Vickrey (1914-1996), who won the 1996 Nobel Prize in economics for this work and
died of a heart attack three days later.
Chapter
682
17.
Making Complex Decisions
particular loses the auction. So bidding vi is optimal for all possible values of b0, and in fact, vi
is the only bid that has this property. Because of its simplicity and the minimal computation
requirements for both seller and bidders, the Vickrey auction is widely used in constructing distributed AI systems. Also, Internet search engines conduct over a billion auctions a day to sell advertisements along with their search results, and on1ine auction sites handle $100 billion a year in goods, all using variants ofthe Vickrey auction. Note that the expected value to the seller is REVENUE EQUIVALENCE THEOREM
b0, which is the same expected return as the limit of the English auction as
the increment d goes to zero. This is actually a very general result: the revenue equivalence theorem
states that, with a few minor caveats, any auction mechanism where risk-neutral
bitldt:rS ftavt: va)Ut:S 'Vi kHUWII only tU tht:IIISdVt:S (but kHUW a probability UistributiUII from
which those values are sampled), will yield the same expected revenue. This principle means that the various mechanisms are not competing on the basis of revenue generation, but rather on other qualities. Although the second-price auction is tmth-revealing, it turns out that extending the idea to multiple goods and using a next-price auction is not truth-revealing. Many Internet search engines use a mechanism where they auction k slots for ads on a page. The highest bidder wins the top spot, the second highest gets the second spot, and so on. Each winner pays the price bid by the next-lower bidder, with the understanding that payment is made on1y if the searcher actually clicks on the ad. The top slots are considered more valuable because they are more likely to be noticed and clicked on. Imagine that three bidders, b1, � and b3, have valuations for a click of v1 = 200, 'V2 = 180, and V3 = 100, and thatk = 2 slots are available, where it is known that the top spot is clicked on 5% of the time and the bottom spot 2%. If all bidders bid truthfully, then b1 wins the top slot and pays 180, and has an expected return
0.05= 1 . The second slot goes to �. But b1 can see that if she were to bid anything in the range 101-179, she would concede the top slot to �, win the second slot, and yield an expected return of (200- 100) x .02 = 2. Thus, b1 can double her expected return by
of (200 - 180)
x
bidding less than her true value in this case. In general, bidders in this multislot auction must spend a lot of energy analyzing the bids of others to detennine their best strategy; there is no
simple dominant strategy. Aggarwal et at. (2006) show that there is a unique truthful auction mechanism for this multislot problem, in which the winner of slot j pays the full price for slot j just for those additional clicks that are available at slot j and not at slot j + 1 . The winner pays the price for the lower slot for the remaining clicks. In our example, b1 would bid 200 truthfully, and would pay 180 for the additional .05 - .02 = .03 clicks in the top slot, but would pay only the cost of the bottom slot, 100, for the remaining .02 clicks. Thus, the total return to
b1 would be (200 - 180)
x
.03 + (200 - 100)
x
.02 = 2.6.
Another example of where auctions can come into play within AI is when a collection of agents are deciding whether to cooperate on a joint plan. Hunsberger and Grosz (2000) show that this can be accomplished efficiently with an auction in which the agents bid for roles in the joint plan.
Section 17 .6.
Mechanism Design 17.6.2
683
Common goods
Now let's consider another type of game, in which countries set their policy for controlling air pollution. Each cotmtry has a choice: they can reduce pollution at a cost of -10 points for
implementing the necessary changes, or they can continue to pollute, which gives them a net utility of -5 (in added health costs, etc.) and also contributes -1 points to every other cotmtry (because the air is shared across countries). Clearly, the dominant strategy for each cotmtry
is "continue to pollute;' but if there are 100 countries and each follows this policy, then each country gets a total utility of -104, whereas if every country reduced pollution, they would
TRAGELY OFTHE COM!DIS
each have a utility of -10. This situation is called the tragedy of the commons: if nobody has to pay for using a common resoun:e, then it tends to be exploited in a way that leads to a lower total utility for all agents. It is similar to the prisoner's dilemma: there is another solution to the game that is better for all parties, but there appears to be no way for rational
agents to arrive at that solution.
The standard approach for dealing with the tragedy of the commons is to change the mechanism to one that charges each agent for using the commons. More generally, we need
EXTERNALITIES
to ensure that all externaliti�ffects on global utility that are not recognized in the in dividual agents' transactions-are made explicit. Setting the prices correctly is the difficult pa11. In the limit, this approach amounts to creating a mechanism in which each agent is
effectively required to maximize global utility, but can do so by making a local decision. For
tl1is example, a carbon tax would be an example of a mechanism that charges for use of the commons in a way that, if implemented well, maximizes global utility.
As a final example, consider the problem of allocating some common goods. Suppose a city decides it wants to install some free wireless Internet transceivers. However, the number of transceivers tlley can afford is less than the nwnber of neighborhoods that want them. The city wants to allocate the goods efficiently, to the neighborhoods that would value them the most. That is, they want to maximize the global utility
V = Ei vi·
The problem is that if
they just ask each neighborhood council "how much do you value this free gift?" they would all have an incentive to lie, and report a high value. It turns out there is a mechanism, known
VICKRE'fCLARKE GRCNES VCG
as the Vickrey-Ciarke-Groves, or VCG, mechanism, that makes it a dominant strategy for
each agent to report its true utility and that achieves an efficient allocation of the goods. The trick is that each agent pays a tax equivalent to the loss in global utility that occurs because
of the agent's presence in the game. The mechanism works like this: l . The center asks each agent to report its value for receiving an item. Call this bi.
2. The center allocates tl1e goods to a subset of the bidders. We call this subset A, and use
bi(A) to mean the result to i under this allocation: bi if i is in A (that is, i is a wilmer), and 0 otherwise. The center chooses A to maximize total reported utility B = Ei bi(A). the notation
3. The center calculates (for each except
i.
i) the sum of the reported utilities for all the winners
We use the notation B_i
= L:;,o;i bj(A).
The center also computes (for each
i) the allocation that would maximize total global utility if i were not in the game; call that sum w-i·
4. Each agent i pays a tax equal to W-i - B-i·
684
Chapter 17.
Making Complex Decisions
In this example, the VCG rule means that each winner would pay a tax equal to the highest repo11ed value among the losers. That is, if I report my value as 5, and that causes someone with value 2 to miss out on an allocation, then I pay a tax of 2. All wilmers should be happy because they pay a tax that is less than their value, and all losers are as happy as they can be, because they value the goods less than the required tax. Why is it that this mechanism is truth-revealing? First, consider the payoff to agent i, which is the value of getting an item, minus the tax: (17.14) Here we distinguish the agent's true utility, Vi, from his reported utility bi (but we are trying to show that a dominant strategy is bi = v,). Agent i knows that the center will maximize global utility using the reported values,
L bj(A) = bi(A) + L b;(A)
j jfi whereas agent i wants the center to maximize (17.14), which can be rewritten as vi(A) + L bj(A) - w_i . jfi Since agent i cannot affect the value of W-i (it depends only on the other agents), the only way i can make the center optimize what i wants is to report the true utility, bi =vi· 17.7
SUMMARY
nus chapter shows how to use knowledge about the world to make decisions even when the outcomes of an action are uncertain and the rewards for acting might not be reaped until many actions have passed. The main points are as follows: •
•
•
•
•
Sequential decision problems in uncertain environments, also called Markov decision processes, or MOPs, are defined by a transition model specifying the probabilistic outcomes of actions and a reward function specifying the reward in each state. The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time. The solution of an MOP is a policy that associates a decision with every state that the agent might reach. An optimal policy maximizes the utility of the state sequences encountered when it is executed. 1l1e utility of a state is the expected utility of the state sequences encountered when an optimal policy is executed, starting ill that state. The value iteration algOJittun for solving MOPs works by iteratively solving the equations relating the utility of each state to those of its neighbors. Policy iteration alternates between calculating the utilities of states under the cun·ent policy and improving the current policy with respect to the current utilities. Partially observable MOPs, or POMDPs, are much more difficult to solve than are MOPs. They can be solved by conversion to an MDP in the continuous space of belief
Bibliographical and Historical Notes
685
states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and there fore make better decisions in the future. • A decision-theoretic agent can be constructed for POMDP environments.
uses a
dynamic decision network
The agent
to represent the transition and sensor models, to
update its belief state, and to project forward possible action sequences. •
Game theory
describes rational behavior for agents in situations in which multiple
agents interact simultaneously. Solutions of games are files in which no agent has an incentive •
Nash equilibria-strategy pro
to deviate from the specified strategy.
Mechanism design can be used to set the rules by which agents
will interact,
in order
to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent consider the choices made by other agents. We shall return to the world of MDPs and POMDP in Chapter
21,
when we study
to
rein
forcement learning methods that allow an agent to improve its behavior from experience in sequential, uncertain environments.
BffiLIOGRAPHICAL AND HISTORICAL NOTES Richard Bellman developed the ideas underlying the modem approach to sequential decision problems while working at the RAND Corporation beginning in
1949.
According to his au
tobiography (Bellman, 1984), he coined the exciting term "dynamic programming" to hide from a research-phobic Secretary of Defense, Charles Wilson, the fact that his group was
doing mathematics. (This cannot be strictly true, because his first paper using the term (Bell man, 1952) appeared before Wilson became Secretary of Defense in 1953.) Bellman's book, Dynamic Programming (1957), gave the new field a solid foundation and introduced the basic algorithmic approaches. Ron Howard's Ph.D. thesis (1960) introduced policy iteration and the idea of average reward for solving infinite-horizon problems. Several additional results were introduced by Bellman and Dreyfus
(1962). Modified policy iteration is due to van (1976) and Puterman and Shin (1978). Asynchronous policy iteration was analyzed by Williams and Baird (1993), who also proved the policy loss bound in Equation (17.9). The analysis of discounting in terms of stationary preferences is due to Koopmans (1972). The texts by Bertsekas (1987), Puterman (1994), and Bertsekas and Tsitsiklis (1996) provide a rigorous introduction to sequential decision problems. Papadimitriou and Tsitsiklis (1987) Nunen
describe results on the computational complexity of MDPs. Seminal work by Sutton
(1988) and Watkins (1989) on reinforcement learning methods
for solving MDPs played a significant role in introducing MDPs into the AI community, as did the later survey by Barto
et al. (1995).
similar ideas, but was not taken up
(Earlier work by Werbos
to the same extent.)
AI planning problems was made first by Sven Koenig
(1977) contained many
The cormection between MDPs and
(1991), who showed how probabilistic
STRIPS operators provide a compact representation for transition models (see also Wellman,
686
FACTORED MOP
RELATIONAL MOP
Chapter
17.
Making Complex Decisions
1990b). Work by Dean et at. (1993) and Tash and Russell (1994) attempted to overcome the combinatorics of large state spaces by using a limited search horizon and abstract states. Heuristics based on the value of information can be used to select areas of the state space where a local expansion of the horizon will yield a significant improvement in decision qual ity. Agents using this approach can tailor their effort to handle time pressure and generate some interesting behaviors such as using familiar "beaten paths" to find their way around the state space quickly without having to recompute optimal decisions at each point. As one might expect, AI researchers have pushed MOPs in the direction of more ex pressive representations that can accommodate much larger problems than the traditional atomic representations based on transition matrices. The use of a dynamic Bayesian network to represent transition models was an obvious idea, but work on factored MDPs (Boutilier et at., 2000; Koller and Parr, 2000; Guestrin et at. , 2003b) extends the idea to structured representations of the value function with provable improvements in complexity. Relational MOPs (Boutilier et at., 2001; Guestrin et at., 2003a) go one step further, using structured representations to handle domains with many related objects. The observation that a partially observable MOP can be transformed into a regular MDP over belief states is due to Astrom (1965) and Aoki (1965). The first complete algorithm for the exact solution of POMDPs--essentially the value iteration algorithm presented in this chapter-was proposed by Edward Sondik (1971) in his Ph.D. thesis. (A later journal paper by Smallwood and Sondik (1973) contains some errors, but is more accessible.) Lovejoy (1991) surveyed the first twenty-five years of POMDP research, reaching somewhat pes simistic conclusions about the feasibility of solving large problems. The first significant contribution within AI was the Witness algorithm (Cassandra et at., 1994; Kaelbling et at. , 1998), an improved version of POMDP value iteration. Other algorithms soon followed, in cluding an approach due to Hansen (1998) that constructs a policy incrementally in the fonn of a finite-state automaton. In this policy representation, the belief state corresponds directly to a particular state in the automaton. More recent work in AI has focused on point-based value iteration methods that, at each iteration, generate conditional plans and a-vectors for a finite set of belief states rather than for the entire belief space. Lovejoy (1991) proposed such an algorithm for a fixed grid of points, an approach taken also by Bonet (2002). An influential paper by Pineau et at. (2003) suggested generating reachable points by simulat ing trajectories in a somewhat greedy fashion; Spaan and Vlassis (2005) observe that one need generate plans for only a small, randomly selected subset of points to improve on the plans from the previous iteration for all points in the set. Current point-based methods such as point-based policy iteration (Ji et at. , 2007Man generate near-optimal solutions for POMDPs with thousands of states. Because POMDPs are PSPACE-hard (Papadimitriou and Tsitsiklis, 1987), further progress may require taking advantage of various kinds of structure within a factored representation. The online approach-using look-ahead search to select an action for lhe cun-ent belief state-was first examined by Satia and Lave (1973). The use of sampling at chance nodes was expl01-ed analytically by Keams et at. (2000) and Ng and Jordan (2000). The basic ideas for an agent architecture using dynamic decision networks were proposed by Dean and Kanazawa (1989a). The book Planning and Control by Dean and Wellman (1991) goes
Bibliographical and Historical Notes
687
into much greater depth, making connections between DBN/DDN models and the c1assical control literature on filtering. Tatman and Shachter (1990) showed how to apply dynamic programming algorithms to DDN models. Russell (1998) explains various ways in which such agents can be scaled up and identities a number of open research issues. The roots of game theory can be traced back to proposals made in the 17th century by Clu·istiaan Huygens and Gottfried Leibniz to study competitive and cooperative human interactions scientitically and mathematically. Throughout the 19th century, several leading economists created simple mathematical examples to analyze particular examples of com petitive situations. The first formal results in game theory are due to Zermelo (1913) (who had, the year before, suggested a form of minimax search for games, albeit an incorrect one). Emile Borel (1921) introduced the notion of a mixed strategy. John von Neumann (1928) proved that every two-person, zero-sum game has a maximin equilibrium in mixed strategies and a well-defined value. Von Neumann's collaboration with the economist Oskar Morgen stem led to the publication in 1944 of the Theory of Games and Economic Behavior, the defining book for game theory. Publication of the book was delayed by the wartime paper shortage until a member of the Rockefeller family personally subsidized its publication. In 1950, at the age of21, John Nash published his ideas concerning equilibria in general (non-zero-sum) games. His definition of an equilibrium solution, although originating in the work of Coumot (1838), became known as Nash equilibrium. After a long delay because of the schizophrenia he suffered from 1959 onward, Nash was awarded the Nobel Memorial Prize in Economics (along with Reinhart Selten and John Harsanyi) in 1994. The Bayes-Nash equilibrium is described by Harsanyi (1967) and discussed by Kadane and Larkey (1982). Some issues in the use of game theory for agent control are covered by Birunore (1982). The prisoner's dilemma was invented as a classroom exercise by Albert W. Tucker in 1950 (based on an example by Merrill Flood and Melvin Dresher) and is covered extensively by Axelrod (1985) and Pmmdstone (1993). Repeated games were introduced by Luce and Raiffa (1957), and games of partial information in extensive form by Kuhn (1953). The first practical algoritlun for sequential, partial-information games was developed within AI by Koller et al. (1996); the paper by Koller and Pfeffer (1997) provides a readable introduction to the field and describe a working system for representing and solving sequential games. The use of abstraction to reduce a game tree to a size that can be solved with Koller's technique is discussed by Billings et al. (2003). Bowling et al. (2008) show how to use importance sampling to get a better estimate of the value of a strategy. Waugh et al. (2009) show that the abstraction approach is vulnerable to making systematic errors in approximating the equilibrium solution, meaning that the whole approach is on shaky ground: it works for some games but not others. Korb et al. (1999) experiment with an opponent model in the form of a Bayesian network. It plays five-card stud about as well as experienced humans. (Zinkevich et al., 2008) show how an approach that minimizes regret can find approximate equilibria for abstractions with 1 012 states, 100 times more than previous methods. Game theory and MDPs are combined in the theory of Markov games, also called stochastic games (Littman, 1994; Hu and Wellman, 1998). Shapley (1953) actually described the value iteration algorithm independent.ly of Bellman, but his results were not widely ap preciated, perhaps because they were presented in the context of Markov games. Evolu-
688
Chapter
17.
Making Complex Decisions
tionary game theory (Smith, 1982; Weibull, 1995) looks at strategy drift over time: if your opponent's strategy is changing, how should you react? Textbooks on game theory from an economics point of view include those by Myerson (1991), Fudenberg and Tirole (1991), Osbome (2004), and Osbome and Rubinstein (1994); Mailath and Samuelson (2006) concen trate on repeated games. From an AI perspective we have Nisan et al. (2007), Leyton-Brown and Shoham (2008), and Shoham and Leyton-Brown (2009). The 2007 Nobel Memorial Prize in Economics went to Hurwicz, Maskin, and Myerson "for having laid the foundations of mechanism design theory" (Hurwicz, 1973). The tragedy of the commons, a motivating problem for the field, was presented by Hardin ( 1968). The rev elation principle is due to Myerson (1986), and the revenue equivalence theorem was devel oped independently by Myerson (1981) and Riley and Samuelson (1981). Two economists, Milgrom (1997) and K1emperer (2002), write about the multibillion-dollar spectmm auctions they were involved in. Mechanism design is used in multiagent planning (Hunsberger and Grosz, 2000; Stone et al., 2009) and scheduling (Rassenti et al. , 1982). Varian (1995) gives a brief overview with cormections to the computer science literature, and Rosenschein and Zlotkin (1994) present a book-length treatment with applications to distributed AI. Related work on distributed AI also goes under other names, including collective intelligence (Turner and Wolpert, 2000; Segaran, 2007) and market-based control (Clearwater, 1996). Since 2001 there has been an annual Trading Agents Competition (TAC), in which agents try to make the best profit on a series of auctions (Wellman et al. , 2001; Arunachalam and Sadeh, 2005). Papers on computational issues n i auctions often appear in the ACM Conferences on Electronic Commerce.
EXERCISES 17.1 For the 4 x 3 world shown in Figure 17.1, calculate which squares can be reached from (1 ,1) by the action sequence [Up, Up, Right, Right, Right] and with what probabilities.
Explain how this computation is related to the prediction task (see Section 1 5.2.1) for a hidden Markov model.
17.2 Select a specific member of the set of policies that are optimal for R(s) > 0 as shown in Figure 1 7.2(b), and calculate the fraction of time the agent spends in each state, in the limit, if the policy is executed forever. (Hint: Construct the state-to-state transition probability matrix corresponding to the policy and see Exercise 15.2.)
17.3 Suppose that we define the utility of a state sequence to be the maximum reward ob tained in any state in the sequence. Show that this utility function does not result in stationary
preferences between state sequences. Is it sti11 possible to define a utility function on states such that MEU decision making gives optimal behavior?
17.4 Sometimes MOPs are formulated with a reward function R(s,a) that depends on the action taken or with a reward function
R(s, a, s') that also depends on the outcome state.
a. Write the Bellman equations for these formulations.
Exercises
689 b.
Show how an MOP with reward function R(s, a, s') can be transfonned into a different MOP with reward function R( s, a), such that optjmal policies in the new MOP corre spond exact.ly to optimal policies in the original MOP. c. Now do the same to conve1t MOPs with R(s,a) into MOPs with R(s).
17.5 For the environment shown in Figure 17.1, find all the threshold values for R(s) such that the optimal policy changes when the threshold is crossed. You will need a way to calcu late the optimal policy and its value for fixed R(s). (Hint: Prove that the value of any fixed policy varies linearly with R(s).) 17.6 Equation (17.7) on page 654 states that the Bellman operator is a contraction.
a. Show that, for any functions f and g,
I maxf(a) - maxg(a)l :::; max If(a) - g(a)l . a
b.
a
a
Write out an expression for I(B Ui - B UI)(s)l and then apply the result from (a) to complete the proof that the Bellman operator is a contraction.
17.7 This exercise considers two-player MOPs that correspond to zero-sum, tum-taking games like those in Chapter 5. Let the players be A and B, and let R(s) be the reward for player A in state s. (The reward for B is always equal and opposite.)
a. Let UA(s) be the utility of state s when it is A's tum to move in s, and let UB(s) be the utility of state s when it is B's tum to move in s. All rewards and utilities are calculated from A's point of view (just as in a minimax game b·ee). Write down Bellman equations defining UA(s) and UB(s). b. Explain how to do two-player value iteration with these equations, and define a suitable termination criterion. c. Consider the game described in Figure 5.17 on page 197. Draw the state space (rather than the game tree), showing the moves by A as solid lines and moves by B as dashed lines. Mark each state with R(s). You will find it helpful to mange the states (sA, BB) on a two-dimensional grid, using SA and SB as "coordinates." d. Now apply two-player value iteration to solve thls game, and derive the optimal policy. 17.8 Consider the 3 x 3 world shown in Figure 17.14(a). The transition model is the same as in the 4 x 3 Figure 17.1: 80% ofthe time the agent goes in the direction it selects; the rest of the time it moves at right angles to the intended direction. Implement value iteration for this world for each value of r below. Use discounted rewards with a discount factor of 0.99. Show the policy obtained in each case. Explain intuitively why the value of r leads to each policy.
a. r = 100 b. r = -3 c. r = 0 d. r = +3
Chapter
690
-I -I -I -I -I r
1+!01 -I -I
+50 -I
-
1
-I
Start
-50 +I +I +I
(a)
17.
.
.
.
.
.
.
.
.
.
Making Complex Decisions
-I -I -I 0 +I +I +I B
(b)
Figure 17.14 (a) 3 X 3 world for Exercise 17.8. The reward for each state is indicated. The upper right square is a terminal state. (b) 101 x 3 world for Exercise 17.9 (omitting 93 identical columns in the middle). The start state has reward
0.
Consider the 1 01 x 3 world shown in Figure 17.14(b). In the start state the agent has
17.9
a choice of two deterministic actions, Up or Down, but in the other states the agent has one deterministic action, Right. Assuming a discounted reward ftmction, for what values of the discount 1 should the agent choose Up and for which Down? Compute the utility of each action
as
a function of "f. (Note that this simple example actually reflects many real-world
situations in which one must weigh the value of an immediate action versus the potential continual long-term consequences, such as choosing to dump pollutants into a lake.)
17.10 Consider an tmdiscotmted MOP having three states, (1. 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The transition model is as follows: •
In state 1 , action a moves the agent to state 2 with probability stay put with probability
•
0.8 and makes the agent
0.2.
In state 2, action a moves the agent to state 1 with probability
0.8 and makes the agent 0.2. In either state 1 or state 2, action b moves the agent to state 3 with probability 0.1 and makes the agent stay put with probability 0.9. stay put with probability
•
Answer the following questions:
a. What can be determined qualitatively about the optimal policy in states 1 and 2'?
b.
Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and
2.
Assume that the initial policy has action bin both states.
c. What happens to policy iteration if the initial policy has action a in both states? Does discounting help? Does the optimal policy depend on the discount factor?
17.11
Consider the 4 x 3 world shown in Figure 17.1.
a. Implement an environment simulator for this environment, such that the specific geog raphy of the environment is easily altered. Some code for doing this is already in the online code repository.
691
Exercises
b.
Create an agent that uses policy iteration, and measure its performance in the environ ment simulator from various starting states.
Perfonn several experiments from each
starting state, and compare the average total reward received per run with the utility of the state, as determined by your algorithm.
c.
Experiment with increasing the size of the environment.
How does the run time for
policy iteration vary with the size of the environment?
17.12
How can
the value
determination algoritlun be used to calculate the expected Joss
experienced by an agent using a given set of utility estimates
U and an estimated model P,
compared with an agent using correct values?
17.13 Let the initial belief state bo for the 4 x 3 POMDP on page 658 be the uniform dis. . . 1 1 1 1 1 1 1 1 1 tn"butJon over h t e nontermmaI states, J.e., ( "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > 0, 0) CalcuIate t he exact belief state b1 after the agent moves Left and its sensor reports 1 adjacent wall. Also calculate •
�
assuming that the same thing happens again.
17.14
What is the time complexity of d steps of POMDP value iteration for a sensorless
environment?
17.15 Consider a version of the two-state POMDP on page 661 in which the sensor is 90% reliable in state 0 but provides no information in state 1 (that is, it reports 0 or 1 with equal probability). Analyze, either qualitatively or quantitatively, the utility function and the optimal policy for this problem.
17.16
Show that a dominant strategy equilibrium is a Nash equilibriwn, but not vice versa.
17.17
In the children's game of rock-paper-scissors each player reveals at the same time
a choice of rock, paper, or scissors. Paper wraps rock, rock blunts scissors, and scissors cut paper.
In the extended version rock-paper-scissors-fire-water, fire beats rock, paper, and
scissors; rock, paper, and scissors beat water; and water beats fire.
Write out the payoff
matrix and find a mixed-strategy solution to this game.
17.18
The following payoff matrix, from Blinder (1983) by way of Bernstein
(1996), shows
a game between politicians and the Federal Reserve. Fed: contract Pol: contract Pol: do nothing Pol: expand
Fed: do nothing
Fed: expand
F = 7, P = 1 F = 9, P = 4 F = 6, P = 6 F = 8, ? = 2 F = 5,P = 5 F = 4, P = 9 F = 3, P = 3 F = 2, P = 7 F = l, P = 8
Politicians can expand or contract fiscal policy, while the Fed can expand or contract mon etary policy. (And of course either side can choose to do nothing.) Each side also has pref erences for who should do what-neither side wants to look like the bad guys. The payoffs shown are simply the rank orderings:
9 for first choice
through
1
for last choice. Find the
Nash equilibrium of the game in pure strategies. Is this a Pareto-optimal solution? You might wish to analyze the policies of recent administrations in this light.
Chapter
692 17.19
17.
Making Complex Decisions
A Dutch auction is similar in an English auction, but rather than starting the bidding
at a low price and increasing, in a Dutch auction the seller starts at a high price and gradually lowers the price until some buyer is willing to accept that price. (If multiple bidders accept the price, one is arbitrarily chosen as the winner.) More formally, the seller begins with a price p and gradually Jowers p by increments of d until at least one buyer accepts the price. Assuming all bidders act rationally, is it true that for arbitrarily small d, a Dutch auction will always result in the bidder with the highest value for the item obtaining the item? If so, show mathematically why. If not, explain how it may be possible for the bidder with highest value for the item not to obtain it. 17.20
Imagine an auction mechanism that is just like an ascending-bid auction, except that
at the end, the winning bidder, the one who bid
bmax. pays only bmax/2 rather than bmax·
Assuming all agents are rational, what is the expected revenue to the auctioneer for this mechanism, compared with a standard ascending-bid auction? 17.21
Teams in the National Hockey League historically received 2 points for winning a
game and 0 for losing. If the game is tied, an overtime period is played; if nobody wins in uvt:rlilm:, lltt: game:: is a lit: and t:adt team gt:ls 1 puiut. But lt:agut: ufftdals fdl that lt:ams were playing too conservatively in overtime (to avoid a loss), and it would be more exciting if overtime produced a winner. So in 1999 the officials experimented in mechanism design: the rules were changed, giving a team that loses in overtime 1 point, not 0. It is still 2 points for a win and 1 for a tie. a.
Was hockey a zero-sum game betore the mle change? After?
b.
Suppose that at a certain time t in a game, the home team has probability p of winning in regulation time, probability 0.78 - p of losing, and probability 0.22 of going into overtime, where they have probability q of winning, .9 - q of losing, and .1 of tying. Give equations for the expected value for the home and visiting teams.
c.
Imagine that it were legal and ethical for the two teams to enter into a pact where they agree that they will skate to a tie in regulation time, and then both try in eamest to win in overtime. Under what conditions, in terms of p and q, would it be rational for both teams to agree to this pact?
d.
Longley and Sankaran (2005) report that since the mle change, the percentage of games with a winner in overtime went up 18.2%, as desired, but the percentage of overtime games also went up 3.6%. What does that suggest about possible collusion or conser vative play after the mle change?
18
LEARNING FROM EXAMPLES
In which we describe agents that can improve their behavior through diligent study of their own experiences.
LEARNING
18.1
An agent is learning if it improves its performance on future tasks after making observations about the world. Learning can range from the trivial, as exhibited by jotting down a phone number, to the profound, as exhibited by Albert Einstein, who inferred a new theory of the tmiverse. In this chapter we will concentrate on one class of learning problem, which seems restricted but actually has vast applicability: from a collection of input-Qut.put pairs, learn a function that predicts the output for new inputs. Why would we want an agent to learn? If the design of the agent can be improved, why wouldn't the designers just program in that improvement to begin with? There are three main reasons. First, the designers cannot anticipate all possible situations that the agent might find itself in. For example, a robot designed to navigate mazes must leam the layout of each new maze it encounters. Second, the designers cannot anticipate all changes over time; a program designed to predict tomorrow's stock market prices must learn to adapt when conditions change from boom to bust. Third, sometimes hwnan programmers have no idea how to program a solution themselves. For example, most people are good at recognizing the faces of family members, but even the best programmers are unable to program a computer to accomplish that task, except by using learning algorithms. This chapter first gives an overview of the various forms of learning, then describes one popular approach, decision tree learning, in Section 18.3, followed by a theoretical analysis of learning in Sections 18.4 and 18.5. We look at various learning systems used in practice: linear models, nonlinear models (in particular, neural networks), nonparametric models, and support vector machines. Finally we show how ensembles of models can outperform a single model. FORMS OF LEARNING
Any component of an agent can be improved by learning from data. The improvements, and t.he techniques used to make them, depend on four major factors: •
Which component is to be improved. 693
18.
Chapter
694 •
What prior knowledge the agent already has.
•
What
•
Learning from Examples
representation is used for the data and the component. Whatfeedback is available to learn from.
Components to be learned Chapter 2 described several agent designs. The components of these agents include:
1. 2. 3.
A direct mapping
from conditions on the cun-ent state to actions.
A means to infer relevant properties of the world from the percept sequence. Infonnation about the way the world evolves and about the results of possible actions the agent can take.
4. Utility information indicating the desirability of world states. 5. Action-value information indicating the desirability of actions.
6.
Goals that describe classes of states whose achievement maximizes the agent's utility.
Each of these components can be learned. Consider, for example, an agent training to become a taxi driver. Every time the instructor shouts "Brake!" the agent might leam a condition action rule for when to brake (component
1);
the agent also learns every time the instructor
does not shout. By seeing many camera images that it is told contain buses, it can leam
to recognize them (2). By trying actions and observing the results-for example, braking hard on a wet road-it can learn the effects of its actions
(3).
Then, when it receives no tip
from passengers who have been thoroughly shaken up dming the trip, it can leam a useful component of its overall utility function
(4).
Representation and prior knowledge We have seen several examples of representations for agent components: propositional and first-order logical sentences for the components in a logical agent; Bayesian networks for the inferential components of a decision-theoretic agent, and so on. Effective teaming algo rithms have been devised for all of these representations. This chapter (and most of current machine learning research) covers inputs that foO"O a
factored representation-a vector
of
attribute values-and outputs that can be either a continuous numerical value or a discrete value. Chapter
19
covers functions and prior knowledge composed of first-order logic sen
tences, and Chapter 20 concentrates on Bayesian networks. There is another way
to look at the various types of learning. We say that learning
a (possibly incorrect) general function or rule from specific input-output pairs is called
INDUCTlVE LEARNIPIG DEDUCTIVE LEARNIPIG
ductive learning. We will see in Chapter 19 that we learning: going from a known general rule to a new
can also do
in analytical or deductive
rule that is logically entailed, but is
useful because it allows more efficient processing.
Feedback to learn from There are three
UNSUPERVISED LEARNitiG CLUSTERING
types offeedback that determine the tlm�e main types of learning: In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common unsupervised learning task is clustering: detecting
Section 1 8.2.
REINFORCEMENT LEARNING
SUPERVISED LEARNING
SEM�SUPERIISED LEARNING
18.2
695
Supervised Learning
potentially useful clusters of n i put examples. For example, a taxi agent might gradually develop a concept of "good traffic days" and "bad traffic days" without ever being given labeled examples of each by a teacher. In reinforcement learning the agent leams from a series of reinforcements-rewards or punishments. For example, the lack of a tip at the end of the journey gives the taxi agent an indication that it did something wrong. The two points for a win at the end of a chess game tells the agent it did something right. It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it. In supervised learning the agent observes some example input-output pairs and learns a function that maps from input to output. In component 1 above, the inputs are percepts and the output are provided by a teacher who says "Brake!" or "Turn left." In component 2, the inputs are camera images and the outputs again come from a teacher who says "that's a bus." In 3, the theory of braking is a function from states and braking actions to stopping distance in feet. In this case the output value is available directly from the agent's percepts (after the fact); the environment is the teacher. In practice, these distinction are not always so crisp. In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of un labeled examples. Even the labels themselves may not be the oracular truths that we hope for. Imagine that you are trying to build a system to guess a person's a.ge from a photo. You gather some labeled examples by snapping pictures of people and asking their age. That's supervised learning. But n i reality some of the people lied about their age. It's not just that there is random noise in the data; rather the inaccuracies are systematic, and to uncover them is an unsupervised learning problem involving linages, self-reported ages, and true (un known) ages. Thus, both noise and lack of labels create a continuum between supervised and unsupervised learning.
SUPERVISE D LEARNING
The task of supervised learning is this:
TRAINING SET
Given a training set of N example input-output pairs
HYPOTHESIS
(x1, Yl), (x2, Y2), (xN , YN ) , where each Yj was generated by an unknown function y = f (x), discover a function h that approximates the true function f. Here x and y can be any value; they need not be nwnbers. The function h is a hypothesis. 1
TESTSET
Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measw·e the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set. We say a hypothesis
·
·
·
1 A note on notation: except where noted, we will use j to index the N examples; Xj will always be the input and Y3 the output. In cases where the input is specifically a vector of attribute values (beginning with Section 18.3), we will use x3 for the jth example and we will use i to index the n attributes of each example. The elements of Xj are written Xj,l, x;,2, . . . , x;,n.
696
Chapter f(x)
/�)
0
0
0
0
(a)
0
0
Learning from Examples f(x)
/�)
0
0
X
18.
0
0
0
0
0
0
�
� (b)
X
(c)
X
0
0
0
0
0
0
0
(d)
X
Figure 18.1 (a) Example (x.f(x)) pairs and a consistent, linear hypothesis. (b) A consistent, degree-7 polynominl hypothesis for the snme dntn set. (c) A different dntn set, which admits an exact degree-6 polynomial fit or an approximate linear fit. (d) A simple, exact sinusoidal fit to the same data set.
GENERALIZATION
CLASSIFICATION REGRESSION
generalizes well if it correctly predicts the value of y for novel examples. Sometimes the fimction f is stochastic-it is not strictly a function of x, and what we have to learn is a conditional probability disttibution, P(Y I x). When tile output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification, and is called Boolean or binary classification if there are only two values. When y is a nwnber (such as tomorrow's temperature), the learning problem is called regression. (Technically, solving a regression problem is finding a conditional expectation or average value of y, because the probability that we have found exactly the right real-valued number for y is 0.) Figure I &.I shows a familiar example: fitting a function of a single variable to some data
HYPOTHESIS SPACE
CONSISTENT
OCKHAM'S RAZOR
points. The examples are points in the (x, y) plane, where y = f(x). We don't know what f is, but we will approximate it with a function h selected from a hypothesis space, H, which for this example we will take to be the set of polynomials, such as x5 +3x2 +2. Figure 18.1 (a) shows some data with an exact fit by a straight line (the polynomial 0.4x + 3). The line is called a consistent hypothesis because it agrees with all the data. Figure 18.1(b) shows a high degree polynomial that is also consistent with the same data. This illustrates a fundamental problem in inductive learning: how do we choosefrom among multiple consistent hypotheses? One answer is to prefer the simplest hypothesis consistent wilh the data. This principle is called Ockham's razor, after the 14th-century English philosopher William ofOckham, who used it to argue sharply a.gainst all sorts of complications. Defining simplicity is not easy, but it seems clear that a degree-I polynomial is simpler than a degree-7 polynomial, and thus (a) should be preferred to (b). We will make this intuition more precise in Section 18.4.3. Figure 18.1(c) shows a second data set. There is no consistent straight line for this data set; in fact, it requires a degree-6 polynomial for an exact fit. There are just 7 data points, so a polynomial with 7 parameters does not seem to be finding any pattem in the data and we do not expect it to generalize well. A straight line that is not consistent with any of the data points, but might generalize fairly well for unseen values of x, is also shown in (c). In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better. In Figure 18.l(d) we expand the
Section 1 8.3.
REALIZABLE
Learning Decision Trees
697
hypothesis space 1{ to allow polynomials over both x and sn i (x) , and find that the data in (c) can be fitted exactly by a simple function of the form ax+ b + csin(x). This shows the importance of the choice of hypothesis space. We say that a learning problem is reatizable if the hypothesis space contains the true function. Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true ftmction is not known. In some cases, an analyst looking at a problem is willing to make more fine-grained distinctions about the hypothesis space, to say-even before seeing any data-.not just that a hypothesis is possible or impossible, but rather how probable it is. Supervised learning can be done by choosing the hypothesis h* that is most probable given the data:
h* = argmax P(hldata) . hE1t
By Bayes' mle this is equivalent to
h* = argmax P( data l h) P(h) . hE1t
Then we can say that the prior probability P(h) is high for a degree-] or -2 polynomial, lower for a degree-7 polynomial, and especially low for degree-7 polynomials with large, sharp spikes as in Figure 18.1 (b). We allow unusual-looking functions when the data say we really need them, but we discourage them by giving them a low prior probability. Why not Jet 1{ be the class of all Java programs, or Turing machines? After all, every computable function can be represented by some Turing machine, and that is the best we can do. One problem with this idea is that it does not take into account the computational complexity of learning. There is a tradeoff between the expressiveness ofa hypothesis space and the complexity offinding a good hypothesis within that space. For example, fitting a straight line to data is an easy computation; fitting high-degree polynomials is somewhat harder; and fitting Turing machines is in general tmdecidable. A second reason to prefer simple hypothesis spaces is that presumably we will want to use h after we have learned it, and computing h(x) when h is a linear function is guaranteed to be fast, while computing an arbitrary Turing machine program is not even guaranteed to terminate. For these reasons, most work on learning has focused on simple representations. We will see that the expressiveness-complexity tradeoff is not as simple as it first seems: it is often the case, as we saw with first-order logic in Chapter 8, that an expressive language makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness of the language means that any consistent hypothesis must be very complex. For example, the rules of chess can be written in a page or two of first-order logic, but require thousands of pages when written in propositional logic.
18.3
LEARNING DECISION TREES
Decision tree induction is one of the simplest and yet most successful forms of machine learning. We first describe the representation-the hypothesis space-and then show how to learn a good hypothesis.
Chapter
698 18.3.1
DECISION TREE
POSITIVE NEGATIVE
GOAL PREDICATE
18.
Learning from Examples
The decision tree representation
A decision tree represents a function that takes as input a vector of attribute values and returns a "decision"-a single output value. The input and output values can be discrete or continuous. For now we will concentrate on problems where the inputs have discrete values and the output has exactly two possible values; this is Boolean classification, where each example input will be classified as true (a positive example) or false (a negative example). A decision tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai, and the branches from the node are labeled with the possible values of the attribute, Ai = Vik· Each leaf node in the tree specifies a value to be returned by the function. The decision tree representation is natural for humans; indeed, many "How To" manuals (e.g., for car repair) are writt.en entirely as a single decision tree stretching over hundreds of pages. As an example, we will build a decision tree to decide whether to wait for a table at a restaurant. The aim here is to learn a definition for the goal predicate WillWait. First we list the attributes that we will consider as part of the input: 1 . Alternate: whether there is a suitable alternative restatu·ant nearby. 2. Bar: whether the restaurant has a comfortable bar area to wait in. 3. FrifSat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry. 5.
6. 7.
8. 9. 10.
Patrons: how many people are in the restaurant (values are None, Some, and Full). Price: the restaurant's price range ($, $$, $$$). Raining: whether it is raining outside. Reservation: whether we made a reservation. Type: the kind of restaw·ant (French, Italian, Thai, or burger). WaitEstimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, or >60).
Note that every variable has a small set of possible values; the value of WaitEstimate, for example, is not an integer, rather it is one of the four discrete values 0-10, 10-30, 30-60, or >60. The decision tree usually used by one of us (SR) for this domain is shown in Figure 18.2. Notice that the tree ignores the Price and Type attributes. Examples are processed by the tree starting at the root and following the appropriate branch until a leaf is reached. For instance, an example with Patrons = Pull and WaitEstimate = 0-10 will be classified as positive (i.e., yes, we will wait for a table). 18.3.2
Expressiveness of decision trees
A Buult:an dcxisiun ln::e:: is lugkally e::yuivalt:ul tu tht:: asse::rliun lhal the:: gual attribute:: is true:: if and only if the input attributes satisfy one of the paths leading to a leaf with value true. Writing this out in propositional logic, we have
( Path1 V Path2 V · · ·) , where each Path is a conjunction of att.ribute-value tests required to follow that path. Thus, Goal
¢>
the whole expression is equivalent to disjunctive normal form (see page 283), which means
Section 18.3.
Learning Decision Trees
699
that any function in propositional logic can be expressed as a decision tree. As an example, the rightmost path in Figure 18.2 is
Path = (Patrons = Full 1\ WaitEstimate =0-10) . For a wide variety of problems, the decision tree format yields a nice, concise result. But some functions cannot be represented concisely. For example, the majority function, which returns true if and only if more than half of the inputs are true, requires an exponentially large decision tree. In other words, decision trees are good for some kinds of functions and bad for others. Is there any kind of representation that is efficient for all kinds of functions? Unfortunately, the answer is no. We can show this in a general way. Consider the set of all Boolean functions on n attributes. How many different functions are in this set? This is just the number of different truth tables that we can write down, because the function is defined by its truth table. A truth table over n attributes has 2" rows, one for each combination of values of the attributes. We can consider the "answer" column ofthe table as a 2n-bit number that defines the ftmction. That means there are 22" different functions (and there will be more than that nwnber of trees, since more than one tree can compute the same function). This is a scary number. For example, with just the ten Boolean attributes of our restaurant problem there are 21024 or about 10308 different functions to choose from, and for 20 attributes there are over 10300•000 . We will need some ingenious algorithms to find good hypotheses in such a large space. 18.3.3
Inducing decision trees from examples
An example for a Boolean decision tree consists of an (x, y) pair, where xis a vector of values for the input attributes, and y is a single Boolean output value. A training set of 12 examples
Figure 18.2
A decision tree for deciding whether to wait for a table.
700
Chapter
Xs X6 X7
xs
Xg X10
xu
Xl2
Learning from Examples
Input Attributes
Example Xl X2 X3 X4
18.
Alt
Ba1·
les Jes No les les No No No
No No
No
les No Yes
Figure 18.3
Jilri Hun
Pat
Yes No
No No No Yes Yes No No No
Some Full Some Full Full Some None Some
Yes
Yes
Yes No
Yes No Yes
Yes
No No
Yes
Yes
Yes Yes No Yes No Yes No Yes No
Yes No Yes
Full
Full None Full
Goal
Price Rain Res $$$ $ $ $
$$$ $$ $ $$ $
$$$ $ $
No No No Yes No Yes Yes Yes Yes
No No No
Yes No No No Yes Yes No Yes
No
Yes No No
Est
WillWait
French
0-10
Thai
30--60
Burger
0-10
Thai
10-30
Type
French
>60
Italian
0-10
Burger
0-10
Thai
0-10
Yl = Yes Y2 = No Y3 = Yes Y4 = Yes Ys = No Y6 = Yes Yr = No YS = Yes
Italian
10-30
YlO = No
Thai
0-10
Yll
Burger
30--60
Y12
Burger
>60
yg
= No
= No = Yes
Examples for the restaurant domain.
is shown in Figure 18.3. The positive examples are the ones in which the goal Will Wait is tme (x1, X3, . . . ); the negative examples are the ones in which it is false (x2, xs, . . . ) . We want a tree that is consistent with the examples and is as small as possible. Un fortunately, no matter how we measure size, it is an intractable problem to find the smallest consistent tree; there is no way to efficiently search through the 22" trees. With some simple heuristics, however, we can find a good approximate solution: a small (but not smallest) con sistent tree. The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first. This test divides the problem up into smaller subproblems that can then be solved recursively. By "most important attribute," we mean the one that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be shallow. Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible outcomes, each of which has the same number of positive as negative examples. On the other hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or Some, then we are left with example sets for which we can answer definitively (No and Yes, respectively). If the value is Full, we are left with a mixed set of examples. In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning problem in itself, with fewer examples and one less attribute. There are four cases to consider for these recursive problems: 1 . If the remaining examples are all positive (or all negative), then we are done: we can answer Yes or No. Figure 18.4(b) shows examples of this happening in the None and Some branches. 2. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 18.4(b) shows Hungry being used to split the remaining examples. 3. If there are no examples left, it means that no example has been observed for this com-
Section 1 8.3.
701
Learning Decision Trees
None
om
•
(a) Figure
18.4
(b)
Splitting the examples by testing on attributes. At each node we show the
positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b) Splitting
on Patrons does a good job of separating positive and negative examples. After splitting on Pahvns, Hungry is a fairly good second test.
bination of attribute va1ues, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node's parent. These are passed along in the variable parent-examples.
NOISE
4. If there are no attributes left, but both positive and negative examples, it means that these examples have exactly the same description, but different classifications. This can happen because there is an error or noise in the data; because the domain is nondeter ministic; or because we can't observe an attribute that would distinguish the examples. The best we can do is retum the plurality classification of the remaining examples. The DECISION-TREE-LEARNING algorithm is shown in Figure 18.5. Note that the set of examples is crucial for constructing the tree, but nowhere do the examples appear in the tree itself. A tree consists of just tests on attributes in the interior nodes, values of attributes on the branches, and output values on the leaf nodes. The details of the IMPORTANCE function are given in Section 18.3.4. The output of the learning algorithm on our sample training set is shown in Figure 18.6. The tree is clearly different from the original tree shown in Figure 18.2. One might conclude that the leaming algorithm is not doing a very good job of leaming the correct function. This would be the wrong conclusion to draw, however. The learning algorittun looks at the examples, not at the correct function, and in fact, its hypothesis (see Figure 18.6) not only is consistent with all the examples, but is considerably simpler than the original tree! The leaming algorithm has no reason to include tests for Raining and Reservation, because it can classify all the examples without them. It has also detected an interesting and previously unsuspected pattern: the first author will wait for Thai food on weekends. It is also bound to make some mistakes for cases where it has seen no examples. For example, it has never seen a case where the wait is 0-10 minutes but the restaurant is full.
Chapter
702
function
18.
Learning from Examples
DECISION-TREE-LEARNING(examples, att?'ibutes, parent-examples)
a tree
returns
if examples is empty then return PLURALITY-VALUE(parent_examples) else if all
examples have the same classification then return the classification
else if attributes is empty then return PLURALITY-VALUE(examples) else
A f- argmaxa E attributes IMPORTANCE( a, examples) tree ._ a new decision tree with root test A
for each value Vk of A do
exs ._ {e : e E examples and e.A = v�c } sttbt?'ee f- DECISION-TREE-LEARNING(exs, attributes - A, examples) add a branch to tree with label (A = v�c) and subtree subtree
return tree
Figure 18.5
The decision-tree learning algorithm. The function
IMPORTANCE is de
scribed in Section 18.3.4. The function PLURALITY-VALUE selects the most common output value among a set of examples, breaking ties randomly.
Figure 18.6
LEARNING CURVE
The decision tree n i duced from the 12-example training set.
In that case it says not to wait when Hungry is false, but I (SR) would certainly wait. With more training examples the teaming program could correct this mistake. We note there is a danger of over-interpreting the tree that the algorithm selects. When there are several variables of similar importance, the choice between them is somewhat arbi trary: with slightly different input examples, a different variable would be chosen to split on first, and the whole tree would look completely different. The function computed by the tree would still be similar, but the structure of the tree can vary widely. We can evaluate the accuracy of a leaming algorithm with a learning curve, as shown in Figw-e 18.7. We have 100 examples at our disposal, which we split nto i a training set and
Section 1 8.3.
703
Learning Decision Trees
i.i � §
tl !!
8
.§
J
''( . rv-.'VV
0.9 0.8 0.7 0.6 0.5 0.4
...._
_ _ _ _ _ _ _ _ _ _ _ _
0
20
40
60
80
100
Training set size
Figure
18.7
A learning curve for the decision tree learning algorithm on
100
randomly
generated examples in the restaurant domain. Each data point is the average of 20 trials.
a test set. We learn a hypothesis h with the training set and measure its accuracy with the test set. We do this starting with a training set of size 1 and increasing one at a time up to size 99. For each size we actually repeat the process of randomly splitting 20 times, and average the results of the 20 trials. The curve shows that as the training set size grows, the accuracy increases. (For this reason, learning curves are also called happy graphs.) In tllis graph we reach 95% accuracy, and it looks like the curve might continue to increase with more data. 18.3.4
Choosing attribute tests
The greedy search used in decision tree learning is designed to approximately minimize the
ENTROPY
depth of the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact classification of the examples. A perfect attribute divides the examples into sets, each of which are all positive or all negative and thus will be leaves of the tree. The Patrons attribute is not perfect, but it is fairly good. A really useless attribute, such as Type, leaves the example sets with roughly the same proportion of positive and negative examples as the original set. All we need, then, is a formal measure of"fairly good" and "really useless" and we can implement the IMPORTANCE function of Figure 18.5. We will use the notion of information gain, which is defined in terms of entropy, the fundamental quantity in information theory (Shannon and Weaver, 1 949). Entropy is a measure of the uncertainty of a random variable; acquisition of information corresponds to a reduction in entropy. A random variable with only one value-a coin that always comes up heads-has no uncertainty and thus its entropy is defined as zero; thus, we gain no information by observing its value. A tlip of a fair coin is equally likely to come up heads or tails, 0 or I , and we will soon show that this counts as "1 bit" of entropy. The roll of a fairfour-sided die has 2 bits of entropy, because it takes two bits to describe one of four equally probable choices. Now consider an unfair coin that comes up heads 99% of the time. Intuitively, this coin has less uncertainty than the fair coin-if we guess heads we'll be wrong only 1 % of the time-so we would like it to have an entropy measure that is close to zero, but
704
Chapter
Learning from Examples
18.
positive. In general, the entropy of a random variable V with values Vk, each with probability P(Vk), is defined as Entropy:
H(V) = LP(vk) log2
k
1 = - LP(vk) log2 P(vk) . P(Vk) k
We can check that the entropy of a fair coin flip is indeed 1 bit: H(Fair) =
-(0.5log2 0.5 + 0.5log2 0.5) = 1 .
Ifthe coin is loaded to give 99% heads, we get
-(0.99log2 0.99 + 0.01 log2 0.01) 0.08 bits. It will help to define B(q) as the entropy of a Boolean random variable that is true with probability q: B(q) = -(q log2 q + (1 - q) log2(1 - q)) . Thus, H(Loaded) = B(0.99) 0.08. Now Jet's get back to decision tree learning. If a �
H(Loaded) =
p p-) . H( Goal) = n(p+ n
�
training set contains positive examples and n negative examples, then the entropy of the goal attribute on the whole set is
has p
The restaurant training set in Figw·e 18.3 = n = 6, so the corresponding entropy is B(0.5) or exactly I bit. A test on a single attribute A might give us only part of this 1 bit. We can measure exactly how much by looking at the entropy remaining a fter the attribute test. An attribute A with d distinct values divides the training set E into subsets E1, . . . , Ea. Each subset Ek has Pk positive examples and nk negative examples, so if we go along that branch, we will need an additional B(pk/(pk + nk)) bits of information to answer the ques tion. A randomly chosen example from the training set has the kth value for the attribute with probability (pk + nk)/(p + n), so the expected entropy remaining after testing attribute A is d
Remainder(A) = L v;!�kB(�)
k=l
INFORMATION GAIN
.
The information gain from the attribute test on A is the expected reduction in entropy: Gain(A) = B(�) - Remainder(A) . In fact Gain(A) is just what we need to implement the IMPORTANCE function. Returning to the attributes considered in Figure 18.4, we have
1 - [fiB(�) + �B(t) + -&BCi)] 0.541 bits, Gain( Type) = 1 - [1�B(�) + � 1 B(�) + � 1 B( �) + �B(�)] = 0 bits,
Gain( Patrons ) =
�
confirming our intuition that Patmns is a better attribute to split on. In fact, Patrons has the maximwn gain of any of the attributes and would be chosen by the decision-tree learning algorithm as the root.
Section 18.3.
Learning Decision Trees
18.3.5
705
Generalization and overfitting
On some problems, the DECISION-TREE-LEARNING algorithm will generate a large tree when there is actually no pattern to be found. Consider the problem of trying to predict whether the roll of a die will come up as 6 or not. Suppose that experiments are carried out with various dice and that the attributes describing each training example include the color of the die, its weight, the time when the roll was done, and whether the experimenters had their fingers crossed. If the dice are fair, the right thing to learn is a tree with a single node that says "no," But the DECISION-TREE-LEARNING algorithm will seize on any pattern it can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers
ovERFITTING
crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in that case. This problem is called overfitting. A general phenomenon, overfitting occurs with all types of learners, even when the target function is not at all random. In Figure 18.1(b) and (c), we saw polynomial functions overfitting the data. Overfitting becomes more likely as the hypothesis space and the number of input attributes grows, and less likely as we increase the number of training examples.
g����TREE
For decision trees, a technique called decision tree pruning combats overfitting. Pruning works by eliminating nodes that are not clearly relevant. We strut with a full tree, as generated by DECISION-TREE-LEARNING. We then look at a test node that has only leaf nodes as descendants. If the test appears to be irrelevant-detecting only noise in the data then we eliminate the test, replacing it with a leaf node. We repeat this process, considering each test with only leaf descendants, until each one has either been pruned or accepted as is. The question is, how do we detect that a node is testing an irrelevant attribute? Suppose we are at a node consisting ofp positive and n negative examples. If the attribute is irrelevant,
we would expect that it would split the examples into subsets that each have roughly the same
proportion of positive examples as the whole set, pf(p + n), and so the infonnation gain will 2 be close to zero. Thus, the information gain is a good clue to irrelevance. Now the question is, how large a gain should we require :in order to split on a particular attribute?
SIGNIFICANCE TEST NuLL HYPOTHESis
We can answer this question by using a statistical significance test. Such a test begins by assuming that there is no underlying pattern (the so-called null hypothesis). Then the ac
tual data are analyzed to calculate the extent to which they deviate from a perfect absence of pattern. If the degree of deviation is statistically unlikely (usually taken to mean a 5% prob
ability or less), then that is considered to be good evidence for the presence of a significant pattern in the data. The probabilities ar·e calculated from standard distributions of the runount of deviation one would expect to see in random sampling.
In this case, the null hypothesis is that the attribute is irrelevant atld, hence, that the infonnation gain for an infinitely large sample would be zero. We need to calculate the probability that, under the null hypothesis, a sample of size v = n + p would exhibit the observed deviation from the expected distribution of positive and negative examples. We can measure the deviation by comparing the actual numbers of positive and negative examples in 2 The gain wiU be strictly positive except for the unlikely case where all the proportions are exactly the same.
(See Exercise 18.5.)
Chapter
706
18.
Learning from Examples
each subset, Pk and nk. with the expected numbers, Pk and nk. assmning true irrelevance:
Pk = p x •
, Pk + nk nk = n x :....:.:.... ...;;:. p+n
Pk + nk p+n
_
A convenient measure of the total deviation is given by /:::;. =
d
� L__;
k=l
(pk - Pk)2 •
�
Pk
+
• )2 (nk - nk �
nk
Under the null hypothesis, the value of /:::;. is distributed according to the
2
x2
(chi-squared)
distribution with v - 1 degrees of freedom. We can use a x table or a standard statistical library routine to see if a particular � value confirms or rejects the null hypothesis. For
example, consider the restaurant type attribute, with four values and thus three degrees of freedom. A value of /:::;. = 7.82 or more would reject the null hypothesis at the 5% level (and a value of /:::;. = 1 1 .35 or more would reject at the 1 % level). Exercise 18.8 asks you to extend the
x2 PAJNING
DECISION-TREE-LEARNING algmithm to implement this form of pruning, which is known as x2
pruning. With pruning, noise in the examples can be tolerated. Errors in the example's label (e.g.,
an example (x,
Yes) that should be (x, No)) give a linear increase in prediction error, whereas
eiTors in the desctiptions of examples (e.g., Price = $ when it was actually Price = $$) have an asymptotic effect that gets worse as the tree shrinks down to smaller sets. Pruned trees perform significantly better than unpruned trees when the data contain a large amount of
are often much smaller and hence easier to understand. 2 One final warning: You might think that x pruning and information gain look similar, so why not combine them using an approach called early stopping have the decision tree noise. Also, the pmned trees
EAOLV eTOr'r'INC
algorithm stop generating nodes when there is no good attribute to split on, rather than going to aU the trouble of generating nodes and then pruning them away. The problem with early stopping is that it stops us
from recognizing situations where there is no one good attribute,
but there are combinations of attributes that are informative. For example, consider the XOR ftmction of two binary attributes. If there are roughly equal nwnber of examples for all four combinations of input values, then neither attribute will be informative, yet the correct thing to do is to split on one of the attributes (it doesn't matter which one), and then at the second level we will get splits that are informative. Early stopping would miss this, but generate and-then-prune handles it correctly. 18.3.6
Broadening the applicability of decision trees
In order to extend decision tree induction to a wider variety of problems, a number of issues must be addressed. We will briefly mention several, suggesting that a full understanding is best obtained by doing the associated exercises: •
Missing data:
In many domains, not all the attribute values will be known for every
example. The values might have gone
unrecorded,
or they might be too expensive to
obtain. nus gives rise to two problems: First, given a complete decision tree, how should one classify an example that is missing one of the test attributes? Second, how
Section 18.3.
Learning Decision Trees
707
should one modify the information-gain formula when some examples have unknown values for the attribute? These questions are addressed in Exercise 18.9. •
Multivalued attributes: When an attribute has many possible values, the information
gain measure gives an inappropriate indication of the attribute's usefulness. In the ex treme case, an attribute such as
ExactTime has
a different value for every example,
which means each subset of examples is a singleton with a unique classification, and the information gain measure would have its highest value for this attribute. But choos ing this split first is unlikely to yield the best tree. One solution is to use the gain ratio
GAIN RATIO
(Exercise 18.1 0). Another possibility is to allow a Boolean test of the fonn A = Vk. that is, picking out just one of the possible values for an attribute, leaving the remaining values to possibly be tested later in the tree. •
Continuous and integer-valued input attributes: Continuous or integer-valued at tributes such as Height and Weight, have an infinite set of possible values. Rather than generate nfi i nitely many branches, decision-tree learning algorithms typically find the
split point that gives the highest information gain. For example, at a given node in the tree, it tnight be the case that testing on Weight > 160 gives the most informa
SPLIT POINT
tion. Efficient methods exist for finding good split points: start by sorting the values of the attribute, and then consider only split points that are between two examples in sorted order that have different classifications, while keeping track of the running totals of positive and negative examples on each side of the split point. Splitting is the most expensive part of real-world decision tree learning applications. •
REGRESSION TREE
Continuous-valued output attributes: If we are trying to predict a numerical output value, such as the price of an apartment, then we need a regression tree rather than a classification tree. A regression tree has at each leaf a linear ftmction of some subset of nwnerical attributes, rather than a single value. For example, the branch for two bedroom apartments might end with a linear ftmction of square footage, munber of
bathrooms, and average income for the neighborhood. The learning algorithm must decide when to stop splitting and begin applying linear regression (see Section 18.6) over the attributes. A decision-tree learning system for real-world applications must be able to handle all of these problems. Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial packages have been built that meet these criteria, and they have been used to develop thousands of fielded systems. In many areas of industry and commerce, decision trees are usually the first method tried when a classification method is to be extracted from a data set. One important property of decision trees is that it is possible for a human to understand the reason for the output of the learning algorithm. (Indeed, this is a legal requirement for financial decisions that are subject to anti-discrimination Jaws.) This is a property not shared by some other representations, such as neural networks.
708 1 8 .4
Chapter
18.
Learning from Examples
EVALUATING AND CHOOSING THE BEST HYPOTHESIS
STATIONARITY ASSUMPTION
We want to Jearn a hypothesis that fits the future data best. To make that precise we need to define "future data" and "best." We make the stationarity asswnption: that there is a probability distribution over examples that remains stational)' over time. Each example data point (before we see it) is a random variable E; whose observed value ej = (xj, Yi) is sampled from that distribution, and is independent of the previous examples:
P(EjiE;-l ,Ej-2, . . .) = P(Ej) ,
and each example has an identical prior probability distribution:
P(Ej) = P(E;- 1) = P(E;-2) = · · ·
1.1.0.
ERRORAATE
HOLDOUT CROS$-VALID.ATION
K·FOLD CROS$-VALID.ATION
LEAVE-ONE.QUT CROSS.'IALIDATION LOOCV PEEKING
.
Examples that satisfy these asstunptions are called independent and identically distributed or i.i.d . An i.i.d. assumption connects the past to the future; without some such connection, all bets are off-the future could be anything. (We will see later that learning can still occur if there are slow changes in the distribution.) The next step is to define "best fit." We define the error rate of a hypothesis as the proportion of mistakes it makes-the proportion oftimes that h(x) =!= y for an (x, y) example. Now, just because a hypothesis h has a low error rate on the training set does not mean that it will generalize well. A professor knows that an exam will not accurately evaluate students if they have already seen the exam questions. Similarly, to get an accurate evaluation of a hypothesis, we need to test it on a set ofexamples it has not seen yet. The simplest approach is the one we have seen already: randomly split the available data into a training set from which the learning algorithm produces h and a test set on which the accuracy of h is evaluated. This method, sometimes called holdout cross-validation, has the disadvantage that it fails to use all the available data; if we use half the data for the test set, then we are only training on half the data, and we may get a poor hypothesis. On the other hand, if we reserve only 10% of the data for the test set, then we may, by statistical chance, get a poor estimate of the actual accuracy. We can squeeze more out ofthe data and still get an accurate estimate using a technique called k-fold cross-validation. The idea is that each example serves double duty-as training data and test data. First we split the data into k equal subsets. We then perform k rounds of learning; on each round 1/k of the data is held out as a test set and the remaining examples are used as training data. The average test set score of the k rounds should then be a better estimate than a single score. Popular values for k are 5 and 10---enough to give an estimate that is statistically likely to be accurate, at a cost of 5 to 10 times longer computation time. The extreme is k = n, also known as leave-one-out cross-validation or LOOCV. Despite the best efforts of statistical methodologists, users frequently invalidate their results by inadvertently peeking at the test data. Peeking can happen like this: A learning algorithm has various "knobs" that can be twiddled to tune its behavior-for example, various different critetia for choosing the next attribute in decision tree learning. The researcher generates hypotheses for various different settings of the knobs, measures their error rates on the test set, and reports the error rate of the best hypothesis. Alas, peeking has occurred! The .
Section 1 8.4.
7(1)
Evaluating and Choosing the Best Hypothesis reason is that the hypothesis was selected
on the basis ofits test set error rate, so information
about the test set has leaked into the leaming algorithm. Peeking is a consequence of using test-set performance to both choose a hypothesis and
evaluate it.
The way to avoid this is to
really hold
the test set out-lock it away until you
are completely done with learning and simply wish to obtain an independent evaluation of the final hypothesis. (And then, if you don't like the results . . . you have to obtain, and lock away, a completely new test set if you want to go back and find a better hypothesis.) If the test set is locked away, but you still want to measure performance on unseen data as a way of selecting a good hypothesis, then divide the available data (without the test set) into a training VALID\liON SET
set and a
validation set.
The next section shows how to use validation sets to find a good
tradeoff between hypothesis complexity and goodness of fit.
18.4.1
Model selection: Complexity versus goodness of fit
In Figure 18.1 (page 696) we showed that higher-degree polynomials can fit the training data better, but when the degree is too high they will overfit, and perfonn poorly on validation data. MODELSELECTION
Choosing the degree of the polynomial is an instance of the problem of model selection. You can think of the task of finding the best hypothesis as two tasks: model selection defines the
OPTIMIZATION
hypothesis space and then optimization finds the best hypothesis within that space. In this section we explain how to select among models that
are parameterized by size.
For example, with polynomials we have size = 1 for linear functions, size = 2 for quadratics,
and so on. For decision trees, the size could be the number of nodes in the tree. In all cases
we want to find the value of the size parameter that best balances underfitting and overfilling
to give the best test set accuracy. An algorithm to perform model selection and optimization is shown in Figure 18.8. It WRAPPER
is a
wrapper that takes a learning algorithm as an argument (DECISION-TREE-LEARNING, for example). The wrapper enumerates models a.ccording to a parameter, size. For each size, it uses cross validation on Learner to compute the average error rate on the training and test sets. We start with the smallest, simplest models (which probably tmderfit the data), and iterate, considering more complex models at each step, until the models start to overfit. In Figure 18.9 we see typical curves: the training set error decreases monotonically (although there may in general be slight random variation), while the validation set error decreases at first, and then increases when the model begins to overfit. The cross-validation procedure picks the value of size with the lowest validation set error; the bottom of the U-shaped curve. We then generate a hypothesis of that size, using all the data (without holding out any of it). Finally, of course, we should evaluate the returned hypothesis on a separate test set. This approach requires that the learning algorithm accept a parameter,
size, and deliver
a hypothesis of that size. As we said, for decision tree learning, the size can be the number of nodes. We can modify DECISION-TREE-LEARNER so that it takes the number of nodes as an input, builds the tree breadth-first rather than depth-first (but at each level it still chooses the highest gain attribute first), and stops when it reaches the desired nwnber of nodes.
Chapter
710
Learning from Examples
18.
function CROSS-VALIDATION-WRAPPBR(Leame1·, k, examples) returns a hypothesis local variables: er·rT, an array, indexed by size, storing training-set error rates er'l'V, an array, indexed by size, storing validation-set error rates for size = 1 to oo do
errT[size], errV[size] .- CROSS-VALIDATION(Leamer, size, k, examples) if errT has converged then do best..size ..-the value of size with minimum e1r V[size] return Leamer( best_size, examples)
function CROSS-VALIDATION(Leamer, size, k, examples) returns two values: average training set error rate, average validation set error rate
fold_elrT .99), but in 200 dimensions, over 98% of the points fall within this thin shell-almost all the points are outliers. You can see an example of a poor nearest-neighbors tit on outliers if you look ahead to Figure 18.28(b). The NN(k, Xq) function is conceptually trivial: given a set of N examples and a query Xq, iterate through the examples, measure the distance to Xq from each one, and keep the best k. If we are satisfied with an implementation that takes O(N) execution time, then that is the end of the story. But instance-based methods are designed for large data sets, so we would like an algorithm with sublinear run time. Elementary analysis of algorithms tells us that exact table lookup is O(N) with a sequential table, O(log N) with a binary tree, and 0(1) with a hash table. We will now see that binary trees and hash tables are also applicable for finding nearest neighbors. 18.8.2
K·DTREE
739
Finding nearest neighbors with k-d trees
A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for k-dimensional tree. (In our notation, the nwnber of dimensions is n, so they would be n-d trees. The construction of a k-d tree is similar to the construction of a one-dimensional balanced binary tree. We start with a set of examples and at the root node we split them along the ith dimension by testing whether Xi � m. We chose the value m to be the median of the examples along the ith dimension; thus half the examples will be in the left branch of the tree
Chapter
740
I
"8 0
0.9
.8
.c
-�
0.7
g
0.6
0 -a
0.5
-e
._
""
.li
" "' '0
�
'il
I
2
0.9
-5i b
.
0.8
E
.s :J .s
8. '5
0.4 0.3
"
.2
0.2 0.1 0 25
50
75
100 125
150
175 200
Number ofdimensions
(a)
l
0:
Learning from Examples
18.
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 25
50
75
100
125 150 175
200
N1m1ber of dimensions
(b)
Figure 18.27 The curse of dimensionality: (a) The length ofthe avemge neighborhood for 10-nearest-neighbors in a unit hypercube with 1,000,000 points, as a function of the number of dimensions. (b) The proportion of points that fall within a thin shell consisting of the outer 1% ofthe hypercube, as a function of the number ofdimensions. Sampled from 10,000 mndomly distributed points.
and half in the right. We then recursively make a tree for the left and right sets of examples, stopping when there are fewer than two examples left. To choose a dimension to split on at each node of the tree, one can simply select dimension i mod n at level i of the tree. (Note that we may need to split on any given dimension several times as we proceed down the tree.) Another strategy is to split on the dimension that has the widest spread of values. Exact lookup from a k-d tree is just like lookup from a binary tree (with the slight complication that you need to pay attention to which dimension you are testing at each node). But nearest neighbor lookup is more complicated. As we go down the branches, splitting the examples in half, in some cases we can discard the other half of the examples. But not always. Sometimes the point we are querying for falls very close to the dividing boundary. The query point itself might be on the left hand side of the boundary, but one or more of the k nearest neighbors might actually be on the right-hand side. We have to test for this possibility by computing the distance of the query point to the dividing botmdary, and then searching both sides if we can't find k examples on the left that are closer than this distance. Because of this problem, k-d trees are appropriate only when there are many more examples than dimensions, preferably at least 2"' examples. Thus, k-d trees work well with up to l O dimensions with thousands of examples or up to 20 dimensions with millions of examples. If we don't have enough examples, lookup is no faster than a linear scan of the entire data set. 18.8.3
LOCALIT'fSENSITIVE HASH
Locality-sensitive hashing
Hash tables have the potential to provide even faster lookup than binaty trees. But how can we find nearest neighbors using a hash table, when hash codes rely on an exact match? Hash codes randomly distribute values among the bins, but we want to have near points grouped together in the same bin; we want a locality-sensitive hash (LSH).
Section 1 8.8.
APPROXIMATE NEAR-NEIGHBORS
Nonparametric Models
741
We can't use hashes to solve NN(k,xq) exactly, but with a clever use of randomized algorithms, we can find an approximate solution. First we define the approximate near neighbors problem: given a data set of example points and a query point Xq, find, with high probability, an example point (or points) that is near Xq · To be more precise, we require that if there is a point Xj that is within a radius ·r of Xq, then with high probability the algorithm will find a point Xj' that is within distance cr of q . If there is no point within radius r then the algorithm is allowed to report failure. The values of c and "high probability" are parameters of the algorithm. To solve approximate near neighbors, we will need a hash function g(x) that has the property that, for any two points Xj and xj', the probability that they have the same hash code is small if their distance is more than cr, and is high if their distance is less than r. For simplicity we will treat each point as a bit string. (Any features that are not Boolean can be encoded into a set of Boolean features.) The intuition we rely on is that if two points are close together in an n-dimensional space, then they will necessarily be close when projected down onto a one-dimensional space (a line). In fact, we can discretize the line into bins-hash buckets-so that, with high prob ability, near points project down to exactly the same bin. Points that are far away from each other will tend to project down into different bins for most projections, but there will always be a few projections that coincidentally project far-apart points into the same bin. Thus, the bin for point Xq contains many (but not all) points that are near to Xq, as well as some points that are far away. The trick of LSH is to create multiple random projections and combine them. A random projection is just a random subset of the bit-string representation. We choose f. different random projections and create f. hash tables, g1(x), . . . ,gt (x). We then enter all the examples into each hash table. Then when given a query point Xq, we fetch the set of points in bin 9k(q) for each k, and union these sets together into a set of candidate points, C. Then we compute the actual distance to Xq for each of the points in C and return the k closest points. With high probability, each of the points that are near to Xq will show up in at least one of the bins, and although some far-away points will show up as well, we can ignore those. With large real world problems, such as finding the near neighbors in a data set of 1 3 million Web images using 512 dimensions (Torralba et at., 2008), locality-sensitive hashing needs to examine only a few thousand images out of 13 million to find nearest neighbors; a thousand-fold speedup over exhaustive or k-d tree approaches. 18.8.4
Nonparametric regression
Now we'll look at nonparametric approaches to regression rather than classification. Fig ure 18.28 shows an example of some different models. In (a), we have perhaps the simplest method of all, known informally as "connect-the-dots," and superciliously as "piecewise linear nonparametric regression." This model creates a function h(x) that, when given a query Xq, solves the ordinary linear regression problem with just two points: the training examples immediately to the left and right of Xq · When noise is low, this trivial method is actually not too bad, which is why it is a standard feature ofcharting software in spreadsheets.
742
Chapter
8
'II
7 �\
5 4
2 1 0
Q..
i
/ &·.6
6
\
5 4
�
!
I -�
2
4
6
8
10
12
"
3
\ � � .... _ ... _ _ _ _ _ _ _ _ _ _ _ 0
0
7
'
�.."'&. 'G··e-·-
Q says that if any examples
match on
a set of examples if every pair that matches on the predicates on the left-hand side also matches on
P, then they must also match on Q. A determination is therefore consistent with
the goal predicate. For example, suppose we have the following examples of conductance measurements on material samples:
786
Chapter
19.
Knowledge in Learning
function MINIMAL-CONSISTBNT-DET(E, A) returns a set of attributes inputs: E, a set of examples A, a set of attributes, of size n
for i = 0 to n do for each subset Ai of A of size i do if CONSISTENT-DET?(Ai, E) then return Ai function CONSISTENT-DET?(A, E) returns a truth value inputs: A, a set of attributes E, a set of examples local variables: H, a hash table
for eacb example e in E do
if some example in H has the same values as e for the attributes A but a different classification then return false store the class of e in H, indexed by the values for attributes A of the example
return true Figure 19.8
e
An algorithm for finding a minimal consistent detennination.
Sample Mass Temperature Material Size Conductance S1 S1 S2 S3 S3 S4
26
12 12 24 12 12 24
100 26 26
100 26
Copper Copper Copper Lead Lead Lead
3 3
6
2 2
4
0.59 0.57 0.59 0.05 0.04 0.05
Material /\ Tempemture >- Conductance. There determination, namely, Mass 1\ Size 1\ Tempemture >
The minimal consistent determination is is a nonminimal but consistent
Conductance. This is consistent with the examples because mass and size determine density
and, in our data set, we do not have two different materials with the same density. As usual, we would need a lm�er sample set in order to eliminate a nearly conect hypothesis. There are several possible algorithms for finding minimal consistent determinations.
TI1e most obvious approach is to conduct a search through the space of determinations, check ing all determinations with one predicate, two predicates, and so on, tmlil a consistent deter
mination is found. We will assume a simple attribute-based representation, like that used for decision tree learning in Chapter 18. A detennination d will be represented by the set of attiibutes on the left-hand side, because the target predicate is a.sstuned to be fixed. The basic algorithm is outlined in Figure
19.8.
The time complexity of this algorithm depends on the size of the smallest consistent determination. Suppose this determination hasp attributes out of the n total attributes. Then the algorithm will not find it until searching the subsets of A of size p. There are
(;) = O(nP)
Section 19.4.
787
Learning Using Relevance Information
i.i � §
tl !!
8
.§
J
0.9
RBDTL DTL
0.8 •
:
)
�·
-------·
••
IV'_'vJ 1_\j
, , • j r n. ..; : '..�v � - \t�,'·t�''{r
0.7 0.6
-
•
v
.'...: ' ---· y and y ::; z can be added. As in decision-tree learning, a constant threshold value can be chosen to maximize the discriminatory power of the test. The resulting branching factor in this search space is very large (see Exercise 19.6), but FOIL can also use type information to reduce it. For example, if the domain included nwnbers as
Section
19.5.
793
Inductive Logic Programming
ftmction FOIL(examples, tm-get) returns a set of Hom clauses inputs: examples, set of examples ta1-get, a literal forthe goal predicate local variables: clauses, set of clauses, initially empty while examples contains positive examples do clause ::::?
(x1. )2 2a _
f'
j X
L: ==r a - V/r;:L:::--:; ( -N-- J.t-:-:::-) 2
/.t = _
.,;
(20.4)
That is, the maximum-likelihood value of the mean is the sample average and the maximum likelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm "commonsense" practice.
810
Chapter
20.
..
0.8 4 3.5
>.
3
2.5
2 1.5 I 0.05 0
0.6 0.4
+
learning Probabilistic Models
+
+ ..
..
+ � ••..i ...... ·-+. .... .. ·-t .. , • + • t. •• ..�.......... .... + ..
.
...
+ + +
+
...
+
+
+ + +
···-t
+
...
-.;:.��-
.�--
...
.
........
...
0.2
0 ......_ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
___________
X
(a)
(b)
(a) A linear Gaussian model described as y = 8,x + 82 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model.
Figure 20.4
Now consider a linear Gaussian model with one continuous parent X and a continuous child Y. As explained on page 520, Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed. To learn the conditional distribution P(Y I X), we can maximize the conditional likelihood (20.5) Here, the parameters are 81, 82, and a. The data are a collection of (Xj, J./j) pairs, as illustrated in Figure 20.4. Using the usual methods (Exercise 20.5), we can find the maximum-likelihood values of the parameters. The point here is different. If we consider just the parameters 81 and 82 that define the linear relationship between x and y, it becomes clear that maximizing the log likelihood with respect to these parameters is the same as minimizing the numerator (y- (81x + 82))2 in the exponent of Equation (20.5). This is the £2 loss, the squared er ror between the actual value y and the prediction 81x + 82. This is the quantity minimized by the standard linear regression procedure described in Section 18.6. Now we can under stand why: minimizing the sum of squared errors gives the maximum-likelihood straight-line model, provided that the data are generated with Gaussian noise offixed variance.
20.2.4
HYPOTHESIS PRIOR
Bayesian parameter learning
Maximum-likelihood learning gives rise to some very simple procedures, but it has some serious deficiencies with small data sets. For example, after seeing one che1ry candy, the maximwn-likelihood hypothesis is that the bag is 100% cherry (i.e., 8 = 1.0). Unless one's hypothesis ptior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. It is more likely that the bag is a mixture of lime and cherry. The Bayesian approach to parameter learning sta1ts by defining a prior probability distribution over the possible hypotheses. We call this the hypothesis prior. Then, as data arrives, the posterior probability distribution is updated.
Section 20.2.
811
Learning with Complete Data
6
2.5
5
[30,
10](\ :
�
:
� � :
:'
!
0.5 0.2
0.6
0.4
0.8
Parameter 9
(a)
Figure 20.5
BETA DISTRIBUTION HYPER�l\RAMETER
0
0.2
0.4
0.6
0.8
Parameter 9
(b)
Examples of the beta[a, b] distribution for different values of [a, b].
The candy example in Figure 20.2(a) has one parameter, B: the probability that a ran domly selected piece of candy is cherry-flavored. In the Bayesian view, B is the (unknown) value of a random variable e that defines the hypothesis space; the hypothesis prior is just the prior distribution P(8). Thus, P(8 =B) is the prior probability that the bag has a fraction B of cherry candies. If the parameter B can be any value between 0 and 1, then P(8) must be a continuous distribution that is nonzero only between 0 and 1 and that integrates to 1. The uniform density P(B) = Uniform[O, 1](8) is one candidate. (See Chapter 13.) It turns out that the uniform density is a member of the family of beta distributions. Each beta distribution is defined by two hyperparameters3 a and b such that
1 beta[a, b](B) = a: ea- (1 - B)b-1,
coNJUGATE PRIOR
o�--���--�----�-4
(20.6)
forB in the range [0, 1]. TI1e normalization constant a:, which makes the distribution integrate to 1, depends on a and b. (See Exercise 20.7.) Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is aj(a + b), so larger values of a suggest a belief that e is closer to I than to 0. Larger values of a + b make the distribution more peaked, suggesting greater certainty about the value of e. Thus, the beta family provides a useful range of possibilities for the hypothesis prior. Besides its flexibility, the beta family has another wonderful property: if 8 has a prior beta[a, b], then, after a data point is observed, the posterior distribution fore is also a beta distribution. In other words, beta is closed under update. The beta family is called the conjugate prior for the family of distributions for a Boolean variable. 4 Let's see how this works. Suppose we observe a cherry candy; then we have 3
They are called hyperparameters because they parameterize a distribution over 8, which is itself a parameter.
4 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution
and the Normal-Wishart family for the parameters of a Gaussian distribution. See Bernardo and Smith (1994).
81 2
Chapter
20.
learning Probabilistic Models
Figure 20.6
A Bayesian network that corresponds to a Bayesian learning process. Poste rior distributions for the parameter variables e, el, and e2 can be inferred from their prior distributions and the evidence in the Flavor i and Wmpperi variables.
P(8 I D1 = cherry)
VIRTU>.L COUNTS
At.RAMETER INDEPENDENCE
a P(D1 = cherry I 8)P(8) b 1 1 r/ {J. hP.tR.[n., h](fl) = rl {J. na- (1 - fl) 1 a' 8a(1 - 8)b- = beta[a + 1, b](8) .
Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior; similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the a and b hyperparameters as virtual counts, in the sense that a prior beta[ a, b] behaves exactly as if we had started out with a uniform prior beta[1, 1] and seen a - 1 actual cherry candies and b - 1 actual lime candies. By examining a sequence of beta distributions for increasing values of a and b, keeping the proportions fixed, we can see vividly how the posterior distJibution over the parameter e changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Fig ure 20.5(b) shows the sequence beta[3, 1], beta[6, 2], beta[30, 10]. Clearly, the distribution is converging to a narrow peak around the true value of e. For large data sets, then, Bayesian learning (at least in this case) converges to the same answer as maximum-likelihood learning. Now let us consider a more complicated case. The network in Figure 20.2(b) has three parameters, 8, 81, and 82, where 81 is the probabj}jty of a red wrapper on a chen-y candy and 82 is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must cover all three parameters-that is, we need to specify P(8, 0�, 82). Usually, we assume parameter independence:
Section
20.2.
Learning with Complete Data
813
With this asswnption, each parameter can have its own beta distribution that is updated sep arately as data arrive. Figure
20.6 shows
how we can incorporate the hypothesis prior and
any data into one Bayesian network. The nodes
8, 81,82
have no parents. But each time
we make an observation of a wrapper and corresponding flavor of a piece of candy, we add a node
Flavori, which is dependent on the flavor parameter 8: P(Flavori = che1'ry I 8 = 8) = 8 .
We also add a node
Wmpperi, which is dependent on 81 and 82:
P( Wmpperi = red I Flavori = cherry, e1 = 81) = 81 P( Wmpperi = red I Flavori = lime, e2 = 82) = 82 °
Now, the entire Bayesian learning process can be formulated as an
inference problem. We
add new evidence nodes, then query the unknown nodes (in this case,
8, 81, 82).
This for
mulation of learning and prediction makes it clear that Bayesian learning requires no extra "principles of learning." Furthermore,
there is, in essence, just one learning algorithm -the
inference algorithm for Bayesian networks. Of course, the nature of these networks is some what different from those of Chapter 14 because of the potentially huge number of evidence variables representing the training set and the prevalence of continuous-valued parameter variables.
20.2.5
Learning Bayes net structures
So far, we have assumed that the structure of the Bayes net is given and we are just trying to leam the parameters. The structure of the network represents basic causal knowledge about the domain that is often easy for an expert, or even a naive user, to supply. In some cases, however, the causal model may be unavailable or subject to dispute-for example, certain corporations have long claimed that smoking does not cause cancer-so it is important to tmderstand how the structure of a Bayes net can be learned from data. This section gives a brief sketch of the main ideas. The mmt obvious approach is to
search for a good model. We can start with a model
containing no links and begin adding parents for each node, fitting the parameters with the methods we have just covered and measuring the accuracy of the resulting model. Altema tively, we can start with an initial guess at the structure and use hill-climbing or simulated annealing search to make modifications, retuning the parameters after each change in the structure. Modifications can include reversing, adding, or deleting links. We must not in troduce cycles in the process, so many algorithms assume that. an ordering is given for the variables, and that a node can have parents only among those nodes that come earlier in the ordering (just as in the construction process in Chapter 14). For full generality, we also need to search over possible orderings. There are two alternative methods for deciding when a good structm·e has been found. The first is to test whether the conditional independence assertions implicit in the structure are actually satisfied in the data. For example, the use of a naive Bayes model for the restaurant problem assumes that
P(PrifSat, Bar I WillWait) = P(PrifSat I WillWait)P(Bar I WillWait)
814
Chapter 20.
Learning Probabilistic Mcdels
and we can check in the data that the same equation holds between the cotresponding condi tional frequencies. But even if the structure describes the true causal nature of the domain, statistical fluctuations in the data set mean that the equation will never be satisfied exactly, so we need to perform a suitable statistical test to see if there is sufficient evidence that the independence hypothesis is violated. The complexity of the resulting network will depend on the threshold used for this test-the stricter the independence test, the more links will be added and the greater the danger of overfilling. An approach more consistent with the ideas in this chapter is to assess the degree to which the proposed model explains the data (in a probabilistic sense). We must be careful how we measw·e this, however. I f we just try to find the maximum-likelihood hypothesis, we will end up with a fully connected network, because adding more parents to a node can not decrease the likelihood (Exercise 20.8). We are forced to penalize model complexity in some way. The MAP (or MDL) approach simply subtracts a penalty from the likelihood of each structure (after parameter tuning) before comparing different structures. The Bayesian approach places a joint prior over structw-es and parameters. There are usually far too many structures to sum over (superexponential in the number of variables), so most practitioners use MCMC to sample over structures. Penalizing complexity (whether by MAP or Bayesian methods) introduces an important cormection between the optimal structure and the natw·e of the representation for the condi tional distributions in the network. With tabular distributions, the complexity penalty for a node's distribution grows exponentially with the number of parents, but with, say, noisy-OR distributions, it grows only linearly. Tlus means that learning with noisy-OR (or other com pactly parameterized) models tends to produce learned structures with more parents than does learning with tabular distributions.
20.2.6
NONPAFAMETRIC DENSITY ESTIMATION
Density estimation with nonparametric models
It is possible to learn a probability model without making any assumptions about its structure and parameterization by adopting the nonparametric methods of Section 18.8. The task of nonpararnetric density estimation is typically done in continuous domains, such as that shown in Figure 20.7(a). The figure shows a probability density function on a space defined by two continuous variables. In Figure 20.7(b) we see a sample of data points from this density function. The question is, can we recover the model from the samples? First we will consider k-nearest-neighbors models. (In Chapter 18 we saw nearest neighbor models for classification and regression; here we see them for density estimation.) Given a sample of data points, to estimate the unknown probability density at a query point x we can simply measure the density of the data points in the neighborhood of x. Figure 20.7(b) shows two query points (small squares). For each query point we have drawn the smallest circle that encloses 10 neighbors-the 10-nearest-neighborhood. We can see that the central circle is large, meaning there is a low density there, and the circle on the right is small, meaning there is a high density there. In Figure 20.8 we show three plots of density estimation using k-nearest-neighbors, for different values of k. It seems clear that (b) is about right, while (a) is too spiky (k is too small) and (c) is too smooth (k is too big).
Section 20.2.
Learning with Complete Data
815
Density
18 16 14 12 10 8 6 4 2 0
•
0.8 0.6 0.4 0.2
0
• ..
I
.� :�. •
+
... 1' . .
•"-• •
•
,.
•
•
.
0.9 0.8 . . 1 . ... 0.7 • 0.6 0.5 0.4 0.3 ..... 0 0.2 0.4
•
..
.. . •
· ·z
-r---. .-.... ..---r ----.
Figure 20.7
0.6
0.8
(b)
(a)
(a) A 3D plot of the mixture of Gaussians from Figure 20.l l(a). (b) A 128-
point sample of points from the mixture, together with two query points (smaU squares) and
their 10-nearest-neighborhoods (medium and large circles).
' ..
(a)
(b)
(c)
Figure 20.8
Density estimation using k-nearest-neighbors, applied to the data in Fig ure 20.7(b), for k = 3, 10, and 40 respectively. k = 3 is too spiky, 40 is too smooth, and 10 is just about right. The best value fork can be chosen by cross-validation.
(a)
Figure 20.9
(b)
(c)
Kernel density estimation for the data in Figure 20.7(b), using Gaussian kernels with w = 0.02, 0.07, and 0.20 respectively. w = 0.07 is about right.
816
Chapter 20.
learning Probabilistic Models
Another possibility is to use kernel functions, as we did for locally weighted regres sion. To apply a kernel model to density estimation, assume that each data point generates its own little density function, using a Gaussian kernel. The estimated density at a query point x is then the average density as given by each kernel function: P(x) =
N 1 L.: ,qx, xj) . N j=l
We will assume spherical Gaussians with standard deviation w along each axis: K(x x·' ' 3'
1
(w2v'21i)d
e
D(X,Xj)2 2w2
where d is the number of dimensions in x and D is the Euclidean distance ftmction. We still have the problem of choosing a suitable value for kemel width w; Figure 20.9 shows values that are too small, just right, and too large. A good value of w can be chosen by using cross-validation. 20.3
LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM
LATENTVARIABLE
The preceding section dealt with the fully observable case. Many real-world problems have hidden variables (sometimes called latent variables), which are not observable in the data that are available for learning. For example, medical records often include the observed sym ptoms, the physician's diagnosis, the treatment applied, and perhaps the outcome of the
EXPECTATION MAXIMIZATION
treatment, but they seldom contain a direct observation of the disease itself! (Note that the diagnosis is not the disease; it is a causal consequence of the observed symptoms, which are in tum caused by the disease.) One might ask, "If the disease is not observed, why not construct a model without it?" The answer appears in Figure 20.10, which shows a small, fictitious diagnostic model for heatt disease. There are three observable predisposing factors and three observable symptoms (which are too depressing to name). Assume that each variable has three possible values (e.g., none, modemte, and severe ). Removing the hidden variable from the network in (a) yields the network in (b); the total number of parameters increases from 78 to 708. Thus, latent variables can dramatically reduce the number ofparameters required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important, but they do complicate the leaming problem. In Fig ure 20.10(a), for example, it is not obvious how to learn the conditional distribution for HeartDisease, given its parents, because we do not know the value of HeartDisease in each case; the same problem arises in learning the distributions for the symptoms. This section describes an algorithm called expectation-maximization, or EM, that solves this problem in a very general way. We will show three examples and then provide a general description. The algorithm seems like magic at first, but once the intuition has been developed, one can find applications for EM in a huge range of learning problems.
Section 20.3.
Learning with Hidden Variables: The EM Algorithm
(a)
Figure 20.10
817
(b)
(a) A simple diagnostic network for heart disease, which is assumed to be
a hidden variable. Each variable has three possible values and is labeled with the number
of independent parameters in its conditional distribution; the total number is 78. (b) The equivalent network with HeartDisease removed. Note that the symptom variables are no longer conditionally independent given their parents. This network requires 708 parameters.
20.3.1 UNSUPERVISED CLUSTERING
Unsupervised clustering: Learning mixtures of Gaussians
Unsupervised clustering is the problem of disceming multiple categories in a collection of objects. The problem is tmsupervised because the category labels are not given. For example, suppose we record the spectra of a hundred thousand stars; are there different types of stars revealed by the spectra, and, if so, how many types and what are their charact.eristics? We are all familiar with terms such as "red giant'' and "white dwarf;' b u t the stars do not carry
MOCTUR: DISTRIBUTION COMPONENT
these labels on their hats-astronomers had to perform tmsupervised clustering to identify these categories. Other examples include the identification of species, genera, orders, and so on in the Linnrean taxonomy and the creation of natural kinds for ordinary objects (see Chapter 12). Unsupervised clustering begins with data. Figure 20.11(b) shows 500 data points, each of which specifies the values of two continuous attributes. The data points might correspond to stars, and the attributes might correspond to spectral intensities at two particular frequen cies. Next, we need to understand what kind of probability distribution might have generated the data. Clustering presumes that the data are generated from a mb.1ure distribution, P. Such a distribution has k components, each of which is a distribution in its own right. A data point is generated by first choosing a component and then generating a sample from that component. Let the random variable C denote the component, with values 1, . . . , k; then the mixture distribution is given by k
P(x) =
MOCTUR: OF GAUSSIAN$
L P(C=i) P(xl C=i) ,
i=l where x refers to the values of the atttibutes for a data point. For continuous data, a natural choice for the component distributions is the multivariate Gaussian, which gives the so-called mixture of Gaussians family of distributions. The parameters of a mixture of Gaussians are
818
Chapter
20.
learning Probabi1istic Models
0.8 0.6 0.2 0 "----0 0.2 0.4 0.6 0.8 (a)
0.4 0.2
o o�----o 0.2 0.4 0.6 0.8 (b)
0.2 0 "----0 0.2 0.4 0.6 0.8 (c)
Figure 20.11
(a) A Gaussian mixture model with three components; the weights (left-to right) are 0.2, 0.3, and 0.5. (b) 500 data points sampled from the model in (a). (c) The model reconstructed by EM from the data in (b). wi
= P(C = i}
(the weight of each component), f-Li (the mean of each component), and :Ei (the covariance of each component). Figure 20.11(a) shows a mixture of three Gaussians; this mixture is in fact the source of the data in (b) as well as being the model shown in Figure 20.7(a) on page 815.
The unsupervised clustering problem, then, is to recover a mixture model like the
one
in Figure 20.11(a) from raw data like that in Figure 20. l l(b). Clearly, if we knew which com ponent generated each data point, then it would be easy to recover the component Gaussians: we could just select all the data points from a given component and then apply (a multivariate version of) Equation (20.4) (page 809) for fitting the parameters of a Gaussian to a set of data. On the other hand, if we knew the parameters of each component, then we could, at least in a probabilistic sense, assign each data point to a component. The problem is that we know neither the assigrunents nor the parameters. The basic idea of EM in this context is to pretend that we know the parameters of the model and then to infer the probability that each data point belongs to each component. After that, we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component. The process iterates until convergence. Essentially, we are "completing" the data by inferring probability distributions over the hidden variables-which component each data point belongs t�based on the current model. For the mixture of Gaussians, we initialize the mixture-model parame ters arbitrarily and then iterate the following two steps: 1. E-step: Compute the probabilities Pii = P( C = i I Xj), the probability that datum xi was generated by component i. By Bayes' rule, we have Pij = aP(xj I C = i)P(C = i). The term P(xj I C =i) is just the probability at Xj of the ith Gaussian, and the tetm
P(C = i)
is just the weight parameter for the ith Gaussian. Define 71i = effective number of data points cwTently assigned to component i.
I:j Pii•
the
2. M-step: Compute the new mean, covariance, and component weights using the follow ing steps in sequence:
Section 20.3.
INDICATOR VARIABLE
Learning with Hidden Variables: The EM Algorithm
1-Li
I
-
0.8
1.4 � �
.£
....._,_,,_,
(3,2) (3,3) (4,3)
E
@' 1.2
843
>.
-·-·-·-·-·-·-·...
.!,l
I
� ..:
0.4
f:�:::��:��=7::.:0:����i����s�::��
· -------------------------·
40
20
60
Number or trials
80
RMS error -Policy loss -------·
'
8. 0.8 ..: e 0.6 �
··-······
0.6 0
1.2
0
100
b
0.2
' '
0
20
40
60
�
80
100
Number of trials
(a)
(b)
Figure 21.7 Performance of the exploratory ADP agent. using R+ - 2 and Ne = 5. (a) Utility estimates for selected states over time. (b) The RMS error in utility values and the associated policy loss.
leads to a good destination, but because of nondeterminism in the environment the agent ends up in a catastrophic state. The TD update rule will take this as seriously as if the outcome had been the normal result of the action, whereas one might suppose that, because the outcome was a fluke, the agent should not worry about it too much. In fact, of course, the unlikely outcome will occur only infrequently in a large set of training sequences; hence n i the long run its effects will be weighted proportionally to its probability, as we would hope. Once again, it can be shown that the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity. There is an alternative TD method, called Q-Iearning, which learns an action-utility representation instead of learning utilities. We will use the notation Q(s, a) to denote the value of doing action a in state s. Q-values are directly related to utility values as follows:
U(s) = maxQ(s,a) .
(21.6)
a
MODEL·FREE
Q-functions may seem like just another way of storing utility information, but they have a very important property: a TD agent that learns a Q-function does not need a model of the form P(s' l s,a), either for learning or for action selection. For this reason, Q-learning is called a model-free method. As with utilities, we can write a constraint equation that must hold at equilibritun when the Q-values are correct:
Q(s, a) = ll(s)
� JJ(s' 1 s,a) + 'Y L..J •'
max / a
Q(s', a') .
(21.7)
As in the ADP learning agent, we can use this equation directly as an update equation for an iteration process that calculates exact Q-values, given an estimated model. This does, however, require that a model also be leamed, because the equation uses P(s' l s, a). The temporal-difference approach, on the other hand, requires no model of state transitions-all
844
Chapter
21.
Reinforcement Leaming
function Q-LBARNING-AGENT(percept) returns an action inputs: pe1·cept, a percept indicating the current state s' and reward signal r' persistent: Q, a table of action values indexed by state and action, initially zero Nsa, a table of frequencies for state-action pairs, initially zero s, a, r, the previous state, action, and reward, initially null ifTERMINAL?(s) then Q[s, None] +-- 1'1 if s is not null then increment Nsa[s, a]
Q[s, a].- Q[s,a] + a(Nsa[s, a])(r + / ffiaJCa' Q[s',a'] - Q[s , a])
' s, a, 1' +-- s',argmaxa, f( Q[s' , a'], Nsa [s' , a']), r return a
Figure 21.8 An exploratory Q-learning agent It s i an active Ieamer that learns the value Q( s, a) of each action in each situation. It uses the same exploration function f as the ex
ploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors. it needs are the Q values. The update equation for TD Q-leaming is
Q(s,a)
SARSA
DROVHiOFFRULE DELTARULE
=
Widrow-Hoff rule, or the delta rule, for online least-squares. For the linear function approximator U9 (8) in Equation (21.10), we get three simple update rules: 8o +- 8o + a (uj(8) - U9 (8)) , Tilis is called the
81 +- 81 + a (uj(8) - U9 (8))x , 82 +- 82 + a (uj (8) - U9 (s))y .
We do know that the exact utility function can be represented in a page or two of Lisp, Java, or C++. That is, it can be represented by a program that solves the game exactly every time it is called. We are interested only in function npproximators that use a reasonable amount of computation. It might in fact be better to learn a very simple function approximator and combine it with a certain amount of look-ahead search. The tradeoffs involved
3
are currently not well understood.
Section 21.4.
847
Generalization in Reinforcement Learning
Uo (1, 1 )
0.8 and uj(1 , 1 ) is 0.4. Bo, 81. and 82 are all decreased by 0.4a, which reduces the en-or for (1,1). Notice that changing the parameters Bin response to an observed transition between two states also changes the values of Uo for every other state! This is what we mean by saying that function approximation We can apply these rules to the example where
is
allows a reinforcement learner to generalize from its experiences.
We expect that the agent will Jearn faster if it uses a ftmction approximator, provided that the hypothesis space is not too large, but includes some functions that are a reasonably good fit to the true utility function. Exercise
21.5
asks you to evaluate
the performance of
direct utility estimation, both with and without function approximation. The improvement in the 4
x 3 world is noticeable but not dramatic, because this is a very small state space to begin with. The improvement is much greater in a 1 0 x 10 world with a + 1 reward at (10,10). This
world is well suited for a linear utility function because the true utility function is smooth and nearly linear. (See Exercise
21.8.)
If we put the +1 reward at
more like a pyramid and the function approximator in Equation All is not lost, however!
(5,5),
the true utility is
(21.10) will fail miserably.
Remember that what matters for linear function approximation
is that the function be linear in the parameters-the features themselves can be arbitrary nonlinear ftmctions of the state variables. Hence, we can include a term such as
B3..j(x - x9)2+ (y - y9)2 that measures the distance to the goal.
83!3(x, y) =
We can apply these ideas equally well to temporal-difference leamers. All we need do is adjust the parameters to try to reduce the temporal difference between successive states. The new versions of the TD and Q-learning equations
(21.3 on page 836 and 21.8 on page 844)
are given by
, , 8Uo(s) Bi � Bi + a [R(s) + 'Y Uo(s') - Uo (s)] oBi
for utilities and
(21.12)
Q��' a)
Bi � Bi + a [R(s) + 'Y ma�Q0(s' , a') - Q0(s, a)]0
(21.13)
forQ-values. For passive TD learning, the update rule can be shown to converge to the closest 4 possible approximation to the true function when the function approximator is linear in the parameters. With active learning and
nonlinear ftmctions such
as neural networks, all bets
are off: There are some very simple cases in which the parameters can go off to infinity even though there are good solutions in the hypothesis space. There are more sophisticated algorithms that can avoid these problems, but at present reinforcement learning with general function approximators remains a delicate
art.
Function approximation can also be very helpful for learning a model of the environ ment. Remember that learning a model for an
observable environment is a supervised learn
ing problem, because the next percept gives the outcome state. Any of the supervised learning methods n i Chapter
18 can be used, with suitable adjustments for the fact that we need to pre
dict a complete state description rather than just a Boolean classification or a single real value. For a partially
observable environment,
the learning problem is much more difficult. If we
know what the hidden variables are and how they are causally related to each other and to the 4
The definition of distance between utility functions is rather technical; see Tsitsiklis and Van Roy (1997).
848
Chapter
21.
Reinforcement Leaming
obseiVable variables, then we can fix the structure of a dynamic Bayesian network and use the EM algorithm to leam the parameters, as was described in Chapter 20. Inventing the hidden variables and learning the model structure are still open problems. Some practical examples are described in Section 2 1 .6. P OLICY S EARCH
2 1 .5
POLICY SEARCH
The final approach we will consider for reinforcement Jeaming problems is called policy search. In some ways, policy search is the simplest of all the methods in this chapter: the idea is to keep twiddling the policy as long as its performance improves, then stop. Let us begin with the policies themselves. Remember that a policy 1r is a function that maps states to actions. We are interested primarily in parameterized representations of 1r that have far fewer parameters than there are states in the state space Uust as in the preceding section). For example, we could represent 1r by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value:
1r(s) = max:Qe(s,a) .
a
STOCHASTic
POLICY
soFTw.x FUNCTION
(21. 14)
Each Q-function could be a linea r function of the parameters B, as in Equation (21.10), or it could be a nonlinear ftmction such as a neural network. Policy search wiU then ad just the parameters B to improve the policy. Notice that if the policy is represented by Q functions, then policy search results in a process that learns Q-functions. This process is not the same as Q-leaming! In Q-Jearning with function approximation, the algorithm finds a value of B such that Qe is ..close" to Q*, the optimal ()-function. Policy search, on the other hand, finds a value of () that results in good performance; the values found by the two methods may differ very substantially. (For example, the approximate Q-function defined by Q9(s, a) =Q*(s, a)/10 gives optimal performance, even though it is not at all close to Q* .) Another clear instance of the difference is the case where 1r(s) is calculated using, say, depth- 10 look-ahead search with an approximate utility function Ue. A value o f () that gives good results may be a long way from making Ue resemble the true utility function. One problem with policy representations ofthe kind given in Equation (21. 14) is that the policy is a discontinuous function of the parameters when the actions are discrete. (For a continuous action space, the policy can be a smooth function of the parameters.) That is, there will be values of B such that an infinitesimal change in () causes the policy to switch from one action to another. This means that the value of the policy may also change discontinuously, which makes gradient-based search difficult. For this reason, policy search methods often use a stochastic policy representation TC9 (s, a), which specifies the probability of selecting action a in state s. One popular representation is the softmax function:
7re(s, a) = eQB(s,a)/ L eQB(s,a') . a/
Softmax becomes nearly deterministic if one action is much better than the others, but it always gives a differentiable ftmction of B; hence, the value of the policy (which depends in
Section 21.5.
POLICYVALUE
POLICY GRADIENT
849
Policy Search
a continuous fashion on the action selection probabilities) is a differentiable function of (). Softmax is a genemlization of the logistic function (page 725) to multiple variables. Now let us look at methods for improving the policy. We start with the simplest case: a deterministic policy and a deterministic environment. Let p(B) be the policy value, i.e., the expected reward-to-go when 1r9 is executed. If we can derive an expression for p(B) in closed form, then we have a standard optimization problem, as described in Chapter 4. We can follow the policy gradient vector '\!ep(B) provided p(B) is differentiable. Alternatively, if p(B) is not available in closed form, we can evaluate 1r9 simply by executing it and observing the accumulated reward. We can follow the empirical gradient by hill climbing-i.e., evaluating the change in policy value for small increments in each pammeter. With the usual caveats, this process will converge to a local optimum in policy space. When the environment (or the policy) is stochastic, things get more difficult. Suppose we are trying to do hill climbing, which requires comparing p(B) and p(B + D.B) for some small D.B. The problem is that the total reward on each trial may vary widely, so estimates of the policy value from a small number of trials will be quite unreliable; trying to compare two such estimates will be even more tmreliable. One solution is simply to run lots of trials, measuring the sample variance and using it to detennine that enough trials have been run to get a reliable indication of the direction of improvement for p(B). Unfortunately, tllis is impractical for many real problems where each trial may be expensive, time-consuming, and perhaps even dangerous. For the case of a stochastic policy 1r9(s, a), it is possible to obtain an unbiased estimate of the gradient at B, '\!ep(B), directly from the results of trials executed at B. For simplicity, we will derive this estimate for the simple case of a nonsequential environment in which the reward R(a) is obtained immediately after doing action a in the start state so. 1n this case, the policy value is just the expected value of the reward, and we have
'\lep(B) = Ve :L:>e(so,a)R(a) = L('\le7re(so,a))R(a) . a
a
Now we perform a simple trick so that this summation can be approximated by samples generated from the probability distribution defined by 7re(so,a). Suppose that we have N trials in all and the action taken on the jth trial is aj. Then
�.. ('\le1re(so, aj))R(aj) . ""'.. 1r9 (so, a) . ('\le1re(so, a))R(a) _!__ L... _ L... " v e P (B) N 7re(so, ai) 1re(so, a) i= 1 _
a
Thus, the true gradient of the policy value is approximated by a swn of terms involving the gradient of the action-selection probability in each trial. For the sequential case, tllis generalizes to N
('\le1re(s , aj))Rj(s) L '\lep(B) _..!:._ 7re(s,aj) N �
.
J =l
for each state s visited, where aj is executed in s on the jth trial and Rj (s) is the total reward received from state s onwards in the jth trial. The resulting algorithm is called REINFORCE (Williams, 1992); it is usually much more effective than hill climbing using lots of trials at each value of B. It is still much slower than necessary, however.
850
Chapter
21.
Reinforcement Leaming
Consider the following task: given two blackjac� programs, determine which is best.
to do this is to have each play against a standard "dealer" for a certain number of hands and then to measure their respective winnings. The problem with this, as we have seen, One way
is that the wirmings of each program fluctuate widely depending on whether it receives good
�ED SAMPLING
or bad cards. An obvious solution is to generate a certain number of hands in advance and
have each program play the same set of hands.
In this way, we eliminate the measurement
en-or due to differences in the cards received. This idea, called
correlated sampling,
derlies a policy-search algorithm called PEGASUS (Ng and Jordan,
2000).
un
The algorithm is
applicable to domains for which a simulator is available so that the "random" outcomes of actions can be repeated. The algorithm works by generating in advance
N sequences of ran
dom numbers, each of which can be used to run a trial of any policy. Policy search is carried out by evaluating each candidate policy using the same set of random sequences to determine the action outcomes. It can be shown that the nwnber of random sequences required to ensure that the value of every policy is well estimated depends only on the complexity of the policy space, and not at all on the complexity of the underlying domain.
2 1 .6
APPLICATIONS OF REINFORCEMENT LEARNING
We now tum to examples of large-scale applications of reinforcement learning. We consider applications in game playing, where the transition model is known and the goal is to Jearn the utility function, and in robotics, where the model is usually unknown.
21.6.1
Applications to game playing
The first significant application of reinforcement learning was also the first significant team ing program of any kind-the checkers program written by Arthur Samuel Samuel first used a weighted linear function for the evaluation of positions, terms at any one time. He applied a version of Equation
(1959, 1967). using up to 16
(21.12) to update the weights.
There
were some significant differences, however, between his program and cunent methods. First, he updated the weights using the difference between the cunent state and the backed-up value generated by full look-ahead in the search tree. This works fine, because it amounts
to view
ing the state space at a different granularity. A second difference was that the program did
not use any observed rewards!
That is, the values of terminal states reached in self-play were
ignureu. Tills me::ans that it is the::ureti�o;ally possible:: fur Samud's program nut to �.;unve::rge::, or
to converge on a strategy designed to lose rather than to win.
He
managed
to avoid this fate
by insisting that the weight for material advantage should always be positive. Remarkably, this was sufficient to direct the program into areas of weight space conesponding to good checkers play. Geny Tesauro's backgammon program TD-GAMMON potential of reinforcement learning techniques.
1989),
(1992) forcefully .illustrates
In earlier work (Tesauro and Sejnowski,
Tesauro tried teaming a neural network representation of
5 Also known as twenty-one or pontoon.
the
Q(s, a)
directly from ex-
Section 21.6.
Applications of Reinforcement Learning
851
X
Figure 21.9 Setup for the problem of balancing a long pole on top of a moving cart. The cart can be jerked left or right by a controller that observes x, 8, x, and B. amples of moves labeled with relative values by a human expert.
This approach proved
extremely tedious for the expert. It resulted in a program, called NEUROGAMMON, that was strong by computer standards, but not competitive with human experts. The TO-GAMMON project was an attempt to learn from self-play alone. The only reward signal was given at t.he end of each game. The evaluation function was represented by a fully connected neural network with a single hidden layer containing Equation
40 nodes. Simply by repeated application of
(21.12), TO-GAMMON learned to play considerably better than
NEUROGAMMON,
even though the input representation contained just the raw board position with no computed features. This took about 200,000 training games and two weeks of computer time. Although that may seem like a lot of games, it is only a vanishingly small fraction of the state space. When precomputed features were added to the input representation, a network with 80 hidden nodes was able, after
300,000 training games, to reach a standard of play comparable to that
of the top three human players worldwide. Kit Woolsey, a top player and analyst, said that "There is no question in my mind that its positional judgment is far better than mine."
21.6.2 CARHOLE INVERTED PENDULUM
Application to robot control
The setup for the famous
cart-pole
lum,
21.9. The problem is to control the position x of the cart so that
is shown in Figure
the pole stays roughly upright
(B
�
balancing problem, also known as the
n/2),
inverted pendu
while staying within the limits of the cart track
as shown. Several thousand papers in reinforcement learning and control theory have been published on this seemingly simple problem. The catt-pole problem differs from the prob
x, B, x, and iJ are continuous. The actions are jerk left or jerk right, the so-called bang-bang control regime.
lems described earlier in that the state variables BANG·B.ING CONTROL
usually discrete:
The earliest work on leaming for this problem was carried out by Michie and Cham bers (1968). Their BOXES algorithm was able to balance the pole for over an hour after only about
30 trials. Moreover, tullike many subsequent systems,
BOXES was implemented with a
852
Chapter 21.
Reinforcement Leaming
real cart and pole, not a simulation. The algorithm first discretized the four-dimensional state space into boxes-hence the name. It then ran trials until the pole fell over or the cart hit the end of the track. Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence. It was found that the discretization caused some problems when the apparatus was initialized in a position different from those used in training, suggesting that generalization was not perfect. Improved generalization and faster learning can be obtained using an algorithm that adaptively partitions the state space accord ing to the observed variation in the reward, or by using a continuous-state, nonlinear function approximator such as a neural network. Nowadays, balancing a triple inverted pendulwn is a common exercise-a feat far beyond the capabilities of most humans. Still more impressive is the application of reinforcement leaming to helicopter flight (Figure 21.10). This work has generally used policy search (Bagnell and Schneider, 2001) as well as the PEGASUS algorithm with simulation based on a learned transition model (Ng et at., 2004). Further details are given in Chapter 25.
Figure 21.10 Superimposed time-lapse images of an autonomous helicopter performing a very difficult "nose-in circle" maneuver. The helicopter is under the control of a policy developed by the PEGASUS policy-search algorithm. A simulator model was developed by observing the effects of various control manipulations on the real helicopter; then the algo rithm was run on the simulator model overnight. A variety of controllers were developed for different maneuvers. In all cases, performance far exceeded that of an expert human pilot using remote control. (Image courtesy of Andrew Ng.)
Section 21.7.
2 1 .7
853
Summary
SUMMARY This chapter has examined the reinforcement learning problem: how an agent can become
proficient in an unknown envirorunent, given only its percepts and occasional rewards. Rein forcement learning can be viewed as a microcosm for the entire AI problem, but it is studied in a number of simplified settings to facilitate progress. The major points are: •
•
The overall agent design dictates the kind of information that must be leamed. The three main designs we covered were
the model-based design,
using a model
utility function U; the model-free design, using an action-utility function reflex design, using a policy 1r.
Q;
P and a and the
Utilities can be learned using three approaches:
1.
Direct utility estimation uses
the total observed reward-to-go for a given state as
direct evidence for learning its utility.
2.
Adaptive dynamic programming (ADP) leams a model and a reward function from observations and then uses value or policy iteration to obtain the utilities or
an optimal policy. ADP makes optimal use of the local constraints on utilities of states imposed tlu-ough the neighborhood structure ofthe enviromnent.
3.
(TD) methods update utility estimates to match those ofsuc cessor states. They can be viewed as simple approximations to the ADP approach Temporal-difference
that can learn without requiting a transition model. Using a leamed model to gen erate pseudoexperiences can, however, result in faster leaming. •
ADP appruad1 ur a TD approach. With TD, Q-learning requires no model in either the leaming or action A�.:liuu-ulilily fuu�.:liuns, ur Q-fun�.:liuus, �.:au bt: lt:amt:d by
an
selection phase. This simplifies the leammg problem but potentially restricts the ability to learn in complex environments, because the agent cannot simulate the results of possible courses of action. •
When the leammg agent is responsible for selecting actions while it leatns, it must trade off the estimated value of those actions against the potential for learning useful new information. An exact solution of the exploration problem is mfeasible, but some simple heuristics do a reasonable job.
•
In large state spaces, reinforcement teaming algorithms must use an approximate func tional representation in order to generalize over states. The temporal-difference signal can be used directly to update parameters in representations such as neural networks.
•
Policy-search methods operate directly on a representation of the policy, attempting to improve it based on observed performance. The variation in the performance in a stochastic domain is a serious problem; for simulated domains this can be overcome by fixing the randomness in advance.
Because of its potential for eliminating hand coding ofcontrol strategies, reinforcement learn ing continues to be one of the most active areas of machine leaming research. Applications in robotics promise to be particularly valuable; these will require methods for handling con-
Chapter
854
21.
Reinforcement Leaming
tinuous, high-dimensional, partially obseiVable environments in which successful behaviors may consist of thousands or even millions of primitive actions.
BIBLIOGRAPHICAL AND HISTORICAL NOTES Turing (1948,
1950) proposed the reinforcement-learning approach, although he was not con
vinced of its effectiveness, writing, "the use of punishments and rewards can at best be a part of the teaching process." Arthur Samuel's work (1959) was probably the earliest successful machine learning research. Although this work was informal and had a number of flaws, it contained most of the modern ideas in reinforcement learning, including temporal differ encing and function approximation. Around the same time, researchers in adaptive control theory (Widrow and Hoff,
1960), building on work by Hebb (1949), were training simple net
works using the delta rule. (This early connection between neural networks and reinforcement learning may have led to
the persistent misperception that the latter is
a subfield of the for
mer.) The cart-pole work of Michie and Chambers (1968) can also be seen as a reinforcement learning method with a ftmction approximator. The psychological literature on reinforcement learning is much older; Hilgard and Bower (1975) provide a good survey. Direct evidence for the operation of reinforcement learning in animals has been provided by investigations into the foraging behavior of bees; there is a clear neural correlate of the reward signal in the fonn of a large neuron mapping from tague
et al. , 1995).
the nectar intake
sensors directly to the motor cortex (Mon
Research using single-cell recording suggests that
the dopamine
system
in primate brains implements something resembling value function learning (Schultz
ei at. ,
1997). The neuroscience text by Dayan and Abbott (2001) describes possible neural imple mentations of temporal-difference learning, while Dayan and Niv (2008) sw·vey the latest evidence from neuroscientific and behavioral experiments. The connection between reinforcement learning and Markov decision processes was first made by Werbos
(1977), but the development of reinforcement learning in AI stems from work at the University of Massachusetts in the early 1980s (Barto et at., 1981). The paper by Sutton (1988) provides a good historical oveiView. Equation (21.3) in this chapter is a special case for
A = O of Sutton's general TD(A)
algorithm. TD(A) updates the utility
values of all states in a sequence leading up to each transition by an amount that drops off as
At for states t steps in the past.
TD(1 ) is identical to the Widrow-Hoff or delta rule. Boyan
(2002), building on work by Bradtke and Barto (1996), argues that TD(A) and related algo rithms make inefficient use of experiences; essentially, they are online regression algorithms that converge much more slowly than offline regression. His LSTD (least-squares temporal
differencing) algorithm is an online algorithm for passive reinforcement leaming that gives the same results as offline regression. Least-squares policy iteration, or LSPI (Lagoudakis and Parr,
2003), combines this idea with the policy iteration algmithm, yielding a robust,
statistically efficient, model-free algorithm for learning policies. The combination of temporal-difference learning with the model-based generation of simulated experiences was proposed in Sutton's DYNA architecture (Sutton,
1990). The idea of prioritized sweeping was introduced independently by Moore and Atkeson (1993) and
Bibliographical and Historical Notes
855
Peng and Williams (1993). Q-learning was developed in Watkins's Ph.D. thesis (1989), while SARSA appeared in a technical report by Rummery and Niranjan (1994). Bandit problems, which model the problem of exploration for nonsequential decisions, are analyzed in depth by Berry and Fristedt (1985). Optimal exploration strategies for several settings are obtainable using the technique called Gittins indices (Gittins, 1989). A vari ety of exploration methods for sequential decision problems are discussed by Barto
et al.
(1995). Kearns and Singh (1998) and Brafman and Tennenholtz (2000) describe algorithms that explore unknown environments and are guaranteed to converge on near-optimal policies in polynomial time. Bayesian reinforcement learning (Dearden et al., 1998, 1999) provides another angle on both model uncet1ainty and exploration. Function approximation in reinforcement learning goes back to the work of Samuel, who used both linear and nonlinear evaluation functions and also used feature-selection meth CMAC
ods to reduce the feature space. Later methods include the CMAC (Cerebellar Model Artic ulation Controller) (Albus, 1975), which is essentially a sum of overlapping local kernel
functions, and the associative neural networks of Barto et al. (1983). Neural networks are currently the most popular fonn of function approximator. The best-known application is TO-Gammon (Tesauro, 1992, 1995), which was discussed in the chapter. One significant prohlem exhihitefl hy neura.l-net.work-ha.�efl TD learners is that they tenfl to forget earlier ex
petiences, especially those in parts of the state space that are avoided once competence is achieved. This can result in catastrophic failure if such circumstances reappear. Function ap proximation based on instance-based learning can avoid this problem (Ormoneit and Sen,
2002; Forbes, 2002). The convergence of reinforcement learning algorithms using function approximation is an extremely technical subject. Results for TO learning have been progressively strength ened for the case of linear function approximators (Sutton, 1988; Dayan, 1992; Tsitsiklis and Van Roy, 1997), but several examples of divergence have been presented for nonlinear func tions (see Tsitsiklis and Van Roy, 1997, for a discussion). Papavassiliou and Russell (1999) describe a new type of reinforcement learning that converges with any fonn of ftmction ap proximator, provided that a best-fit approximation can be found for the observed data. Policy search methods were brought to the fore by Williams (1 992), who developed the REINFORCE family of algorithms. Later work by Marbach and Tsitsiklis (1998), Sutton et al. (2000), and Baxter and Bartlett (2000) strengthened and generalized the convergence results
for policy search. The method of correlated sampling for comparing different configurations of a system was described formally by Kahn and Marshall (1953), but seems to have been known long before that. Its use in reinforcement learning is due to Van Roy (1998) and Ng
and Jordan (2000); the latter paper also introduced the PEGASUS algoritlun and proved its fonnal properties.
As we mentioned .in the chapter, the perfonnance of a stochastic policy is a continu ous fuul;l iuu uf its p-k !k(Xi- l ,xi,e,i) . k
Section 22.4.
879
Infonnation Extraction
The Ak parameter values are learned with a MAP (maximum a posteriori) estimation proce dure that maximizes the conditional likelihood of the training data. The feature functions are the key components ofa CRF. The function fk has access to a pair ofadjacent states, x.- 1 and
x., but also the entire observation (word) sequence e, and the current position in the temporal sequence, i. This gives us a lot of flexibility in defining features. We can define a simple feature function, for example one that produces a value of 1 if the current word is A NDREW
{
and the current state is SPEAKER: .)
f1 (Xi-t , X i , e, z
=
1 if Xi = SPEAKER and e;. = ANDREW . herwiSe 0 ot
How are features like these used? It depends on their corresponding weights. If At > 0, then whenever ft is tme, it increases the probability of the hidden state sequence X1:N . This is another way of saying "the CRF model should prefer the target state SPEAKER for the word A NDREW."
If on
the other hand
At < 0, the CRF model will try to avoid this association,
and if Al = 0, this feature is ignored. Parameter values can be set manually or can be learned
{
from data. Now consider a second feature function:
f2 (X·•-t
'
.)
X·" e' 2 =
1 if Xi = SPEAKER and fi+t = SAID
0 otherwise
This feature is true if the current state is SPEAKER and the next word is "said." One would therefore expect a positive A2 value to go with the feature. More interestingly, note that both ft and h can hold at the same time for a sentence like "Andrew said . . . ." In this case, the two features overlap each other and both boost the belief in Xt = SPEAKER. Because of the
independence asswnption, HMMs cannot use overlapping features; CRFs can. Furthermore,
a featu.re in a CRF can use any part of the sequence e1 :N. Features can also be defined over transitions between states. The features we defined here were binary, but in general, a feature function can be any real-valued function. For domains where we have some knowledge about the types of features we would like to include, the CRF fonnalism gives us a great deal of flexibility in defining them. This flexibility can lead to accuracies that are higher than with less flexible models such as HMMs.
22.4.4
Ontology extraction from large corpora
So far we have thought of infonnation extraction as finding a specific set of relations (e.g., speaker, time, location) in a specific text (e.g., a talk announcement). A different applica tion of extraction technology is building a large knowledge base or ontology of facts from a corpus. This is different in three ways: First it is open-ended-we want to acquire facts about all types of domains, not just one specific domain. Second, with a large corpus, this task is dominated by precision, not recall-just as with question answering on the Web (Sec tion 22.3.6). Third, the results can be statistical aggregates gathered from multiple sources, rather than being extracted from one specific text. For example, Hearst (1992) looked at the problem of learning an ontology of concept categories and subcategories from a large corpus. (In 1992, a large corpus was a 1000-page encyclopedia; today it would be a 100-million-page Web corpus.) The work concentrated on templates that are very general (not tied to a specific domain) and have high precision (are
880
Chapter
22.
Natural Language Processing
almost always correct when they match) but low recall (do not always match). Here is one of the most productive templates:
NP such as NP (, NP)* (,)?
((and I or)
NP)? .
Here the bold words and commas must appear literally in the text, but the parentheses are for grouping, the asterisk means
optional. NP is a variable
repe1ition of zero or more, and
standing for a noun phrase; Chapter
the question mark means
23 describes how to identify
noun phrases; for now just assume that we know some words are nouns and other words (such as
verbs) that we can reliably assume are not part of a simple nmm phrase. This template
matches the texts "diseases such as rabies affect your dog" and "supports network protocols
such as DNS;' concluding that rabies is a disease and DNS is a network protocol. Similar templates can be constructed with the key words "including," "especially," and "or other." Of course these templates will fail to match many relevant passages, like "Rabies is a disease." That is intentional. The "NP is a
NP" template does indeed sometimes denote a subcategory
relation, but it often means something else, as in "There is a God" or "She is a little tired." With a large corpus we can afford to be picky; to use only the high-precision templates. We'll miss many statements of a subcategory relationship, but most likely we'll find a paraphrase of the statement somewhere else in the corpus in a form we can use.
22.4.5
Automated template construction
The subcategory relation is so fundamental that is worthwhile to handcraft a few templates to help identify instances of it occun-ing in natural language text. But what about the thousands of other relations in the world? There aren't enough AI grad students in the world to create
and debug templates for all of them. Fortunately, it is possible to learn templates from a few
examples, then use the templates to learn more examples, from which more templates can be learned, and so on. In one ofthe first experiments of this kind, Brin (1999) started with a data set ofjust five examples:
("Isaac Asimov", "The Robots of Dawn") ("David Brio", "Startide Rising") ("James Gleick", "Chaos-Making a New Science") ("Charles Dickens", "Great Expectations") ("William Shakespeare", "The Comedy of Errors") Clearly these are examples of the author-title relation, but the learning system had no knowl edge of authors or titles. The words in these examples were used in a search over a Web
corpus, resulting in
199 matches. Each match is defined as a tuple of seven strings,
(Author, Title, Order, Prefix, Middle, Postfix, URL) , where
Order is true
if the author came first and false if the title came first,
Middle
is the
Prefix is the 10 characters before the match, Suffix is URL is the Web address where the match was made.
characters between the author and title, the I 0 characters after the match, and
Given a set of matches, a simple template-generation scheme can find templates to explain the matches. The language of templates was designed to have a close mapping to the matches themselves, to be amenable to automated learning, and to emphasize high precision
Section 22.4.
Infonnation Extraction
881
(possibly at the risk of lower recall). Each template has the same seven components as a match. The Author and Title are regexes consisting of any characters (but beginning and ending in letters) and constrained to have a length from half the minimum length of the examples to twice the maximum length. The prefix, middle, and postfix are restricted to literal strings, not regexes. The middle is the easiest to learn: each distinct middle string in the set of matches is a distinct candidate template. For each such candidate, the template's Prefix is then defined as the longest common suffix of aU the prefixes in the matches, and the Postfix is defined as the longest common prefix of all the postfixes in the matches. U either of these is of length zero, then the template is rejected. The URL of the template is defined as the longest prefix of the URLs in the matches. In the experiment run by Brin, the first 199 matches generated three templates. The most productive template was
Title by Author (
URL: www . s f f . net/locus/c
The three templates were then used to retrieve 4047 more (author, title) examples. The exam ples were then used to generate more templates, and so on, eventually yielding over 15,000 titles. Given a good set of templates, the system can collect a good set of examples. Given a good set of examples, the system can build a good set of templates. The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is incorrect, errors can propagate quickly. One way to limit tllis problem is to not accept a new example unless it is verified by multiple templates, and not accept a new template unless it discovers multiple examples that are also found by other templates. 22.4.6 Machine reading
MACHINE READING
Automated template construction is a big step up from handcrafted template construction, but it still requires a handful of labeled examples of each relation to get started. To build a large ontology with many thousands of relations, even that amount of work would be onerous; we would like to have an extraction system with no human input of any kind-a system that could read on its own and build up its own database. Such a system would be relation-independent; would work for any relation. In practice, these systems work on all relations in parallel, because of the I/0 demands of large corpora. They behave less like a traditional information extraction system that is targeted at a few relations and more like a human reader who learns from the text itself; because of tllis the field has been called machine reading. A representative machine-reading system is TEXTRUNNER (Banko and Etzioni, 2008). TEXTRUNNER uses cotraining to boost its performance, but it needs something to bootstrap from. In the case of Hearst (1992), specific patterns (e.g., such as) provided the bootstrap, and for Brin (1998), it was a set of five author-title pairs. For TEXTRUNNER, the original ni spi ration was a taxonomy of eight very general syntactic templates, as shown in Figure 22.3. It was felt that a small number of templates like this could cover most of the ways that relation ships are expressed in English. The actual bootsrapping starts from a set of labelled examples that are extracted from the Perm Treebank, a corpus of parsed sentences. For example, from the parse of the sentence "Einstein received the Nobel Prize in 1921 , TEXTRUNNER is able "
882
Chapter
22.
Natural Language Processing
to extract the relation ("Einstein," "received;' "Nobel Prize").
Given a set of labeled examples of this type, TEXTRUNNER trains a linear-chain CRF
to extract further examples from unlabeled text. The features in the CRF include function words like "to" and "of" and "the," but not nouns and verbs (and not noun phrases or verb phrases). Because TEXTRUNNER is domain-independent, it cannot rely on predefined lists ofnouns and verbs. Type
Template
Example
Frequency
Verb Noun-Prep Verb-Prep Infinitive Modifier Noun-Coordinate Verb-Coordinate Appositive
NP1 Ve1·b NP2 NP1 NP Pl·ep NP2 NP1 Ve1·b Prep NP2 NP1 to Ve1·b NP2 NP1 Verb NP2 Noun NP1 (, I and I - I :) NP2 NP NP1 (,1 and) NP2 Verb NP1 NP (:1 ,)? NP2
X established Y X settlement with Y X moved to Y X plans to acquire Y X is Y winner X-Y deal X, Y merge X hometown : Y
38% 23% 16% 9% 5% 2% 1% 1%
Figure 22.3 Eight general templates that cover about 95% of the ways that relations are expressed in English.
TEXTRUNNER achieves a precision of 88% and recall of 45% (F1 of 60%) on a large Web corpus. TEXTRUNNER has extracted hundreds of millions of facts from a corpus of a half-billion Web pages. For example, even though it has no predefined medical knowledge, it has extracted over 2000 answers to the query [what kills bacteria]; cotTect answers include antibiotics, ozone, chlorine, Cipro, and broccoli sprouts. Questionable answers include "wa ter," which came from the sentence "Boiling water for at least I 0 minutes will kill bacteria." It would be better to attribute this to "boiling water" rather than just "water."
With the techniques outlined in this chapter and continual new inventions, we are start
ing to get closer to the goal of machine reading.
22.5
SUMMARY The main points of this chapter are as follows:
• Probabilistk language models based on n-grams recover a surprising amount of infor mation about a language. They can perform well on such diverse tasks as language identification, spelling correction, genre classification, and named-entity recognition.
• These language models can have millions of features, so feature selection and prepro cessing of the data to reduce noise is important. • Text classification can be done with naive Bayes n-gram models or with any of the classification algorithms we have previously discussed. Classification can also be seen as a problem in data compression.
Bibliographical and Historical Notes •
883
fuformation retrieval systems use a very simple language model based on bags of words, yet still manage to perform well in terms of recall and precision on very large corpora of text. On Web corpora, link-analysis algorithms improve performance.
•
Question answering can be handled by an approach based on information retrieval, for
•
fuformation-ex1raction systems use a more complex model that includes limited no
•
In building a statistical language system, it is best to devise a model that can make good
questions that have multiple answers in the corpus. When more answers are available in the corpus, we can use techniques that emphasize precision rather than recall. tions of syntax and semantics in the form of templates. They can be built from finite state automata, HMMs, or conditional random fields, and can be learned from examples. use of available data, even if the model seems overly simplistic.
BIBLIOGRAPHICAL AND HISTORICAL NOTES
N-gram letter models
for language modeling were proposed by Markov (1913). Claude Sharmon (Shannon and Weaver, 1949) was the first to generate n-gram word models of En glish. Chomsky (1956, 1957) pointed out the limitations of finite-state models compared with context-free models, concluding, "Probabilistic models give no particular insight into some ofthe basic problems of syntactic structure." This is true, but probabilistic models do provide insight into some other basic problems-problems that context-free models ignore. Chom sky's remarks had the unfortunate effect of scaring many people away from statistical models
for two decades, until these models reemerged for use in speech recognition (Jelinek, 1976). Kessler et al. ( 1997) show how to apply character n-gram models to genre classification, and Klein et al. (2003) describe named-entity recognition with character models. Franz and Brants (2006) describe the Google n-gram corpus of 13 million unique words from a trillion words ofWeb text; it is now publicly available. The bag of words model gets its name from a passage from linguist Zellig Harris (1954), "language is not merely a bag of words but a tool with particular properties." Norvig (2009) gives some examples of tasks that can be accomplished with n-gram models. Add-one smoothing, first suggested by Pierre-Simon Laplace ( 1816), was formalized by Jeffreys (1948), and interpolation smoothing is due to Jelinek and Mercer (1980), who used it for speech recognition. Other techniques include Witten-Bell smoothing (1991), Good Turing smoothing (Church and Gale, 1991) and Kneser-Ney smoothing (1995). Chen and Goodman (1996) and Goodman (2001) survey smoothing techniques. Simple n-gram letter and word models are not the only possible probabilistic models. Blei et al. (2001) describe a probabilistic text model called latent Dirichlet allocation that views a document as a mixture of topics, each with its own distribution of words. This model can be seen as an extension and rationalization of the latent semantic indexing model of (Deerwester et al. , 1990) (see also Papadimitriou et al. (1998)) and is also related to the multiple-cause mixture model of (Sahami et al., 1996).
884
Chapter
22.
Natural Language Processing
Manning and Schiitze (1999) and Sebastiani (2002) survey text-classification teclmiques. Joachims (2001) uses statistical leaming theory and support vector machines to give a theo retical analysis of when classification will be successful. Apte et at. (1994) report an accuracy of 96% in classifying Reuters news articles into the "Eamings" category. Koller and Sahami (1997) report accuracy up to 95% with a naive Bayes classifier, and up to 98.6% with a Bayes classifier that accounts for some dependencies among features. Lewis (1998) surveys f01ty years of application of naive Bayes techniques to text classification and retrieval. Schapire and Singer (2000) show that simple linear classifiers can often achieve accuracy almost as good as more complex models and are more efficient to evaluate. Nigam et at. (2000) show how to use the EM algorithm to label unlabeled documents, thus learning a better classifi cation model. Witten et al. (1999) describe compression algorithms for classification, and show the deep connection between the LZW compression algoritlun and maximum-entropy language models. Many of the n-gram model techniques are also used in bioinformatics problems. Bio statistics and probabilistic NLP are coming closer together, as each deals with long, structured sequences chosen from an alphabet of constituents. The field of information retrieval is experiencing a regrowth in interest, sparked by the wide usage of Internet searching. Robertson (1977) gives an early overview and intro duces the probability ranking principle. Croft et at. (2009) and Manning et at. (2008) are the first textbooks to cover Web-based search as well as traditional IR. Hearst (2009) covers user interfaces for Web search. The TREC conference, organized by the U.S. government's National Institute of Standards and Technology (NIST), hosts an annual competition for IR systems and publishes proceedings with results. In the first seven years of the competition, performance roughly doubled.
The most popular model for IR is the vector space model (Salton et al., 1975). Salton's work dominated the early years of the field. There are two alternative probabilistic models, one due to Ponte and Croft (1998) and one by Maron and Kuhns (1960) and Robertson and Sparck Jones (1976). Lafferty and Zhai (2001 ) show that the models are based on the same joint probability distribution, but that the choice of model has implications for training the parameters. Craswell et al. (2005) describe the BM25 scoring function and Svore and Burges (2009) describe how BM25 can be improved with a machine learning approach that incorpo rates click data-examples of past search queies and the results that were clicked on. Brin and Page (1998) describe the PageRank algoritlun and the implementation of a Web search engine. Kleinberg (1999) describes the HITS algorithm. Silverstein et al. (1998) investigate a log of a billion Web searches. The journal Information Retrieval and the pro ceedings of the annual SIGIR conference cover recent developments in the field. Early information extraction programs include GUS (Bobrow et al. , 1977) and FRUMP (DeJong, 1982). Recent information extraction has been pushed forward by the annual Mes sage Understand Conferences (MUC), sponsored by the U.S. goverrunent. The FASTUS finite-state system was done by Hobbs et al. (1997). It was based in part on the idea from Pereira and Wright (1991) of using FSAs as approximations to phrase-stmcture grammars. Surveys of template-based systems are given by Roche and Schabes (1997), Appelt (1999),
Exercises
885 and Muslea (1999). Large databases of facts were extracted by Craven
et al.
et al.
(2000), Pasca
(2006), Mitchell (2007), and Durme and Pasca (2008). Freitag and McCallum (2000) discuss HMMs for Information Extraction. CRFs were
introduced by Lafferty et al. (2001); an example of their use for information extraction is described in (McCallum, 2003) and a tutorial with practical guidance is given by (Sutton and McCallum, 2007). Sarawagi (2007) gives a comprehensive survey. Banko
et al.
(2002) present the ASKMSR question-answering system; a similar sys
tem is due to Kwok
et al.
(2001). Pasca and Harabagiu (2001) discuss a contest-winning
question-answering system. Two early influential approaches to automated knowledge engi neering were by Riloff (1993), who showed that an automatically constructed dictionary per formed almost as well as a carefully handcrafted domain-specific dictionary, and by Yarowsky (1995), who showed that the task of word sense classification (see page 756) could be accom plished ttuough unsupervised training on a corpus of unlabeled text with accuracy as good as supervised methods. The idea of simu1taneously extracting templates and examples from a handful oflabeled examples was developed independently and simultaneously by Blum and Mitchell (1998), who called it cotraining and by Brin (1998), who called it DIPRE (Dual Iterative Pattern Relation Extraction). You can see why the term
cotraining has
stuck. Similar early work,
tmder the name of bootstrapping, was done by Jones et al. (1999). The method was advanced by the QXTRACT (Agichtein and Gravano, 2003) and KNOWITALL (Etzioni et al., 2005) systems. Machine reading was introduced by Mitchell (2005) and Etzioni et al. (2006) and is the focus of the TEXTRUNNER project (Banko
et al., 2007; Banko and Etzioni, 2008).
This chapter has focused on natural language text, but it is also possible to do informa
tion extraction based on the physical structure or layout of text rather than on the linguistic
are home to data that can be extracted and consolidated (Hurst, 2000; Pinto et al., 2003; Cafarella et al., 2008).
structure. HT\1L lists and tables in both HTML and relational databases
The Association for Computational Linguistics (ACL) holds regular conferences and publishes the journal Computational linguistics. There is also an International Conference on Computational Linguistics (COLING). The textbook by Manning and Schiitze (1999) cov ers statistical language processing, while Jurafsky and Martin (2008) give a comprehensive introduction to speech and natural language processing.
EXERCISES
22.1 This exercise explores the quality of the n-gram model of language. Find or create a monolingual corpus of 100,000 words or more. Segment it into words, and compute the fre quency of each word. How many distinct words are there? Also count frequencies of bigrams
(two consecutive words) and trigrams (three consecutive words). Now use those frequencies to generate language: from the unigram, bigram, and trigram models, in ttun, generate a 100word text by making random choices according to the frequency counts. Compare the three generated texts with actual language. Finally, calculate the perplexity of each model.
886
Chapter
22.
Natural Language Processing
Write a program to do segmentation of words without spaces. Given a string, such as the URL "thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com," return a list of component words: ["the," "longest;' "list," . . . ]. This task is useful for parsing URLs, for spelling correction when words mntogether, and for Jan guages such as Chinese that do not have spaces between words. It can be solved with a unigram or bigram word model and a dynamic programming algorithm similar to the Viterbi algorithm. 22.2
22.3
STYLOMETRY
(Adapted from Jurafsky and Martin (2000).) In this exercise you will develop a classi fier for authorship: given a text, the classifier predicts which of two candidate authors wrote the text.. Obtain samples of text from two different authors. Separate them into training and test sets. Now train a language model on the training set. You can choose what features to use; n-grams of words or letters are the easiest, but you can add additional features that you think may help. Then compute the probability of the text under each language model and chose the most probable model. Assess the accuracy of this technique. How does accuracy change as you alter the set of features? This subfield of linguistics is called stylometry; its successes include the identification of the author ofthe disputed Federalist Papers (Mosteller and Wallace, 1964) and some disputed works of Shakespeare (Hope, 1994). Khmelev and Tweedie (2001) produce good results with a simple letter bigram model. 22.4
This exercise concems the classification of spam email. Create a corpus of spam email and one of non-spam mail. Examine each corpus and decide what features appear to be useful for classification: unigram words? bigrams? message length, sender, time of arrival? Then train a classification algoritlun (decision tree, naive Bayes, SVM, logistic regression, or some od1er algoritlun of your choosing) on a training set and report its accuracy on a test set. 22.5 Create a test set of ten queries, and pose them to three major Web search engines. Evaluate each one for precision at 1, 3, and 10 documents. Can you explain the differences between engines? 22.6
Try to ascertain which of the search engines from the previous exercise are using case folding, stemming, synonyms, and spelling coJTection. Write a regular expression or a short program to extract company names. Test it on a corpus of business news articles. Report your recall and precision.
22.7
22.8
Consider the problem of trying to evaluate the quality of an IR system that returns a ranked list of answers (like most Web search engines). The appropriate measure of quality depends on the preswned model of what the searcher is trying to achieve, and what strategy she employs. For each of the following models, propose a corresponding numeric measure.
a. The searcher will look at the first twenty answers returned, with the objective of getting as much relevant infonnation as possible.
b. The searcher needs only one relevant document, and will go down the list until she finds the first one. c.
The searcher has a fairly naiTow query and is able to examine all the answers retrieved. She wants to be sure that she has seen everything in the document collection that is
Exercises
887 relevant to her query. (E.g., a lawyer wants to be sure that she has found all relevant precedents, and is willing to spend considerable resources on that.) d.
The searcher needs just one document relevant to the query, and can afford to pay a research assistant for an hour's work looking through the results. The assistant can look through 100 retrieved documents in an hour. The assistant will charge the searcher for the full hour regardless of whether he finds it immediately or at the end of the hour.
e.
The searcher will look through all the answers. Examining a document has cost $A; finding a relevant document has value $B; failing to find a relevant document has cost
$C for each relevant document not fotmd. f.
The searcher wants to collect as many relevant documents as possible, but needs steady encouragement. She looks through the documents in order. If the documents she has looked at so far are mostly good, she will continue; otherwise, she will stop.
23
NATURA L LANGUAGE FOR COMMUNICATION
In which we see how humans communicate with one another in natural language, and how computer agents mightjoin in the conversation.
COMMUNICATION SIGN
Communication is the intentional exchange of infonnation brought about by the production and perception of signs drawn from a shared system of conventional signs. Most animals use signs to represent important messages: food here, predator nearby, approach, withdraw, let's mate. In a partially observable world, communication can help agents be successful because they can learn information that is observed or inferred by others. Hwnans are the most chatty of all species, and if computer agents are to be helpful, they'll need to leam to speak the language. In this chapter we look at language models for communication. Models aimed at deep understanding of a conversation necessarily need to be more complex than the simple models aimed at, say, spam classification. We start with grammatical models of the phrase structure of sentences, add semantics to the model, and then apply it to machine translation and speech recognition.
23.1
PHRASE STRUCTURE GRAMMARS
The n-gram language models of Chapter 22 were based on sequences of words. The big issue for these models is data sparsity with a vocabulary of, say, 105 words, there are 1015 trigram probabilities to estimate, and so a corpus of even a trillion words wi11 not be able to supply reliable estimates for all of them. We can address the problem of sparsity through generalization. From the fact that "black dog" is more frequent than "dog black" and similar observations, we can form the generalization that adjectives tend to come before nouns in English (whereas they tend to follow nouns in French: "chien noir" is more frequent). Of course there are always exceptions; "galore" is an adjective that follows the nmm it modifies. Despite the exceptions, the notion of a lexical category (also known as a part ofspeech) such as noun or adjective is a useful generalization-useful in its own right, but more so when we string together lexical categories to fonn syntactic categories such as noun phrase or verb phrase, and combine these syntactic categories into trees representing the phrase structure of sentences: nested phrases, each marked with a category. -
LEXICAL CATEGORY
SYNTACTIC CATEGORIES PHRASE STRUCTURE
888
Section 23.1.
Phrase Structure Grammars
GENERATIVE CAPACITY Grammatical formalisms can be classified by their generative capacity: the set of languages they can represent. Chomsky (1957) describes four classes of grammat ical formalisms that differ only in the form of the rewrite rules. The classes can be arranged in a hierarchy, where each class can be used to describe all the lan guages that can be described by a less powerful class, as well as some additional languages. Here we list the hierarchy, most powerful class first: Recursively enumerable grammars use unrestricted rules: both sides of the rewrite rules can have any number of terminal and nonterminal symbols, as in the rule A B C --+ D E. These grammars are equivalent to Turing machines in their expressive power. Contex1-sensitive grammars are restricted only in that the right-hand side must contain at least as many symbols as t.he left-hand side. The name "context sensitive" comes from the fact that a rule such as A X B --+ A Y B says that an X can be rewritten as a Y in the context of a preceding A and a following B . Context-sensitive grammars can represent languages such as anbncn (a sequence of n copies of a followed by the same nwnber of bs and then cs). In context-free grammars (or CFGs), the left-hand side consists of a sin gle nonterm.inal symbol. Thus, each rule licenses rewriting the nontenninal as the right-hand side in any context. CFGs are popular for natw·al-language and progranuning-language grammars, although it is now widely accepted that at least some natural languages have constructions that are not context-free (Pullum, 1991). Context-free grammars can represent anbn, but not anbncn. Regular grammars are the most restricted class. Every rule has a single non terminal on the left-hand side and a terminal symbol optionally followed by a non terminal on the right-hand side. Regular grammars are equivalent in power to finite state machines. They are poorly suited for programming languages, because they cannot represent constructs such as balanced opening and closing parentheses (a variation of the anbn language). The closest they can come is representing a* b*, a sequence of any number of as followed by any number of bs. The grammars higher up in the hierarchy have more expressive power, but the algoritluns for dealing with them are less efficient. Up to the 1980s, linguists focused on context-free and context-sensitive languages. Since then, there has been renewed interest in regular grammars, brought about by the need to process and leam from gigabytes or terabytes of online text very quickly, even at the cost of a less complete analysis. As Femando Pereira put it, "The older I get, the further down the Chomsky hierarchy I go." To see what he means, compare Pereira and Warren (1980) with Mohri, Pereira, and Riley (2002) (and note that these three authors all now work on large text corpora at Google).
889
890
PROBABILISTIC OONTEXHREE GRAMMAR GRAMMAR LANGUAGE
NON-TERMINAL SYMBOLS TERMINALSYMBOL
Chapter
23.
Natural Language for Communication
There have been many competing language models based on the idea of phrase struc ture; we will describe a popular model called the probabilistic context-free grammar, or PCFG. 1 A grammar is a collection of rules that defines a language as a set of allowable strings of words. "Context-free" is described in the sidebar on page 889, and "probabilistic" means that the grammar assigns a probability to every string. Here is a PCFG rule:
VP -4 Verb [0. 70] VP NP [0.30] . I Here VP (verb phrase) and NP (noun phrase) are non-terminal symbols. The grammar also refers to actual words, which are called terminal symbols. This rule is saying that with probability 0.70 a verb phrase consists solely of a verb, and with probability 0.30 it is a VP followed by an NP. Appendix B describes non-probabilistic context-free grammars. We now define a grammar for a tiny fragment of English that is suitable for communi cation between agents exploring the wumpus world. We call this language Eo. Later sections
improve on Eo to make it slightly closer to real English. We are unlikely ever to devise a complete grammar for English, if only because no two persons would agree entirely on what constitutes valid English.
23.1.1 LEXICON
The lexicon of Eo
First we define the lexicon, or list of allowable words. The words are grouped into the lexical categories familiar to dictionary users: nmms, pronouns, and names to denote things; verbs to denote events; adjectives to modify nouns; adverbs to modify verbs; and function words: ruticles (such as
the),
prepositions
(in), and conjunctions (and).
Figure 23.1 shows a small
lexicon for the language £o. Each of the categories ends in . . . to indicate that there are other words in the category. For nouns, names, verbs, adjectives, and adverbs, it is infeasible even in principle
to list all
the words. Not only are there tens of thousands of members in each class, but new ones like iPod or biodiesel-are being added constantly. These five categories ru·e called open
OPEN CLASS CLOSED CLASS
classes. For the categories of pronoun, relative pronoun, article, preposition, and conjunction we could have listed all the words with a little more work. These are called closed classes; they have a small number of words (a dozen or so). Closed classes change over the course
of centuries, not months. For example, "thee" and "thou" were commonly used pronouns in the 17th century, were on the decline in the 19th, and are seen today only in poetry and some regional dialects.
23.1.2 The Grammar of Eo The next step is to combine the words into phrases. Figw-e 23.2 shows a grammar for
Eo,
with rules for each of the six syntactic categories and an example for each rewrite rule. 2 ll\RSE TREE
Figure 23.3 shows a parse tree for the sentence "Every wwnpus smells." The parse tree 1 PCFGs are also known as stochastic- context-free grammars, or SCFGs. 2 A relative clause follows nod modjfies a noun phrase. It consists of a relative pronoun (such as "who" or "that") followed by a verb phrase. An example of a relative clause is that stinks in "The wumpus that stinks is in 2 2." Another kind ofrelative clause has no relative pronoun, e.g., /!mow in "the man I know."
Section 23.1.
891
Phrase Structure Grammars
Noun Verb Adjective Adverb Pronoun RelP.ro Name Article Prep Conj Digit
--4
stench [0.05] 1 breeze [0.10] 1 wwnpus [0.15] 1 pits [0.05] 1 is [0.10] 1 feel [0.10] 1 smells [0.10] 1 stinks [0.05] 1 . . . right [0.10] 1 dead [0.05] 1 smelly [0.02] 1 breezy [0.02] . . . here [0.05] 1 ahead [0.05] 1 nearby [0.02] 1 . . . me [0.10] I you [0.03] I I [0.10] 1 it [0.10] 1 . . . that [0.40] 1 which [0.15] 1 who [0.20] 1 whom [0.02] V . . . John [0.01] I Mary [0.01] I Boston [0.01] I . . . the [0.40] 1 a [0.30] 1 an [0.10] 1 every [0.05] 1 . . . to [0.20] 1 in [0.10] 1 on [0.05] 1 near [0.10] 1 . . . and [0.50] 1 or [0.10] 1 but [0.20] 1 yet [0.02] V . . .
--4
o
--4 --4 --4 --4 --4 --4 --4 --4 --4
[o.2o] 1
t
[o.2o] 1
2
[o.2o] 1
3
[o.2o] 1
4
[o.2o] 1 . . .
The lexicon for £o. RelPm is short forrelativepronoun, Prep for preposition, Conj for conjunction. The sum of the probabilities for each category is 1.
Figure 23.1
and
eo :
s
--4
NP
--4
VP
--4
Adjs
--4
pp
--4
RelClause
--4
NP VP S Conj S
[0.90] [0.10]
Pronoun Name Noun Article Noun Article Adjs Noun Digit Digit NP PP NP RelClause
[0.30] I [0.10] John [0.10] pits [0.25] the + wumpus [0.05] the + smelly dead + wumpus [0.05] 3 4 [0.10] the wumpus + in 1 3 [0.05] the wumpus + that is smelly
Verb VP NP VP Adjective VP PP VP Adverb
[0.40] [0.35] [0.05] [0.10] [0.10]
stinks
Adjective Adjective Adjs Prep NP RelP.ro VP
[0.80] [0.20] [1.00] [1.00]
smelly
I + feel a breeze
I feel a breeze + and + It stinks
feel + a breeze smells + dead is+ in
13
go+ ahead
smelly + dead to + the east that + is smelly
Figure 23.2 The grammar for £o, with example phrases for each rule. The syntactic cat egories are sentence (S), noun phrase (NP), verb phrase ( VP), list of adjectives (Adjs), prepositional phrase (PP), and relative clause (RelClause).
892
Chapter
23.
Natural Language for Communication
s
�VP
NP � Article Noun I o.o5 I o.I5 Every
wumpus
l o.40 Verb 10. 10
smells
Figure 23.3 Parse tree for the sentence "Every wumpus smells" according to the grammar &o. Each interior node of the tree is labeled with its probability. The probability of the tree as a whole is 0.9 X 0.25 X 0.05 X 0.15 X 0.40 X 0.10 = 0.0000675. Since this tree is the only
parse of the sentence, that number is also the probability of the sentence. The tree can also be written in linear form as [S [NP [Article every] [Noun wtunpus]][VP [Ve1·b smells]]].
ovERGENERATION UNDERGENERATION
23.2
PARSING
gives a constmctive proof that the string of words is indeed a sentence according to the rules of [Q. The Eo grammar generates a wide range of English sentences such as the following: John is in the pit The wumpus that stinks is in 2 2 Mary is in Boston and the wwnpus is near 3 2 Unfortunately, the grammar overgenerates: that is, it generates sentences that are not gram matical, such as "Me go Boston" and "I smell pits wwnpus John." It also undergenerates: there are many sentences of English that it rejects, such as "I think the wumpus is smelly." We will see how to learn a better grammar later; for now we concentrate on what we can do with the grammar we have.
SYNTACTIC ANALYSIS (PARSING)
Parsing is the process of analyzing astring of words to uncover its phrase stmcture, according to the rules of a grammar. Figure 23.4 shows that we can start with the S symbol and search top down for a tree that has the words as its leaves, or we can start with the words and search bottom up for a tree that culminates in an S. Both top-down and bottom-up parsing can be inefficient, however, because they can end up repeating effort in areas of the search space that lea.d to dead ends. Consider the following two sentences: Have the students in section 2 of Computer Science 101 take the exam. Have the students in section 2 of Computer Science 101 taken the exam? Even though they share the first 10 words, these sentences have very different parses, because the first is a command and the second is a question. A left-to-right parsing algorithm would have to guess whether the first word is part of a command or a question and will not be able to tell if the guess is correct until at least the eleventh word, take or taken. If the algorithm guesses wrong, it will have to backtrack all the way to the first word and reanalyze the whole sentence under the other interpretation.
Section 23.2.
893
Syntactic Analysis (Parsing)
Ust ofitems
Rule
s
� NP VP VP � VP Adjective VP � Verb Adjective � dead Verb � is NP � Article Noun Noun � wumpus Article � the
NP VP NP VP Adjective NP Verb Adjective
S
NP Verb dead NP is dead
Article Noun is dead Article wumpus is dead the wumpus is dead Figure 23.4
Trace of the process of finding a parse for the string "The wumpus is dead"
as a sentence, according to the grammar £o. Viewed as a top-down parse, we start with the list of items being S and, on each step, match an item X with a rule of the form (X -t and replace
X in the list of items
. . .
)
with (. . . ). Viewed as a bottom-up parse, we start with the
list of items being the words of the sentence, and, on each step, match a string of tokens (. . . )
in the list against a rule of the form (X -t . . . ) and replace (. . . ) with X.
To avoid this source of inefficiency we can use dynamic programming:
every time we
analyze a substring, store the results so we won't have to reanalyze it later. For example, once we discover that "the students in section 2 of Computer Science 101" is an NP, we can CHART
record that result n i a data structure known as a chart. Algorithms that do this are called chart
parsers. Because we are dealing with context-free grammars, any phrase that was fotmd
n i
the context of one branch of the search space can work just as well in any other branch of the search space. There are many types of chart parsers; we describe a bottom-up version called CYKALGORITHM
the CYK algorithm, after its inventors, John Cocke, Daniel Younger, and Tadeo Kasami. The CYK algorithm is shown in Figure
CHOMSKY NORMAL FORM
23.5.
Note that it requires a grammar with all
rules in one of two very specific fonnats: lexical rules of the fonn X rules of the form X
�
� word, and syntactic
Y Z. This grammar format, called Chomsky Normal Form, may
seem restrictive, but it is not: any context-free grammar can be automatically transformed
into Chomsky Nonnal Form. Exercise
23.8 leads you through the process.
The CYK algorithm uses space of O(n2m) for the words in the sentence, and
P table, where n is the number of
m is the number of nomerminal symbols in the grarrunar, and takes
O(n3m). (Since m is constant for a particular grammar, this is commonly described as O(n3).) No algorithm can do better for general context-free grammars, although there are
time
faster algorithms on more restricted grammars. In fact, it is quite a trick for the algorithm to complete in 0(n3) time, given that it is possible for a sentence to have an exponential nwnber of parse trees. Consider the sentence Fall leaves fall and spring leaves spring. It is ambiguous because each word (except "and") can be either a noun or a verb, and "fall" and "spring" can be adjectives
as
well. (For example, one meaning of "Fall leaves fall" is
894
Chapter
23.
Natural Language for Communication
function CYK-PARSE(words,gmmmar) returns P, a table of probabilities
N +- LBNGTH(wonls) M +-the number of nonterminal symbols in gmmmar P +-an array of size [M, N, N], initially all 0 I * Insert lexical rulesfor each word *I for i
=
1 to N do
tor each rule of form (X
P[X, i, 1] +-p I*
-+ wordsi [p]) do
Combn i efirst and second parts ofright-handsides ofrules,from short to long * I
for length = 2 to N do
1 to N - length + 1 do for lenl = l to N - 1 do len2 +- length - lenl for each rule ofthe form (X -+ Y Z [p]) do P[X, sta1-t, length]+-MAX(P[X, start, length], P[ Y, st.art, lenl] x P[Z, start + lenl , len2] return P for start
=
Figure 23.5
The
x
p)
CYK algorithm for parsing. Given a sequence of words, it finds the
most probable derivation for the whole sequence and for each subsequence. It returns the
P, in which an entry P[X, start, len] is the probability of the most probable X of length len starting at position start. If there is no X of that size at that location, the
whole table,
probability is 0.
equivalent to "Autumn abandons autumn.) With £o the sentence has four parses:
[S [S [NP Fall leaves] fall] and [S [NP spring leaves] spring] [S [S [NP Fall leaves] fall] and [S spring [ VP leaves spring]] [S [S Fall [ VP leaves fall]] and [S [NP spring leaves] spring] [S [S Fall [ VP leaves fall]] and [S spring [ VP leaves spring]]
.
If we had c two-ways-ambiguous conjoined subsentences, we would have 2c ways of choos 3 ing parses for the subsentences. How does the CYK algorithm process these 2c parse trees in
O(c3)
time? The answer is that it doesn't examine all the parse trees; all it has to do is
compute the probability of the most probable tree. The subtrees are all represented in the
P
table, and with a little work we could enumerate them all (in exponential time), but the beauty of the CYK algorithm is that we don't have to emunerate them unless we want to. In practice we are usually not interested in all parses; just the best one or best few. Think of the CYK algoritlun as defining the complete state space defined by the "apply grammar rule" operator. It is possible to search just part of this space using
A* search. Each state
in this space is a list of items (words or categories), as shown in the bottom-up parse table (Figure
23.4). The start state is a list of words, and a goal state is the single item S. The
There also would be O(c!) ambiguity in the way the components conjoin-for example, (X and (Y nod Z)) versus ((X and Y) and Z). But that is another story, one told well by Church and Pntil (1982).
3
Section 23.2.
895
Syntactic Analysis (Parsing)
[ [S [NP-SBJ-2 Her eyes] [VP were [VP glazed [NP *-2] [SBAR-ADV as if [S [NP-SBJ she] [VP did n't [VP [VP hear [NP *-1]]
or
[VP [ADVP even] see [NP *-1]] [NP-1 him]]]]]]]]
.]
Annotated tree for the sentence "Her eyes were glazed as if she didn't hear or even see him." from the Penn Treebank. Note that in this grammar there is a distinction between an object noun phrase (NP) and a subject noun phrase (NP-SBJ). Note also a gram matical phenomenon we have not covered yet: the movement of a phrase from one part of the tree to another. This tree analyzes the phrase "hear or even see him" as consisting of two constituent VPs, [VP hear [NP *-1]] and [VP [ADVP even] see [NP *-1]], both of which have a missing object, denoted *-1, which refers to the NP labeled elsewhere in the tree as [NP-1 him].
Figure 23.6
cost of a state is the inverse of its probability as defined by the rules applied so far, and there are various heuristics to estimate the remaining distance to the goal; the best heuristics come from machine learning applied to a corpus of sentences. With the A• algorithm we don't have to search the entire state space, and we most probable.
23.2.1
are
guaranteed that the first parse fotmd will be the
Learning probabilities for PCFGs
A PCFG has many rules, with a probability for each rule. This suggests that learning the grammar from data might be better than a knowledge engineering approach. Learning is eas TREEBANK
iest if we are given a corpus of correctly parsed sentences, commonly called a treebank. The Penn Treebank (Marcus et al., 1993) is the best known; it consists of 3 million words which
have been annotated with part of speech and parse-tree structure, using human labor assisted by some automated tools. Figure 23.6 shows an annotated tree from the Penn Treebank.
Given a corpus of trees, we can create a PCFG just by counting (and smoothing). ln the
are two nodes ofthe form [S[NP . . .] [ VP . . .]]. We would count these, and all the other subtrees with root S in the corpus. If there are 100,000 S nodes of which 60,000 are of this fonn, then we create the rule: example above, there
S � NP VP [0.60] . What if a treebank is not available, but we have a corpus of raw unlabeled sentences? It is still possible to learn a grammar from such a corpus, but it is more difficult. First of all, we actually have two problems: learning the structure of the granunar rules and learning the
896
Chapter
23.
Natural Language for Communication
probabilities associated with each rule. (We have the same distinction in learning Bayes nets.) We'll assume that we're given the lexical and syntactic category names. (If not, we can just assume categories X1 , . . . Xn and use cross-validation to pick the best value of n.) We can then assume that the grammar includes every possible (X � Y Z) or (X � word) rule, although many of these rules will have probability 0 or close to 0. We can then use an expectation-maximization (EM) approach, just as we did in learning HMMs. The parameters we are trying to learn are the rule probabilities; we strut them off at random or uniform values. The hidden variables are the parse trees: we don't know whether a string of words wi . . . Wj is or is not generated by a rule (X � ). The E step estimates the probability that each subsequence is generated by each rule. The M step then estimates the probability of each rule. The whole computation can be done in a dynamic-programming fashion with an algorithm called the inside-outside algorithm in analogy to the forward backward algorithm for HMMs. The inside-outside algorithm seems magical in that it induces a grammar from tmparsed text. But it has several drawbacks. First, the parses that are assigned by the induced grammars are often difficult to tmderstand and unsatisfying to linguists. This makes it hard to combine handcrafted knowledge with automated induction. Second, it is slow: O(n3m3 ), where n is the number of words in a sentence and m is the number of grammar categories. Third, the space of probability assignments is very large, and empirically it seems that getting stuck in local maxima is a severe problem. Alternatives such as simulated annealing can get closer to the global maximum, at a cost of even more computation. Lari and Young (1990) conclude that inside-outside is "computationally intractable for realistic problems." However, progress can be made if we are willing to step outside the bounds of learning solely from unparsed text. One approach is to learn from prototypes: to seed the process with a dozen or two rules, similar to the rules in £1. From there, more complex rules can be learned more easily, and the resulting grammar parses English with an overall recall and precision for sentences of about 80% (Haghighi and Klein, 2006). Another approach is to use treebanks, but in addition to learning PCFG rules directly from the bracketings, also learning distinctions that are not in the treebank. For example, not that the tree in Figure 23.6 makes the distinction between NP and NP SBJ. The latter is used for the pronoun "she," the former for the pronoun "her." We will explore this issue in Section 23.6; for now let us just say that there .
INSIDE-OUTSIDE ALGORITHM
.
.
-
are many ways in which it would be useful
to split
a category like NP-grammar induction
systems that use treebanks but automatically split categories do better than those that stick with the original category set (Petrov and Klein, 2007c). The error rates for automatically learned grammars are still about 50% higher than for hand-constructed grammar, but the gap is decreasing. 23.2.2
Comparing context-free and Markov models
The problem with PCFGs is that they are context-free. That means that the difference between P("eat a banana") and P("eat a bandanna") depends only on P(Noun � "banana") versus P(Noun � "bandanna.") and not on the relation between "eat" and the respective objects. A Markov model of order two or more, given a sufficiently large corpus, will know that "eat
Section 23.3.
Augmented Grammars and Semantic Interpretation
897
a banana" is more probable. We can combine a PCFG and Markov model to get the best of both. The simplest approach is to estimate the probability of a sentence with the geometric mean ofthe probabilities computed by both models. Then we would know that "eat a banana" is probable from both the grammatical and lexical point of view. But it still wouldn't pick up the relation between "eat" and "banana" in "eat a sljghtly aging but still palatable banana" because here the relation is more than two words away. Increasing the order of the Markov model won't get at the relation precisely; to do that we
can
use a
Iexicalized
PCFG, as
described in the next section. Another problem with PCFGs is that they tend to have too strong a preference for shorter sentences. is about
In a corpus such as the
Wall Street Journal,
the average length of a sentence
25 words. But a PCFG will usually assign fairly high probability to many shott
sentences, such as "He slept," whereas in the Journal we're more likely to see something like "It has been reported by a reliable source that the allegation that he slept is credible." It seems that the phrases in the Journal really are not context-free; instead the writers have an idea of the expected sentence length and use that length as a soft global constraint on their sentences.
This is hard to reflect n i a PCFG.
23.3
AUGMENTE D GRAMMARS AND
SEMANTIC INTERPRETATION
In this section we see how to extend context-free grammars--to say that, for example, not every
NP is independent of context, but rather, certain NPs are more likely to appear in one
context, and others in another context.
23.3.1
Lexicalized PCFGs
To get at the relationship between the verb "eat" and the nouns "banana" versus "bandarma;' LEXICALIZED PCFG
we can use a Iexicalized
PCFG, in which the probabilities for a mle depend on the relation
ship between words in the parse tree, not just on the adjacency of words in a sentence. Of course, we can't have the probability depend on every word in the tree, because we won't have enough trainjng data to estimate all those probabilities. It is useful to introduce the no HEAD
tion of the
head of a phrase-the
VP NP "a banana." We use the notation VP(v) VP whose head word is v. We say that the category VP most important word. Thus, "eat" is the head of the
"eat a banana" and "banana" is the head of the to denote a phrase with category AUGMENTED GRAMMAR
is augmented with the head variable verb-object relation:
v.
I Iere is an augmented grammar that describes the
VP(v) -4 Verb(v) NP(n) [Pt (v,n)] VP(v) -4 Verb(v) [P2(v)] NP(n) -4 Article(a) Adjs(j) Noun(n) [Pg(n, a)] Noun(banana) -4 banana [pn] Pt(v, n) depends on the head words v and n. We would set this proba bility to be relatively high when v is "eat" and n is "banana," and low when n is "bandatma." Here the probability
898
Chapter
23.
Natural Language for Communication
Note that since we are considering only heads, the distinction between "eat a banana" and "eat a rancid banana" will not be caught by these probabilities. Another issue with this ap proach is that, in a vocabulary with, say, 20,000 nouns and 5,000 verbs, P1 needs 100 million probability estimates. Only a few percent of these can come from a corpus; the rest will have
to come from smoothing (see Section 22.1.2). For example, we can estimateP1 (v, n) for a
(v, n)
pair that we have not seen often (or at all) by backing off to a model that depends
only on v. These objectless probabilities are still very useful; they can capture the distinction between a transitive verb like "eat"-which will have a high value for P1 and a low value for P2-and an intransitive verb like "sleep," which will have the reverse. I t is quite feasible to learn these probabilities from a treebank.
23.3.2
Formal definition of augmented grammar rules
Augmented rules are complicated, so we will give them a formal definition by showing how an DEANITECLAUSE GRAMMAR
augmented rule can be translated into a logical sentence. The sentence will have the fonn
of a definite clause (see page 256), so the result is called a definite clause grammar, or DCG. We'll use as an example a version of a rule from the lexicalized grammar for
NP
with one
new piece of notation:
NP(n)
--+
Article(a) Adj8(j) Noun(n) { Compatible(j, n)} .
The new aspect here is the notation { constmint} to denote a logical constraint on some of the variables; the rule only holds when the constraint is true. Here the predicate
Compatible (j, n)
is meant to test whether adjective j and noun n are compatible; it would be defined by a series of assertions such as
Compatible(black, dog).
We can convert this grammar rule into a def
inite clause by ( I ) reversing the order of right- and left-hand sides, (2) making a conjunction
(3) adding a variable 8i to the list of arguments for each constituent to represent the sequence of words spanned by the constituent, (4) adding a tenn of all the constituents and constraints,
for the concatenation of words,
Append (81, . . . ) , to the list of arguments for the root of the
tree. That. gives us
Article(a, 81) A Adj8(j, 82) A Noun(n, 83) A Compatible(j, n) � NP(n, Append(81, 82, 83)) . Article is true of a head word a and a string 81, and Adj8 is similarly true of a head word j and a string 82, and Noun is true of a head word n and a string 83, and if j and n are compatible, then the predicate NP is true of the head word n and the result of appending strings 81, 82, and 83.
nus definite clause says that if the predicate
The DCG translation left out the probabilities, but we could put them back in: just aug ment each constituent with one more variable representing the probability of the constituent, and augment the root with a variable that is the product of the constituent probabilities times the mle probability. The translation from grammar mle to definite clause allows us to talk about parsing as
logical inference. This makes it possible to reason about languages and strings in many
different ways. For example, it means we can do bottom-up parsing using forward chaining or top-down parsing using backward chaining. In fact, parsing natural language with DCGs was
Section 23.3.
LANGUAGE GENERATION
Augmented Grammars and Semantic Interpretation
899
one of the first applications of (and motivations for) the Prolog logic programming language. It is sometimes possible to run the process backward and do language generation as well as parsing. For example, skipping ahead to Figure 23.10 (page 903), a logic program could be given the semantic form Loves( John, Mary) and apply the definite-clause rules to deduce
S(Loves(John, Mary), [John, loves, Mary] ) . This works for toy examples, but serious language-generation systems need more control over the process than is afforded by the DCG rules alone. s
-4
NPs NPo VP
-4 -4
pp
-4
Pronouns P.ronouno
-4
£1 :
-4
-4
NPs VP I Pronouns I Name I Noun I Pronouno I Name I Noun I VP NPo I Prep NPo I I you I he I she I it I me I you I him I her I it I
S(head) -4 NP(Sbj, pn, h) VP(pn, head) I . . . NP(c, pn, head) -4 Pronoun(c, pn, head) I Noun( c, pn, head) I VP(pn, head) -4 VP(pn, head) NP( Obj,p, h) I PP(head) -4 Prep(head) NP(Obj,pn,h) Pronoun(Sbj, 1S, I) -4 I Pronoun(Sbj, 1P, we) -4 we Pronoun(Obj, 1S,me) -4 me Pronoun( Obj, 3P, them) -4 them �:
Figure 23.7
Top: part of a grammar for the language &1, which handles subjective and
objective cases in noun phrases and thus does not overgenerate quite as badly as
&o.
The
portions that are identical to Eo have been omitted. Bottom: part of an augmented grammar for &2, with three augmentations: case agreement, subject-verb agreement, and head word.
Sbj, Obj, JS, JP and 3P are constants, and lowercase names are variables.
23.3.3
Case agreement and subject-verb agreement
We saw in Section 23.1 that the simple granunar for Eo overgenerates, producing nonsen tences such as "Me smell a stench." To avoid this problem, our grammar would have to know that "me" is not a valid NP when it is the subject of a sentence. Linguists say that the pronoun "I" is in the subjective case, and "me" is in the objective case.4 We can account for this by 4 The subjective case is also sometimes called the nominative case and the objective case is sometimes called the accusative case. Many languages also have a dative case for words in the indirect object position.
900
CASE AGREEMENT
SUilJtCI-vtH� AGREEMENT
Chapter
23.
Natural Language for Communication
splitting NP into two categories, NPs and NPo, to stand for noun phrases in the subjective and objective case, respectively. We would also need to split the category Pronoun into the two categories Pronouns (which includes "I") and P.ronouno (which includes "me"). The top part of Figure 23.7 shows the grammar for case agreement; we call the resulting language £1. Notice that all the NP rules must be duplicated, once for NPs and once for NPO · Unfortunately, £1 still overgenerates. English requires subject-verb agreement for person and number of the subject and main verb of a sentence. For example, if "I" is the subject, then "I smell" is grammatical, but "I smells" is not. If "it" is the subject, we get the reverse. In English, the agreement distinctions are minimal: most verbs have one form for third-person singular subjects (he, she, or it), and a second form for all other combinations of person and number. There is one exception: the verb "to be" has three fonns, "I am I you are I he is." So one distinction (case) splits NP two ways, another distinction (person and number) splits NP three ways, and as we uncover other distinctjons we would end up with an exponential number of subscripted N P forms if we took the approach of £1. Augmentations are a better approach: they can represent an exponential number of forms as a single rule. In the bottom ofFigure 23.7 we see (pa1t ot) an augmented grammar for the language whlch handles case agreement, subject-verb agreement, and head words. We have just £2, one NP category, but NP(c, pn, head) has three augmentations: c is a parameter for case, p11. is a parameter for person ancl numher, ancl h.Pild is a parameter for the heacl worcl of
the phrase. The other categories also are augmented with heads and other arguments. Let's consider one rule in detail:
S(head)
---+
NP(Sbj , pn,h) VP(pn, head ) .
Thls rule is easiest to understand right-to-left: when an NP and a VP are conjoined they fonn an S, but only if the NP has the subjective (Sb;) case and the person and number (pn) of the NP and VP are identical. If that holds, then we have an S whose head is the same as the head of the VP. Note the head of the NP, denoted by the dummy var.iable h, is not part of the augmentation of the S. The lexical rules for £2 fill in the values of the parameters and are also best read right-to-left. For example, the rule
Pronoun(Sbj, 18,1) ---+ I says that "I" can be interpreted as a Pronoun in the subjective case, first-person singular, with head "1." For simplicity we have omitted the probabilities for these rules, but augmentation does work with probabilities. Augmentation can also work with automated leaming mecha nisms. Petrov and Klein (2007c) show how a learning algorithm can automatically split the NP category into NPs and NPO · 23.3.4
Semantic interpretation
To show how to add semantics to a grammar, we start with an example that is simpler than English: the semantics of arithmetic expressions. Figure 23.8 shows a grammar for arithmetic expressions, where each rule is augmented with a variable indicating the semantic interpreta tion of the phrase. The semantics of a digit such as "3" is the digit itself. The semantics of an expression such as "3 + 4" is the operator "+" applied to the semantics of the phrase "3" and
Section
23.3.
Augmented Grammars and Semantic Interpretation
901
Exp(x) Exp(x1) Operator(op) Exp(x2) {x = Apply(op, x1,x2)} ( Exp(x) ) Exp(x) Exp(x) Number(x) N11.m.hP.r(x) nigit(x) Number(x) Number(x1) Digit(x2) {x= 1 0 x x1 + x2} Digit(x) x {0 � x � 9} Operator(x) x {x E {+, - , +, x } } �
� �
�
�
�
�
Figure 23.8
A grammar for arithmetic expressions, augmented with semantics. Each vari
able Xi represents the semantics of a constituent. Note the use of the logical predicates that must
{test} notation to define
be satisfied, but that are not constituents.
Exp(5) Exp(2)
EJ.p(3)
Exp(4)
I
Number(4)
I
Number(3)
I
Digit(3)
I
3 Figure23.9
COMPOSITIONAL SEMANTICS
Exp(2)
I
Operator(+)
I
+
I
Number(2)
I
Digit(4) Operator(+) (
I
4
Digit(2)
I
I
)
2
.
Parse tree with semantic interpretations for the string
"3 + (4 -;- 2)".
the phrase "4." The ndes obey the principle of compositional semantics-the semantics of a phrase is a function ofthe semantics ofthe subphrases. Figw·e
3 + ( 4 -;- 2)
23.9 shows the parse tree for
according to this grammar. The root of the parse tree is
whose semantic interpretation is
5.
Exp(5),
an expression
Now let's move on to the semantics of English, or at least of Eo. We start by determin ing what semantic representations we want to associate with what phrases. We use the simple example sentence "John loves Mary." The NP "John" should have as its semantic interpreta tion the logica1 term John, and the sentence as a whole should have as its interpretation the
logica1 sentence
Loves( John , Mary) .
That much seems clear. The complicated part is the
VP "loves Mary." The semantic interpretation of this phrase is neither a logical term nor a complete logical sentence. Intuitively, "loves Mary" is a description that might or might not
Chapter
902
23.
Natural Language for Communication
apply to a particular person. (In this case, it applies to John.) This means that "loves Mary" is a predicate that, when combined with a term that represents a person (the person doing the loving), yields a complete logical sentence. Using the .A-notation (see page 294), we can represent "loves Mary" as the predicate
.Ax Loves(x, Mary) . Now we need a rule that says "an NP with semantics obj followed by a VP with semantics pred yields a sentence whose semantics is the result of applying pred to obj :"
S(pred( obj)) TI1e rule
�
NP( obj)
VP(pred) .
tells us that the semantic interpretation of "John loves Mary" is
(.Ax Loves(x,Mary))(John) , which is equivalent to Loves( John, Mary) . TI1e rest of the semantics follows in a straightforward way from t.he choices we have made so far. Because VPs are represented as predicates, it is a good idea to be consistent and
represent verbs as predicates as well. The verb "loves" is represented as .Ay .Ax Loves(x, y) , the predicate that, when given the argument Mary, returns the predicate
.Ax Loves(x, Mary).
We end up with the grammar shown in Figure 23.10 and the parse tree shown in Figure 23.1 1.
We could just as easily have added semantics to £2; we chose to work with Eo so that the reader can focus on one type of augmentation at a time. Adding semantic augmentations to a grammar by hand is laborious and error prone. Therefore, there have been several projects to lean1 semantic augmentations from examples. CHILL (Zelle and Mooney, 1996) is an inductive logic programming (ILP) program that learns a grammar and a specialized pars·er for that grammar from examples. The target domain is natural language database queries. The training examples consist of pairs of word strings
and corresponding semantic forms-for example; What is the capital of the state with the largest population?
Answer(c, Capital(s, c) A Largest(p, State(s) A Population(s,p))) CHILL's task is to learn a predicate Parse( words, semantics) that is consistent with the ex amples and, hopefully, generalizes well to other examples. Applying ILP directly to learn this predicate results in poor performance: the induced parser has only about 20% accuracy. Fortunately, ILP learners can improve by adding knowledge. In this case, most of the Parse predicate was defined as a logic program, and CHILL's task was reduced to inducing the control rules that guide the parser to select one parse over another. With this additional back growJd knowledge, CHILL can learn to achieve 70% to 85% accuracy on various database query tasks. 23.3.5
Complications
The grammar of real English is endlessly complex. We will briefly mention some examples. TIME AND TENSE
Time and tense:
Suppose we want to represent the difference between "John loves
Mary" and "John loved Mary." English uses verb tenses (past, present, and future) to indicate
Section 23.3.
Augmented Grammars and Semantic Interpretation
903
S(pred(obj)) � NP(obj) VP(pred) VP(pred(obj)) � Verb(pred) NP(obj) NP( obj) � Name(obj) Name(John) � John Name(Mary) � Mary Verb(Ay AX Loves(x,y)) � Figure 23.10
loves
A grammarthat can derive a parse tree and semantic interpretation for"John
loves Mary" (and three other sentences). Each category is augmented with a single argument representing the semantics.
S(Loves(John,Mary))
VPI).xfLoves(x,Mary))
�wy)
NP(John)
I
Name(John) Verb(A.yfo.xjLoves(x,y))
I
I
John Figure 23.11
Name(Mary)
loves
I
Mary
A parse tree with semantic interpretations for the string "John loves Mary".
the relative time of an event. One good choice to represent the time of events is the event calculus notation of Section
12.3. In event calculus we have
E1 E Loves(John, Mary) 1\ During( Now, Extent(E1)) John loved mary: E2 E Loves(John, Mary) 1\ After( Now, Extent(E2)) .
John loves mary:
This suggests that our two lexical rules for the words "loves" and "loved" should be these:
Verb(Ay AXe E Loves(x, y) 1\ During(Now, e)) � loves Verb(Ay AXe E Loves(x, y) 1\ After( Now, e)) � loved .
Other than this change, everything else about the grammar remains the same, which is en couraging news; it suggests we are on the right track if we can so easily add a complication like the tense of verbs (although we have just scratched the surface of a complete grammar for time and tense). It is also encouraging that the distinction between processes and discrete events that we made in mu· discussion of knowledge representation in Section
12.3.1 is actu
Sleeping is a process category, but it is odd to say "John found a unicorn a Jot last night;' where Finding
ally reflected in language use. We can say "John slept a lot last night," where
is a discrete event category. A grammar would reflect that fact by having a low probability for adding the adverbial phrase "a lot" QlANTIFICATION
Quantification:
to discrete events.
Consider the sentence "Every agent feels a breeze." The sentence has
only one syntactic parse under
£o, but it is actually
semantically ambiguous; the preferred
Chapter
904
23.
Natural Language for Communication
meaning is "For every agent there exists a breeze that the agent feels," but an acceptable alternative meaning is "There exists a breeze that every agent feels." 5 The two interpretations
can be represented as V a a E Agents � 3 b b E Breezes ! \3 e e E Feel(a, b) 1\ During( Now, e) ; 3 b b E Breezes V a a E Agents � 3 e e E Feel(a, b) 1\ During( Now, e) . TI1e standard approach to quantification is for the grammar to define not an actual logical QUASI-l.OOICAI.. FORM
semantic sentence, but rather a quasi-logical form that is then turned into a logical sentence by algorithms outside of the parsing process. Those algorithms can have preference rules for preferring one quantifier scope over another-preferences that need not be reflected directly in the grammar.
Pragmatics: We have shown how an a.gent can perceive a string of words and use a
PRAGMATICS
grammar to derive a set of possible semantic interpretations. Now we address t.he problem of completing the interpretation by adding context-dependent information about the current situation. The most obvious need for pragmatic information is in resolving the meaning of INDEXICAL
indexicals, which are phrases that refer directly to the current situation. For example, in the sentence "I am in Boston today," both "I" and "today" are indexicals. The word "I" would be represented by the fluent
Speaker, and it would be up to the hearer to resolve the meaning of
the fluent-that is not considered part of the grammar but rather an issue of pragmatics; of using the context of t.he current situation to interpret fluents. Another part of pragmatics is interpreting the speaker's intent. The speaker's action is SPEECH Ia
considered a speech act, and it is up lo the hearer to decipher what type of action it is-a question, a statement, a promise, a warning, a command, and so on. A command such "go to 2 2" implicitly refers to the hearer. So far, our grammar for
as
S covers only declarative
sentences. We can easily extend it to cover commands. A command can be fonned from a VP, where the subject is implicitly the hearer. We need to distinguish commands from statements, so we alter the rules for S to include the type of speech act:
S(Statement(Speaker,pred(obj))) � NP(obj) VP(pred) S(Command(Speaker,pred(Hearer))) � VP(pred) . Long-distance dependencies: Questions introduce a new grammatical complexity.
LONG·DISTANCE DEPENDENCIES
In
"Who did the agent tell you to give the gold to?" the final word "to" should be parsed as TRACE
[PP to �], where the "w" denotes a gap or trace where an NP is missing; the missing NP is licensed by the first word of the sentence, "who." A complex system of augmentations is used to make sure that the missing NPs match up with the licensing words in just the right way, and prohibit gaps in the wrong places. For example, you can't have a gap in one branch of an
NP conjunction:
"What did he play
[NP
you can have the same gap in both branches of a AMBIGUITY
Dungeons and �]?" is ungrammatical. But
VP conjunction:
"What did you
[ VP [ VP
smell �] and [ VP shoot an arrow at w]]?" Ambiguity: In some cases, hearers are consciously aware of ambiguity in an utterance. Here are some examples taken from newspaper headlines: 5 If this interpreat t ion seems unlikely, consider "Every Protestant believes in a just God."
Section 23.3.
Augmented Grammars and Semantic Interpretation
905
Squad helps dog bite victim. Police begin campaign to run down jaywalkers. Helicopter powered by human flies. Once-sagging cloth diaper industry saved by full dumps. Portable toilet bombed; police have nothing to go on. Teacher strikes idle kids. Include your children when baking cookies. Hospitals are sued by 7 foot doctors. Milk drinkers are turning to powder. Safety experts say school bus passengers should be belted. But most of the time the language we hear seems llllambiguous. Thus, when researchers first began to use computers to analyze language in the 1960s, they were quite surprised to learn that almost every utterance is highly ambiguous, even though the alternative interpretations
might not be apparent to a native speaker. A system with a large grammar and lexicon might LEXICALAMBIGUITY
find thousands of interpretations for a perfectly ordinary sentence. Lexical ambiguity, in which a word has more than one meaning, is quite common; "back" can be an adverb (go back), an adjective (back door), a noun (the back of the room) or a verb (back up your files). "Jack" can be a name, a nmm (a playing card, a six-pointed metal game piece, a nautical flag, a fish, a socket, or a device for raising heavy objects), or a verb (to jack up a car, to hunt with
SYNTACTIC AMBIGUITY
a light, or to hit a baseball hard). Syntactic ambiguity refers to a phrase that has multiple parses: "I smelled a wumpus in 2,2" has two parses: one where the prepositional phrase "in 2,2" modifies the nollll and one where it modifies the verb. The syntactic ambiguity leads to a
SEMANTIC
semantic ambiguity,
AMBIGUITY
because one parse means that the wumpus is in 2,2 and the other means
that. a stench is in 2,2. In this case, getting the wrong interpretation could be a deadly mistake for the agent. Finally, there can be ambiguity between literal and figurative meanings. Figures of speech are important in poetry, but are surprisingly common in everyday speech as well. A METONYMY
metonymy
is a figure of speech in which one object is used to stand for another. When
we hear "Chrysler announced a new model," we do not interpret it as saying that compa nies can talk; rather we understand that a spokesperson representing the company made the annow1cement. Metonymy is common and is often interpreted unconsciously by human hear ers. Unfortllltely, la our grammar as it is written is not so facile. To handle the semantics of metonymy properly, we need to introduce a whole new level of ambiguity. We do this by pro viding two objects for the semantic interpretation of every phrase in the sentence: one for the object that the phrase literally refers to (Chrysler) and one for the metonymic reference (the spokesperson). We then have to say that there is a relation between the two. In our cun-ent grammar, "Chrysler announced" gets interpreted as x
= Chrysler 1\ e E Announce(x) 1\ After(Now, Extent(e)) .
We need to change that to x
= Chrysler 1\ e E Announce(m) 1\ After(Now, Extent(e)) 1\
Metonymy (m,x) .
906
Chapter 23.
Natural Language for Communication
This says that there is one entity x that is equal to Chrysler, and another entity m that did the announcing, and that the two are in a metonymy relation. The next step is to define what kinds of metonymy relations can occur. The simplest case is when there is no metonymy at all-the literal object x and the metonymic object m are identical:
Vm,x (m = x)
=?
Metonymy (m, x) .
For the Chl)'sler example, a reasonable generalization is that an organization can be used to stand for a spokesperson of that organization:
Vm,x x E Organizations A Spokesperson(m,x)
METAPHOR
DISAMBIGUATION
=?
Metonymy(m,x) .
Other metonymies include the author for the works (I read Shakespeare) or more generally the producer for the product (I drive a Honda) and the part for the whole (TI1e Red Sox need a strong arm). Some examples of metonymy, such as "The ham sandwich on Table 4 wants another beer," are more novel and are interpreted with respect to a situation. A metaphor is another figure of speech, in which a phrase with one literal meaning is used to suggest a different meaning by way of an analogy. Thus, metaphor can be seen as a kind of metonymy where the relation is one of similarity. Disambiguation is the process of recovering the most probable intended meaning of an utterance. In one sense we already have a framework for solving this problem: each rule has a probability associated with it, so the probability of an interpretation is the product of the probabilities of the rules that led to the interpretation. Unfortunately, the probabilities reflect how common the phrases are in the corpus from which the grarrunar was learned, and thus reflect general knowledge, not specific knowledge of the cw·rent situation. To do disambiguation properly, we need to combine four models: I . The world model: the likelihood that a proposition occurs in the world. Given what we know about the world, it is more likely that a speaker who says "I'm dead" means "I am in big trouble" rather than "My life ended, and yet I can still talk." 2. The mental model: the likelihood that the speaker forms the intention of communicat ing a certain fact to the hearer. This approach combines models of what the speaker believes, what the speaker believes the hearer believes, and so on. For example, when a politician says, "I am not a crook;' the world model might assign a probability of only 50% to the proposition that the politician is not a criminal, and 99.999% to the proposition that he is not a hooked shepherd's staff. Nevertheless, we select the former interpretation because it is a more likely thing to say. 3. TI1e language model: the likelihood that a certain string of words will be chosen, given that the speaker has the intention of communicating a certain fact. 4. The acoustic model: for spoken communication, the likelihood that a particular se quence of sounds will be generated, given that the speaker has chosen a given string of words. Section 23.5 covers speech recognition.
Section 23.4.
23.4
907
Machine Translation
MACHINE TRANSLATION
Machine translation is the automatic translation oftext from one natural language (the source) to another (the target). It was one of the first application areas envisioned for computers (Weaver, 1949), but it is only in the past decade that the technology has seen widespread usage. Here is a passage from page 1 of this book: AI is one of the newest fields in science and engineering. Work started in earnest soon after World War IT, and the name itself was coined in 1956. Along with molecular biol
ogy, AI is regularly cited as the "field I would most like to be in" by scientists in other disciplines.
And here it is translated from English to Danish by an online tool, Google Translate: AI er en af de nyeste 01nrilder inden for videnskab og teknik. Arbejde startede for alvor
lige efter Anden Verdenskrig, og navuet i sig selv var opfundet i 1956. Sammen med molekylrer biologi, er AI jrevn1igt nrevnt som "feltet Jeg ville de fleste gerne vrere i" af
forskere i anL4v -)l-+ .....
l
I
I
llllld ltlllliliiiL
8 1� �= 10== 1 5== 3� 2= 2= 63=� �� ���=: '0:: 12:73: � � ��= 52 4 82 89 9 4 7 I I I
Frames withfeatures: -
Figure 23.15
11
Translating the acoustic signal into a sequence of frames. In this diagram
each frame is described by the discretized values of three acoustic features; a real system would have dozens of features.
916
Chapter
23.
Natural Language for Communication
Phone HMM for [m]: 0.3
0.9
0.4
Output probabilities for thephone HMM: Onset:
Mid:
End:
CI:0.5
C:J: 0.2
C4: 0.1
C2: 0.2
C4: 0.7
C6: 0.5
C3: 0.3
C5: 0.1
C7: 0.4
Figure 23.16 An HMM for the three-state phone [m]. Each state has several possible outputs, each with its own probability. The MFCC feature labels C1 through C1 are arbitrary, standing for some combination of feature values.
(a) Word model with dialect variation:
(b) Word model with coarticulation and dialect variations
Figure 23.17
Two pronunciation models of the word "tomato." Each model is shown as transition diagram with states as circles and arrows showing allowed transitions with their associated probabilities. (a) A model allowing for dialect differences. The 0.5 numbers are estimates based on the two authors' preferred pronunciations. (b) A model with a coarticula tion effect on the first vowel, allowing either the [ow] or the [ah] phone. a
Section 23.5.
PRONUNCIATION MODEL
Speech Recognition
917
a phone as ttuee states, the onset, middle, and end. For example, the [t] phone has a silent beginning, a small explosive burst of sound in the middle, and (usually) a hissing at the end. Figure 23.16 shows an example for the phone [m]. Note that in normal speech, an average phone has a duration of 50-100 milliseconds, or 5-10 frames. The self-loops in each state allows for variation in this duration. By taking many self-loops (especially in the mid state), we can represent a long "mmmmmmmmmmm" sound. Bypassing the self-loops yields a short "m" sound. In Figure 23.17 the phone models are strung together to form a pronunciation model for a word. According to Gershwin (1937), you say [t ow m ey t ow] and I say [t ow m aa t ow]. Figure 23.17(a) shows a transition model that provides for this dialect variation. Each of the circles in this diagram represents a phone model like the one in Figure 23.16. In addition to dialect variation, words can have coarticulation variation. For example, the [t] phone is produced with the tongue at the top of the mouth, whereas the [ow] has the tongue near the bottom. When speaking quickly, the tongue doesn't have time to get into position for the [ow], and we end up with [t ah] rather than [t ow]. Figure 23.17(b) gives a model for "tomato" that takes this coarticulation effect into account. More sophisticated phone models take into account the context o fthe surrounding phones.
There can be substantial variation in pronunciation for a word. The most common pronunciation of "because" is [b iy k ah z], but that only accounts for about a quarter of uses. Another quarter (approximately) substitutes [ix], [ih] or [ax] for the first vowel, and the remainder substitute [ax] or [aa] for the second vowel, [zh] or [s] for the final [z], or drop "be" entirely, leaving "cuz." 23.5.2
Language model
For general-purpose speech recognition, the langua.ge model can be an n-gram model of text learned from a corpus of written sentences. However, spoken language has different characteristics than written language, so it is better to get a corpus of transcripts of spoken language. For task-specific speech recognition, the corpus should be task-specific: to build your airline reservation system, get transcripts of prior calls. It also helps to have task-specific vocabulary, such as a Jist of all the airports and cities served, and all the flight numbers. Part ofthe design of a voice user interface is to coerce the user into saying things from a limited set ofoptions, so that the speech recognizer will have a tighter probability distribution to deal with. For example, asking "What city do you want to go to?" elicits a response with a highly constrained language model, while asking "How can I help you?" does not. 23.5.3
Building a speech recognizer
The quality of a speech recognition system depends on the quality of all of its components the language model, the word-pronunciation models, the phone models, and the signal processing algorithms used to extract spectral features from the acoustic signal. We have discussed how the langua.ge model can be constructed from a corpus of written text, and we leave the details of signal processing to other textbooks. We are left with the prommciation and phone models. The structure of the pronunciation models-such as the tomato models in
918
Chapter
23.
Natural Language for Communication
Figure 23.17-is usually developed by hand. Large pronunciation dictionaries are now avail able for English and other languages, although their accuracy varies greatly. The structure of the three-state phone models is the same for all phones, as shown in Figw·e 23.1 6. That leaves the probabilities themselves. As usual, we will acquire the probabilities from a corpus, this time a corpus of speech. The most common type of corpus to obtain is one that includes the speech signal for each sentence paired with a transcript of the words. Building a model from this corpus is more difficult than building an n-gram model of text, because we have to build a hidden Markov model-the phone sequence for each word and the phone state for each time frame are hidden variables. In the early days of speech recognition, the hidden variables were provided by laborious hand-labeling of spectrograms. Recent systems use expectation-maximization to automatically supply the missing data. The idea is simple: given an HMM and an observation sequence, we can use the smoothing algorithms from Sections 15.2 and 15.3 to compute the probability of each state at each time step and, by a simple extension, the probability of each state-state pair at consecutive time steps. These probabilities can be viewed as uncertain labels. From the uncertain labels, we can estimate new transition and sensor probabilities, and the EM procedure repeats. The method is guaranteed to increase the fit between model and data on each iteration, and it generally converges to a much better set of parameter values t.ha.n those provirlerl hy the initial, ha.nrl-lahelerl estimates.
The systems with the highest accuracy work by training a different model for each speaker, thereby capturing differences in dialect as well as male/female and other variations. This training can require several hours of interaction with the speaker, so the systems with the most widespread adoption do not create speaker-specific models. The accuracy of a system depends on a number of factors. First, the quality ofthe signal matters: a high-quality directional microphone aimed at a stationary mouth in a padded room will do much better than a cheap microphone transmitting a signal over phone lines from a car in traffic with the radio playing. The vocabulary size matters: when recognizing digit strings with a vocabulary of 1 1 words (1-9 plus "oh" and "zero"), the word error rate wil1 be below 0.5%, whereas it rises to about 10% on news stories with a 20,000-word vocabulary, and 20% on a corpus with a 64,000-word vocabulary. The task matters too: when the system is trying to accomplish a specific task-book a flight or give directions to a restaurant-the task can often be accomplished perfectly even with a word error rate of 10% or more.
23.6
SUMMARY
Natural language understanding is one of the most important subfields of AI. Unlike most other areas of AI, natural language understanding requires an empirical investigation of actual human behavior--which turns out to be complex and interesting. •
Formal language theory and phrase structure grammars (and in particular, context free grammar) are useful tools for dealing with some aspects of natural language. The probabilistic context-free grammar (PCFG) fonnalism is widely used.
919
Bibliographical and Historical Notes •
• • •
Sentences in a context-free language can be parsed in
O(n3)
time by a
chart parser such as the CYK algorithm, which requires grammar rules to be in Chomsky Normal Form. A treebank can be used to Jearn a grammar. It is also possible to learn a grammar from an unparsed corpus of sentences, but this is less successful. A
lexicalized PCFG allows us to represent that some relationships between words are
more common than others. It is convenient to augment a grammar to handle such problems as subject-verb agree ment and pronow1 case.
Definite clause grammar (DCG) is a formalism that allows for
augmentations. With DCG, parsing and semantic interpretation (and even generation) can be done using logical inference. • •
Semantic interpretation can also be handled by an augmented grammar. Ambiguity is a very important problem in natural language Lmderstanding;
most sen
tences have many possible interpretations, but usually only one is appropriate. Disam biguation relies on knowledge about the world, about the current situation, and about language use.
•
Machine translation
systems have been implemented using a range of techniques,
from full syntactic and semantic analysis to statistical techniques based on phrase fre quencies. Currently the statistical models are most popular and most successful.
• •
Speech recognition systems
are also plimarily based on statistical principles. Speech
systems are popular and useful, albeit imperfect.
Together, machine translation and speech recognition are two of the big successes of natural language teclmology. One reason that the models perform well is that large corpora are available-both translation and speech are tasks that are performed "in the wild" by people every day.
In contrast, tasks like parsing sentences have been less
successful, in part because no large corpora of parsed sentences are available "in the wild" and in part because parsing is not useful in and of itself.
BffiLIOGRAPHICAL AND HISTORICAL NOTES Like semantic networks, context-free grammars (also known as phrase structure grammars) are a reinvention of a technique first used by ancient Indian grammarians (especially Panini, ca. 350 B.C.) studying Shastlic Sanskrit (Ingerman, 1967). They were reinvented by Noam Chomsky (1956) for the analysis of English syntax and independently by John Backus for the analysis of Algol-58 syntax. Peter Naur extended Backus's notation and is now credited (Backus, 1996) with the "N" in BNF, which originally stood for "Backus N01mal Form." ATIRIBL'TE GRAMM.IA
Knuth (1968) defined a kind of augmented grammar called attribute ful for progranuning languages.
grammar
that is use
Definite clause grarrunars were introduced by Colmer
auer (1975) and developed and popularized by Pereira and Shieber (1987).
Probabilistic context-free granunars
were investigated by Booth (1969) and Salo
maa (1969). Other algorithms for PCFGs are presented in the excellent short monograph by
920
Chapter
23.
Natural Language for Communication
Chamiak (1993) and the excellent long textbooks by Manning and Schiitze (1999) and Juraf
sky and Martin (2008). Baker (1979) introduces the inside-outside algorithm for learning a PCFG, and Lari and Young (1990) describe its uses and limitations. Stolcke and Omohundro (1994) show how to learn grammar rules with Bayesian model merging; Haghighi and Klein (2006) describe a learning system based on prototypes.
Lexicalized PCFGs (Charniak,
1997; Hwa, 1998) combine the best aspects of PCFGs
and n-gram models. Collins (1999) describes PCFG parsing that is lexicalized with head
features. Petrov and Klein (2007a) show how to get the advantages of lexicalization without actual lexical augmentations by learning specific syntactic categories from a treebank that has
general categories; for example, the treebank has the category categories such as NPo and NPs can be learned.
NP, from which more specific
There have been many attempts to write formal gra1runars of natural languages, both
in "pure" linguistics and in computational linguistics. There are several comprehensive but
informal grammars of English (Quirk et al., 1985; McCawley, 1988; Huddleston and Pullum, 2002). Since the mid-1980s, there has been a trend toward putting more information in the
lexicon and less in the grammar. Lexical-functional grammar, or LFG (Bresnan, 1982) was the first major grammar formalism to be highly lexicalized. If we carry lexicalization to an extreme, we end up with
categorial grammar (Clark and Curran, 2004), n i which there can dependency grammar (Smith and Eisner, 2008;
be as few as two grammar rules, or with
KUbler et al., 2009) in which there are no syntactic categories, only relations between words. Sleator and Temperley (1993) describe a dependency parser. Paskin (2001) shows that a version of dependency grammar is easier to learn than PCFGs. The first computerized parsing algoritluns were demonstrated by Yngve (1955). Ef
ficient algorithms were developed in the late 1960s, with a few twists since then (Kasami, 1965; Younger, 1967; Earley, 1970; Graham
et al.,
1980). Maxwell and Kaplan (1993) show
how chart parsing with augmentations can be made efficient in the average case. Church
and Patil (1982) address the resolution of syntactic ambiguity. Klein and Manning (2003)
describe A* parsing, and Pauls and Klein (2009) extend that to K-best A* parsing, in which the result is not a single parse but the K best. Leading parsers today include those by Petrov and Klein (2007b), which achieved 90.6% accuracy on the Wall Street Journal corpus, Chamiak and Johnson (2005), which achieved 92.0%, and Koo
et al.
(2008), which achieved 93.2% on the Penn treebank. These
numbers are not directly comparable, and there is some criticism ofthe field that it is focusing too narrowly on a few select corpora, and perhaps overfitting on them.
Formal semantic interpretation of natural languages originates within philosophy and formal logic, particularly Alfred Tarski's (1935) work on the semantics of formal languages. Bar-Hillel (1954) was the first to consider the problems of pragmatics and propose that they could be handled by formal logic. For example, he introduced C. S. Peirce's (1902) tenn
indexical into linguistics.
Richard Montague's essay "English as a formal language" (1970)
is a kind of manifesto for the logical analysis of language, but the books by Dowty
et al.
(1991) and Portner and Partee (2002) are more readable. The first NLP system to solve an actual task was probably the BASEBALL question answering system (Green et al., 1961 ), which handled questions about a database of baseball
Bibliographical and Historical Notes
921
statistics. Close after that was Woods's (1973) LUNAR, which answered questions about the rocks brought back from the moon by the Apollo program. Roger Schank and his students built a seties of programs (Schank and Abelson, 1977; Schank and Riesbeck, 1981) that all had the task of understanding language. Modem approaches to semantic interpretation usually assume that the mapping from syntax to semantics will be learned from examples (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005). Hobbs et al. (1993) describes a quantitative nonprobabilistic framework for interpreta tion. More recent work follows an explicitly probabilistic framework (Chamiak and Gold man, 1992; Wu, 1993; Franz, 1996). In linguistics, optimality theory (Kager, 1999) is based on the idea of building soft constraints into the grammar, giving a natw-al ranking to inter pretations (similar to a probability distribution), rather than having the grammar generate all possibilities with equal rank. Norvig (1988) discusses the problems of considering multiple simultaneous interpretations, rather than settling for a single maximum-likelihood interpre tation. Literary critics (Empson, 1953; Hobbs, 1990) have been ambiguous about whether ambiguity is something to be resolved or cherished. Nunberg (1979) outlines a formal model of metonymy. Lakoff and Johnson (1980) give an engaging anaJysis and catalog of common metaphors in English. Martin (1990) and Gibbs (2006) offer computational models of metaphor interpretation. The first important result on grammar induction was a negative one: Gold (1967) showed that it is not possible to reliably learn a correct context-free granunar, given a set of strings from that grammar. Prominent linguists, such as Chomsky (1957) and Pinker (2003), UNIVERSAL GRAMMAR
have used Gold's result to argue that there must be an innate universal grammar that all children have from birth. The so-called Poverty of the Stimulus argument says that children aren't given enough input to learn a CFG, so they must aJready "know" the grammar and be merely tuning some of its parameters. While this argument continues to hold sway throughout much of Chomskyan linguistics, it has been dismissed by some other linguists (Pullum, 1996; Elman et al., 1997) and most computer scientists. As early as 1969, Homing showed that it is possible to learn, in the sense of PAC learning, a probabilistic context-free grammar. Since then, there have been many convincing empirical demonstrations of learning from positive examples alone, such as the ILP work of Mooney (1999) and Muggleton and De Raedt (1994), the sequence learning of Nevill-Manning and Witten (1997), and the remarkable Ph.D. theses of Schiitze (1995) and de Marcken (1996). There is an annual International Conference on Grammatical Inference (ICGI). It is possible to team other grammar formalisms, such as regular languages (Denis, 2001) and finite state automata (Parekh and Honavar, 2001 ). Abney (2007) is a textbook introduction to semi-supervised learning for language models. Wordnet (Fellbaum, 2001) is a publicly available dictionary of about 100,000 words and pluases, categorized into parts of speech and linked by semantic relations such as synonym, antonym, and part-of. The Penn Treebank (Marcus et al. , 1993) provides parse trees for a 3-million-word corpus of English. Chamiak (1996) and Klein and Mamling (2001) discuss parsing with treebank grammars. The British National Corpus (Leech et al., 2001) contains 100 million words, and the World Wide Web contains several trillion words; (Brants et al., 2007) describe n-gram models over a 2-trillion-word Web corpus.
922
Chapter
23.
Natural Language for Communication
In the 1930s Petr Troyanskii applied for a patent for a "translating machine," but there were no computers available to implement his ideas. In March 1947, the Rockefeller Founda tion's Warren Weaver wrote t.o Norbert Wiener, suggesting that machine translation might be possible. Drawing on work in cryptography and information theory, Weaver wrote, "When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in strange symbols. I will now proceed to decode."' For the next decade, the commw1ity tried to decode in this way. IBM exhibited a rudimentary system in 1954. Bar-Hillel (1960) de scribes the enthusiasm of this period. However, the U.S. government subsequently reported (ALPAC, 1966) that "there is no immediate or predictable prospect of useful machine trans lation." However, limited work continued, and starting in the 1980s, computer power had increased to the point where the ALPAC findings were no longer correct. The basic statistical approach we describe in the chapter is based on early work by the mM group (Brown et al., 1988, 1993) and the recent work by the lSI and Google research groups (Och and Ney, 2004; Zollmann et al., 2008). A textbook introduction on statistical machine translation is given by Koehn (2009), and a short tutorial by Kevin Knight (1999) has been influential. Early work on sentence segmentation was done by Palmer and Hearst ( 1994). Och and Ney (2003) and Moore (2005) cover bilingual sentence aligrunent. The prehistory of speech recognition began in the 1920s with Radio Rex, a voice activated toy dog. Rex jumped out of his doghouse in response to the word "Rex!" (or actually almost any sufficiently loud word). Somewhat more serious work began after World War II. At AT&T Bell Labs, a system was built for recognizing isolated digits (Davis et at. , 1952) by means of simple pattern matching of acoustic features. Starting in 1971, the Defense Advanced Research Projects Agency (DARPA) of the United States Depa1tment of Defense funded four competing five-year projects to develop high-perfonnance speech recognition systems. The wirmer, and the only system to meet the goal of 90% accuracy with a 1000-word vocabulary, was the HARPY system at CMU (Lowerre and Reddy, 1980). The final version of HARPY was derived from a system called DRAGON built by CMU graduate student James Baker (1975); DRAGON was the first to use HMMs for speech. Almost simultaneously, Je linek (1976) at ffiM had developed another HMM-based system. Recent years have been characterized by steady incremental progress, larger data sets and models, and more rigor ous competitions on more realistic speech tasks. In 1997, Bill Gates predicted, "The PC five years from now-you won't recognize it, because speech will come into the interface." That didn't quite happen, but in 2008 he predicted "In five years, Microsoft expects more Intemet searches to be done through speech than through typing on a keyboard." History will tell if he is right this time around. Several good textbooks on speech recognition are available (Rabiner and Juang, 1993; Jelinek, 1997; Gold and Morgan, 2000; Huang et al., 2001). The presentation in this chapter drew on the survey by Kay, Gawron, and Norvig (1994) and on the textbook by Jurafsky and Martin (2008). Speech recognition research is published in Computer Speech and Language,
Speech Communications, and the IEEE Transactions on Acoustics, Speech, and Signal Pro cessing and at the DARPA Workshops on Speech and Natural Language Processing and the Eurospeech, ICSLP, and ASRU conferences.
Exercises
923 Ken Church (2004) shows that natural language research has cycled between concen trating on the data (empuicism) and concentrating on theories (rationalism). The linguist John Firth (1957) proclaimed "You shall know a word by the company it keeps," and liJJguis tics of the 1940s and early 1950s was based largely on word frequencies, although without the computational power we have available today. Then Noam (Chomsky, 1956) showed the limitations of finite-state models, and sparked an interest in theoretical studies of syntax, disregarding frequency counts. This approach domu1ated for twenty years, until empiricism made a comeback based on the success of work in statistical speech recognition (Jelinek, 1976). Today, most work accepts the statistical framework, but there is great interest in build ing statistical models that consider higher-level models, such as syntactic trees and semantic relations, not just sequences of words. Work on applications of language processing is presented at the biennial Applied Natu ral Language Processing conference (ANLP), the conference on Empu·ical Methods in Natu ral Language Processing (EMNLP), and the journal Natural Language Engineering. A broad range of NLP work appears in the journal Computational Linguistics and its conference, ACL, and in the Computational Linguistics (COLING) conference.
EXERCISES
Read the following text once for understanding, and remember as much of it as you can. There will be a test later. 23.1
The procedure is actually quite simple. First you arrange things into different groups. Of course, one pile may be sufficientdepending on how much there is to do. Ifyou have to go somewhere else due to Jack of facilities that is the next step, otherwise you are pretty well set.. It is important not to overdo things. That is, it is better to do too few things at once than too many. In the short run this may not seem important but complications can easily arise. A mistake is expensive as well. At firstthe whole procedure will seem complicated. Soon, however, it will become just another facet of life. It is difficult to foresee any end to the necessity for this task in the m i mediate future, but then one can never tell. Afterthe procedure is completed one arranges the material into different groups again. Then they can be put into their appropriate places. Eventually they will be used once more and the whole cycle will have to be repeated. However, this is part of life.
An HMM grammar is essentially a standard HMM whose state variable is N (nonter minal, with values such as Det, Adjective, Noun and so on) and whose evidence variable is
23.2
W (word, with values such as is, duck, and so on). The HMM model includes a prior P(No), a transition model P(Nt+liNt). and a sensor model P(WtiNt)· Show that every HMM gram mar can be written as a PCFG. [Hint: start by thinking about how the HMM prior can be represented by PCFG rules for the sentence symbol. You may find it helpful to illustrate for the particular HMM with values A, B for N and values x, y for W.]
924
Chapter 23.3
23.
Natural Language for Communication
Consider the following PCFG for simple verb phrases:
O.l 0.2 0.5 0.2 0.5 0.5 0.8 0.2 0.5 0.5 0.5 0.5 0.6 0.4
: VP � Verb : VP � Capula Adjective : VP � Verb theNoun : VP � VP Adverb : Verb � is : Verb � shoots : Copula � is : Copula � seems : Adjective � unwell : Adjective � well : Adverb � well : Adverb � badly : Noun � duck : Noun � well
a. Which of the following have a nonzero probability as a VP"? (i) shoots the duck well
well well (ii) seems the well well (iii) shoots the unwell well badly b. What is the probability of generating "is well well"? c. What types of ambiguity are exhibited by the phrase in (b)"?
d. Given any PCFG, is it possible to calculate the probability that the PCFG generates a string ofexactly
10
words?
Outline the major differences between Java (or any other computer language with which you are familiar) and English, commenting on the "understanding" problem in each case. Think about such things as grammar, syntax, semantics, pragmatics, compositional ity, context-dependence, lexical ambiguity, syntactic ambiguity, reference finding (including pronouns), background knowledge, and what it means to "understand" in the first place. 23.4
23.5
This exercise concerns grammars for very simple languages.
a. Write a context-free grammar for the language anbn. b. Write a context-free grammar for the palindrome language: the set of all strings whose second half is the reverse of the first half. c. Write a context-sensitive grammar for the duplicate language: the set of all strings whose second half is the same as the first half .
Consider the sentence "Someone walked slowly to the supermarket" and a lexicon consisting of the following words: 23.6
Pronoun � someone Verb � walked Adv � slowly Prep � to Article � the Noun � supermarket Which of the following three grammars, combined with the lexicon, generates the given sen tence? Show the corresponding parse tree(s).
Exercises
925 (A): S --4 NP VP NP NP
VP VP VP PP NP
Pronoun --4 Article Noun --4 VP PP --4 VP Adv Adv --4 Verb --4 Prep NP --4 Noun --4
(B): S --4 NP VP
(C): S --4 NP VP
NP --4 Pronoun NP --4 Noun NP --4 Article NP VP --4 Verb Vmod Vmod --4 Adv Vmod Vmod --4 Adv Adv --4 PP PP --4 Prep NP
NP --4 Pronoun NP --4 Article NP VP --4 Verb Adv Adv --4 Adv Adv Adv --4 PP PP --4 Prep NP NP --4 Noun
For each of the preceding three grammars, write down three sentences of English and three sentences of non-English generated by the grammar. Each sentence should be significantly different, should be at least six words long, and shou1d include some new lexical entries (which you should define). Suggest ways to improve each grammar to avoid generating the non-English sentences. Collect some examples of time expressions, such as "two o'clock;' "midnight," and "12:46." Also think up some examples that are ungrammatical, such as "thirteen o'clock" or "half past two fifteen." Write a grammar for the time language. 23.7
In this exercise you will transform Eo into Chomsky Nonnal Form (CNF). There are five steps: (a) Add a new start symbol, (b) Eliminate E rules, (c) Eliminate multiple words on right-hand sides, (d) Eliminate rules of the form (X --4 Y), (e) Convert long right-hand sides into binary rules. 23.8
a.
The start symbol, S, can occur only on the left-hand side in CNF. Add a new rule ofthe form S' --4 S, using a new symbol S'.
b. The empty string, E cannot appear on the right-hand side in CNF. Eo does not have any rules with E, so this is not an issue. c. A word can appear on the right-hand side in a rule only of the form (X --4 word). Replace each rule of the form (X --4 word . . . ) with (X --4 . . . W' . . . ) and ( W' --4 word), using a new symbol W'. .
•
•
d. A rule (X --4 Y) is not allowed in CNF; it must be (X --4 Y Z) or (X --4 word). Replace each rule of the form (X --4 Y) with a set of rules of the form (X --4 . . . ), one for each rule ( Y --4 . . . ), where (. . . ) indicates one or more symbols. e.
Replace each rule of the form (X --4 Y Z . . . ) with two rules, (X --4 Z . . . ), where Z' is a new symbol.
--4
Y Z') and (Z'
Show each step of the process and the final set of rules. Using DCG notation, write a granunar for a language that is just like E1, except that it enforces agreement between the subject and verb of a sentence and thus does not generate w1grammatical sentences such as "I smells the wumpus." 23.9
926
Chapter 23.10
23.
Natural Language for Communication
Consider the following PCFG: S
--+
NP VP [1.0] NP --+ Noun [0.6] I Pronoun [0.4] VP --+ Ve1·b NP [0.8] I Modal Verb [0.2] Noun --+ can [0.1] I fish [0.3] I . . . Pronoun --+ I [0.4] I . . . Verb --+ can [0.01] I fish [O.l] I . . . Modal --+ can [0.3] I . . . The sentence "I can fish" has two parse trees with this granunar. Show the two trees, their prior probabilities, and their conditional probabilities, given the sentence. 23.11
An augmented context-free grammar can represent languages that a regular context free grammar cannot. Show an augmented context-free grammar for the language anbncn .
The allowable values for augmentation variables are 1 and SUCCESSOR (n), where
n
is a
value. The rule for a sentence in this language is
S(n)
--+
A(n) B(n) C(n) .
Show the rule(s) for each of A, B, and C. 23.12
Augment the
£1 grammar so t.hat it handles
article-noun agreement. That is, make
sure that "agents" and "an agent" are NPs, but "a.gent" and "an agents" are not. 23.13
Consider the following sentence (from
The New York Times, July 28, 2008):
Banks struggling to recover from multibillion-dollar loans on real estate are cur tailing loans to American businesses, depriving even healthy companies of money for expansion and hiring.
a. b. c. d.
Which of the words in this sentence are lexically ambiguous? Find two cases of syntactic ambiguity in this sentence (there are more than two.)
23.14
a. b. c. d. e.
Give an instance of metaphor in this sentence. Can you find semantic ambiguity? Without looking back at Exercise 23.1, answer the following questions: What are the four steps that are mentioned? What step is left out? What is "the material" that is mentioned in the text? What kind of mistake would be expensive? Is it better to do too few things or too many? Why?
Select five sentences and submit them to an online translation service. Translate them from English to another language and back to English. Rate the resulting sentences for
23.15
grammaticality and preservation of meaning. Repeat the process; does the second round of
927
Exercises
iteration give worse results or the same results? Does the choice of intermediate language make a difference to the quality of the results? If you know a foreign language, look at the translation of one paragraph into that language. Count and describe the errors made, and conjecture why these errors were made.
23.16
The Di values for the sentence in Figure 23.13 sum to 0. Will that be true of every
translation pair'! Prove it or give a counterexample. {Adapted from Knight (1999).) Our translation model assumes t11at, after the phrase
23.17
translation model selects phrases and the distortion model pennutes them, the language model can unscramble the pennutation. This exercise investigates how sensible that assumption is. Try to unscramble these proposed lists of phrases into the correct order: a.
have, programming, a, seen, never, I, language, better
b. loves, john, mary is t·he, communication, exchange of, intentional, infonnation brought·, by, about, the
c.
producl'ion, perception of, and signs, from, drawn, a, of, system, signs, conventional, shared d. created, that, we hold these, to be, all men, truths, are, equal, self-evident Which ones could you do? What type of knowledge did you draw upon'? Train a bigram model from a traini.ng corpus, and use it to find the highest-probabi lity permutation of some sentences from a test corpus. Report on the accuracy of this model.
23.18
Calculate the most probable path furough the HMM in Figure 23.16 for the output
sequence
23.19
[C1 , C2, C3, C4, C4, Cs, Cr].
Also give its probability.
We forgot to mention mat the text in Exercise 23.1 is entitled "Washing Clothes."
Reread the text and answer the questions in Exercise 23.14. Did you do better this time? Bransford and Johnson (1973) used this text in a controlled experiment and found that me title helped significantly. What does this tell you about how language and memory works?
24
PERCEPTION
In which we connect the computer to the raw, unwashed world.
PERCEPTION sENSOR
Perception provides agents with infonnation about t.he world they inhabit by interpreting the response of sensors. A sensor measures some aspect of the environment in a form that can be used as input by an agent program. The sensor could be as simple as a switch, which gives one bit telling whether it is on or off, or as complex as the eye. A variety of sensory modalities are available to artificial a.gents. Those they share with lnunans include vision, hearing, and touch. Modalities that are not available to the unaided human include radio, infrared, GPS,
and wireless signals. Some robots do active sensing, meaning they send out a signal, such as radar or ultrasound, and sense the reflection of this signal off of the environment. Rather than trying to cover all of these, this chapter will cover one modality in depth: vision. We saw in our description of POMDPs (Section 17.4, page 658) that a model-based decision-theoretic agent in a partially observable envirorunent has a sensor model-a prob ability distribution P(E I
S)
over the evidence that its sensors provide, given a state of the
world. Bayes' rule can then be used to update the estimation of the state. oBJECTMOOEL
For vision, the sensor model can be broken into two components: An object model describes the objects that inhabit the visual world-people, buildings, trees, cars, etc. The object model could include a precise 3D geometric model taken from a computer-aided design (CAD) system, or it could be vague constraints, such as the fact that human eyes are usually 5
RENDERING MODEL
to 7 em apart. A rendering model describes the physical, geometric, and statistical processes that produce the stimulus from the world. Rendering models are quite accurate, but they are ambiguous. For example, a white object under low light may appear as the same color as a black object under intense light. A small nearby object may look the same as a large distant object. Without additional evidence, we cannot tell if the image that fills the frame is a toy Godzil1a or a real monster. Ambiguity can be managed with prior knowledge-we know GodziUa is not real, so the image must be a toy--Qr by selectively choosing to ignore the ambiguity. For example, the vision system for an autonomous car may not be able to interpret objects that are far in the distance, but the agent can choose to ignore the problem, because it is unlikely to crash into an object that is miles away.
928
Section 24. 1 .
Image Formation
929
A decision-theoretic agent is not the only architecture that can make use of vision sen sors. For example, fruit flies (Drosophila) are in part reflex agents: they have cervical giant fibers that fonn a direct pathway from their visual system to the wing muscles that initiate an escape response-an immediate reaction, without deliberation. Flies and many other flying animals make used of a closed-loop control architecture to land on an object. The visual system extracts an estimate of the distance to the object, and the control system adjusts the wing muscles accordingly, allowing very fast changes of direction, with no need for a detailed model of the object. Compared to the data from other sensors (such as the single bit that tells the vacuum robot that it has bumped into a wall), visual observations are extraordinarily rich, both in the detail they can reveal and in the sheer amount of data they produce. A video camera for robotic applications might produce a million 24-bit pixels at 60 Hz; a rate of 10 GB per minute. The problem for a vision-capable agent then is: Which aspects of the rich visual stimulus should be considered to help the agent make good action choices, and which aspects should be ignored? Vision-and all perception-serves to further the agent's goals, not as an end to itself. FEATURE EXTRACTION
We can characterize three broad approaches
RECOGNITION
RECONSTRJCTION
24. 1
to
the problem. The feature extraction
approach, as exhibited by Drosophila, emphasizes simple computations applied directly to the sensor observations. In the recognition approach an agent draws distinctions among the objects it encounters based on visual and other information. Recognition could mean labeling each image with a yes or no as to whether it contains food that we should forage, or contains Grandma's face. Finally, in the reconstruction approach an agent builds a geometric model of the world from an image or a set of images. The last thirty years of research have produced powerful tools and methods for ad dressing these approaches. Understanding these methods requires an tmderstanding of the processes by which images are formed. Therefore, we now cover the physical and statistical phenomena that occur in the production of an image.
IMAGE FORMATION
Imaging distorts the appearance of objects. For example, a pictw·e taken looking down a long straight set of railway tracks will suggest that the rails converge and meet. As another example, if you hold your hand in front of your eye, you can block out the moon, which is not smaller than yow· hand. As you move your hand back and forth or tilt it, your hand will seem to shrink and grow in the image, but it is not doing so in reality (Figure 24.1 ). Models of these effects are essential for both recognition and reconstruction. 24.1.1 SCENE IMAGE
Images without lenses: The pinhole camera
Image sensors gather light scattered from objects in a scene and create a two-dimensional image. In the eye, the image is fonned on the retina, which consists of two types of cells: about 100 million rods, which are sensitive to light at a wide range of wavelengths, and 5
Chapter 24.
930
/A:'\_
( MmI -
Figure 24.1
Perception
_ _
��7
-
--
- -
-
Imaging distorts geometry. Parallel lines appear to meet in the distance, as
in the image of the railway tracks on the left. a large
moon.
In the center, a small hand blocks out most of On the right is a foreshortening effect: the hand is tilted away from the eye,
making it appear shorter than in the center figure.
PIJCEL
PINHOLE CAMERA
PERSPECTIVE PROJECTION
million cones. Cones, which are essential for color vision, are of three main types, each of which is sensitive to a different set of wavelengths. In cameras, the image is formed on an image plane, which can be a piece of film coated with silver halides or a rectangular grid of a few million photosensitive pixels, each a complementary metal-oxide semiconductor (CMOS) or charge-coupled device (CCD). Each photon aniving at the sensor produces an effect, whose strength depends on the wavelength of the photon. The output of the sensor is the sum of all effects due to photons observed in some time window, meaning that image sensors report a weighted average of the intensity of light arriving at the sensor. To see a focused image, we must ensure that all the photons from approximately the same spot in the scene rurive at approximately the same point in the image plane. The simplest way to form a focused image is to view stationary objects with a pinhole camera, which consists of a pinhole opening, 0, at the front of a box, and an image plane at the back of the box (Figure 24.2). Photons from the scene must pass through the pinhole, so if it is small enough then nearby photons in the scene will be nearby in the image plane, and the image will be in focus. The geometry of scene and image is easiest to understand with the pinhole camera. We use a three-dimensional coordinate system with the origin at the pinhole, and consider a point P in the scene, with coordinates (X, Y, Z). P gets projected to the point P' in the image plane with coordinates (x, y, z). Iff is the distance from the pinhole to the image plane, then by similar triangles, we can derive the following equations: -x X -y Y -fX -fY
T
=
z, T
=
z
:::?
X=
---z- · y = ---z .
These equations define an image-formation process known as perspective projection. Note that the Z in the denominator means that the farther away an object is, the smaller its image
Section 24.1.
Image Formation
931
.. .
Figure 24.2
era
····
·
Each light-sensitive element in the image plane at the back of a pinhole cam
receives light from a the small range of directions that passes through the pinhole. If the
pinhole is small enough, the result is a focused image at the back of the pinhole. The process of projection means that large, distant objects look the same as smaller, nearby objects. Note
that the image is projected upside down.
will be. Also, note that the minus signs mean that the image is inverted, both left-right and up-down, compared with the scene. Under perspective projection, distant objects look small. This is what allows you to cover the moon with yom hand (Figure 24.1 ). An important result of this effect is that parallel lines converge to a point on the horizon. (Think of railway tracks, Figure 24.1.) A line in the scene in the direction (U, V, W) and passing through the point (Xo, Yo, Zo) can be described as the set of points (Xo + >.U, Yo + >.V, Zo + >.W), with ).. varying between -oo and +oo. Different choices of (Xo, Yo,Zo) yield different lines parallel to one another. The projection of a point P>. from this line onto the image plane is given by
>.U f Yo + >.V ) (! ZoXo++>-W . ' Zo + >.W
VANISHING POINT
As ).. � oo or ).. � -oo, this becomes Poo = (!UjW, fVfW) if W =f:. 0. This means that two parallel lines leaving different points in space will converge in the image-for large ).., the image points are nearly the same, whatever the value of (Xo, Yo, Zo) (again, think railway tracks, Figure 24.1 ). We call Poo the vanishing point associated with the family of straight lines with direction (U, V, W). Lines with the same direction share the same vanishing point.
24.1.2
MOTION BLUR
Lens systems
The drawback of the pinhole camera is that we need a small pinhole to keep the image in focus. But the smaller the pinhole, the fewer photons get through, meaning the image will be dark. We can gather more photons by keeping the pinhole open longer, but then we will get motion blur--objects in the scene that move will appear blurred because they send photons to multiple locations on the image plane. If we can't keep the pinhole open longer, we can try to make it bigger. More light will enter, but light from a small patch of object in the scene will now be spread over a patch on the image plane, causing a blurred image.
Chapter
932
24.
Perception
Image plane
Lens
System
Figure 24.3
Lenses collect the light leaving a scene point n i a range of directions, and steer
it all to arrive at a single point on the image plane. Focusing works for points lying close to a focal plane in space; other points will not be focused properly.
In cameras, elements of
the lens system move to change the focal plane, whereas in the eye, the shape of the lens is changed by specialized muscles.
Vertebrate eyes and modern cameras use a
LENS
keeping the image in focus.
A large
lens system to gather sufficient light while
opening is covered with a lens that focuses light from
i the image plane. However, lens systems nearby object locations down to nearby locations n DEPTH OFFIELD FOCAL PLANE
have a limited
depth of field: they can focus light only from points that lie within a range of depths (centered arotmd a focal plane). Objects outside this range will be out of focus in
the image. To move the focal plane, the lens in the eye can change shape (Figure 24.3); in a
camera, the lenses move back and forth.
24.1.3 Scaled orthographic projection Perspective effects aren't always pronounced. For example, spots on a distant leopard may look small because the leopard is far away, but two spots that are next to each other will have i distance to the spots is small compared about the same size. This is because the difference n
SCALED ORTHOGRAPHIC PROJECTION
to the distance to them, and so we can simplify the projection model. The appropriate model scaled orthographic projection. The idea is as follows: If the depth Z of points on the object vaties within some range Zo ± t:.Z, with t:.Z « Zo, then the perspective scaling factor fJZ can be approximated by a constant s = f/Zo. The equations for projection from the scene coordinates (X, Y, Z) to the image plane become x = sX and y = sY. Scaled is
orthographic projection is an approximation that is valid only for those parts of the scene with
not much internal depth vatiation. For example, scaled orthographic projection can be a good model for the features on the front of a distant building.
24.1.4
Light and shading
The brightness of a pixel in the image is a function of the brightness of the surface patch in the scene that projects to the pixel. We will assume a linear model (current cameras have non i the middle). Image brightness is lineatities at the extremes of light and dark, but are linear n
Section 24. 1 .
Image Formation
933
Diffuse reflection, bright
Diffuse reflection, dark
Cast shadow
Figure 24.4
A variety of illumination effects. There are specularities onthe metal spoon bright because it faces the light direction. The dark diffuse surface is dark because it is tangential to the illumination direction. The shadows appearat surface points that cannot.see the light source. Photo by MikeLinksvayer(mlinksva and on the milk. The bright diffuse surface is
on
flickr).
a strong, if ambiguous, cue to the shape of an object, and from there to its identity. People are usually able to distinguish the three main causes of varying brightness and reverse-engineer OVERALL INTENSITY
the object's properties. The first cause is overall intensity of the light. Even though a white object in shadow may be Jess bright than a black object in direct sunlight, the eye can distin guish relative brighh1ess well, and perceive the white object as white. Second, different points
REFLECT
in the scene may reflect more or Jess of the light. Usually, the result is that people perceive these points as lighter or darker, and so see texture or markings on the object. Third, surface patches facing the light are brighter than swface patches tilted away from the light, an effect
SHADING
DIFFUSE REFLECTION
known as shading. Typically, people can tell that this shading comes from the geometry of the object, but sometimes get shading and markings mixed up. For example, a streak of dark makeup under a cheekbone will often look like a shading effect, making the face look thi1mer. Most surfaces reflect light by a process of diffuse reflection. Diffuse reflection scat ters light evenly across the directions leaving a surface, so the brightness of a diffuse surface doesn't depend on the viewing direction. Most cloth, paints, rough wooden surfaces, vegeta tion, and rough stone are diffuse. Mirrors are not diffuse, because what you see depends on the direction in which you look at the minor. The behavior of a perfect mirror is known as
SPECULAR REFLECTION SPECULARITIES
Some surfaces-such as brushed metal, plastic, or a wet floor--display small patches where specular reflection has occurred, called specularities. These are easy to
specular reflection.
identify, because they are small and bright (Figure 24.4). For almost all purposes, it is enough
odel ail
to m
surfaces as being diffuse with specu]arities.
Chapter 24.
934
Figure 24.5
Perception
Two surface patches are illuminated by a distant point source, whose rays are
shown as gray arrowheads. Patch
A is tilted away
from the source
({} is close to
90°) and
collects less energy, because it cuts fewer light rays per unit surface area. Patch B, facing the source ({} is close to 0°), collects more energy.
DIS-,.NT POINT LIGHTSOURCE
The main source of illumination outside is the stm, whose rays all travel paraUel to one another. We model this behavior as a distant point light source. This is the most important model of lighting, and is quite effective for indoor scenes as well as outdoor scenes. The
DIFFUSEALBEDO
LAMBERT"S COSINE LAW
amount of light collected by a surface patch in this model depends on the angle B between the illumination direction and the normal to the sUJface. A diffuse surface patch iUuminated by a distant point light source will reflect some fraction of the light it collects; this fraction is called the diffuse albedo. White paper and snow have a high albedo, about 0.90, whereas flat black velvet and charcoal have a low albedo of about 0.05 (which means that 95% of the incoming light is absorbed within the fibers of the velvet or the pores of the charcoal). Lambert's cosine law states that the brightness of a diffuse patch is given by I = plo cos B ,
SHADOW
INTERREFLECTIONS
AMBIENT ILLUMINATION
where pis the diffuse albedo, Io is the intensity of the light source and B is the angle between the light source direction and the surface normal (see Figure 24.5). Lampert's law predicts bright image pixels come from surface patches that face the light directly and dark pixels come from patches that see the light only tangentiaUy, so that the shading on a surface pro vides some shape information. We explore this cue in Section 24.4.5. If the surface is not reached by the light source, then it is in shadow. Shadows are very seldom a uniform black, because the shadowed surface receives some light from other sources. Outdoors, the most important such source is the sky, which is quite bright. Indoors, light reflected from other surfaces illuminates shadowed patches. These interreftections can have a significant effect on the brightness of other surfaces, too. These effects are sometimes modeled by adding a constant ambient iUumination term to the predicted intensity.
Section 24.2.
Early Image-Processing Operations 24.1.5
935
Color
Fruit is a bribe that a tree offers to animals to cany its seeds around. Trees have evolved to have fntit that turns red or yellow when ripe, and animals have evolved to detect these color changes. Light arriving at the eye has different amounts of energy at different wavelengths; this can be represented by a spectral energy density function. Human eyes respond to light in the 380-750run wavelength region, with three different types of color receptor cells, which have peak receptiveness at 420mm (blue), 540nm (green), and 570run (red). The hwnan eye can capture only a small fraction of the full spectral energy density function-but it is enough to tell when the fruit is ripe. The principle of trichromacy states that for any spectral energy density, no matter how
PRINCIPLE OF TRICHAOMACY
complicated, it is possible to construct another spectral energy density consisting of a mixture ofjust three colors-usually red, green, and blue-such that a human can't tell the difference between the two. That means that our TVs and computer displays can get by with just the three red/green/blue (or R/G/B) color elements. It makes our computer vision algorithms easier, too. Each surface can be modeled with three different albedos for RIG/B. Similarly, each light source can be modeled with three R/G/B intensities. We then apply Lambert's cosine law to each to get three R/G/B pixel values. This model predicts, correctly, that the same surface will produce different colored image patches under different-colored lights. In COLOR CONSTANCY
fact, human observers are quite good at ignoring the effects of different colored lights and are able to estimate the color ofthe surface under white light, an effect known as color constancy. Quite accw·ate color constancy algorittuns are now available; simple versions show up in the "auto white balance" function of your camera. Note that if we wanted to build a camera for mantis shrimp, we would need 12 different pixel colors, corresponding to the 12 types of color receptors of the crustacean.
24.2
EARLY IMAGE-PROCESSING OPERATIONS
We have seen how light reflects off objects in the scene to form an image consisting of, say, five million 3-byte pixels. With all sensors there will be noise in the image, and in any case there is a lot of data to deal with. So how do we get started on analyzing this data? In this section we will study three useful image-processing operations: edge detection, texture analysis, and computation of optical How. These are called "early" or "low-level" operations because they are the first in a pipeline of operations. Early vision operations are charactelized by their local natw·e (they can be carried out in one part of the image without regard for anything more than a few pixels away) and by their lack of knowledge: we can perform these operations without consirleration of the ohject� that might he present in the
scene. This makes the low-level operations good candidates for implementation in parallel hardware-either in a graphics processor unit (GPU) or an eye. We will then look at one mid-level operation: segmenting the image into regions.
Chapter
936
24.
Perception
1
1
1
Figure 24.6 Different kinds of edges: (1) depth discontinuities; (2) surface orientation discontinuities; (3) reflectance discontinuities; (4) illumination discontinuities (shadows). 24.2.1 EDGE
Edge detection
Edges are straight lines or curves in the image plane across which there is a "significant" change in image brightness. The goal of edge detection is to abstract away from the messy, multimegabyte image and toward a more compact, abstract representation, as in Figure 24.6. The motivation is that edge contours in the image correspond to important scene contours. In the figure we have three examples of depth discontinuity, labeled
1 ; two surface-normal
discontinuities, labeled 2; a reflectance discontinuity, labeled 3; and an ilhunination discon tinuity (shadow), labeled
4. Edge detection is concerned only with the image, and thus does
not distinguish between these different types of scene discontinuities; later processing will. Figure 24.7(a) shows an image of a scene containing a stapler resting on a desk, and (b) shows the output of an edge-detection algorithm on this image. As you can see, there is a difference between the output and an ideal line drawing. There are gaps where no edge appears, and there are "noise" edges that do not correspond to anything of significance in the scene. Later stages of processing will have to correct for these errors. How do we detect edges in an image? Consider the profile of image brightness along a one-dimensional cross-section perpendicular to an edge-for example, the one between the left edge of the desk and the wall. It looks something like what is shown in Figure 24.8 (top). Edges correspond to locations in images where the brightness undergoes a sharp change, so
a naive idea would be to differentiate the image and look for places where the magnitude
of the derivative I'(x) is large. That almost works. In Figure 24.8 (middle), we see that there is indeed a peak at x = 50 , but there are also subsidiary peaks at other locations (e.g., x = 75).
These arise because of the presence of noise in the image. If we smooth the image first, the spurious peaks are diminished, as we see in the bottom of the figure.
Section 24.2.
Early Image-Processing Operations
937
(a) Figure 24.7
(b)
(a) Photograph of a stapler. (b) Edges computed from (a).
;�L__.__L._.__�� : lL______L___��� :�: �
_
_
_
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
0
10
20
M
40
50
60
"tO
80
90
100
Top: Intensity profile I(x) along a one-dimensional section across an edge at x=50. Middle: The derivative of intensity, I'(x). Large values of this function correspond to edges, but the function is noisy. Bottom: The derivative of a smoothed version of the intensity, (I * Gu )', which can be computedin one step as the convolution I * G�. The noisy candidate edge at x = 75 has disappeared. Figure 24.8
i a CCD camera is based on a physical The measurement of brightness at a pixel n process involving the absorption of photons and the release of electrons; inevitably there will be statistical fluctuations of the measw·ement-noise. The noise can be modeled with
Chapter 24.
938
GAUSSIAN FILTER
Perception
a Gaussian probability distribution, with each pixel independent of the others. One way to smooth an image is to assign to each pixel the average of its neighbors. This tends to cancel out extreme values. But how many neighbors should we consider--one pixel away, or two, or more? One good answer is a weighted average that weights the nearest pixels the most, then gradually decreases the weight for more distant pixels. The Gaussian filter does just that. (Users of Photoshop recognize this as the Gaussian blur operation.) Recall that the Gaussian function with standard deviation a and mean 0 is
Nu(x) = � e-x2f2u2 in one dimension, or u in two dimensions. N (x , y) = .... 1ru .l.... e-("'2+Y2) /2u2
2 410
to (x + D cos(1>), y + D sin(¢>), ¢>) , and Rotate(B) has the effect of changing state (x, y, ¢>) to (x, y, ¢> + B).
has the effect of changing state
the action
a. Suppose that the robot is initially at (0, 0, 0) and then executes the actions Rotate(60°), Roll(l), Rotate(25°), Roll(2). What is the final state of the robot?
1018
Chapter
25.
Robotics
goal
Figure 25.33
Simplified robot in a maze. See Exercise 25.9.
b. Now suppose that the robot has imperfect control of its own rotation, and that, if it attempts to rotate by B, it may actually rotate by any angle between () - 10° and () + 10°. In that case, if the robot attempts to carry out the sequence of actions in (A), there is
a range of possible ending states. What are the minimal and maximal values of the x-coordinate, the y-coordinate and the orientation in the fi nat state?
c. Let us modify the model in (B) to a probabilistic model in which, when the robot attempts to rotate by (), its actual angle of rotation follows a Gaussian distribution with mean () and standard deviation 10°. Suppose that the robot executes the actions Rotate(90°), Roll(!). Give a si111ple argument that (a) the expected value of the loca tion at the end is not equal to the result of rotating exactly
1 unit, and
90° and then rolling forward
(b) that the distribution of locations at the end does not follow a Gaussian.
(Do not attempt to calculate the true mean or the true distribution.) The point of this exercise is that rotational uncertainty quickly gives rise to a lot of positional uncertainty and that dealing with rotational tmcertainty is painful, whether uncertainty is treated in tenns of hard intervals or probabilistically, due to the fact that the relation between orientation and position is both non-linear and non-monotonic. 25.9
Consider the simplified robot shown in Figure 25.33. Suppose the robot's Cartesian
coordinates are known at all times, as are those of its goal location. However, the locations
of the obstacles are unknown. The robot can sense obstacles in its immediate proximity, as illustrated in this figure. For simplicity, let us assume the robot's motion is noise-free, and the state space is discrete. J:n. ln Proc. fEEl!.Cot(emtce011 Data &gi�t«ring.
Agmon, S. (1954). The reluaicn melhod for lin
ear
inequalities. Canadian Joumalcf MaiMmatics,
6(3), 382-392.
l'«lgi:
Agre, P. E. and Chapman. D. (1987). an im ple:me.nlationofa theoryofactivity. In UCA/.87,pp. 268-272.
Abo, A. V., Ho . J., and Ullmon, J. D. (1974). paoft a ly sis of Computtr Algorithms. The Design and An Addison-Wesley.
Aiz.erman, M., Braverman, E., and Rozonou, L. Theoretical foundations of the potentilll i . Au� function method in pattern re co iti on learnng g n
(1964).
tomarion and Remere Control, 25, 821-837. AI·Chnug, M., Bresina, J., Charest, L.. Otase, A., Hsu.J.,Jonoson,A., K3Jlefsky, B., Mo�ris, P.. Rajlln, K., Yglesias,)., Chafin, B., Dias, W.. and Mnldague, P. (2004). MAPGEN: Mixed-Initiative plannins and scheduling for the MMs Exelonnion Rover mission. IEEEIntelligent Systems, lSI(I), 8-12. Albus, J. S. (197S). A new • yprooch 10m:utipulatot cootrol: The cerebellar m o de articulation controller (CMAC). J. Dynamic Systems, Mtasu,.mtnt, and Control, 97. 210-2n.
Aldous, D. and Vazimni, U. (1994). "Go with the winners" lllgorilhms. InFOCS-94. pp. 492...0 .5 1.
Alekhnovich, M., Hirsch, E. A., and llsykson, D.
(200S).
Exponential low..- bounds
time of DPLL algorilhms
on
for lhe running
sarisfuble formulas.
JAR, 35(1-3), S1-72. Allais, M. (19S3). Le oomportmeol de l'bomme
rorionnel de van1 Ia risque: critique des postulats et uicmes de l'Ooole Americaine. Econometrica, 21, 501-546.
Allrn, J. F. (1983).
Maintaining knowledge abool
temporal intervals. CACM, 26(1 I), &32-843.
Allrn, J. F. (1984). Towards a geneml lbeory of ac tion and time. All, 21, 123-IS,-e-bosed approach 10 connect four. The game is solved: White wins. Mas ter•s thesis. Vrije Univ.. Arnsterdam. Almuallim, H. and Dietterich, T. (1991). Learning
with many irrelevant pp. 541-552.
realures.
In AAAI-91, Vol. 2,
ALPAC (1966). Language aod machines: Com puters in translation and ln i guistics. Tech. rep. 1416, The Automatic Language Processing Advi sory Committee of the Nationlll Academy of Sci ences. Allennan, R. (1988). Adapve ti planning. Cognitive
Science, /2, 39�22.
Amarel, S. (1967). An approoch lo heuristic problem-solving and theorem proving in lhe propo sitional calculus. In Hart, J. and Takasu, S. (Eds.), SysttmsllfldCoffJ>ulUScience.University of Toronto Press. An1arel, S. (1968). On representations of prob lems of reasooing about actions. In Michie, D. (Ed.), Machine bttdligenct 3, Vol. 3, pp. 131-171. Elsevier/Nonh-Holl.m:l. Amir, E. and Russell,S.I. (2003). Logical filtering. In UCAI.Ol.
Amit, D., Gulfrennd, H., and Sompolinsky. H. (198S). Spin-glass models ofneuraloetwod<s. Phys ical &vkw,A 32, 1007-1018.
Aodustn. S. K., Olesen. K. G., Jensen, F. V., and Ienseo, F. (1989). HUGIN-A sheD fa bJildin8
Bayesian belief �...-ses for expert systems. In IJCAI-89, Vol. 2, pp. 1080-1085.
Andersoo, J. R. (1980). Cognitiw< Psyclwlogyand lu lrrv>licarM. io W.H. Fl-eeman. Anderson, J. R. (1983). TheArrhitecturt ofCogni
tion. Harvard University Press.
Andoni, A. and lndyk, P. (2006). Near-optimal hash ing: algorithms for approximate nearest neighbor in
high dimensions. In FOCS-06. Andre, D. and Russel�S.1.(2002). State abstraction for programmable reinforcement learning agents. In AAAI-02, pp. 119-125. Anthony, M. aod Bartlen, P. (1999). NI!W'al Net wort I.Laming: Theoretical Foundations. Cam bridge Unjversity Press.
Aoki, M. (1965). Optimal control of partially ob servable Marln. In Neural lnfomll1tion Processing Systems,Vol. 14.
Blinder, A. S. (1983). Issues in the coordination of monetary and fiscal policies. In Monetary Palicy Issues n i the 1980s. Federal Reserve Bank, Kansas Cit y, Missouri.
Bliss, C. I. (1934). The methodofprobits. Science, 79(2037), 38-39.
Olock, H. D., Knight, B., and Rosenblau, F. (1962). Analysis of a four-layer series-coupled perceptron. Rev. Modem Physics, 34(1), 275-282.
Blum, A. L. and Furst, M. (1995). Fast planning through planning graph analysis. In IJCAJ-95, pp. 1636-1642.
Blum, A. L. and Furst, M. (1997). Fast planning through planning graph arualysis. AIJ, 90(1-2), 281300.
Blum, A. L. (1996). On-ine l algorithms in machine learning. In Proc. IVorkshop on On-Line Algorithms, Dagstuhl, pp. 306-325. Blum, A. L. and MitcbeU, T. M. (1998). Combin ing labeled and unlabeled data with co-training. In COLT-98, pp. 92-100.
lllumer, A., Ebrenfeucht, A., Haussler, D., and War muth, M. (1989). Leamability and the Vapnik Cbervonenkis dimension. JACM, 36(4), 929-965.
Bobrow, D. G. (1967). Natural language input for a
Bishop, C. M. (1995). Neural Networksfor Pattern Recognition. Oxford University Press.
Boden, M. A. (Ed.). (1990). The Philosophy ofAr
tijiciallnteUigence. Oxford University Press.
Bolognesi, A. and Cianca.rini, P. (2003). Computer programming ofkriegspiel endings: The case of KR i CompttterGantes /0, vs. k. lnAdmnces n
Bonet, B. and Geffner, H. (1999).
Planning as
372.
Bonet, B. and Geffner, H. (2000). Planning with
Bishop, C. M. (2007). Pattern Recognition and Ma
incomplete information as heuristic search in belief space. In ICAPS.OO,pp. 52-61.
Bisson, T.(l990). They'remadeoutofmeat Omni
ter than AC/'? InAAAI-OS.
chine Learning. Springer-Verlag.
Magazint.
Bistarelli, S., Montanar� U., and Rossi, F. (1997).
Semiring:�ased constraint satisfaction andoptimiza tion. lACM, 44(2),201-236.
Bitner, J. R. and Reingold, E. M. (1975). Backtrnck
programming techniq ues. CACM, /8(1 1), 651-656.
Bizer, C., Auer, S., Kobilarov, G., Lehmann, J., and Cyganiak , R. (2007). DBPedia -querying wikipeia d
like a database. In Developers Track. Prtsentalion at the 16th International Conference on World IVide Web.
Borgida, A., Bracltman, R. J., McGuinness, D., and Alperin Resnick, L. (1989). CLASSIC: A structural datamodel forobjects. SIGMODRecord, /8(2), 5867.
Boroditsky, L. (2003). Linguistic relativity. In Nadel, L. (Ed.), Encyclopedia ofCognitive Science, pp. 917-921. Macmillan.
Boser, B., Ouyon, L, and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COU-92. Bosse, M., Newman, P., Leonard, J., Soika, M., Feiten, W., and Teller, S. (2004). Simultaneous localization and map building in large-scale cyclic environments using the atlas framework. Int. J.
Robotics Res•arr:h, 23(12), 1113-1139. Bourzutschky, endgames
M. (2006). 7-man with pawns. CCRL Discussion
Board, kirill -kryukov. com/chess; discussion-board;viewcoplc.php?t= 805.
Boutilier, C. and Brafman, R. I. (2001). Panial order planning with concurrent interacting actio ns.
JAJR, /4, 105-136.
Boutilier, C., Dearden, R., and Goldszmidt, M.
(2000). Stochastic dynamic programmiug with fac toredrepresentations. All, 121, 49-107.
Boden, M. A. (1977). Anificial Intelligence and NaturalMan. Basic Books.
heuristic search: New results. In ECP-99, pp. 360-
Essays on Fundaons ti of
tion, 7(3), 278-288.
Boozy, B. and Cazenave, T. (2001). Computer go: AnAJoriented survey. AJJ, 132(1), 39-103.
Bobrow, D. G., Kaplan, R-, Kay, M.,Nonnan, D. A., Thompson, H., and Winograd, T. (1977). GUS, a frame driven dialog system. All, 8, 155-173.
Game Theory. Pitman.
Binmore, K. (1982).
Y. (1991). The vector field histogram-Fast obstacle avoidance for mobile robots. IEEE Tronsactions on Robotics andAutonUl
215. MIT Press.
Science and Cybernetics Conference, Miami.
puter. Invited paper presented at the IEEE Systems
Bore.nstein, J. and Koren.,
Boutilier, C., Reiter, R., and Price, B. (2001). Sym bolic dynamic programming forfirst-order MOPs. In IJCAI-01, PP- 467-472.
computerproblem solving system. In Minsky, M. L. (Ed.), Semantic lnfomJation Processing, pp. 133-
Bonet, B. (2002). An epsilon-optimal grid-based algorithm for partially observable Markov decision processes. InICML-02, pp. 51-58.
Binford, T. 0. (1971). Visual perception by com
Borenstein, J., E•'eretl, B., and Feng, L. (1996). Navigan ti gMobile Robots: Systems and Techniques. A. K. Peters, Ltd.
Bonet, B. and Geffner, H. (2005). An algorithm bet
Boole, G. (1847). The Mathematical Anal ysis of Logic: Being an Essay towardsa Calculus ofDeduc tive Reasoning. Macmillan, Barclay, andMacmillan, Cambridge.
Booth, T. L. (1969). Probabilistic representation of formA! l:m&W�se•. In IEEE Confmnu R�cord of the 1969 TtnthAnnualS ymposium on Switching and
Automata Theory, pp. 74-81.
Borel, E. (1921). La theorie du jeu et lesequations
intt�grales a noyau sy&:nel!rique. Comptes Rendus Hebdomadairts des seances de l'Acadimie des Sci ences, 173, 1304-1308.
Boutilier, c_, Friedman, N., Goldszrnidt, M., and Koller, D. (1996). Context-specific independence n i Bayesiannetworks. In UAI-96,pp. l l5-123.
Bowerman, M. and Levinson, S. (2001). Language
acquisition. and conceptualdevelopment. Cambridge
University Press.
Bowling, M., Johanson, M., Burcb, N.,andSzafron, D. (2008). Strategy evaluation in extensive games with importance sampling. In /CM/.,08.
Box, G. E. P. (1957). Evolutionary operation: A i method ofincreasing industrialproductivty.Apple id
Statistics,6, 81-101.
Box, G. E. P:., Jenkins, G., and Reinsel, G. (1994). Ttme SeriesAnalysis: Forecasting and Control (3rd edition). Prentice Hall.
Boyan, J. A. (2002). Technical update: Least squares temporal difference learning. Machine Learning, 49(2-3), 233-246.
Boyan, J. A. and Moore, A. W. (1998). Learn ing evaluation functions for global optimization and Boolean satisfiability. In AAAI-98. Bod, r ghe, L. (2004). Convex Op y S. and Vandenbe
timization. CambridgeU niversity Press.
Boyen, X., Friedman, N., andKoller,D. (1 999). Dis covering the hidden structure of complex dynamic systems. In UAJ-99.
Boyer, R. S. and Moore, J. S. (1979). A Computa tional Logic. Academjc Press,
Boyer, R. S. andMoore, J. S. (1984). Proofchecking the RSApublic key encryption algorithm. American
MathematicalMonthly, 9/(3), 181-189.
1067
Bibliography Brachman, R. J. (1979).
On the epistemologi cal status of semantic networl<s. In Findler, N. V. (Ed.), Associative Networks: Representation and Use of Knowledge by Computers, pp. �. Aca demic Press. Brachman, R. J., Fikes, R. E., and Levesque, H. J. (1983). Krypton: A functional approach to knowl edge representation. Computer, 16(10), 67-73. Brac.hman, R. J. andLevesque, H. J. (Eds.). (1985).
Readings n i Knowledge Representation. Morgan Kaufmann. Bradtke,S. J. and Barto, A. G. (1996). Linear least
squares algorithms for temporal difference learning Machineuaming, 22, 33-57.
Brallnan, 0. and Braftnan, R. (2009). Sway: The Irresistible Pull ofln'ational Behavior. Broadway
Business.
Brafman, R.I. andDomshlak, C. (2008). From one
to many: Planning for loosely coupled multi-agent systems. InICAPS-08, pp. 28-35.
Brafman, R.I. and Tennen!holtz, M. (2000). A near
optimal polynomial time algorithm for learning in oertain classesofstochastic games. AJJ, 121,31-47.
Braitenberg, V. (1984). Vthic/es: Experiments in Synthetic Psychology. MIT Press. Bransford, J. and Johnsoo, M. (1973). Considera
tion of some problems in comprehension. In Chase. W. G. (Ed), ViStlal Informao i g. Aca ti n Processn demic Press.
Brants, T., Popat, A. C., Xu.,P., Och, F. J., and Dean, J. (2007). Large language models in machine trans lation. In PMNLP-CoNIL-2007: Proc. 2007 Joint Conference on Empirical Methods n i Natural Lan guage Processing and Con�putarional Nahlml lAn guageuaming, pp. 85�67. Bratko, I. (1986). Prolog Programmingfor Artifi ciallntelligrnce (1st edition). Addison-Wesley.
Bratko, I. (2001). Prolog Programming for Artifi ciallnrelligtnce (Third edition). Addison-Wesley.
Bratman, M. E. (1987). Intention, Plans, and Prac ticalReason. Harvard University Press. Bratman, M. E. (1992). Planning and the stability ofintention. Minds andMa,chines, 2(1), l-16.
Breese, J. S. (1992). Construction ofbelief and de cisionnetworlcs. Computational Intelligence, 8(4),
624-647.
Breese, J. S. and Heckerman, D. (1996). O.cision
theorec ti troubleshooting: A frarneworlc for repair and experiment. In UAI-96, pp. 124-132. Breiman, L. (1996). Baggingpredictors. Machine
Learning, 24(2), 123-140.
Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees.
Wadsworth International Group.
Brelaz, D. (1979). Newmethods to colorthe vertioes ofa graph. CACM, 22(4), 251-256.
Brent, R. P. (1973). Algrithms for minimization without derivatil·es. Prentioe-Hall.
Bresnan, J. (1982). The Mental Representation of Gmmmarical Relations. MITPress. Bre.wka, G., Dix, J., and Konolige, K. (1997). NoMnotonic Reasoning: An Oven•iew. CSU Publi
cations.
Brickley, D. and Gu.ha, R. V. (2004). RDF vocab
ulary description language J.0: RDF schema. Tech.
rep., W3C.
llridle, J. S. (1990). Probabilistic interpretation of
feedforward classification network outputs. with re lationships to statistical pattern recognition. In Fo gelman Soulie, F. and Herault, J. (Eds.), Neurocom puring: Algorithms, At-ersal plans innon-detenninistic domains. InAAAI-98, pp. 875-881. Clark. A. (1998). Being 1hero: Putting Brain, Body, and IVar/d TogetherAgain. MIT Press.
Cheeseman, P., Self, M., Kelly, J., and Stutz, J. (1988). Bayesian classification. InAAAI-88, Vol. 2, pp. 607�11.
Clark, A. (2008). Superszi g the Mind: Embodi nren� Action, and Cognitive E.ttension. Oxford Uni
Cheeseman, P. and Stutz, J. (1996).
Clark, K.L. (1978). Negation as fai l ure. InGallaire, H. and Minker, J. (Eds.), Logic and Data Bases, pp.
Bayesian
classification (AutoCJass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., and
Uthurusarny, R. (Eds.), Advances n i Knowledge Dis covery andData Mining. AAAI Press/MIT Press.
Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for language model ing. InACL-96, pp. 310-318.
Cheng, J. and Dnrzdzel, M. J. (2000). AIS-BN: An adapve ti importance sampling algorithm for eviden tial reasoning n i large Bayesian networks. JAIR, 13, 155-188. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., and Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. AIJ. 137,43-90. Chklovski, T. and Gil, Y. (2005). Improving the design of intelligent acquisition n i terfaces for col lectingworld knowledge from web contributors. In Proc. ThirdInternational Conference on Knowledge Capture (K-CAP). Chomsky, N. (1956). Three models forthe descrip a n g uage. IRE Transactions on lnfom11ltion tion of l T h eory,2(3), 111--124.
Chomsky, N. (1957). Syn tacticStruciUros. Mouton. Chose!, H. (1996). Sensor BasedMotion Planning: The Hitrarchir"f"llilg. l IJ{J-4). 369-
409. Cohn. A. G, Bennett, B
Gooday. J. M.. and Ooa� ualito N. (1997). RCC: A calculus fcr region boJOdq tive sp:llial reoscning. fdolnformotico, I. 27s-.316. .•
and KaiZ. S. (1999). Self-stabilizing distribu"'d conllrainl Atisr..:tion. Chicago JDliNial of TMontical Computu Scitn· ersite d' Aix
Cntig, J. (1989). Introduction to Robotics: Mechan ics and Control (2nd tdition). Add soo-Wesley Pub i lishing, Inc.
Condon, J. H. 3Jld Thompson, K. (1982). Belle cbess hanlware. In Clarb, M. R. B. (Ed.), AIAnncts in O>mputtrChtss3, pp. 45-S4.Perpmon.
Craswell, N Zar.lgoza, H., and Robertson, S. E.
and
Colmernuer, A. (1975). Les g m mmlli res de meta morphose. Tech. rep., Groupe d'lnt elli gence Artifi. cielle. Universite de Maneille�·urniny.
Rousse� P. (1973).
R appo rt,
Marseille n.
Congdon, C. B., Huber, M., Kon enl: amp. D., Bid lack. C. Cohen, C, Huffman. S.. Koa, F Rucbke. U, and Weymouth, T. (1992). CARMEL ,'1!11us •.
Flakey: A comparison ortwo robots. Tecb. rep. Pa persfiom thoAAAlRobol Com tition. RC-92.01. pe American Association for Artifia al lrweUigence.
Confisk. J. (1989). Tbtee variants on tho Allais ••· ample.Amtrican EcDMmicRtional complexity net. of probabilistic inference using Ba.yeaian beli r e
works. A/J, 42, 393-405.
Cooper, G. and Herskovits, E. (1992). A Bayeaian i duction of probabili stic networks method for the n from data. Machine Ltaming, 9, 309-347. Copeland, I. (1993). Anificial lnrtlligencc: A PhilosophicalIntroduction. Blackwell.
Au
Cox, R. T. (1946). Probability, frequency, and rea sonable expectation. American Journal ofPhy!ics,
14(1). 1-13.
Cnoik, K. I. (1943). lhe Nature of Explmtation. Cambridge University Press. ••
(2005). Microsoft camlridge attrec-14: Enterprise
crack. In Proc. FoMrt«nth Tert REiriel10l Confor �nu.
Cnuser, A., Mehlhom, K.,Meyer,U.,and Sanders,
P . (1998). A parallelil3tion of Dipua's sboneso path algorithm. In Proe. 2Jnl lnttNIDriottal S)m· posiMm on Mmhtmatical FDUIIdations ofO>trfJIIltr Science., pp. 722-731.
Cross, S. E. and Walker. E. (1994). DART: Apply ing knowledgebased plannins and schedulingto cri sis action l p ann ing. In Zweben, M. and Fox, M. S. (Eds.), Intt/Ji�tS chedulill g. pp. 71 1-729. Morgan Kaufmm1n. Cruse, D. A. (1986).Ltxical Stmantics. �ridge
Uni\mry Press.
Culberson, J. and Seboefl'er, J. (1996). Searching with panem databoJOs. In AIA10nces ill Anijidients for human detecritbematical model. E.conamet
rico, 17, �211.
Darwirhe, A. (2001).
126, S-41.
Reaniveconditioning. AIJ.
Darwirhe, A. and Gin�. M. L (1992). A sym bolic generalization of probobi�1y theo!J. In AAAJ.
Cnnn, M.. DiPasquo, D, Freitag. D, McCaiitm, A . Mi1tbel� T. M., Nig;un. K, and Slattery, S.
92, pp. 622-627.
Crawford, 1. M. and Auton, L D. (1993). Experi mental results on the crossover point in satisfiAbility
Darwin. C. (1859). On Tht Origill of Sptcies by Mtans ofNatuml Stltcrion. I. Murray. London.
Cristianini, N. and Hahn, M. (2007). buroduction to Computational Grnomics: A Case SIUdies Ap proach. Cambridge UDiversity Press.
Dasguptn. P., Chakrobarti, P. P.. and de Sarkar, S. C.
.
(2000). Learningto coostruotknowledge bases from tho Wcrld Wide Web. AJJ, 118(1/}.), 69-1 13. problems. InAAAI-93, pp. 21-27.
Cristianini, N. and SehOikopf, B. (2002). Support vec:lor machines and kernel methods: The new gen eration oflearning machines. AIMag, 23(3), 31-41. Cristianini, N. and Sbawe-Taylor, I. (2000). An introduction to support ••ector machines {lfld other kernel-based teaming mtthads. Cambridge Univer sity Press.
Crockett, L. (1994). Tire Turing Ttstand the Frame Problem: AJ's Mistaken Understanding of Intelli gence. Ablex.
Crort, B Metzler, D and Stroharn, T. (2009). Search Engillts: lnfo1'11U1tion mriised Language Acquisition. Ph.D. thesis, MIT.
De Morgan, A. (1864). On the syllogism, No. IV, and on the logic of relations. Tmnsaction of the Cambridge Philosophical Society, X, 331-358.
De Raedl, L. (1992). lntemcrive Theory Reo•ision: An Jnductio•e Logic Progmmming Approach. Aca demic Press.
de. Salvo Braz, R., Asnir. E., and Roth, D.
(2007).
Lifted first-order probabilistic n i ference. In Getoor, L. and Thskar, B. (Eds.), Introduction to Staristical Relational /..earning. MIT Press.
Deacon, T. W. (1997). The symbolic species: The co-evolution oflanguage and the brain. W. W. Nor ton.
Deale, M., Yvanovich, M., Schnitzius, D., Kautz, D., C:upeultaneous localisationandmap building (SLAM) problem. IEEE Transactions on Robotics and Au tomation, /7(3), 229--241. Do, M. B. and Kambhampat� S. (2001). Sapa: A domain-independent heuristic metric temporal plan Dillenburg, All,
search.
ner. In ECP.()I.
Do, M. B. and Kambhampati, S. (2003). Planning as constraint satisfaction: solving the planning graph by compiling ti into CSP. All, 132(2), 151-182. Doctorow, C. (2001). Metacrap: Puttingthe torch to seven straw-men of the meta-utopia. www.well. com/Ndoctorow/metacrap.htn
P. andPau.ani, M. (1997). On the opti mality ofthe simple Bayesian classifierunder zero Domingos,
one Joss. Machine Learning, 29, 103-30.
P. andRichardson, M. (2004). Marlcov logi : A unifying frameworl: for statistical relational learning. In Proc. ICM/.,04 Workshop on Statistical Domingos, c
Relaonal ti Learning.
C. and Lorenz, U. (2004). The chess monster hy . In Proc. 14th International Con
Donninger, dra
ference on Field-Programmable Logic and Applica tions, pp. 927-932.
Doorenbos, R. (1994). Combining left and right un linking for matching a large number oflearnedrules. lnAAAI-94. Doran, J. and Michie, D. (1966). Experiments with the graph traverser program. Proc. Ro)Yll Society of London, 294, SeriesA, 235-259. Dorf, R. C. and Bishop, R. H. (2004). Modem Con trol Systems (lOth edition). Prentice-Hall. Doucet, A. (1997). Monte Carlo methods for
Bayesian estimation ofhidden Markov models: Ap plicaon ti to radiation signals. PhD. thesis. Univer
site de Paris-Sud. Doucet, A., de Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Prac tice. Springer-Verlag. Doucet, A., de Freitas, N., Murphy, K., and Russell, S. J. (2000). Rao-blackl•-ellised particle filtering for dynamic bayesian networl:s. In UAT-00. Dowling, W. F. and Gallier, J. H. (1984). Lineas time algorithms fortestingthe satisfiabilityofpropo sitional Hom foonulas. J. Logic Programming, I, 267-284.
Dowty, D., Wal�
R., and Peters, S. (1991).
duction to Montague Semantics. D. Reidel.
Intro
J. (1979). A truth maintenance system. All, /2(3), 231-272. c Doyle, J. (1983). What is rational ps hology? To y (3), 50M a g, 4 ward a modernmental philosophy. AJ Doyle,
53.
and Pari� R. (1991). Two theses ofknowl edge representation: Language restrictions, taxo nomic classification, and the utility ofrepresentation services. All, 48(3), 261-297. Drabble, B. (1990). Mission scheduling for space craft: Diaries of T-SCHED. In Expert Plooning Sys tems,pp. 76-81. Institute ofElectrical Engineers. Dredze, M., Crammer, K., and Pereira, F. (2008). Confidence-weighted linear classification. In ICMJ., 08, pp. 264-271. Doyle, J.
Dreyfus, H. L. (1972). IVhar Computers Can't Do: A Critique ofArtificialReason. Harper andRow.
Dreyfus, H. L. (1992). IVhat Computers Still Can't Do: A Critique ofArtificialReason. MITPress.
Dreyfus, H.L. andDreyfus, S. E. (1986). Mind O\'er Machine: The Power ofHumanIntuition and Exper tise in the Em of the Computer. Blackwell Dreyfus, S. E. (1969). An appraisal of some shortest-paths algorithms. Operations Researr:h, 17,
395-412.
andPrade,H. (1994). A survey ofbelief and updating rules in various uncertainty models. Tnt. J. IntelligentSysten&r, 9(1), 61-100. Duda, R. 0., Gaschnig, J., and Hart P. E. (1979). Model desigo n i the Prospector consulLmt system for mineral exploration. In Michie, D. (Ed.), Ex pert Systems n i the Microelectronic Age, pp. 153167. Edinburgh University Press. Duda, R. 0. and Hart, P. E. (1973). Pattem classifi Dubois, D.
revision
,
caon ti andsceneanalysis. Wiley.
R. 0., Hart P. E., and Stork, D. G. (2001). Patum Oassijicatioo (2nd edition). Wiley. Dudek, G. and Jenkin, M. (2000). Computational Principles ofMabile Robotics. Cambridge Univer sity Press. Duffy, D. (1991). Principles ofAutomated Theorem Proving. John Wiley & Soos. Duda,
,
Dunn, H.L. (1946). Record linkage". Am. J. Public Hea/th,36(12), 1412-1416. Durfee., E. H. and Lesser, V. R. (1989). Negotiat ing task decomposition and allocation using partial global planning. InHubns,M. andGasser, L. (Eds.), DistributedAl,Vol. 2. Morgan Kaufinann. Durme, B. V. and Pasca, M. (2008). Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled n i stances for open-domain infonnation extraction.JnAAAJ.(J8,pp.l243-1248. Dyer, M. (1983). In-Depth Understanding. MITPress.
yson, G. (1998).Darwinamongthe machines : the evolution ofglobal n i telligence. Perseus Books. D
S., Muggleton, S. H., and Russell, S. J. PAC-leamability of detenninate logic pro grams. In COLT-92, pp. 12�135. Earley, J. (1970). An efficient context-free parsing algorithm. CACM, 13(2), 94-102. Edelkamp, S. (2009). Scaling search with symbolic pattern databases. Jn Model Checking andArtificial Intelligence (MOCHAR1), pp. 49--65. Edmonds, J. (1965). Paths, trees, and flowers. Canadian Joumal ofMathematics, 17, 449--467. Edwards, P. (Ed.). (1967). The Encyclopedi a ofPhi losophy. Macmillan. Een, N. and Sorensson, N. (2003). An extensi ble SAT-solver. In Giunchiglia, E. and TaccheUa, Duzeroski,
(1992).
A. (Eds.), Theory and Applications ofSatisjiability
Testing: 6th International Conference (SAT 2003).
Springer-Verlag. Eiur, T., Leone, N., Mateis, C., Pfeifer, G., and Scarcello, F. (1998). The KR system dlv: Progress repon, comparisons and benchmarks. In KR-98, pp. 406-417.
R. (Ed.). (2002). Common Sense, Reasoning, and Rationality. Oxford University Press.
Elio,
Elkan, C.
(1993). The paradoxical success of fuzzy logic. In AAAT-93, pp. 698-703.
Elkan, C.
(1997). Boosting and naive Bayesian learning. Tech. rep., Department of Computer Sci i ence and Engneering, University of Californa i , San Diego. Ellsberg, D. (1962). Risk, Ambiguity, and Decision. Ph.D. thesis, HarvardUniversity. Elman, J., Bates, E., Johnson, M., Kanuiloff-Smith, A., Parisi, D., and Plunkett, K. (1997). Rtthn i king Innateness, MITPress. Empson, W. (1953). Seven Types ofAmbi guity. New Directions. Ende.rton, H. B. (1972). A Mathematical Introduc tion to Logic. Academic Press. Epstein, R., Roberts, G., and Beber, G. (Eds.). (2008). Parsing the Turing Test. Springer. Erdmann, M.A. and Mason, M. (1988). An explo ration of sensorless manipulation. IEEE Journal of Robotics and Automation, 4(4), 369--379. Ernst, H. A. (1961). MH-1, a Computer-Opemred Mechanical Hand. Ph.D. thesis, Massachusetts In
stitute of Technology. M., Millstein, T., and Weld, D. S. (1997). Au tomatic SAT-compilation of planning problems. In /JCAJ-97, pp. 1169--1176. Erol, K., Hendler, J., and Nau, D. S. (1994). HTN planning: Complexity and expressivity. InAAA/-94, pp. 1123-1 128. Emst,
1072
Bibliography
Erol, K., Hendler, J., and Nau, D. S. (1996). Com plexity results forHTN planning. AIJ, 18(1), 69--93.
Etzioni, A. (2004). From Empire to Community: A New APproach to International Relation. Palgrave Macmillan.
Elzioni, 0. (198'9). Tractable decision-analytic con trol. In Proc. Fr i st International Conference on Knowledge Representation andReasoning, pp. II� 125.
Felzenszwalb, P. and Huttenlocber, D. (2000). Effi
cient matching ofpictorial structures. In CVPR.
Felze.nszWlllb, P. and MeAllester, D. A. (2007). The
generalized A* architecture. JAJR.
Forgy, C. (1982). A fast algorithm for the many patterns/many objects match problem. AU, 19(1),
389--403.
Forsyth, D. aud Ponce, J. (2002). Computer 1-l'sion: AModemAPproach. Prentice Hall.
Ferguson, T. (1992). Mate with knight aud bishop in k:riegspiel. Theoretical ComputerScience, 96{2),
Ferguson, T. (1995). Mate withthe two bishops n i kriegspiel. www.math.ucla.edtV"tom/papers.
Elzioni, 0., Banko, M., Soderland, S., and Weld, D. S. (2008). Open infonnation extraction from the
Ferguson, T. (1973). Bayesian analysis of some non.parametric problems. Annals ofStatistics, 1{2),
Etzioni, 0., Hanks, S., Weld, D. S., Draper, D., Lesh, N., and Williamson, M. (1992). An approach to planning with n i complete infonuation. In KR-92.
Ferraris, P. and Giunchiglia, E. (2000). Planning as satisability in nondeterministic domains. In AAAI00, pp. 748-753.
web. CACM, 5/(12).
Elzioni, 0. and Weld, D. S. (1994). A softbot-based interface to the Internet. CACM, 37(1), 7'1--76.
Elzioni, 0., Banko, M., and Cafarella, M. J. (2006). Machine reading. InAAAI-06. Elzioni, 0., Cafarella, M. J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised named-entity ex traction from the web: An experimental study. AJJ, 165(1), 91-134. Evans, T. G. (1968). A program for the solution of a class of geometric-ana1ogy intelligence-test ques tions. In Minsky, M. L. (Ed.),Semantic lnfomJation Processing, pp. 7:11-353. MITPress. Fagin, R.,Halpern,J. Y., Moses, Y., andVardi, M. Y. (1995). Reasoning about Knowledge. MIT Press. Fahlman, S. E. (1974). A planuing system for robot
construction tasks. AJJ,5(1), 1-49.
Faugeras, 0. (1993). Three-Dimensional Computer Viso i n: A Geometric Viewpoint. MITPress.
209-230.
Ferriss, T. (2007). The 4-Hour \Vorkweek. Crown.
Fikes, R. E., Hart, P. E., and Nilsson, N. J. (1972). Leaming andexecuting generalizedrobotplans.AJJ,
3(4), 251-288.
Fikes, R. E. and Nilsson, N. J. (1971). STRIPS: A approach to the application of theorem proving to problem solving. AJJ, 2(3-4), 189--208. new
Fikes, R. E. and Nilsson, N. J. (1993). STRIPS, a retrospective. AJJ, 59(1-2), 227-232. Fine,S.,Singer, Y.,and Tishby,N. (1998). Theltier archical hidden marlro•• model: Analysis and appli cations. Machine Learning, 32(41-02).
Finney, D. J. (1947). Probit analysis: A statistical treatment ofthe sigmoid response curve. Cambridge University Press. Firth, J. (1957). Papers n i Linguiscs. ti Oxford Uni •tersity Press.
(2001). The Geometry of Multiple Images. MIT
Fislter, R. A. (1922). On the mathematical founda tions of theoretical statistics. Philosophical T mn .sac tions of the Royal Society ofwndon, Series A 222,
Fearing, R. S. and HoUemach, J. M. (1985). Basic solid mechanics for tactile sensing. Int. J. Robocs ti Research,4(3), 4vin. G. (1937). Let's call the whole thing off.
Song.
Cetoor, B. (Eds.). (2007). Introduc tion to Stastical ri Relarional Learning. MIT Press.
L. and Taskar,
Ghahramani, Z. and Jordan, M. L ( 1997). Facto rial hiddenMarlrov mod.els. Machine Learning. 29,
245-274.
Gbahramani, Z. (1998). In Adapti1·e Processing of Se quences and DataStructures,pp. 16&-197.
Learning dynamic
bayesian networks.
Cbahramani, Z. (2005). Tutorial on nonparametric Bayesian methods. Tutorial presentation at the UAI
Conference.
Ghallab, M., Howe, A., Knoblock, C. A., and Mc Dermott, D. (1998). POOL-The planning domain definition language. Tech. rep. DCS TR-1165, Yale Center for Computational Vision and Control. Ghnllab, M. and Laruelle, H. (1994). Representa tion and control n i IxTeT, a temporal planner. In
273-282. UNESCO House.
AIPS-94, pp. 61-67.
M. and Lifschitz, V. (1988). Compiling cir cumscriptive theories into. logic programs. In Non
Gballab, M., Nau, (2004). A11tomar.d Planning: Theory andpractice. Morgan
Celfond,
Monotonic Reasoning: 2nd lntemarional \Volishop Proceedings, 74-99.
pp.
Answer sets. In vanHarmelan, F., Lifschitz, V., and Porrer, B. (Eds.), Handbook of Celfond, M. (2008).
Knowledge Representation, pp. 285-316. Elsevier.
Celly, S. and Silver, D. (2008). Achieving master le>�l play in 9 x 9 comp�rsity. Ce.nt, 1., Petrie, K., and Puget, J.-F. (2006). Sym metry n i constraint programming. In Rossi. F., van Beek, P., and Walsh, T. (Eds.), Handbook of Con straint Progmmmi r1g. Elsevier.
D. S., and Traverso, P.
Kaufmann.
Gibbs, R. W. (2006). Metaphorinteopretation as esn bodied simulation. Mn i d, 2/(3), 434-458. Gibson, J. J. (1950). The Perception of the Visual \Vorld. Houghton Mifflin. Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Houghton Mifflin. Gilks, W.R., Richardson, S., and Spiegelhalter, D. J.
(Eds.). (1996). Markov chain Mont< Carlo n i pmc
rice.
Chapman and HalL W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language .and program for complex Bayesian modelling. The Stastician, ti 43, 169-178. Gilmore, P. C. (1960). A proof method for quantifi cation theory: Its justfi i cation and realization. IBM
Cilks,
Jo11rnal ofResearchand De>·elopment, 4, 28-35.
L. (1993). Essrnrials ofArrijicialln Morgan Kaufmann. Ginsberg, M. L. (1999). GIB: Steps toward an expert-level bridge-playing program. In IJCAI-99, Ginsberg, M. telligence.
pp. 584-589.
Ginsberg, M. L., Frank, M., Halpin, M. P., and Tor
rance, M. C. (1990). Search lessons learned from crossword puzzles. InAAAI-90, Vol. I, pp. 210-215. Ginsberg, M. L. (2001)- GIB: Imperfect n i footma tion n i a computationally challenging game. JAIR, 14,303-358.
Gionis, A.,Indyk, P., andMotwani, R. (1999). Simi larity searchinhg i h dimensionsvis hashing. InProc. 25th Very Large Database (VWB) Conference.
Bibliography
1074 Gittins, 1.C. (1989). Multi-Am�td BanditAllocation Indices. Wiley.
Clanc, A. (1978). On the etymology of the word "robot". SIGARTNewsletter,67, 12. Clover, F. and Laguna, M. (Eds.). (1997). Tabu search. KJuwer.
COde!, K. (1930). Ober die Vol/standigkeit des Logikkalktt/s. Ph.D. thesis, University ofVienna. C .. planning and explicit knowledge. In A/PS-96, pp. 110-117. Goldszmidt, M. and Pearl, J. (1996). Qualitative probabilities for default reasoning, belief revision, and causal modeling. AIJ, 84(1-2), 57-112.
Colomb, S. and Baumert, L. (1965). Backtrack pro ramming. JACM, 14, 516-524. Golub, G., Heath, M., and Wahba, G. (1979). Gen eralized cross-validation as a method for choosing a sood ridge parameter. Tec/mometrics, 2/(2).
Comes, C., Selman, B., Crato, N., aod Kautz, H. (2000). Heavy-tailed phenomena n i satisfiability and constrain processing. JAR, 24, 67-100. Comes, C., Kautz, H., Sabh:uwal, A., and Selman, B. (2008). Satisfiability so�vers. In van Harme!en, F., Lifschitz, V., and Porter, B. (Eds.), Handbook of Knowledge Rtprr!sentation. Else'Yier. Comes, C. and Selman, B. (2001). Algorithmport folios. AIJ, 126, 4:w;2. Comes, C., Selman, B., and Kautz, H. (1998). Boosting combinatorial search through randomiza tion. InAAAI-98, pp. 431-437.
Contbier, G. (2008). Formal proof-The four-color theorem. Notices ofthe AMS, 55(1 1), 1382-1393.
Good, L 1. (1961). A causal calculus. British Jour nal ofthePhilosophy ofScience, II, 305-318.
Good, I. 1. (1965). Speculations concerning the first ultraintelligent machine. In Alt, F. L. and Rubinotf, M. (Eds.), Advances in Computers, Vol. 6, pp. 3188. Academic Press.
Cood, I. J. (1983). Good Thinking: The Founda tions ofProbability and Its Applications. University of Minnesota Press.
Goodman, D. and Keene, R. (1997). Man versus Machine: Knsparov versus Deep Blue. H3 Publica tions.
Goodman, J. (2001). A bit of progress n i language modeling. Tech. rep. MSR-TR-2001-72, Microsoft Research.
Goodman, 1. and Heckenrnan, D. (2004). Fighting spam with statistics. Significance, the Magazine of the Royal Statistical Society, I, 69-72. Goodman, N. (1954). Fact, Fiction and Forecast. University of London Press. Goodman, N. (1977). The Structure ofAppearance (3rd edition). D. Reidel.
Copnik, A. and Glymour, C. (2002). Causal maps and bayes nets: A cognitive and computational ac count of theory-fonnation. In Caruthers, P., Stich, S., andSiega� M. (Eds.), The Cognitive Basis ofSci ence. Cambridge University Press. Cordon, D. M. (2000). Ams at Wo'*, Norton. Gordon, D. M. (2007). Control without bieraschy. Nature,446(8), 143.
Cordon, M. J., Milner, A. 1., and Wadsworth, C. P. (1979). EdinburghLCF. Springer-Verlag. Gordon, N. (1994). Ba)'esian methods for tmcldng.
PhD. thesis, Imperial Coll e ge.
Cordon, N., Salmond, D. 1., and Smith, A. F. M. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. lEE Proceedings F (RadarandSignal Proce.,ing), 140(2), 107-113. Corry, G. A. (1968). Stmtegies for computer-aided diagnosis. Mathematical Biosciences, 2(3-4), 293318.
Corry, G. A., Kassirer, J. P., Essig, A., and Schwartz, W. B. (1973). Decision analysis as the basis for computer-aided tillAnagemetn ofacute renal failure. American Journal ofMtdicine, 55, 473-484. Cottlob, G., Leone, N., andScarcello, F. (1999a). A oomparison of structusal CSP decomposition meth ods. In /JCAI-99, pp. 394-399.
Cottlob, G., Leone, N., and Scarcello, F. (1999b). Hypertree decompositions and rractable queries. In PODS-99, pp. 21-32.
Graham, S. L., Harrison, M. A., and Ruzzo, W. L. (1980). An improved oontext-free recognizer. ACM Tmnsactions on Programming lAnguages and Sys tems, 2(3), 415-462.
Crama, A. andKumar, V. (1995). A survey ofparal lel seasch algorithms for d:iscrete optimization prob lems. ORSA Joumal ofComputing, 7(4), 365-385.
Grassmann, H. (1861). J..ehrbuch der Arithntetik. Th. Chr. Fr. Enslin, Berlin. Grayson, C. J. (1960). Decisions under uncer tainty: Drilling decisions by oil and gas operators. Tech. rep., Division of Research, Harvard Business School. Green, B., Wolf, A., Chomsky, C., and Laugherty, K. (1961). BASEBALL: An automatic question an swerer. In Proc. \Vesrem Joint Computer Confer ence, pp. 219-224.
Green, C. (1969a). Application oftheorem proving to problem solving. In /JCA/-69, pp. 219-239. Green, C. (1969b). Theorem.proving by resolu tion as a basis for question-ansY.-ering systems. In Meltz-er, B., Michie, D., and Swann, M. (Eds.), Ma chine Intelligence 4, pp. 183-205. Edinburgh Uni-
1-ersity Press.
Green, C. and Raphael, B. (1968). The use of theorem-proving techniques in question-answering systems. In Proc. 23nfACM National Conference. Greenblatt, R. D., Eastlake, D. E., and Crocker, S.D. (1967). The Greenblatt chess program. In Proc. Fall Joint Computer Conference, pp. 801-810. Greiner, R. (1989). Towards a formal analysis of EBL. In /CM/..,.89, pp. 450-453.
Grinstead, C. and Snell, 1. (1997). Introduction to Probability. AMS. Grove, W. and Meehl, P. (1996). Co mpa rati> .. effi ciency of infonnal (subjective, impressionistic) and fonnal (mechanical, algrithmic) prediction proce dures: The clinical statistical controversy. Psychol ogy, Public Policy, and Law, 2, 293-323.
Gruber, T. (2004). Interview of Tom Gruber. AIS SIGSEMISBulletin, 1(3).
Cu, 1. (1989). PamllelAlgorithms andArchitectures PhD. thesis, Un iversity of Utab.
for Vel)'FastAI Search.
Guard, 1., Oglesby, F., Bennett, 1., and Settle, L. (1969). Semi-'lutomated mathematics. JACM, 16,
49-62.
Cueslrin, C., Koller, D Gearhart, C., and Kanodia, N. (200�a). Generalizing plans to new environments in relational MOPs. In IJCA/-03. .•
Guestrin, C., Koller, D., Parr R., and Venkatara man, S. (2003b). Efficient solution algorithms for factored MOPs. JAIR, /9, 399-468. ,
Gueslrin, C., Lagoudakis, M. G., and Parr R. (2002). Coordinated reinforcement learning. In ICM/..,.02, pp. 227-234. ,
Cuibas, L. 1., Knuth, D. E., and Sharir, M. (1992). Randomized ncremental i construction of Delaunay and \bronoi diagrams. Algorithmica, 7, 381-413. See also 17thlnt Coil. onAutomata, Languages and Progmmming, 1990, pp. 414-431. Cwnperz, J. and Levinson, S. (1996). Rethinking Linguistic Re/atMty. Cambridge University Press.
n, I. and Elisseeff, A. (2003). An introduction Guyo to variable and feature sel tion. JMLR, pp. 11571182.
ec
Hacking, L (1975). The Emergtnce ofProbability. C ambri ge University Press.
d
Haghigbi, A. aod Klein, D. (2006). Prototype driven gr3111mar ind�li()n. In COUNG-06. Hald, A. (1990). A Hist()ryofProbabilityandStatis tics and TheirApplicationsbefore 1750. Wiley.
Halevy, A. (2007). Dataspaces: A new paradigm for data n i tegration. In Brazilian Symposium on Databases.
Halevy, A., Norvig, P., and Pereisa, F. (2009). The unreasonable effectiveness ofdata. IEEEIntelligent Systems,March/April, 8--12. Halpern, J. Y. (1990). An analysis oftirst-orderlog c i sofprobability. A/J,46(3), 311-350.
Halpern, J. Y. (1999). Technical addendum, Cox's theoremrevisited. JAIR, 11, 429-435.
Halpern, J. Y. and Weissman, V.(2008). Usingfirst order logic to reason about policies. ACM Tronsac tions on lnfomJaon ti and System Security, 11(4).
Hamming, R. W. (1991). TheArtofProbabilityfor Scien.tistsandEngineers. Addison-Wesley. Hammond, K. (1989). Case-Based Planning: View ing Planning as a Memory Task, Academic Press.
Ham!1l·her, W., Console, L., and Kleer, J.D. (1992). Readings n i Model-baud Diagnosis. Morgan Kauf mann. Han, X. and Boyden, E. (2007). Multiple-color op tical activation. silencing. and desynchroniz.ation of neural activity, with single-spike temporal resolu tion. PLoS One, e299. Hand, D., Mannila, H., and Smyth, P. (2001). Prin ciples ofData Mining. MIT Press.
1075
Bibliography Handschin, J. E. and Mayne, D. Q. (1969). Monte Carlo rechniques to estimatethe conditionalexpecta tion n i multi-stage nonlinearfiltering. Int. J. Control,
Hastie, T. and libshiran� R. (1996). Discriminant adapri"-e nearest neighlx>r classification and regres sion. In Touretz.ky, D. S., Mozer, M. C., and Has selmo, M. E. (Eds.), NIPS 8, pp. 409-15. MIT Press.
Hansen, E. (1998). Solving POMDPs by searching in policy space. In UAI-98, pp. 211-219.
Hastie., T., libshirani, R., and Friedman, J. (2001). The Elements ofStatisrical Learning: Data Mining, Inference and Pmli c tion (2nd edition). Springer 'krlag.
9(5), 547-559.
Hansen, E. and Zilberstein, S. (2001). LAO*: a heuristic search algorithm that finds solutions with loops. AIJ, /29(1-2), 35-62.
Hansen, P. and Jaumard, B. (1990). Algorithms for the m:uimum satisfiability problem. Computing,
44(4), 279-303.
Hanski, I. and Cambefort, Y. (Eds.). (1991). Dung Beetle Ecology. Princeton University Press.
Hansson, 0. and Mayer, A. (1989). Heuristic search as evidential reasoning. In UA/5. Hansson, 0., Mayer, A., and YW>g, M. (1992). Crit icizing solutions lo relaxed models yields powerful admissible benristics. 1njom10rion Sciences, 63(3),
207-227.
Haralick, R. M. and Elliot, G. L. (1980). Increas ing tree search efficiency for constraint satisfaction problems. AJJ, /4(3), 263-3 I3.
Hardin, G. (1968). The tragedy of the conunons. Science, 162, 1243-1248. Hardy, G. H. (1940). A Mathemarician's Apalogy. Cambridge Uni'-ersity Press.
Harrum, G. H. (1983). Change in Wew: Principles ofReasoni ng. MIT Press.
Harris, Z (1954). Distributional structw-e. ll�"d, / 0(2/3). i Harrison, 1. R. and March, J. G. (1984). Decsion making and postdecision surprises. Adminisrmtil·t Science Quarterly, 29, 26-42. Harsanyi, J. (1967). Games with incomplete infor mation played by Bayesian players. Management
Science, /4, 159-182.
Har� P.E., Nilsson, N. 1., and Raphael, B. (1968). A fonnaJ basis for the heuristic determination ofmini mum cost paths. IEEETransacrions on SystemsSci ence and Cybernetics, SSC-4(2), 100-107.
Hastie., T., libshirani, R., and Friedman, J. (2009). The Elements ofStatisrical Learning: DataMining, Inference and Prediction (2nd edition). Springer 'krlag. Haugefand, J. (Ed.). (1985). Artificial Intelligence: The Very Idea. MIT Press.
Hauk, T. (2004). Search n i Trees with Chance Nodes. Ph.D. thesis,Univ. of Alberta. Haussler, D. (1989). Learning conjunctive concepts in structural domains. Machi ne Learning, 4(1), 740.
Havelund, K., Lowry, M., Park, S., Pecheur, C., Penix, J., Visser, W., and White, J.L. (2000). Fom>al analysis of the remote agent before and afterflight. In Proc. 5th NASA Langley Formal Methods \Volk shop. Havenstein, H. (2005). Spring comes to AI winter.
Computer World. Hawkins, J. and Blalreslee, S. (2004). On lmelli gence. Henry Holt and Co.
Hayes, P. 1. (1978). The naiv-e physics manifesto. In Michie, D. (Ed.), Expert Systems n i the Microelec troni c Age. Edinburgh Univ-ersity Press. Hayes, P.1. (1979). The logic of frames. In Metring, D. (Ed.), Fmme Conceptions and Tr:ct Understand ng, i pp. 46-61. de Gruyter.
IL•yes, P. 1. (1985a). Naive physics 1: Ontology for liquids. In Hobbs, J. R. and Moore R. C. (Eds.), For mal Theories of rhe Commonsense IVorld, chap. 3, pp. 71-107. Ablex. ,
Hayes, P. J. (1985b). The second naiv-e physicsman iresto. In Hobbs, J. R. and Moore R. C. (Eds.), For mal Theories of rhe Commonsense World, chap. I, pp. 1-36. Ablex. ,
Heinz, E. A. (2000). Scalable search n i computer chtss. V.eweg. Held, M. and Karp, R. M. (1970). The tra>-eling salesmanproblem andminimum spanning trees. Op emons ri Research, 18, 1138-1162. Heimerl, M. (2001). On the complexity of planning in transportation domains. In ECP-01. Helme.rt, M. (2003). Complexity results for stan dard benchmark domains n i planning. AIJ, 143(2),
219-262.
Heimerl, M. (2006). The fast downward planning system. JATR, 26, 191-246. Heimerl, M. and Ricbter, S. (2004). Fast downward - Malring use of causal dependencies in the prob lem representation. In Proc. Inttmational Planning Competition at /CAPS, pp. 41-43. Heimerl, M. and Roger, G. (2008). How good is abnost pelfect7 InAAAI-08. Hendler, J., Carl>oneU, J. G., Lenat, D. B., Mi zogucbi, R., and Rosenbloom, P. S. (1995). VERY large knowledge bases - Architecture vs engineer ing. In /JCAI-95, pp. 2033-2036. Henrion, M. (1988). Propasation of uncertainty n i Bayesian net,..,rl<s by probabilistic logic sampling. In Lemmer, J. F. and Kana!, L. N. (Eds.), UA/2, pp. 149-163. Elsevier/North-Holland. Henzinger, T. A. and Sastry, S. (Eds.). (1998). Hy brid systtms: Compt�tarion and control. Springer
Verlag.
Herbrand, J. (1930). Recherches sur Ia Theorie cle Ia Dimonstration� PhD. thesis, University ofParis. Hewitt, C. (1969). PLANNER: a language forprov ing theorems in robots. In TJCAI-69, pp. 295-301.
Hie•·holzer, C. (1873). Ober die M6glichkeit, einen Linienz.ug ohne Wiederbolung wtd ohne Un terbrechung zu um fuhre n ..Mathtmatische Annakn,
6, 30-32.
Hilgard, E. R. andBower, G. H. (1975) Learning (4th edition). Prentice-Hall.
,
Theorifs of
Hintikka, J. (1962). Knowledge and Belief Cornell Univ-ersity Press.
Haykin, S. (2008). Neuml Netwolks: A Compre hensive Foundation. Prentice Hall.
mination ofminimum cost paths . SIGARTNtwslet ur, 37, 28-29.
Hinton, G. E. and Anderson, J. A. (1981). Paml/el Models ofAssociari>•e Memory. Lawrence Erlbaurn Associates.
Hays, J. and Efros, A. A. (2007). Scene completion Using millions of photographs. ACM Tmnsacrions on Graphics (SIGGRAPH), 26(3).
prune (TP) algorithm. Artificial intelligence project memo 30, Massachusetts Institute of Technology.
Hinton, G. E. and Nowlan, S. J. (1987). How learn ing can guide evolution. Complex Sy stems, 1(3), 495-502.
Hearst, M. A. (1992). Automatic acquisition ofhy ponyrns fromlargetext corpora. In COUNG-92.
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm fordeep beliefnets. Neural Compuration, 18, 1527-15554.
Har� P. E., Nilsson, N. J., and Raphael, B. (1972). Correction to"A formal basis for the heuristic deter ..
Har� T. P. and Edwards, D. 1. (1961). The tree
Hartley, H.(1958). Maximum likelihood estimation from incomplete data. Biometrics, 14, 174-194. Hartley, R. and Zissennan, A. (2000). Mulriple view geometry in computervision. Cambridge Unh-ersity Press.
Hasluon, P., Botea, A., Helmert, M., Bonet, B., 3Jld Koeoig, S. (2007). Domain-independent construc tion of pattern database heuristics for cost�timal planning. In AAAI-07, pp. 1007-1012.
Has1um, P. and Geffuer, H. (2001). Heuristic plan ning with time and resources. In Proc. /JCAI-01 Workshop on Planning with Resoun:ts, IL.slum, P. (2006). Improving henristics through re laxed search-An analysis of TP4 and HSP*a in the 2004 planning competition. JAIR, 25, 233-267.
Haslum, P., Bonet, B., and Geffner, H. (2005). New admissible heuristics for domain-independent plan ning. lnAAAI-05.
Hearst, M.A. (2009). Search UserInterfaces. Cam bridge University Press.
Hebb, D. 0. (1949). The Organiwrion ofBehcn•ior.
Wiley.
Heckennan, D. (1986). Probabilistic interpretation for MYCIN's certainty factors. In Kana!, L. N. and Lemmer, J. R (Eds.), UAI 2, pp. 167-196. Elsevier/Nortb-HoUand. Heckenuan, D. (1991). Networi<s. MIT Press.
Probabilisric Similarity
Heckennan, D. (1998). A tutorial on learning with Bayesian networks. In Jordan, M. I. (Ed.), Learn ing n i gmp!Ji cal models. Kluwer. Heckerman, D., Geiger, D., and Chickering, D. M. (1994). Learning Bayesian networlcs: The combi nation of knowledge and statistical data. TechnicaJ report MSR-TR-94-09, Microsoft Research. Heidegger, M. (1927). Being and Time. SCM Press.
Hinton, G. E. and Sejnowski, T. (1983). Optimal perceptual interence. In CVPR,pp. 448-453. Hinton, G. E. and Sejnowslri, T. (1986). Learning and relearning in Boltzmann machines. In Rumel hart, D. E. and McClelland, J. L. (Eds.), Paral lel Di s tribured Processi ng, chap. 7, pp. 282-317. MITPress. Hirsh, H. (1987). Explanation-based generalization n i a logic programming environment. In /JCAI-87. Hobbs, J. R. (1990). Literature and Cognirion. CSU Press.
Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Karneyama, M., Stickel, M. E., and Tyson, M. (1997). FASTUS: A cascaded finite-state transducer for extracting infonnation from natural-language text. In Roche, E. and Schabes, Y. (Eds.), Finite State Devicesfor Natural Language Processing, pp. 383-406. MITPress.
1076
Bibliography
Hobbs, J.R. and Moore, R. C. (Eds.). (1985). For mal Theories of the Commonsense \Vorld. Ablex.
Hobbs, J. R., Stickel, M. E., Appelt, D., and Martin, P. (1993). Interpretation as abduction. AJJ, 63(1-2), 69-142.
Hoffmann, J. (2001). FF: The fast-forward planning system. AJMag, 22(3), 57-f>2. Hoffmann, J. and Brafman, R. I. (2006). Confor
mant planning
via heuristic forward search: A new
approach. AJJ, 170(6-7), 507-541.
Hoffmann, J. and Brafman, R. I. (2005). Contingent
planning via heuristic forward search with implicit belief states. In ICAPS.05.
Hoffmann, J. (2005). Where "ignoring delete lists"
Horowitz, E. andSahni, S. (1978). Fundamentals of ComputerAlgorithms. ComputerScience Press.
Horswill, I. (2000). Functional programming of behavior -based systems. AutonomousRobots, 9, 8393. Horvitz, E. J. (1987). Problem-solving design: Rea soning about computational value. trade-offs, and re sources. InProc. SecondAnnualNASAResearchFo mm, pp. 26-43.
Horvitz, E. J. (1989). Rational metareasoning and compilation for optimizing decisions underbounded
resources. In Proc. Computational Intelligence 89. Association for Computing Machinery.
Horvitz, E. J. and Barry, M. (1995). Display of in
bench
formation for time-ctitical decision making. In UAJ95, pp. 296-305.
Hoffmann, J. and Nebel, B. (200 1). The FF plan ning system: Fastp lan gen e r ation through heuristic
Hovel, D. (1998). The Lwniere project: Bayesian user modeling for inferring the goals and needs of software users. In UAJ-98, pp. 256-265.
works: Local search topology n i planning
marks. JAJR,24, 685-758. search. JAIR, 14,
253-30 2 .
Hoffmann, J., Sabharwal, A., and Oornshlak, C. (2006). Friends or foes? AnAI planning perspective on abstraction and search. In ICAPS-06, pp. 294-
303.
Hogan, N. (1985). fmpedance control: An approach to manipulation. Parts I, II, and ill. J. D}Yiamic Sys
tems, Mrasurtmen� and Control, /07(3), 1-24. Hoiem, D., Efros, A. A., and Hebert, M. (2008). Puttingobjects in perspective. IJCV, 80(1).
Holland, J. H. (1975). Adoption n i Narum/ and Ar i Press. tificial Systems. Unh'ersity of Michgan
Holland, J. H. (1995). Hidden Order: HowAdapta
tion Builds Comple.tity. Addison-Wesley.
Holt.., R.andHemadvolgyi, I. (2001). Steps towards
the automatic creation of search heuristics. Tech. rep. TR04-02, CS Dept., Univ. ofAlberta.
Holzmann, G. J. (1997). The Spin model checker.
IEEE Tmnsacons ri on Software Engineering, 23(5), 279-295.
Hood, A. (1824). Case 4tb-28 July 1824 (Mr. Hood's cases of injuries of the bran i ). PhT"actie. In AlPS.(}(), pp. 177-
Jennings, H. S. (1906). Behavior ofthe Lawer Or
29(2), 246-254.
Kaelbling, L. P. and Rosenscbein,S. J. (1990). Ac tion and planning in embedded agents. Robotics and
.
ural Language Processn i g. Computational Linguis tics, andSpuch Recognition. Prentice-HalL
Jurnfsky, D. and Martin, J. H. (2008). Speech and Longuage Processing: An Introduction to Nat
ural lAnguage Processn i g. Computational linguis tics, and SpeechRecognition (2ndedition). Prentce
Hall.
Kadane, J. B. and Simon, H. A.
i
(1977). Optimal
str.uegies for a class ofronstrained sequential prob lems. Annals ofStatiscs ti , 5, 237-255.
Kadane, I. B. and Laskey, P. D. (1982), Subjecth-e
probability and the theory of games. Management Science, 28(2), 113-120. Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and actiong n i partiaUy ob servable stochastic domaims. AIJ, 101,99-134.
Kaelbling, L. P., Littman. M. L., and Moore, A. W.
(1996). Reinfon:ernent leasning: A survey. JAIR,4,
237-285.
1364.
insandprediction problems. J. Basc i Engineering, 82, 3546.
reuse. ComputationallnuUigence, 10,213-244.
Kambh.-.mpati, S., Mali, A. D., and Srivastava, B.
(1998). Hybrid planning for partially hieran:hical domains. InAAAI-98, pp. 882-888.
Kanal, L. N. and Kumar, V. (1988). Search in Arri jiciallntelligtnce. Sprin;ger-Verlag. Kanazavra, K., Koller, D., andRussell, S. J. (1995). Stochastic simulation algorithms for dynamic prob abilistic networks. In UAI-95, pp. 346-351. Kantorovich, L. V. (1939). Mathematical methods oforgani zing and planning production. Publisbd in
translation in Management Science, 6(4), 366-422, July 1960. Kaplan,D. and Montague, R. (1960). A paradox re gained. Notre Dame Journal ofFomJall.ogic, 1(3), 79-90.
Karmarkar, N. (1984). A new polynomial-time al gorithm for linearprogramming. Combinarorica. 4.
373-395.
Karp, R. M. (1972). Reducibility among combina torial problems. In Miller, R. E. and Thatcher, J. W. (Eds.), Comple.tity of Computer Computarions, pp. 85-103. Plenum. Kart..•m, N. A. and Levitt, R. B.
(1990). A constraint-based approach to construction planning of multi-story buildings. In E.tpert Planning Sys rems, pp. 245-250. Institute ofElectrical Engineers. Kasami, T.(l965). An efficientrecognition and syn tax
analysis algorithm for context-free languages. Tech. rep. AFCRL-65-758, Air Force Cambridge Research Laboratory. Kasparov, G.
(1997). IBM owes me a rematch.
1ime, 149(21), 66--67.
Kaul\nann, M., Manolios, P., and Moore, J. S. (2000) Compurer-Aidul Reasoning: An Approach. .
Kluwer.
Kautz, H. (2006). Deconstructing planning as satis
fiability. InAAAI-06.
Kautz, H., McAIIester, D. A., and Selman, B.
(1996). Encoding plans n i propositional logic. In KR-96, pp. 374-384. Kautz, H. and Selman, B. (1992). Planning as satis
fiability. In ECAI-92, pp-. 359-363.
1078
Bibliography
Kautz, H. and Selman, B. (1998). BLACKBOX: A new approach to the application of theorem proving to problem solving. Working Notes of the AIPS-98 Workshop on Planning as Combinatorial Search. Kavrnki, L., Svestka, P., Latombe, J.-C., and Over mars, M. (1996). Probabilistic roadma�s for path planning inhigh-dimensional configuration spaoes.
IEEE Transacti ons on Robocs ti and Automation. /2(4), 566-580.
Kim, J. H. and Pearl, J. (1983). A computational model for combined causal and diagnostic reasoning in inference systems. InTJCAT-83, pp. 190-193.
Kim, J.-H., Lee, C.-H., Lee, K.-H., and Kup puswamy, N. (2007). Evolving personality of a ge netic robot in ubiquitous environment In The 16th TEEE lntemational Symposium on Robot and Hu man interacti:l·e Communicaon. ti pp.
84�53.
Kocsis, L. and Szepesvari, C. (2006). Bandit-based Monte-Carlo planning. In ECML-06. Koditsdtek, D. (1987). Exact robot navigation by means ofpotential functions: some top>logical con siderations. In TCRA-87, Vol. 1. pp. 1-6.
Koehler, J., Nebel, B., Hoffmann, J., and Dimopou los,Y. (1997). Extending planninggraphs to anADL subset. In ECP-'77, pp. 27�285.
Kay, M., Gawron, J. M., and Norvig, P. (1994).
King, R. D., Rowland, J., Oliver, S. G., and Young, M. (2009). The automation of science. Science,
Koehn, P. (2009). Stastical ti Machine Tmnslation. Cambridge University Press.
Kearns, M. (1990). Tht Computaonal ti Complexity ofMachine Learning. MIT Press.
Kirk, D. E. (2004). Optimal Control Theory: An
Koenderink, J. J. (1990). Solid Shape. MIT Press.
Introduction. Dover,
Verl>mobil: A Tmnslation System for Face-To-Fact Dialog. CSLIPress.
324(5923), 85-89.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Sci
Kearns, M., Mansour, Y., and Ng, A. Y. (2000) Ap proximate planning n i large POMDPs via reusable trajectories. In SoUa, S. A., Leen, T. K., and Milller, K.-R. (Eds.),NTPS 12. MITPress.
Kister, J., Stein, P., Ul:un, S., Walden, W., and
Kearns, M. and Singh, S. P. (1998). Near-optimal reinforcement learning in polynomial rime. In TCML-98, pp. 260-268.
Kisynski, J. and Poole, D. (2009). Lifted aggrega tion in disee1ed first-order probabilistic models. In
.
Kearns, M. andVaz.irani, U. (1994). Anlntroducon ti to Computational Learning Theory. MIT Press.
Kearns, M. andMansour, Y. (1998). A fast, bottom up decisiontreepnmingalgorithm with near-optimal generalization. In TCML-98, pp. 269-277. Kebeasy, R. M., Hussein, A. 1., and Dahy, S. A. (1998). Discrimination between natural earthquakes and nuclear explosions using the Aswan Seismic Networlc. Annali di Geofisica,41(2), 127-140. Keeney, R. L. (1974). Multiplicative utility func tions. Operations Research, 22, 22-34. Keeney, R. L. and Raiffa, H. (1976). Decisions with Multiple Obj ectives: Prtfert'nces and Value Tmde offs. Wiley.
Kemp, M. (Ed). (1989). Leonardo on Painting: An Anthology of \Vritings. Yale University Press. Kephart, J. 0. and Chess, D. M. (2003). The vision of autonomic computing. TEEE Computer, 36(1), 41-50.
Kersting, K., Raedt, L. D., and Kramer, S. (2000) Interpretingbayesian logicprograms. In Proc.AAAT-
.
2(}()1) \Vorkshop on LearningStatisticalModelsfrom Relational Data.
Kessler, B., Nunberg, G., and Schiitze, H. (1997). Automatic detection of text genre. CoRR. cmp
ence, 220,671-680.
WeUs, M. (1957). Experiments in chess. JACM, 4, 174-tn.
Koe.nig, S. (2000). Exploring unknown environ ments with real4ime search or reinforcement learn ing. In SoDa, S. A., Leen, T. K., and MUUer, K.-R. (Eds.), NTPS 12. MIT Press.
Koe.nig, S. (2001). Agent-centered search. AlMag,
TJCAT-09.
22(4), 109-131.
Kitano, H., Asada, M., Kuniyoshi, Y., Noda, l, and Osawa, E. (1997a). RoboCup: The robot world cup initiative. In Proc. FirstInternaonal ti Conference on
Koller, D., Mesgido, N., and von Stengel, B. (1996). Efficient computation of equilibria for ex tensive twoiJerson games. Games and Economic
Kitano, H., Asada, M., Kuniyoshi,Y., Noda, L, Os awa, E., and Matsubara, H. (1997b). RoboCup: A challenge problem forA.J. A/Mag, !8(1), 7�5.
Koller, D. and Pfuffer, A. (1997). Representations and solutions for game-theoretic problems. All,
Autonomous Agents, pp. 340-347.
Behaviour, !4(2), 247-259.
94(1-2), 167-215.
Kjaerulff, U. (1992). A computational scheme for reasoning in dynamic probabilistic networks. In
Koller, D. and Pfeffer, A. (1998). Probabilistic frame-based systems. In AAAT-98, pp. 580-587.
Klein,D. andManning, C. (2001). Parsing withtree bank grammars: Empirical bounds, theoretical mod els, and the structure ofthe Penn treebank. In ACL-
tic Gmphical Models: Principles and Techniques.
U A/-92, pp. 121-129.
01.
Klein, D. and Manning, C. (2003). A* parsing: Fast exact Viterbi parse "'lection. In HLT-NAACL-03, pp.
119-126.
Klein, D., Smarr, J., Nguyen, H., and Manning, C. (2003). Named entity recognition with character level models. In Conference on Narum! Language
Learning (CoNLL).
Kleinberg, J. M. (1999). Authoritative sources n i a hyper!inked environment JACM, 46(5), 604-632.
lg/9707002.
Klemperer, P. (2002). What really matters in auc tion design. J. Economic Perspectives, !6(1).
Keynes, J. M. (1921). A Treatise on Probability. Macmillan.
Knese.r, R. and Ney, H. (1995). Improved backing off for M-grarn language modeling. In TCASSP-95,
Khare, R. (2006). Microfonnats: The next (small) thing on the semantic web. TEEE lnternet Comput n i g, /0(1), 68-75.
Koenig, S. (1991). Optimal probabilistic and decision-theoretic planning using Markovian deci sion theory. Master's report, Computer Science Di vision, University of California.
pp. 181-184.
Koller, D. and Friedman, N. (2009). Probabilis MIT Press.
Koller, D. and Milch, B. (2003). Multi-agent intlu ence diagrams for representing and solving games.
Games and Economic Behavior,45, 181-221.
Koller, D. and Parr, R. (2000). Policy iteration for factored MOPs. In UAT-00, pp. 326-334.
Koller, D. and Sahami, M. (1997). HierarchicaUy classifying docwnents using very few words. In TCML-97, pp. 170-178. Kolmogorov, A. N. (1941). Interpolation und ex trapolation von stationaren zufalligen folgen. Bul
letin of the Academy of Sciences ofthe USSR, Ser. Math. 5, �14.
Kolmogorov, A. N. (1950). Foundaons ti ofthe The ory ofProbability. Chelsea.
Kolmogorov, A. N. (1963). On tables of random numbers. Sankhya, the lndian Journal ofStatiscs, ti
Knight, K. (1999). A statistical MT tutorial worlc book. Prepared in connection with the Johns Hop kins University swnmer workshop.
SeriesA 25.
Research, 5(1), 90-98.
Knuth, D. E. (1964). Representing nmnbers using only one 4. Mathematics Magazine, J7(Nov/I)ec),
lnfomwtion Transmission, /(1), 1-7.
Khmelev, D. V. and Tweedie, F. J. (2001). Using Marlcov chains for identification of writer. Litemry
Knuth, D. E. (1968). Semantics forcontext-free lan
Khatib, 0. (1986). Real-time obstacle avoidance for robotmanipulator andmobile robots. Tnt. J.IIDbocs ti
and Linguistic Computing, !6(3), 299-307.
Kietz, J.-U. and Duzeroski, S. (1994). Inductive
logic programmins and le:unAbility. SIGART Bul letin, 5(1), 22-32.
Kilgarriff, A. and Grefensterte, G. (2006). Intro duction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333-347.
Kim, J. H. (1983). CONVTNCE: A Conversational lnfe,.nce Consolidation Engine. PhD. thesis, De partment of Computer Science, University of Cali fornia at Lns Angeles.
308-310.
guages. Mathematical Systems Theory, 2(2), 127145.
Knuth, D. E. (1973). ThtArt ofComputerProgram
Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of infonnation. Problems n i
Kolodner, J. (1983). Reconstructive memory: A computer model. Cognitive Science, 7, 281-328.
Kolodner, J. (1993). Case-Baud Reasoning. Mor gan Kaufmann. Kondrnk, G. and van Beek, P. (1997). A theoretical
ming (second edition)., Vol. 2: Fundamental Algo rithms. Addison Wesley.
evaluation of selected backtracking algorithms. All,
Knuth, D. E. (1975). An analysis of alpha-beta pruning. All, 6(4), 29�326.
Konolige, K. (1997). COLBERT: Alanguage forre active control in Saphira. In Kanstliche lntelligenz: Adwmces n i Artificial lntelligence, LNAI, pp. 31-
-
Knuth, D. E. and Bendix, P. B. (1970). Simple word problems in universal algebras. In Leech, J.
(Ed.), Computational Problems n i AbstractAlgebm, pp. 263-267. Pergamon.
89. 365-387.
52.
Konolige, K. (2004). Large-scale map-making. In AAA/-04, pp. 457-463.
1079
Bibliography Konolige, K. (19l2).
A first order formalizatioo of knowledge and action for a multi-agent planning system. In Hayes, J. E., Michie, D., and Pao, Y.-H. (Eds.), Machinelfllelligence 10. Ellis Horwood.
Konolige, K. (lm). Easytobehard: Difficult prob lems for greedy al�orithms. In KR-94, pp. 374-378. Koo T., Carreras, X., and Collins, M. (2008). Sim ,
ple senti-superviseddependeocy parsing. InACL-08.
Koopmans, T. C. (1972). Representation of pref erence
In
orderings over time.
and Radner, R. (Eds.), Decision Elsevier/North-Holland.
McGuire. C. B.
and Organization.
Korb, K. B. and Nicholson, A. (2003). Bayesian Arrificial Intelligence. ChapmanandHall.
Koza, J. R. (lm). Genec ri Programming II:Auto matic discm·ery of reusableprograms. MITPress. K'ery: Com
putational E:cplorations of the C,.eative Processes.
MITPress. Langton,
MITPress.
C.
(Ed.).
L.•place, P. (1816).
probabilitis (3rd
Paris.
(1995).
Artificial Lif e.
Essai philosophique sur les
edition).
Courcier Imprimeur,
1080
Bibliography
Laplev, I. and Perez, P. (2007). Retrieving actions in movies. In /CCV, pp. 1-8.
Lege.ndre, A. M. (1805). Nou1·elles mithodes pour Ia dltemJination dts orbites des comiues. .
4, 35-56.
lenat, D. B. (1983). EURJSKO: A program that
Lari, K. and Young, S. J. (1990). The estimruion of stochasticcontext-free grammars usingthe inside outside algorithm. ComputerSpeech and Language,
Lehrer, J. (2009). How We Decide. Houghton Mif flin.
Larrailaga, P., Kuijpers, C., Murg��, R., Inza, I., and
learns new heuristics and domain concepts: The na ture f heuristics, ill: Program design and results.
13, 129-170.
LenaI, D. B. and Brown, J. S. (1984). WhyAM and EURJSKO appear to wodc. AIJ, 23(3), 269-294.
Dizdarevic, S. (1999). Genetic algorithms for the travelling salesman problem: A review of represen tations and operators. Artificiallntelligtnce Review,
AJJ, 2/(1-2), 61-98.
lenat, D. B. andGuba, R. V. (1990). Building Lo.!f!e
Lightbill, J. (1973). Artificial intelligence: A gen eral survey. In Lightbill, J., Sutherland, N. S., Need ham, R. M., Longuet-Hi!J8ins, H. C., and Michie, D. (Eds.), Arrijicial lntelligrnce: A Paper Symposiunc Science Research Council fGreat Britain. Lin, S. (1965). Computer solutins of the travelling salesman problem. Bell Systtms Technical Joumal, 44(10), 2245-2269.
Lin, S. and Kernighan, B . W. (1973). An effective heuristic algorithm for the travelling-salesman prob lem. Operations Resrarch, 2/(2), 498-516.
Larson, S. C. (1931). The shrinkage of the coef ficient of multiple correlation. J. Educaonal ti Psy chology, 22, 45-55.
Knowledge-Baud System•: Representation and In ference in the CYCProjec.t. Addison-Wesley.
Robot Motion Planning.
l.eonard, J. andDurrant-Whyte, H. (1992). Directed
ficiallntdligtnctfor Organic Chemistry: The DEN
LeSniewski, S. (1916). mnogoSci. Moscow.
Littman, M. L (1994). M:ukov games as a fcame work for multi-agent reinforcement learning. In ICML-94, pp. 157-163.
Laskey, K. B. (2008). MEBN: A language for first order bayesianknowledge bases. AJJ, 172, 140-178.
Latombe, J.-C. (1991).
Kluwer.
Lauritzen, S. (1995). The EM algorithm for graphi cal association models with missing data. Computa
tional Statistics and Data Analysis, 19, 191-201.
Lauritzen, S. (1996). Graphical models. Oxford
i Press. Universty
Lauritzen, S., Dawid, A. P., Lassen, B., and leim.,.,
H. (1990). Independence properties of directed Markov fields. Networl:s, 20(5),491-505.
Lauritzen, S. and Spiegelbalter,D. J. (1988). Local computations with probabilities on graphical struc tures andtheir applicationte:q>ert systems. J. Royal
Statistical Society, B 50(2), 157-224.
Lauritzen, S. and Wennutlh, N. (1989). Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals
ofStatistics, 17, 31-57. LaValle, S.
(2006). Planning Algorithms.
bridge Un iversity Press.
Cam
Lavrauc, N. and Duzeroski, S. (1994). Inductive
Logic Programming: Techniques and Applications. Ellis Horwood.
Lawter, E. L., Lenstra, J.K., Kan, A., and Shmoys,
D. B. (1992). The Tm,·ellingSalesman Problem. Wi ley lnterscience.
Lawter, E. L., Lenstra, J. K., Kan, A., and Shmoys,
D. B. (1993). Sequencing and scheduling: Algo
rithms and complexity. In Graves, S. C., Zipkin, P. H., and Kan, A. H. G. R. (Eds.), Lo.gistics ofPro
duction and Inventory: Handbooks n i Operations Research and Managemtnr Scitnct) Volume 4. pp.
leonard, H. S. and Goodman, N. (1940). The cal culus of individuals and its uses. JSL, 5(2), 45-55.
sonarstnsingfor mobile robot navigation. KJuwer.
P<XIstawy og61nej teorii
lettvin, J. Y., Maturana, H. R., McCulloch, W. S., and Pitts, W. (1959). What the frog's eye tells the frog's brain. Proc. IRE,47(ll),
1940-1951.
letz, R., Schumann, J., Bayer!, S., and Bibe � W. (1992). SElHEO: A hjgh-p...Connance thoorem
prover. JAR, 8(2), 183-212.
Levesque, H. J. and Braclunan, R. J. (1987). Ex
pressiveness and tractabmty in kno wled gere resen p tation and reasoning. Computational I ntellig ence,
3(2), 78-93.
levin, D. A., Peres, Y., and Wilmer, E. L. (2008). Markov Chains and Mixing Times. American Math .,..atical Society.
levitt, G. M. (2000). The Turl:, Chess Alllomaton. McFarland and Company.
Compmer Chess Com
levy, D. (Ed). (1988a).
pendium. Springer-'ktlag.
levy, D. (Ed.). (1988b).
Springer.Verlag.
Computer Games.
levy, D. (1989). The ntilli oo poond bridge program.
In l.evy, D. and Beal, D. (Eds.), Heuristic Program ming n i Artificial Intelligence. Ellis Horwood.
levy, D. (2007). Lave and St.t with Robots. Harper.
lewis, D. D. (1998). Naive Bayes at forty: The in
dependence assumption in information retrieval. ECML-98, pp. �15.
In
445-522. North-Holland.
Lawter, E. L. and Wood, D. E. (1966). Branch-and bound methds: A survey. Operations Research,
lewis, D. K. (1966). An
14(4), 699-719.
lewis, D.K. (1980). Madpain and Martian pain. In
Lazatllls,A. and Latmbe, J..c. (1992). Landmark based robot navigation. In AAA/-92, pp. 816-822
Levton-Brown, K. and Shobam, Y. (2008). Essen
leCun, Y., Jackel, L., Boser, B., and Denker, J.
(1989). Handwritten dig it recogrt t i io n:
f.
Ap l ica s andautomatic earn
cbip
tions of neural network ing. IEEE Communications Magazine, 27(11),
theory.
argume n t for the identity
( 1), 1 7-25. h y, 63 J. Phloso i p
Block, N. (Ed.), Readings in Philosophy ofPsycho/. ogy, Vol. 1, pp. 216-222 Harvard University Press.
tiais of Game Theory: A Concise, Multidisciplinary Introduction. Morgan Claypool.
Li, C. M. and Anbulagan (1997). Heuristics based
41-
on unit propag��tion for s�tisfiability problems. In
leCun, Y., Jacke� L., Bottou, L., Bruoot, A., Cortes, C., Denker, J., Drucker, H., Guyon, 1.,
Li, M. and Vitanyi, P. M. B. (1993). An Introduc tion ro Kolmogorov Complexiry and ItsApplications.
46.
Muller, U., Seckinger, E., Simard, P., and Vapnik, V. N. (1995). Comparison of leanting algorithms for handwritten digit rec()gnitioo. Conference on
In Int.
Arrijicial NeuralNetworl:s,pp. 53-60.
leen English' Based on tht British National Corpus. Longman.
IJCAI-97, pp. 366-371. Springer-Verlag.
Liberatore, P. (1997). The complexity of the lan guage A. Electroni c Tmnsactions onArtificial Intel
ligence, I, 13-38.
Litscbitz, V. (2001). Answer set programming and
plan genecation. AJJ, 138(1-2), 39-54.
Lindley, D. V. (1956). On a measure of the infor mruion provided by an experiment. Annals ofMath
ematicalStatisrics, 27(4), 986-1005.
Lindsay, R. K., Buchanan, B. G., Feigenbaum, E. A., andlederberg,J. (1980). Applications ofArti
i
DRALProject. McGraw-Hll.
Littman, M. L., Keirn, G. A., and Sbazeer, N. M.
(1999). Solving crosswords with PROVERB. In AAAI-99, pp. 91�915.
Liu, J. S. and Chen, R. (1998). Sequential Monte Carl metbods for dyn�mic systems. JASA, 93,
1022-1031.
Livescu, K., Glass, J., and Bilmes, J. (2003). Hidden
feature modeling for speech rec()SDition using dy namic Bayesiannetworks. In EUROSPEECH-2003, pp. 2529-2532.
Livnat, A. and Pippenger, N. (2006). An optim al br.Un can be composed of conllicting agents. P NAS,
/03(9), 3198-3202.
Locke, J. (1690). An Essay Concerni ng Human Un derstanding. William Tegg.
Lodge, D. (1984). Small lforld. Penguin Books.
Loftus, E. and Palmer, J. (1974). Reconstruction of
automobile destruction: An example of the n i terac tionbetweenlanguage andmemoty. J. VerbalLearn
ing and Verbal Brhavior, 13, 585-589.
Lobo, J. D., Kraus, W. F., and Colmbano, S. P.
(2001). Evolutionaty optimization of yagi-uda an tennas. In Proc. Fourth International Conference on EvolvableSystems, pp. 236-243.
Longley, N. and Sankatran, S. (2005). The NHL's ov...Ome-lss rule: Empirically analyzing the unin tended effects. Atlantic EconomicJournal. Longuet-Higgins, H. C. (1981). A computer algo
rithm for reconstructing: a scene tions. Natun, 293, 133-135.
from two projec
Loo, B. T., Coodie, T., Garofulakis, M., Gay, D. E., Hellerstein, J. M., Maniatis, P., Ramakrishnan, R., Rscoe, T., and Stoica, L (2006). Declarative net working: Language, execution and optimization. In
SIGMOD-06.
Love, N., Hinrichs, T., and Genesereth, M. R. (2006). General game playing: Game descrip
tion language specification. Tech. rep. LG·2006�1. StanfordUniversity Computer Science Dept.
Lovejoy, W. S. (1991). A survey of algorithntic metoods forpartially observed Markov decisionpro cesses.Annals ofOpemrionsResearch, 28(1-4), 47-
66.
Loveland, D. (1970). A linear fonnat frresolution. InProc. TRIA Symposium on Automatic Demonstm tion, pp. 147-162.
Bibliography
1081
Lowe, D. (1987).
Three-dimensional object recog nition from single tY.'O-dimensional images. All, 31.
355-395.
Lowe, D. (1999). Object recognition using local
scaleinvariant feature.
In /CCV.
Lowe, D. (2004). Distinctive image features from
scale-invariant keypoints. IJCV, 6�2), 91-1 10.
U>wenheim, L. (1915). Obermoglichkeiten m i Rel
ativkalkiJJ. Mathemarische Annalen,
76,447-470.
Lowerre, B. T. (1976). The HARPY Speech Recog nition System. Ph-D. thesis, Computer Science De partment, Carnegie-MeDon University. Lowerre, B. T. and Reddy, R. (1980).
The HARPY
speech recognition system. In Lea, W. A. (Ed.), Trends n i Speech Recognition, chap. 15. Prentice
Hall.
Lowry, M_ (2008). Intelligent software engineering tools for NASA's crew exploration vehicle. In Proc.
ISMIS.
Loyd, S. (1959). Mathematical Puwes of Sam Loyd: Selected and Edited by Marrin Gardner. Dover.
�
Marsland, A- T. and Schaeffer, J. (Eds.). (1990). Computers, Chess, and Cognition. Springer-Verlag.
Mahanti, A. and Daniels, C. J. (1993). ASIMD ap
Martelli, A. and Montanari, U. (1973). AND/OR graphs. InIJCAI-73, pp. 1-1 1.
ley.
proach to paotllel heuristic search.
282.
AIJ, 60(2), 243-
Mailath, G. and Samuelson, L. (2006). Repeated Games and Reputations: Long-Run Relationships. Oxford University Press.
Majercik, S. M. and Liuman, M_ L. (2003). Contin gent planning under uncertainty via stochastic satis fiability. AIJ, pp. 119-162. Malik, J. and Perona, P. (1990). Preauentive texture discrimination withearlyvisionmechanisms. J. Opt. Soc. Am. A, 7(5), 923-932.
Malik, J. and Rosenhohz, R. (1994). Recovering swface curvature and orientation from texture distor tion: A least squares algorithm and sensitivity anal ysis. In ECCV, pp. 353-364. Malik, J. and Rosenholtz, R. (1997). Computing local surface orientat ion and shape from texture for curved surfaces. IJCV, 23(2), 149-168.
Lozano-Perez, T. (1983). Spatial planning: A con figuration space approach. IEEE Tmnsactions on Computers, C-32(2), 108-120.
Maneva, E., Mossel, E., and Wainwright, M. J.
(1984). Automatic synthesis of fine-motion strate
Manna, Z. and Waldinger, R. (1971). Toward auto
Lozano-Perez, T., Mason, M., aod Taylor, R.
gies for robots. lnt J. RoboticsResean:h, 3(1), 3-24.
Lu, F. and Milios, E. (1997). Globally consistent range scan aligrunent for environment mapping. Au tonomous Robots, 4, 333-349. Luby, M-, Sinclair, A., aod Zuckennan, D. (1993). Optimal speedup ofLas Vegas algoritluns. Informa tion Processing Letters, 47, 173-1 80.
Luws, J. R. (1961). Minds, machines, and Godel Philosophy, 36. Luws, J. R. (1976). This Godel is killing me: rejoinder. Philosophia, 6(1), 14S-148.
A
Lucas, P. (1996).
Knowledge acquisition for decision-theoretic expert systems. AISB Quarte-rly, 94,23-33.
Lucas, P., van der Gaag, L., and Abu-Hanna, A. (2004). Bayesian networks n i biomedicine and health-care. Arri jiciallnte/ligence n i Medicine. Luce, D. R. and Raiffa, H. (1957). Games and De cisions. Wiley.
Ludlow, P., Nagasawa, Y., and Stoljas, D. (2004). There's SomethingAboutMary. MIT Press. Luger, G. F. (Ed.). (1995). Computation and ntelli i gence: Co/leered readings. AAAI Press. Lyman, P. and
much
Varian, H.
nfolTllation? i
R.
(2003).
How
www. sims . berkeley.
edu/how-much-info-2003.
Jl·e Science, pp. 505-514. Wiley. i'>L.c.Kay, D. J. C. (1992).
A practical Bayesian
frao1e""d< for back-propagation networks. Neural Computation, 4(3), 448-4n. Mac.Kay, D. J. C. (2002). lnformatian Theory, In
ference and Learning Algorithms. Cambridge Uni versity Press.
MacKenzie, D. (2004). Mechooizing Proof
Press.
MIT
Mac.kworth, A. K. (1977). Consistency n i networks ofrelations. AIJ, 8(1), 99-1 18.
(2007).
A new look at survey propagation and its
generalizations.
JACM, 54(4).
marie program synthesis. CACM, 14(3), 151-165.
Manna, Z. and Waldinger, R. (1985). The Logical Basis for ComputerProgmmming: Volume 1: De ductive Reasoning. Addison-Wesley.
Manning, C. and Schillze, ll (1999). Foundations of Statistical Natuml Longuage Processing.
MIT
Press.
Manning, C., Raghavan, P.,andSchtitze, H. (2008).
lntroductian to lnfomJation Retrie\'111. Cambridge
University Press.
Mannion, M.
(2002).
Using first-order logic for
product line model validation. In Saftware Product lines: Second International Conference. Springer. Manzini, G. (1995). BIDA*: An improved perime ter search algorithm. AIJ, 72(2), 347-360.
Marbach, P. andTsitsiklis, J. N. (1998). Simulation based optimization of Markov reward processes. Technicalreport LIDS-P-2411, Laboratory forInfor mation and Decision Systems, Massachusetts Insti IUte ofTechnology.
Marcus, G. (2009). Kluge: The HaplmPrrf El'olu tion afthe HunJan Mind. Mariner Books. Marcus, M. P., Santorini, B., and Mascinkiewicz, M_ A. (1993). Building a lasge annotated corpus of english: The pemtteebank. Computationallinguis tics, /9(2), 313-330.
Markov, A. A- (1913). An example of statistical m·e i stigation n i the text of "Eugene Onegi.n" illus trating coupling of"tests" in chains. Proc. Academy ofSciences ofSt. Petersburg, 7.
Marsland, S. (2009). Machine Learni ng: An Algo rithmic Perspective. CRC Press.
Addivtie
Martelli, A. and Montanari, U. (1978). Optimizing
trees thsough heuristically guided seasch. CACM, 21, 102S-1039. decision
Martelli, A. (1977). On the complexity of admissi ble seasch algorithms. All, 8(1), 1-13.
Marthi, B., Pasula, ll, Russell, S. J., and Peres, Y. (2002). Decayed MCMC filtering. In UAI-02, pp. 319-326.
Marthi, B., Russell, S. 1., Latham, D., and Guestrin, C. (2005). Concurrent ierarchical h reinforcement learning. In UCAI-05.
Marthi, B., Russell, S. J., and Wolfe, J. (2007). An gelic semantics for high-level actions. In ICAPS-07. Marthi, B., Russell, S. J., and Wolfe, J. (2008). An gelic ieraschical h planning: Optimal and online al gorithms. In ICAPS-08. Martin, D., Fowlkes, C., and Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 26(5), 530-549.
Martin, J. ll (1990). A Computational Model of !fetaphorInterpretation. Academic Press. Mason, M_ (1993). A/Mag, 14(1), 58-59.
Kicking
the
sensing habit .
Mason, M. (2001). Mechanics ofRobotic Maniptl lation. MITPress.
Mason, M. and Salisbury, J. (1985). Robot hands andthe mechanics ofrnanipulation. MIT Press.
Ma!aric, M_ J. (1997). Reinforcement learning n i the multi-robot domain. Autonomous Robots, 4(1), 73-83. Mates, B. (1953). Stoic Logic. University of Cali
fornia Press.
Matuszek, C., Cabral, J., Witbrock, M., and DeO ti.. i troduction to the syntax and , ira, J. (2006). An n semantics ofeye. In Proc. AAAJSpring Symposium an FomJalizing and Compiling Background Knowl edge and ItsApplications to Knowledge Representa tion and Question Answering.
Maxwell, J. and Kaplan, R. (1993). The interface between phrasal and functional constraints. Compu
tational linguistics, 19(4), 571-590.
McAllester,D.A.(t980). An outlookon bUthmain
tenance. Ai memo 55 1, MIT AJ Laboratory.
McAllester, D. A. (1988). Conspiracy numbers for min-max seorch. Al/,35(3), 287-310. McAllester, D. A. (1998). What is the most press ing issue facing and today? Candidate statement, eJection for Councilor of the American Association for Artificial Intelligence.
AJ
the AAAJ
M-aron, M. E. (1961). Automatic indexn i g: An ex
McAIIester, D. A. and Rosenblitl, D. (1991). Sys tematic nonlineas planning. In AAAI-91, Vol. 2, pp.
Maron, M_ E. and Kuhns, J.-L (1960).
McCallum, A. (2003). Efficiently inducing feaiUres ofconditional random fields. In UAI-03.
perimental inquiry.
evance,
JACM,8(3), 404-417.
On rel probabilistic indexing and information re
trieval. CACM, 7, 219-244.
Marr, D. (1982). Vision: A Computational Investi gation n i to the Human Representation and Process ing ofVisuallnfomJation. W.H. Freeman.
Marriotl, K. and Stuckey, P. J. (1998). Program ming with Consrmints: An Introduction. MIT Press.
634-639.
McCarthy, J. (1958). sense.
Programs with common
In Proc. Symposium on Mechanisation of
1houghJ Processes, Vol. 1, pp. n-84.
McCarthy, J. (1963). Situations, actions, and causal laws. Memo 2, Stanford University Artificial Intelli gence Project.
Bibliography
1082 McCarthy, J. (1968). Programs with common sense. Jn Minsky, M. L. (Ed.), Semantic lnfomlll tion Processing, pp. 40�18. MIT Press. McCarthy, J. (1980). Ci r cumscription: A fonn of
non-monotonic reasoning. AJJ, 13(1-2), 27-39.
McCarthy, J. (2007). From here to hwnan-level AI.
AIJ, 171(18), 117�1182.
McCarthy, J. and Hayes, P. J. (1969). Some philo sophical problems from the standpoint of artificial intelligence. Jn Meltzer, B., Michie, D., and Swann, M. (Eds.), Machn i e Intelligence 4, pp. 461-502. Ed inburgh University Press. McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E. (1955). Proposal for the Dartmouth
summer research project on artificial intelligence. Tech. rep., Dartmouth College.
McCawley, J.D. (1988). The Syntactic Phmomena of English, Vol. 2 •·olwnes_ University of Chicago Press. McCorduck, P. (2004). Machn i es who think: a per
sonal inquiry into the history andprospects ofartifi· cial intelligence (Revised edition). A K Peters.
McCulloch, W. S. and Pitts, W. (1943). A logical calcuius of the ideas imm:ment n i nervous activity. Bulletin ofMathematical Biophysics, 5, 1 15-137.
Mendel, G. (1866). Versuche fiber pftanzen hybriden. Verhandlungm des Naturforschend�n Vereins, Abhandlungen, 8rtlnn,4, �1. Translated into English by C. T. Druery, published by Bateson (1902). Me...,.r, J. (1909). Functions of positive and nega tive type and their conneotion with the theory of in tegral equations. Philos. Trans. Roy. Soc. London, A.
209, 415-446.
Merleau.Ponty, M. (1945). PhenomrnologyofPer
ception. Routledge.
Metropolis, N., Rosenbl"uth, A., Rosenbluth, M.,
Teller, A., and Teller, E. ( 1953). Equations of state calculations by fast computing machines. J. Chemi cal Physics, 21, 1087-1091. Metzinger, T. (2009). The Ego Tunnel: The Science i d tlild the Myth oft ofrlw Mn he Self Basic Books.
Mez.ard, M. and Nadal, J.-P. (1989). Learning in feedforward layered networks: The tiling algorithm.
J. Physics, 22, 2191-2204.
Michalski, R. S. (1969). On the quasi-minimal so lution of the general covering problem. In Proc. First International Sympo·sium on lnfomJO.tion Pro cessing, pp. 125-128. Michalski, R. S., Mo:retic, 1., Hong, J., and Lavrauc,
McCune, W. (1992). Automated discovery ofnew axiomatizations ofthe left SJTOUP andright group cal
N. (1986). The multi-pt1rpose incremental team ing system AQI5 and its testing application to three medical domains. InAAAl-86, pp. 1041-1045.
McCune, W. (1997). Solution ofthe Robbins prob lem. JAR, 19(3), 263-276.
Mkhie, D. (1966). Game-playing and game teaming automata. Jn Fox, L. (Ed.), Advances n i Programming andNon-Numtrical Computation, pp. 183-200. Pergamon.
culL JAR, 9(1), 1-24.
McDermott, D. (1976). Artificial intelligence meets natural stupidity. SIGARTNewsle"u, 57, �9. McDermott, D. (1978a). Planning and acting. Cog
Michie, D. (1972). Macbine imelligence at Edin burgh. ManagemenT lnforrnatics, 2(1), 7-12.
Minsky, M. L. (1975). A frameworl< for represent ing knowledge. JnWinston, P. H. (Ed.), The Psycho� ogyofComputer Vision,pp. 211-277. McGraw-Hill. Originally an MIT AI Laboratory memo; the 1975 version is abridged. but is the most widely cited. Minsk , M. L. (1986). The society ofmind. Simon
y
and Sc huster.
Minsky, M. L. (2007). The Emotion Machine: Com
monsense Thinking, Artificial Intelligence, and the Fmure ofthe Human Mind. Simon and Schuster.
Minsky, M. L. and Papert, S. (1969). Perceprrons: An lntroducrion to Computational Geometry (first
edition). MIT Press.
Minsky, M. L. and Papert, S. (1988). Perceptrons: An Introduction to Computarional Geornetry (Ex
panded edition). MIT Press.
Minsky, M. L., Singh, P., and Sloman, A. (2004).
The st. thomas
common sense symposium: De sig,ning architectures fo.r human-level intelligence.
A/Mag, 25(2), 113-124.
Minton,S. (1984). Constraint-based generalization: Learning game-playing plans from single examples. JnAAAl-84, pp. 251-254. Minton, S. (1988). Quantitative results concerning i the utility of explanaton-based teaming. Jn AAAI88, pp. 5�569. Minton, S., Johnston, M. D., Philips, A B., and Laird, P. (1992). Minimizing conflicts: A heuris tic repair method for ·constraint satisfaction and scheduling problems. A/J,58(1-3), 161-205. Misak, C. (2004). The Cambridge Companion Peirce. Cambridge University Press.
ro
Mitc.hell, M. (1996). An Introduction to GenecAl ti gorithms. MIT Press.
McDermott, D. (1978b).
Michie, D. (1974). Macbine n i telligence at Edin burgh. Jn On Intelli gence, pp. 143-155. Edinburgh University Press.
Mitc.hell, M., Holland, J. H., and Forrest,S. (1996). When will a genetic algorithm outperfonn hill climbing? In Cowan, J_, Tesauro, G., and Alspec tor, J. (Eds.), NIPS 6. MIT Press.
McDermott, D. (1985). Reasoning about plans. In
Michie, D. and Chambers, R. A. (1968). BOXES: An experiment in adaptive control. Jn Dale, E. and Michie, D. (Eds.),Machine Intelligence 2, pp. 125-
Mitc.hell, T. M. (1977). \lorsion spaces: A candidate elimination approach to rule learning. Jn IJCAI-77, pp. 305-310.
nitive Science, 2(2), 71-109.
Tarskian semantics, or, no notation without denotamionl Cogniri:l•e Science, 2(3).
Hobbs, J. and Moore, R. (Eds.), Fom111l theories of the commonsense world� Intellect Books. McDermott, D. (1987). A critique ofpure reason.
ComputationalIntelligence, 3(3), 151-237. McDermott, D. (1996). A heuristic estimator for
means-ends analysis n i planning. Jn ICAPS-96, pp.
142-149.
McDermott, D. and Doyle, J. (1980).
monotoniclogic: i. AIJ, 13(1-2), 41-72.
Non-
McDermott, J. (1982). Rl: A rule-based configurer
of computer systems. AIJ, 19(1), 3!Wl8.
McEliece, R. J., MacKay, D. J. C., and Cheng, 1. F. (1998). TIUOO decoding .as an instance of Pearl's "beliefpropagation" algorithm. IEEE Journal on Se
lectedAreas n i Communications, 16(2), 140-152.
McGregor, J. J. (1979). Relational consistency al
gorithms and their application in frnding subgrapb and graph isomorphisms. Information Sciences,
19(3), 229-250.
Mcllraith, S. and Zeng, H. (2001). Semantic web
services. IEEElnrelligent Systems, 16(2), 46-53.
McLachlan, G. J. and Krislhnan, T. (1997). The EM Algorithm and Etttnsions. Wiley. McMillan, K. L. (1993). Symboli c Model Checking.
Kluwer.
Meehl, P. (1955). Clinical vs. Statiscal ti Prediction. University ofMinnesota Press.
133. Elsevier/North-HoU311d.
Michie, D., Spiegelhalter, D. 1., and Taylor, C. (Eds.). (1994). Machine Learning, Neural and Sta tistical Classification. Ellis Horwood.
Milch, B., Martbi, B., S<mtag, D., Russell, S. J., Ong, D., and Kolobov, A. (2005). BWG: Proba bilistic models with unknown objects. Jn IJCAI-05.
Milch, B., Zettlemoyer, L S., Kersting. K.,Haimes, M., and Kaelbling, L. P. (2008). Lifted probabilistic
inference with counting fonnulas. In AAAI-08, pp.
1062-1068.
Milgrom, P. (1997). Putting auction theory to work: Tile simultaneous ascending auction. Tech. rep.
Technical Report 98.0002, Stanford University De partment ofEconomics.
Mitchell, T. M. (1982). AIJ, I8(2), 203-226.
Generalization as search.
Mitchell, T. M. (1990). Becomingincreasingly reac tive (mobile robots). In AAAI-90, \bl. 2, pp. 1051-
1058.
Mitchell, T. M. (1997).
McGraw-Hill.
Machine Learning.
Mitchell, T. M., Ketler, R., and Kedar.Cahelli, S.
(1986). Explanation-based generalization: A unify ing view. Machine Learning, I, 47-80.
Mitc.hell, T. M., Utgoff, P. E., and Banerj� R. (1983). Learningby experimentation: Acquiring and refining problem-solving heuristics. Jn Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (Eds.),
Mill, J. S. (1843). A System ofLogic, Ratiocinative
Machine Learning: An Artificial lntelligrnce Ap proach, pp. 163-190. Morgan Kaufmann.
Mill, J. S. (1863). Utilitarianism. Parker, Son and
traction and the web. In.ECMUPKDD,p. I.
and Inductive: Being a Connected View ofthe Prin ciples ofEvidence, and Methods ofScientific lm•es tigation. 1. W. l'.uter, Lorndon. Bourn, London.
Miller, A. C., Merkhofer, M. M., Howard, R. A., Matheson, J. E., and Rice, T. R. (1976). Develop ment of automated aids for decision analysis. Tech nical report, SRI International. Minker, J. (2001). Logic-Based Artificial Intelli
gence. Kluwer.
Mitchell, T. M. (2005). Reaing d the web: A break through goal for AI. A/Mag, 26(3), 12-16. Mitthell, T. M. (2007).
Le3111ing, information ex
Mitchell, T. M., Shinkareva, S. V., Carlson, A., Cbang, K.-M., Mala>.., V. L., Mason, R. A., and
Just, M. A. (2008). Predicting bwnan brain activ ity associated with the meanings of nouns. Science.
320, 1191-1195.
Mohr, R.andHenderson, T. C. (1986). Arcandpath
consistencyrevisited. AIJ, 28(2), 225-233.
Bibliography Mohri, M., Pereira, F., and Riley, M. (2002). Weighted finite-state transducers n i speech recogni tion. ComputerSpeech and Language, 16(1),69-88. Montague, P. R., Dayan, P., Person, C., and Se
T. (1995). Bee foraging n i uncertain envi nowski, j
1083 Morjaria, M. A., Rink, F. J., Smith, W. D., Klemp ner, G., Bums, C., and Stein, J. (1995). Elicitationof probabilities for beliefnetworks: Combining quali tative and quantitati>'e information. In UA/-95, pp.
141-148.
ronments usingpredictive Hebbian learning. Natwe,
Morrison, P. and Morrison, E. (Eds.). (1961). Charles Babboge and His Calculating Engines: Se
Montague, R. (1970). English as a formal language. In linguaggi nella Societa e nella Tecnica, pp. 189224. Edizioni di Comunit3..
Dover.
377, 'nS-728.
Montague, R. (1973). The propertreatment ofquan tification in ordinary English. In Hintikka, K. J. J., Moravcsik, J. M. E., and Suppes, P. (Eds.), Ap proaches to Natuml Language. D. Reidel Montanari, U. (1974). Networlcs of constraints: Fundamental properties and applications to picture prooessing. lnfomJation Science s, 7(2), 95-132.
Montemerlo, M. and Thrun, S. (2004). Large-scale robotic 3-D mapping of urban structures. In Proc. lntemational Symposium on Erpen·mental Robotics. Springer Tracts in Advanced Robotics (STAR).
Montemerlo, M., Thrun, S., Koller, D., and Weg breit, B. (2002). FastSLAM: A factored solution to the simultaneous localization and mapping problem.
InAAAI-02.
Mooney, R. (1999). Learning for semantic interpre tation: Scaling up without dumbing down. In Proc. Ist \Vorl:shop on Learning Language n i Logic, pp.
7-15.
Moore, A. and Wong, W.-K. (2003). Optintal rein sertion: A new search operator for accelerated and more accurate Bayesian network structure learning.
In ICML-03.
Moore, A. W. and Atkeson, C. G. (1993). Prior itized sweeping-Reinforcement learning: with less data and Jesstime. Machine Learning, 13, 103-130.
Moore, A. W. and Lee, M. S. (1997). Cached suf ficient statistics for efficient machine learning with large datasets. JAIR, 8, 67-91.
Moore, E. F. (1959). The shortest path through a maze. In Proc. an Internaonal ti Symposium on tht. Theary ofSwitching, Part II, pp. 285-292. Harvard University Press. Moore, R. C. (1980). Reasoning about knowledge and action. Artificial intelligence center technical note 19l,SRI International. Moore, R. C. (1985). A formal theory of knowl edge and action. In Hobbs, J. R. and Moore, R. C. (Eds.), FormalTheories ofthe Commonsense World, pp. 3 19-358. Ablex. Moore, R. C. (2005). Association-based bilingual word alignment. In Proc. ACL.05 Worl:sltop on Building and Using Pam/lei Te.tts, pp. 1-8. Moravec, H. P. (1983). The stanford cart and the emu rover. Proc. IEEE, 71(1), 872-884.
Mora•·ec, H. P. and Elfes, A (1985). High resolu tion maps fmm wide angle sonar. In ICRA-85, pp.
11�121.
Moravec, H. P. (1988). Mn i d Childrtn: The Future of Robot and Human Intelligence. Harvard Univer sity Press.
Moravec, H. P. (2000). IWbot: Mere Machine to TranscendentMind. Oxford University Press. Morgenstern, L. (1998). Inheritance comes of age: Applying nonmonotonic techniques to problems n i industry. AJJ, 103,237-27 I.
lected Writings by Charles Babbage and Others.
Moskewicz, M. W., Madigan, C. F., Zhao, Y., Zhang, L., and Malik, S. (2001). Chaff: Engineer ing an efficient SAT solver. In Proc. 38th lksign Automation Conferonce(DAC 2001), pp. 5335. Mosteller, F. and Wallace, D. L. (1964). lnfertnce and DispuudAuthorshp i : The Fedemlist. Addison Wesley. Mostow, J. and Prieditis, A. E. (1989). Discovering admissible heuristics by abitracting and optimizing: A transfonnational approach. In /JCAI-89, Vol. I, pp. 701-707.
Motzkin, T. S. and Scboenberg, I. J. (1954). The relaxation method for linear n i equalities. Canadiart
Joumal ofMathematics, 6(3), 39�.
Moutarlier, P. and Chatila, R. (1989). Stochastic multisensory data fusion for mobile robot Jocation and environment modeling. In ISRR-89. Muelle.r, E. T. (2006). Commonsense Rearoning. Morgan Kaufmann.
Muggleton, S. H. (1991). Inductive logic program ming. New C,nemrion Computing, 8, 295-318.
Muggleton, S. H. (1992). Inductive Logic Progmm. ming. Academic Press.
Muggleton, S. H. (1995). Inverse entailment and
l'rO!,>OI. New C,neration Computing, !3(3-4), 245286. Muggleton, S. H. (2000). Learning stochastic logic programs. Proc. AAAJ 2000 Workshop on Learning Statistical Models fmm Relational Data.
Mugglelou, S. H. and Buntine, W. (1988). Machine
invention of first-order predicates by inverting reso lution. In ICML-88, pp. 339-3S2.
Muggleton, S. H. and De Raedt, L. (1994). Indue rive logic programming: Theory and methods. J.
Logc i Progmmming, 19120, 629-679.
Muggleton,S. H. and Feng, C. (1990). Efficient n i duction of logic programs. In Proc. \Vorl:shop on Algorithmc i Learning Theory, pp. 368-381. Muller, M. (2002). Computer Go. AJJ, /34(1-2),
145-179.
Murphy, K. and Russell, S. J. (2001). Rao black"'ellised particle filtering for dynamic Bayesian networks. In Doucet, A., de Freitas, N., and Gordon, N. J. (Eds.), Sequential Monu Carlo Methods n i Practice. Springer-Verlag. Murphy, K. and Weiss, Y. (2001). The fac tored frontier algori t hm for appro xi mate inference in DBNs. In UAI-01 , pp. 37&-385.
Murphy, R. (2000). Introduction to AI Robotics. MITPress.
Mul'ray-Rnst, P., Rzepa, H. S., Williamson, J., and Willighagen, E. L (2003). Chemical markup, XML andthe world-wide web. 4. CML schema. J. Chem. lnf. Comput. Sci., 41, 75'b-772.
Murthy, C. andRussell, J. R. (1990). A constructive proofofHigman·s lemma. InUCS-90, pp.257-269.
Mustettoln, N. (2002). Computingtheenvelope for
stepwise-constant resource allocations. In CP-02, pp. 139-154.
Muscettola, N., Nayak, P., Pell, B., and Williams, B. (1998). Remote agent: Toboldly go where no AJ system bas gone before. AU, 101. 5-48.
Muslea, I. (1999). Extraction patterns for n i forma tion extraction tasks: A survey. In Proc. AAAI-99 \Vorkshop on Machine LearningforInformation Et
rmction.
Myerson, R. (1981). Optimal auction design. Math ematics ofOpemtions Research, 6, 5&--73. Myerson, R. (1986). Multistage games with oom munication. Econometrica, 54, 323-358.
Myerson, R. (1991). Game Theory: Analysis of Conflict Harvard University Press. Nagel, T. (1974). Whatis it like to be a bat? Philo
sophicalReview, 83, 435-450.
Nalwa, V. S. (1993). A Guided Tour of Computer Wsion. Addison-Wesley. Nash, J. (1950). Equilibrium points in N'l"'rson games. PNAS, 36, 48-49. Nau, D. S. (1980). Pathology on game trees: A sum mary of results. InAAAJ-80, pp. JO'b--104.
Nau, D. S. (1983). Pathology on game trees revis ited, and an alternative to minimaxing. AU.21(l-2),
221-244.
Nau, D. S., Kumar, V., and Kanal, L. N. (1984).
General branchandbound, and its relation to A* and AO*. AIJ, 23, 29-58.
Nayak, P. and Williams, B. (1997). Fast context
switching n i real-me ti propositional reasoning. In
Mtiller, M. (2003). Conditional combinatorial games, and their application to analyzing capturing races in go. lnfornUition Sciences, 154(3-4), 189-
AAAI-97, pp. 50-56.
Mumford, D. and Shah, J. (1989). Optimal approx imations by piece-wise smooth functions and asso ciated variational problems. Commurt. Pure Appl.
Nebel, B. (2000). On the cornpilability and expres sive power of propositional pJanning formalisms.
202.
MaJh., 42, 577�85.
Murphy, K., Weiss, Y., and Jordan, M. I. (1999). Loopy beliefpropagatioofor approximate inference: An empirical study. In UA/-99, pp. 467-475.
Murphy, K. (2001).
'Ibe Bayes net toolbox for
MATLAB. Computing Science and Statiscs, ti 13.
Murphy, K. (2002) D)Yramic Bayesian Networks: Representation, Inference and Learning. PhD. the sis, UC Beoteley. .
Murphy, K. and Mian, L S. (1999). Modelling gene expression data using Bayesian networks. peoplQ.cs.ubc.ca/"murphyk/Papers/ ismb99 . pdf.
Neal, R.(1996). Bayesian Learningfor Neuml Net worl:s. Springer-Verlag.
JAIR, 12, 271-315.
Nefian, A., Liang, L., Pi, X., Liu, X., and Murphy, K. (2002) Dynamic bayesian netwodts for audio visual syeech recognition. EURASIP, Joumal ofAp .
pliedS•gnalProctssing. ll, 1-15.
Nesterov, Y. and Nernirovsk� A. (1994). Interior Point Polynomial Methods in Coovex Programming.
SIAM (Society for Industrial and Applied Mathe matics).
Netto, E. (1901). Lehrbuch der Combinatorik. B. G. Teubner.
Neviii-Manning, C. G. and Witten, I. H. (1997). Identifying hierarchical structures n i sequences: A linear-time algorithm. JAJR, 7, 67-82.
Bibliography
1084 Newell, A. (1982). The knowledge le•�l. AU, 18(1), 82-127. Newell, A. (1990). Unified Theories of Cognition.
Harvard University Press.
Neweii,A. and Ernst, G. (1965). The search for gen erality. In Proc. JFIP Congress, Vol. 1, pp. 17-24.
Newell, A., Shaw, J. C., and Simon, fl A. (1957). Empirical explorations with the logic theory ma
chine Proc. iK>stem Joint Computer Confertnce, 15, 218-239. Reprinted in Fei genbaum and Feld man (1963). .
Newell, A., Shaw, J. C., and Simon, fl A. (1958). Chess playing programs and the problem of com plexity. IBM Journal ofRestarrhand De�•elopmenr, 4(2), 320-335. Newell, A. and Simon, H. A. (1961).
GPS, a pro
gram that simulates human thought In Billins. H. (Ed.), LernendeAutomatrn, pp. 109-124. R. Olden
bourg.
Newell, A. and Simon, H. A. (1972). Human Prob lem Solving. Prentice-Hall.
i Newell, A. and Smon, H. A. (1976).
Computer
scienoe as empiricaJ inquiry: Symbols and search.
Nilsson, N. J. (1984). Shakey the robot. Technical
note 323, SRI InternationaL
Nilsson, N.J. (1986). Probabilistic lngic. AIJ,28(1), 71-87.
Nilsson, N. J. (1991). Logic and artificial intelli gence. AJJ, 47(1-3), 3 1-56.
Nilsson, N. J. (1995). Eye on the prize. AIMag, /6(2), 9-17. Nilsson, N. J. (1998). ArtificialIntelligence: A New Synthesis. Morgan Kaufmann.
Nilsson, N. J. (2005). Human-le>�l artificial intelli gence? be serious! A/Mag, 26(4), 68-75. Nilsson, N. J. (2009). The QuestforArtificial Intel
ligence: AHistory ofIdeas andAchievements. Cam bridge University Press.
Nisan, N., Roughgarden, T., Tardos, E,, and Yazi rani, V. (Eds.). (2007). Algorithmic Game Theory. Cambridge University Press. Noe, A. (2009). Out of Our Heads: IVhy You Are
Not YourBrain, and OtherLessonsfrom the Biology ofConsciousness. Hil l and Wang.
Norvig, P. (1988). Multp i le simultaneous n i terpreta
CACM, 19, 113-126.
tions of ambiguous sentences.
Newton, L (1664-1671). Metbodus fluxionum et se rierum infinitarum. Unpublished notes.
Stna Progmmming: Case Studies in Common Lisp.
In COGSCI-88.
Norvig, P. (1992). Rlradigms ofArtiftcial lntelli Morgan Kaufmann.
Ng, A. Y. (2004). Feature selection, 11 vs. !2 regu larization, and rotational invariance. In JCM/.,04.
Norvig, P. (2009). Natural language corpus data. In Segaran, T. and Hammerbacher, J. (Eds.), Beautiful
Ng, A. Y. and Jordan, M. L (2000). PEGASUS: A policy search method forlarge MOPs and POMDPs.
Nowick, S. M., Dean, M. E., Dill, D. L., and Horowitz, M. (1993). The design of a high performance cache controller: A case study n i asyn chronous synthesis. Integration: The VLSJJournal,
Ng, A. Y., Kim, H. J., Jordan, M. 1., and Sastry, S. (2004). Autonomous helicopterflight via reinforce
Nunbe.rg, G. (1979). The non-uniqueness of seman tic solutions: Polysemy. Language and Philosophy,
Ng, A. Y., Harada, D., and Russeii,S. J. (1999). Pol icy invariance underrewardtransformations: Theory and application to reward shaping. In JCML-99. In UAI-00, pp.406-415.
ment learning.
In NIPS 16.
Nguyen, X. and Kambhrunpati, S. (2001). Reviving partial orderplanning. In JJCAI-01, pp. 459-466.
n d a, R. S. v n, X., Kamb harn ati, S., and Nie Ngue (200 1 ). Planning graphp as the basisg t o r deriving heuristics for plan synthesis by state space and CSP search. Tech. rep., Computer Science and Engineer ing Department, Arizona State University.
Nicholson, A. and Brady, J. M. (1992). The data as sociation problem when monitoring robot vehicles using dynamic belief networlcs. In ECAI-92, pp. 689-693.
Niemela, 1., Simons, P., and Syrj n en, T. (2000).
Smodels:
a
A system for ansv.-er
set
program
ming. In Proc. 8th lntemational lVorl:shop on Non Monotoni c Reasoning.
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning,
39(2-3), 103-134.
Niles, L and Pease, A. (2001). Towards a standard In FOJS '0I: Proc. n i ternational i lnfornllltion Sys conference on FomJal Ontology n
upper ontology.
tems,pp. 2-9.
Nilsson, D. and Lauritzen, S. (2000). Evaluaring influence diagrams using UMJDs. In UAU!O, pp.
43645.
Nilsson, N. J. (1965). Learning Machines: FOun dations of Trainable Rlttem-Classifying Systems. McGraw-Hill. Republished in 1990. Nilsson, N. J. (1971). Problem-Solving Methods in
Artificial Intelligence. McGraw-Hil L
Data. O'Reilly.
/5(3), 241-262.
3(2), 143-184.
Nussbatun, M. C. (1978). Arisrorlt's De MoiU Ani
O'Reilly, U.-M. and Oppacher, F. (1994). Program search with a ierarchical h variable length represen tation: Genetic programming. simulated annealing and hill climbing. In Proc. Third Conference on Par aUel Problem Solvingfrom Nahtre, pp. 397-406.
Onnoneit, D. and Sen, S. (2002) Kernel-based re infon:ement learning. Machine Learning, 49(2-3), 161-178. .
Osborne, M. J. (2004). An Introduction to Game Theory. Oxford University Pres. Osborne, M. J. and Rubinstein,A. (1994). A Course n i Game Theory. MIT Press.
Osherson, D. N., Stob, M.,
and Weinstein, S.
(1986). Systems That Leam: An Introduction to Learning Theory for Cogniti••e and Computer Sci ensts. ti MIT Press.
Padgham; L, and Winikoff, M. (2004). !kl'cloping
IntelligentAgemSystems: A PracticalGuide. Wiley. Page, C. D. and Srinivasan, A. (2002). ILP: A short
look back and a longer look forward. Submitted to Journal of Machine Learning Research. Palacios, H. and Geffner, fl (2007). From confor mant into classicaJ planning-: Efficient translations that may be complete too. In JCAPS-07. .
Palay, A. J. (1985). Scan:hing with Probabilities.
Pitman.
Palmer, D. A. and Hearst, M. A. (1994). Adapti� sentenceboun dary dismbig uation.InProc. Confer a n guage Processing, pp. ence on App li e d NaturolLD 78-83. Palmer, S. (1999). VisionScience: Photons to Phe nomenology. MIT Press. Papadimilriou, C. H. (1994). Computational Com plexiry. Addison Wesley.
Papadimitriou, C. H., Tamaki, H., Raghavan, P., and Vempala, S. (1998). Latent semantic indexing: A probabilisticanalysis. In PODS-98, pp. 159-168. Papadimilriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of Maikov decision processes.
malium. Princeton University Press.
Mathematics of Operations Researrh, /2(3), 441450.
Ocb, F. J. and Ney, H. (2003). A systematic compar ison of various statistical alignment modeL Compu
Shortestpaths without a map. Theoretical Computer
Oaksford, M. and Chater, N. (Eds.). (1998). Ra tional models ofcognion. ti Oxford Uni•�rsity Press.
tational linguistics, 29(1), 19-51.
Och, F. J. and Ney, H. (2004). The alignment template approach to statistical machine translation.
Computaonal ti Linguistics, 30,417-449.
Ogawa, S., Lee, T.-M., Kay, A. R., and Tank, D. W. (1990). Brain magneticresonance imagingwith con trast dependent on blood oxygenation. PNAS, 87, 9868-9872.
Oh, S., Russell, S. J., and Sastry, S. (2009). Markov
chain Monte Carlo data association for multi-target tracking. IEEE Transactions on Automatic Control, 54(3), 481-497. Olesen, K. G. (1993). Causal probabilistic networks with both discrete and conrinuous variables. PAM/,
/5(3), 275-279.
Oliver, N., Garg, A., and Horvitz, E. J. (2004). Lay ered representations for learning and inferring office activity from multiple sensory channels. Computer Vision andlnJage Understanding, 96, 163-180.
OJi,,.r, R. M. and Smith, J. Q. (Eds.). (1990). Influ ence Diagrams, BeliefNtts and Decision Analysis.
Wiley. Omohundro, S. (2008).
The basic AI
dri•�s.
In
AGI-08 Workshop on rile Sociocultural, Ethical and Futurological Implications ofArtiftcial lnte/ligen.ce.
Papadimitriou, C. H. and Yannakakis, M. (1991).
Science,84(1), 127-150.
Papavassiliou, V.and Russel� S. J. (1999). Conver gence of reinforcement learning with general func tion approximators. In /JCAI-99, pp. 748-757.
Parekh, R. and Honavar, V. (2001). DFA learning from simple examples. Mac/tine Learning, 44, 935. Pa.risi, G. (1988). Statisticalfieldtheory. Addison Wesley. Pa.risi, M. M. G. and Zecchina, R.
(2002) Ana lytic and algorithmic solution of random satisfiabil ity problems. Scien.ce, 297, 812-815. .
Pa.rker, A., Nau, D. S., and Subrahmanian, V. S. (2005). Game-tree searchwith combinatorially !urge belief states. In JJCAI-05, pp. 254-259.
Parker, D. B. (1985). Learning logc. Tech nical report TR-47, Center for Computational Re
i
search in Economics and Management Science, Massachusetts Institute ofTechnology.
Parker, L E. (1996). On the design of behavior based multi-robot teams. J. Advanced Rnbotics, /0(6).
Parr, R. and Russell, S. J. (1998). Reinforcement
learning with hierarchies of machines. In Jordan, M. 1., Kearns, M., and Solla, S. A. (Eds.), NIPS 10. MIT Press.
Bibliography
1085
Parz.<m, E. (1962). On estimat ion of a probability density function and mode. Annals ofMathematical Stastics, ti 33, 1065-1076. Pasca, M. andHarabagiu, S. M. (2001 ). Highperfor manoe question/answering. In SIGIR-01, pp. 366374. Pasca, M., Lin, D., Bigham, J., Lifchits, A., and Jain, A. (2006). Organizing and searching the world
wide web of facts-Srep one: The one-million fact extraction challenge. InAAA/.06. Paskin, M. (2001). Grammatc i al bigrams. In NIPS.
Pasula, fl, Marthi, B., Milch, B., Russell, S. J., and
Shpitser, I. (2003). Identity uncertainty and citation matching. In NIPS I5. MIT Press.
Pasula, H. and Russell, S. J. (2001). Approximate interence for first-order probabilistic languages. In JJCAJ.ol.
Pasula, H., Russell, S. J., Osdand, M., and Ritov, Y. (1999). Tracking many objects with many sensors. In IJCA/-99.
Patashnik, 0. (1980). Qubic: 4x4x4 tic-tac-toe. Mathematics Magazine, 53(4), 202-216. Patrick, B. G., Almulla, M., and Newborn, M. (1992). An upper bound on the time complexity of iterative-deepening-A•. All, 5(2-4), 265-278.
Paul, R. P. (1981). RobotManipulators: Mathemar ics, Progmmming, and Control. MIT Press.
Pauls, A. and Klein, D. (2009). K-best A* parsing.
lnACL-09.
Peano, G. (1889).
Arirhmetices principia, nova
me.rhodo exposita. Fratres Bocca, Turin.
Pearce.,J., Tambe, M., and Mabeswaran, R. (2008). Solving multiagent networks using distribured con straintoptimizatioo. A/Mag, 29(3), 47-62.
Peirce, C. S. (1870). Descriptionofanotation forthe logic of relatives, resulting from an amplification of the conceptions of Boote's calculus of logic. Mtm
oirs ofthe AmericanAcademy ofArts and Sciences, 9, 317-378.
Peirce, C.S. (1883). A theory ofprobable inferenoe. Nore B. The logic ofrelatives. In Studies in Logic by Members oftlre Johns Hopkins University, pp. 187203, Boston. Peirce, C. S. (1902). Logic as semiotc i : The the ory of signs. Unpublished manuscript; reprinted in (Buchler 1955). Peirce, C. S. (1909). Exisrential graphs. Unpub lished manuscript; reprinted n i (Buchler 1955). Pelikan, M., Goldberg, D. E., and Cantu-Paz, E. e sian optimization algo (1999). BOA: The Bay
rithm. In GECC0-99: Proc. G<mttic and Evolution ary Computation Conference, pp. 525-532.
Pemberton, J. C. and Korf, R. E. (1992). Incremen tal planning on graphs with cycles. InA/PS-92, pp. 525-532. Penberthy, J. S. and Weld, D. S. (1992). UCPOP: A souod, complere, partial order planner for ADL In KR-92, pp. 103-114. Peng, J. and Williams, R. J. (1993). Efficient learn
ingand planning within th.e Dyna framework. Adap tive Btha�•ior, 2, 437454.
Penrose., R. (1989). The Emperor's New Mind. Ox ford University Press.
Penrose, R. (1994). Shadows of the Mind. Oxford University Press. Peot, M. and Smith, D. E. (1992). Conditional non linear planning. In ICAPS-92, pp. 189-197. Pereira,
F. and Shieber,
S. (1987).
Natuml-Longuage Annlysis. Center forthe Study of
Pearl, J. (l982a). Reverend Bayes on inferenoe en
Language andInformation (CSLI).
Pearl, J. (l982b). The solution for the branching factor of the alpha-beta pruning algorithm and its optimality. CACM, 25(8), 559-564.
transition networks. AIJ, 13, 231-278.
gines: A distributed hieraschical approach. InAAAJ82, pp. 133-136.
Heuris.tics: Intelligent Search Stmtegiesfor ConrpurerProblemSolving. Addison Pearl, J. (1984).
Wesley.
Pearl, J. (1986). Fusion, propagation, and structur ing in belief networks. AIJ, 29, 241-288.
Pro/.og and
Pereira, F. and Warren, D. H. D. (1980). Definite clause grammars for language analysis: A survey of the fonnalism and a comparison with augmented
Per·er salis. Apud AbelemSwalle, Londoo. Rashevsky, N. (1936). Physico-mathematical as
pects ofexcitation and conduction in nerves. In Cold Springs Harbor Symposa i on Quantitative Biology. IV: Etcitation Phenomena, pp. 90-97.
ceedings.
and m i plementation of VAMPIRE. .AI Communica tions, /5(2-3), 91-1 10. Rich, E. and Knight, K. (1991). Arrificial Intelli gence (second editioo). McGraw-Hill. Richards, M. and Amir, E. (2007). Opponent mod eling n i Scrabble. In JJCAI-07.
Rashevsky, N. (1938). Mathtnratical Biophysics:
Richardson, M., Bilmes, J., and Diorio, C. (2000). Hidden-articulator Markov models: Perlormance impro,•ements and robustness to noise. In ICASSP00.
Gundenoo, K. (Ed.), Lllngua[ll', Mind and Knowl edge: Minnesota Studies n i the Philosophy ofSci ence.University of Minnesota Press.
Rasmussen,
Richter, S. and Westphal, M. (2008). The LAMA planner. In Proc. Inremational Planning Competi
MIT Press.
tion at !CAPS.
Pylyshyn, Z. W. (1974). Minds, machines and phe nomenology: Some reftections on Dreyfus' 'What Compulers Can't Do". lnt J. Cognitive Psychology, 3(1), 57-77.
Rassenti, S., Smith, V., and Buffin, R. (1982). A
Ridley, M. (2004). Evolution. Oxford Reader.
Mind, and Rtlig;on, pp. 37-48. University of Pitts burgh Press.
Putnam, H. (1975). The meaning of"meaning". In
Physico-Mathematical Foundarions ofBiology. Uni •�rsity ofChicago Press. C. E. and Williams,
C. K. I.
(2006). Gaussian Processesfor Machine Learning.
combinatoriaJ auction mechanism for airport time slot allocation. Bell Journal ofEconomics, 13, 402417.
Rieger, C. (1976). Anorganizatiooofknowledge for
problem solving and language comprehension. AJJ, 7,89-127.
Bibliography Riley, J. and Samuelson, W. (1981). Optimal auc tions. American Economic Review, 71, 381-392. Rilotf, E. (1993). Automatically constructing a dic tionary for information extractiontasks. JnAAAI-93, pp. 811�16. Rintanen, J. (1999). Improvements to the evalua tion of quantified Boolean formulae. In IJCA/-99, pp. 1192-1197. Rintanen, J. (2007). Asymptotically optimal eocod ings of conformant planning in QBF. In AAAI-07, pp. 1045-1050. Ripley, B. D. (1996). Pattern Recognirion and Neu ralN erworl:s. Cambridge University Press. Rissanen, J. (1984). Universal coding, infom1ation, prediction, and estimation. IEEE Transacons ri on
Information Theory, IT-30(4), 62�36.
Rissanen, J. (2007). lnfomrotion and Complexity n i
Statistical Modeling. Springer.
Ritchie, G. D. and Hanna, F. K. (1984). AM: A case study in AI methodology. AIJ, 21(3), 249-268.
Rivest, R. (1987). learning decision lists. Machn ie Learn n i g, 2(3), 229-246.
r ce nofthree Roberts, L G. (1963). Machine pe ptio
dimensional solids. Teclmical report 315, MIT Lin coln Laboratory. Robertson, N. and Seymour, P. D. (1986). Graph minors. IT. Algorithmic aspects oftree-width. J. Al
gorithms, 7(3), 309-322.
Robertson, S. E. (1977). The probability ranking
principle n i m. J. Documentation, 33,294-304. Robertson, S. E. and Sparcl: Jones, K. (1976). Rel evance weighting: of searcb (enns. J . American Soci etyforlnfomwrion Scen i ce, 27, 129-146. Robinson, A. and Voronkov, A. (2001). Handbook ofAutomatedReasonn i g. Elsevier. Robinson, J. A (1965). A machine-oriented logic based on the resolution prin.ciple. JACM, 12,23-41. Roche, E. and Schabes, Y. (1997). Finite.State lAn guage Processn i g (lAnguage, Speech and Commu nication). Bradford Books. Rock, I. (1984). Perception. W. H. Freeman_ Rosenblatt, F. (1957). The perceptron: A perceiv ing and recognizing automaton_ Report 85-460-1, Project PARA, Cornell Aeronautical Laboratory. Rosenblatt, F. (1960). On the convergence of rein forcement procedures n i simple perc:eptrons. Repon VG-1 196-G-4, Cornell Aeronautical Laboratory. Rosenblatt, F. (1962). Principles of Neurod)YJanl ics: Perr:eptrons and the Theory ofBrain Mecha nisms. Spartan. ROilenblatt, M_ (1956). Remarks on some nonpasa metric estimates of a dens5ty function. Annals of
Mathematical Statistics, 27, 832-837.
A., Wiener, N., and Bigelow, J. Behavior, purpose, and teleology. Philos
Rosenblueth,
(1943).
ophy ofScience, 10, 1 8-24.
ROilenschein, J. S. and Zlo�kin, G. (1994). Rules of
Encounter. MIT Press.
S. J. (1985). Formal theories of knowledge n i AI and robotics. New Generaon ti
Rosenschein,
Computing, 3(4), 345-357.
Ross,
P. E. (2004). Psyching out computer chess players. IEEESpectmm, 41(2), 14-15. Ross, S. M. (1988). A Fir.st Coursr n i Probability (third edition). Macmillan.
1087 Rossi, F., van Beek, P., and Walsh, T. (2006). Hand
Russell, S. J. and Subr a manian, D. (1995). Provably
Roussel, P. (1975).
Russell, S. 1., Subramanian, D., and Parr, R. (1993). Provably bounded oplimal agents. In IJCA/-93, pp.
book ofConstraint Proc�ssing. Elsevier.
Prolog: Manual de reference et d'utilization. Tech_ rep., Groupe d'lntelligence Arti ficielle, Universite d'Aix-Marseille. Rouveirol, C. and Puget, J.-F. (1989). A simple and general solution for inverting resolution. In Proc.
European Workn i g Session on Learnn i g, pp. 201210.
Rowat, P. F. (1979). Representing the Spatial Ex perience and Solving Spaal ti problems n i a Simu lated Robot Environment. PhD. thesis, University
ofBritish Columbia_
Roweis,S. T. and Ghahramani, Z (1999). A unify ing reviewof linear Gaussian Models. Neural Com putation, I1(2), 305-345.
Rowley, H., Baluja, S., an.dKanade, T. (1996). Neu ral netwrk-based fuce detection_ In CVPR,pp. 203-
208.
Roy, N., Gordon, G., and Thrun, S. (2005). Finding
approximate POMDP solutions through belief oom pression. JAIR, 23, 1-40. Rubin, D. (1988). Using the sm algorithm to sim ulate posterior distributio·ns. Jn Bemasdo, J. M., de Groot, M. H., Lindley, D.V.,and Smith, A F. M. (Eds.), Bayesian Statistics 3, pp. 395-402. Oxford University Press. Rumelbart, D. E.,Hinton, G. E., and Williams, R. J. (1986a). Learning internal representations by error propagation. In Rurnelhart, D. E. and McClelland, J. L (Eds.), Parallel Distributed Processing, Vol. I, chap. 8, pp. 318-362. MITPress. Rumelb.,rt, D. E., H.inton, G. E., and Williams, R. J. (1986b). learning representations by back propagating errors. Naturo, 321, 533-536. Rumelbart, D. E. and McClelland, J. L. (Eds.). (1986). Paml/el Distribut-.r sity Engineering Department. Ruspini, E. H., Lowrance, J. D., and Strat, T. M. (1992). Understanding evidential reasoning. /JAR, 6(3), 401-424.
Russell, J. G. B. (1990). Is
screening for abdom inal aortic aneurysm wort:hwhile? Clinical Radiol
ogy,41, 182-184.
Russell, S. J. (1985). The compleat guide to MRS. Report STAN-CS-85-1080, Computer Science De
partment, Stanford University. Russell, S. J. (1986). A quantitati>-. analysis of anal ogy by similarity. InAAA/-86, pp. 284-288. Russell, S. J. (1988). Tree-structuredbias. InAAAI88, Vol. 2, pp. 641-645. Russell, S. J. (1992). Efficient memory-bounded search methods. Jn ECA/-92, pp. 1-5. Russell, S. J. (1998). Learning agents for uncertain environments (extended abstract). In COU-98, pp. 101-103.
Russell, S. J., Binder, J., Koller, D., and Kanazawa,
K- (1995). Local learning n i probabilistic netwolks with hidden variables. In JJCA/-95, pp. I 146-52. Russell, S. J. and Grosof, B. (1987). A declasarive approach to bias n i concept learning. InAAAI-87. Russell, S. J. and Norvig, P. (2003).ArtificialIntelli gence: A Madero Approach (2ud edition). Prentice Hall.
bounded-optimal agents_ JAIR, 3, 515-609.
33&-345.
Russell, S. J. and Wefald, E. H. (1989). On optimal
game-tree search using rational meta-reasoning. In IJCA/-89, pp. 334-340. Russeii,S. J. and Wefald, E. H. (1991). Do the Right Thing: Studies n i Limited Rationality. MIT Press. Russell, S. J. and Wolre, J. (2005). Efficient belief-state AND-OR search, with applications to Kriegspiel In IJCAI-05, pp. 278-285. Q Russell, S. J. and Zimdars, A. (2003). decomposition of reinforcement letu"ning agents. In ICM£.1J3.
Rustagi,J. S. (1976). lbriational Mtthods n i Stas ri tics. Academic Press.
Sabin, D. and Freuder, E. C. (1994).
Contradicting
conventional wis:lom ia constraint satisfaction. In ECA/-94, pp. 125-129.
Saoordoti, E. D. (1974)_ Planning n i a hieraschy of
abstraction spaces. AIJ, 5(2), 115-135. E. D. (1975). The nonlinear nature of plans. Jn IJCAI-75, pp. 206-214.
Saoordoti,
Saoordoti, E. D. (1977)_ A Stmcturefor Plans and Behavior. Elsevier/North-Holland. Sadri, F. and Kowalski, R. (1995). Variants of the
event calculus. In ICLP-95, pp. 67�1. Sabami, M., Dumais, S. T., Heckerman, D., and Horvitz, E. J. (1998). A Bayesian approach to fil teringjunk E-mail. In Learning for Te.tt Categoriza tion: Papersfrom the 19.98\Vorkshop.
Sahami, M_, Heasst, M.. A, and Saund, E. (1996).
Applying the mulple ti cause mixture model to text categorization. In ICML-96, pp. 435-443. Sabin, N. T., Pinker, S., Cash, S. S., Schomer, D., and Halgren, E. (2009)- Sequential processing of lexical, grammatical, and phonological information within Broca's area. Science, 326(5291), 445-449. Snkuta, M. and Iida, H. (2002). AND/OR-tree search for solving problems with uncertainty: A case study using screen-shogi problems. IPSJ Journal, 43(01).
Sulomaa,
A. (1969). Probabilistic and weighted grammars. Information and Control, 15, 529-544.
Sulton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. CACM, IB(I 1), 61:w;2o. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Joumal
ofResearchand De>•e!opmenr, 3(3), 210-229. Samuel, A. L. (1967).
Some studies in machine learning using the game of checkers ll-Recent progress. IBM Joumal of Research and Develop ment, 1/(6),601�17.
Samuelsson, C. and Rayner, M. (1991).
Quantita tive evaluation of explanation-based learning as an optimization tool for a large-scale natura) language sys�m. In/JCAI-91, pp. 609-615. Sara,.agi, S. (2007). Information extraction. Foun dations and Trends n i Databases, I(3), 26I-377. Satia, J. K- and Lave, R. E. l(1973). Markovian decisionprocesses with probabiistic observationof states. Management Science,20(1), 1-13. Sato, T. and Kameya, Y. (1997). PRISM: A symbolic-statistical modleling language. Jn IJCAI97, pp.
1330-1335.
Bibliography
1088 Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996).
Meanfieldtheory for sigmoid beliefnetworl<s. JAIR, 4,61-16.
Scolt, D. and Krauss, P. (1966). Assigning probabil ities to logical fonnulas. Ln Hintikka, J. and Suppes, P. (Eds.),Aspectsoflnductil'e Logic. North-Holland.
Wiley.
BBS. 3, 417-457.
Savage, L. J. (1954). The Foundations ofStatistics.
Sayre, K. (1993). Three more flaws in the compu tational model. Paper presented at the APA (Central Division) Annual Conference, Chicago, Dlinois.
Searle, J. R. (1980). Mirods, brains, and programs.
Searle, J. R. (1984). Mi nds, Brains and Science. Harvard Uni1�rsity Press.
Searle, J. R. (1990). Is the brain's mind a oomputer
Schaeffer, J. (2008). One Jump Ahen, R., and Driessens, K. (2004). Relational reinforcement learning: An overview. In ICM/.,04.
j
357-390.
Teyssier, M. and Koller,D. (2005). Ordering-based search: A simple and effective algorithm for learning
Bnycsi.nnnetworks. In UAJ-05, pp. 58�590.
Thaler, R. (1992). The Winner's Curse: Parado.tes nndAnomalies ofEconomic lif e. Princeton Univer sity Press. Thaler, R. andSunstein, C. (2009).
Tail, P. G. (1880). Note on the theory of the "15 puzzle". Proc. RoyalSocietyofEdi nburgh, 10,664-
Nudge: Improv ingDecisionsAbout Heall/; IVealth, andHappiness. Penguin.
Tamaki, H. and Sato, T. (1986). OLD resolution with tabulation. InICLP-86, pp. 84-98.
(2004).
665.
Thoocharous, G., Murphy, K., and Kaelbling, L. P.
Tarjan, R. E. (1983). Data Stmcturesand Network
RepresenTing hierarchical POMDPs as DBNs for multi-scale rolx>t localization. Jn ICRA04.
Tarski, A. (1935). Die Wahrheitshegriff in den for malisierten Spract.en. Studia Philosophi ca, I, 261-
Thiele, T. (1880). Om anvendelse af mindste kvadraters methode i nogle tilf.,lde, hl'or en kom plikation af visse slags uensartede tilf.,ldige fejlk ildergiver fejleoe en ' systernatisk' karakter. Vidensk. Selsk. Skr. 5. Rk., naturvid. og mat. Aft/., 12, 381-
CBMS-NSF Regional Confereoce Se ries n i Appled i MAthematics. SIAM (Society for In dustrial and Applied Mathematics).
Algorithms.
405.
Tikhonov, A. N. (1963).
Solution of incorrectly fotmulated problems and the regularization method.
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Starisrical analysis ofjinitr mitture
Future Shack, Bantam.
Tomer, A. (1970).
Swift, T.andWarren, D. S. (1994). AnalysisofSLG WAM evaluation ofdefinite programs. In Logic Pro
Tadepalli, P. (199�). Learning from queries and ex
661-692.
Tomasi, C. andKanade, T. (1992). Shape and mo
277.
amples with tree-struct:ured bias. In TC!fL-93, pp.
Thrun, S. (2006). Stanley, the robot that won the DARPAGrand Challenge. J. Field Robocs, ri 23(9),
acrions on Systems, MQil and Cybernetics, 20{2),
Tesauro, G. (1995). Temporal differenoe learning andTO-Gammon. CACM, 38(3), 5�.
Syrjiinen, T. (20JO). Lparse 1.0 user's manual. saturn.tcs.hut.fi/software/smodels.
29-53.
365-379.
vations. J. AstronauticalSciences. 6,46-52.
gmmming. Proc. !994 ln.temarional ymposium S Logicprogmmmillg, pp. 219-235.
Proba
Thrun, S., Fox, D., and Busgard, W. (1998). A prob abilistic approach to concurrent rnapping and local ization for mobile robots. Machine Learning, 3/,
distributions. Wiley.
Taylor, G., Stensrud, B., Eitelman, S., and Dnnhatn, C. (2007). Towards autcmating airspace manage ment. In Proc. Computational IntelligenceforSectt rity and Defense Applicarions (CISDA) Conference, pp. 1-.5.
agement.
Tbrun, S., Busgard, W., andFox, D. (2005). bilisticRobotics. MIT Press.
Thtman, J. A. and Shachter, R. D. (1990). Dynamic programming and influence cliagrarns. IEEE Trans
216-224.
Svo...,, K. and Busges, C. (2009). A machine learning approach for improved bD\25 retrieval. In Proc. Conference on lnfornl(lrion Knowledge Man
Thompson, K. (1996). 6-p iece endgames . J. Inter nuliuu«l Cumputu Clu::i:s A�·�·(.l(;iutiur1, 19(4), 215226.
Soviet Math. DokL, 5, 1035-1038.
A Thotuand End-Games: A Collection of Chess Posiri ons That Can be \Ibn or Drmw' by th
Sutton, R. S. and Barto. A. G. (1998). Reinforce ment Learning: AR Introduction. MIT Press.
299.
Thompson, K. (1986). Rettograde analysis of oer tain endgames. J.lnternationalComptlterChessAs sociation,May, 131-139.
Tate, A. and Whiter, A. M. (1984). Planning with multiple reso\l.J'Ce constraints and an application to a naval planning problem. In Proc. First Conference on AIApplications, pp. 410-416.
Sutton, R. S. (1990). Integrated architectures for
learning, plannins, and reacting based on approx mating dynatnic programming. In ICML,.YO, pp. i
Thielscher, M. (1999). From situation calculus to fluent calculus: State update axioms as a solution to the inferential frame problem. AlJ, II/{1-2), 277-
408.
tion from image streatns under orthography: A fac torization method. JJCV, 9, 137-154.
Torr-alba, A., r-crsu.s, R., and Weiu. Y. (2008). Small codes and large image databases for recogni tion. In CVPR, pp. 1-8. Trucco, E. and Verri, A. (1998). Introductory Tech niquesfor3-D ComputerVision. Prentice HaJJ. Tsitsiklis, J. N. andVan Roy, B. (1997). An analy sis oftemporal-difference learning with function ap . IEEE T m nsach ons ' en Automa.tic Con proximation trol, 42(5),674-690. Tuoner, K. and Wolpert, D. (20001. Co ll etti, .. intel ligenoe and braess' pasadox. Jn AAA/-00, pp. 104-
109.
Turrolle, M., Muggleton, S. H., and Sternberg, M. J. E. (2001). Automated discover/ of structural sig natures of protein fold and function.. J. Molecular Biology, 31)5, 591-605. Turing, A. (1936). On computable numbers, with an application to the Entscheidungsproblem. Proc. London Mathemarical Society, 2nd series, 42, 230265. Turing, A. (1948). Intelligentmarhinery. Tech. rep., National Physical Laboratory. reprinted in (Jnce, 1992). Turing, A. (1950). Computing machinery and intel ligeoce. Mind, 59, 433-460. Turing, A., Strachey, C., Bates, M. A., andBowden, B. V. (1953). Digital computers awlie d to games. In Bowden, B. V. (Ed.), Fasterthan fhought,pp. 286310. Pitman. Tversky, A. and Kahneman, D. (1982). Causal scl>emata in judgements under uncertanty. In Kah i neman, D., Slovic, P., and r,..rsky, A. (Eds.), Judge mer�t Uncertainty: Heurish'cs and Biases. CambridgeUniversity Press.
Under
Ullman, J. D. (1985). Jmplerne: Windows and durations for activities and goals. PAM/, 5, 246-267.
(Eds.), Handbook ofConsrmint Progmmming. Else-
Ve.rma, V Gordon, G., Simmons, R and Thrun, S. (2004). Particle filters for rover fault diagnosis.
CPian: A con su-aint programming approach to planning. InAAA/99, pp. 585-590.
Vinge, V. (1993). The eming technological sin gulari ty: How to survive n i the post-human era. In VISION-21 Symposium. NASA Lewis Research Center and the Ohio Aerospace Institute.
ner. van Beek, P. and Chen, X. (1999).
van Ueek, P. and Maachak, D. (1996). The design and experimental analysis ofalgorithms for temporal reasoning. JAJR,4, 1-18. van Bentham, J. and ter Meu.Jen, A. (1997). Hand
book ofLogic and Longuage. MITPress.
Van Emden, M. H. and Kowalski, R. (1976). The semantics of p-edicate logic as a prograrruning lan guage. JACM, 23(4), 733-742.
Rarmelen, F. and Bundy, A. (1988). Explanation-based generalisation partial evalu atioo. AJ/,36(3), 401-412. van
=
van Harmelen, F., Lifschitz, V., and
Porter, B.
(2007). The Handbook of Knowledge Representa tion. Elsevier. van Heijenoort, J. (Ed.). (1967).
From Frege to Glide!: A Source Book in Mathematical Logc, i
.•
.•
IEEE Robotics andAutomation Magazine, June.
Viola, P. andJones, M. (2002a). Fast and robust clas sification using asymmetric adaOOost and a detector cascade. In NIPS 14.
Viola, P. and Jones, M. (:2002b). Robust real-time object detection. /CCV.
Visser, U. and Burkhard, H.-D. (2007). RoboCup 2006: achievements andgoals for the future. AJMag, 28(2), 115-130. Visser, U., Ribeiro, F., Obash� T., and Dellaert, F.
(Eds.). (2008). RoboCup 2007: RobotSoccer World
CupXI. Springer.
Viterbi, A. J. (1967). Erro:rbounds for convolutional codes and an asymptotically optimum decoding al gorithm. IEEETmnsactions on lnfom!aon ti Theory,
/3(2), 260-269.
Warren, D. H. D. (198.3). An abstract Prolog in struction set. Technical aote 309, SRI InternationaL Warren, D. H. D., Pereira, L. M., and Pereira, F.
PROLOG: The language and its imple mentation compared with LISP. SIGPLAN Notices,
(1977).
/2(8), 109-1 15.
Wassern ..n, L. (2004). All ofSratiscs. ti Springer.
Watkins, C. J. (1989). Models of Delayed Rein forcemenr Learning. Ph.D. thesis, Psychology De partment, Cambridge University. Watson,J. D. andCrick,F. H.C. (1953). A sbUcture for deoxyribose nucleic acid. Naruro, 171, 737.
Waugh, K., Scbnizlein, D., Bowling, M., and Szafron, D. (2009). Abstraction pathologies in ex tensive games. JnAAMAS-09.
Weaver, W. (1949). Translation. In Locke, W. N. and Booth, D. (Eds.), Machine rmnslation of lan guages: fourteen essays, pp. I5-23. Wiley. Webber, B. L. and Nilsson, N. J. (Eds.). (1981). Relldings in Artificial lmelligence. Morgan Kauf
mann.
Weibull, J. (1995). Evolutionary Game Theory. MIT Press. Weidenbach, C. (2001). SPASS: Combining super position, sorts and splitting. In Robinson, A. and
Voronkov, A. (Eds.), Handbook ofAutomated Rea soning. MJT Press. Weiss, G. (2000a). Multiagenr sysrems. MIT Press.
1879-1931. Harvard Univeirsity Press.
Vlassis,N. (2008). A Concise lnrroduction to Multi
Van Hente.nryck, P., Saraswat, V., and DeviUe, Y. (1998). Design, implementation, and evaluation of the constraint langnase cc(FD). J. Logic Progmm
agenrS ystemsandDisrribr.1tedArtificialintelligence.
Weiss, Y. (2000b). Correctness oflocal probability propasation in graphical models with loops. Neural
von Mises, R. (1928). Wahrscheinlichkeir, Sratistik und \fuhrheir. 1. Springer.
Weiss, Y. and Freeman, W. (200I). Correctness of belief propasation in Gaussian graphical models
ming, 37(1-3), 139-164.
Morgan and Claypool.
van Doeve, W..J.(2001). The allditferent constraint: a survey. In 6rh AnmNJl Workshop of the ERCIM
von Neumann, J. (1928).
van Hoeve, W..J. and Katriel,
von Neumann, J. andMorgenstern, 0. (1944). The ory ofGames and Economic Brhavior (firstedition). Princeton University Press.
Working Group on Constmi nts.
constraints.
In Rossi, F.,
van
L (2006). Global Beek, P., and Walsh,
T. (Eds.), Handbook ofCo,nstmint Processing, pp. 169-208. Elsevier.
van Uu1lhalgen, M. and Hamm, F. (2005). The
Proper Trearmenr ofEvents. Wiley-BlackweU.
van Nunen, J. A. E. E. (I 976). A set of succes sive approximation methods for discounted Marko vian decision problems. Zeitschriftfur Opemons ri Research, Serie A, 20(5), 203-208. Van Roy, B.
(1998). Learning and value function
appro.timarion in complexdecisionprocesses. Ph.D. thesis, Laboratory for Jnfonnation andDeciso i nSys tems, MIT.
Van Roy, P. L. (1990). Can logic programming execute as fast as imperative programming? Re port UCB/CSD 90/600, Co mpu t erScience Division, Un iverst i y ofCalifornia, Bed<eley, California. Vapnik, V. N. (1998). Statistical Learning Theory.
Wiley.
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory ofProbability and Irs Applications, 16, 264-280.
GeseUschaftsspiele.
/00(295-320).
Zur Tbeorie der
l\lathematische Annalen,
von Winterfeldt, D. and Edwards, W. ( 1986). Deci
son i Analysis and Behavioml Research. Cambridge University Press.
Vossen, T., Ball, M., Lotem, A., and Nau, D. S. (2001). Applying integer programming to AI plan ning. Knowledge Engineering Review, 16, 85-100. Wainwright, M. J. and Jo:rdan, M. L (2008). Graph ical models, exponential families, and variational n i
ference. Machine Learning, /(1-2), 1-305.
Waldinger, R. (1975). Achieving several goals si multaneously. In Elcock, E. W. and Michie, D. (Eds.), Machine Intelligence 8, pp. 94-138. Ellis
Horwood.
Wallace., A. R. (1858). Onthe tendency of varieties to depart indefinitely from the original type. Proc.
Iinnean Socety i ofLondon, 3, 53-02.
Waltz, D. (1975). Understanding line drawings of scenes with shadows. In Winston. P. H. (Ed.), The
Psychology ofComputer Vision. McGraw-Hill.
Wang, Y. and Gelly, S. (2007). Modifications of UCT and sequence-like simulations forMonte-Caslo
Varian, H. R. (1995). Economic mechanism design for computerized agents. In USENIX IVori:shop on
Go. In IEEE Symposium on Computational Intelli gence and Games, pp. 175-182.
Vauquois, B. (1968). A survey of fonnal grammars and algorithms for recognition and transfoll'J),4tion in mechanical translation. In Proc. IFIP Congrtss, pp.
Warren, D. H. D. (1974). WARPLAN: A System for Generating Plans. Department of Computational Logic Memo 76, University ofEdinbusgh.
Electronic Commerce, pp. 13-21.
1 1 1�1122.
Wanner, E. (1974). On remembering,forgertingand understanding sentences. Mouton.
Complllation, /2(1), 141.
of arl>itrary topology. Neuml Computarion, 13(10),
2173-2200.
Weizenbaum, J. (1976). CompurerA:Joverand Hu man Reason. W. H. Freeman. Weld, D. S. (1994). An introduction to leas! oorn mitment planning. A/Mag, /5(4), 27-01.
Weld,D. S. (I999). Recentadvances n i AI planning. AJMag, 20(2), 93-122.
Weld, D. S., Anderso11, C. R., and Smith, D. E. (1998). Extendn i g graphplan to handle uncertainty and sensing actions. InAAAI-98, pp. 897-904. Weld, D. S. and de Kleer, J. (1990). Readings n i
Qualitatil·eReasoning aboutPhysicalSystems. Mor
ganKaufmann.
Weld, D. S. and Etzioni, 0. (1994). The first law of robotics: A call to arms. In AAAl-94.
Wellman, M.P. (1985). Reasoning about preference models. Technical report MIT/LCS/fR-340, Labo ratory for Computer Science, MIT.
(1988). Fom!Ulation ofTradeoffs n i Plannng i tender Uncertainty. Ph.D. thesis, Mas sachusetts Institute ofTechnology.
Wellman, M. P.
Wellman, M. P. (Ii 990a). Fundamental concepts of qualitative probablistic networks. AIJ, 44(3), 257-
303.
Wellman, M. P. (1990b). The STRIPS assumption forplanning under uncertainty. In AAAI-90,pp. 198-
203.
The economic approach to artificial intelligence. ACM Computi ng Surveys, Wellman, M. P. (1995).
27(3), 360-362.
Wellman, M. P., Breese, J. S., and Goldman, R. (1992). From knowledge bases to decision models.
Knowledge Engineering Review, 7(1), 35-53.
Bibliography
1092 WellllWI, M. P. and Doyle, J. (1992). Modular util t iy representation for decision-theoretic planning. In
ICAPS-92, pp. 236-242.
WellllWI, M. P., Wurman, P., O'Malley, K., Bangera, R., Lin, S., Reeves, D., and Walsh, W. (2001). A trading agent competitioo. IEEE Inter
net Computing.
Wells, H. G. (1898). The \Varofthe Worlds. William Heinemann.
Werbos, P. (1974). Beyond Regression: New Tools for Predicon ti and Analysis n i the Behal•ioral Sci ences. PhD. thesis, Harvard Univorsity. W..-bos, P. (1977). Advan.ced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22, 25-38.
Wesl�y. M. A. and Lozano-Perez, T. (1979).
An
algorithm for planning coUision-free paths an10ng polyhedral objects. CACM, 22(10), 5�570.
Y. and Meek, C. (2009). MAS: A multi i plicative approximation scheme for probabilistic n Wexler,
ference. In NIPS 21.
Whitehead, A. N. (1911). An Introduction to Math ematics. Williams and Northgate.
Whitehead, A. N. and Russell, B. (1910). Principia Mathematica. CambridgeUniversity Press. Whorf, B. (1956). Language, Thought, and Reality.
MIT Press.
Widrow, B. (1962). Generalization and information storage n i netwl>rks of adiline "neurons". In Self Organizing Systems 1962, pp. 435-461.
B. and Hoff, M. E. (1960). Adaptil-e swi t ching circuits. In 1960 IRE IVESCON Conven tion Record, pp. 96-104. Widrow,
Wiedijk, F. (2003). Comparing mathematical InMathematical Knowledge Management, pp. 188-202. provers.
Wiegley, J., Goldberg, K., Peshkin, M., and Brokowski, M. (1996). A complete algorithm for designing passive fences to orient parts. In ICRA-
96.
Wiener, N. (1942). The extrapolation, n i terpolation, and smoothing of stationary rime series. Osrd 370, Report to the Services 19, Research Project DIC6037,MIT. Wiener, N. (1948). Cybernetics. Wiley.
Wilensky, R. (1978). Understanding goal-based ston'es. PhD. thesis, Yale University. Wilensky, R. (1983). Planning and Understanding.
Addison-Wesley.
Wilkins, D. E. (1980). Using patterns and plans n i chess. AIJ, 14{2), 165-203. Wilkins, D.E. (1988). Practical Planning: &tend ing theAIPlanning Paradigm. Morgan Kaufmann.
Wilkins, D. E. (1990). Can AI planners solve prac tical problems? Computational Intelli gence, 6(4),
232.--246.
Williams, B., Ingham, M Chung, S., and EUiott, .•
Williams, R. J. and Bair,d, L. C. I. (1993). Tight performance boundson greedy policies basedon im perfect value functions. Tech. rep. NU-CCS-93-14, College ofComputer Science, Northeastern Un iver sity.
l , R C. (Eds.). (1999). Wilson, R. A. and Kei The MIT Encyclopedia