MACHINE INTELLIGENCE QUO VADIS?
ADVANCES IN FUZZY SYSTEMS
-
APPLICATIONS AND THEORY
Honorary Editor: Lotfi A. Zadeh (Univ. of California, Berkeley) Series Editors: Kaoru Hirota (Tokyo lnsf. of Tech.), George J. Klir (Binghamfon Univ.-SUNY), Elie Sanchez (Neurinfo), Pei-Zhuang Wang (West Texas A&M Univ.), Ronald R. Yager (lona College)
VOl. 7:
Vol. 8: VOl. 9: Vol. 10:
VOl. 11: VOl. 12: Vol. 13: Vol. 14:
Vol. 15: Vol. 16: Vol. 17: Vol. 18: VOl. 19:
VOl. 20: Vol. 21:
Genetic Algorithms and Fuzzy Logic Systems: Soft Computing Perspectives (Eds. E. Sanchez, T. Shibata and L. A. Zadeh) Foundations and Applications of Possibility Theory (Eds. G. de Cooman, D. Ruan and E. E. Kerre) Fuzzy Topology (Y. M. Liu and M. K. Luo) Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition (Z. Chi, H. Yan and T. D. Pham) Hybrid Intelligent Engineering Systems (Eds. L. C. Jain and R. K. Jain) Fuzzy Logic for Business, Finance, and Management (G. Bojadziev and M. Bojadziev) Fuzzy and Uncertain Object-Oriented Databases: Concepts and Models (Ed. R. de Caluwe) Automatic Generation of Neural Network Architecture Using Evolutionary Computing (Eds. E. Vonk, L. C. Jain and R. P. Johnson) Fuzzy-Logic-Based Programming (Chin-Liang Chang) Computational Intelligence in Software Engineering (W. Pedrycz and J. F. Peters) Non-additive Set Functions and Nonlinear Integrals (Forthcoming) (Z. Y. Wang) Factor Space, Fuzzy Statistics, and Uncertainty Inference (Forthcoming) (P. Z. Wang and X. H. Zhang) Genetic Fuzzy Systems, Evolutionary Tuning and Learning of Fuzzy Knowledge Bases (0. Cordon, F. Herrera, F. Hoffmann and L. Magdalena) Uncertainty in Intelligent and Information Systems (Eds. B. Bouchon-Meunier, R. R. Yager and L. A. Zadeh) Machine Intelligence: Quo Vadis? (Eds. P. SinEak, J. VaSEak and K. Hirota)
Advances in Fuzzy Systems - Applications and Theory - Vol. 21
MACHINE INTELLIGENCE QUO VADIS? Peter Sincak Jan Vascak Echnical University of Kosice, Slovakia
Kaoru Hirota Tokyo Institute of Technology,Japan
NEW JERSEY ' LONDON
s World Scientific 1 ; SINGAPORE
BElJlNG * SHANGHAI
HONG KONG
TAIPEI * CHENNAI
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224
USA oflce: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK oflce: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-PublicationData A catalogue record for this book is available from the British Library.
MACHINE INTELLIGENCE: QUO VADIS? Copyright 0 2004 by World Scientific Publishing Co. Re. Ltd.
All rights reserved. This book, or parts thereoj may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-238-751-X
Printed in Singapore by World Scientific Printers (S) Pte Ltd
FOREWORD
More than fifty years ago, Norbert Wiener published his seminal book Cybernetics or Control and Communication in the Animal and Machine. The significance of this book is twofold. Firstly, it contains the first serious and comprehensive examination of the important unifying role of transdisciplinary concepts in science and technology; these are concepts such as information, control, organization, learning, adaptability, complexity, and the like. Secondly, it examines the various capabilities of living organisms as role models for technology, in particular computer-based technology. At this time, the heritage of cybernetics is perhaps most visible in a recently emerging branch of artificial intelligence, for which three names have alternatively been used: machine intelligence, computational intelligence, and intelligent systems. This branch of artificial intelligence is concerned with human-made systems (machines) that are capable of achieving highly complex tasks in a human-like, intelligent way. The qualifier “human-like” is crucial, and it distinguishes this branch from the current mainstream in the broader area of artificial intelligence. In machine intelligence (computational intelligence, intelligent systems), the human mind is viewed as a role model and the aim is to understand and emulate its various cognitive capabilities, which allow human beings to perform remarkably complex tasks. These capabilities include: 0
0
perceiving a given environment, recognizing in it features that are relevant for accomplishing a given task, and acting upon them; predicting changes in the environment on the basis of its model as well as additional knowledge, which is often expressed in natural language; using information about environment and other available knowledge to reason, often in terms of natural language, about the task to be performed; coordinating perception, planning, and action in a purposeful way; using natural language to communicate and collaborate with other human beings; learning and generalizing from previous experience; adapting behavior as needed to successfully complete the given task.
Achieving these capabilities in machines is contingent upon advanced computer technology, advanced instrumentation technology, and significant progress in cognitive science. In addition, however, it is also contingent upon the development of appropriate computational tools for making it possible for machines to emulate the remarkable capability of humans to perform a
V
vi
Foreword
wide variety of physical and mental tasks by using perceptions in a purposeful way and approximating perceptions in natural language. The ultimate aim in developing these computational tools, which are referred to as tools of sop computing, is to compute with perceptions. The basic thesis of soft computing is that the requirement of high precision and certainty carry a high computational cost. When the cost becomes prohibitive, the requirement needs to be softened. The principal aim of soft computing is to exploit the tolerance for imprecision and uncertainty to achieve tractability, robustness, and low cost. Precision and certainty are thus utilized in soft computing as commodities that are traded for reduction in computational complexity and for enhancing robustness. This recognition that imprecision and uncertainty play a useful role in taming unmanageable complexity is an important paradigmatic change in science and engineering. Imprecision and uncertainty are no longer viewed as an unavoidable plague, but rather as a valuable resource. Using perceptions in purposeful ways and approximating perceptions by statements in natural language are remarkable capabilities of the human mind. Studying these and other capabilities of the human mind and emulating them via the various tools of soft computing in machines is the primary focus of research in the area of machine intelligence. For a quick orientation, it is useful to recognize the following six categories of tools subsumed under soft computing: (i) The first category consists of the various properties of fuzzy set theory and the associated fuzzy logic. These are formalized languages with great expressive power. They are capable of capturing linguistic imprecision embedded in natural language and, as a consequence, they allow us to represent knowledge and formalize reasoning in terms of statements expressed in natural language. These statements are, in turn, capable of approximating perceptions. (ii) The second category contains all recognized types of artificial neural networks, which were inspired by their biological counterparts. Their distinctive characteristics are the capability of learning from experience and the capability of recognizing patterns in data. (iii) The third category is formed by the various types of evolutionary algorithms, which were originally inspired by the processes of biological evolution. They are surprisingly efficient in searching for approximate solutions of many complex problems, in particular optimization problems, in which exact solutions are difficult or impossible to obtain. (iv) The fourth category contains theories of rough sets of various types. The aim of these theories is to capture uncertainty that is caused by the limited resolution power of our measuring instruments and sensors. They allow
Foreword
vii
us to formalize and deal with approximations of given sets via lower-level resolution capabilities. (v) The fifth category is based on the theory of monotone measures, which is a generalization of classical measure theory. In monotone measures, the requirement of additivity of classical measures is replaced with a weaker requirement of monotonicity with respect to subset ordering. This theory allows us to measure and deal with properties that, due to some synergetic or inhibitory effects, do not satisfy the requirement of additivity of classical measures. (vi) The sixth category consists of the various theories of uncertainty and principles based upon them. Included are theories of precise as well as imprecise probabilities. This class may also be viewed as a generalized uncertainty-based information theory, in which the amount of information obtained by some action is measured by the amount of uncertainty reduced by this action. These various components of soft computing are often combined. By each combination, some new capabilities are gained. Combining, for example, fuzzy sets with rough sets, we obtain mathematical structures considerably more expressive than those of fuzzy sets or rough sets alone. While fuzzy sets are capable of capturing linguistic imprecision, which is characteristic of natural language, the aim of rough sets is to capture uncertainty caused by limited resolution power of our measuring instruments and sensors. Moreover, there are two possible ways of combining these two types of sets, which are very different from one another. One leads to fuzzy rough sets, which are rough sets based on fuzzy equivalence relations. The other leads to rough fuzzy sets, which are rough approximations of fuzzy sets. This example is mentioned here just to illustrate the enormous scope of soft computing. Machine intelligence covers a broad spectrum of intelligence. Among machines that require a relatively low level of intelligence are those that have lately been employed in various consumer products. The best known of these are perhaps intelligent washing machines, which are capable of selecting the best washing method based on the amount of clothes, quality of clothes, type of dirt, and amount of dirt by reasoning in fuzzy logic that emulates the reasoning of an experienced user. Much higher intelligence is needed in a machine that is capable of controlling a helicopter without a pilot by simple commands in natural language via wireless communication. Such a sophisticated machine, heavily based on soft computing, was designed, implemented and successfully tested by Michio Sugeno in Japan a few years ago. Examples of machines with an even higher level of intelligence are the prospective mobile home robots. These robots, which we are not able to build as yet, will be required to be highly adaptable for performing a broad variety of tasks in varying environments
viii
Foreword
according to requests expressed in natural language by their human users. Such machines will have to be equipped with highly sophisticated visual and auditory perception capabilities, as well as all the other requisite capabilities of intelligent machines mentioned earlier. In order to be able to compare the level of intelligence of various machines, it has already been suggested to introduce a suitable measure of machine intelligence, analogous to the intelligent quotient that is routinely used for measuring human intelligence. Although some ways of measuring machine intelligence have already been suggested in the literature, no consensus has been obtained on this issue thus far. Machine intelligence cannot be successfully studied within any single discipline. It requires cooperation of experts from multiple disciplines. The primary disciplines include computer science, mathematics, electrical engineering, mechanical engineering, cognitive science, biology, linguistics, and systems science. To facilitate cooperation among the disciplines involved, multidisciplinary research centers for machine intelligence (computational intelligence, intelligent systems) have lately been established in numerous places. This is an indicator that the importance of the relatively new area of machine intelligence, an unorthodox branch in the broader field of artificial intelligence, is already recognized. The editors of Machine Intelligence: Quo Vadis? should be congratulated for preparing this excellent and timely book on a subject whose importance is likely to grow significantly over the years. The book covers almost all aspects of machine intelligence and will undoubtedly become a benchmark against which future progress in machine intelligence will be measured.
George J. Klir State University of New York Binghamton, New York, USA
FOREWORD Intelligence - Science and Technology in the 21st Century
Intelligence is a fundamental feature of humans. It emerges from the activities of our brain and evolves during our life and social interactions. Even though much progress has been made in brain research and understanding, very challenging questions remain in the 21st century. For example, how is intelligence generated in the brain, and how are we to build intelligent systems using modern technology? The book Machine Intelligence: Quo Vadis? contends with these problems, providing state of the art research results with the integration of various approaches. The 20th century can be characterized by a number of important developments and inventions in science and technology. The progress was led by a revolution in physics which resulted in our surprisingly rich modern technology. Mathematics has also achieved a number of new results and a high level of abstract mathematics. In the middle of the century new results were revealed in biological research. Molecular biology emerged from the efforts to further understand the secrets and origin of life based on molecular principles. Since then, progress has brought us to where we now understand and study the genetic make-up of man. Information and computer science have also progressed since the middle of the 20th century. The achievements in computer technology have been so remarkable that we’re now facing a new civilization, the Information Society, supported by computers and Internet technology. It must be said that until now we have not been very successful in understanding how our brain works and how to design and build intelligent systems. This is the biggest challenge remaining to us in the 21st century: how to embed intelligence into a man-made machine. What are to be the main principles for approaching this challenge? For the last century the sciences have been divided into specific areas or disciplines and were partially isolated. However, today we are facing very fundamental important global problems which cannot be solved in one area by a single discipline, e.g. environmental protection, energy production, various illnesses, global communications and many others. We need to approach these problems by integrating all our knowledge and scientific disciplines. Modelling human intellect and creating Artificial Intelligence is one of these important and fundamental problems. Dynamic activities of the brain are evolving our
ix
x
Foreword
intelligence. The brain is a biological organ composed of biological subsystems. However, it functions as an information processor, on which our mind depends. This situates brain science between the biological and information sciences. Artificial Intelligence is a synthetic approach to modelling the intellect of living systems. Its goal is to build functional, human-like intelligent systems based on modern computer technology, not necessarily based on the physical model of the brain. While building a system that will behave like the brain we have to keep in mind the global functional analysis of the brain, which is not a simple task. This again is the reason to include various disciplines which may be helpful in this analysis. There have been a number of proposed approaches based on neural networks, fuzzy systems, evolutionary computation and the new agent technologies. Following the isolated development of the above-mentioned approaches, now is the time to integrate all of them, understand the brain and build an intelligent system. The time has come to fill the gap between the functional biological brain and computer science oriented technology.
Shun-ichi Amari RIKEN Brain Science Institute Saitama, Japan
PREFACE
Machine Intelligence (MI) is very important in building intelligent systems with various degrees of autonomous behavior. These groups of tools support features such as the learning ability and adaptability of intelligent systems in various types of environments and situations. The current and future Information Society is expected to come into technology and everyday life using the Ambient Intelligence ( A d ) approach. This will provide a wide range of potential applications for Machine Intelligence tools to support the implementation of the Am1 concept. A number of studies indicate that this approach is inevitable and will play an essential and central role in the development of the Information Society in the near future. The important role of Machine Intelligence in this historic challenge highlights the responsibility the MI community has to include all the different fields such as brain-like research and applications, fuzzy logic, neural networks, evolutionary computation, multi-agent systems, artificial life, expert systems, symbolic approaches based on logic reasoning, knowledge discovery, mining, replication and many other related fields in the support, development, and creation of an intelligent system. The important embedding of these systems in various types of technology should be profitable us and bring a different role for mankind in production and everyday life. We expect to see intelligent technology, solutions, and even humanoid robots to help mankind improve and preserve the ideals of humanity and democracy. The Machine Intelligence Quotient will play an important role in the future ability to evaluate a specific system’s degree of autonomy. It is believed that it will be a domain-oriented problem and it will be important for humans to use this information in decision-making, e.g. in the evaluation of information systems in the commercial world, to choose the system with the highest MIQ. The usefulness of this parameter will depend on many influences including technological, domain-oriented and also the commercial aspect of the MI application in various systems. The commercial demand for “intelligent” solutions and products should increase the interest for MI tools. One of the most challenging fields is brain science and the application of this research in technology and man-made products. There is much important research being done into biologically inspired systems but the problems of knowledge acquisition, storage, replication and integration into “superknowledge” systems seem to be very important. The solutions for these problems could lead to many new technological applications within the Am1 concept. Many fields within MI are supportive of this effort. This could lead to
xi
xii
Preface
a system with a multi-accessible knowledge database with incremental learning ability, which would revolutionize the application potential of MI man-made solutions and products. This multi-author book includes contributions from several leading authorities in the field of Machine Intelligence including the areas of fuzzy logic, neural networks, evolutionary computation and hybrid systems, which are all generally believed to play an important role in the field of Machine Intelligence. There is no longer a dispute over specific fields of technology, but there are still questions about the most effective way to achieve the solution to the task within the means of MI. Only the cooperation of all the MI tools will yield profitable research and applications for creating Intelligent Technology and Systems and their applications in technology and everyday life. We can see through history that the differences between the classical Artificial Intelligence communities and Computational Intelligence communities are slowly diminishing. This era of cooperation seems to be opening the way to achieving the common goal of support development for Machine Intelligence tools and their role in the Am1 concept. This book is a contribution to the pooled knowledge and presentation of important problems and advances in Machine Intelligence theory and its application. The secondary goal of this book is to bring to mind the question L L Q Vadis ~ ~Machine Intelligence?” in the sense of “where is MI going?” We want to do this by considering the possible misuse of these technologies. We are living in a very complex time especially after September 11, 2001. It is important therefore to study MI’S potential to benefit mankind while being conscious of, or at least discussing, the potential misuse of MI in Am1 in an Information Society. We are thankful for the interesting forewords written by Prof. George Klir from the University of Binghamton, New York, USA and Prof. S. Amari from RIKEN, Japan pointing out the interesting aspects of Machine Intelligence. We’d also like to thank all the contributors to this book as well as the students from the Center for Intelligent Technology at the Technical University in KoSice, Slovakia for their many hours contributed preparing and formatting the manuscript for publication. Finally, we’d like to thank our publisher World Scientific Publishing for their encouragement and for helping to spread these ideas to the global scientific community. March, 2002
Dr. Peter SinEBk and Dr. JBn VaSEBk Center for Intelligent Technologies Technical University of Kosice Kosice, Slovakia
Dr. Kauro Hirota Hirota Laboratory Tokyo Institute of Technology Tokyo, Japan
CONTENTS
Forewords
J . Klir S. Amari
ix
P. Sin&%, J . VaSCa'lc and K. Hirota
xi
V
Preface
Chapter 1
Chapter 2
INTRODUCTION
1
Quo Vadis Computational Intelligence? W. Duch and J. Ma7idziulc
3
MATHEMATICAL TOOLS FOR MACHINE INTELLIGENCE
29
Mappings between High-dimensional Representations in Connectionistic Systems V. Kiirlcova'
31
The Stimulating v of Fuzzy Set Theory in Mathematics and its Applications E. E. Kerre and D. van der Welcen
47
K-order Additive Fuzzy Measures: A New Tool for Intelligent Computing R. Mesaar
65
On-line Adaptation of Recurrent Radial Basis Function Networks using the Extended Kalman Filter B. Todorouik, M . StankoviC and C. Moraga
73
Iterative Evaluation of Anytime PSGS Fuzzy Systems 0. Taka'cs and A . R. Vdrkonyi-Kdczy
93
xiii
xiv
Contents
Chapter 3
Kolmogorov's Spline Network B. Igelnik and N . Parikh
107
Extended Kalman Filter Based Adaptation of Time-varying Recurrent Radial Basis Function Networks Structure B. TodoroviC, M. StankoviC and C. Moraga
115
A Multi-NF Approach with a Hybrid Learning Algorithm for Classification D. Rutkowska and A . Starczewski
125
A Neural Fuzzy Classifier Based on MF-ARTMAP P. SinEhk, M. Hric, R. Valo, P. Horansky and P. Karel
141
Mathematical Properties of Various Fuzzy Flip-flops as a Basis of Fuzzy Memory Modules K . Hirota and S. Yoshida
161
Generalized T-operators I. J. Rudas
179
Fuzzy Rule Extraction from Input/Output Data L. T. Kdczy, J. Botzhiem, A. B. Ruano, A. Chong and T. D. Gedeon
199
Knowledge Discovery from Continuous Data Using Artificial Neural Networks R. Setiono and J. Zurada
217
ADVANCED APPLICATIONS WITH MACHINE INTELLIGENCE
233
Review of Fuzzy Logic in the Geological Sciences: Where We Have Been and Where We Are Going R. V. Demicco
235
Bayesian Neural Networks in Prediction of Geomagnetic Storms G. Andrejkova'
251
Contents xv
Chapter 4
Adaptation in Intelligent Systems: Case Studies from Process Industry K . Leiviska
259
Estimation and Control of Non-linear Process Using Neural Networks A . Jadlovska'
275
The Use of Non-linear Partial Least Square Methods for On-line Process Monitoring as an Alternative to Artificial Neural Networks P. F. Fantoni, M. Hoffmann, W. Hines, B. Rasmussen and A . Kirschner
283
Recurrent Neural Networks for Real-time Computation of Inverse Kinematics of Redundant Manipulators J . Wang and Y . Zhang
299
Towards Perception-based Fuzzy Modeling: An Extended Multistage Fuzzy Control Model and its Use in Sustainable Regional Development Planning J . Kacprzyk
32 1
MACHINE INTELLIGENCE FOR HIGH LEVEL INTELLIGENT SYSTEMS
339
Neural Network Models for Vision K. Fukushima
34 1
Humanoid Robot for Kansei Communication: Computer Must Have Body S. Hashimoto
357
Intelligent Space and Human Centered Robotics T . Yamaguchi and N . Ando
371
Evolution of Unconscious Cooperation in Proportional Discrete-time Harvesting J. Pospichal
385
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution in Silico V. KvasniEka
403
xvi
Contents
Index
Human Centered Support System Using Intelligent Space and Ontological Neural Networks T. Yamaguchi, H. Murakami and D. Chen
427
The K.E.I. (Knowledge, Emotion and Intention) Based Robot Using Human Motion Recognition T. Yamaguchi, T. Saito and N . Ando
435
Auditory and Visual-based Signal Processing with Interactive Evolutionary Computation H. Takagi
443 455
Introduction
This page intentionally left blank
QUO VADIS, COMPUTATIONAL INTELLIGENCE? WLODZISLAW DUCH’ AND JACEK MANDZIUK~ Department of Informatics, Nicholas Copemicus University, ul. Grudziqdzka 5, 87-I00 Toruri, Poland E-mail:
[email protected]. torun.pl Faculty of Mathematics and Information Science, Warsaw University of Technology, Plac Politechniki I , 00-661 Warsaw, Poland. I
What are the most important problems of computational intelligence? A sketch of the road to intelligent systems is presented. Several experts have made interesting comments on the most challenging problems.
Keywords: computational intelligence, artificial intelligence, cognition, neural networks, evolution
1
Introduction.
In the introduction to the “MIT Encyclopedia of Cognitive Sciences” M. Jordan and S. Russell [33. ] used the term “computational intelligence” to cover two views of artificial intelligence (AI): engineering and empirical science. Traditional A1 started as an engineering discipline concerned with the creation of intelligent machines. Computational modeling of human intelligence is an empirical science. Both are based on computations. Artificial Intelligence (AI) has established its identity quite early, during the Dartmouth conference in 1956 [6. 1. It had clearly defined goals, exemplified by great early projects, such as the General Problem Solver of Simon and Newell. There are many definitions of A1 [48. ,64. 1, for example: “... the science of making machines do things that would require intelligence if done by humans” (Marvin Minsky), “The study of how to make computers do things at which, at the moment, people are better” [48. 1. In essence A1 tries to solve problems for which effective algorithms do not exist, using knowledge-based methods. In the 1970-ties A1 has contributed to the development of cognitive science and to the goal of creating “unified theories of cognition”, as Allen Newell called it. Ambitious theories of high cognitive functions were formalized by John Anderson in his Act* theory [ 3 . 1, and by Newell and his collaborators in the Soar theory [43. 1. Both were very successful and led to many practical (and commercial) applications. In the last decade intelligent agents become the focus of AI, entities that can perceive and act in a rational goal directed way to achieve some objectives. Machine learning has been important from the beginning in AI. Samuel’s checker-playing system (1959) learned to play far superior checkers than its creator. Although initially research on perceptrons has developed as a part of A1 in the late
3
4
W. Duch and J. Maddziuk
1950-ties machine learning became preoccupied with inductive, rule based knowledge [50. 1. A1 development has always been predominately concerned with highlevel cognition, where symbolic models are appropriate. In 1973 the book of Duda and Hart on pattern recognition appeared [ 18. 1. The authors wrote that “pattern recognition might appear to be a rather specialized topic”. It is obviously a very broad topic now, including good part of neural networks research [5. ]. In 1982 Hopfield network [26. ], in 1986 the backpropagation algorithm [49. 1, and a year later the PDP books [7. ] brought the neural network field to the center of attention. Since that time the field of neural computing has been growing rapidly in many directions and became very popular in the early 90ties. Computational neuroscience, connectionist systems in psychology and neural networks for data analysis are very different branches with rather different goals. The last of these branches gained solid foundations in statistics and Bayesian theory of learning [5. ,25. 1. Soft computing conferences started to draw people from neural, fuzzy sets and evolutionary algorithms communities. Applications of these methods overlap with those dealt with by pattern recognition, A1 and optimization communities. Computational Intelligence (CI) is used as a name to cover many existing branches of science. This name is used sometimes to replace artificial intelligence, both by book authors [47. ] and some journals (for example, “Computational Intelligence. An International Journal”, by Blackwell Publishers). There are several computational intelligence journals dedicated to the theory and applications of artificial neural networks, fuzzy systems, evolutionary computation and hybrid systems. In our opinion it should be used to cover all branches of science and engineering that are concerned with understanding and implementing functions for which effective algorithms do not exist. From this point of view some areas of A1 and a good part of pattern recognition, image analysis and computational neuroscience are subfields of CI. What is the ultimate goal of computational intelligence and what are the shortterm and the long term challenges to the field? What is it trying to achieve? Without setting up clear goals and yardsticks to measure progress on the way, without having a clear sense of direction many efforts will end up nowhere, going in circles and repeating the same problems. We hope that this paper will start a discussion about the future of CI that should clarify some of the issues involved. First we shall make some speculations about the goals of computational intelligence, think how to reach them, and raise some questions worth answering. Then we will write about some challenges. We have asked several experts what they consider to be the greatest challenges in their field. Finally some conclusions will be given.
Quo Vadis Computational Intelligence?
2
5
The ultimate goals of CI.
From the perspective of cognitive sciences artificial intelligence is concerned with high level cognition, dealing with such problems as understanding of language, problem solving, reasoning, planning and knowledge engineering at the symbolic level. Knowledge has complex structure, the main problems are combinatorial in nature and their solution requires heuristic search techniques. Learning is used to gain knowledge that expert systems may employ for reasoning, and is frequently based on logic. Other branches of computational intelligence are concerned with lower level problems, are more on the pattern recognition side, closer to perception, are at the subsymbolic level. Complex knowledge structures do not play important role, most methods work in fixed dimensional feature spaces. Combinatorial character of problems and knowledge-based heuristic search are not an issue. Numerical methods are used more often than discrete mathematics. The ultimate goal of A1 is to create a program that would pass the Turing test, that is would understand human language and be able to think in a similar way to humans. The ultimate A1 project is perhaps CYC, a super-expert system with over a million of logical assertions describing all aspects of the world. The ultimate goal of other CI branches may be to build an artificial rat (this was the conclusion of a discussion panel on the challenges to CI in the XXI century, at the World Congress on Computational Intelligence in Anchorage, Alaska, in 1998). Problems involved in building an artificial animal that may survive in a hostile environment are rather different than problems related to the Turing test. Instead of symbolic knowledge problems related to perception, direction of attention, orientation, motor control and motor learning have to be solved. Behavioral intelligence that is embodied in the Cog project is perhaps the most ambitious project of this kind[1. 1. Each branch of CI has its natural areas of application and the overlap between them is sometimes small. For example, with very few exceptions A1 experts are separated from communities belonging to other CI branches, and vice versa. Even neural networks and pattern recognition communities, despite a considerable overlap in applications, tend to be separated. Is there a common ground where the two fields could meet? The ultimate challenge for CI seems to be a robot that combines high behavioral competence with human-level higher cognitive competencies. Building creative systems of such kind all branches of CI will be required, including symbolic A1 and lower level pattern recognition methods. At the one end of the spectrum we have neurons, at the other brains and societies.
6
3
W . Duch and J . Maridriuk
A roadmap to creative systems.
The brain is not a homogenous, huge neural network, but has quite specific modular and hierarchical structure. Neural network models are inspired by processes at a low level of this hierarchy, while symbolic A1 works at the highest level. Little work has been devoted to the description and understanding of intermediate levels, although investigation of connections between them can be quite fruitful [9. 1. Below we have sketched a roadmap from the level of single neurons to the highest level of creative societies of brains, presenting some speculations and research directions that seem unexplored. Cooperation of individual elements that have some local knowledge leads to emergence of a higher-order unit that should be regarded at its own footing. The same principles may operate at different scales of complexity. A major challenge for CI is to create models and learn how to scale up systems to reach higher level. 3.I
Threshold neurons and perceptrons
Neurons in simple perceptrons have only one parameter, the threshold for their activity, and the synaptic weights that determine their interactions. Combined together perceptrons create the popular multi-layer perceptron (MLP) networks that are quite powerful, able to learn any multidimensional mapping starting from examples of required input/output relations. Usually the network aspect is stressed while learning processes are discussed: the whole, with interacting elements, is bigger than its parts. Real biological networks involve a huge number of neurons with thousands of connections each. Instead of looking at the fixed architecture of neural network it may be better to imagine that synaptic connections define interactions between subsets of individual elements. Clusters of activity, or forming subnetworks, has been observed in networks of spiking neurons [24]. Similar effects have not been investigated in MLP networks. Perceptron neural networks may be regarded as collections of primitive processing elements (PEs). Individual element do not understand the task the whole collection is faced with, but are able to adjust to the flow of information, performing local transformations of the incoming data and being criticized or corrected by other members of the team (i.e. network). Hebb principle provides reinforcement for PE, regulating the level of their activity in solving different cooperative problems. Backpropagation procedure provides another kind of critique of the activity of single neurons. Some parameters are internal to the neural units (thresholds), while other parameters are shared, allowing for interactions between units during the learning procedure. Neural networks use many units (neurons) that cooperate to solve problems that are beyond the capabilities of a single unit. Interactions and local knowledge of a simple PEs determine the type of problems that networks of such elements may solve.
Quo Vadis Computational Intelligence? 7
Networks are able to generalize what has been learned, creating a model of states of local environment they are embedded in. Generalization is not yet creativity, but is a step in the right direction.
3.2
Increasing complexity of internal PE states
Next step beyond the single parameter (threshold) describing internal state of a neuron is to add more internal parameters, allowing each PE to realize a bit more than a hyperplane discrimination. Perceptrons are not able to solve the famous connectedness and other problems posed by Minsky and Papert [39] as a challenge for neural networks. Adding more network layers does not help (see the second edition of [39. I), the problem scales exponentially with the growing size of images. This problem may be solved with neural oscillator networks in a biologically plausible way [61. 1, but rather complex networks are required. Adding one additional internal parameter (phase) is sufficient to solve this problem [32]. What is the complexity class of problems that may be solved this way? Can all problems of finding topological invariants be solved? What can be gained by adding more parameters? The answers are not clear yet. Computational neuroscience investigates models of cortical columns or Hebbian cell assemblies. Modular neural networks may be regarded as networks with super PEs that adapt to requirements of the complex environment. Instead of simple units with little internal knowledge and fixed relations (fixed architecture of MLP networks), more powerful PEs dynamically forming various configurations (virtual networks) should be used. More complex internal knowledge and interaction patterns of PEs are worth investigation. The simplest extension of network processing elements that adds more internal parameters requires abandoning the sigmoidal neurons and using a more complex transfer functions. A Gaussian node in a Radial Basis Function network [5] has at least N internal parameters, defining position of the center of the function in N dimensional space. Weights define the inverse of dispersions for each dimension, determining interaction with other network nodes through adaptation of parameters to the data flow. Although research efforts have been primarily devoted to improvement of neural training algorithms and architectures there are good reasons to think that transfer functions may significantly influence the rate of convergence, complexity of the network and the quality of solution it provides [ 161. What do these more complex PEs represent? If their inputs are values of some features they model areas of the feature space that may be associated with some objects, frequently appearing input patterns. Recurrent neural networks, including networks of spiking neurons, are used as autoassociative memories that store prototype memories as attractors of network dynamics [2. 1. Basins of these attractors define areas of the feature space associated with each attractor. A single complex PE, or a combination of a few PEs, represent such areas directly, replacing a subnetwork of simpler neurons.
8
W. Duch and J . Malidzauk
This was the original motivation for the development of the Feature Space Mapping (FSM) networks [13,9. 1. Nodes of FSM networks use separable transfer functions G(X)=lIiG,(xi),instead of radial functions (the only separable radial function is Gaussian). Their outputs model the probability of recognizing a particular combination of input features as some object. Each PE may be treated as a fuzzy prototype of an object, while each component Gi(xi)may be treated as a membership function for feature xi. Thus FSM is a neurofuzzy system that allows for control of the decision borders around each prototype by modifying the internal parameters of the PEs (transfer functions). Precise control of basins of attractors in dynamical networks is usually impossible. In contrast to MLP neural networks many types of functions with different internal parameterizations are used. First steps towards neural networks with heterogonous PEs were made [11,17. ,29. 1. Theoretically they should allow for discovery of an inductive bias in the data, selecting or adapting transfer functions to the data using minimal number of parameters. Creation of efficient algorithms for networks with heterogonous PEs is quite challenging task. Each complex PE represents a module that adapts to the data flow adjusting its basin of influence in the feature space. Is this approximation sufficient to replace dynamical networks with spiking neurons or recurrent networks? What are the limitations? Many questions are still to be answered.
3.3
Increasing complexity of PE interactions
Rigorous transition from attractor networks to equivalent FSM networks may be based on fuzzy version of symbolic dynamics [4. ] or on the cell mapping method [28. 1. It should be possible to characterize not only the asymptotic properties of dynamical models, but also to provide simplified trajectories preserving sufficient information about basins of attractors and transition probabilities. This level of description is more detailed than the finite state automata, since each state is an objected represented in the feature space. Such models are a step from neural networks to networks representing lowlevel cognitive processes. They are tools to model processes taking place in feature spaces. FSM networks use clusterization-based procedures to create initial model of the input data and then learn by adaptation of parameters. Adding knowledge to the feature space is easy by creating, deleting and merging nodes of the network. FSM may work as associative memory, unsupervised learning, pattern completion system or a fuzzy inference system. Constraints on variables, such as arithmetic relations, or laws Y=F(XI,..XN) may be directly represented in feature spaces. Although using complex PEs in networks adds internal degrees of freedom interactions between the nodes are still fixed by the network architecture. Even if nodes are added and deleted the initial feature space is fixed. An animal has a very large number of receptors and is able to pay attention to different combinations of sensory stimuli. Attractor networks are combinatorially productive, activating many
Quo Vadis Computational Intelligence? 9
combinations of neural modules. Feedforward networks, even with complex PEs, have fixed path of data flow. Although internal representations of PEs go beyond logical predicates they are not dynamic. Thus they are not able to model networks of modules that interact in a flexible way depending on the internal states of their modules. One reason for changes in the internal states of cortical brain modules is due to the recent history (priming effects), another is due to changes in neuromodulation controlled by a rough recognition and emotional responses in the limbic areas. A simplified model of interacting modules should include the fact that all internal parameters should depend either directly on inputs P ( X ) , or indirectly on hidden parameters P(H(X)) characterizing internal states of other modules. Each module should estimate how competent it is in a given situation and add its contribution to the interaction with other modules only if its competence is sufficient. Recently this idea has been applied to create committees of competent classifiers [lS. 1. A committee is a network of networks, or a network where each element has been replaced by a very complex PE, made from individual network. Outputs O(X;M,) from all network modules (classifiers) M , are combined together with weights W, in the perceptron-like architecture. The weights of these combinations are modulated (multiplied) by factors F(X;M,) that are small in the feature space areas where the model MI makes many errors and large where it works well. Thus the effective weights depend on the current state of the network, W,(X) = W, F(X;M,). This method may be used to create virtual subnetworks, with different effective path of information flow. Modulation of the activity of modules is effective only if the information about the current state is distributed to all modules simultaneously. In the brain this role may be played by the working memory (cf. Newman and Baars [42]). The step from associations to sequential processing is usually modeled by recurrent networks. Here we have networks of modules adjusting their internal states (local knowledge that each module has learned) and their interactions (modulations of weights) to the requirements of the information flow through this system. At this level systematic search processes may operate. In [ 131 we have shown that a complex problem requiring combinatorial approach may be solved quite easily by search processes that activate simple FSM modules. The electrical circuit example from the PDP book has been used [7. ] to demonstrate it. Each FSM module has learned qualitatively to analyze relations between the 3 variables, such as the Ohm’s law U=ZR etc. The amazing result is [9. ] that almost any relation AA=f(AB, AC) representing changes of variables leads to the same objects in the feature space model. In the electric circuit example there are 7 variables and S laws that may be applied to this circuit. If values of some variables are fixed activity of the 5 FSM modules (each corresponding to a 3-term relation, and each identical) that are competent to add something new to the solution is sufficient to specify the behavior of the remaining variables.
10
W . Duch and J . Ma7idziuk
Thus modular networks, such as the FSM model, may be used as powerful heuristics to solve problems requiring reasoning. The solution is found by systematic search, as in the reasoning systems, but each logical (search) step is supported by the intuitive knowledge of the whole system (level of activity of the competent modules). Such systems may be used for simple symbolic processing, but creating flexible modular networks of this type that could compete with experts systems is still a challenge. 3.4
Beyond the vector space concept
Feature space representation lies at the foundation of pattern recognition [18. 1, but it is doubtful that it plays such an important role in the brain. Even at the level of visual perception similarity and discrimination may be sufficient to provide the information needed for visual exploration of the world [44. 1. At the higher cognitive levels, in the abstract reasoning or sentence analysis processes, vector spaces with fixed number of features are of little use. In such applications complex knowledge structures are created and manipulated by knowledge-based A1 expert systems. Although a general framework for processing of structural data, based on recurrent neural networks and hidden Markov models, has been introduced [20. ], it is rather difficult to implement and use. Perhaps a simpler approach could be sufficient. The two most common knowledge representation schemes in A1 are based on the state or the problem description [48. ,64. 1. The initial state and the goal state are also represented in the same way, the goal being usually a desired state, or a simple problem that has known solution. A set of operators is defined, transforming the initial object (state, problem), into the final object (goal). Each operator has some costs associated with its use. Solutions are represented by paths in the search graph. The best solution has lowest costs of transforming the initial object into the final object. An efficient algorithm to compute such distances may be based on dynamical programming [36. I. Lowest costs of transformation that connect two complex objects are a measure of similarity of these objects. Mental operations behind evaluations of similarity are rather complex and are not modeled directly at this level. Similarity is sufficient for categorization and once it has been evaluated original features are not needed any more. At the level of perception sensory information is of course feature-based, but different types of higher-level features are created for different objects from the raw sensory impressions. At the higher cognitive level “intuitive thinking” is probably based on similarity evaluation that cannot be analyzed by logical rules. Crisp or fuzzy rules have limited expressive powers [ 12. 1, prototype-based rules that evaluate similarity are more powerful alternative [ 14. I. General framework for similarity-based systems includes most types of neural networks as special cases [lo. ]. Pattern recognition methods that are based on similarity or dissimilarity matrices and do not require vector spaces based on features have been published (cf. [45. I).
Quo V a d i s C o m p u t a t i o n a l Intelligence?
11
Another research direction may be inspired by Lev Goldfarb’s criticism of the vector space as a foundation for inductive class generalization [23. 1. His system of evolving transformations tries to synthesize new operators for object transformation and similarity functions, allowing for evaluation of similarities between two objects that have quite different structure. This is necessary for example in chemistry, comparing molecules that have different structure although they belong to the same class (have the same activity or other high-level properties). In other words some kind of a measure of functional isomorphism or similarity (not necessarily corresponding to the structural one) is required in such applications.
3.5
Flexible incremental approaches
One of the fundamental impediments in building large, scalable learning systems based on neural networks is the problem of catastrophic forgetting. In order to alleviate this problem several ideas concerning both the network structures and the training algorithms have been introduced. The main approaches reported in the literature include modular networks [60,52,41], constructive approaches [ 19,211. In modular networks the problem to be learned is divided into subproblems, each of which is learned independently by a separate module and then the solution for the whole problem is obtained as a proper combination of subproblem solutions. In constructive approaches the network starts off with a small number of nodes and its final structure is being built systematically by adding nodes and links whenever necessary. Both types of methods are well known in the community so their advantages and weak points will not be discussed here. Other examples of flexible incremental approaches are the lifelong learning methods [57,58] in which learning new tasks becomes relatively easier when the number of already learned tasks increases. One possible approach of that type is to start training procedure based on very simple, “atomic” problems. Structures developed while solving these atomic problems are frozen and consequently will not be obliterated in subsequent learning only fine tuning would be permitted. These small atomic networks will serve as building blocks for solving more complicated problems - say level 1 problems. Larger structures (networks) obtained in the course of training for solving level 1 problems will serve as blocks of building even larger structures capable of solving more complex problems (level 2 ones), etc. Once in a while the whole system is tested based on the previously learned (or similar to them) atomic, level 1, level 2, etc. problems. The above scheme can be viewed as an example of constructive approach, however - unlike in typical constructive approaches - it is postulated that the network starts off with the number of nodes and links a few times exceeding the number of actually required ones (i.e. “enough’ spare nodes and links is available in the system). Hence the potential informational capacity of the system is available right from the beginning of the learning process (similarly to biological brains). After
12
W. Duch and J . Malidziuk
completion of the training process the nodes and links not involved in the problem representation are pruned unless the system is going to be exposed to another training task in future. We have used similar to the above scheme to solving supervised classification problem. The training scheme called Incremental Class Learning (ICL) was successfully applied to unconstrained Handwritten Digit Recognition problem [34,35] . The system was trained digit by digit (class by class) and atomic features developed in the course of learning were frozen, and available in subsequent learning. These frozen features were shared among several classes. The ICL approach not only takes advantage of existing knowledge when learning a new problem, it also offers a large degree of immunity from the catastrophic interference problem. The ICL idea can possibly be extended to the case of multimodal systems performing several learning tasks where different tasks are characterized by different features. This would require adaptation of the above scheme to the case of multimodal feature spaces. The above mentioned incremental learning methods are suitable for supervised, off-line classification tasks in which multi-pass procedure is acceptable. Alternative approaches - probably based on unsupervised training - must be used in problem domains requiring real-time learning ability. Ideally, large, scalable network structures should be suited to immediate, one pass incremental learning schemes. An examples - to some extent - of such fast trainable networks are Probabilistic Neural Networks [53] and General Regression Neural Networks [54] often applied to financial prediction problems [51]. However the cost of fast training ability is a tremendous increase of memory requirements since all training patterns must be memorized in the network. The other disadvantage is relatively slow response of the system in the testing phase. The search for efficient, fast incremental training algorithms and suitable network architectures is regarded as one of the challenges in computational intelligence.
3.6
Evolution of networks
Another important research direction is changing from static (deterministic) networks into evolving (context dependent) ones. Possible approaches here include networks composed of nodes with local memory that process information step-wise, depending on the previous state(s). Evolving networks should be capable of adding and pruning nodes and links along with the learning process. Moreover, the internal knowledge representation should be based on redundant features sets as opposed to highly distributed representations. Non-determinism and context dependence can, for example, be achieved by using nodes equipped with simple fuzzy rules (stored in their local memories) that would allow for intelligent, non-deterministic information processing. These fuzzy rules should take into account both local parameters (e.g. the number of active incoming links, the degree of local weights density, etc.)
Quo Vadis Computational Intelligence?
13
as well as global ones (e.g. the average level of global activation - “temperature of the system”, global level of wiring of the system, etc.). Knowledge representation should allow for off-line learning, which will be performed by separate parts of the systems - not involved in the very fast, on-line learning. The off-line learning should allow for fine tuning of the knowledge representation and also would be responsible for implementation of appropriate relearning schemes. One of the possible approaches are the ECOS (Evolving COnnectionist Systems) introduced by Kasabov [3I], which implement off-line re-training schemes based on internal representation of the training examples in the system. Similar idea was also introduced in our paper [35] where the network was trained layer by layer and the upper layer was trained based on the feature representation developed in the lower layer. Another claim concerning flexible learning algorithms and network structures is that structures of network modules as well as training methods should have some degree of fuzziness or randomness. Ideally, several network modules starting with exactly the same structure and internal parameters after some training period should diverge from one another though still stay functionally isomorphic. Some amount of randomness in the training procedure would allow for better generalization capabilities and higher flexibility of these modules. Flexibility and hierarchy of information (knowledge) can be partly realized by the use of multidimensional links. Very simple associations will be represented by classical one dimensional links (form one neuron to another neuron). More complex facts will be represented by groups of links joint together and governed by sophisticated fuzzy rules taking into account context information. In other words multidimensional link will be a much more complex and powerful structure than the simple sum of all one dimensional links being their parts. A dimension of the link will be proportional to the degree of complexity of information it represents. This idea has its roots in the design of associative memories where depending on the nature and complexity of stored associations suitable type of memory can be used (autoassociative, bidirectional or multidirectional). 3.7
Transition to symbolic processing
A1 has concentrated on symbolic level of information. Problems related to perception, analysis of visual or auditory scenes, analysis of olfactory stimuli are solved by real brains working on spatiotemporal patterns. There are many challenges facing the computational cognitive neuroscience field that deals with modeling such brain functions. Spiking networks may have some advantages in such applications [61. 1. Several journals specialize in such topics and a book with subtitle “Towards Neuroscience-inspired computing” appeared recently [63], discussing modular organization, timing and synchronization, learning and memory models inspired by understanding of the brain.
14
W. Duch and J . Malidziuk
We are interested here only in identification of promising routes to simplified models that may be used for processing of dynamic spatiotemporal patterns, going from low to high-level cognition. One mechanism proposed by Hopfield and Brody [27] is based on recognition of the spatiotemporal pattern via transient synchrony of the action potentials of a group of neurons. The recognition is in their model invariant to uniform time warp and uniform intensity change of the input events. Although modeling of recognition in feature spaces is rather straightforward invariance is rather difficult to obtain. Recognition or categorization of spatiotemporal patterns allows for their symbolic labeling, although such labeling may sometimes be a crude approximation. Transition from recurrent neural networks (RNNs) to finite state automata rules and symbols may be done in several ways: extracting transition rules from dynamics of RNNs, learning finite state behavior by RNNs, or encoding finite-state automata in neural networks [63,22]. Although a lot of effort has been devoted to this subject most papers assume only two internal states (active or not) for automata and for network PEs, severely restricting their possibilities. Relations between more complex PEs and automata with complex internal states are very interesting but not much is known about them. Sequential processes in modular networks, composed of subnetworks with some local memory, should roughly correspond to the information processing by neocortex. These processes could be approximated by probabilistic multi-state fuzzy automata. Complex network processing elements with local memory may process information step-wise, depending on their history. Modules, or subnetworks, should specialize in solving fragments of the problem. Such approach may be necessary to achieve the level of non-trivial grammar that should emerge from analysis of transitions allowed in finite state automata corresponding to networks.
3.8
Up to the bruins and the societies of bruins
The same scheme may be used at higher levels: modular networks described above are used to process information in a way that roughly corresponds to functions of various brain areas, and these networks become modules that are used to build nextlevel “supernetworks”, functional equivalents of whole brains. The principles at each level are similar: networks of interacting modules adjust to the flow of information changing their internal knowledge and their interactions with other modules. Only at quite low level, with very simple interaction and local knowledge of PEs, efficient algorithms for learning are known. The process of learning leads to emergence of novel, complex behaviors and competencies. Maximization of system information capacity may be one guiding principle in building such systems: if the supernetwork is not able to model all relations in the environment then it should recruit additional members that will specialize in learning facts, relations or behaviors that have been missing.
Quo Vadis Computational Intelligence?
15
At present all systems that reach the level of higher cognitive functions and are used for commonsense reasoning and natural language understanding are based on artificial intelligence expert system technology. The CYC system (www.cyc.com) with over one million assertions and tens of thousands of concepts does not use any neural technology or cognitive inspirations. It is a brute-force symbolic approach. Other successful A1 models, such as the Soar [43] or Act [3] systems, have developed also quite far remaining at the level of purely symbolic processing. Can such technology be improved using subsymbolic computational intelligence ideas? Belief networks may be integrated in such systems in relatively easily, but it is still a big challenge for neural systems to scale up to such applications. DISCERN was the only really ambitious project that used neural lexicon for natural language processing [38], but it did not go too far and has been abandoned. Very complex supernetworks, such as the individual brains, may be further treated as units that co-operate to create higher-level structures, such as groups of experts, institutions, think-tanks or universities, commanding huge amounts of knowledge that is required to solve the problems facing the whole society. Brainstorming is an example of interaction that may bring ideas up that are further evaluated and analyzed in a logical way by groups of experts. The difficult part is to create ideas. Creativity requires novel combination, generalization of knowledge that each unit has, applying it in novel ways. This process may not fundamentally differ from generalization in neural networks, although it takes place at much higher level of complexity. The difficult part is to create a system that has sufficiently rich, dense. representation of useful knowledge to be able to solve the problem by combining or adding new conceptslelements.
4
Problems pointed out by experts
Certainly, the statements presented in the previous sections, reflecting authors’ point of view on the subject, do not pretend to be a complete and comprehensive answer to the question “Quo vadis, computational intelligence?”. The field of CI is very broad and still expanding, so - in a sense - even listing all of its branches or subfields may be considered a challenge itself. Having that in mind we had an idea that a good way to make the real and efficient search for the challenging problems is to post this question to a group of the well known experts in several branches of CI. Therefore, we asked a few leading scientists working in the field of computational intelligence (understood in a very broad sense) what - according to them - would be the most challenging problems for the next 5-10 years in their area of expertise, and what solutions (if known) are at the horizon. CI disciplines represented by the experts included neural networks, genetic algorithms, evolutionary computing, swarm optimization, artificial life, Bayesian methods, brain sciences, neuroinformatics, robotics, computational biology, fuzzy
16
W. Duch and J . Malidziuk
systems, rough sets, mean field methods, control theory, and related disciplines. Both theoretical as well as applicative challenges were asked for. Our first idea was to collect the answers and then try to identify some number of common problems that may be of general interest for computational intelligence community. However, after collecting the responses we decided that presentation of individual experts’ opinions with some comments from us will be more advantageous to potential readers. In order to precisely express views and opinions provided by the experts we have decided to present several citations from their responses. For the sake of clarity of the presentation in the following text all citations of experts’ opinions will be distinguished by italic font. Problems posted by the experts can be divided into two main categories: general CI problems related to human-type intelligence, specific problems within various CI disciplines. 4.1
General CI problems related to human-like intelligence
Among general problems envisioned by the experts two were proposed by Lee Giles. The first one concerns bringing robotics into the mainstream world. Probably the effective way of bringing robotics (and CI in general) into the mainstream world will require the development of CI-based user-friendly everyday applications able to convince people of the usefulness and practical value of CI research. Several devices of that kind already exist, e.g. intelligent adaptive fuzzy controllers installed in public lifts or various household machines. These bottom-level, practical successes of CI are however not well advertised and therefore not well known (or actually not at all known) to general public. Paradoxically, events which seem to be much more “abstract” achievements of A1 (at least for non-professionals) became recently very influential signs of A1 successes. These include the defeat of Kasparov by Deep Blue supercomputer or design of artificial dogs - Aibo and Poo-Chi. The other challenge pointed out by Giles is integrating the separate successes of A1 - vision, speech, etc - into an intelligent SYSTEM. In fact building of intelligent agents has been of primary concern for A1 experts for about a decade now. Development of new robotic toys, such as the Aibo dogs, requires integration of many branches of CI. What capabilities will such toy robots show in 20 years? Perhaps similar progress as in the case of personal computer hardware (for example graphics) and software (from DOS to Windows XP) should be expected here. Several advanced robotics projects are being currently developed in the industrial labs. The most ambitious one concerning humanoid robots - developed from and around the Cog project at MIT - demanded integration of several perceptual and motor systems [ 11. Social interaction with humans demands much more: identification and analysis of emotional cues during interactions with humans, speech prosody, shared attention and visual search, learning through imitation, and building theory of mind. Similar challenge concerning the design of the advanced user intefaces using natural language, speech, and visualizations is listed by Erkki Oja. According to
Quo Vadis Computational Intelligence?
17
Oja realization of such integrated, human friendly interfaces requires very advanced and robust pattern recognition methods. As a possible approach to tackle these tasks Oja proposes application of some kind of machine learning, as well as probabilistic modeling methods aimed at finding - in unsupervised manner - a suitable compressed representation of the data. The underlying idea is that when models are learned from the actual data, they are able to explain the data, and meaningful inferences and decisions can be based on the compressed models. These issues are also connected with the questions about functioning of the learning algorithms in human brains. If we can really find out the learning algorithms that the brain is using, this will have an enormous impact on both neuroscience and on the artificial neural systems (Oja). A related challenge from the domain of intelligent human-like interfaces is also stated by John Taylor: how is human language understanding achieved? This is needed to be answered to enable language systems to improve and to allow humanmachine interaction. Before answering this questions two other major challenges in the area of building intelligent autonomous systems need to be considered. The first one is concerned with the problem of how is attention-controlled processing achieved to create (by learning) internal goals for an autonomous system? This requires a reward learning structure, but more specifically a way of constructing (prefrontal-like) representations of goals of actiodobject character at the basis of schema development. Current research on reinforcement learning draws little inspiration from brain research. Perhaps the subject is not understood well enough. On the other hand considerable progress has been achieved by Ai Enterprises in building a “child machine”, trained by reinforcement learning to respond to symbols like an infant [39]. The transcripts from the program have been evaluated by a developmental psychologist as a healthy bubbling of 18-month old baby. This is still only bubbling and it will be fascinating to see how far can one go in this way. The other challenging problem is answering the question of how is automatisation of response achieved by learning? Initial controlled response needs to be ‘put on automatic’ in order to enable an autonomous system to concentrate on other tasks. This may be solved by further understanding of the processes occurring in the frontal lobes in their interaction with the basal ganglia. At the same time episodic and working memory are clearly crucially involved (Taylor). In other words, how is the task initially requiring conscious decisions taken by the brain at the highest level, such as learning to drive, becomes quite automatic? What is the role of working memory here? Perhaps it is needed only to provide reinforcement by observing and evaluating the actions that the brain has planned and executed? Is this the main role of consciousness? Relating one’s own performance to memorized episodes of performance observed previously requires evaluation and comparison followed by emotional reactions that provide reinforcement and increase neuromodulation, facilitating rapid learning. Working memory is essential to perform such complex task, and after the skill is learned there is no need for reinforcement and it becomes
18
W. Duch and J . Maddziuk
automatic (subconscious). Unfortunately working memory models are not well developed. Similarly to Oja, the need for appropriate data (state) representation is also stressed by Christoph von der Malsburg: In the classical field of AI, this question is left entirely open, a myriad of different applications being dealt with by a myriad of different data formats. In the field of Artificial Neural Networks, there is a generic data format - neurons acting as elementary symbols - but this data format is too empoverished, having no provision for representing hierarchical structures, and having no provision of the equivalent of the pointers of AI. A related challenging problem pointed out by von der Malsburg is design of autonomous self-organization processes in the (hierarchical) state organization - the state of an intelligent system must be built up under the control of actual inputs and of short-term and long-term stored information. The algorithmic approach to state construction ... must be overcome and be replaced by autonomous organization. State organization must be conformed to a general underlying idea of the ability of intelligent systems to generalize based on the current state: Intelligent systems relate particular situations to more general patterns. This is the basis for generalization, the hall-mark of intelligence. Each situation we meet is new in detail. It is important to recognize general patterns in them so that known tools and reactions can be applied. To recognize a specific situation as an instance of a general pattern, the system must find correspondences between sub-patterns, and must be able to represent the result by expressing these correspondences as a set of links. Finding such sets of links is an exercise in network self-organization (von der Malsburg). On the other hand self-organization alone seems to be not powerful enough it order to create intelligent systems (behaviors) in limited time and with limited resources. Therefore some kind of learning with a teacher seems to be indispensable. An important sub-category of learning is guided by teaching. Essential instruments of teaching are: showing of examples, pointing, naming and explanation. To provide the necessary instruments that underlay these activities constitutes an considerable array of sub-problems. Human intelligence is a social phenomenon and is based on teaching. The alternative is evolution, but we will hardly have the patience to create the intelligence if only of a mouse or a frog by purely evolutionary mechanisms (von der Malsburg). Another problem stressed by von der Malsburgh is the ability of intelligent autonomous systems to learn from natural environments: Intelligent systems must be able to pick up significant structure from their environment. Machine learning in AI is limited to pre-coded application fields. Artificial neural networks promise learning from input, finding significant structure on the basis of input statistics, but that concept fails beyond a few hundred bits of information per input pattern, requiring astronomical learning times. Animals and humans demonstrate learning from one or a few examples. To emulate this,mechanisrns must be found for identifying significant structure in single scenes. Similarly to Oja, von der Malsburg underlines the role of interaction with natural environment - intelligent systems must be able to
Quo Vadis Computational Intelligence?
19
autonomously interact with their environment, by interpreting the signals they receive and closing the loop from action to perception. The next step on the way of building intelligent systems is the stage of hierarchical integration of separate modules or operational paradigms into one, coherent organizational structure. Two major challenges concerning this issue were put forward by von der Malsburg. The first problem is the subsystem integration. An intelligent system is to be composed of (a hierarchy of) individual modules, each representing an independent source of information or computational process, and problems are to be solved by coupling these modules in a coherent way. This process may be likened to a negotiation process, in which the different players try to reach agreement with each other by adjusting internal parameters. If there is suficient redundancy in the system, a globally coherent state of the system arises by seEforganization. This process is the basis for the creativity of intelligent systems. The problem is to find the general terms and laws which make subsystem integration possible. The other - closely related - challenge is structuring of general goal hierarchies. Whatever intelligent systems do, they are pursuing goals, which they themselves set out with or recognize as important for a given scene or application area. To organize goal-oriented behavior, a system starts with rather generally defined goals (survive, don’t get hurt, feed yourself,..) and must be able to autonomously break those goals down to specific settings, and to self-organize consistent goal hierarchies. The key issues in development of computational intelligence field according to Harold Szu lie in the area of unsupervised learning. The CI science is now in the cross road of taking the advantage of the exponential growth of information sciences modeling and linear growth of neurosciences experiments. The key is to find the proper representation of the complex neuroscience experiment data that can couple the two together. The idea of learning without a teacher has a “natural” support in biological world, since we - people have pairs of eyes, ears, etc. Therefore, the proper representation is a vector time series whose components are input of a pair of eyes, ears, etc. - smart sensor pairs. Szu believes that the unsupervised learning Hebb rule results from the thermodynamics Helmholtz free energy (see [56] for mathematical formulation). According to Szu one of the intermediate problems that need to be solved on the way is developing of appropriate and efficient procedures for sampling information from the environment. One of the key sub-issues are the redundancy problem and the problem of dimensionality reduction. Another central challenge is stated by Paul Werbos in his recent paper [62]1: Artificial neural networks offer both a challenge to control theory and some ways to help meet that challenge. We need new efforts/proposals from control theorists and 1 Submitted to IEEE CDC02 conference by invitation.
20
W. Duch and J. Maridziuk
others to make progress towards the key long-term challenge: to design generic families of intelligent controllers such that one system (like the mammal brain) has a general-purpose ability to adapt to a wide variety of large nonlinear stochastic environments, and learn a strategy of action to maximize some measure of utility across time. New results in nonlinear function approximation and approximate dynamic programming put this goal in sight, but many parallel efforts will be needed to get there. Concepts from optimal control, adaptive control and robust control need to be unified more effectively. The above citation presents the general statement concerning the need for new ideas/proposals that might influence research in the intelligent control area. Going further Werbos states several goals and suggests possible approaches to achieve them. In fact the paper [62] was written with the similar intention as our work and we encourage anybody interested in the subject to read it. Since we are unable to present all ideas from this paper we have chosen only two problems that appear to us to be very important. One of them addresses the problem of appropriate balance between problem independent approach to learning in intelligent systems versus methods taking advantage of problem specific knowledge. In the most challenging applications, the ideal strategy may be to look for a learning system as powerful as possible, a system able to converge to the optimal strategy without any prior knowledge at all - and then initialize that system to an initial strategy and model as close as possible to the most extensive prior knowledge we can find. Another suggestion is to regard artificial intelligent systems in the rational framework which means defining our goals and expectations towards them in the realistic way. We cannot expect the brain or any other physical device to guarantee an exact optimal strategy of action in the general case. That is too hard for any physically realizable system. We will probably never be able to build a device to play a perfect game of chess or a perfect game of Go. ... We look for the best possible approximations, trying to be as exact as we can, but not giving up on the true nonlinear problems of real interest. We would definitely agree with that. In any real situation when non-trivial goals are to be achieved the optimal strategy cannot be “calculated” in a reasonable time. We believe that one of the main obstacles on the way of developing intelligent autonomous systems were - right from the beginning - too high expectations regarding their abilities and the lack of properly defined, achievable, realistic goals. The brains are not all-powerful devices, but have been prepared by millions of years of evolution to make reasonable decisions in situations that are natural from the environmental point of view. In many unnatural situations humans suffer from “cognitive illusions” [46].
4.2
General problems within certain CI disciplines
Several problems stated by the experts concerned particular disciplines that constitute computational intelligence. In the context of neural networks two of the pro-
Quo Vadis Computational Intelligence? 21
posed problems were connected with the reduction of data dimension in both theoretical as well as applicative aspects. One of them known as the curse of dimensionality is pointed out by Vera Kurkova: for some tasks implementation of theoretically optimal approximation procedures becomes unfeasible because of unmanageably large number of parameters. In particular, high-dimensional tasks are limited by the “curse of dimensionality”, i.e., an exponentially fast scaling of the number of parameters with the number of variables. One of the challenges of mathematical theory of neurocomputing is to get some understanding what properties make high-dimensional connectionist models efficient, what attributes of multivariable mappings guarantee that their approximation by certain types of neural networks guarantee does not exhibit the “curse of dimensionality”. Similar challenging problem concerned with unmanageable data dimensionality is put forward by Lip0 Wang: which features are relevant and which of them are important for a task at hand? In several application domains high-dimensional data, except for being computationally infeasible, is also difficult to be properly interpreted. In other words when data dimensionality is high, information can be obscured, because of the presence of irrelevant features. This is important for many data mining tasks, such as classification, clustering, and rule extraction. In most practical problems estimation of relative importance of particular data properties comes out from experts’ knowledge or experience. Quite rarely it becomes available as a result of theoretical analysis. In neural networks domain some general methods supporting that kind of analysis have been already developed. The most popular examples include the Principal Component Analysis - allowing reduction of data dimensionality based on its orthogonalisation and defining the most relevant dimensions. The other well known method is Independent Component Analysis - allowing for blind source separation in case of multi-source and noisy data. Both methods perform well in many cases however their applicability is not unconditional. For example application of PCA method in case of highly interrelated data (e.g. when sampled from multidimensional chaotic systems) may lead to degradation of performance compared to using data that was not pre-processed by PCA [30]. The need for reliable identification of relevant features in multidimensional data is especially important within popular, fast growing disciplines where the increase of the amount of available data is enormous. Indeed, in bioinformatics - for example - tens or even hundreds of thousands of features are defined for some problems, and the selection of information becomes a central issue. One possible approach is to use feature aggregation instead of feature selection. Such hierarchical processing of information allows for integration of very rich input data into simpler, but more informative, higher-level structures. Integration of distributions of timedependent input (sensory) signals creates distributions at the higher levels. Although interval arithmetic is known, relevant mathematics for computing with arbitrary distributions has not yet been formulated. The other promising idea - similar to the way in which attention facilitates control - is selecting subsets of relevant features.
22
W. Duch and J . Malidziuk
This results in a dynamical process, serving the short and long-term goals of the system. Another challenging issue connected with data processing is design and construction of intelligent systems capable of providing a focused search through the huge amount of available data (e.g. published over the Internet). According to Oja in short and medium term, we will have a great demand for fast and reliable computer systems to manage and analyze the vast and ever increasing data masses (text, images, measurements, digital sound and video, etc.) available in databases and the Web, How to extract information and knowledge, to be used by humans, from this kind of scattered data storages? The problem is well known and various solutions are being proposed. One group of solutions is based on visualization techniques, for example using the Web-SOM variant of Kohonen’s networks. Many other clusterization methods and visualization techniques are certainly worth using. Data mining techniques for modeling the user interest and extracting knowledge from data are in the experimental stage. Latent Semantic Indexing [8] is based on the analysis of the terms-document frequency matrix using singular Value Decomposition to find Principal Components that are treated as “concepts”. Unfortunately these concepts are vector coefficients and are only useful for estimation of similarity of documents but they are not understandable concepts interesting to humans. The Interspace Research Project is aimed at semantic indexing of multimedia information and facilitating communication between different communities of experts that use similar concepts. Only a few CI methods have been applied to this field so far. Certainly, the problem of explosive growing of the amount of accessible data has a great impact on artificial systems’ (and also humans’) ability to preprocess and analyze this data and consequently make optimal (or at least efficient) decisions. Gathering a compact set of relevant and complete information concerning a given task requires much more efficient search engines than those available now. One of the underlying features of these “future” search engines must be ability to analyze the data contextually. Ultimately understanding of texts requires sophisticated natural language processing techniques. At this stage the best programs for natural language understanding (NLU) are based on huge ontologies, human created hierarchical description of concepts and their interrelations. The release of the OpenCyc tools by CycCorp in 2001 made such applications easier, but hybrid systems, combining NLU techniques developed by A1 experts with the data mining techniques developed by CI experts, have not yet been created. One of the possible directions on the way to design intelligent decision support systems capable of extracting useful information and knowledge form large data repositories is distributive multi-agent approach in which a set of agents automatically searches various databases and professional services (e.g. the Internet ones) in real time in order to provide an up-to-date, relevant information. This would also require the soft mechanisms for checking information reliability. Furthermore, efficient mechanisms of automated reasoning based on CI techniques need to be applied in such systems.
Quo V a d i s C o m p u t a t i o n a l Intelligence?
23
Another challenge was identified in the area of combinatorial optimization (Lip0 Wang): evolutionary computation and neural networks are effective approaches to solving optimization problems. How can we make them more powerful? This question is really hard to answer. In the framework of neural networks two main approaches to solving constraint optimization problems exist: evolving template matching and Hopfield-type networks. Both of them suffer from serious intrinsic limitations. Template matching methods require that the problem to be solved has appropriate geometrical representation. Hopfield-type approaches suffer from gradient minimization scheme and the lack of general recipes for defining constraints coefficients in the energy function. Despite enormous number of papers devoted to the above two types of approaches and despite the development of various modifications to their original formulations it seems that the efficacy of neural network-based optimization methods - although significantly increased compared to the initial approaches - cannot be proven for the problems exceeding a certain level of complexity. Similar situation exists in evolutionary computation domain where, for example, no general rules were yet developed concerning the efficient coding scheme or choosing a priori a suitable form of crossover operation or appropriate mutation probability. In most cases the above very basic choices are still being decided by trial and error methods. Consequently, in complex problem domains the time required to achieve reasonably good solutions is prohibitive. Similar problems concerning scalability of evolutionary computation algorithms are pointed out by Xin Yao: There have been all kinds of evolutionary techniques, methods and algorithms published in the literature which appear to work very well for certain classes of problems with a relatively small size. However, few can deal with large and complex real world problems. It is well known that divideand-conquer is an effective and often the only strategy that can be used to tackle a large problem. It is unclear, though, how to divide a large problem and how to put the individual solutions back together in a knowledge-lean domain. Automated approaches to divide-and-conquer will be a challenge, as well as an opportunity to tackle the scalability issue, for evolutionary computation researchers. An issue closely related to scalability problem is the lack of theoretical estimations of computational complexity of evolutionary methods: We still know very little about the computational time complexity of evolutionary algorithms on various problems. It is interesting to observe that a key concern in the analysis of algorithms in the mainstream computer science is computational time complexity, while very f e w complexity results have been established f o r evolutionary algorithms. It is still unclear where the real power, if any, of evolutionary algorithms is (Yao). Another problem emphasized by Yao is the need for suitable mechanisms that allow promotion of a "team work" rather than the best (highest scored) individuals: Evolutionary computation emphasizes populations. While one can use a very large population size, it is often the best individual that we are after. This is in sharp contrast to our own (human) experience in problem solving, where we tend to use a group of people to solve a large and complex problem. Clearly, we need to rethink
24
W. Duch and J . Malidziuk
our endeavour in finding the best individual. Instead, we need the best team to solve a large and complex problem. This challenges us to think about questions such as how to evolve/design the best team and how to scale up the team to deal with increasingly large and complex problems.
5
Conclusions
In this short article only a few challenges facing various branches of computational intelligence may obviously be identified. According to several suggestions the underlying issues are related to - generally speaking - emulation of human-type intelligent behavior. Within this research area several specific goals and challenges are identified, e.g. flexible data (state) representations and suitable, context (state) dependent training methods, training methods involving both supervised and unsupervised paradigms allowing to combine learning from examples with self-organizing evolutionary development of the system, further investigation of the working mechanisms of biological brains, integration of solutions achieved for partial (individual) problems into more complex, efficiently working systems, theoretical investigations on the complexity, potential applicability and limitations of explored ideas. We have tried to show some promising directions that should allow to model certain brain-like functions, going beyond the current applications of neural networks in pattern recognition. UCI repository of data for machine learning methods [37] has played a very important role in providing the pattern recognition problems to be solved. Collection of more ambitious problems for testing new approaches going beyond classification and approximation is urgently needed. We hope that the issues pointed out by professionals and by ourselves will serve as useful pointers - especially for young and less experienced researchers looking for interesting problems - in developing computational intelligence in promising directions.
Acknowledgments We would like to thank our expert colleagues who supported this project by sending descriptions of problems that according to them are the most challenging issues in the field of computational intelligence. We gratefully acknowledge the helpful comments from C. Lee Giles (The Pennsylvania State University), Vera Kurkova (Academy of Science of the Czech Republic), Christoph von der Malsburg (RuhrUniversitat Bochum), Erkki Oja (Helsinki University of Technology), Harold Szu (Naval Research Laboratory), John G. Taylor (King’s College London), Lip0 Wang
Q u o V a d i s C o m p u t a t i o n a l Intelligence? 25
(Nanyang Technical University), Xin Yao (University of Birmingham) and Paul Werbos (National Science Foundation). W.D. would like to thank the Polish State Committee for Scientific Research for suppport, grant no. 8 T1 IC 006 19. References 1. Adams, B., Breazeal C., Brooks, R., Scassellati, B.: Humanoid Robots: A New Kind of Tool, IEEE Intelligent Systems 15 (2000) 25-31 2. Amit D.J.: The Hebbian paradigm reintegrated: local reverberations as internal representations. Brain and Behavioral Science 18 (1995) 617-657 3. Anderson, J.R.: Rules of the Mind. Erlbaum, Hillsdale, N.J. (1993) 4. Bedford, T., Keane M., Series, C.: Ergodtc theory, symbolic dynamics and hyperbolic spaces. Oxford University Press, Oxford, UK (1991) 5. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press (1995) 6. Crevier, D.: AI: The Tumultuous History of the Search for Artificial lntelligence. Basic Books, New York (1993) 7. McClelland, J.L, Rumelhart D.E. and the PDP research group.: Parallel distributed processing. The MIT Press, Cambridge, MA (1987) 8. Deerwester, Dumais, Landauer, Furnas, Harshman (1990) Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6): 391-407 9. Duch, W.: Platonic model of mind as an approximation to neurodynamics. In: Brain-like computing and intelligent information systems, ed. S. Amari, N. Kasabov. Springer, Singapore (1997) 491-5 12 10. Duch, W.: Similarity-Based Methods. Control and Cybernetics 4 (2000) 937968 11. Duch, W., Adamczak, R., Diercksen, G.H.F.: Constructive density estimation network based on several different separable transfer functions. 9th European Symposium on Artificial Neural Networks (ESANN), Brugge. De-facto publications (2001) 107-112 12. Duch, W., Adamczak, R., Grabczewski, K.: Methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks 12 (2001) 277-306 13. Duch, W., Diercksen, G.H.F.: Feature Space Mapping as a universal adaptive system. Computer Physics Communications 87 (1995) 341-37 1 14. Duch, W., Grudzinski, K., Prototype based rules - new way to understand the data. Int. Joint Conference on Neural Networks, Washington D.C., July 2001, 1858-1863 15. Duch, W., Itert, L., Grudzinski, K.: Competent undemocratic committees. Int. Conf. on Neural Networks and Soft Computing, Zakopane, Poland (in print, 2002)
26
W . Duch and J . Maridziuk
16. Duch W., Jankowski, N.: Survey of neural transfer functions. Neural Computing Surveys 2 (1999) 163-213 17. Duch, W., Jankowski, N.: Transfer functions: hidden possibilities for better neural networks. 9th European Symposium on Artificial Neural Networks (ESANN), Brugge. De-facto publications (2001) 8 1-94 18. Duda, R . 0 , Hart, P.E, Stork, D.G.: Pattern Classification, 2nd Ed, John Wiley & Sons, New York (2001) 19. Fahlman, S.E., Lebiere, C.: (1990) The cascade-correlation learning architecture, In D. Touretzky (Ed.), Advances in Neural Information Processing Systems, 2, Morgan Kaufmann, 524-532 20. Frasconi, P., Gori, M., Sperduti, A.: A General Framework for Adaptive Processing of Data Structures. IEEE Transactions on Neural Networks 9 (1998) 768-786 21. Frean, M.: (1990) The upstart algorithm: a method for constructing and training feedforward neural networks, Neural Computation 2: 198-209 22. Giles, L.C., Gori, M. (Eds): Adaptive procesing of sequences and data structures. Springer, Berlin (1998) 23. Goldfarb, L. Nigam, S.: The unified learning paradigm: A foundation for AI. In: V.Honovar, L.Uhr, Eds. Artificial Intelligence and Neural Networks: Steps Toward Principled Integration. Academic Press, Boston (1994) 24. Golomb, D., Hansel, D., Shraiman, B, Sompolinsky, H.: Clustering in globally coupled phase oscilators. Phys Rev. A 45 (1992) 35 16-3530 25. Hasti, T, Tibshirani, R, Friedman J.: The Elements of Statistical Learning. Springer Series in Statistics, New York (2001) 26. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities, Proc. National Academy of Science USA, 79 (1982) 2554-2558 27. Hopfield, J.J., Brody, C.D.: What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. PNAS 98 (2001) 1282-1287 28. Hsu C.S.: Global analysis by cell mapping, J. of Bifurcation and Chaos 2 (1994) 727-77 1 29. Jankowski, N., Duch W.: Optimal transfer function neural networks. 9th European Symposium on Artificial Neural Networks (ESANN), Brugge. De-facto publications (2001) 101-106 30. Jaruszewicz, M., Mandziuk, J. (2002) Short-term weater forecasting with neural nets, International Conference on Neural Networks and Soft Computing, Zakopane. Poland, (submitted) 31. Kasabov, N. (1988) ECOS - A framework for evolving connectionist systems and the 'eco' training method, Proc. of ICONIP'98 - The Fifth International Conference on Neural Information Processing, Kitakyushu, Japan, 3: 12321235
Quo Vadis Computational Intelligence? 27
32. Kunstman, N., Hillermeier C., Rabus, B., Tavan P.: An associative memory that can form hypotheses: a phase-coded neural network. Biological Cybernetics 72 (1994) 119-132 33. MIT Encyclopedia of Cognitive Sciences. Ed. M.A. Wilson, F.C. Keil, MIT Press, Cambridge, MA (1999) 34. Mandziuk, J., Shastri, L. (1999) Incremental Class Learning - an approach to longlife and scalable learning, Proc. International Joint Conference on Neural Networks (IJCNN'99), Washington D.C., USA, (6 pages on CD-ROM) 35. Mandziuk, J., Shastri, L. (2002) Incremental Class Learning approach and its application to Handwritten Digit Recognition problem, Information Sciences (in press) 36. Marczak, M,, Duch, W., Grudzinski, K., Naud, A.: (2002) Transformation Distances, Strings and Identification of DNA Promoters. Int. Conf. on Neural Networks and Soft Computing, Zakopane, Poland (in print, 2002) 37. Mertz, C.J., Murphy, P.M.: UCI repository of machine learning databases, http://www.ics.uci. eddpublmachine-learningdatabases 38. Miikkulainen, R. Subsymbolic natural language processing: an integrated model of scripts, lexicon and memory. MIT Press,Cambridge, MA (1993) 39. Minsky M., Papert S.: Perceptrons. MIT Press, Cambridge, MA (1969), 2"d ed. (1988) 40. Mitchell T.: Machine learning. McGraw Hill (1997) 41. Mitra, P., Mitra, S., Pal, S.K.: (2000) Staging of cervical cancer using Soft Computing, IEEE Transactions on Biomedical Engineering, 47(7): 934-940 42. Newman J, Baars B.J.: Neural Global Workspace Model. Concepts in Neuroscience 4 (1993) 255-290 43. Newell, A,: Unified Theories of Cognition. Cambridge, MA: Harvard University Press (1990) 44. O'Regan, J.K., Noe, A.: A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences 24(5) (2001, in print) 45. PGkalska, E., Pacilik, P., Duin, R.P.W.: A generalized kernel approach to dissimilarity-based classification. J. Machine Learning Research 2 (2001) 175-2 11 46. Piattelli-Palmarini, M.: Inevitable Illusions: How Mistakes of Reason Rule Our Minds. John Wiley & Sons (1996) 47. Poole D., Mackworth, A., Goebel, R.: Computational Intelligence. A Logical Approach. Oxford University Press, New York (1998) 48. Rich E., Knight K.: Artificial Intelligence. McGraw Hill Inc, Int'l Edition (1991) 49. Rumelhart, D.E., Hinton, G.E., Williams R.J.: Learning representations by back-propagating errors, Nature 323 (1986) 533-536 50. Russell, S. J., Norvig, P.: Artificial Intelligence: A Modern Approach. PrenticeHall, Englewood Cliffs, N.J. (1995)
28
W. Duch and J . Maridziuk
51. Saad, E.W., Prokhorov, D.V., Wunsch 11, D.C.: (1998) Comparative study of stock trend prediction using time delay, recurrent and probabilistic neural networks, IEEE Transactions on Neural Networks, 9(6): 1456-1470 52. Shastri, L., Fontaine, T.: (1995) Recognizing handwritten digit strings using modular spatio-temporal connectionist networks, Connection Science 7(3): 21 1-235 53. Specht, D.: (1990) Probabilistic neural networks, Neural Networks 3: 109-118 54. Specht, D.: (1991) A general regression neural network, IEEE Transactions on neural Networks 2: 568-576 55. Sun, R., Giles, L. (Eds): Sequence learning. Springer Verlag, Berlin (2001) 56. Szu, H., Kopriva, I (2002) Constrained equal a priori entropy for unsupervised remote sensing, IEEE Transactions on Geoscience Remote Sensing 57. Thrun, S., Explanation based neural network learning. A lifelong learning approach. Kluwer Academic Publishers, Boston / Dordrecht / London, 1996 58. Thrun, S., Mitchell, T.M.: (1994) Learning one more thing, Technical Report: CMU-CS-94-184 59. Treister-Goren, A., Hutchens, J.L.: Creating AI: A unique interplay between the development of learning algorithms and their education. Technical Report, A1 Enterprises, Tel-Aviv 2000. Available from http://www.a-i.com 60. Waibel, A.: (1989) Consonant recognition by modular construction of large phonemic time-delay neural networks, In D. Touretzky (Ed.), Advances in Neural Information Processing Systems, 1, Morgan Kaufmann, 2 15-223 61. Wang, D.: On Connectedness: A Solution Based on Oscillatory Correlation. Neural Computation 12 (2000) 131-139 62. Werbos, P., Neural Networks for Control: Research Opportunities and Recent Developments, IEEE CDC’02 Conference, (submitted, 2002) 63. Wermter, S., Austin, J., Willshaw, D., (Eds.): Emergent neural computational architectures based on neuroscience. Towards Neuroscience-inspired computing. Springer, Berlin (2001) 64. Winston P.: Artificial Intelligence. 3rd ed, Addison Wesley (1992)
Mathematical Tools for Machine Intelligence
This page intentionally left blank
MAPPINGS BETWEEN HIGH-DIMENSIONAL REPRESENTATIONS I N CONNECTIONISTIC SYSTEMS VERAK~JRKOVA Institute of Computer Science Academy of Sciences of the Czech Republic Pod uoddrenskou ~ 6 %2, 182 07 Prague 8, the Czech Republic e-mail
[email protected] Approximation of high-dimensional mappings by neural networks is investigated in the context of nonlinear approximation theory. It is shown that the “curse of dimensionality” can be avoided when certain norms are kept low. There are described properties and methods of derivation of estimates of such norms. The results are applied to perceptron and RBF networks.
1
Introduction
Classical artificial intelligence has modeled cognitive tasks using rule-based manipulation of symbols. Typically, symbolic representations has been discrete and low-dimensional, which had both conceptual and computational advantages. In contrast t o classical AI, connectionistic computational models employ learning in systems of distributed representations. Such representations tend to be high-dimensional as they describe objects in terms of many parameters. Connectionistic systems map one type of a distributed representation to another one. The primary computational technique used in such systems is approximation of high-dimensional mappings implemented in neural networks of various types. For example NETtalk 33 , which performs “reading aloud”, maps binary vectors of length over 200 (coding segments of written text) to vectors with 26 real entries (coding phonemes). A vowel-recognizer performing “lip-reading” 34 transforms vectors of length 500 (corresponding to pixels from video) t o a phonetic code of length 32. One of the goals of research in computational intelligence is to understand what properties make such high-dimensional connectionistic models efficient and flexible. A theoretical understanding can provide guidelines for a design of computationally feasible procedures that are flexible in the sense that they are capable of performing a variety of high-dimensional tasks by merely changing procedures parameters. Since 19th century, mathematics studied families of mappings, such as polynomials, rational functions and trigonometric sums, which are sufficiently flexible so that with a proper choice of coefficients, they are able to approx-
31
32
V. KGrtova‘
imate arbitrarily well any continuous or measurable mapping defined on a compact (closed and bounded) subset of a multidimensional space. Later, it has been shown that also splines, sums of waveletes and many types of feedforward networks posses this flexibility, which is sometimes called the “universal approximation capability”. However for some tasks, implementation of theoretically optimal approximation procedures becomes unfeasible because of unmanageably large number of parameters. In particular, high-dimensional tasks are limited by the “curse of dimensionality” 4 1 i.e., an exponentially fast scaling of the number of parameters with the number of variables. However, some mappings between high-dimensional representations have been successfully implemented using connectionistic systems of moderate complexity (see, e.g., 33, 34). Approximation theory offers some explanation of feasibility of such implementations. It has derived various upper bounds on complexity of approximating systems depending on the number of variables of functions to be approximated together with their other characteristics. Inspection of such bounds shows that we can cope with the curse of dimensionality by constraining the characteristics involved. For example, upper bounds on worst-case errors in linear approximation are of the form 0 ( n p s l d )where , d is the number of variables, s degree of smoothness of the functions to be approximated, while n is the number of parameters of the linear approximating family (see, e.g., 29). Thus in linear approximation, the curse of dimensionality can be avoided by increasing smoothness together with the number of variables. In this paper, we describe ways how to cope with the curse of dimensionality in nonlinear approximation schemes corresponding to connectionistic systems. We study such systems in the framework of variable-basis approximation, which includes feedforward networks as well as other nonlinear systems with free parameters. We show that for variable-basis approximation, the role of the characteristics to be kept low to cope with the curse of dimensionality is played by a norm tailored t o t h e particular basis. In the case of feedforward networks, such a basis corresponds to the type of computational units (e.g., perceptrons or radial units). We show that in contrast to linear approximation, where with increasing number of variables the sets of functions that do not exhibit the curse of dimensionality are more and more constrained (as the requirements on the degree of their smoothness are increasing), in the case of feedforward networks, such sets of d-variable functions can be embedded into corresponding sets of d 1-variable functions. We derive properties of such sets and methods of estimation of norms that define them. The paper is organized as follows. In section 2, there are introduced concepts and notation concerning fixed and variable-basis approximation (that
+
Mappings between High-dimensional Representations
33
includes feedforward neural networks), while in section 3, upper bounds on rates of these two types of approximation are stated in terms of norms of functions to be approximated: Sobolev norms and variation with respect to a set of functions. In section 4, the concept of variation is illustrated by examples of real-valued Boolean functions with “small” variation with respect to perceptron networks. In sections 5 and 6, there are presented methods of estimation of variation from above and from below. Section 7 is a brief discussion. 2
Fixed and variable-basis approximation
Tasks representable as mappings can be performed by devices computing functions depending on two vector variables: an input vector (corresponding to a coded pattern to be recognized or transformed) and a parameter vector (to be adjusted during learning mode). Due to error-correcting afterprocessing (such as best guess 33) it is sufficient when such devices compute mappings that perform the tasks only approximately. Classical linear approximation theory has explored properties of parametrized sets formed by linear combinations of the first n elements of a set of basis functions with a fixed ordering (e.g., polynomials ordered by degree or sines ordered by frequencies). Thus it can be called fixed-basis approximation. Connectionistic systems exploit parametrized families with a flexible choice of basis functions that belong to a nonlinear approximation scheme called variable-basis approximation. Formally, such scheme is defined for any subset G of a real linear space X as the set of all linear combinations of at most n elements of G, which is denoted span,G. With proper choices of G it includes feedforward networks with a single linear output unit and any number of hidden layers as well as other nonlinear families with free inner parameters such as free-nodes splines, rational functions with free poles, and trigonometric sums with free-frequencies. Since in applications all parameters are bounded, we shall also consider variable-basis approximation schemes with constraints on coefficients of linear combinations of basis functions. If such a constraint is defined in terms of a norm on the space R” of coefficient vectors, then the set of functions computable by such a scheme is, for a proper scalar, contained in the set of all convex combinations of at most n elements of the scaled set G(c) = {wg : g E G, (201 5 c), i.e., in conw,G(c). This can be easily checked for a constraint defined in terms of 11-norm on R” and as all norms on R” are equivalent, it can be further extended to a general norm.
34
V. KGrkovci
A computational unit can be formally described as a function $ : A x K -+ R, where A C R q is the set of “inner” parameters of the unit, while K C Rd is the set of its inputs (we shall restrict our considerations to inputs in either the Euclidean cube [0, lIdor the Boolean one (0, l}d). Denote G # ( K ,A) = {$(a, .) : K -+ R : a E A } the set of functions on K computable by a computational unit C$ with all possible choices of parameters a E A. For A = Rd,we shall write only G # ( K ) ,while for A = Rd and K = [ O , l I d , we shall write only G4. The set of functions computable by a neural network with n hidden units computing C$ is spun,G+ if all coefficients in linear combinations are allowed, i.e., output weights are unconstrained) or it is a subset of conv,G($) if a constraint in the form of a norm on output weights is imposed. Standard types of computational units used in neurocomputing are perceptrons and RBF units, but this formalism also includes other types of parametrized mappings like trigonometric functions with frequencies playing the role of parameters. A perceptron with an activation function $ : R + R computes $ ( ( v , b ) , z )= $(w . z + b) : A x K + R and R B F unit with radial function R+ -+ R (R+ denotes the set of positive reals) computes C$((v,b ) , z ) = $(b(llz - vII) : A x K -+ R. Standard types of activation functions are sigmoidal functions, i.e., nondecreasing functions u : R + R with limt,-, a ( t ) = 0 and limt++oo a ( t ) = 1. Heaviside activation function 6 is defined as 6 ( t ) = 0 for t < 0 and 6 ( t )= 1 for t 2 0. A standard radial function is the Gaussian function y ( t ) = e P t 2 . We shall denote by P d ( $ ) the set of functions computable by perceptrons with activation $, i.e., I‘d($) = G$ for 4((v, b ) , z ) = $(v . z + b) : Rd+’ x [0, lId -+ R. Similarly, by Fd($, 11. 1 1) is denoted the set of f u n c tions computable by R B F - u n i t s with radial function $, i.e., Fd($) = G# for $((w, b ) , z ) = $(b(llz - ~ 1 1 ) : Rd+’x [0, lId-+ R. Since the set of functions cornputable by perceptrons with Heaviside activation Pd(6) is equal to the set of characteristic functions of closed half-spaces of Rd restricted t o [ O , l l d , we shall denote it by Hd.
3
Upper bounds on rates of approximation of multivariable functions
For many types of computational units 4, the sets UnE~+spun,G@ are dense in the space of all continuous functions C([O, lid) with the supremum norm or C,([O, lid) with ,&-norms, p E [l,co) (see, e.g., 19, 30). Although density guarantees arbitrarily close approximation of all functions from C( [ O , l l d ) , L,([O, lid), resp., for practical purposes, its implications are limited to func-
Mappings between High-dimensional Representations 35
tions for which a sufficient accuracy can be achieved by span,G@ with n small enough t o allow implementation. Thus it is useful to study speed of decrease of errors in approximation by span,G and cmu,G with n increasing. Maurey 31, Jones l2 and Barron derived an upper bound on approximation by conu,G that allows to describe conditions on d-variable functions that guarantee approximation without the curse of dimensionality. For a subset M of a normed linear space ( X , [1.11), cl M denotes the closure of M in the norm-induced topology, for f E X the distance of f from M is denoted by 11 f - Mil, conu G = UngN+COTIVnG and span G = UnEN+span,G, where N+ denotes the set of positive integers. The following theorem presents a slightly modified version of Maurey-Jones-Barron's upper bound. Theorem 3.1 Let G be a bounded subset of a Hilbert space ( X , 11. 1 1) and SG = supoFG11g11. Then f o r every f E cl c m v G and f o r every positive integer n , For simplicity, we present here upper bounds on variable-basis approximation only for Hilbert spaces, however similar upper bounds as in Theorem 3.1 have been derived in for Lp-spaces, p E (l,co), and in 2 , 9 , lo, 28, 24 for spaces of continuous functions with the supremum norm. For certain sets G, tight improvements of Maurey-Jones-Barron's bound have been obtained (e.g., for G orthonormal in 24, 21 and for G = Pd(a) in 2 , 28, 2 3 ) . Since conv,G C span,G, the upper bound from Theorem 3.1 also applies t o rates of approximation by span,G. Replacing the set G by G(c) = { w g : g E G,IwI 5 c } , we can apply Theorem 3.1 to all elements of U c ~ ~ +conu c l G(c) (as cunv,G(c) c span,G(c) = span,G for any c E R). This approach can be mathematically formulated in terms of a norm tailored to a set G. Let ( X , 11.11) be a normed linear space and G lie its subset, then Gvariation (variation with respect to G) is defined as the Minkowski functional of the set cl conu(G U -G), i.e.,
l l f l l ~ = inf{c > 0 : f / c
E
cZconv(G U -G)}.
X . It G-variation is a norm on the subspace {f E X : l l f l l ~ < I X I } was introduced in as an extension of the concept of variution with respect to half-spaces (Hd-variation) introduced in '. For functions of one variable, variation with respect to half-spaces coincides, up to a constant, with the notion of total variation studied in the integration theory '. Moreover in Lpspaces, Hd-variation is equal to Pd(cT)-variation for any sigmoidal activation function CJ 20. Thus t o derive consequences of Maurey- Jones-Barron's theorem for sigmoidal perceptron networks, it is sufficient to study properties of H d variation.
36
V. KGrkovd
When a set of functions has to be approximated, it is useful to estimate worst-case error which is mathematically formalized by the concept of deviation 6 ( B , Y ) = supfEBIlf - YI(. The following theorem from l8 is a corollary of Theorem 3.1 formulated in terms of G-variation. If ( X , 11.11) is a normed linear space and b > 0 , then ~ b ( ~ ~ s.b (~l l .~l l ) , denote the ball, sphere, resp., of radius b centered at zero, i.e., Bb(ll.11) = {f E X : l l f l l 5 b} and Sb(ll.ll) = {f E : llfll = b}. Theorem 3.2 Let G be a bounded subset of a Hilbert space ( X , 11.11), SG = s u p g E1~ 1g11, b > 0, 0 5 T I SGb and n be a positive integer. Then
x
(i) f o r every f E X ,
[If
- span,Gll
5
d
m
;
%.
(iai) h(Bb(II.IIG),sWnnG) I
This upper bound shows how to cope with the curse of dimensionality in variable-basis approximation. If {Gd C & ( [ O , lid) : d E N+} is a sequence of sets of functions of d variables, we can keep rate of approximation by spanned within the order of O ( l / f i ) for all d by restricting approximation to functions from balls in Gd-variation. Thus Gd-variation plays a similar role in variable-basis approximation as the Sobolev norm of the order d does in linear approximation. Recall that for R an open subset of Rd,p E [l,oo), the Sobolev space Ws,p(R) is the subspace of L,(R) formed by all functions f : R + R such that for every multi-index a = ( a ~. . .,, a d ) satisfying 0 5 la1 5 s, oafE L,(O), where Da = DF1 . . .D:d is a partial derivative in weak (distributional) sense (see
1. The norm of f E ws,p(R) is defined as llfllf,p= (Coslalss llDafIl;)l’p. The worst-case error in approximation of the unit ball in WS,,(R) with [0, lId C R by an n-dimensional linear subspace X , of Lp(f2)is of the order of O(n-’/‘) 29. More precisely, infx, ~(BI(II.II~,~),X,)O(n-s ld ),where the infimum is taken over all n-dimensional subspaces of L,(R). So if we are free to choose for each n an “optimal” n-dimensional subspace, then we can approximate all functions from Sobolev balls of the order s 2 d within the accuracy O(n-l). So in linear approximation, one can achieve rates of the order of O(n-l) by restricting approximation of functions of d variables only to functions with bounded by a fixed constant. With d increasing, balls Sobolev norms Il.l :,p in these norms are “shrinking” as they are defined by bounds on the sum of L,-norms of all iterated partial derivatives up to the order d. Note that some ~ ~ ~be, extended p to functions in the unit ball in the Sobolev norm ~ ~ . cannot
-
Mappings between High-dimensional Representations
37
+
functions of d 1 variables from the unit ball in the Sobolev norm ~ ~ In contrast to balls in Sobolev norms of degree d, balls in Gd-variation need not to be shrinking with d increasing. Proposition 3.3 L e t d be a positive integer, Gd and Gd+l be bounded subsets of L2([0,lId) and L2([0,lId+’, resp., such t h a t f o r ever9 g € Gd, g : [O, lId+l + k ‘! defined as g(x1,. ..,q + 1 ) = g(x1,. . .,a) belongs t o Gd+l. T h e n there exists a n embedding Vd : B~(ll.ll~,) 4& ( ~ ~ . ~ ~ G d + l ) . Proof. Define Ud on &(ll.ll~,> = cZmv(Gd U Gd) as Ud(g) = g, where g(x1,... ,xd+l) = g(xl,. . . ,xd). As ud on Gd is an embedding, it is also an 0 embedding on cl wnv(Gd U -Gd).
It is easy to check that balls in variation with respect to perceptrons (Pd(q)-variation) and RBF (Fd($)-variation) with any activation or radial function satisfy the assumptions of Proposition 3.3. So in neural network approximation, there exist families of sets of functions of d-variables approximable with rates (3(r1-’/~), which are not shrinking with d increasing. Sets of functions computable by networks with n sigmoidal perceptrons are much larger than a single n-dimensional subspace (they are formed by the union of all n-dimensional subspaces spanned by n-tuples of elements of G). However upper bounds on linear approximation cannot be automatically extended to such unions of n-dimensional subspaces as they may not contain the “optimal” n-dimensional subspaces for linear approximation of balls in Sobolev norms. Nevertheless, for a sufficiently large radius (depending on both d and s), a ball in Hd-variation contains the unit ball in the Sobolev norm of the order s. It was shown in that for a bounded open set R such that [0, lId c R and s > d / 2 1, Br(11.11$,2) is contained in B2rb,(II.JI~d(~) where $J
+
b.3 = ( J R I ( 1 + l l v l l 2(s-1))-1dy)1’2 2 (note that for 2(s - 1) > d , b, is finite). So for s > d/2 + 1, the unit balls in Sobolev norms 11. 1 1f, 2 are contained in balls in Hd-variation of the radius 2bs. The worst-case errors in approximation by n-dimensional linear subspaces of such Sobolev balls bounded from above by C3(n-(1/2+1/d)),while worst-case errors in approximation by neural networks with n sigmoidal perceptrons are by the improvement of MaureyJones-Barron’s bound from 28 bounded from above by b,n-(1/2f1/d). So both upper bounds imply the estimates of rates of approximation of the same order. However, there exist sets of functions with considerably faster rates of approximation by neural networks than rates achievable by linear approximation methods (see 3 , 22). Also properties of projection operators in linear and neural network approximation are quite different - due to geometrical prop-
38
V. K I ~ T ~ O V ~
erties of sets span,G (nonconvexity) they cannot be continuous (see 17).
4
14, 15,
Rates of approximation of real-valued Boolean functions
One of the simplest cases where Maurey-Jones-Barron's theorem gives description of sets of functions that can be approximated by neural networks without curse of dimensionality is space of real-valued Boolean function. This space, denoted by B((0, l}d)and endowed with the inner product defined for f , g E fi({o, as f - g = f(x)g(x), is afinite-dimensional Hilbert space isomorphic t o the 2d-dimensional Euclidean space R2dwith the Z2-norm. Let H d denotes the set of functions on {O,l}d computable by signum -+ R : f(x) = sgn(v. x b),v E R ~b E, perceptrons, i.e., H d = {f : (0, R}. From technical reasons we consider perceptrons with the signum (bipolar) activation function, defined as sgn(t) = -1 for t < 0 and sgn(t) = 1 for t 2 0, instead of more common Heaviside function that assigns zero to negative numbers. For an orthonormal basis G, G-variation can be more easily estimated because it is equivalent to Z1-norm with respect to G (see 22). Expressing elements of such a basis as a linear combination of signum perceptrons, we can obtain description of subsets of B({O,l}d) which can be approximated by preceptron networks without curse of dimensionality. We can use two orthonormal bases. The first one is the Euclidean orthonormal basis defined as E d = {e, : u E { O , l } d } , where e,(u) = 1 and for every x E { o , l } d with x # u, e,(x) = 0. The second one is the Fourier orthonormal basis (see, e.g., 3 5 ) defined as F d = {&(-l)u'z : u E {O,l}d}. Every
xzE(O,lld
+
&
f E fi({0,1)~) can be represented as f(x) = CuE{O,lld f(u)(-~)".~, where f ( u ) = CzE(O,lld f(z)(-l)"'". The Zl-norm with respect to the Fourier basis, llflll,Fd= Ilf^llll = CuE{O,lld lf(u)l, called the spectra2 norm, is equal to Fd-variation ( 21, 24). For a subset I C (0, l}d,I-parity is defined by p l ( u ) = 1 if &I u;is odd, and p I ( u ) = 0 otherwise. If we interpret the output 1 as -1 and 0 as 1, then the elements of the Fourier basis F d correspond to the generalized parity functions. As Fd and Ed are orthonormal, for all f E a({(), I l f l I 1 , E d = llfllEd and Ilf^lll = I!flll,Fd = IlfllFd. It is easy t o see that spun,Fd c qmndn+lHd (as f,(z) = ( - 1 ) ~ - x= + C,d,l(-l)jsgn(u . x - j + f ) and that span,+lEd C spannHd (as e,(x) = Ygn(v';+b)+l for appropriate 2) and b). Thus by Maurey-Jones-
&
Mappings between High-dimensional Representations
39
JT. LLfast17 rates of approximation are guaranteed for functions SO
with either L‘small”variation with respect to signum perceptrons, “small” spectral norm, or “small” norm with respect to the Euclidean basis More interesting classes of functions that can be well approximated by a “moderate” number of Boolean signum perceptrons, are functions with only a “small” number of nonzero Fourier coefficients. The following estimate can be obtained from the embedding of span,Fd into spn,d+lHd combined with the Cauchy-Schwartz inequality (see 2 4 ) . Proposition 4.1 Let d , n, and m be positive integers, m 5 2 d , c > 0 and f E B ( ( 0 ,l}d)be a function with at most m Fourier coeficients nonzero and with l l f l l 5 c. Then [If - spand,+1 H d l l 5 Another example of functions that can be efficiently approximated by perceptron networks are functions representable by “small” decision trees (such trees play an important role in machine learning, see, e.g., A decision tree is a binary tree with labeled nodes and edges. The size of a decision tree is the number of its leaves. A function f : ( O , l } d + R is representable by a decision tree if there exists a tree with internal nodes labeled by variables X I , . . . ,X d , all pairs of edges outgoing from a node labeled by 0s and Is, and all leaves labeled by real numbers, such that f can be computed by this tree as follows. The computation starts at the root and after reaching an internal node labeled by xi, continues along the edge whose label coincides with the actual value of the variable xi; finally a leaf is reached and its label is equal to f ( x 1 , . . . , xd). The following proposition is a corollary of a results from 24. Proposition 4.2 Let d, s be positive integers, b 2 0 , f E B ( ( 0 ,l } d ) be representable b y a decision tree of size s such that for all x E ( 0 , l } d , f ( 2 ) # 0
gE.
and 5
max,~{o,l]d
If(l)l
min,~{a,i}d
if(x)l
5 b. Then 1l.f - spandn+l Bdll 5
&.
Upper bounds on variation
Besides of being a generalization of the concept of total variation, G-variation is also an extension of the notion of Z1-norm: for G a countable orthonormal basis of x,l l f l l ~ = IlfllG,l = lf.gl 2 2 . Thus for G countable orthonormal, the unit ball in G-variation is equal to the unit ball in Z1-norm. Even for some nonorthonormal G, properties of balls in G-variation can be investigated using images of balls in Zl or Ll-norms under certain linear operators. The following proposition shows that when G is finite, balls in G-variation are images of balls in ll(R”),where m = cardG. By 11.111 is denoted El-norm.
xgEG
40
V. Kfirkovd
Proposition 5.1 Let G = (91,.. . ,gm} be a subset of a normed linear space ( X ,11.11) and T : R” + X be a linear operator defined for every w = (w1,...,w,) E R” as T(w)= wigi. T h e n for every T > 0 , B~(1 1. IIG ) = T ( B(1 ~I -1 111) . Proof. By Proposition 2.3 from 24, l l f l l ~ = min{llwlll : w E Rm,f= T(w)}.Thus T(BT(ll.lll))c B,.(ll.lla). To show the opposite inclusion, consider f E BT(ll.ll~). By the definition of G-variation, f = limj+,oofj, where fj = ELl wjigi and llwjlll = lwjil 5 T . Hence there exist w = ( ~ 1 ,... ,w,) E R” such that for all i = 1,... ,m, the sequences {wji : j E N+}converge to wi subsequentially. So the sequence {fj : j E N+} converges subsequentially to f’ = wigi. Since every norm-induced topol0 ogy is Hausdorff, f = f’ and so f = T(w).
ELl
ELl
Even when G is infinite, in some cases G-variation can be estimated in terms of L1-norm. Let q5 E L2(A x K ) , where A C Rq, K c Rd and w E & ( A ) , then Tq defined as Tq(w)= JAw(a)q5(a,x)da is a bounded linear operator Tq : &(A) + L 2 ( K ) (see ). When A, K are compact, Tq is a compact operator and when q5 E C(A x K ) , then Tq : C(A)+ C ( K ) . Intuitively, any function in T+(&(A))can be represented as a “neural network with a continuum of hidden units computing $”, w ( a ) 4 ( a x)da. , Gq-variation of such functions can be estimated from above by ,C1-norm of the “output weight function” w. Theorem 5.2 Let d , m be positive integers, A C R”, K C Rd be compact, q5 E & ( A x K ) and f E & ( K ) be such that f = Tq(w)for some w E & ( A ) . Then IlfllG, 5 IIwIIL1(A) = JA Iw(a)Ida. The proof of this theorem is a modification of an argument used in 2o t o derive a similar result for C(K). Integral representations of the form of a “neural network with a continuum of hidden units” has been originally studied as a tool for derivation of the universal approximation property of various types of neural networks (see, e.g., 5, ll). Fourier representation was combined with Maurey- Jones-Barron’s bound in to derive an upper bound on approximation by sigmoidal perceptrons via estimates for cosine activation. For functions with compactly supported Fourier transforms and continuous partial derivatives of the order s > d / 2 , an integral representation of the form of a neural network with Gaussian RBF units was derived in a. The following theorem from l6 (see also 2 0 ) gives an integral representation of the form of Heaviside perceptron network. For e E Sd-l and b E R,we denote He,b = { x E Rd : e . x b = 0 ) . The half-spaces bounded by this hyperplane are denoted H:6 = { x E Rd : e . x b 2 0 ) and HeTb= {x E Rd :
sA
+
+
Mappings between High-dimensional Representations
e . z + b < 0 ) . By A is denoted the Laplacian operator A ( h ) =
d
41
a2h
w.
Theorem 5.3 Let d be a positive integer and let f : Rd + R be compactly supported and d 2-times continuously diflerentiable. T h e n
+
w f ( e ,b)19(e . z
+ b)dedb,
d-lxR
where ford odd, w f ( e ,b) = a d
sH-Akdf ( y ) d y , e,b
kd
independent of f , while f o r d even, w f ( e ,b) = a d
+
t for t where q ( t ) = -tlogltl constant independent of f . 6
y,and
=
sH-
A k d
e.b
# 0 and ~ ( 0=) 0,
kd
=
ad
is a constant
f ( y ) V ( e .y
+ b)dy,
F,and a d i s a
Lower bounds on variation
The following theorem from 24 gives a geometric characterization of Gvariation in a Hilbert space. By G I is denoted the orthogonal complement of G, i.e., G’- = {f E X : (Vg E G)(f - 9= 0 ) ) . Theorem 6.1 Let ( X , 11-11) be a Hilbert space, G be its bounded non-empty subset and f E X be such that l l f l l ~ < 00. T h e n Ilf IIG = S U P h ~ ~ - G L .h ig.hl. Theorem 6.1 implies a lower bound on G-variation of the form
sup,Efc
So functions that have small inner products with all elements of G (are “almost orthogonal” t o G) have large G-variation. This bound was used in 24 to show that certain Boolean functions have &-variation at least of the order of 0 (2 d /6 ). The proof was based on properties of rather special Boolean functions called bent functions. They can be extended t o functions defined on [0,lld with Hd-variation of the same order. It was shown in 28 that for G = H d , Maurey-Jones-Barron’s upper Ud- H d ) ) 5 r / f i , can be slightly improved to bound, ~ ( B T ( ( ( . ( ( H d ) , C 0 7 1 2 ) 1 2 ( H a tight bound of the form r/n1/2+1/d(see 23 for extension t o other sets G). So in the ball BT(ll.II~d), where T is of the order of 0(2d/6), there must be a function, for which the worst-case error r/n1/2+1/din approximation by cOnV,(Hd(r) U - - H ~ ( T ) ) is achieved. Such function cannot be approximated by perceptron networks efficiently. Another method of demonstrating existence of functions with large variation is based on comparison of cardinality of G with certain covering numbers.
42
V. KZirkova’
Define a binary relation p on the unit ball Sl(ll.ll) of a Hilbert space ( X , 11-11) by p ( f , g) = arccos I f .g[. It is easy to see that p is a pseudometrics measuring distance as the minimum of the two angles between f and g and between f and -g (it is a pseudometrics as the distance of antipodal vectors is zero). For a subset G of S1(11.11) define extent of G as aG
=inf{a E [ o , r / 2 ]: (vf E s1(ll-11))(3g E G ) ( p ( f , g )
5 a)]-
Note that QG is the infimum of all a for which G is an €-net in &(lI.\l), i.e., balls of radius E centered at elements of G cover S1(11.11). For G compact, we can replace inf in the definition of GIG by min. This is the case of Hd which is compact in LP([O,lid) for any p E [l,00) and any positive integer d lo. When QG is small, G is “almost dense” in S1(11.11), while when a~ is large, G is “sparse” in S l ( ~ ~If. ~ QIG ~is) .close to f , then there exists an element in Sl(ll.ll)which has a large “angular distance” from all elements of G (is almost orthogonal t o G) and hence it has a large G-variation. Metric entropy studies “size” of sets in Banach spaces in terms of covering numbers. The size of the smallest a-net in &(((.I\) is the covering number cOv,S1(((.I() (see 25). When for some a close to ~ / 2 the , cardinality of G is smaller than a-covering number of S1(11.11), then there exists a function in S 1(1 1. I I) with “large” G-variation. Proposition 6.2 Let G be a subset of the unit sphere S1(ll.ll) in a Hilbert space ( X , 11. 1 1) and a E [ O , T / ~ ] be such that cardG < c ~ v ~ S ~ Then ( ~ ~there . ~ exists f E Sl(lI.[l)for which llfllG 2 l l c o s a . Proof. If cardG < cosa, then LYG 5 a. So there exists f E Sl(ll.ll) such that for all g E G, p( f lg ) 2 a and hence 1 f . g1 5 C O S ~ .So by Theorem 6.1, l\fllG 2 l/cOsa0 Covering numbers of S”-l with the angular pseudometrics p grow for fixed a asymptotically exponentially with the dimension m 13. So if a subset G of the space of real-valued Boolean functions has cardinality that depends on 2d only polynomially, then there exists a function with “large” G-variation.
7 Discussion We have shown that it is possible to cope with the curse of dimensionality when approximation is restricted to functions with certain norms bounded by a fixed constant independent of the number of variables. For linear approximation, such norms are Sobolev norms of order increasing with the dimension while for neural networks, such norms are variations with respect to computational units. For sigmoidal perceptrons they are all equal to variation with
Mappings between High-dimensional Representations 43
respect to half-spaces and for RBF units, they are variations with respect to spherical waves of the shape corresponding to radial functions. However, examples of functions with variations depending exponentially on the number of variables show that interpretations of Maurey-JonesBarron’s theorem as proof of “dimension-independent” approximation capabilities of neural networks are misleading. It is not possible to approximate all &variable functions by neural networks with rates of the order C J ( T X - ~ To achieve such rates, approximation has to be restricted to functions from balls in variation with respect to the type of computational units. We have described methods of estimation of such variations. Upper bounds can be derived from integral representations of the form of a neural network with a (‘continuum of hidden units” of a given type and lower bounds from geometrical considerations (angular relationships to hidden unit functions). Existence and properties of the norms described in this paper partially explain efficiency of neural networks in performing high-dimensional tasks.
Acknowledgement This work was partially supported by GA CR grant number 201/00/1489.
References 1. Adams, R. A. (1975). Sobolev Spaces, New York: Academic Press. 2. Barron, A. R. (1992). Neural net approximation. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems (pp. 69-72). 3. Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39, 930-945. 4. Bellman, R. (1957). Dynamic Programming. Princeton: Princeton University Press. 5. Carroll, S. M. & Dickinson, B. W. (1989). Construction of neural nets using the Radon transform. In Proceedings of IJCNN’89 (pp. I. 607-611). New York: IEEE Press. 6. Darken, C., Donahue, M., Gurvits, L. & Sontag, E. (1993). Rate of approximation results motivated by robust neural network learning. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 303-309). New York: ACM. 7. F’riedman, A. Foundations of Modern Analysis. New York: Dover, 1992. 8. Girosi, F., & Anzellotti, G. (1993). Rates of convergence for radial basis function and neural networks. In Artificial Neural Networks for Speech and Vision (pp.97-113). London: Chapman & Hall.
44
V. Kfirkovd
9. Girosi, F. (1995). Approximation error bounds that use VC-bounds. In Proceedings of ICANN’95 (pp. 295-302). Paris: EC2 & Cie. 10. Gurvits, L. & Koiran, P. (1997). Approximation and learning of convex superpositions. Journal of Computer and System Sciences, 55, 161-170. 11. Ito, Y. (1991). Representations of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks, 4, 385-394. 12. Jones, L. K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics, 20, 608-613. 13. Kainen, P. C., KdrkovB, V. (1993). Quasiorthogonal dimension of Euclidean spaces. Applied Math. Letters, 6,7-10. 14. Kainen, P. C., KdrkovB, V. & Vogt, A. (1999). Approximation by neural networks is not continuous. Neurocomputing, 29, 47-56. 15. Kainen, P. C., Kdrkovb, V. & Vogt, A. (2000). Best approximation by Heaviside perceptron networks. Neural Networks, 13, 645-647. 16. Kainen, P. C., Kirkovi, V. & Vogt, A. (2000). An integral formula for Heaviside neural networks. Neural Network World, 10, 313-319. 17. Kainen, P. C., Kirkovb, V. & Vogt, A. (2001). Continuity of approximation by neural networks in &-spaces. Annals of Operational Research, 101, 143-147. 18. KdrkovB, V. (1997). Dimension-independent rates of approximation by neural networks. In Computer-Intensive Methods i n Control and Signal Processing: Curse of Dimensionality (Eds. Warwick, K., KArnjr, M.) (pp. 261-270). Boston: Birkhauser. 19. Kirkovb, V. (2002). Universality and complexity of approximation of multivariable functions by feedforward networks. In Softcomputing in Industrial Applications - Recent Advances (Eds. R. Roy, M. Koeppen, S. Ovaska, T. F’uruhashi, F. Hoffmann). London: Springer-Verlag (to appear). 20. KdrkovB, V., Kainen, P. C. & Kreinovich, V. (1997). Estimates of the number of hidden units and variation with respect to half-spaces. Neural Networks, 10, 1061-1068. 21. Kfirkovb, V. & Sanquinetti, M. (2001). Bounds on rates of variable-basis and neural network approximation. IEEE Trans. on Information Theory, 47, 26592665. 22. KilrkovB, V. & Sanquineti, M. (2002). Comparison of worst-case errors in linear and neural network approximation. IEEE Trans. on Information Theory, 48, 264-275. 23. Kdrkovb, V. & Sanquineti, M. (2002). Tight bounds on rates of variable-basis approximation via estimates of covering numbers. Research Report ICS-02-865. 24. KdrkovB, V., Savickjr, P. & HlavBEkovb, K. (1998). Representations and rates of approximation of real-valued Boolean functions by neural networks. Neural Networks, 11, 651-659. 25. Kolmogorov, L.N. (1956). Asymptotic characteristics of some completely
Mappings between High-dimensional Representations 45
bounded metric spaces. Dokl. Akad. Nauk, 108,385-389. 26. Kushilewicz, E. & Mansour, Y. (1993). Learning decision trees using the Fourier spectrum. S I A M J . o n Computing, 22, 1331-1348. 27. Milman, V. D., Schechtman, G. (1986). Asymptotic Theory of Finite Dirnensional Normed Spaces. Berlin: Springer, 1986. 28. Makovoz., Y. (1996). Random approximants and neural networks. J . of A p proximation Theory, 8 5 , pp. 98-109. 29. Pinkus, A. (1986). n - W i d t h in Approximation Theory. Berlin: Springer. 30. Pinkus, A. (1998). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 , 277-283. 31. Pisier, G. (1981). Remarques sur un resultat non publiC de B. Maurey. In Seminaire d’AnaZyse Fonctionelle I., n.12. 32. Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill. 33. Sejnowski, T. J. & Rosenberg, C. (1987). Parallel networks that learn t o pronounce English text. Complex Systems 1, 145-168. 34. Sejnowski, T. J. & Yuhas, B. P. (1991). Mappings between high-dimensional representations of acoustic and visual speech signals. In Computation and Cognition (pp.52-68). Philadelphia: Siam. 35. Weaver, H. J. (1983). Applications of Discrete and Continuous Fourier Analysis
This page intentionally left blank
THE STIMULATING ROLE OF FUZZY SET THEORY IN MATHEMATICS AND ITS APPLICATIONS ETIENNE E. KERRE - DIETRICH VAN DER WEKEN Fuzziness and Uncertainty Modelling Research Unit Department of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (Building S9), 9000 Gent, Belgium E-mail: etienne.kerre, dietrich.vanderweken@rug. ac. be In order to cope with the incapability of classical yes-or-no mathematics to capture with incomplete information, several mathematical models to represent and to process imprecise and uncertain information were introduced. We mention explicitly : fuzzy set theory, flou set theory, L-fuzzy set theory, rough set theory, intuitionistic fuzzy set theory, twofold fuzzy sets, ... As fuzzy set theory being the most developed one, we will focus on the enrichment of the existing mathematical structures and their domains of applications by the introduction of fuzzy set theory. More precisely we will give some examples of the basic and applied research performed at the Fuzziness & Uncertainty Modelling research unit of Ghent University.
Keywords: f u z z y set theory, fuzzification, fuzzy topology, fuzzy image processing. 1
Introduction
Nowadays the scientific community widely recognizes the need for appropriate mathematical models to represent and to process imprecise and uncertain information. At the same time one agrees about the incapability of classical yes-or-no mathematics to capture with incomplete information. During the past three decennia many new theories have been initiated and developed to model the colourful real world and extend in this way the black-or-white model of classical mathematics. We mention explicitly : fuzzy set theory, flou set theory, L-fuzzy set theory, rough set theory, intuitionistic fuzzy set theory, twofold fuzzy sets, .... Undoubtly fuzzy set theory being the most developed one. Let us note that this model is all but fuzzy : a better name should have been : theory of fuzziness, mathematics of fuzziness, logic of fuzziness. As nowadays science in general and mathematics in particular heavily are based on Cantors set theory, one may expect that the enlargement of the basis by means of fuzzy set theory, will remarkably enrich the exciting mathematical structures and their domains of application. Indeed it is hard to find a domain of mathematics (pure and applied) that has not been infiltrated by fuzzifications. There exists already an extensive literature on fuzzy topology,
47
48
E. E. Kerre and D. van der Weken
fuzzy algebraic structures, fuzzy measure theory, fuzzy relational calculus, fuzzy reliability theory, fuzzy relational databases, fuzzy control, fuzzy decision making. For more than a quarter century basic and applied research has been performed at the Fuzziness & Uncertainty Modelling research unit of the Ghent University, resulting in about 250 scientific papers. In this contribution we want to show the enrichment of mathematics by the introduction of fuzzy set theory. We will consider three stages in the process of fuzzification. The first stage has been mainly developed during the seventies and consists of a moreor-less straightforward fuzzification of the classical mathematical structures. The second stage, mainly developed during the eighties reveals an explosion of the possible generalizations of the classical structures. The current third stage started in the nineties consists on the one hand of modes of standardization, axiomatizations and characterizations and on the other hand of some successful applications of fuzzy set theory. We will provide some examples from our own research group in each of these stages. 2
The stage of straightforward fuzzification
Fuzzy sets have been introduced in 1965 in the most cited seminal paper of L. Zadeh '. Very soon after its introduction fuzzy sets started their infiltration in the domain of fuzzy topology by the paper of C.L. Chang '. A fuzzy topological space (XIT) consists of a universe X together with a subclass T of the fuzzy powerclass F ( X ) of X satisfying : (0.1)0 E
T
and X E
T;
+ O1 n O2 E T; ( 0 . 3 ) (b'j E J ) ( O j E T) + u Oj E T; (0.2)
01
E
T
and
0 2
ET
j€J
where J denotes an arbitrary index set. The elements of T are called 7-open fuzzy sets or simply open fuzzy sets. A fuzzy set A on X is called closed iff coA is open. The class of all closed fuzzy sets in ( X ,T) is denoted as T'. The interior of a fuzzy set A on X, denoted IntA, is defined as the greatest open fuzzy set contained in A , i.e
IntA = u(O10 E
T
and 0 C A } .
The closure of a fuzzy set A on X , denoted clA, is defined as the smallest
The Stimulating Role of Fuzzy Set Theory 49
closed fuzzy set which contains A , i.e.
cZA = n{FIF E
T’
and A C_ F } .
Straightforward verifications show that a fuzzy topology may be characterized in terms of closed fuzzy sets, in terms of interior and in terms of closure as in the crisp case. Problems arise when one tried to obtain a characterization of a fuzzy topology in terms of neighbourhood systems. In it has been shown that straightforward fuzzification of the classical axioms for neighbourhood systems applied to fuzzy singletons didn’t Iead to a characterization of a fuzzy topology. So let p be a fuzzy singleton on a universe X , i.e. a fuzzy set on X whose support is a crisp singleton in X . A fuzzy set A on X is called a neighbourhood of a fuzzy singleton p on X in a fuzzy topological space ( X ,T ) iff (30 E ~ ) ( Cp 0 C_ A ) . The class of all neighbourhoods of p , denoted U p ,satisfies the following fuzzy equivalents of the crisp neighbourhoods :
(N.1) UP # 0; (N.2) A1 E Up and A2 E U p + A1 n A2 E U p ; (N.3) A1 E Up and A1
C A2
* A2 E U p ;
(N.4) (YAl E Up)(3A2E Up)(A2C_ A1 and (Yq C &)(A1 E u,)). But conversely, starting with a system of neighbourhoods for each fuzzy singleton p on X satisfying (N.l)-(N.4) and defining a class T as T
=
(010 E
F(X)and
(Yp C 0)(0E U p ) }
does not necessarily lead to a fuzzy topology on X. The main reason for this deviation from classical topology is that only the implication
(3E
J)(P
G O j ) =+ P c
u
0j
j€J
holds, and no longer the reverse. Another example concerns the fuzzification of the notion of Cartesian product of a finite number of sets. Starting from A1 E F ( X ) and A2 E F(X), the Cartesian product A1 x A2 is defined as: Al
In
A2(27
y) = min(Al(z),A z ( y ) ) .
the following deviations from classical set theory have been outlined:
50
E. E. Kerre and D. v a n der W e k e n
Al x A2 = A2 x Al +? A1 = A2; (A1 x Az)\(AI x A3) $ AI x (A2\A3); A1 x A2 5 A’, x A$ +? A1 C A: and A2 C A;. for Al, A‘,, Az, A$,A3 fuzzy sets on X and \ the fuzzy generalization of the set substraction operation. So in the first stage direct or straightforward fuzzifications of many classical mathematical notions have been introduced and their deviations have been studied.
3
T h e stage of explosion of possible fuzzifications
Due to the introduction of the triangular norms and conorms from the domain of probalistic metric spaces ‘, the intersection and union of fuzzy sets could be extended to a 7-intersection and S-union, as for A E F(X), B E F(X), 7 a triangular norm and S a triangular conorm :
A nT B ( z ) = 7 ( A ( z )B(z)),b!’~: , EX A Us B ( z ) = S ( A ( z )B(z)),b”z , EX. During the eighties we have noticed a deep study of the alternative operations on a logical level (conjunction, disjunction, negation, implication) and on a set-theoretic level (intersection, union, complementation, inclusion). Many researchers among them Bellman, Zadeh, Giertz, Weber, Yager, Dubois, Prade, Klir, Zimmermann, Zysno, Bandler, Kohout have obtained interesting and deep results on these issues. Another characteristic of the second stage lies in the enrichment of the classical structures due to the non-equivalency of the possible generalizations. Let’s provide some examples. For Fl and Fz two crisp subsets of a universe X we have : F1
n Fz
= 0 @ FI
C COFZ,
but for fuzzy sets on X we only have: F1
n Fz
=0
=+ Fl
CCOF~
and hence not the reverse of this implication. It is well-known that many theorems in classical mathematics involve disjoint sets. So one can imagine that the disappearance of the above equivalency leads to different notions in fuzzy set theory. As an example we considered the concept of normality in a fuzzy topological space. A (crisp) topological space (X, T ) is called normal iff
( ~ ( F Fz) I , E T”)(Fi n Fz
=8
* (3(01,02) E T Z ) ( O ln
0 2 =
0 and FI C 0
1
and FZ 2
02),
T h e Stimulating Role of Fuzzy Set Theory 51
or equivalently, stating the famous Urysohn’s lemma, (YO E T ) ( V F ,F E T‘ and F
0)(3V X ) ( F C IntV and clV C 0), where again T’ denotes the class of all closed sets in ( X ,7’). In 1975 B. Hutton defined a normal fuzzy topological space ( X ,T ) as: (VO E r)(V’F,F E r’ and F
C 0)(3V E F ( X ) ) ( F
IntV and clV
C 0),
and hence a straightforward fuzzification of the Urysohn’s form of normality. We wondered why the usual form of crisp normality was not been taken for fuzzification. The answer lies in the choice of the definition of disjointness. So in we defined: (X,r ) is normal
’
0
(VJ(Fi,F2) E T ’ ~ ) ( FCI cop2 + ( 3 ( 0 1 , 0 2 )E T ~ ) ( CO coo2 ~ and F1 C ( X ,T ) is weakly normal
0 1
and F2
C02))
Q
( V ( F l ,F ~ E) T’W~ n F~ # 0 + ( ~ ( O I0 ,2 ) E r2)(OI C coo2 and
F1 C 0 1 and F 2 C 0 2 ) ) . We also have shown that our definition of normality completely coincides with the one of Hutton and that weakly normality is a weaker concept, i.e. the fuzzy Sierpinski space is weakly normal but not normal. In we introduced an even stronger concept of normality: ( X ,T ) is completely normal
Q
(V(A1,Az) E F ( X ) 2 ) ( ( A 1C co(clA2) and A2 C co(clA1) ( 3 ( 0 1 , 0 2 )E r2)(A1C 0 1 and A2 C 0 2 and 01 C ~ 0 0 2 ) ) .
We however found that the well-known Tietze characterization theorem no longer holds in fuzzy set theory, i.e., only the following implication holds: ( X ,r ) is completely normal + every fuzzy subspace of ( X ,T ) is normal. A second example of the huge amount of possible fuzzifications concerns the concept of neighbourhood in a fuzzy topological space. Here the basic notions are the fuzzification of the notions ”point” and ”membership relation”. Some authors have defined the concept neighbourhood for a crisp point in the underlying universe while authors used ”fuzzy points” and ”fuzzy singletons”. A fuzzy singleton in a universe X , denoted x, where 6 ~ ] 0 , 1and ] x E X is defined as: x, : x 4 [0,1] X
W
E
y
H
0,ifyfx.
E. E. Kerre and D. van der Weken
52
while a fuzzy point, also denoted x , , where E E]O,l[ and x E X is defined in the same way. The different membership relations are defined for A E F ( X ) and: - for fuzzy singletons: x , C A H E 5 A(%) x,qA H ~ ( z C , COA)H
+
A ( x ) E > 1. The first relation is Zadeh’s inclusion relation g , while the second one is P u and Liu’s quasi-coincidence relation q .
- for fuzzy points:
x, 5A H
E
< A(%).
Note that fuzzy singletons and fuzzy points reveal a complementary attitude towards Zadeh’s fuzzy set theoretic operations. Indeed, let (Aj)jEJ be a family of fuzzy sets on X , s a fuzzy singleton in X and p a fuzzy point in X . Then we obtain 0
for the Zadeh intersection:
but the converse only holds for finite J .
but the converse only holds for finite J . for the Zadeh union: (3j E
J)(s
Aj) + s
C
UAj, j€J
but the converse only for finite J .
The Stimulating Role of Fuzzy Set Theory 53
Based on these notions several neighbourhood concepts have been introduced. Suppose ( X ,T ) being a fuzzy topological space, IC E X , s a fuzzy singleton in X and p a fuzzy point in x. Then we have: A is a Ludescher lo neighbourhood of x ( 3 0 E T ) ( I CE supp 0 and 0 E A ) .
A is a Kerre3 neighbourhood of s H (30 E r ) ( s C 0 C A). l1 neighbourhood of x 0 and 0 C A and O(x) = A ( x ) ) .
A is a Warren H ( 3 0 E T ) ( I CE supp
A is a Pu l2 neighbourhood of s H (30 E T)(S q 0 and 0 C A ) . A is a Mashhour l 3 neighbourhood of p (30 E ~ ) ( p s and O 0 C A). The resulting neighbourhood systems will be denoted as: C, ( T ) , Ic, (T), wz(~), Ps(7)and M p ( 7 ) . In a series of papers we could prove the following results concerning the characterization of a fuzzy topology by means of the different neighbourhood systems. (1) All neighbourhood systems, except Warren’s, satisfy the purely formal translated properties of a classical neighbourhood system. See section 2 as illustrated for the Kerre neighbourhood system. (2) A Ludescher neighbourhood system defines a fuzzy topology which is however not necessarily unique, i.e. different fuzzy topologies can lead t o the same Ludescher neighbourhood system and hence Ludescher’s definition gives no characterization. (3) A Kerre neighbourhood system defines no fuzzy topology but only a base for a fuzzy topology. Hence Kerre’s definition gives no characterization, contradicting the conjecture made by P u and Liu in 12. (4) A Pu and Liu neighbourhood system defines a unique fuzzy topology and
hence it provided a characterization. ( 5 ) A Mashhour-Ghanim-Kerre neighbourhood system defines a unique fuzzy topology and hence provided a characterization.
Due to a lattice-theoretic study of the different approaches to the concept of a neighbourhood we could introduce a skala of fuzzy topological spaces :
54 E. E. K e r r e and D. v a n d e r W e k e n
surjective spaces, conditionally closed Kerre spaces, conditionally closed P u spaces, conditionally closed Mashhour spaces, fuzzy co-countable spaces, fuzzy Sierpinski spaces and the 2-space. Moreover we could partly fill up the gap between Chang fuzzy topological spaces and Lowen spaces. For more details we refer to l6 and *. A third example concerns the fuzzification of a real number and the ordertheoretic structure of the real numbers. A walk through the fuzzy literature reveals the existence of many different definitions for this fundamental notion. We mention explicitly : the original definition of Mizumoto and Tanaka imposing a convexity condition, the different concepts introduced by Dubois and Prade using a monotonicity property and several forms of continuity and the approach of Rodabaugh related to probability distributions. The diversity of concepts is dangerous since the corresponding calculus of fuzzy quantities highly depends on the definition. In l8 we gave a theoretic consistent description of fuzzy numbers and their properties and we made an attempt to link the different definitions. As a final example of the extension of classical mathematics by fuzzy mathematics we would like to mention the fuzzy relational calculus. Relations are considered as one of the basic concepts in science and in particular in mathematics. Due to the pioneering work of Bandler and Kohout in the eighties, the machinery of relational calculus has been substantially extended, theoretically as well as practically in databases, information retrieval and preference modelling in decision making. For an extensive overview of fuzzy relational calculus and its applications we refer t o l9 or '. 4
4.1
The current stage Standardization, axiomatization and L-fuzzification
During the nineties up to now a major part of the ongoing research has been dedicated to more fundamental issues such as standardization, axiomatization and L-fuzzification. Problems of standardization concern the motivated choice of a standard for basic notions: fuzzy point versus fuzzy singleton as the most primitive building-stone; which concept of neighbourhood should be choosen as standard; what is the most suitable t-norm for modelling intersection of fuzzy sets; which definition should be choosen for the fuzzification of a subgroup and other basic algebraic structures. As a result of this process of standardization one should clear up the papers entitled "fuzzy ... redefined". Once a standard has been agreed, the use of deviations from the standard should be stated explicitly and thorougly motivated.
The Stimulating Role of Fuzzy Set Theory 55
A second important topic of research in the third stage concerns axiomatization in order to reach a consensus with respect t o the desirable properties of typical fuzzy concepts such as: fuzzy implication, measures of fuzziness, fuzzy measures, fuzzy preference relations, fuzzy ranking methods, fuzzy similarity relations, defuzzification methods. A third characteristic of the fuzzy research in the nineties is the so-called L-fuzzification, where L denotes a complete lattice. In the original fuzzy set theory the membership degrees are taken from the unit interval [0,1] and hence no incomparability could be taken into account. In order t o represent incomparable elements, states, systems, ..., Goguen introduced in 1967 the extension t o L-fuzzy sets. Now a huge amount of research results exists on L-fuzzy groups, L-fuzzy topology, L-fuzzy modifiers. Let’s give some concrete examples of these important directions of the latest fuzzy research. The first example stews from fuzzy topology. Let f be a mapping between two crisp topological spaces ( X I ,T I )and ( X 2 , T z ) . Then the following equivalent formulations for the continuity of f are well-known: f is continuous
(YO2 E T 2 ) W 1 ( 0 2 ) E TI) (VFZ E Td)(f-l(Fz) E Ti) @ P A 1 E R X l ) ) ( f ( C h ( A l ) ) c_ clz(f(A1)) e (VAz E P(X2))(f-1(intz(A2)) intl(f-l(A2)) @
w
c
It has been shown that not all of these equivalencies are kept in the framework of fuzzy topology. As a consequence a huge number of continuity concepts have been introduced in the eighties : fuzzy continuous, fuzzy weak continuous, fuzzy almost continuous, fuzzy semi continuous, fuzzy weak semi continuous, fuzzy 8-semi continuous, fuzzy almost semi continuous, fuzzy almost semi continuous, fuzzy S-continuous, fuzzy semi irresolute continuous, fuzzy strongly irresolute continuous, fuzzy irresolute continuous, fuzzy &continuous, fuzzy strong @-continuous,fuzzy almost strong 8-continuous, fuzzy super continuous, fuzzy weak 8-continuous, fuzzy weak precontinuous, fuzzy semi strongly 8-continuous, fuzzy weak X-continuous, fuzzy (8, S)-continuous, fuzzy quasi irresolute continuous, fuzzy semi weak continuous. Due to the introduction of the notions of an operation, we have been able 2o to unify all these concepts, i.e., to propose a general form of fuzzy continuity, such that all the foregoing concepts are special instances of this general form. A second example concerns axiomatization of fuzzy implication operators. In 21 Smets and Magrez introduced the following axioms for a fuzzy implication operator Z on the unit interval. Let 1 be a [0, 112- [0, I] mapping and
E. E. Xerre and D. van der Weken
56
(z, y, 2) E
[o, 113:
(A.l) Axiom of contraposition : Z(x, y) = Z(1 - y, 1 - z); (A.2) Exchange principle: Z(z,z(Y,.))
= W(z,y),z);
(A.3) Axiom of hybrid monotonicity: Z(., z) is decreasing and Z(z, .) is increasing;
(A.4) Boundary conditions: z 5 y =+ Z(z, y) = 1; (A.5) Neutrality principle: Z(1,z) = s; (A.6) Axiom of continuity: Z is continuous. In 22 we have tested 19 widely used fuzzy implication operators w.r.t. these axioms and we have built a fuzzy set of good fuzzy implication operators. Other examples of the stimulating role of fuzzy set theory in the development of pure and applied mathematics are:
- a critical comparison of the fuzzy model and other recently developed uncertainty models such as : rough sets, twofold fuzzy sets, flou sets, intuitionistic fuzzy sets 23;
- an overview of fuzzy quantifiers - a unified method
24J5;
for ranking fuzzy quantities
26,27;
- axiomatization and classification of the different methods for defuzzification 4.2
28.
Applications of fuzzy set theory
In this section we will focus on possible applications of fuzzy set theory, more precisely fuzzy techniques in image processing. We will illustrate how fuzzy techniques are used in establishing measures for image quality evaluation and in constructing fuzzy filters for image noise reduction. An important problem in image processing constitutes the comparison of images: if different algorithms are applied to an image, we need an objective measure to compare the different output images. It is well-known that
The Stimulating Role of f i z z y Set Theory
57
classical measures, such as the MSE (mean square error), do not always give convincing results. Since gray-scale images can be identified with fuzzy sets, it is interesting to investigate whether similarity measures, i.e. measures that are developed to express the degree of similarity between fuzzy sets, can also be applied in image processing. In the literature a lot of measures that express the similarity between two fuzzy sets can be found. In most cases, a similarity measure is formally defined as a fuzzy binary relation in F(X), i.e. a F ( X ) x F ( X ) + [O, 11 mapping that is reflexive, symmetric and min-transitive. However, not every measure in the literature satisfies this definition. Therefore, we give a larger interpretation to the notion of a similarity measure: a similarity measure is any measure to compare two fuzzy sets. We investigated 33 similarity measures 29 w.r.t. their applicability for image quality evaluation. In order to do this, we evaluated the similarity measures w.r.t. a list of relevant properties. For example: a similarity measure needs to be reflexive and symmetric. Furhtermore, a good similarity measure should not be affected too much due to noise, and should be decreasing with respect to an increasing noise-percentage. From a total of the 33 similarity measures, only six similarity measures satisfied the list of relevant properties. A complete overview of the appropriate similarity measures can be found in 30. Let’s give the following two examples:
where A, B E F ( X ) ,with X = {(z, y)l0 of image points.
5 z 5 M , 0 5 1~ 5 N } a discrete set
Another illustration of the application of fuzzy techniques in image processing is the construction of fuzzy filters for image noise reduction. Already several fuzzy filters for noise reduction have been developed, e.g. the well-known FIRE-filter from Russo the weighted fuzzy mean filter from Lee 3 3 , and the iterative fuzzy control based filter from Farbiz and Menhaj 34. However, most techniques are not specifically designed for gaussian(-like) noise or do not show convincing results when applied to this type of noise. The GOA filter 35 (named after the research project it was developed in) is, in contrast to most other fuzzy filters, specifically designed for the reduction 31y32,
58
E. E. K e r r e and D. v a n d e r W e k e n
of gaussian-like noise. The general idea is to average a pixel using other pixels from its neighbourhood, but simultaneously t o take care of important image structures such as edges. To accomplish this goal, two important features are presented. First, to distinguish between local variations due to noise and due to image structure the filter estimates a fuzzy gradient for each direction; second, the membership functions are adapted accordingly to the noise level t o perform fuzzy smoothing. The filter is applied iteratively.
Figure 1. Pixels involved in the calculation of
vEw(i,j)
First, a value og(i,j) that expresses the degree in which the gradient in the direction D is small is derived. For each direction D (see Fig. 1 for an example), this is done by using the classical gradient values in the processed pixel and the two neighbouring pixels perpendicular to direction D (gradient values denoted by ~ ~ ( i , v j ~ ) ,( i l , j land ) v ~ ( i z , j z ) and ) , the following fuzzy rule: IF ( vo(i,j)is s m a l l AND v ~ ( i l , jisl )s m a l l ) OR ( ~ ~ ( i is, js m) a l l AND v ~ ( i 2 , j ais) small ) OR ( v ~ ( i 1 , jisl ) s m a l l AND o ~ ( i z , j 2is) s m a l l ) THEN vg(i,j)is s m a l l , where small is a triangular fuzzy number. The membership function p s m of this fuzzy set is adapted in each iteration] depending upon the residual amount of noise. The principle behind this is similar to the one behind the EIFCF filter: a compression of the fuzzy set s m a l l keeps the filtering capacity of the filter preserved. The further construction of the filter is based on the observation that a small fuzzy gradient most likely is caused by noise, while a large fuzzy gradient most likely is caused by an edge in the image (this is due t o the fact that the fuzzy gradient is derived from 3 classical gradient values, perpendicular t o the considered direction). Consequently, for each direction the following fuzzy rules that take this observation into account are applied:
The Stimulating Role of Fuzzy Set Theory 59
~ g ( i , jis)small AND ~ ~ ( i is, jp o) s i t i v e THEN y A ( i , j ) is p o s i t i v e , IF vg(i,j)is small AND v ~ ( i ,isjn)e g a t i v e THEN y i ( i , j ) is n e g a t i v e ,
IF
where p o s i t i v e and n e g a t i v e are linear fuzzy sets with membership functions ppOs and pneg. The truth values y$(i,j) and y;(i,j) of both rules are calculated for each direction, e.g. yi(i,j) = m i n ( ~ ~ m ( o ~ ( i , j ) ) , ~ L p , s ( ~and ~ ( the i , j )correction )), term is then given by: Y ( i 7 j ) = 255.
c
(YA(i,j) - Y&,j)),
DEdir
with dir the set of all directions. To illustrate the performance of the GOA filter, we have applied the filter to a noisy “cameraman” image. For comparison, we have also applied the classical Wiener filter, and the weighted fuzzy mean filter. The numerical results are displayed in Table 1. Besides the MSE-values, we have also calculated the values of the similarity measures S1 ( r = 1) and SZ.
Table 1. Numerical results corresponding t o the images in Fig. 2.
The results not only confirm the good performance of the new filter, but also illustrate that similarity measures (in particular the used measures S1 and ,572) are a better tool for image comparison. Our new filter performs best w.r.t. both the MSE and the similarity measures. Regarding the WFM filter: performs worst (MSE much higher, similarity measure lower). Regarding the Wiener filter: performs bad w.r.t. MSE, but shows an improvement w.r.t. the similarity measures S1 and S,. The latter is in accordance with the visual result, and illustrates that the similarity measures better reflect the visual observations than the MSE measure. The previously discussed applications are only two of many successful applications of fuzzy set theory in the area of image processing. Other possible applications are: fuzzy image enhancement, fuzzy edge detection, fuzzy image segmentation, fuzzy processing of color images, and applications in medical imaging and robot vision. We refer to 36 for an up-to-date and state-of-the-art coverage of diverse aspects related to fuzzy techniques in image processing.
60 E. E. Kerre and D. van der W e k e n
Figure 2. Top row: from left to right: original “cameraman” image and “cameraman” image with gaussian noise (u = 5.7); bottom row: from left to right: result of the GOA filter, result of the classical Wiener filter, result of the weighted fuzzy mean filter.
References
1. L.A. Zadeh, Fuzzy sets, in : Information and Control 8 (1965), pp. 338353. 2. C.L. Chang, Fuzzy Topological Spaces, in: J. Math. Anal. Appl. 24 (1968), pp. 182-192.
The Stimulating Role of f i z z y Set Theory 61
3. E.E. Kerre, Fuzzy topologizing with preassigned operations, in: Intermational Congress for Mathematicians (Helsinki, 1978), pp. 61-62. 4. E.E. Kerre, Introduction to the Basic Principles of Fuzzy Set Theory and some of its Applications, second revised edition, Communication and Cognition, Gent, (1993). 5. B. Schweizer, A. Sklar, Probabilistic Metric Spaces, Elsevier Science Publishing Company, New York, (1983). 6. B. Hutton, Normality in fuzzy topological spaces, in: J . Math. Anal. Appl. 50 (1975), pp. 74-79. 7. E.E. Kerre, Characterization of normality in fuzzy topological spaces, in: Simon Stevin 53 (1979), pp. 239-248. 8. E.E. Kerre, Fuzzy Sierpinski space and its generalizations, in: J. Math. Anal. Appl. 74 (1980), pp. 318-324. 9. E.E. Kerre, P. Ottoy, Fuzzy subspace of a fuzzy topological space, in: S. Ovchinnikov, Ed. Proceedings NAFIPS 88 (San Francisco, 1988), pp. 141-146. 10. H. Ludescher, E. Roventa, Sur les topologies floues definies I’aide des voisinages, in: C.R. Acad. Sci. Paris 283 (1976), pp. 575-577. 11. R.H. Warren, Fuzzy topologies characterized by neighbourhood systems, Rocky Mountain J. Math. 9 (1979), pp. 761-764. 12. Fuzzy Topology I, in: J. Math. Anal. Appl. 76 (1980), pp. 571-599. 13. M. Ghanim, E.E. Kerre, A. Mashhour, Separation axioms, subspaces and sums in fuzzy topology, in: J. Math. Anal. Appl. 102 (1984), pp. 189202. 14. E.E. Kerre, P. Ottoy, On the different notions of neighbourhood systems in Chang-Goguen fuzzy topological spaces, in: Simon Stevin 61 (1987), pp. 131-146. 15. E.E. Kerre, P. Ottoy, On the characterization of a Chang fuzzy topology by means of a Kerre neighbourhood system, in: J.L. Chameau, J . Yao, Eds., Proceedings of NAFIPS 87 (Purdue University Press, Purdue 1987), pp. 302-307. 16. E.E. Kerre, P. Ottoy, Lattice properties of neighbourhood systems in Chang fuzzy topological spaces, in: Fuzzy Sets and Systems 30 (1989), pp. 205-213. 17. E.E. Kerre, P. Ottoy, A comparison of the different notions of neighbourhood systems for Chang topologies, in: J. Kacprzyck, A. Straszak, Eds., Proceedings of the First Joint IFSA-EG EURO- WG Workshop on Progress in Fuzzy Sets in Europe, Prace ibs pan, Warsaw, vol. 169 (1989), pp. 241-251. 18. E.E. Kerre, A. Van Schooten, A deeper look on fuzzy numbers from a the-
62
E. E. Kerre and D. van der Weken
19. 20. 21.
22.
23. 24. 25. 26. 27. 28. 29.
30.
31.
32.
33. 34.
oretical as well as a practical point of view, in: M. Gupta, T . Yawakawa, Eds., Fuzzy Logic in Knowledge-Based Systems, Decision and Control, North-Holland, Amsterdam, (1988), pp. 173-196. B. De Baets, E.E. Kerre, Fuzzy relations and applications, in: Advances in Electronics and Electron Physics 89 (1994), pp. 255-324. A. Kandil, E.E. Kerre, A. Nouh, Operations and mappings on fuzzy topological spaces, in: Annales de la Socie'te' Scientzjique de Bruxelles, T.105, 4 (1991), pp. 167-188. Ph. Smets, P. Magrez, Implication in fuzzy logic, in: Internat. J. App. Reasoning 1 (1987), pp. 327-347. Ruan Da, E.E. Kerre, Fuzzy implication operators and generalized fuzzy method of cases, in: Fuzzy Sets and Systems 54 (1993), pp. 23-37. E.E. Kerre, A First View on the Alternatives of Fuzzy Set Theory, in : B. Reusch, K.-H. Temme, Eds. Computational Intelligence in Theory and Practice, Physica Verlag, Heidelberg (2001), pp. 55-72. Y. Liu, E.E. Kerre, An overview of fuzzy quantifiers, Part I : Interpretations, in: Fuzzy Sets and Systems 95 (1998), pp. 1-22. Y. Liu, E.E. Kerre, An overview of fuzzy quantifiers, Part I1 : Reasoning and applications, in: Fuzzy Sets and Systems 95 (1998), pp. 135-146. X. Wang, E.E. Kerre, Reasonable properties for the ordering of fuzzy quantities, Part I, in: Fuzzy Sets a n d Systems 118 (2001), pp. 375-386. X. Wang, E.E. Kerre, Reasonable properties for the ordering of fuzzy quantities, Part 11, in: Fuzzy Sets and Systems 118 (2001) pp. 387-406. W. Van Leekwijck, E.E. Kerre, Defuzzification: criteria and classification, in: Fuzzy Sets and Systems 108 (1999), pp. 159-178. D. Van der Weken, M. Nachtegael, E.E. Kerre, The applicability of similarity measures in image processing, in: Proceedings of the 8th International Conference on Intelligent Systems and Computer Science, Moskou, december 2000. D. Van der Weken, M. Nachtegael, E.E. Kerre, An overview of similarity measures for images, Accepted for the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, May 13-17, 2002, Orlando, USA. F. RUSSO,G. Ramponi, A fuzzy filter for images corrupted by impulse noise, in: IEEE Signal Processing Letters 3 (1996), pp. 168-170. F. RUSSO, Fire Operators for image processing, in: Fuzzy Sets and Systems 109 (1999), pp. 265-275. K. Lee, Chang-Shing, Yau-Hwang, P.T. Yu, Weighted fuzzy mean filters for image processing, in: Fuzzy Sets and Systems 89 (1997), pp. 157-180. F. Farbiz, M.B. Menhaj, A fuzzy logic control based approach for image
The Stimulating Role of Fuzzy Set Theory 63
filtering, in: E.E. Kerre, M. Nachtegael, Eds., Fuzzy Techniques in Image Processing, Springer Verlag (2000), pp. 194-221. 35. D. Van De Ville, M. Nachtegael, D. Van der Weken, E.E. Kerre, W. Philips, I. Lemahieu, Noise reduction by fuzzy image filtering, in: IEEE Transactions on Fuzzy Systems, submitted. 36. E.E. Kerre, M. Nachtegael, Eds., Fuzzy Techniques in Image Processing, Springer Verlag (2000).
This page intentionally left blank
K-ORDER ADDITIVE FUZZY MEASURES: A NEW TOOL FOR INTELLIGENT COMPUTING RADKO MESIAR Department of Mathematics, Fac. of Civil Engineering, Slovak University of Technology, RadlinskGho 11, 819 68 Bratislava, Slovakia e-mail:
[email protected],stuba.sk Several classes of fuzzy measures are recalled and characterized. A new class of Ic-order additive fuzzy measures offering new perspectives for intelligent computing is discussed, including some construction and identification methods. Several examples are given.
Keywords 1
: Additivity,
belief function, Choquet integral, fuzzy measure.
Introduction
Fuzzy measures (monotone set functions vanishing in the empty set, premeasures) extend the concept of additive set functions (measures, probabilities), see, e.g., For a recent overview of the topic we recommend 1 4 . Recall that for a fixed measurable space (X, A),the class P of all probabilities on (XIA) is a convex set. Dealing with the case when X = {XI,. . . , 2,) is a finite space and A = 2x, P is a simplex with vertices h i z i ) (Dirac measures). Moreover, P is closed under duality of set functions since any probability P is self-dual. However, P is not closed under many other aggregations, e.g., under min, Prod, max, etc. The smallest class of set functions closed under any aggregation in the spirit of 15,? is just the class of all fuzzy measures. The smallest convex class containing P and closed under min is the class of all belief functions (the same holds when Prod is taken for the aggregation). Again, B is a simplex with vertices BA (generalized Dirac measure), 31135726,?.
dA
= min ( d i z i ) I xi E A ) =
dlzi}. xi € A
Note that generalized Dirac measures are known in the game theory as unamity games 7&13 and they can be characterized by the set function 6A(B)
=
if A c B , 0 otherwise.
{1
Observe that similarly we can discuss the fuzzy measures linked to some pseudo-addition @ 3 2 , e.g., linked to max operator V (possibility measures).
65
66
R. Mesiar
2
k-order additive belief functions
Dempster-Shafer theory was successfully applied in several important areas linked to the intelligent computing, especially in multicriteria decision making However, the identification of a relevant belief function (fuzzy measure) fitting the modeled situation is a problem with high complexity, practically inaccessible in reality. Hence, a specific subclass of belief functions (fuzzy measures) , with acceptable complexity but sufficiently large applicability was proposed recently by Grabisch 1 2 . Grabisch’s concept of k-order additivity can be easily formulated by means of generalized Dirac functions. The class 131, of all k-order additive belief functions on X = (x1,. . . , xn} with k < n, is just the simplex with vertices b ~cardA , 5 k. 353?.
We have shown recently that equivalently, m E Dk if and only if m(A) = P(Ak) for some probability measure P on X k ,see, e.g., Recall also that then 20321123.
where fk(xtl , . . . , x i k ) = min ( f ( x i l ) ,. . . , f(qk)). Similarly, k-order additive fuzzy measures can be introduced 20,21,23. In the case k = 1, the standard additivity is recovered, and then we have two equivalent characterizations of 1-additive belief functions (fuzzy measures), namely (i) VA, B with A n B
=0 :
m(A U B ) = m(A)
+ m(B),
(additivity)
(ii) V A , B : m ( A U B ) + m ( A n B ) = m ( A ) + m ( B ) , (valuation property) For a general k we have the next two equivalent characterizations
21134:
(i’) V A 1 , . . . ,Ak+l pairwise disjoint :
(k-additivity) (ii’) VA1, . . . ,A k + l denote Bi =
Ai
\ U Aj, j#i
i = l,...,k
+ 1, and Bo
=
K - o r d e r Additive Fuzzy M e a s u r e s kfl
U Ai \
i=l
67
‘$ Bi: i=l
(I(odd
lJl even
j €J
(k-valuation property).
3
Construction a n d identification of 2-order additive belief functions
To construct a 2-order additive belief function m on (X,A), we need first
to construct a probability P on (X,A)’. However, any such probability is determined by the marginal probabilities PI and P 2 on (X,A) by means of some copula C 2 5 . If PI = P 2 are uniform and C is the product copula then the corresponding m = Pf is a symmetric belief function,
m ( A )=
card A
2
(7 I
and the relevant Choquet integral is the OWA operator induced by quantifier q(x) = x2,see 3 6 . To identify a 2-order additive belief function m we have to determine the weights wi corresponding to b{zil and wi,j linked to 6{2,,z3}1 that is, weights constrained by the non-negativity and the sum equal to 1. Because of (1) for any input function f : X -+ R, the Choquet integral output (that is, the global evaluation of f ) is given by
(c> -
/
c n
f dm =
i=l
wif(xi)
+
c
wi,j min
( f ( z i > ,f(xj>)
i<j
where the weights are constrained by the normalization condition: i
i<j
Now, having a training data, we estimate the weights by means of the least square method, leading to a quadratic programming problem, see also 24.
As an illustration we present an example for n = 2, that is, with two inputs x and y only. The corresponding Choquet integral based output will be denoted by z . The observed data are given in Table 1.
68
R. Mesiar
2 0 3
5 5 0
9 25 9 47
3 2 1
6 15 3 26
Suppose we want to find the best fitting probability measure describing observed data, that is, we look for p E [0,1] such that z=pz+(l-p)y
Following the least squares method, p should minimize the value
c 4
LP(P) =
c 4
(Zi - pzz
- (1 - p)y,)2 =
i=l
((Zi
2
- yz) - p ( z i - yi)) ,
i=l
that is, L$(p) = 0 yielding
26
-
1363
-
29
Note that then Lp ( E ) - T T - A. Now, let us model our data by means of a 2-order additive believe function m, that is, we need to determine the weights w1, w2, 1 - w1 - w2 = w1,2 so that
z = w1z
+ w2y + (I - w1
-
w2)
min(z,y).
Now we have t o optimize the value 4
L,(w~, w2) =
C (zi -
wlxi
- w2yi - (1 - w1 - w2) min(zi, yi))2
i= 1
c 4
=
((zi- min(zi, yi)) - w1 max(0, zi - yi) - w2 max(yi - xi,0))2
i=l
We have to solve the system of equations
dLm -= 0, awl
dLm = 0. aw2
K-order Additive Fuzzy Measures
69
Due t o the fact that max(0,xi - yi) max(yi - xi, 0) = 0 for all i = 1 , . . . , 4 , the relevant solution to the system is, see Table 2, 4
C (zi- min(zi, yi)) max(0, xi - yi)
w1 =
i= 1
c (max(o,xi 4
i=l
-
1 3
-_ -
and
Table 2 Note that an analytical solution for w1,wq and W I , ~= 1 - w1 - wz need not fulfill the non-negativity requirement, in general. Then the quadratic programming should be applied. However, in our case the weights 1 15 31 w1 = 3 ' w 2 = - and w1,2 = 114 38 fulfill all requirements and hence determine the optimal 2-order additive belief function m. Note that now the sum of square differences is efficiently smaller, 114
3
= 0.079
< 0.617 =
Further recall that the best approximation of a belief function m by means of a probability measure P in the sense C ( m ( A )- P(A))' = min, is called a pignistic probability and denoted by P*. Due t o results in concerning the Mobius transform, it can be shown that if m is a 2-order additive belief function with weights wi and wi,j, for i , j E ( 1 , . . . , n } , i < j , then P" = ( P l , . . . , P n ) with pi = wi
+ -21 ( g w k , i +
2
j=i+l
Wij)
70
R. Mesiar
In the case discussed above, we have n = 2 and for m optimizing our training data given in Table 1, we get 121 107 -15+ - 1 -31 = P I = -1 + -1- 31 - - p2 3 2'114 228' 38 2'114 228' that is ,
s)
As we have already shown, the best probability measure P fitting our training data is given by P = (%, , verifying the fact that the optimal belief function is not linked to the optimal probability measure. 4
Conclusions
The concept of k-order additivity of fuzzy measures allows to model the interaction which can never be caught by an additive set function, but still several additivity advantages are preserved. Therefore several interesting applications in the framework of uncertainty modeling are expected. The included example has shown 87% improvement of the sum of square differences when applying 2-order additive belief function instead of a probability measure, which supports our expectations.
Acknowledgement The support of the grants VEGA 1/8331/01 and 1/7146/20 is kindly announced. References 1. Benvenuti, P. and Mesiar, R.: Integrals with respect t o a general fuzzy measure. In: Fuzzy Measures and Integrals. Theory and Applications, M. Grabisch, T. Murofushi, M. Sugeno, eds., Physica-Verlag, Heidelberg, 2000, pp. 205-232. 2. Calvo T., Kolesarova A., Komornikova M. and Mesiar R.: A Review of Aggregation Operators. Univ. of AlcalA, AlcalA de Henares, Spain, 2001. 3. Chateauneuf, A. and Jaffray, J.Y.: Some characterizations of lower probabilities and other monotone capacities through the use of Mobius inversion. Mathematical Social Sciences 17 (1989) 263-283.
K-order Additive Fuzzy Measures
71
4. Choquet, G.: Theory of capacities. Ann. Inst. Fourier 5 (1953/54) 131-295. 5. De Finetti, B.: Sull’impostazione assiomatica del calcolo delle probabilit8. Annali Uniu. Trieste 19 (1949) 3-55. 6. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics 38 (1967) 325-339. 7. Denneberg, D.: Non-additive Measure and Integral, Kluwer Academic Publishers, Dordrecht, 1994. 8. Denneberg, D.: Non-additive measure and integral, basic concepts and their role for applications. In: Fuzzy Measures and Integrals. Theory and Applications, M. Grabisch, T. Murofushi, M. Sugeno, eds., PhysicaVerlag, Heidelberg, 2000, pp. 42-69. 9. Dubois, D. and Prade, H.: Possibility Theory, Plenum Press, New York, 1988. 10. Grabisch M.: Fuzzy integral in multicriteria decision making. Fuzzy Sets and Systems 69 (1995) 279-298. 11. Grabisch, M.: k-order additive fuzzy measures. Proceedings IPMU’96, Granada, 1996, pp. 1345-1350. 12. Grabisch, M.: k-order additive discrete fuzzy measures and their representation. Fuzzy Sets and Systems 92 (1997) 167-189. 13. Grabisch, M.: The interaction and Mobius representation of fuzzy measures on finite spaces, k-additive measures. In: Fuzzy Measures and Integrals. Theory and Applications, M. Grabisch, T. Murofushi, M. Sugeno, eds., Physica-Verlag, 2000, pp. 70-93. 14. M. Grabisch, T . Murofushi, M. Sugeno, eds.: Fuzzy Measures and Integrals. Theory and Applications, Physica-Verlag, 2000. 15. Klir, G.J. and Folger, T.: Fuzzy Sets, Uncertainty and Information. Prentice Hall, Englewood Cliffs, 1988. 16. Koleskov6, A.: Mobius fitting aggregation operators. Kybernetika, submitted. 17. Marichal, J.L.: Aggregations operators for multicriteria decision aid, PhD. thesis, University of Liege, 1998. 18. Marinacci, M.: Decomposition and representation of coalition games. Mathematics of Operations Research 21 (1996) 1000-1015. 19. Mesiar, R.: k-order Pan discrete fuzzy measures. Proceedings IFSA ’97, Prague, 1997, pp. 488-490. 20. Mesiar, R.: k-order additive measures. Int. Jour. of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (1999) 561-568. 21. Mesiar, R.: Three alternative definitions of k-order additive fuzzy measures. Busefal 83 (2000) 57-62.
72
R. M e s i a ~
22. Mesiar, R.: Maxitive and k-order maxitive measures. Proceedings IFA C, Prague, 2001. 23. Mesiar R.: Generalized Mobius transform and k-order additive fuzzy measures. Int. J. Gen. Syst, submitted. 24. Miranda P. and Grabisch M.: Optimization issues for fuzzy measures. Proceedings IPMU’98, Paris,1998, pp. 1204-121 1. 25. Nelsen R. B.: An Introduction to Copulas. Lecture Notes in Statistic 139, Springer Verlag, 1999. 26. Pap, E.: Null-additive Set Functions, Kluwer, Dordrecht, 1995. 27. Rota, G.C.: On the foundations of combinatorial theory I. Theory of Mobius functions. Zeitschrifi fur Wahrscheinlichkeitstheorie und verwandte Gebiete 2 (1964) 340-368. 28. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton, 1976. 29. Shafer, G.: Allocations of probability. Ann. Probab. 7 (1979) 827-839. 30. Shilkret, N.: Maxitive measures and integration. Indag. Math. 33 (1971) 109-116. 31. Sugeno, M. Theory of Fuzzy Integrals and Applications, PhD. thesis, Tokyo Inst. of Technology, 1974. 32. Sugeno, M. and Murofushi, T.: Pseudo-additive measures and integrals. J. Math. Anal. Apll. 122 (1987) 197-222. 33. SipoS, J.: Non-linear integrals. Math. Slovaca 29 (1979) 257-270. 34. ValBSkovB, L.: A note to the 2-order additivity. Proc. MAGIA’ 2001, KoCovce 2001, pp. 53-55. 35. Wang, Z. and Klir, G.J.: Fuzzy Measure Theory, Plenum Press, New York, 1992. 36. Yager R.R.: On ordered weighted averaging operators in multicriteria decisionmaking. IEEE Trans. Syst., Man Cybern. 18 (1988) 183-190. 37. Zadeh, L.A.: Fuzzy sets as a basis for the theory of possibility. Fuzzy Sets and Systems 1 (1978) 3-28.
ON-LINE ADAPTATION OF RECURRENT RADIAL BASIS FUNCTION NETWORKS USING THE EXTENDED KALMAN FILTER BRANIMIR TODOROVIC’*, MIOMIR STANKOVIC’, CLAUDIO MORAGA~** I Faculty of Occupational Safety, University of Nii, 18000 NiS, Yugoslavia E-mail:
[email protected] Department of Artificial Intelligence, Polytechnical University of Madrid, Spain Department of Computer Science, University of Dortmund, Germany E-mail:
[email protected] We have applied the extended Kalman filter to the parameter, state and structure estimation of a recurrent radial basis function network. The architecture of a recurrent radial basis function network implements a nonlinear autoregresive model with exogenous inputs. The on-line structure adaptation of the network is achieved by combining growing and pruning of the hidden units and connections of the network. Statistical criteria for growing and pruning were derived using the Kalman filter’s innovation statistics and state estimation error. Examples of non-stationary dynamic system modeling are given to illustrate the proposed algorithm.
Keywords: recurrent RBF, on-line learning, structure adaptation, network growing, pruning, extended Ka fmanfilter
1
Introduction
The fundamental property of any learning system including neural networks, is adaptation in a changing environment. In most real world applications the process or environment to be modeled or controlled is non-stationary, that is, the underlying dynamics changes over time. An algorithm for continuous adaptation of a learning system in a non-stationary environment should resolve the following dilemmas: a) Biadvaraince dilemma: sequential adaptation should maintain the optimal complexity of the learning system. b) Stability/plasticity dilema: while the system should be able to follow changes of the non-stationary environment as quickly and accurately as possible (i.e. optimal plasticity), previously acquired information should not be “forgotten” (ie. stability).
* The work of B. TodoroviC was supported by a Scholarship of the German Academic Exchange Service (DAAD) under the Stability Pact for South East Europe. ** The work of C. Moraga was supported by the Spanish State Secretary of Education and Universities of the Ministry of Education, Culture and Sports (Grant SAB2000-0048), and by the Social Fund of the European Community.
73
74
B. TodorouiC, M . StankoviC and C. Moraga
c) Noisehon-stationarity dilemma: a new data sample which significantly differs fiom the aquired knowledge is either an outlier or the evidence of non-stationarity of the task. Our paper discusses a problem of continuous neural network adaptation (often called on-line learning or sequential adaptation) in non-stationary environment. We shall consider the supervised learning scenario where at each time step k an input/output data sample is presented to the network. After that the sample is discarded and cannot be used again for adaptation. The joint inputloutput data distribution is unknown and varies with time. The network should learn the deterministic dependence between the input and output. The knowledge in aneural network is represented by parameters - weights of the connections between neurons. The complexity of the neural network as a model is defined as the number of adaptable parameters, which in turn depends on the number of neurons in the network. The biashariance dilemma is a well known problem in the neural network community, whether the task of interest is stationary or not. A neural network with a small number of adaptable (free) parameters (weightshodes) is biased in the sense that it is unable to model a desired dependence. The network that is too large for the problem suffers from poor generalization, because estimated parameters tend to have a large variance. Also, such a network is more susceptible to noise and error in the training data. There are several approaches for aiming at solving the biashariance dilemma. Constructive approaches start with small number of units and add new units during adaptation until some performance criterion is satisfied [2,6]. Destructive approaches start with a large number of neurons (possibly too large for a given task) and prune connections and neurons as long as deterioration of network performance is not significant [ 1,3]. Combined approaches switch between constructive and destructive phase [9,10]. In order to track changes in non-stationary environment, parameters and structure of the neural network must be time-varying. The ratio between plasticity and stability must be tuned by suitable forgetting factor so as to reject the knowledge that has become invalid due to the non-stationarity. The networks of neurons with localized activation functions are better suited for learning in the nonstationary environment than networks of neurons with global activation functions such as sigmoidal. When data sample in a particular area of the input space is presented, the nonlinear mapping represented by neurons with sigmoidal activation functions may change in other regions of the input space, thus forgetting still valid knowledge. Such behaviour is known as “catastrophic interference”. In this paper we consider the on-line learning of the Recurrent Radial Basis Function (RRBF) network by applying the Extended Kalman Filter (EKF). Among first, the applications of the EKF to recurrent neural network (RNN) learning were described by Matthews[S] and Williams[ll]. In [ l l ] dynamic of the neuron outputs and parameters of the RNN were represented by the state-space model, and the EKF is applied to the resulting nonlinear estimation problem. In this paper we extend this
On-line Adaptation of Recurrent Radial Basis f i n c t i o n Networks
75
idea by applying the EKF to the simultaneous estimation of states, time-varying parameters and structure of the RRBF network. The on-line structure adaptation of the RRBF network is achieved by combining the growing and pruning of the hidden neurons and connections. The Kalman filter consistency test is used as the criterion for adding new hidden neurons - network growing. The on-line pruning algorithm is derived based on a criterion similar to the off-line pruning method Optimal Brain Surgeon (OBS). It uses the statistics estimated by the EKF to determine the significance of the connections and neurons in order to decide whether to prune or not. As a solution to noiselnonstationarity dilemma we propose “add first - confirm later” paradigm. A new data sample which significantly differs from the aquired network knowledge, will cause the addition of a new hidden neuron. However, at the moment of the data arrival the cause of its novelty is not known. Only upcoming data may confirm or negate the fact that current data sample contains significant information, or it is an outlier. If it is an outlier, the hidden neuron that is generated will become insignificant and consequently pruned. Otherwise, it will become specialized, and permanent in the network structure. The remainder of the paper is organized as follows. In Section 2 we give the short description of a RRBF network as the nonlinear autoregressive model of dynamic system with exogenous inputs. Noise filtering and time-varying parameter adaptation of the RRBF network are put in the framework of nonlinear state estimation applying the EKF in Section 3. In Section 4, we derive the criteria for growing and pruning, in order to obtaine the on-line algorithm for structure adaptation of the RRBF network. Examples of nonstationary nonlinear dynamic system modeling in Section 5 are followed by conclusions in Section 6. 2
Dynamic system modeling using the RRBF networks
Let us consider the nonlinear autoregressive model of dynamic system with exogenous inputs and additive observation noise:
~ ( k=)f ( s(k-l) ,..., ~ ( k - A ~ ) , ~ ( k,... - lu(k-A,),w) ) Y ( k )=
+v(k)
(1)
where s ( k ) corresponds to the true (noiseless) output of the system, ~ ( k is) the input at time step k, An and As are the input and the output order, and f ( . ) is a nonlinear function parametrized by w. The only available measurement y ( k ) contains additive noise v(k) . We apply the RRBF network to implement this model by approximating the function f(.).Without loss of generality we shall consider the system with a one-dimensional output (see Fig. 1). In that case the output of the RRBF network is given by:
76
B. TodomwiC, M. StankowiC and C. Moraga
f(S(k-l),u(k-l),w)
"H
+ Cqh(s(k-l),u(k-l),w) .
(2)
i=l
We have used w to denote the n , dimensional vector of unknown parameters (bias a o , weights a;,centers m i l , mij7 and widths o;, , o i j T )and , nH is the number of hidden neurons. The output of the i-th hidden neuron is given by:
where ~ ( -1) k
= [s(k -i)...s(k
and u(k - 1) = [u(k -
...u(k -
-
a,)]fA,is the vector of previous network outputs is the vector of previous inputs.
In the remainder of this paper we will develop the algorithm for simultaneous sequential estimation of noiseless outputs (filtering) and sequential estimation of weights and structure (learning) of the RRBF network by applying the extended Kalman filter.
Fig. 1. Recurrent Radial Basis Function Network
3
State and parameter estimation applying EKF
Kalman Filter (KF) is an optimal linear minimum-mean-square-error state estimator for stochastic linear systems in a state form. In the case of a nonlinear model, the Extended Kalman Filter (EKF) is a suboptimal estimator, obtained by linearizing the model arround current state estimate, and applying the KF equations to the resulting time-varying linear model.
On-line Adaptation of Recurrent Radial Basis Function Networks
77
Let us consider the state estimation of the nonlinear dynamic sistem represented by the following state-space model:
where d ( k ) and v ( k ) are the zero mean Gaussian noise processes. The state vector x(k) evolves according to a nonlinear, non-stationary Markov dynamics driven by input vector u(k) and process noise d ( k ) . Non-stationarity means that fk and the covariance of the process noise are time variant. The Markov property implies that the probability density function of x(k + 1) depends on the knowledge of the current state x(k) and not previous statesx(k-I), I = l,Z,.... The measurement (observation) vector y ( k ) is a nonlinear, nonstationary and noisy mapping of the current state x(k) and current input u ( k ) . The RRBF network parameter estimation can be put in the framework of nonlinear state estimation using the EKF by augmenting the base state s, which is in our case defined as the previous As outputs of the RRBF network, with the vector of network parameters w. The state space representation of the RRBF network parameter and output dynamics is given by:
x(k)= @(x(k-l),u(k-l))+d,(k-I), Y W
=f
w k ) + v ( k ) v(k) 9
d,(k-l)-N(O,Q,(k-l)),
(5a)
- N(O,R(k))
(5b)
The process noise d,(k) and observation noise b-2, it is because the innovations have mean different from zero. That is exactly the reason why a new hidden neuron should be added to the network. However, in the case of multidimensional network output ( n o >1), the test
v p ( k ) > b2 cannot identify the source of the problem. In that case, the separate bias test, whether the mean of each component of the innovation e(k) is nonzero or not, has to be carried out. This can be done by dividing each component of the innovation e(k): el(k), I = 1,2,...,no by its standard deviation, which makes it normal N(0,l) , and testing to see if its mean can be accepted as nonzero. The hypothesis that the mean of the 1-th compoment of the innovation is zero H o : E[e(k)]= 0 is tested using the folowing sample mean:
where &(k) is the I-th diagonal element of the innovation covariance S ( k ) . It can be shown that the variance of the sample mean (14 is 1/N . Under the hypothesis H , and for large enough N , the variable ( k ) should be normal with zero mean and variance equal to one. We shall accept that the mean of the I-th innovation component is not zero if the following criterion is satisfied:
dil
On-line Adaptation of Recurrent Radial Basis Function Networks 81
Threshold
y~
is
determined
based
on
the
probability
P{Jf i & ( k ) I< T ~ / H=~1-a } and a is usually chosen to be 0.05. A Kalman filter is not consistent if at least one of the innovation components has mean not equal to zero. Note that one can also use the test (15) if N = 1 . A new hidden neuron should be added if a) the consistency test is not satisfied and b) only specialized hidden neurons are activated by the current network input. A hidden neuron is referred to as specialized if its input and output parameters have accumulated certain level of knowledge, and new observations cannot significantly improve it. The moment of neuron specialization is determined based on the number of samples that have activated the neuron above certain threshold. By applying criterion b) in adition to the consistency test, we insure that some period of time (i.e. some number of samples) is given to the parameters of existing non-specialized neurons to adapt before a new neuron is added. A new hidden neuron will be added if the current input to the network activates only specialized hidden neurons and if at least one of no criteria (15) is satisfied. Initially a new hidden neuron will be connected to all input neurons and to those output neurons whose innovations satisfy (15).
4.2
RRBF network pruning
During adaptation to a time-varying environment some connections or hidden neurons may become insignificant and should be pruned. A connection is insignificant if its parameter and the parameter change are both insignificant. In a RRBF network the significance of the input connection is determined based on the width o u r ,the significance of the recurrent connections, based on the width o i l , and the significance of the output connection, based on the weight a , . The wellknown pruning method OBS [ 11, ranks parameters according to the saliency, which is defined as the change in the training error when the particular parameter is eliminated. The parameter with the smallest saliency is pruned. However, OBS was developed for the off-line trained networks with fixed training and test set. We have derived an analogous on-line pruning method for RRBF network [lo], by establishing the relation between the parameter saliency and the statistical significance of the parameter. Additional criterion is introduced in order to test the the significance of time-varying parameters. The inverse of the Hessian of the cost function, needed for the significance test, is recursively updated by the EKF. Therefore, the pruning method does not significantly increase the overall computation complexity of the learning algortihm.
82
B. TodoroviC, M. StankoviC and C. Moraga
Parameter saliency Saliency of the parameters, estimated using an extended Kalman filter, is defined as the minimal change of the EKF cost function (8) if the particular parameter is pruned. The change of the cost function a ( x ( k ) ) when the current a posteriori estimate i(k) is changed to i ( k ) + & ( k ) , can be obtained from the local1 approximaton of the cost (8) using the Taylor series expansion around i ( k ) . Taking into account that V , J ( i ( k ) ) = 0 , we obtain:
1 a(i(k))=-Gx(k).V,(VJ(i(k)))T 2
.6x(k)
(16)
The estimate of the change 6 i ( k ) should minimize (16), subject to the constraint: & ( k ) T U P = -i(k)
T
up
(17)
where u p is the unit vector for the p-th parameter. The minimum of (16), obtained under the constraint (17) by applying the Lagrangian multiplier approach, represents the saliency coefficient of the p-th parameter:
The matrix T ( k ) = V,(V,J(~(X)))~denotes the Hessian of the cost function (8). The saliency (18) is obtained for the parameter change: &(k) = -
i ( k ) Tu p
T (k)-' u p
u;T(k)-i u p
In order to obtain (1 8) and (I 9), the inverse of the Hessian should be calculated. The Hessian of the cost function (8) is given by: T ( k ) = P-(k)-'
+ HTR(k)-!H .
(20)
It can be shown that the inverse of the Hessian (20) is P ( k ) = T ( k ) - ' . Therefore, the saliency coefficient of the p-th parameter can be rewritten as:
On-line Adaptation of Recurrent Radial Basis Function Networks
83
Parameter significance Assuming that the process and measurement noise are normal with zero mean and known variances, the estimate of the p-th component of the parameter vector is normal with the mean equal to the unknown true value xp(k), and with variance Ppp( k ) .The
hypothesis
that
the
parameter
is
statistically
insignificant
H o : x p ( k ) = 0 , is accepted as true if the folowing criterion is satisfied:
The P( 1 i ; ( k )
test
threshold
is
obtained
from
the
probability
constraint
I< y / H o ) = 1-a .
Comparing the saliency coefficient S,(k) of the p-th parameter, and the i i ( k ) we conclude that I $ i ( k ) /= therefore the parameter insignificance test can be rewritten as: -/,
b2 is satisfied. 4.2
RRBF network pruning
During adaptation to time-varying environment some connections or hidden neurons may become insignificant and should be pruned. A connection is insignificant if its parameter and the parameter change are both insignificant. In RRBF network the significance of the input connection is determined based on
120 B. Todorovid,
M.Stankovid and C. Moraga
width c g r the , significance of the recurrent connections based on the width uil, and the significance of the output connection based on weight ai . The well-known pruning method OBS [I], ranks parameters according to the saliency, which is defined as the change in the training error when the particular parameter is eliminated. The parameter with the smallest saliency is pruned. The similar idea is used in [4,5] to derive the on-line pruning method for parameters of the feed forward RBF network, estimated by extended Kalman filter. The same pruning method is applied to the RRBF network [6]. The saliency coefficient of the p-th parameter S,, defined as the minimal increase of the cost (10) when the p-th parameter is pruned, and the corresponding parameter update &(k/k) are given by:
where i ( k / k ) is the parameter estimate and P ( k / k ) is the state estimation error covariance in time step k; u p is p-th unit vector. The hypothesis H o that the parameter is statistically insignificant, is accepted if [6]:
I i ; ( k / k ) I< y, i ; ( k / k ) = Pp,(k/k)-*’*ip(k/k) where y is obtained from P(I $ ( k / k )
i ; ( k / k ) we conclude that I i ; ( k / k ) I=
(15)
I< y/Ho) = 1- a . Comparing
S , ( k ) and
J2s,(k), therefore parameter insignificance
test can be rewritten as: J2Sp(k) < y . In order to test significance of the parameter change, the following fading memory sum forp-th parameter is considered [6]: 6,W
= ~ ~ - ~ ~ ~ ~ ~ ~ ~ ~ , ~ ~ - ~ ~ + ( ~(16) , 7
where 0 < p < 1 and &,,(k - 1) is the p-th parameter’s process noise variance and $,(k) is the activity of the i-th hidden neuron, to whom parameter belongs. The distribution of (16) is approximately the scaled chi-square 6,(k)
- qf, where
c = 1/(2 - p) represents the scaling factor and n’ = (2 - p ) / p is the number of degrees of freedom. The hypothesis H o , that the estimated parameter change is consistent with the process noise variance, is accepted if S P ( k )~ [ b , b zwhere ] , the acceptance interval is chosen to be (1 -a).100% probability concentration region
for 6,(k).
Extended Kalman Falter Based Adaptation
121
The specialized or rarely activated hidden neuron should be pruned if all of its output connections have statistically insignificant parameters or at least one of its input connections has insignificant, small width. Hidden neuron is significant if all of its input connections have significant width and at least one of the output connections has significant weight. 5
Experiments
We have developed the NARX RRBF network to be applied in modeling and control of nonlinear dynamic systems with time-varying dynamics. In our preliminary experiments the network was reduced to the recurrent NAR model (no exogenous inputs) for non-stationary time series prediction. The following examples show that learning algorithm produces very compact networks with small number of hidden neurons and adaptable parameters.
5.1
Non-stationary Mackey-Glass time series prediction
EKF trained RRJ3F was applied to prediction of non-stationary Mackey-Glass time series, defined by differential delay equation:
i ( t ) = -bs(t)
+ as(t - r)/(l
- s(t - z)")
,
(17)
with a = 0.2, b = 0.1. We integrated the equation (17) for 0 < t < 3 100s using forth order Runge Kutta with step size 0.1, and the history initialized to 1.2. During integration delay z was varied according to: t
= 23.5 + 6.5(0.7sin(2xt/3100)+0.3sin(5nt/3100)),
(18)
Every tenth sample was used, and samples from first 100s were discarded, leaving s(k),k = 1,..,3000 samples for training. Observation were obtained according to y ( k ) = s ( k ) - cos(3d/3000) + v(k) , where the v(k) was Gaussian white noise with a variance which gave the signal to noise ratio SNR=40dB. The RRBF network was trained to predict x(k + 6) from x(k) , x(k - 6), x(k - 12), x(k - 18), x ( k - 24), where x(k) was the output of the RREiF network at time step
k, and observations y ( k ) ,k = 30,...,3000 were used for training. At the end of data sequence learning algorithm produced the RRBF network with 5 recurrent connections and 4 hidden neurons (3 significant), i.e. 45 adaptable parameters. The root mean square error, measured on the last 1000 data samples was FWSE=0.0209. Note that learning was sequential with each training sample presented only once to the network. Network was always asked for prediction before the parameter or structure adaptation. Therefore we can consider that the given RMSE is a test error measure.
122
B. TodoroviC. M . StankoviC and C. Moraga 2.5
---"
2.5,
y(k)
- RRBF output
2
2
2 G
'c
$ 1.5
-
1
y
0.5
I
f j
1.5
1
V
0.5 -0.5
I
I 2600
2700 2800 Time steps: k
2900
500
3000
1000
1500
2000
2500
3000
T h e steps: k
a) Comparison
b) Comparison
21.5-
s
10.5-
P
0
500
1000
1500
2000
2500
3000
Time steps: k
c) Innovation
Time steps. k
d) Growth and pruning pattern
Fig. 2. Results of non-stationary Mackey-Glass time series prediction
5.2
Non-stationary Lorenz time series prediction
We have applied recurrent radial basis function network with adaptive structure to model the chaotic system described by Lorenz equations [3]:
where p , o and p are adjustable parameters. Equations (19) were integrated for 0 < t < 75s using forth order Runge Kutta method with step size 0.01. Parameters p , o and p were varied according to:
Extended Kalman Filter Based Adaptation
123
The time series from the variable s2 (sampled at a period of 0.05 seconds) was scaled between [-1,1] and Gaussian white noise v(k) was added to obtain the observations y(k),k = 1,...,1500 with SNR=40&. The FW3F network was trained
to predict x(k + 1) from x(k) , x ( k - 1) , x ( k - 2 ) , x(k - 3), where x ( k ) was the output of the RRBF network at time step k.
i
1
8
0.5
tm 2
-
0
5 -0.5 1 I
I
I100
1200 1300 Time steps: k
1400
1500
a) Comparison
l
-
1000
500
200
400
600 800 1000 1200 1400 Time steps k
b) Companson
1500
Time steps: k
c) Innovation
d) Growth and pruning pattern
Fig. 3. Results of non-stationary Lorenz time series prediction
Each training sample was presented only once to the network. At the end of data sequence the RRBF network with 4 recurrent connections had 3 hidden neurons (2 significant), i.e. 28 adaptable parameters. The root mean square error, measured on the last 1000 data samples was RMSE=0.0147. 6
Conclusions
Extended Kalman Filter is applied to the parameter, state and structure estimation of recurrent radial basis fiinction network. The architecture of recurrent radial basis function network is based upon nonlinear autoregressive model with exogenous
124
B. TodoroviC, M. StanlcoviC and
C. Moraga
inputs. Augmented state of the recurrent radial basis function network is the stacked vector consisting of parameters and outputs of the network. This state is estimated using extended Kalman filter. The on-line structure adaptation of the network is achieved by combining growing and pruning of the hidden units and connections of the network. Statistical criteria for growing and pruning were derived using the state estimation error and innovation statistics. Recurrent network trained by extended Kalman filter is applied in non-stationary time series prediction. 7
References 1. Hassibi, B., Stork, D., Wolff, G. J.: Optimal Brain Surgeon and General Network Pruning, IEEE Int. Conf: Neural Networks, San Francisco (1993) 293299. 2. Kadirkamanathan, V.: A statistical inference based growth criterion for the RBF network, In Proc. IEEE Workshop on Neural Networks for Signal Processing (1 994) 3. Lorenz, E. N.: Deterministic non-periodic flow, J. Atm. Science, vol. 20, pp. 130-141, 1963 4. TodoroviC, B.: Incremental adaptation of RBF network structure, Master Thesis, University of NiS (2000) 5. TodoroviC, B., StankoviC, M, Todorovic-Zarkula, S.: Structurally adaptive RBF network in non-stationary time series prediction, In Proc.IEEE AS-SPCC, Lake Louise, Alberta, Oct. 1-4 (2000) 224-229 6 . TodoroviC, B., StankoviC, M : Training recurrent radial basis function network using extended Kalman filter: parameter, state and structure estimation, In Proc. South-Eastern Europe Workshop on Computational Intelligence and Information Technology, Press T. University of Nis, Yugoslavia (200 1) 7. Williams, R.J.: Some observations on the use of the extended Kalman filter as a recurrent network learning algorithm, Technical Report NU-CCS-92-1. Boston: Northeastern University, College of Computer Science (1992)
A MULTI-NF APPROACH WITH A HYBRID LEARNING ALGORITHM FOR CLASSIFICATION DANUTA RUTKOWSKA AND ARTUR STARCZEWSKI Technical University of Czestochowa, Department of Computer Engineering, Czesfochowa, Poland E-maii:
[email protected] The paper presents an approach to classification based on neuro-fuzzy systems and hybrid learning algorithms. A new method of rule generation is proposed. The rules are used in order to create a connectionist neuro-fuzzy architecture of the multi-NF system. Parameters of the rules are then adjusted by a gradient algorithm. Thus, the system can be employed to solve multi-class classification problems. Some examples are depicted. Keywords: neuro-jizq systems, connectionist networks, learning methods, rule generation, classificationproblems, intelligent systems
1
Introduction
Many different systems have been applied to classification problems. In the area of Computational Intelligence, neural networks, fuzzy systems and neuro-fuzzy systems are widely employed as classifiers; see e.g. [5], [S], [17]. In Section 2 , a neuro-fuzzy system that solves the well-known IRIS classification task [3] is presented. This kind of system can be used successfully in many problems concerning classification, as well as control or function approximation; for details, see [9]. However, in order to achieve better performance of the classification in the case of many classes, the multi-NF system described in Section 3 has been proposed. This system is composed of connectionist neuro-fuzzy (NF) networks, similar to the NF network illustrated in Fig. 1. In Section 4, rule generation algorithms are depicted and a new method is proposed in order to create the NF networks. The number of these rules determines the number of elements (nodes) of the networks. Hybrid learning methods of the NF networks are described in Section 5. A gradient algorithm is applied to parameter tuning. Thus, the parameters of membership functions of the fuzzy IF-THEN rules can be adjusted. Examples of classification problems, solved by means of the multi-NF approach, are depicted in Section 6. Apart from the IRIS classification task, medical diagnosis applications have been considered. The medical data are available on the Internet [7]. Conclusions and final remarks are presented in Section 7. It should be mentioned that the multi-NF approach was introduced in [ 151, also presented in [13], [16], as well as [9], and referred to as the multi-segment or
125
126
D. Rutkowska and A . Starczewski
hierarchical system. However, the learning methods differed from the new algorithm proposed in this paper. 2
Neuro-fuzzy systems for classification
As mentioned in Section 1, classification problems can be solved by means of neural networks, fuzzy systems, and neuro-fuzzy systems. In this paper, neuro-fuzzy systems in the form of connectionist multi-layer networks are considered as classifiers. Fig.1 illustrates such a network that can be used in order to solve the well-known IRIS classification task. The neuro-fuzzy system (network), which is also called the fuzzy inference neural network, is a multi-layer architecture, similar to classical neural networks. However, the elements (nodes, neurons) perform different functions than neurons of neural networks, except of two classical (linear) neurons that realize the sum operations. The neuro-fuzzy network shown in Fig.1 represents a fuzzy system that employs the singleton fuzzifier, Mamdani approach to fuzzy inference with product operation as the Cartesian product, and center-average defuzzification method; see e.g. [ 181, [9], for details. The system portrayed in Fig.1 performs the fuzzy inference using 3 fuzzy IF-THEN rules in the following form: R' : IF x, is A / AND ... AND x,,is A: THEN y is B J
(1)
where x1,.. . ,x, and y are linguistic variables, A,', ...,A,' and B are fuzzy sets; j = 1,. . . ,N . The linguistic variables correspond to the inputs and output of the system. In Fig. 1, there are 3 rules, and 4 inputs, which means that N = 3 and n = 4 . Values of the linguistic variables can be crisp or fuzzy; XI, X2,X 3 , X4 and j j denote crisp input and output values, respectively. The fuzzy sets A : , . . . ,AnJand B are defined in the universes of discourse
XI,..., X , , , Y c R , w h e r e x,,?, E X , and y , y ~ Y , f o ri = l , ..., n . The membership functions of the fuzzy sets A { , . . . ,A,/ and B are usually chosen as Gaussian or triangular functions. In this paper, the Gaussian functions are employed. These functions are expressed as follows:
and
A Multi-NF Approach f o r Classajication 127
vJ,
where T,',o;' and oJare center and width parameters of these membership functions, respectively. In the neuro-fuzzy network illustrated in Fig.1, elements of the first layer, denoted as A,' , for i = 1, ... ,n , j = 1,. ..,N , realize the membership functions (2). Elements of the second layer perform the Cartesian product of the fuzzy sets A / , ...,A,' , using the product operation of the membership functions of these fuzzy sets. The output values of these elements represent the antecedent matching degree. This part of the network corresponds to the IF part (antecedent part) of the rules (1). The next part of the network shown in Fig.1 realizes the center-average defuzzification. The two linear neurons and the element that performs the division operation constitute the defuzzification layer. The weights of the first linear neuron, denoted as v',v2,v3, have the interpretation of the centers of the membership functions of the consequent fuzzy sets B / , for j = 1,2,3. Hence, v J = . These fuzzy sets can also be chosen as singletons, which means that their membership functions equal 1 for y = for y
f
vJ.
Fig. 1. Neuro-fuzzy system for IRIS classification
v J and
0
128
D. Rutkowska and A . Starczewski
As we see in Fig.1, the elements of the second layer, which produce the antecedent matching degree, also called the degree of activation of the rule (or rule firing level) are connected with the appropriate v J , for j = 1,2,3, corresponding to the conclusion part (THEN part) of the rule. The neuro-fuzzy network presented in Fig. 1, as mentioned earlier, can be used in order to solve the IRIS classification task. The problem is to classify the vectors of features of the iris flowers to the proper class of iris species. There are three species of iris: Sestosa, Versicolor, and Virginica. Thus, 3 classes are distinguished. Four features of the iris flowers are measured: sepal length, sepal width, petal length, petal width. Hence, the data vectors that contain the flower measurements include 4 components. The well-known Fisher’s iris data set [3] is composed of 150 data items (vectors), 50 for each of the iris species. The IRIS classification task can be solved using the neuro-fuzzy system shown in Fig.1, based on 3 rules that correspond to the iris species. Thus, the consequent parts of these rules represent the classes of the Sestosa, Versicolor, and Virginica. Values of the weights v’ ,v 2 ,v 3 can be chosen as 1,2, 3, respectively, which means the first (Sestosa), second (Versicolor), and third (Virginica) class. The inputs X,,X,,X3,X, are components of the data vectors. Now, the problem is to choose proper values of the parameters (center and width) of the Gaussian membership functions (2). Let us analyze values of the data of iris flowers. The values of the first feature (sepal length) range from 4.3 to 5.8 for the Sestosa, from 4.9 to 7.0 for the Versicolor, and from 4.9 to 7.9 for the Virginica. The values of the second feature (sepal width) range from 2.3 to 4.4 for the Sestosa, from 2.0 to 3.4 for the Versicolor, and from 2.2 to 3.8 for the Virginica. The values of the third feature (petal length) range from 1.0 to 1.9 for the Sestosa, from 3.0 to 5.1 for the Versicolor, and from 4.5 to 6.9 for the Virginica. The values of the fourth feature (petal width) range from 0.1 to 0.6 for the Sestosa, from 1.0 to 1.8 for the Versicolor, and from 1.4 to 2.5 for the Versicolor. It is easy to notice that the ranges of the sepal length and sepal width features overlap for every class of the iris species. The ranges of the petal length and petal width for the Sestosa are separated from the ranges of these features for the Versicolor and Virginica, which overlap. Thus, it is more difficult to correctly classify the data vectors that belong to the Versicolor and Virginica than to the Sestosa class. In the case of the iris classification, we can determine the parameters of the antecedent membership functions based on the ranges of the feature values, portrayed above. It seems reasonable to choose the center parameters as the centers of these ranges, and the width parameters as the half of these ranges. Hence, we obtain the following values of the center and width parameters of the membership functions (2), for i = 1,2,3,4, and j = 1,2,3 :
A Multi-NF Approach for Classijication 129
A,' : i?: = 5.05,
0:= 0.75
; A: : i?; = 5.95 , 0;= 1.05 ; A: :
Y:
= 6.4, oI = 1.5 3
A : : Z : = 3 . 3 5 , 0:=1.05 ; A ~ : ? ~ = 2 . 7 0 , 0 ~ = 0 . 7A; ~ : Z ~ = 3 . 0 , ~ ~ : A::Zi =1.45, 0: =0.45; A,' : Z,'=4.05, 0: =1.05; A: : ? .: =5.7, o3 3 = 1.2
A::?: =0.35,
D',
=0.25;
A,' : Y,' =1.4,
0: =0.4;
A: : 2; =1.95,
0: ~
0.55
The neuro-fuzzy system shown in Fig.1, with the antecedent membership functions presented above, performs very well solving the iris classification problem. However, several mistakes occur with regards to the data vectors that belong to the overlapping region of the Versicolor and Virginica classes; even after the gradient learning. In order to eliminate these misclassifications, more rules should be applied. The multi-NF approach, described in the next section, when employed to the iris classification task, uses more rules but allows to perform the classification without mistakes, also concerning the overlapping region. 3
Multi-NF approach
The neuro-fuzzy system presented in Section 2 can be employed to various classification tasks. Of course, the number of inputs as well as elements of the first and second layers differ depending on the problem to be solved. The systems of this kind usually perform very well in the case of two distinct classes. It is more difficult to create a similar system in order to classify data to many classes, especially in the case when the regions of the data vectors - associated with the particular classes - overlap in the high dimensional space. Therefore, the multi-NF system has been proposed in [ 151. This system is composed of single NF networks corresponding to the separate classes; see Fig.2. The output module (OM) produces the output value of the system that informs about the overall classification result that is determined based on the outputs of the NF networks.
130
D. Rutkowska and A . Starczewski
lnput vector
_I, NF1 Classification result
b NF2
NFM
OM
b
'
Fig.2. Multi-NF system for classification
The idea of the multi-NF system is to decompose a multi-class classification problem to simpler tasks when data vectors that belong to one class must be separated from others. Each NF network is associated with one individual class and is responsible for recognizing data vectors belonging to this class. The number of the networks in the multi-NF system, M, equals to the number of classes. However, it can equal to the number of classes minus one, since it is obvious that if a data vector is not accepted by every NF network of the system, it belongs to the one remain class which is not associated with the NF networks. There are two possible ways of realizing the multi-NF system. The input vectors can enter into every NF network in parallel or only the first network receives all the input data vectors entered into the system. In the latter case, an input vector is entered into the second NF network if it is discarded by the first one as not belonging to the first class. Thus, the second network as well as the next ones do not receive the input data vectors recognized as members of the first class which is associated with the first NF network. Similarly, the data vectors accepted by the second NF network as belonging to the second class (associated with this network) do not need to be entered into the input of the third and subsequent NF networks. The output values of both systems, for a given input data vector, equal to the output of the NF network which has recognized this input vector as belonging to its class. Thus, the output module transmits the proper NF network output to the output of the system. The output value of the NF network which accepts an input data vector as a member of its class differs from zero, while the output values of the preceding networks of the system equal zero. This kind of performance is resulted from the NF network architecture. The connectionist multi-layer architecture of the network is illustrated in Fig.3. The first NF network, in the multi-NF system, with non-zero output value is taken into account by the OM unit to produce the output value of the system.
A Multi-NF Approach for Classification
131
The NF network is similar to the well-known multi-layer network that represents the fuzzy system based on the Mamdani approach to fuzzy inference, singleton fuzzifier, center-average defuzzification method, Gaussian membership functions, and product operation as the Cartesian product; see [ 181, [9], for details. The first layer refers to antecedent fuzzy sets. The elements of this layer realize Gaussian membership functions of the fuzzy sets in the antecedent part of the IFTHEN rules. The second layer consists of the elements that perform the product operation. Both layers correspond to the first layer of the RBF network which is equivalent to the fuzzy system (in the case when width parameters of the Gaussian functions are equal); for details see [4]. The next layer contains the elements that realize sigmoidal functions. This is the additional layer, introduced in [15] especially for the NF network of the multiNF system in application to classification problems.
Fig.3. NF network of the multi-NF system
The last layers, which include two classical linear neurons and the element that perform the division operation, realize the center-average defuzzification. In addition, the constant R that represents the so-called zero rule was introduced in [ 151for the NF network of the multi-NF system. It is worth emphasizing that the neuro-fuzzy systems for classification do not need the defuzzification layers since the classification result reflects the outputs of the second layer. The elements of the second layer produce values of the so-called antecedent matching degree, that is the degree of activation of the rule (or rule firing level). It represents the membership of the input vector in the Cartesian product of the antecedent fuzzy sets. In classification tasks, each rule is associated with a corresponding class, so the maximal value of the antecedent matching degree indicates the class to which the input data vector belongs. The additional sigmoidal
132
D . Rutkowska and A . Starczewski
layer, as well as the zero rule, can be introduced in order to produce the proper classification decision at the output of the system. Of course, the system output gives the same classification results as concluded from the outputs of the second layer. The sigmoidal layer and the constant R play an important role in this network since every rule corresponds to the same class. As mentioned earlier, each NF network is responsible for recognizing data vectors that belong to one class, associated with this network. The weight value v of the classical (linear) neuron represents this class in the consequent part of the IF-THEN rules. The NF networks can be trained in the similar way to classical neural networks, using the gradient algorithm based on the steepest descent optimization method. This kind of algorithm is analogous to the back-propagation algorithm which is commonly applied to train neural networks [17]. The gradient method for learning classical neuro-fuzzy networks is presented in [ 181, [9]. Learning methods of the NF networks are described in Section 5. 4
Rule generation algorithms
The multi-NF system for classification, illustrated in Fig.2, is composed of M networks shown in Fig.3. Each NF network consists of different number of elements (nodes) in the first three layers. The number of these nodes depends on the number of fuzzy IF-THEN rules. Each NF network is responsible for one class, associated with this network. Different number of the fuzzy rules can be applied to recognize data vectors that belong to an individual class. Each NF network of the multi-NF system represents a fuzzy system that should decide whether an input data vector belongs or does not belong to the class associated with this network. The decision is inferred based on the collection of fuzzy IF-THEN rules that have the same conclusion part, that is “THEN class ck “ for the NF k network, where k = 1, ... ,M . The problem is how to generate this collection of the rules for each NF k network. In this paper, the following algorithm is proposed: 1. For k = 1,...,M , collect the labeled data vectors z I , for 1 = 1,.. . qk , associated with class k , where qk is the number of the data vectors labeled to the class k. 2. Set k = l . 3. Fix small values of D and z . 4. Set c = l , a n d i = l , Z=1. 5. Create cluster Vi with the prototype (cluster center) v j = z I . 6. For 1 = 2,. . .qk , check the Euclidean distance between data vector z I and the prototype vi ,according to the condition:
A Multi-NF Approach for Classijcation 133
(4)
J ~ z-/vi))5 D
If inequality (4) is satisfied, include data vector zI into cluster Vi , and set w J. = z I , for j = 1,. ..,oi , where oiis the number of the data vectors in the cluster Vi . 7 . Update the cluster center v i using the following formula: Wi 1 v; := - v; + x w j I+.;( j=,
)
8. For I = 27...qk,and i = 1,. ..,c , if z I E Vi ,create new clusters by returning to step 5 with c := c + 1 . 9. If stopping criteria (for example, a desired number of clusters) are not met, increase the value of D as follows:
D:=D+z
(6)
and replace the set of the labeled data vectors zI for I = 1,. . .qk ,by the set of the prototypes vi ,for i = 1,... ,c . Then, return to step 4. 10. If the stopping criteria are met or formula (5) does not change values of the prototypes, use the clusters obtained Vi , i = 1,. . . ,c , in order to formulate the fuzzy IF-THEN rules. The number of the rules equals to the number of the clusters, that is c . The membership functions of the antecedent fuzzy sets are Gaussian functions with center and width parameters determined as components of the prototype vectors and the value of D , respectively. 11. If k < M ,then k := k + 1 ,and go to step 3 to generate fuzzy IF-THEN rules for the next class. Otherwise, stop. The idea of the algorithm, proposed above, is to decrease the amount of the initial data set - replacing them by the prototypes. Then, the number of the clusters obtained are decreased in the similar way, when the prototypes are treated as the data vectors and new (bigger) clusters are created. The fuzzy IF-THEN rules are formulated based on the final clusters and their prototypes, determined by this clustering algorithm. Other methods for rule generation can also be applied. The algorithm presented in this section is similar to that introduced in [15] and described in [9]-[12]. Those algorithms allow to obtain clusters and generate fuzzy rules, based on the cluster centers, but use different method of updating the prototypes. The well-known h z z y clustering algorithm, called the fuzzy c-means [2], can also be employed. However, it is worth emphasizing that in this case the number of clusters, c, is not determined by this method and has to be fixed. This means that we should know the number of rules before using this algorithm.
134
5
D. Rutkowska and A . Starczewski
Hybrid learning methods
The rule generation method proposed in Section 4 is employed in order to construct the connectionist multi-layer architectures of the NF-networks presented in Fig.3. As mentioned earlier, the number of elements in the first three layers of these networks depends on the number of the fuzzy IF-THEN rules. The clustering algorithm described in Section 4 determines the number of rules, resulting in the information about the elements in the particular layers. Thus we know the architecture of the networks as well as the form of the rules (1). For each class, the number of rules, N ,equals to the number of clusters, c . The clustering algorithm not only allows to construct the NF architectures but also provides initial values of the parameters of the membership functions. This means that this algorithm can be treated as a learning method that finds values of the parameters (centers and widths of the membership functions) based on the labeled data vectors. However, these values are not optimal and should be tuned using another learning method, for example, a gradient algorithm. Thus, the hybrid approach (see Fig.4) is recommended in order to determine optimal values of the parameters. The clustering algorithm is depicted in Section 4. The gradient algorithm for the NF network shown in Fig.3 is presented in [ 151. The sigmoidal functions, realized by the elements in the third layer of the NF network portrayed in Fig.3, are expressed by the following formula:
s(9)=
1 1+exp[-p(~-h)]
(7)
where h E (0,l) ,p > 0 . It is easy to notice that the network illustrated in Fig.3 is described by the function:
,=I
where
is the antecedent matching degree (rule firing level, degree of activation of the rule), and p is defined by Equation (2). Ai
The constant h , in formula (7), is interpreted as a threshold membership value that refers to the rule firing level. The constant p defines the slope of the sigmoidal
A Multi-NF Approach for Classification
135
function. A large value of p is better for the system performance (for a testing phase) but the lower value is used by the learning method presented below. The gradient algorithm, introduced in [15], adjusts parameters Y/,o: of the membership functions defined by Equation (2), for i = 1,.. .,n , and j = 1,. ..,N , and realized by the elements of the first layer of the NF network shown in Fig.3. This algorithm is based on the steepest descent optimization method, and formulated analogously to the gradient learning procedure that is proposed in [18] for classical neuro-fuzzy systems (Fig. 1). The formulas that represent the algorithm for tuning the parameters , !F o: are expressed as follows:
where J' is the desired output value of the NF network, t = 0,1,2,. .., and a = 2qp, while y~E (0,l) is the stepsize constant of the steepest descent algorithm, also called the learning rate; usually, a = 0.7, while p = 10, and h = 0.36. The parameter tuning by means of formulas (lo), (1 1) is performed based on the learning sequence, which is composed of the pairs of the input data vectors with the desired output values. Although every labeled data vector can be included to the learning sequence, the following procedure is recommended for the multi-NF system. The first NF network, constructed using fizzy rules associated with the first class, is trained based on the whole learning data set. The second NF network employs the learning sequence that is composed of less number of the data vectors, because the vectors labeled to the first class are not included. Similarly, the learning sequence applied to the third NF network does not contain the data vectors labeled to the first and second classes, and so on. Thus, the last NF network uses the shortest learning sequence, which does not contain the data vectors that are labeled to all the preceding classes. It is better to apply scaled data vectors. Initial values of the parameters F,',of , for t = 0 , are derived from the clustering algorithm described in Section 4. The scaled data should be employed.
136
D. Rutkowska and A . Starczewski
Optimal values of these parameters are thus obtained by use of the hybrid learning approach illustrated in Fig.4.
Hybrid
% '
#'
Fig.4. Hybrid learning approach
6
Classification examples
In Section 2 a neuro-fuzzy system for the IRIS classification problem is presented. This system solves the task very well (except the overlapping region) based on the fuzzy IF-THEN rules which have been formulated by analyzing the data set. It is worth emphasizing that this method of rule generation does not work for many other examples, especially when the ranges of feature values overlap. Therefore, the multi-NF system, depicted in Section 3, and the algorithms presented in Sections 4 and 5 are useful. To achieve better classification performance usually more rules than the number of classes should be employed. The multi-NF system, when applied to the iris classification task, solves this problem without any mistakes. However, this system is composed of two NF networks, and one of them is constructed using 8 rules. The network associated with the Sestosa class does not need more than one rule but, as explained in Section 2, this class is much easier to separate from the others. The multi-NF system has been used in order to solve some medical diagnosis tasks. The medical data sets are available on the Internet [7]. One of them is the breast cancer database, which was obtained from the University of Wisconsin Hospitals, Madison, Wisconsin, USA; see also [6], [11. Another medical database, supplied by the Cleveland Clinic Foundation, concerns the heart disease diagnosis. The data set that contains items representing the thyroid disease has also been classified by this system. The breast cancer database has been applied to illustrate how the multi-NF system classifies the data vectors in the case when two classes are distinguished. From this database, we have got 683 data items that contain values of 9 attributes: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses. The values of these attributes are integers that range from 1 to 10. In addition, each data item includes the diagnosis that corresponds to the instance represented by the attribute values. Each instance belongs to one of 2 possible
A Multi-NF Approach for Classzfication
137
classes of the diagnosis: benign or malignant. The former is expressed in the database by integer 2, the latter by 4. Distribution of both class instances in this set of medical data is: benign - 65.5%, malignant - 34.5%. The two classes of the breast cancer diagnosis are linearly inseparable [ 11. The multi-NF system, applied to the breast cancer problem, reduces to only one NF network. However, as mentioned in Section 3, the system composed of two NF networks (one for each class) can also be employed; see [16]. Since the only one is sufficient, the network corresponding to the malignant class was constructed. The system with 9 inputs (for 9-component data vectors) and loutput (for the diagnosis ) was created based on 7 rules. The rules were determined by the clustering algorithm described in Section 4. Then the gradient learning method, depicted in Section 5, tuned parameters of the membership functions, adjusting the rules to the breast cancer task. Using this NF network, we obtained the result of about 98% correct classification decisions inferred by the system. The heart disease database, available on the Internet, contains data items that represent instances expressed by values of 13 attributes. The features taken into account as the attributes are e.g.: age (in years), sex (1 - male, 0 - female), chest pain type (four different types - values 1, 2, 3 or 4), resting blood pressure (in mm Hg), serum cholestoral (in mg/dl), fasting blood sugar (1 - if greater than 120 mg/dl, and 0 - otherwise), resting electrocardiographicresults (three states - values 0, 1, 2), maximum heart rate achieved, exercise induced angina (1 - yes, 0 - no). Each of the data items representing instances of the heart disease is associated with one of 5 classes, i.e. diagnosis, expressed by integers 0, 1, 2, 3, 4. However, almost all published experiments with this database distinguish only two types of diagnosis: presence or absence of the disease. Let us notice that the data set of 297 items includes 160, 54, 35, 35, and 13 items which belong to class 0, 1, 2, 3, and 4, respectively. It is much easier to classify the heart disease data vectors to one of two classes: negative diagnosis (class 0) and positive diagnosis (classes 1,2,3,4 treated as one class). In this case, the former contains 160 and the latter 137 data vectors. The heart disease problem with two classes of the diagnosis has been solved by means of classical neuro-fuzzy systems (in the form of the network shown in Fig. 1) as well as the multi-NF system. In [ l I], [12] some results concerning the use of fuzzy inference neural networks, in application to this task, are presented. Those systems employ fuzzy IF-THEN rules generated by clustering algorithms that differ from that proposed in this paper. The multi-NF system, applied to solve this problem, reduces to only one NF network, similarly to the breast cancer diagnosis task. This network was created based on 18 rules obtained using the algorithm depicted in Section 4. Then, the network was trained according to the hybrid approach described in Section 5. This system inferred about 95% correct diagnosis answers when the heart disease data vectors were entered into its inputs. The multi-NF system is especially useful for classification tasks when more than two classes are distinguished. Therefore, we applied this system in order to solve the heart disease problem with 5 classes, i.e. one class of negative diagnosis
138
D. Rutkowska and A . Starczewski
and four classes of positive diagnosis. The multi-NF system composed of four NF networks, associated with the positive diagnosis, was employed. The networks NF 1, NF 2, NF 3, and NF 4 correspond to classes 4, 3, 2, and 1, respectively. Each network was created based on fuzzy rules generated by the algorithm proposed in Section 4. Thus, these networks incorporate 8, 18, 13, and 16 rules, respectively. The classification result is at least 90% of correct diagnosis and can be better if more rules are used. It is worth emphasizing that there is a very small number of the heart disease data vectors assigned to classes 3, 4, and especially class 5. Therefore, it is almost impossible to split the data sets into learning and testing vectors. When attempted to separate learning and testing sets, the system performed even better working on the learning data vectors (96.7% correct answers) but much worse on the testing data. Of course, this system incorporated less number of rules (6, 12, 11, 14, for the particular NF networks). In this case, the learning and testing sets are composed of 212 and 85, respectively, data vectors randomly chosen from the database containing 297 data items. The multi-NF system composed of the NF networks constructed based on 2, 7, 3, and 9 rules, respectively, performed better on the testing set than the previous system (with the larger number of rules) but worse for the input vectors that belong to the training sequence. The thyroid medical data, available on the Internet, have a learning set and a testing set already prepared. The former contains 3772 instances and the latter 3428 data vectors. Each instance is represented by values of 21 attributes, and belongs to one of 3 classes of diagnosis: one class of negative diagnosis and two classes of positive diagnosis. More precisely, the first class refers to normal (not hypothyroid), the next ones mean hyperfunction and subnormal functioning, respectively. These classes are expressed in the database by integers 1, 2, 3. The feature values, similarly to the heart disease data, are in the form of integers or real numbers. There are 15 attributes having binary values, and 6 attributes characterized by continuous values ranging from 0 to 1. Since the thyroid data vectors include more components, corresponding to the features (attributes), than the heart disease database, the problem is more difficult to classify. On the other hand, this data set contains much more data items, and in addition - the learning and testing sets. Thus, performance of the system designed to solve the thyroid problem can be checked more precisely. The multi-NF system, used in order to classify the thyroid data vectors, consists of two NF networks that correspond to the positive diagnosis (hyperfunction and subnormal functioning classes). The first network was created based on 25 rules, while the second one incorporates 53 rules, generated by the clustering algorithm proposed in Section 4. The system was trained using the gradient method described in Section 5 . When the learning vectors were employed to test the system performance, we observed 94.2% diagnosis correctly inferred. Using the testing vectors, the percentage of correct answers was 93%. It is explained in [7] that a good classifier, applied to the thyroid problem, should produce more than 92%
A Multi-NF Approach for Classifkation
139
correct diagnosis answers. Thus, the classifier proposed in this paper fulfils this requirement.
7
Conclusions
From the simulation results, presented in Section 6, we conclude that the multi-NF system, although introduced to solve classification problems with many classes, is also very useful when only two classes are distinguished. The idea of this system is to accept data vectors that belong to the class represented by the NF network and to reject those which are members of the another class. Thus, if the data vector activates the rules incorporated by the NF network, a non-zero output value is produced by the system, otherwise the output value equals zero. In the case of many classes, the multi-NF system composed of many NF networks, realizes the idea of accepting an input data vector by a particular network as a member of its class and rejecting this vector if it does not belong to this class. Thus the system works according to the main idea of classification that is to recognize similar objects as members of one class and distinguish them from other classes. The clustering method proposed in this paper generates h z z y IF-THEN rules that are employed to construct the NF networks - components of the multi-NF system. This method is a part of the hybrid approach that also includes the gradient algorithm to tune parameters of the membership functions. The multi-NF system with the hybrid learning method can be treated as an intelligent system [9]. Such a system can create its architecture and adjust to the classification problem. This kind of system differs from those which use a rule base prepared by experts. It is worth mentioning that the idea of using a multi-network architecture is also employed by other authors, for example, applying ART-map networks [ 141.
References 1. Bennett K.P. , Mangasarian O.L.: Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods and Software 1, Gordon &
Breach Science Publishers, 1992, pp.23-34. 2. Bezdek J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. 3. Fisher R.A.: The use of multiple measurements in taxonomic problems, Ann. Eugenics, Vo1.7, 1936, pp.179-188. 4. Jang J.-S.R., Sun C.-T.: Functional equivalence between radial basis function networks and fuzzy inference systems, IEEE Transactions on Neural Networks, V01.4, No.1, 1993, pp.156-159. 5 . Kuncheva L.I.: Fuzzy Classifier Design, Physica-Verlag, A Springer-Verlag Company, Heidelberg, New York, 2000.
140 D. Rutkowska and A . Starczewski
6. Mangasarian O.L., Wolberg W.H.: Cancer diagnosis via linear programming, SIAM News, Vo1.23, No.5, 1990, pp.1-18. 7. Mertez C.J., Murphy P.M.: UCI repository of machine learning databases, ht~://~w.ics.uci.ed~vublmachine-learnin~-databases. 8. Nauck D., Klawonn F., Kruse R.: Foundations of Neuro-Fuzzy Systems, John Wiley & Sons, 1997. 9. Rutkowska D.: Neuro-Fuzzy Architectures and Hybrid Learning, PhysicaVerlag, A Springer Verlag Company, Heidelberg, 2002. 10. Rutkowska D., Starczewski A.: Two-stage clustering algorithm, Proc. 4* Conference on Neural Networks and Their Applications, Zakopane, Poland, 1999, pp.220-22 1. 11. Rutkowska D., Starczewski A.: Neuro-fuzzy system with clustering algorithm in application to medical diagnosis, Proc. 4* Conference on Neural Networks and Their Applications, Zakopane, Poland, 1999, pp.533-538. 12. Rutkowska D., Starczewski A.: Fuzzy inference neural networks and their applications to medical diagnosis, In: Fuzzy Systems in Medicine, Szczepaniak P.S., Lisboa P.J.G., Kacprzyk J. (eds.), Physica-Verlag, A Springer Verlag Company, Heidelberg, 2000, pp.503-518. 13. Rutkowska D., Starczewski A.: A neuro-fuzzy classifier that can learn from mistakes, Proceedings of the 1Oth International Conference on System Modelling Control ,Zakopane, 2001, pp. 189-194. 14. Sincak P., Hric M., Sarnovsky J., Kopco N.: Fuzzy cluster identification in the feature space using neural networks, Proc. 6'h International Conference on Soft Computing (IIZUKA 2000), Iizuka, Japan, 2000, pp.849-854. 15. Starczewski A,: Hybrid Learning of Neuro-Fuzzy Systems Using Clustering Algorithms, Ph.D. Thesis, Technical University of Czestochowa, Poland, (1999); in Polish. 16. Starczewski A., Rutkowska D.: New hierarchical structure of neuro-fuzzy systems, Proc. 5'h Conference on Neural Networks and Soft Computing, Zakopane, Poland, 2000, pp.383-388. 17. Zurada J.M: Introduction to Artificial Neural Systems, West Publishing Company, 1992.
A NEURAL FUZZY CLASSIFIER BASED ON MF-ARTMAPa PETER SINck2.’, MARCEL HRIC’ , RICHARD VALO’
PAVOL HORANSKY3, PAVEL KAREL3 ‘Centerfor Intelligent Technologies, Department of Cybernetics and A1 Faculty of EE and Infornatics, Technical University of Kosice, Slovakia ’Siemens PSE& I, AG Vienna, ECANSE Group, Austria, 3Tatrabanka a.s. Bratislava, Slovakia Abstract : The project deals with the further progress in MF-ARTMAP (Membership Function ARTMAP ) approach for classification purposes. The research was accomplish to extend the method with various membership functions utilization in classification procedure. The MF-ARTMAP is similar to ARTMAP family neural networks but provide additional information about the degree of membership of input to the detected fuzzy class in the feature space. The ability to have various types of membership functions had impact in reducing the number of nodes in the recognition layer and having better generalization of the system. Experiments were accomplished on image data and also on financial data. Results are encouraging to continue the research in direction of rule extraction using these technologies. Keywords: Neural Networks, ART neural networks, fuzzy logic, feature space, classification procedures, fuzzy cluster,fuzzy class, classification accuracy assessment
1
Introduction
Intelligent technologies play important role in decision systems. Decision procedure is very important part of any complex decision support system. Decision and pattern recognition are very close procedure and could be viewed as similar processes. Both are working with feature space and in both approaches is the final procedure a classification which makes the final decision or categorization. Even considering prediction could be viewed as trends classification in selected feature space where the final step is again classification with time series-like input and output value. In all cases the classification is considered as unknown function approximation and if it is a function we can make decision, categorization or even prediction. Classification is a very important procedure in the feature space with very high application potential. It is in fact clusters identification in the feature space and their association with classes of interest. The main problem is in definition of the cluster in the feature space which could be very complex and is leading to fuzzy approach to cluster identification and also frequent situation when two or more different clusters in the feature space are associated with the same class. a This project is supported by Vega Project from Ministry of Education of Slovak Republic “Intelligent Technologies in Modeling Intelligent Systems”, partially by EU Maria-Curie Individual Fellowship and also by Tatrabanka a s . Slovakia.
141
142
P. SinEdk et al.
This fist problem is associated with a cluster shape determination that is in fact already given on training data and the problem is the overlap of two or more clusters in the feature space. Therefore the knowledge about fuzzy categorization of the unknown input could be a solution to increase accuracy of the classification system. The later problem is associated with the data nature and the fact that classes determined by experts could be represented by various clusters in the feature space in completely different locations of the feature space. The identification of such a classes is rather difficult with approaches like neural network based on BP approach which must design the approximation of rather complicated discrimination function. The approaches based on ART or ARTMAP technology are rather flexible and much more suitable for these situations when class is consisted of various clusters in the feature space. These two considerations let us to design a MF ARTMAP approach and work on its further development mainly for classification purposes. 2
Motivation of the project
Classification is mapping from feature space into space of classes. Considering supervised mode a determination of training sets is extremely important. Many times is a situation that the training sites produce non-homogeneous training data which are represented in the feature space as union of clusters in different locations in the feature space.
p - adJustment (match tracking)
Figure 1 : The basic concept of ARTMAP neural network
A Neural Fuzzy Classifier Based o n M F - A R T M A P
143
From this perspective the classification approach which uses labeled clustering is the most appropriate way of handling classification task. In Figure 1 is the basic concept of ARTMAP neural network which is approaching a problem as labeled clustering task. The role of the Mapfield neural layer is to associate the various clusters into the desired class. The coefficient p is adjusted in the training phase and reflects the plasticity of the system. Number of nodes in the recognition layer reflects number of identified clusters in the feature space. It very often happens that it is very difficult to decide if a certain point in feature space belongs to a certain class. Therefore an approach based on fuzzy sets has many advantages to reduce misclassification results. Sometimes it is more convenient to have results in the form of transparent information concerning relations of the observed point in the feature space to all classes of interest. Instead of the crisp classifier output we can be more satisfied with outputs based on fuzzy sets and namely on values of membership functions of observed input to fuzzy cluster and fuzzy classes. The notions of fuzzy clusters and fuzzy class are described in the next part of this paper. The motivation is to provide for the end-user a smaller number of misclassifications and higher readability of the classification results. The output of these classification results is a vector of values describing relation of the input to each class of interest. The desire is to have a highly parallel tool with incremental learning ability similar to ARTMAP family neural network. 3
Description of the Method
The project is based on the assumption that data in feature space are organized in fuzzy clusters. Fuzzy cluster is considered as a fuzzy relation A in multidimensional feature space.
=I
1
Where A is a fuzzy relation, Y is feature space, x is point in feature space and Xs, E, F are parameters of fuzzy relation. Each fuzzy relation A is defined by equation [2] as combination of partial functions fi for each dimension.
144 P. SznEdlc et al.
Partial funcion for each dimension is defined by equation [3].
There are many fuzzy clusters in feature space and a certain set of fuzzy clusters create a fuzzy class. Fuzzy class is the union of fuzzy clusters belonging to a considered class defined by training set e.g. n
CL = { U A , } i=l
(4)
Generally we can consider fuzzy class as a set of fuzzy clusters Ai representing the variety of the numerical representation of the class. Relation between pcL (x) and pA(x)must be as follows: kL(X= ) max, = l , n ( p , ( X ) ) (5) where Ai is a fuzzy cluster which belongs to class CL and n is a number of fuzzy clusters creating class CL. The MF-ARTMAP is intended to be such a tool to calculate values of membership functions of X to each class of interest in feature space. In Figure 2 we can find an example of 2-dimensional fuzzy relation with various parameters in single-dimensional directions in the space. The space is n+l dimensional where n is number of features and (n+l)-st dimension is fuzzy relation value. As it is clear from the basic equation of the membership function the influence of the shape is done by variables E and F. Basically these 2 variables influence the shape of the membership function of the cluster in the feature space. The membership function should be adapted to the cluster as it exists in the feature space, which means each cluster should have completely different shape of the membership function with ability to describe the non-linearity of the clusters in the feature space. Variable E and F are different in each cluster and even each dimension of the feature space could lead to the fitted shape of the membership
A Neural Fuzzy Classafier Based o n M F - A R T M A P
145
1 0.8 0.6 0.4 0.2 0 1
0
0
Figure 2 : Two dimensional membership value with various parameters
'1
....T'
..... ........ .: . .. . . .
,.....,,.. ::.... .' .(...,
...:
1
\
....i,. ....,,, ; . .. ..... . . . . ,
05
05
0 1
1
0 1
0 0
'1
.. ..,..
....,...
0 0
,..... ...: .
i
1
.
...
...: ..
:
. ..,,,..A*: ....., . .. .
05
0 1
05
1 0 0
0 1
1 0 0
Figure 3 : The shapes of the membership functions when F=l, E=O. 1, E=0.4 (top) and E=0.2, F=l, F=2 (bottom) function to each detected cluster. In Figure 3 are examples of the influence of E and F to the shape of the membership function. The representation in more than 2 dimensional space is rather complex and could lead to efficient description of the
146 P. SznEdk et al.
membership to the clusters which mean ability to reduce the number of nodes in the recognition layer of MF-ARTMAF’ neural network. The learning procedure on the training data is searching the proper values of E and F in every direction of the feature space and every cluster that is detected by MF-ARTMAP. The procedure is efficient and only few iterations are need to achieve stability and able to identify inputs from the training set into different fuzzy clusters and consequently to fuzzy classes. This gives possibility to have more complex view about categorization and relation of categories in feature space what is a non trivial information if we do have high dimensional feature space defined by interpreter for pattern recognition purposes. 3.1
Description of the neural network topology
Topology of MF ARTMAP is based on a similar architecture to ARTMAP. In Figure 4 can be seen general topology of MF-ARTMAP neural networks with 4 neural layers. The input layer is mapping the input into the comparison layer where partial function fi for each dimension is calculated by equation [3], then the final value of fuzzy relation for each cluster is calculated by equation [4] and the comparison between A values and threshold value are done. The input pattern is tested if it does not belong to one of the clusters. In this case the second layer is dynamically changing according to the number of clusters in the 3-rd layer. So the 2-nd and 3-rd layers are extending according to the number of clusters found in the feature space. The recurrent connection between the 2-nd and 3-rd layer is encoding Xs, E and F parameters associated with a particular membership function. The 4-th layer is representing the mapfield like part of a neural network whose role is to integrate the clusters into a resulting class. The input to the mapfield is from the 3rd layer and also from outside of the neural network as associated output into the overall MF-Artmap neural networks. Basically it can be listed the number of neural layers in the MF-Artmap can be listed as follows: Layer #1 - input mapping layer, number of neurons equal to “n”, where n is dimensionality of the feature space, Layer #2 - comparison layer, number of neurons equal to “n x nc”, where nc is the number of clusters identified in the recognition layer, Layer #3 - recognition layer, number of neurons equal to “nc”, where nc is the number of clusters, Layer #4 - mapfield layer, number of neurons equal to “M”, where M is the number of classes for classification procedure. More info can be found in [ 141.
A Neural Fuzzy Classzjier Based o n M F - A R T M A P 147
40utput 4
G
Figure 4 : General MF-ARTMAP topology with a dynamic number of neurons in the 2-nd and 3-rd neural network layer. 3.2
Parallel MF-ARTMP
The notion of modular neural network has been known for many years. It is a very promising idea of solving complex problems by its distribution into more subproblems that are easy to solve. Basically the “divide and conquer” principles are usually used in modular neural networks. There are some difficult questions about the separability problems and discrimination hyper-planes determination. The key problem is in making an answer to the following problem: “Is it easier to separate one particular class from the feature space or to identify more classes among each other? So in fact the question is about difficulties of dichotomous classification comparing the multi-class approach. The first impression could be that dichotomous classification is always easier than the multi-class approach, but it is very difficult to conclude in general. For investigating these ideas, a parallel ARTMAP approach was designed and tested. The basic philosophy of this approach is illustrated in the following Figure 5. The MF-ARTMAP is suitable for solving the conflict in this approach because the values of membership hnctions to fuzzy clusters are good I’
148
P. SinEdk et al.
indicators of conflict solution among more experts as it is indicated in the Figure 5. The basic advantages of Parallel MF-ARTMAP are as follows: 1. Ultra-fast learning abilities on highly parallel systems e.g. PC-farms. This feature can be very useful in cases of large databases with easier handling of larger amount of data. 2. Easy and comfortable extension of classes of interest by adding a new expert network and train on the new class training data. This is very interesting in case of frequent adding the new classes to list of classes of interest. 3. Easy identification of class “unknown” by measuring the membership function value of the unknown input to the fuzzy classes. If the value is lower than the given value - then it is rejected and proclaimed for class “unknown”. This feature is very important when large data is considered with many classes. 4. Easy readability of fuzzy classes as unions of fuzzy clusters and identification of their basic parameters.
On the other side the basic disadvantage of Parallel MF-ARTMAP is the necessity of more parameters determination considering starting values of vigilance parameters for each expert network separately. output
t
t
Input
Figure 5 The Basic philosophy of Parallel MF ARTMAP
A Neural Fuzzy Classifier Based o n M F - A R T M A P
149
This can be avoided by designing the meta-expert controller setting up a1 the parameters for the overall complex of experts e.g. fuzzy system. Also the final decision on the output of the modular system could be adapted according to the a priory knowledge or any external information.
4
Experimental results
Experiments using benchmark and real-world data were done during this project. The aim was comparative analysis of known CI systems with those modified or developed during the research namely MF-Artmap, and MF-Artmap with adaptation F parameter. Basically the same real-word data were used and so the comparison can be done assuming the same training and testing data, which means the same amount of knowledge was used for classification procedure. 4. I
Accuracy assessment
Accuracy assessment was evaluated using contingency table analysis. A contingency table was used in a basic comparison study between all methods, which were investigated and developed. Some details about contingency table analysis for accuracy assessment of classification results can be found in [8]. 4.2
Experiments on benchmark data
There were 2 benchmark data used for testing classification results for comparative purposes. Circle in the rectangle and double spiral problem was used for dichotomous classification testing. These two benchmark data are used the most for estimating the classifier level of sophistication. If results on these benchmark data are good there, is a good assumption that the classifier will be successful in main applications. On the following tables are the results from the selected benchmark classification procedure. Results are on testing data.
150
P. SinEdk et al.
Actual class
MF Artmap A A’ B’ -
B
MF Artmap with F adaptation
Parallel MF Artmap A
B
A
B
97,62 2,92
98,14
1,46
98,32
1,82
2,38 /97,08
1,86
98,54
1,68
98,18
Parallel MF Artmap with F adaptation
2,1
I
97,66
Table 1 : Results on the “circle in the square” dichotomies classification
1
1
05
05
1 IUI
0 111 lrn
1Jo
o n
Figure 6 : Classification of “circlc in the squarc” without and with F adaptation
A Neural Fuzzy Classifier Based on M F - A R T M A P
151
Actual class Parallel MF Artmap
MF Artmap ..........................................................
A
:
B
....................................
/
1
87,54 10,76 87,96 ’
............ .....
... ................................
12,46 ‘89,24 12,04
,
!
MF Artmap with F adaptation
...,.............
~
Parallel MF Artmap with F adaptation
.............................
B
1
7,86
...............................
I 92,14
88,72
9,95 11,06
1
89,72
Table 2 : Results on the “double spiral” dichotomous classification. As it is clear in the Figure 6 the left picture shows nodes, which are more complex, and able to fit more nonlinear data as approach without F adaptation. The F parameter adaptation shows and positive response in decreasing number of nodes in
Figure 7 : Classification of “double spiral” the recognition layer in MF ARTMAP which brings less computational demands and faster computation. This shows better generalization of the system over the data in training and testing sets.
152
4.3 4.3.1
P. SinEdk et al.
Experiments on red-world data Experiments on multi-spectral image data
Basically the behaviors of the methods were observed on multi-spectral image data with the aim to obtain the best classification accuracy on the test data subset. The Kogice data consists of a training set of 3 164 points in the feature space and of a test set of 3 167 points of the feature space. A point in the feature space has 7 real-valued coordinates of the feature space normalized into the interval (0,l) and 7 binary output values. The class of a fact is determined by the output which has a value of one; the other six output values are zero. The data represents 7 attributes of the color spectrum sensed from Landsat satellite. The representation set was determined by a geographer and was supported by ground verification procedure. The main goal was land use identification using the most precise classification procedure for achieving accurate results. The image was taken over the eastern Slovakia region particularly from the City of Kosice region. There were seven classes of interest picked up for classification procedure . The results with classification with F adaptation are on the table 3 and results show slide improvement to results presented in [ 151.
95.51
0.21
OLOO-
0.00
83.16
0.00
0.00
_ _0.00 -3.85
4.23
0.00
,
2.76
0.00
3.01
.
7.18
100
0.00
1.66
0.00 ,O.O&
11.91 -0.00
96.66
1.10
0.00
0.00
0.22
87.29
0.17
’
7
4.72
~
I
0.00
0.00
/ 0.00 0.00 .
7.04
0.00
0.00
0.00
--0.00
0.00
99.49
5.36
0.64
0.00 j 0.00
0.11
0.00
0.34
83.10
i
Table 3 : Contingency table of Landsat TM classification on the test sites for MFARTMAP
A Neural Fuzzy Classafier Bused o n MF-ARTMAP
153
~
z*
................................
~
................................
.............................
Actual
.... l
~
.
~~
0
. 3
A / B / C ] D I E I F / G
a" B'
C'
D'
1
I 3,55
4,12
5,26
0,OO 1............................... 0,OO 99,03 0,OO
0,OO
0,OO
...............................................................
...............................
1
0,OO
A' 94,92 0,23 .............................
.................................
0,64
...............................................................
'
0,OO .........................
87,50 0,OO
7,71
1 0,OO
! 1
0,OO
I 3,41
0,OO
0,OO
__________
95,51 2,75
I 0,48 I '
""
+
0,9
Table 4 :Contingency table of Landsat TM classification on the test sites for MFARTMAP with F adaptation
Figure 8 : Original image. Highlighted areas were classified by expert (A - urban area, B - barren fields, C - bushes, D - agricultural fields, E - meadows, F forests, G - water)
P. SinEdk et al.
154
Figure 9 : Classification results in the form of thematic map. The results are corresponding with the contingency Table 4. 4.3.2
Experiments on financial fraud data
The application of MF-ARTMAP in financial fraud transaction data is an interesting challenge of this approach. The theoretical advantage of this approach is to reveal relations between types of frauds and indication and recommendations to change a feature space with the aim to be able to discriminate the type of the fraud. The financial fraud data consists of a training set of 9000 points in the feature space and of a test set of 9000 points of the feature space. A point in the feature space has 22 attributes, 10 real and 12 nominal, and 7 binary output values. The class of a fact is determined by the output which has a value of one; the other six output values are zero. Class “0” represents clear financial transaction, classes “1” “6” represent some kinds of fraud financial transaction. The data represents 22 attributes of the financial transaction. The representation set was determined by financial fraud experts. The pre-analysis of the data for processing was made in the following aspects: 1.
2.
analysis of the non-contradictory labeling of the data by expert means the expert should not label similar data with different labels analysis of statistical occurrences of the fraud types - if there is no sufficient data representation of the fraud type - there is difficult to make decision about fraud identification
A Neural Fuzzy Classifier Based on MF-ARTMAP 155
3. way how to encode the linguistic features into the numerical form to be able to process the complex fraud description in 22 dimensional feature space.
The analysis of point 1 shows some problems which has been cleaned and some correction in representing set of fraud has been made. Concerning to the point 2 analysis shows that fraud type 2 and 6 had only 4 cases in representative set - that means we have removed them from the classification analysis and fiuther processing. So for classification processing we have used classes 0 - no fraud, 1, 3, 4, 5 - these are selected frauds. The frauds type 2 and 6 has been not included into the classification because of non sufficient representation in representing set. The classification has been made and tables 5 and 6 representing the classification results on financial fraud data with 2 randomly selected subsets of the representing sets. It is clear that both results are similar and if the representative set per fraud are sufficient is it possible to used these approaches for these type of data. In case if the representation of the fraud is poor - there are only few examples then the identification will be done using rule-base approach implemented into the overall classification approach. This will be a matter of hrther research to integrate the rule-base approach and neural technology into final decision support system. The complex system must also take into consideration the type of error so we have to take extreme care if the fraud is classified as non-fraud - which is more important problem if the non-fraud is classified as a fraud. These considerations will be included into the hture research. The number of clusters associated with each class are interesting view of ART technology can easily handle non-uniform classes consisting of more different clusters and belonging according to the expert into one class. The table 7 shows number of clusters associated which each type of class after training procedure. The MF-ARTMAF' reveals also relation of these frauds to the other types of the fraud and give more complex view to the classification procedure what is important for improvement of the overall process.
156 P. Sincák et al.
Table 5 : Financial fraud transaction classification on test data
Table 6 : Financial fraud transaction classifications with changing training and test data ( results on test data)
A Neural Fuzzy Classijer Based on MF-ARTMAP 157
Type of the class Number of clusters in the class
0
1
3
4
5
763
30
16
5
2
Table 7 : Number of clusters in the feature space associated with type of fraud type Regarding to the results presented in the tables 5 and 6 the values of the membership function of misclassification results belonging to the class 0 - but in fact (according to the expert identification) these frauds were not type 0 (non-fraud) which is a very serious problem. Therefore we did an analysis of the misclassifications patterns and values of the membership functions produced with the MF ARTMAP. There are 2 major results of this analysis to prevent an misclassifications as follows: 1. to give a feedback to the expert to re-evaluate the misclassified patterns and if pattern with high membership function to the non-fraud and is still a fraud according to the expert some further analysis must be done. 2. if the membership hnction of the misclassified patterns to class ‘nonfraud’ is the highest but not exceeding a certain thresholds e.g. 0.7 is probably a new type of fraud which does not belong to any of selected examples and should be classified as another type of cluster (class). This information should be discussed with the expert to confirm these considerations and implement it into the system.
The above consideration can be done only by MF-ARTMAP because of providing the values of membership function of the patterns to the fuzzy clusters or hzzy classes. The ARTMAP technology does not provide this information and is not possible to make and feedback to the expect in this extend. The following values in the Table 8 shows the values of membership functions of the patterns which were misclassified from fraud type 4 to class 0 (non-fraud). The values are in interval and could be a solid base for further discussion with the expert to improve the classification results and decrease the misclassification of fraud 4 to class 0.
158
P. SanEdk et al.
Class
misclassified patterns to the class (interval) originally labeled by
Class 0
From 0.9801 To 0.9993
Class 4
I
From 0.9598 To 0.9890
Table 8: The table shows membership values of the patters to classes 0 and 4 which have been identified by expert as class 4. This table could inspiration for discussion with the expert to improve the classification results and decrease the misclassification of the frauds.
5
Conclusion
The paper presents further research on the MF-ARTMAP with F adaptation approach to the classification procedure. The advantage of this approach is higher readability of the neural network in providing the values of membership functions of input to fuzzy clusters or fuzzy classes identified by this approach. Results of these neural networks are comparable with MF ARTMAE’ [14] on benchmark and real-world data and in addition, provide more useful information about the measure of membership into all fuzzy classes and fuzzy clusters discovered in the feature space. Adaptation of F parameter seems to be a useful tool to investigate in the future. This seems to be an interesting advantage of this approach. The experiments on satellite image and financial fraud data has been presented and MF ATRMAP is able to provide an useful feedback information to the expert in fraud identification to reconsider the fraud labeling. The expert should also provide explanation why the misclassified pattern have close metric distance to class -“nonfraud“ and in this case some additional features should be under consideration to distinguish fraud from non-fraud transaction. In the future research we will provide further development of MF-ARTMAP to be able to retrieve some explicit knowledge from the network topology as well as to be able to implement some apriory knowledge into the system. Also beside parallel MF-ARTMAP we will focus our attention to hierarchical classification system to achieve the highest classification accuracy in the form of classification modular complex. Also some implementation the MF-ARTMAP as an autonomous agent and a part of Intelligent Multi-Agent System.
A Neural Fuzzy Classzfier Based o n M F - A R T M A P
159
References : [l] P.M. Atkinson and A.R.L. Tatnall, “Neural networks in remote sensing,” Int. J. Remote Sensing, vol. 18,4, 1997, 7 11-725. [2] J. Richards, Remote sensing digital image analysis: An introduction. Springer Verlag: Berlin, 1993. [3] B.G. Lees and K. Ritman, ”Decision-tree and rule-induction approach to integration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly environments, ” Environmental Management, vol. 15, 1991, pp. 823-83 1. [4] C.H. Chen, “Trends on information processing for remote sensing,” in Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), V O ~ .3, Aug. 3-8 1997, pp. 1190-1192. [5] P. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis, Harvard University, 1974. [6] G.A. Carpenter, M.N. Gjaja, S. Gopal, and C.E. Woodcock, ”ART neural networks for remote sensing” Vegetation classification from Landsat TM and terrain data,“ IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 2, 1997, pp. 308-325. [7] D. Rutkowska, R. Nowicki, ,,Implication-based neuro-fuzzy architectures” in International journal of applied mathematics and computer science, vol. 10, no. 4, 2000, pp. 675-701 [8] P. SinEak, H. Veregin, and N. KopEo, “Conflation techniques in multispectral image processing, ” Geocarto Int, March, pp. 11- 19. 2000. [9] S. Grossberg, “Adaptive pattern classification and universal recoding, I: Feedback, expectation, olfaction, and illusions, Biological Cybernetics, vol. 23, 1976, pp. 187-202. [lo] G.A. Carpenter, B.L. Milenova, and B.W. Noeske, “Distributed ARTMAP: a neural network for fast distributed supervised learning,” Neural Networks, vol. 1 1, no. 5, Jul. 1998, pp. 793-813. [ 111 J.R. Williamson, “Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps,” Neural Networks, vol. 9, 1996, pp. 881897. [ 121 R.K. Cunningham, Learning and recognizing patterns of visual motion, color, and form, Unpublished Ph.D. thesis, Boston University, Boston, MA: 1998. [ 131 R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, New York: 1973. [14] SinCak, KopEo, Hric, Veregin: MF-ARTMAP to identify fuzzy clusters in feature space, accepted to IIZUKA 2000, 6-th International Conference on Computational Intelligence, IIZUKA - Japan, October 1-4, 2000 (1) [15] SinEik, Hric, VaSEBk : Pattern Recognition with MF-ARTMAP Neural Networks, accepted to INTECH 200 1, 2-nd International Conference on Intelligent Technologies, Bangkok - Thailand, November 27-29,2001 If
160
P. SinEa'k et al.
[ 161 Ocelikova, E.-Nguyen Hong, T.: Diferential Equations for maximum entropy Image Restoration. In: Proc. of the 31d International Conference ,, Informatics and Algorithms '99 ,,, PreSov, September 9-10, 1999, pp. 214-218. . ISBN 80-88941-059 [I71 Ocelikova, E.-Klimesova, D.: Clustering by Boundary Detection. In: Proc. of the 4" International Scientific Technical Conference "Process control - a P 2000", Pardubice ,June 2000, pp. 108, ISBN 80-7 194-271-5. [18] D. Klimesova , E. Ocelikova,: GIS and Spatial Data Network. In: Proc. of International Conference "Agrarian Perspectives X-Sources of Sutainable Economic Growth in the Third Millenium. Globalisation versus Regionalism. sept. 18-19, 2001, Prague, Czech republic, ISBN 80-2 13-0799-4
MATHEMATICAL PROPERTIES OF VARIOUS FUZZY FLIP-FLOPS AS A BASIS OF FUZZY MEMORY MODULES KAORU HIROTA, SHINICHI YOSHIDA Deptartment of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama 226-8502, Japan
E-mail: he'
[email protected] D, T, SR, and J K fuzzy flip-flops are proposed and their characteristics are graphically shown in foiir-max-min, algebraic, bounded, drastic-logical operation systems. Some properties of there logical forms are analytically shown. The circuits of the proposed flip-flops are designed and simulated on VHDL circuit simulator. The result of synthesis shows that the areas of D, T, SR fuzzy flip-flops are nearly 0, 2/3 1 / 2 of that of JK fuzzy flip-flop and the delay times of D, T, SR fuzzy flip-flops are nearly 0, 2 / 3 , 2/3 of that of J K type, respectively.
Keywords: fuzzy flip-flop, memory element, fuzzy logic, FPGA, logic circuit
1
Introduction
In binary logic, flip-flop circuits are used as basic memory modules in sequential circuits and computers. Concept of fuzzy flip-flop, which is a fundamental circuit of fuzzy memory element, is proposed in 1989[1] and implemented with analog transistor and TTL digital circuit[2]. Although their theoretical consideration from a viewpoint of max-min fuzzy logic has been studied, they deal with only JK flip-flop and mainly use under max-min logical operation system (1 - ., A, V). So D, T, and SR fuzzy flip-flops, that are less functional but more simple and fast compared with JK fuzzy flip-flops, have been proposed[6]. This paper surveys D, T, SR, and JK fuzzy flip-flops. Their FPGA circuits are also designed on Synopsys Design Compiler and its circuit areas and delay times are measured. 2
JK Fuzzy Flip-Flop
The min-term expression of a chatacteristic equation of JK flip-flop is
Q(t
+ 1) = JKQ + JKQ + J K Q + J K Q ,
161
(1)
K. Hirota and S. Yoshida
162
simplified as
+
~ ( 1) t = JQ
+KQ,
(2)
where time variable ( t )is omitted in the right hand side for the simplicity. On the other hand, the max-term expression is
+
Q (t 1) = ( J
+ K + Q ) . ( J + + Q ) . ( J + + Q ) . ( J + + K),
(3)
which can be simplified as
+
Q(t 1) = ( J
+ Q ) . (K+ Q )
(4)
Of course, these two equations, Eq. (2) and Eq. (4), are equivalent in binary logic(Boo1ean algebra). Their fuzzy extension, however, are not generally equivalent. f i z z y extensions of (2) and (4) are QR(~
+ 1) = (J@Q@)O(K@OQ),
(5)
Qs(t + 1) = (JOQ)@(K@OQ@),
(6)
and
where (0,0, .@) indicate t-norm, s-norm, and fuzzy negation, which are fuzzy extensions of AND, OR, and NOT, respectively. Their characteristics are graphically illustrated in Fig. 1 and Fig. 2, respectively. When Q(t)=0.5, Fig. 1 (b) shows that it cannot be input the value larger than 0.5 by any input values J(t) and K(t). On the other hand, it can always be input 0.5 and less than 0.5, i.e. it can always be reset. Because of this reason, Eq. 5 is called reset-type J K fuzzy flip-flop. Similar Eq. 6 is called set-type JK fuzzy flip-flop.
Proposition 1 If the following relation concerning fuzzy negation, tnorm, and s-norm is valid
then
will be obtained.
Fuzzy Flip-flops as a Basis of Fuzzy Memory Modules 163
01
a(
0
1
1
1
0.5
0.5
0.5
O0
B
B
(a) Q(t)=O
(b) Q(t)=0.5
0
(c)
Q(t)=1
Figure 1. Characteristics of min-max set-type JK fuzzy flip-flop
(a> Q(t)=o
(b) Q(t)=0.5
( c ) Q(t)=1
Figure 2. Characteristics of min-max reset-type J K fuzzy flip-flop
Proof
164 K. Hirota and S. Yoshida
2.1
Max-Man operation system
Eq. (5) and Eq. ( 6 ) using logical operation system (1 - .,A, V) as (.@,@,@) are QR(~
+ 1) = { J A (1 - Q ) ) V ((1- K ) A Q),
(10)
Qs(t + 1) = ( J V Q) A ((1 - K ) V (1- Q ) ) .
(11)
and
Eq. (10) and Eq. (11) cannot used as an element of memory module, because they cannot memorize any value [0,1] at any time. In order to avoid this problem, another J K fuzzy flip-flop was proposed by combining these two J K fuzzy flip-flops (Eq. (12)).
Eq. (10) and Eq. (11) are continuously connected on the line segment J = K . This can be analytically shown as follows. 1) In the case J = K < Q, Eq. (10) is expressed as { J A (1 - Q ) ) V ((1 - J ) A
Q>
(13)
so
Q>J 2J
A (1 - Q)
(1-J) > ( 1 - Q ) 2 J A ( 1 - Q )
(14) (15)
should hold, and (1 - J ) A &
>J
A (I - & )
(16)
&.
(17)
is obtained. Thus Eq. (10) is equal to
(1 - J ) A
On the other hand, Eq. (11) is clearly equal to
Q A (1 - J ) ,
(18)
thus both Eq. (10) and Eq. (11) are the same. 2) In the case J = K = Q, Eq. (10) and Eq. (11) are clearly equal t o the same value as
J A (1 - J ) .
(19)
Fuzzy Flip-flops as a Basis of Fuzzy Memory Modules
> Q , Eq. (10) is expressed as { J A (1- Q ) ) V ((1- J ) A Q>.
165
3) In the case J = K
(20)
Here
J > Q 2 (1- J ) A Q (1- Q )
(21)
> (1 - J ) >_ (1 - J ) A Q
and
J A (1 - Q)
> (1 - J ) A Q
thus, Eq. (10) is equal to
J A (1- Q ) . On the other hand, Eq. (11) is clearly equal to
J A (1- Q). So both are the same. Therefore the combined JK fuzzy flip-flop Eq. (12) is continuous. The combined J K fuzzy flip-flop can also be expressed as unified form(Eq. (26), Eq. (27)).
Q ( ~ + ~ ) = { ( J A ( ~ - Q ) } ~ { ( ~ - ~ ) ~ Q } v ( J (26) A ( ~ - ~ )
+ 1) = { J V (1- K ) } A ( J V Q) A ((1- K) V (1
Q)}
(27) Its characteristics are shown in Fig. 3, Fig. 4, Fig. 5, and Fig. 6 . They are under logical, algebraic, bounded, and drastic operation system, respectively. Any value [0,1] can always be input into this fuzzy flip-flop. This type of J K fuzzy flip-flop has same ability as binary JK flip-flop, i.e. set and reset operation.
Q(t
-
2.2 Alegebraic operation system Under algebraic operation system (., i, 1 - .) as (@,@, Eq. (5) and Eq. (6) are expressed as
.a), characteristics
QR(~
+ 1) = { J . (1 - Q > ) i { ( l -K ) . Q}
(28)
Q s (+ ~ 1) = ( J + Q > .((1- W i ( 1 - Q)},
(29)
which can be transformed as
QR(t
+ 1) = J + Q - 2 J Q - K Q + J Q 2 + J Q K - J Q K 2 ,
(30)
166
K. Hirota and S. Yoshida
1
0.5
0
0
0
0
(4 Q(t)=O
(b) Q(t)=0.5
(c) Q(t)=l
Figure 3. Characteristics of JK fuzzy flip-flop (Max-Min)
011 1
0.5 0
0
(4 Q(t)=O
(b) Q(t)=0.5
(c) Q(t)=l
Figure 4. Characteristics of J K fuzzy flip-flop (Algebraic)
Qs(t + 1) = J
+Q
-
J Q - J K Q - KQ2 + JKQ’.
+
The difference between Q R ( ~ 1) and Qs(t + 1) is
therefore
(31)
Fuzzy Flip-flops as a Basis of h z z y Memory Modules
167
o(
1
05 0
B
(4 Q(t)=O
(b) Q(t)=0.5
(c) Q(t)=l
Figure 5. Characteristics of J K fuzzy flip-flop(Bounded)
(4 Q(t)=O
(b) Q(t)=0.5
(c) Q(t>=1
Figure 6. Characteristics of J K fuzzy flip-flop(Drastic)
2.3 Bounded operation system
Under bounded operation system (0, @, 1- .), characteristics Eq. (5) and Eq. ( 6 ) are expressed as
168 K. Harota and S. Yoshida
(J-K J>K,KLQ J-K 5 2 K,K 2 Q Q-K J = (T A Q @ )v (T@A Q ) v ( T A T @ ) v (Q A Q @ )
QzAX(t
(47)
Since the value of third and fourth term in the right hand side are at most 1/2, minimum operations between them and the other terms whose values are greater than or equal to 1/2 keeps the same value (Kleene's equality). (47) = (T A Q @ )v (T@A Q ) v { ( T A T @ ) A (Q v Q @ ) }v {(Q A Q @ )A (T v T
(11) Case of
(11i) Case of
Since
is less that or equal to
in all cases,
Fuzzy Flip-flops as a Basis of Fuzzy M e m o r y Modules
171
From the above results,
More general results, discussed below, are also obtained. Proposition 3 If (.@,0 , ~ always ) satisfies following relation
AO(BQC) 2 (AOB)O(AOC), A@(BOC)I: (AOB)O(AOC),
(56)
then the inequality
+ 1) I Q M A X ( ~+ 11,
QMIN(~
(57)
holds. Proof
Figure 8 shows the characteristics of the equation (44) using logical, algebraic, bounded, and drastic operation systems, respectively, while Figure 9 shows those of equation (45).
172
K. Hirota and S. Yoshida
(a) logical
(b) algebraic
(c) bounded
(d) drastic
Figure 8. Characteristics of T fuzzy flip-flop (minterm)
From these figures, we can see that
holds.
3.3 SR fuzzy flip-flop Binary SR flip-flop which has three functions-bit set, bit reset, and hold-is is a basic element of a memory module. If the set input S = 1, then the next state Q(t 1) = 1. If the reset input R = 1, then the next state Q(t 1) = 0. If S = R = 0, then the next state holds current state, i.e., Q ( t + 1) = Q ( t ) . The input S = R = 1is forbidden. But in order to construct the characteristic equation of SR flip-flop, there exist two types of SR fuzzy flip-flop. One is set-type, whose Q ( t + 1) = 1 when S = R = 1, and another is reset-type, whose Q ( t + 1) = 0 when S = R = 1.
+
+
Fuzzy Flip-flops as a Basis of Fuzzy M e m o r y Modules
(a) logical
(b) algebraic
(c) bounded
(d) drastic
173
Figure 9. Characteristics of T fuzzy flip-flop (maxterm)
Eq. (60) and Eq.(61) are the characteristic equations of set-type and reset-type SR fuzzy flip-flop, respectively.
Qs(t + 1) = SO(R"@Q)
+ 1) = @@(SO&)
QR(~
(60)
(61)
Now we show an order relation between set-type and reset-type SR fuzzy flip-flop in some case. Namely under max-min, algebraic, and bounded operation systems, fuzzy truth value of set-type SR fuzzy flip-flop is always greater than or equal t o that of reset-type. Needless t o say, such a relation stands in binary logic. Proposition 4 If a operation system (.@, @,@) satisfies
A@(BOC)L (A@B)O(A@C), AO(BOC) 2 (AOB)O(AOC), then it satisfies
Q s ( t + 1) 2
+ 1).
QR(~
174 K. Hirota and S. Yoshida
Proof Qs(t + 1) = S(t)O(R@(t)OQ(t))
2 (R@OS)O(R@OQ) 2 R@O(S@&)
+ 1)
(64)
QR(~
Q.E.D. Corollary 1 If (.a,@,@) is logical operation system (1 - A , V) or algebraic operation system (1 - .,.,-l), characteristic of SR fuzzy flip-flop satisfies Eq. (63). Proposition 5 If the operation system (.@,0,0) is bounded operation system (1 - ., 0,e),Eq.(63) holds.
Proof
Qs(t + 1) = s(t)@ (R@(t)0Q(t)) S + Q - R (0 5 Q - R 5 1- S ) (SA) (0 5 1 - S 5 Q - R ) (SB) (Q - R 5 0) (SC)
={I
+
Q R ( ~ 1) = R@(t) o ( S ( t )e Q ( t ) ) S Q - R ( R 5 S Q 5 1) (RA) ={0 ( S + Q5 R 51) (RBI 1-R (15S+Q) (RC)
+
(65)
+
(66)
(SB) of Eq.(65) is greater than or equal to all cases of (RA), (RB), and (RC) of Eq.(66), and (RB) of Eq.(66) is smaller than or equal to all cases of
(SA), (SB), and (SC) of Eq.(65), and (SA)-(RA) is 0. Therefore] (SA)-(RC) = S
+ Q - R - (1 - R) = S + Q - 1
2 0 (.: 1 5 S + Q ) (SC)-(RA) = S - ( S Q - R) = -(Q
+
(67) -
R)
2 0 (... Q - R 5 0) (SC)-(RC) = S - (1 - R ) = R >O -
-
(68)
(1 - S )
(..'Q-R Q R ( ~+ 1).
(75)
. Therefore
4
Performance of fuzzy flip-flops
Table 1 shows the comparison of functions that can be performed by various fuzzy flip-flops. The symbol “A”in JK-FFF[l] means that it can input arbitrary value only if both set-type and reset-type are used together and are configured appropriately. Figure 12 shows circuit areas of D, T, and SR fuzzy flip-flops and that of unified form of JK fuzzy flip-flops[2] using logical, algebraic, bounded, and
Fuzzy Flip-flops as a Basis of Fuzzy Memory Modules
177
drastic operation systems, respectively. Figure 13 shows their delay times. “Area” in Figure 12 indicates the number of gates, while “Delay” in Figure 13 indicates the time(ns) that the signal runs from inputs t o outputs. Compared with J K fuzzy flip-flops, T and SR fuzzy flip-flops use 213 and 1/2 area of circuit resources respectively, and their delay times are improved t o 213 of that of JK’s in every operation system. This fact shows that the circuit area of fuzzy flip-flops is proportional to the number of t-norm, s-norm, and fuzzy negation. As D fuzzy flip-flop is composed of output latches only, both its circuit area and delay time are negligible. 5
Summary
We define JK, D, T, and SR fuzzy flip-flop as a basic element of fuzzy memory module. Their characteristics are shown under four operation systems: maxmin logical (l-.,A,V), algebraic (l-.,.,$), bounded (l-.,a,@), anddrastic (1 - ., A, V) operation systems. And then the inequalities between set-tyep and reset-type JK fuzzy flip-flops, between maxterm-expressed and mintermexpressed T fuzzy flip-flops, and between set-type and reset-type SR fuzzy flip-flops are analytically shown. Their circuits are designed using VHDL for digital programmable devices FPGA or CPLD. The result of circuit areas and delay times shows that the areas of D, T, and SR fuzzy flip-flops decrease 213 t o 113 of JK’s, and delay times of them decrease 213 t o 1/2 of JK’s. Concept of these fuzzy flip-flops gives the foundation of the realization of fuzzy sequential circuits, multi-stage fuzzy inference processors, and fuzzy computers.
References
1. K.Hirota, K.Ozawa: “Concept of Fuzzy Flip-Flop” ,IEEE Transactions on Systems, Man, and Cybernetics, Vo1.19 No.5, pp.980-997 (1989) 2. K.Hirota, K.Ozawa: “Fuzzy Flip-Flop and Fuzzy Registers” ,Fuzzy Sets and Systems (North-Holland) , Vo1.32 No.2, pp.139-148 (1989) 3. K.Hirota, W.Pedrycz: “Designing sequential systems with fuzzy J-K flipflops”, Fuzzy Sets and Systems (North-Holland), Vo1.39 No.3,pp.261-278 (1991) 4. J.Diamond, W.Pedrycz, D.McLeod: “Fuzzy J K Flip-Flop as Computational Structures Design and Implemantation”, IEEE Transactions on Circuits and Systems 1I:Analog and Digital Signal Processing), Vo1.41 No.3, pp.215-226 (1994) 5. K.Hirota, W.Pedrycz: “Design of Fuzzy Systems With Fuzzy Flip-Flops”,
178 K . Hirota and S. Yoshida
IEEE Transactions on Systems, Man, and Cybernetics, Vo1.25 No.1, pp.169-176 (1995) 6. S.Yoshida, Y.Takama, K.Hirota: “Fuzzy Flip-Flops and their Applications to Fuzzy Memory Element and Circuit Design using FPGA”, Jounral of Advanced Computational Intelligence, vo1.4 No.5, pp.380-386 (2000)
GENERALIZED T-OPERATORS IMRE J . RUDAS Budapest Polytechnic H-1081 Budapest Nbpszinhaz u. 8. Hungary E-mail: rudas@bmJhu Fuzzy set theory provides a host of attractive aggregation operators for integrating the membership values representing uncertain information. The variety of these operators might be confusing and make it difficult to decide which one to use in a specific model or situation. The tutorial gives a survey of the existing aggregation connectives starting from the classical Zadehian-operators, through the theory of t-operators, till the most up-to-date operators, containing the results of the author and his colleagues on entropy and evolutionary operators.
Keywords: f-operators, evolutionary operators, distance-based operators, entropy-based firzzy connectives, generalized operations.
1
Introduction
Many applications of fuzzy set theory involve the use of a fuzzy rule base to model complex and approximately or not well known systems. The most typical applications are fuzzy logic control, fuzzy expert systems and fuzzy systems modeling. The rule base consists of a set of n IF-THEN rules. In case of two-inputsingle output system the rules are of the form [14]: %, : if x is A, and y is B, then z is C,
also
%, : if x is A, and y is B, then z is C,
also
............ also
1. 2. 3. 4.
:if x is A, and y is B, then z is C,
The hzzy inference process consists of the following four step algorithm [14]: Determination of the relevance or matching of each rule to the current input value. Determination of the output of each rule as hzzy subset of the output space. These individual rule outputs will be denoted by R j , Aggregation of the individual rule outputs to obtain the overall hzzy system output as hzzy subset of the output space. We shall denote this overall output by R. Selection of some actions based upon the output set.
179
180 I. J . Rudas
Our purpose here is to investigate the requirements for the operations that can be used to implement this reasoning process. 2
T-operators, negation and some basic properties
Original fuzzy set theory was formulated in terms of Zadeh’s standard operations of mimimum, maximum and complement. Since 1965 for each of these operations several classes of operators, satisfying appropriate axioms, have been introduced. By accepting some basic conditions, a broad class of set of operations for union and intersection is formed by t-operators. Definition 1 A mapping T :[0,1] x [0,1] + [0,1] is a t-norm if it is commutative, associative, non-decreasing and T ( x , l ) = x , for all x E [0,1]. Definition 2 A mapping S: [O,l]x[O,l] + [0,1] is a t-conorm it is commutative, associative, non-decreasing and S(x,O) = x , for all x E [0,1]. Definition 3 A mapping N :[0,1] -+ [0,1] N is a negation, if non-increasing andN(O)= 1 andN(1)=0. N is a strict negation if N is strictly decreasing and N is a continuous function. N is a strong negation if N is strict and N(N(a))= a, that is, N is involutive. Further it is assumed that T is a t-norm, S is a t-conorm and N is a strict negation. 3
Uninorms
Uninorms are such kind of generations of t-norms and t-conorms where the neutral element can be any number from the unit interval. The class of uninorms seems to play an important role both in theory and application [ 151. Definition 4 [13] A uninorm U is a commutative, associative and increasing binary operator with a neutral element e E [0,1], i.e. U (x, e)= x, Vx E [0,1]. The neutral element e is clearly unique. The case e = 1 leads to t-conorm and the case e = 0 leads to t-norm. The first uninorms were given by Yager and Rybalov [27]
and
Generalized T-operators 181
U , is a conjunctive right-continuous uninorm and U , is a disjunctive leftcontinuous uninorm. Regarding the duality of uninorms Yager and Rybalov have proved the following theorem [27]. Theorem 1 Assume U is a uninorm with identity element e, then U(x, y ) = 1- U(1- x,l - y ) is also a uninorm with neutral element 1- e .
4
Nullnorms
Definition 5 [5] A mapping V :[0,1] x [0,1] + [0, I]. nullnorm, if there exists an absorbing element a E [0,1], i.e., ~ ( x , a=)a , ~x E [oJ], v is commutative, v is associative, non-decreasing and satisfies V(X,O)= x for a11 x E [0, a ]
V(X,I)= x for all x E [a,11
(3) (4)
The Frank equation was studied by Calvo, De Baets, and Fodor in case of uninorms and nullnorms, and they found the followings. Theorem 1 Consider a uninorm U with neutral element e E [0,1], then there exists no nullnorm V with absorbing element e such that the pair (U,V) is a solution of the Frank equation ~ ( xy ), + ~ ( xy ), = x + y for a11 (x,y ) E [0, 11x [0, I]. ( 5 ) 2. Consider a nullnorm V with absorbing element e E [0,1], then there exists no uninorm U with neutral element e such that the pair (U,V) is a solution of the Frank equation.
5
Compensative operations
We have seen that there are no t-operators lying between the minimum and maximum operators. This could be a disadvantage of the application of t-operators as aggregation operators in several intelligent systems where hzzy set theory is used to handle uncertain information. A union operator produces a high output whenever at least one of the input values representing degrees of satisfaction of different features or criteria is high. An intersection operator produces a high output only when all of the inputs are high. In real applications, for example at decision making it would be required that a higher degree of satisfaction of one of the criteria could be compensated for a lower degree of satisfaction of another criteria to a certain extent. In this sense, union
182
I. J . Rudas
provides h l l compensation, while in case of intersection there is no compensation at all. To handle the problem Zimmermann and Zysno [30] has introduced the socalled y-operator as the first compensatory operator. Since than compensative operators have been studied by several authors. Definition 6 An operator M is said to be a compensative if and only if
6
Averaging operators
Averaging operators represent a wide range of aggregation operators [ 141 Definition 7 An averaging operator Mis a mapping
M :[0,1]x [0,1] + [0,1]
(7)
satisfies the following properties: A4.a M ( x ,x) = x, Vx E [0,1] ; idempotency, A4.b M ( x ,y ) = M ( y ,x), Vx,y E [0,1]; commutativity, A4.c M(0,O) = 0, M(1,l) = 1 ,; boundary conditions, A4.d M ( x ,y ) I M ( z , w), if x I z and y I w , monotony, A4.e M is continuous. The next proposition shows that for any averaging operator M, the global evaluation of an action will lie between the worst and the best local rating [ 151. Proposition 8[14] If M is an averaging operator, then
7
Absorbing-norms
Definition 8 Let A be a mapping
A:[071]x[0,1]+[0,1]. A is an absorbing-norm, if for all
y , z E [0,1] satisfies the following axioms: A1 .a There exists an absorbing element a E [0,1], i.e., A(x,a) = a, VX E [0,1]. X,
A1 .b A(x,y) = A b , x ) that is, A is commutative, A1 .c A(A(x,y),z)= A(x,A(y,z))that is, A is associative,
Generalized T-operators
183
It is clear that a is an idempotent element A(a,a)= a, hence the absorbing element is unique. If there would exist at least two absorbing elements a,, a 2 ,a,, f a2 for which A(a, , a 2 )= a,, and A(a,,a,) = a2 ,so thus a, ,= a2 T-operators are special absorbing-operators, namely for any t-norm T, T(0,x)= 0, b'x E [0,1] and for any t-conom S, S(1,x) = 1, b'x E [0,1]. As a direct consequence of the definition we have if x 5 a then A(x,a) = a = max(x,a), if x 2 u then A(x,a) = a = min(x,a). These properties provide the background to define some simple absorbingnorms. The trivial absorbing-norm A, : [O, I] x [0,I] + [O, I] with absorbing element a is A, :(x, y) + a, v(x, Y) E
[w]x [OJ]
(9)
Theorem 2. (Rudas [22] ) The mapping
Amin: [O, 11x [0, I] -+ [0,1] defined as
and the mapping A,,,,,: [0,1] x [0,1] -+ [0,1] defined as mi&, y), if (x,y) E [a,11x [a,11 max(x, y ),elsewhere
are absorbing-norms with absorbing element a. The structures of these absorbing-operatorsare shown in Figure 1,2.
Corollary 1. From the structure of Amin and A,,, the following properties can be concluded:
(w) =
(OJ) = Amin(40) = 0 , Amin(0) =0> 4,, (1J) = A,,(OJ) = AJW) = 1 > A,,, (L1) = 1 .
4 i "
Amin
With the combination of Amin,A,, .and A, fkther absorbing-norms can be defined.
184 I. J . Rudas
a
I
min
min
I
a Figure 1. The structure of
Ain
a Figure 2. The structure of
A,, .
1
4
I
1
Theorem 3. (Rudas, [22]) The mapping A& : [O,l]x [0,1]-+ [0,1] dejined as
and the mapping
g,, : [0,1] x [O, 11+ [0,1] defined as
are absorbing-norms with absorbing element a. The structures are illustrated in Figure 3.
Generalized T-operators
I
a
Figure 3 The structure of
xi,,and A:=
185
1
Theorem 4 (Rudas [22]). Assume that A is an absorbing-norm with absorbing element a. The dual operator of A denoted by 2 A(x,y) = 1 - A(1- x,l - y). is an absorbing-norm with absorbing element 1-a. Let us define a kind of complements of A,,,in and A,,,,, .replacing the operator min with max and the max with min as follows.
Definition 9
We have received the first uninorms given by Yager and Rybalov [27] Because of the constructions of these operators for the pairs (Amin,Ud)and
(Aa,,U , ) the laws of absorption and distributivity are fulfilled. Theorem 5. For the pairs (Amin,U,) and (A,,,, ,U,) the following hold 1. Absorption laws A,~,(u,(x,~),x) = x for all x E [OJ] , u,(A,~,(x,~),x)= x for all x E [OJ] ,
186
I. J . Rudas A,,,= (u,( x , z ) , x ) = x for all x E [OJ] , ~~(~,(x,z),x)= x for all x E [OJ] .
2.
Laws of distributivity For all x E [0,1]
8
distance-based evolutionary operators
Let e be an arbitrary element of the closed unit interval [0,1] and denote by d(x,y) the distance of two elements x and y of [0,1]. The idea of definitions of distancebased operators is generated from the reformulation of the definition of the min and max operators as follows min(x, y ) =
x,if d ( x , ~I)d ( y , ~ ) Y j f d(X?O)'
44)
Definition 10 The maximum distance minimum operator with respect to e E [OJ] is defined as
1' 1'
if d(x,e ) > d ( y ,e ) if d(x, e ) < d ( y ,e) . maxr(x,y) = y , min(n, y)jf d(x,e)= d ( y 7e)
(24)
Definition 11 The maximum distance maximum operator with respect to e E [0,1] is defined as
if d(x,e) > d ( y ,e ) if d(x, e) < d ( y ,e ) .
m a x y (x,y ) = y , ma+, y)jf d(x,e ) = d ( y ,e )
(25)
Definition 12 The minimum distance minimum operator with respect to e E [OJ] is defined as
Generalized T-operators 187
1'
if d(x,e)< d ( y ,e )
if d(x,e ) > d ( y , e ) . miny(x,y) = y, min(x, y)jf d(x,e ) = d(y,e )
(26)
Definition 13 The minimum distance maximum operator with respect to e E [OJ] is defined as minTa (x,y ) =
9
The structure of evolutionary operators
It can be proved by simple computation that the distance-based evolutionary operators can be expressed by means of the min and max operators as follows. max?
=
1 1 1 1
max(x, y)jf y > 2e - x min(x, y), if y < 2e - x min(x, y),if y
= 2e-x
min(x, y), if y > 2e - x
minp
=
maxy
=
ma+, y)jf y < 2e - x min(x, y ), if y = 2e-x max(x, y)jf y > 2e - x min(x, y),if y < 2e - x max(x, y)jf y = 2e-x
mi&, y),if y > 2e - x
miny
=
max(x, y)jf y < 2e - x max(x, y)jf y = 2e-x
The structures of the m a x p and the m i n y operators are illustrated in Fig. 4-5.
188 I. J . Rudas
Figure 4 Maximum distance minimum ( max
)
Figure 5 Minimum distance minimum ( m i n p )
10 Properties of distance-based operators
Theorem 6 The distance-based operators have the following properties (Rudas 1221) maxy 0
max? (x, x) = x, Vx E [O, 11 ,that is m a x y is idempotent, m a x r (e, x ) = x that is, e is the neutral element, m a x y is commutative and associative, m a x r is left continuous,
Generalized T-operators 189
max;’” is increasing on each place of [0,1] x [0,1] maxy maxr” (x,x) = x, ‘dx E [0,1] ,that is maxr” is idempotent, maxra (e, x ) = x that is, e is the neutral element, m a x r is commutative and associative, maxfaax is right continuous, m a x y is increasing on each place of [OJ] x [O,l] . min y minry(x,x) = x, ‘dx E [0,1] ,that is m i n T T is idempotent, m i n y (e,x) = e that is, e is an absorbing element, m i n r is right continuous, m i n y is commutative and associative. minr” minfaax(x, x) = x, ‘dx E [0,1] ,that is m i n y is idempotent, m i n y (e,x)
=e
that is, e is the absorbing element,
m i n y is left continuous, m i n y is commutative and associative.
..
Corollary 2 max? and m a x y are uninorms, both of the operators are compensative ones. Regarding duality of uninorms Yager and Rybalov have proved the following proposition.
Proposition 2 [27] Assume U is a uninorm with identity element e, then f i ( x , y ) = 1 - U(1- x,l - y ) is also a uninorm with identity i? = 1 - e .
Corollary 3 a) The dual operators of the uninorms m a x y is maxyex,and
190
I. J . Rudas
b) the dual operators of the uninorms max? If e = 0 then m a x g = m a x t y . Proposition 3 The Pairs the absorption laws
is max?:
and
satisfy
11 Distance-based operators as parametric evolutionary operators The min and max operators As special cases of distance-based operators can be obtained depending on e as follows: a) i f e = 0 then maxy'" (x,y ) = max(x, y ) , max y (x,y ) = ma&, y ) , min,"'" (x,y >= min(x, y ) , m i n y ( x , y ) = mi&, y ) ,
b) i f e = 1 then m a x p (x,y ) = mi+, y ) , max ;lax (x,y ) = min(x, y ) , m i n y (x,y ) = max(x, y ) , m i n y (x,y ) = max(x, y ) . This means that the distance-based operators form a parametric farnib with parameter e. They are also evolutionaiy types in the sense that if for example in case of m a x r while e is increasing starting from zero till e = 1 the max operator is developing into the min operator.
Generalized T-operators 191
12 Entropy-based fuzzy connectives
The special case of distance-based operators when e = 0.5 leads to the entropy based-operators, namely the maximum fuzziness maximum and the minimum fuzziness minimum operations, introduced by Rudas and Kaynak [20]. The original definitions as fuzzy connectives are summarized in this section with an extension to the minimum fuzziness maximum and the maximum fuzziness minimum operations. The definitions are based on the concept of elementary entropy function, which assigns a value to each element of a fuzzy subset that characterizes its degree of fuzziness. The operations are defined on the basis of selecting the more or the less k z y membership degree as the output of the connectives. Definition 14 Let A be a fuzzy subset of X and A is its conventional fuzzy complement, i.e.
The elementary entropyfunction o f A is [18], [20]
qA:xa
I
if
'1 1
- PA ().,
if
(44)
5
Definition 15 Let A and B be two fuzzy subsets of the universe of discourse X and denote pAand qBtheir elementary entropy functions, respectively. The minimum fuzziness generalized minimum is defined as
The minimum fuzziness generalized maximum is defined as
192
I. J . Rudas
Definition 16 Let A and B be two fuzzy subsets of the universe of discourse X and denote pAand pBtheir elementary entropy functions, respectively. The maximumfuzziness generalized maximum is defined as
u r = u r ( A , B ) = {( X ' P u . " " " ( X ) ) / X E
1
PA
Puy :
'a
XPUF(X)E
('19
Pug('),
maxba
(xX
PB
(x)),
[0,11}> where
if PA (')
V)B
if
P B
' (1' ('> > (1' p.4
if
PA
(x) = PB (x)
'
(47)
The maximumfuzziness generalized minimum, is defined as
,y ='_"(A,B)={
(X,PC"(X))jXEX.PF"(X)t
Ply : a
1
PA PB('), min(PA
[0,11}, where if FA
7 )'(
('),
PB
('))9
(1' (1'
> 0 ) B (')
if P B > PA (') if PA (') = V ) B (')
(48)
Theorem 7 The membership functions of these operators can be expressed in terms of conventional min and max operations as follows [ 181, [20]:
It can be verified easily that the entropy-based operations correspond to the distance-based operations with respect to e as follows:
Generalized T-operators 193
13 Some other families of generalized operations Some methods of generation of generalized operations were introduced by Batyrshin et.al. [1,2,3].
Definition 17 Let T be a mapping T :[0,1] x [0,1] + [0,1] .
(57)
T is a conjunction operation if Definition 18 Let S be a mapping
s:[0,1] x [0,1] + [0,1] S is a disjunction operation
if
=a
(58)
for all a
It can be seen easily, that conjunction and disjunction operations satisfy the following properties:
T(0,O)= T(0,l) = T(1,O) = 0, T(1,l) = 1, T(0,a) = T(a,O) = 0, S(0,l) = S(1,O)= S(1,l) = 1 S(0,O) = 0. S(1,a) = S(a,l) = 1 Including these properties in the axiom systems, the concepts of quasioperations are introduced. These axioms together with commutativity and associativity the axiom skeleton of generalized conjunction and disjunction operations given by Klir and Folger are obtained [ 171.
Definition 19 Let T be a mapping
T : [0,1] x [0,1] -+ [0,1] T is a quasi-conjunction operation if
(65)
194
I. J . Rudas
T(0,O) = T(0,l) = T( 1,O) = 0, T( 1,l) = 1. Definition 20 Let S be a mapping
s:[0,1] x [0,1] + [0,1] S is a quasi-disjunction operation if S(O,1)
= S(1,O) = S(1,l) =
1, S(0,O) = 0.
Definition 21 Let T be a mapping T :[0,1] x [OJ]
+ [0,1] .
T is a pseudo-conjunction operation if T(0,l) = T(1,O) = 0, Definition 22 Let S be a mapping
s:[O,l] x [OJ] -+ [0,1] S is a pseudo-disjunction operation if S(0,l) = S(1,O) = 1.
Easy to see that the axioms imply T(O,O)=OandS(l,l)= 1, but T( 1,l) = 1 and S(0,O)= 0 do not follow. Theorem 8
Suppose T1, T2, are conjunctions, S1, and S2 are pseudo-
disjunctions, g1, g2:[0,1] [0,1] are non-decreasing functions such that g1(1)= g2(1)=1; then the following functions
are conjunction operations. As basic conjunctions T2 and Tl and basic disjunction S1 one can use, for example, the simplest T-norms and T-conorms. Disjunction operations may be generated dually or obtained from conjunctions by means of negation operation. Examples of simple parametric conjunctions obtained in this way are the following [1,31:
Generalized T-operators 195
T(x,y) = min(x,y)max{l-p(1-$,I-q(1-y), 0}, T(.,y) = min(x,y)mm(2’,yq), T(x,y) = min(x,y)min(l,2”+yq), T(X,Y) = xy(x + y - xy) p , T(x,y) = min(x,y)(2’ + yq - 2’ yq)
where p,q are positive numbers.
Theorem 9.[3]. Suppose T is a quasi-conjunction, S is a quasi-disjunction andf; g, h:[O,1]+[0, I ] are non-decreasing functions such thatAO)= g(0) = h(0) =0, A1)= g( 1) = h( 1) = 1 ; then thefunctions Tl(X,Y)=AT(g(x),hO” s1(XJY)=AS(g(x),h O ) ) ,
are a quasi-conjunction and a quasi-disjunction respectively.
Theorem 10. Suppose TI, T, are quasi-conjunctions, Sl and Sz are pseudodisjunctions, h, gl, g2:[0,1] +[O, 11 are non-decreasing functions such that gl(l) = gZ(1) = I ; then the following functions T h y ) = TZ(Tl(X,Y),&(gl(x)>g2O>)>, T ( a 4 = TZ(Tl(X,Y),g1 (SI(4Y)N, T(X,Y)= T2(Tl(X>Y), s2(h(4mx,Y))),
are quasi-conjunctions. From Theorems 9 and 10 we can obtain recursively the following simplest parametric quasi-conjunction operations: T(x,y) = min(xP,yq), T(X,Y) =xpyq, T(X,Y) =(xY)p(x + Y - XY) q, where p,q are positive real numbers. We see that the new definition of conjunction and disjunction operations provides the possibility to build the simplest parametric classes of conjunction and disjunction operations. In the following sections we discuss some applications of the new conjunctions in fuzzy modeling.
Acknowledgement The authors gratefully acknowledge the support by the FANUC‘s “Financial Assistance to Research and Development Activities in the Field of Advanced
196 I. J . Rudas
Automation Technology Fund” for 200 1, and the support by the Hungarian National Research Fund OTKA in the projects T 34651 and T 034212.
References 1. I. Batyrshin, 0. Kaynak, “Generation of generalized conjunction and disjunction operations,” in: IPMU-98, Paris, La Sorbonne, pp. 1762-1768, 1998. 2. I. Batyrshin, 0. Kaynak, I. Rudas. Generalized conjunction and disjunction operations for fuzzy control, in: EUFIT’98, Aachen, Germany, 1998, vol. 1, pp. 52-57. 3. I. Batyrshin and 0. Kaynak, “Parametric classes of generalized conjunction and disjunction operations for fuzzy modeling,” IEEE Trans. Fuzzy Syst., vol. 7, N 5, pp. 586-596, 1999. 4. I. Batyrshin and M. Wagenknecht, “Noninvolutive negations on [O,l],” J. Fuzzy Math., vol. 5, N 4, pp. 997-1010, 1997. 5. 0. Cervinka, “Automatic tuning of parametric T-norms and T-conorms in fuzzy modeling,” in Proc. 7th IFSA World Congress. Prague: ACADEMIA, 1997, V O ~1,. pp. 416-421. 6. De Baets, B.: Uninorms: the known classes. Fuzzy Logic ans Intelligent Technologies for Nuclear Science and Industry. (D. Ruan, H.A. Abderrahim, P.D’hondt and E. Kerre, eds). Proc. Third Int. FLINS Workshop (Antwerp. Belgium), World Scientific Publishing, Singapore, 1988, pp. 2 1-28. 7. G. De Cooman and E.E. Kerre, “Order norms on bounded partially ordered sets,” J. Fuzzy Math., vol. 2, pp. 281-3 10, 1994. 8. D. Dubois and H. Prade, “A review of fuzzy set aggregation connectives,” Inform. Sci.,vol. 36, pp. 85-121, 1985. 9. J.C. Fodor, “Strict preference relations based on weak t-norms,” Fuzzy Sets Syst., V O ~ 43, . pp. 327-336, 1991. 10. J.C. Fodor, “A new look at fuzzy connectives,” Fuzzy Sets and Syst., vol. 57, pp. 141 - 148, 1993. 11. J. Fodor, M. Roubens: Fuzzy Preference Modelling and Multicriteria Decision Support. Kluwer Academic Publishers, 1994, The Netherlands. 12. Fodor, J., Yager, R., Rybalov, A.: Structure of uninorms. International Journal Uncertuinq, Fuzziness, and Knowledge Based Systems. 5 (1997) pp. 41 1-427. 13. M. M. Gupta and J. Qi, “Theory of T-norms and fuzzy inference methods,” Fuzzy Sets Syst., vol. 40, pp. 431-450, 1991. 14. R. Fulltr: Introduction to Neuro-Fuzzy Systems. Physica-Verlag. , Heidelberg, 2000. 15. R. Fuller: Fuzzy Reasoning and Fuzzy Optimization. Turku Center for Computer Sciences. No 9. September 1998.
Generalized T-operators 197
16. E. P: Klement, “Construction of fuzzy o-algebras using triangular norms,” J. Math. Anal. Appl., vol. 85, pp. 543-565, 1982. 17. G. J. Klir and T.A: Folger, Fuzzy Sets, Uncertainty, and Information. PrenticeHall International, 1988. 18. I. J. Rudas, 0. Kaynak: New Types of Generalized Operations. Computational Intelligence: Soji Computing and Fuzzy-Neuro Integraiion with Applications. Springer NATO ASI Series. Series F: Computer and Systems Sciences, Vol. 192. 1998. (0.Kaynak, L. A. Zadeh, B. Tiirksen, I. J. Rudas editors), pp. 128156. 19. I. J. Rudas, A. Szeghegyi, J. F: Bitb, G. Geary:. Non Monotone Generalized Fuzzy Operations for Fuzzy Logic Controllers. 5th IEEE International Workshop on Robotics in Alpe-Adria-Danube Region. June, 1996, Budapest, pp. 529-533. 20. I. J. Rudas, M. 0. Kaynak: Minimum and maximum fizziness generalized operators Fuzzy Sets and Systems 98 (1998) 83-94. 21. Rudas, I.J., Kaynak, M.O.: Entropy-Based Operations on Fuzzy Sets. IEEE Transactions on Fuzzy Systems, ~01.6,no. 1, February 1998. pp. 33-40. 22. I.J. Rudas: Evolutionary operators new parametric type operator families The International Journal of Fuzzy Systems. . 1999. Vo1.23. No.2. pp. 149-166. 23. M. Sugeno, “An introductory survey offuzzy control,” Inform. Sci., vol. 36, pp. 59-83, 1985. 24. I. B. Turksen, “Intelligent hzzy system modeling,” in Kaynak et al. (eds), Computational Intelligence - Soft Computing and Fuzzy-Neuro Integration with applications, Springer-Verlag: 1998, pp. 157 - 176. 25. L.-X. Wang, A Course in Fuzzy Systems and Control. Prentice Hall PTR. Upper Saddle River, NJ, 1997. 26. S . Weber, “A general concept of fuzzy connectives, negations and implications based on t-norms and t-conorms,”Fuzzy Sets Syst., vol. 11, pp. 115-134, 1983. 27. Yager, R., Rybalov, A.: Uninorm aggregation operators. Fuzzy Sets and Systems 80 (1998), pp. 105-136. 28. Yager, R., Aggregation operators and fuzzy system modeling. Fuzzy Sets and Systems 67 (1994), pp. 129-145. 29. L.A. Zadeh, “Fuzzy sets,” Inform. Contr., vol. 8, pp. 338-353, 1965. 30. Zimmermann, H.-J., Zysno, P. (1980), “Latent connectives in human decision making. 31. Zimmermann, H.Fuzzy set theory and its applications. Kluwer Academic Publishers., 1981.
This page intentionally left blank
FUZZY RULE EXTRACTION FROM INPUT/OUTPUT DATA L. T. KOCZY*, J. BOTZHEIM*, A. B. RUANO**, A. CHONG ***, T. D. GEDEON*** *Department of Telecommunication and Telematics, Budapest University of Technology and Economics, H-l I 1 7 Budapest, Pcizrndny P. sktany Ud,Hungavy
[email protected] botzheim@,alpha.tit.bme.hu
** Department of Electronic Engineering and Computing, Faculty of Sciences and Technology, University of Algarve, 8000 Faro, Portugal antano@,ualg.ut * ** Department of Information Technology Murdoch University, Australia South Street, Murdoch 6150 WA cchong@,murdoch.edu.au taedeon@,central.murdoch.edu.au This paper discusses the question how the membership functions in a fuzzy rule based system can be extracted without human interference. There are several training algorithms, which have been developed initially for neural networks and can be adapted to fuzzy systems. Other algorithms for the extraction of fuzzy rules are inspired by biological evolution. In this paper one of the most successful neural networks training algorithm, the Levenberg-Marquardt algorithm, is discussed, and a very novel evolutionary method, the so-called “bacterial algorithm”, are introduced. The class of membership functions investigated is restricted to the trapezoidal one as it is general enough for practical applications and is anyway the most widely used one. The method can be easily extended to arbitrary piecewise linear functions as well. Apart from the neural networks and evolutional algorithms, fuzzy clustering has also been used for rule extraction. One of the clustering-based rule extraction algorithms that works on the projection of data is also reported in the paper.
Keywords: fuzzy systems, fuzzy rules extraction, Levenberg-Marquardt algorithm, bacterial algorithm.
1
Introduction
In the application of fuzzy systems to modelling and control one of the most important tasks is to find the optimal rule base. This might be given by a human expert or might be given a priori by the linguistic description of the modelled system. If, however, neither a suitable expert, nor the necessary linguistic descriptions are available, the system has to be designed by other methods based on numerical data. In training, the objective is to tune the membership functions in the
199
200 L. T. Koczy et al.
fuzzy system such that the system performs a desired mapping of input to output. The mapping is given by a set of examples of this function, the so-called pattern set. Each pattern pairp of the pattern set consists of an input activation vector x@)and its target activation vector fi’.After training the membership functions, when an input activation x@’ is presented, the resulting output vector y@)of the fuzzy system should equal the target vector fi’. The distance between the target and the actual output vector must be minimised for each pattern. There are several methods to minimise these distances. These methods can be adopted from the field of neural networks or from the evolution phenomenon of living beings. The paper is organised as follows. Section 2 describes the basics of fuzzy systems. The neural network algorithm is shown in the Section 3. In Section 4 the bacterial algorithm is described. In section 5, 6, 7, 8 and 9, we complete our discussion by introducing a fuzzy clustering based rule extraction technique. Section 10 is the conclusion of the paper.
2
Fuzzysystems
The theory of fuzzy logic was developed by Zadeh in the early 1960s. His theory was essentially the rediscovering the multivalued logic created by Lukasiewicz, however, with going much further in some application related aspects. In 1973 he pointed out that the new fuzzy concept could be excellently used for describing very complex problems with a system of fuzzy relations represented by a fuzzy rule base [l]. A fuzzy rule base contains fuzzy rules Ri:
Ri: IF (XI is A i l ) AND (x2 is Ai2) AND ... AND (xn is Ain) THEN (y is Bi), (1) where A , and Bi are fuzzy sets, xi and y are fuzzy inputs and output. The meaning of the structure of a rule is the following:
IF Premise THEN Conclusion (2) where the premise consists of antecedents linked by fuzzy AND operators. The Centre of Gravity (COG) defuzzification method is used here because it is general and easy to compute. This method calculates the crisp output by the sums of the centre of gravities of the conclusions. Thus, a fuzzy inference system can compute output y of an input vector x. The main purpose is to make the best solution possible for each input vector, therefore the optimum rule base need to be found. 3
The Levenberg-Marquardt algorithm
Our goal is to find the optimal rule base. This means that the distances between the targets and the corresponding output vector gives smaller error than in the case of another rule base. The error is measured by the following function:
Fuzzy Rule Extraction from Input/Output Data
& ,
201
p=l
where P is the number of patterns in the pattern set, to is thepthtarget vector, yo is the prhoutput vector. The most used method to minimise (3) is the Error-Back-Propagation (BP) algorithm, [7] which is a steepest descent algorithm. A newer method is the Levenberg-Marquardt algorithm. Denoting the parameter vector by L, and the Jacobean matrix by - :
~ [ k=]~ [ k-] ~ [ -k11 the LM update, is given as the solution of (5) In (5), a is a regularization parameter, which controls the both the search direction and the magnitude of the update. ( 5 ) can be recast as:
The complexity of this operation is of U ( n ’ ) , where n is the number of
J- . If we apply this algorithm in fuzzy systems then the parameters z columns of must be found. The structure of the fuzzy system is the following:
A grid in the input space is defined. Vectors of knots must be defined, one for each input dimension. These vectors determine the place of the membership functions. In each ithaxis Aj,j will be defined, where
j = OJ, . . . , q .They are arranged in such a way that
is the minimal and
Ai,1 is a given parameter, usually between 2 and 4. 3. Determine the parameters of the left slope, x1 and x2, as: a) Let us initialize x, as the last data point which has smaller membership degree than m and xi be the next point of the convex hull: p(xj+l)= p(xi) > m. A X j 1 < m; (If there is no such point x in the convex hull which satisfies p(x) < m then x, and xi is the first two leftmost point of the convex hull). Further the parameters of the left X I = dmin and xz = dmin. b) Let xI1 and x * be ~ the location of the intersection made by the support and the core, respectively, with the line passing through the points (xj ; p(xj)) and (xi; p(xi)) (see Figure 5). c) If > x1 then x1 := xI1 d) If x*z < x2 or i = j + 1 then x2 := xt2 e) If xi+l 5 x,, then let i := i + 1 and go to step (3b), otherwise continue 4. Determine the parameters of the left slope, x3 and x4, analogously as in the previous step. 5. Order the parameters according to x1 5 x2 5 x3 5 x4. For the convenience of later steps, we convert the trapezoidal clusters to ruspini partition as illustrated in f i g y 1. M = 213 ,a!&
____________________________ / Fig.5. Determination of X’, and X;
m = 113 range
214
9
L. T. K d c t y et al.
Merging Scheme
The reconstruction of multi-dimensional clusters by combining the 1D clusters identified at each dimension can be problematic. Let t be the average number of clusters identified at each dimension. The total number of possible combination is k. Since the number of combination grows exponentially with the increase of dimensions, examining every combination of the 1D clusters is computationally infeasible. In this section, we propose a fast merging technique. The merging process involves the use of a threshold t. The cluster in the multidimensional space is determined to be the region where the number of projected points in the region exceeds t. A point p is contained in the cluster Ciif p,&) > p&) for all jgi. The 4-step algorithm is presented. I . Find one of the multi-dimensional clusters C where the number of points that falls into all its projection exceeds the threshold t. 2 . Remove all data points that are contained in the cluster C approximated. 3. Repeat steps 1 - 2 until no more cluster can be found. The pseudo-C-code for step 1 of the algorithm is presented as follows. 10 PROCEDURE find-MD-cluster
Let Ui be the set of one-dimensional clusters in dimension i Let mdCluster = [ ] for i = 1 to k for each unit u E Ui utemp = mdCluster x u if utemp is dense denseunit = utemp break end if end for end for For the convenience of discussion, we define [ ] as the zero-dimensional (empty) cluster where [ 3 x C, = C,. The algorithm scans through each of the k dimensions to find one of the multi-dimensional clusters in the data giving the complexity O(k). Having identified a cluster, all the data points that is contained within the cluster is removed. The process is repeated until no more cluster can be found. The overall complexity is O(ck) where c is the total number of clusters in the data. Since the complexity of the algorithm is linear, it is computational feasible to deal with data with very large number of dimensions.
Fuzzy Rule Extraction from Input/Output Data
215
11 Conclusions The Leveneberg-Marquardt and the bacterial evolutionary algorithm were described in this paper. The bacterial algorithm seems to be simpler and robust. In this method, the fuzzy systems can be described easier, and the algorithm allows using not only Ruspini-partition. Apart from the neural and evolutionary algorithm, a clustering based rule extraction technique has also been reported. The technique has the advatage of being computationally efficient.
References 1. L.A.Zadeh: Outline of a new approach to the analysis of complex systems and decision processes, IEEE Tr. Systems, Man and Cybernetics 3 (1973), pp. 2844. 2. J.H.Holland: Adaptation in Nature and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT Press, Cambridge, 1992. 3. L.J. Fogel, A.J.Owens, and M.J.Walsh: Artificial Intelligence through Simulated Evolution, Wiley, New York, 1966. 4. M.Salmeri, M.Re, E. Petrongari, and G.C.Cardarilli: A Novel Bacterial Algorithm to Extract the Rule Base from a Training Set, Dept. of Electronic Engineering, University of Rome, 1999. 5. N.E.Nawa, and T.Furuhashi: Fuzzy System Parameters Discovery by Bacterial Evolutionary Algorithm, IEEE Tr. Fuzzy Systems 7 (1999), pp. 608-616. 6. J.Botzheim, B.Hhmori, and L.T.K6czy: Extracting trapezoidal membership functions of a hzzy rule system by bacterial algorithm, 7'h Fuzzy Days, Dortmund 2001,Springer-Verlag, pp. 2 18-227. 7. Werbos, P., Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, PhD. Dissertation, Appl. . Math., Harvard University, USA, 1974 8. Marquardt, D., An Algorithm for Least-Squares Estimation of Nonlinear Parameters, SIAM J. Appl. Math., 1 1, 1963, pp. 43 1-441 9. A.E.Ruano, C. Cabrita, J.V.Oliveira, L.T.K6czy, D. Tikk: Supervised Training Algorithms for B-Spline Neural Networks and Fuzzy Systems, Joint gth IFSA World Congress and 20th NAFIPS International Conference, Vancouver, Canada, 200 1. 10. Wong, K.W., Fung, C.C., and Wong, P.M. A self-generating fuzzy rules inference systems for petrophysical properties prediction. in Proceedings of IEEE International Conference on Intelligent Processing Systems. 1997. Beijing. 11. Wang, L.X. and Mendel, J.M., Generating fuzzy rules by learning from examples. IEEE Transactions on Systems, Man, and Cybernetics, 1992. 22(6): p. 1414-1427.
216
L. T.Kdczy et
al.
12. Sugeno, M. and Yasukawa, T., A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1993. l(1): p. 7-31. 13. Ihara, J., Group method of data handling towards a modelling of complex systms - IV. Systems and Control (in Japanese), 1980.24: p. 158-168. 14. Bezdek, J.C., Pattern Reconition with Fuzzy Objective Function Algorithms. 1981, New York: Plenum Press. 15. Fukuyama, Y. and Sugeno, M. A new method of choosing the number of clusters for fuzzy c-means method. in Proceedings of the 5‘h Fuzzy System Symposium. 1989. 16. Tikk, D., Gedeon, T. D., Koczy, L. T., and Biro, G. Implementation details of problems in Sugeno and Yasukawa’s qualitative modelling. Research Working Paper RWP-IT-02-2001, School of Information Technology, Murdoch University, Perth, W.A., 2001. P. 17. 17. Wong, K.W., Kbczy, L.T., Gedeon, T.D., Chong, A., Tikk, D. (2001) “Improvement of the Clusters Searching Algorithm in Sugeno and Yasukawa’s Qualitative Modeling Approach” in Reusch, B. (Ed), Computational Intelligence: Theory and Applications, Springer-Verlag, Berlin, Proceedings of 7th Fuzzy Days in Dortmund - International Conference on Computational Intelligence, October 200 1, Dortmund, pp. 536-549. 18. Yang, M.S., Wu, K.L., A New Validity Index For Fuzzy Clustering, in Proceedings of IEEE International Conference on Fuzzy Systems, December, Melbourne, 4 pages.
KNOWLEDGE DISCOVERY FROM CONTINUOUS DATA USING ARTIFICIAL NEURAL NETWORKS RUDY SETIONO, JACEK ZURADA National University of Singapore, 3 Science Drive 2, Singapore 11 7543 University of Louisville, Computational Intelligence Lab, Louisville, K Y 40208, USA We describe a method for knowledge discovery from continuous data using neural networks. The method extracts linear regression rules from trained neural networks. Each rule in the extracted rule set corresponds to a subregion of the input space and a linear function involving the relevant input attributes of the data approximates the network output for all data samples in this subregion. Having such linear regression rules is desirable in some application as the output from neural networks are highly nonlinear and generally difficult to comprehend. In contrast, better insights may be obtained from the data if the relationship between the input and the output is expressed as linear functions. The method that we proposed approximates the nonlinear output of a trained neural network by a small number of linear functions while at the same time maintains its predictive accuracy. Illustrations on how the method works on real world data sets are given.
Keywords: knowledge discovery, neural networks, linear regression rules. 1
Introduction
The problem of knowledge discovery from data with continuous attributes has been attracting interests from both the machine learning and statistics communities. Many methods that have been developed for discovering interesting relationship among the data attributes generate decision trees. For example, the model-tree predictor method M5’ generates binary decision trees with linear regression functions at the leaf nodes [l]. This method is an improved re-implementation of Quinlan’s M5 [2]. Tree generation is achieved by splitting each non-leaf node so as to minimize the intra-subset variation in the function values down each branch of the tree. Nodes in the tree are pruned by estimating the expected error at the nodes for the test data. A linear regression model is computed for each non-leaf node by using only the attributes that appear in nodes below this non-leaf node. The model is simplified by dropping terms that would increase the predictive accuracy of the tree. Several tree generating methods first convert the regression problem into a classification problem by discretizing the target values. Once the decision tree
217
218
R. Setiono and J . Zurada
for classification has been generated, a regression tree where the predictions at the leaf nodes are computed by different models can be obtained. The simplest model predicts all samples in a leaf node as constant value, which is the average of the target values of all training samples that fall into this node. A more complex model such as linear regression function of the input attributes can also be built to improve accuracy. The crucial question is then how t o divide the range of the continuous target values into a number of subintervals. Too many subintervals will introduce many classes, which in turn will generate large decision trees. The leaf nodes of such trees represent small regions of the input space with only a small number of samples in each region. The generated linear models are likely t o overfit the data as a result. On the other hand, if there are too few subintervals, the overall predictive accuracy of the regression tree may be poor due t o inadequacy of the linear models to fit the data. The RECLA system [3] resolves the problem of determining the number of subintervals by employing the wrapper approach [4]. Using this approach, a decision tree generating algorithm is wrapped around a method which discretizes the continuous target values. The possible ways for discretization include computing: (1) equally probable intervals, where each subinterval contains the same number of data samples; (2) equal width intervals; and (3) k-means clustering, where the sum of the distances of all data samples in an interval t o its center-of-gravity (centroid) is minimized. The best set of subintervals is determined according to the accuracy of the generated decision trees on the cross-validation set. The Relative Unsupervised Discretization (RUDE) algorithm [5] discretizes not only the continuous target values but also all of the continuous input variables in the data. A key component of this algorithm is a clustering algorithm which groups values of the target feature into subintervals that are characterized by similar values of some input attributes. Once the variables have been discretized, C4.5 [6] is applied for solving the original regression problem. The experimental results on five benchmark data sets show that when compared to the trees from the data sets discretized using the equal width interval and the k-means clustering methods, the decision trees generated from RUDE-discretized data sets have fewer nodes but lower predictive accuracy. CART [7] is a method for regression (as well as classification) that does not require the user to discretize the target values. An optimal regression tree is generated in two steps. In the first step, a tree that overfits the data is generated. All the data samples are assigned to the root node. If the prediction errors are too large, the samples in the node are partitioned into
Knowledge Discovery from Continuous Data 219
two groups by selecting an attribute value that would result in a maximum reduction in errors. Two leaf nodes are generated for the partitioned data set. By recursively splitting the nodes, an overfitting tree is constructed. In order t o improve the prediction accuracy, the second step of CART prunes the tree by combining nodes along the tree branches upward. Cross-validation samples can be used t o determine which nodes are t o be combined. Once an optimal tree has been constructed, prediction for a new sample is usually computed by simply averaging the target values of all samples that fall in the corresponding leaf node. Neural networks, which have been shown t o be universal function approximators, would be an excellent choice for knowledge discovery from data with continuous attributes. The main drawback, however, is the complex nature of the input-output relation of the data as represented by a trained network. The simplest network architecture commonly applied for function approximation is the feedforward neural network with one layer of hidden units. As the computation of the hidden unit activation involves nonlinear function such as the hyperbolic or the sigmoidal tangent functions, it is not easy to explain the predictions from the network as meaningful rules that may be useful for knowledge discovery. In this paper, we present the method REFANN (Rule Extraction from Function Approximating Neural Networks). The key component of this method is the approximation of the hidden unit activation function as a piecewise linear function. Having a piece-wise linear hidden unit activation function enables us to represent the network's prediction as a set of regression rules that is simple enough for users t o understand. Rules for function approximation normally take the form: if (condition is satasfied), then predict y = f (x), where f (x)is either a constant or a linear function of x,the input attributes of the data. This type of rules is suitable because of their similarity to the traditional statistical approach of parametric regression and to those generated by decision tree methods. More than one rule is usually needed to approximate the nonlinear inputoutput mapping of the network well. The approximation of the nonlinear hidden unit activation function divides the input space into smaller subregions. The method presented here predicts the target values of all samples that fall in the same subregion by a single rule consequent in the form of a linear equation whose coefficients are determined by the weights of the network connections. The rules generated are almost as accurate as the original networks from which the rules are extracted. For many data sets that we tested, the number of rules is sufficiently small that useful knowledge about the problem domain can be discovered.
220
2
R. Setiono and J. Zumda
Network training and pruning algorithm
The available data samples (I,, y p ) , p = 1 , 2 , . . . , K where input I, E IRN and target yp E IR,are first randomly divided into 3 subsets: the training, the cross-validation and the test sets. Using the training data set, a network with H hidden units is trained, so as t o minimize the sum of squared errors E(w,v) augmented with a penalty term O(w,v):
where €1 , € 2 , ,5 are positive penalty parameters, w i j is the weight of the connections from input unit j to hidden unit i and wi is the weight of the connection from hidden unit i t o the output unit. The penalty term O(w,v) when minimized pushes the weight values towards the origin of the weight space, and in practice results in many final weights taking values near or at 0. Network connections with such weights may be removed from the network without sacrificing the network accuracy [8]. The hidden unit activation value Ai, for input I, and its predicted function value Y p are computed as follows:
c H
~p
=
ViAip,
(3)
i=I
Ijp is the value of input j for pattern p . The function h ( z ) is the hidden unit activation function. This function is normally the sigmoid function or the hyperbolic tangent function. We have used the hyperbolic tangent function tanh(c) = (et - e-c)/(et + e d ) . Once the network has been trained, its hidden and input units are inspected as candidates for possible removal by a network pruning algorithm. A pruning algorithm called N2PFA (Neural Network Pruning for Function Approximation) [9] has been developed. This algorithm removes redundant and irrelevant units by computing the mean absolute error (MAE) of the network’s prediction. In particular, ET and E X , respectively the MAE’S on the training set 7 and the cross-validation set X , are used to detcrmine when
Knowledge Discovery from Continuous Data 221
pruning should be terminated:
where 1 7 1 and (XI are the cardinality of the training and cross-validation sets, respectively.
Algorithm NSPFA Given: Dataset ( I ,,yp),p= 1 , 2 , . . . , K . Objective: Find a neural network with reduced number of hidden and input units that fits the data and generalizes well. Step 1. Split the data into 3 subsets: training, cross-validation, and test sets. Step 2. Train a network with a sufficiently large number of hidden units to minimize the error function (1). Step 3. Compute E T and EX, and set ETbest = ET, EXbest = EX, Emax = max{ ETbest, EXbest}. Step 4. Remove redundant hidden units:
1. For each i = 1 , 2 , . . . , HI set vi = 0 and compute the prediction errors ETi. 2. Retrain the network with vh = 0 where ETh = mini ETi, and compute ET and EX of the retrained network.
+
3. If E T 5 (1 a)Emax and EX 5 (1+ a)Emax, then Remove hidden unit h. Set ETbest = min{ET, ETbest}, EXbest = min{EX, EXbest} and Emax = max{ ETbest, EXbest}. 0 Set H = H - 1 and go to Step 4.1. Else use the previous setting of network weights. 0
Step 5 . Remove irrelevant inputs: 1. For each j = 1 , 2 , . . . ,N , set wij = 0 for all i and compute the prediction errors E Tj . 2. Retrain the network with win = 0 for all i where ET, = minj E T j , and compute ET and EX of the retrained network.
222
R. Setiono and J . Zurada
+
3. If E T 5 (1 a)Emaz and EX 5 (1
+ a)Emaz, then
Remove input unit n. Set ETbest = min{ET, ETbest}, EXbest = min{EX, EXbest} and Emaa: = max{ETbest, EXbest}. Set N = N - 1 and go to Step 5.1. Else use the previous setting of network weights.
Step 6. Report the accuracy of the network on the test data set. The value of Emaa: is used t o determine if a network unit can be removed. Typically, at the beginning of the algorithm when there are many hidden units in the network, the training mean absolute error E T will be much smaller than the cross-validation mean absolute error EX. The value of E T increases as more and more units are removed. As the network approaches its optimal structure] we expect EX to decrease. As a result, if only ETbest is used to determine whether a unit can be removed, many redundant units can be expected to remain in the network when the algorithm terminates because ETbest tends to be small initially. On the other hand, if only EXbest is used, then the network would perform well on the cross-validation set but may not necessarily generalize well on the test set. This could be caused by the small number of samples available for cross-validation or an uneven distribution of the data in the training and cross-validation sets. Therefore, Emax is assigned the larger of ETbest and EXbest so as t o remove as many redundant units as possible without sacrificing the generalization accuracy. The parameter 0: > 0 is introduced to control the chances that a unit will be removed. With a larger value of a , more units can be removed. However, the accuracy of the resulting network on the test data set may deteriorate. We have conducted extensive experiments to find a value for this parameter that works well for the majority of our test problems.
3
Approximating Hidden Unit Activation Function
Having produced the pruned networks, we can now proceed to extract rules that explain the network outputs as a collection of linear functions. The first step in our rule extraction method is to approximate the hidden unit activation function h ( z ) = tanh(z) by a 3-piece linear function. It is sufficient to illustrate the approximation just for values of z 2 0. Suppose that the input z ranges from 0 to .z, A simple approximation of h ( z ) is to over-estimate it by the piecewise linear function L(x) as shown in
Knowledge Discovery from Continuous Data
223
X
0
xm
XO
Figure 1. The tanh(z) function (solid curve) for z E [0,]z, linear function (dashed lines).
is approximated by a piece-wise
Fig. 1. To ensure that L ( z ) is larger than h ( z ) everywhere between 0 to ,z, the line on the left should intersect the origin with a gradient of h‘(0) = 1, and the line on the right should intersect the coordinate (z, h ( z m ) )with a gradient of h’(z,) = 1 - h2(z,). Thus, L ( z ) can be written as
L ( z )=
{ h‘(z,)(z ”
- x,)
+ h(z,)
if 0 5 z 5 zo if 3: > 20
(5)
The point of intersection z o of the two line segments is given by
The total error EA of estimating h ( z ) by L ( x ) is given by E A = l r m ( L ( ~-)h ( z ) )dz =
51[ ~ g+
(2,
+ --12 -1n0.5
+ h ( z m ) )-] lncoshz,
- zO)(zO
as z ,
(7)
+ cx) .
That is, the total error is bounded by a constant value. Another simple linearization method of approximating h ( z ) is to underestimate it by a 3-piece linear function. It can be shown that the total error
224
R. Setiono and J . Zumda
of the under-estimation method is unbounded and is larger than that of the over-estimation method for 2, > 2.96.
4
Rule Generation
REFANN generates rules from a pruned neural network according to the following steps:
Algorithm REFANN Given: Data set (Ip,g p ) , p = 1 , 2 , .. . , K and a pruned network with H hidden units. Objective: Generate linear regression rules from the network. Step 1. For each hidden unit i = 1 , 2 , .. . , H : 1. Determine xirn from the training samples. 2 . Approximate the hidden unit activation function as follows: rn 0
Compute zio (Eqn. 6). Define the function Li(z):
Lz(x)=
{
+
(z zim)h’(zi,) - h(zi,) if z < -xi0 z if - xi0 5 z (z - zim)h’(zirn) h(zi,) if z > zio
+
5 zio
3 . Using the pair of points -xi0 and zio of function L i ( z ) ,divide the input space into 3H subregions.
Step 2. For each non-empty subregion, generate a rule as follows: 1. Define a linear equation that approximates the network’s output yp for input sample p in this subregion as the consequent of the extracted rule: H
y p = ~ v i L i ( s i p )where ,
(8)
i=l N
sip = c w i j I j p
(9)
j=1
2. Generate the rule condition: (C, and C2 and . . . C H ) , where Ci is either sip < -zio1 -xi0 5 sip 5 zio, or sip > zio.
Step 3. (Optional) Apply C4.5 [6] to simplify the rule conditions.
Knowledge Discovery from Continuous Data
225
Table 1. Attributes of the pollution data set.
Attribute PRE JAT JUT OVR65 POPN EDUC HOU DENS NOW WWDRK POOR HC NOX
so2
HUMID
description average annual precipitation in inches average January temperature in degrees F average July temperature in degrees F percentage of population aged 65 or older average household size median school years completed by those over 22 percentage of housing units which are sound and with all facilities population per square mile in urbanized areas in 1960 percentage of non-white population in urbanized areas in 1960 percentage of those employed in white collar occupations percentage of families with income less than $3000 relative hydrocarbon pollution potential relative nitric oxides pollution potential relative sulphur dioxide pollution potential annual average % relative humidity at lp m
In general, a rule condition Ci is defined in terms of the weighted sum of the inputs sip (Eqn. 9) which corresponds to an oblique hyperplane in the input space. This type of rule condition can be difficult for the users to interpret. In some cases, the oblique hyperplanes can be replaced by hyperplanes that are parallel to the axes without affecting the prediction accuracy of the rules on the data set. Consequently, the hyperplanes can be defined in terms of the isolated inputs, and are easier for the users to understand. In some cases of real-life data, this enhanced interpretability would come a t a possible cost of reduced accuracy. If the replacement of rule conditions is still desired, it can be achieved by employing a classification method such as C4.5 in the optional Step 3. 5
Illustrative Examples
The following examples of applying REFANN on two different data sets illustrate the algorithm in more details The input attributes of the first data set are continuous, while those of the second data set are mixed.
Example 1. Pollution data set
226
R. Setiono and J . Zurada
The data set has 15 continuous attributes as listed in Table 1. The goal is to predict the total age-adjusted mortality rate per 100,000 (MORT). The values of all 15 input attributes were linearly scaled t o the interval [0, 11,while the target MORT was scaled so that it ranged in the interval [0,4]. One of the networks that had been trained for this data set was selected to illustrate in details how the rules were extracted by REFANN. This network originally had eight hidden units, but only one hidden unit remained after pruning. The numbers of training, cross-validation and test samples were 48, 6, and 6, respectively. Many input units were also removed from the network. Only the connections from six inputs: PRE, JAT, JUT, HOU, NOW, and SO2 were still present after the network had been pruned. The weighted input value with the largest magnitude was taken as x, and the value of xo was computed according to Eqn. 6 to be 0.6393. Therefore, the hyperbolic tangent function was approximated by Ll(S1,)
=
{
-0.4155
+ 0.3501 sip
Slp
0.4155
+ 0.3501 sip
if slP < -0.6393 if - 0.6393 5 slP 5 0.6393
if slP > 0.6393
The three subsets of the input space were defined by the following inequalities: 0
0
0
Region 1: slP < -0.6393 H -0.76 PRE + 0.48 JAT 0.24 HOU - 0.98 NOW - 0.79 SO2 + 0.24 < -0.6393 Region 2: -0.6393 5 s l P 5 0.6393 0.24 HOU - 0.98 NOW - 0.79 SO2
+ 0.29
JUT
+
-0.76 PRE+0.48 JAT1-0.29 JUT+
+ 0.24 5 0.6393 Region 3: slP > 0.6393 H -0.76 PRE + 0.48 JAT + 0.29 0.24 HOU - 0.98 NOW 0.79 SO2 + 0.24 > 0.6393
JUT
+
-
It should be noted that the coefficients of the two parallel hyperplanes that divide the input space into the three regions are equal to the weights w l j from the j - t h input unit to the hidden unit. Upon multiplying the coefficients of L1 (sip) by the connection weight value from the hidden unit t o the output unit (Eqn. 8) and re-scaling the input and output data back into their original values, we obtain the following rules: Rule Set 1: Rule 1: if Region 1, then ij = Y1. Rule 2: if Region 2, then ij = Y,. Rule 3: if Region 3, then y = Y3.
Knowledge Discovery from Continuous Data
227
Table 2. The errors of a pruned network, two rule sets extracted from it, and multiple linear regression for the pollution data on the test set. The error rates are computed in terms of the Root Mean Squared Error (RMSE), the Relative Root Mean Squared Error (RRMSE Eqn. 17), the Mean Absolute Error (MAE), and the Relative Mean Absolute Error (RMAE - Eqn. 18).
Pruned network Rule set 1 Rule set l a Linear regression
RMSE 21.62 20.63 20.64 69.61
RRMSE 48.45 46.22 46.25 155.96
MAE 18.50 17.12 17.17 53.42
RMAE 49.35 45.67 45.81 142.51
The predicted value y is given by one of the following 3 equations: Yl = 1030.51 0.95 PRE - 0.59 JAT - 0.70 JUT - 0.55 HOU 1.39 NOW 0.16 SO2 Y2 = 1075.68 2.28 PRE - 1.43 JAT - 1.68 JUT - 1.31 HOU + 3.34 NOW 0.37 SO2 Y3 = 940.45 0.95 PRE - 0.59 JAT - 0.70 JUT - 0.55 HOU + 1.39 NOW 0.16 SO2 An optional final step of REFANN is to describe the 3 input subspaces by rule conditions generated by C4.5. All training samples p in Region 1 were given a target value of 1, in Region 2 a target value of 2, and in Region 3 a target value of 3. The following rules were generated by C4.5: Rule Set la:
+ + +
+
+ + +
Rule 1: if SO2 > 146, then ij = Yl. Rule 2: if NOW > 27.1, then ij = Yl. Rule 3: if PRE > 27.1 and NOW 5 27.1, then ij = Y,. Rule 4: if PRE 5 13, then y = Y,. Default Rule:
5 = Y3.
The errors of the pruned neural network and the rule sets are shown in Table 2. We combined the training and cross-validation sets and obtained the coefficients of the linear regression that fit the samples. Using the backward regression option of SAS [lo], none of the input attributes was found to be significant a t the default significance level of 0.10. The errors from linear regression on the test set are included in Table 2 for comparison. In addition to the mean absolute error (MAE) and the root mean squared error (RMSE),
228
R. Setiono and J . Zumda
we also show the relative root mean squared error (RRMSE) and the relative mean absolute error (RMAE): RRMSE = 100 x
d.C(Vp
-Y
C(& - yp)2
~)~/ P
v
where the summation is computed over the samples in the test set and is the average value of yp in the test set. These relative errors are sometimes preferred over the usual sum of squared errors or the mean absolute error because they normalize the differences in the output ranges of different data sets. An RRMSE or an RMAE that is greater than 100 indicates that the method performs worse than the method that simply predicts the output using the average value of the samples. The result highlights the effectiveness of the REFANN algorithm in splitting the input space into subspaces, where in each subspace a linear equation is generated. By applying different equations depending on the input, the accuracy of the predictions is improved. The error statistics also indicate that the prediction quality of the rules is very close to those of the pruned network.
Example 2. AutoMpg data set The target to be predicted in this problem is the city-cycle fuel consumption of different car models in miles per gallon. The 3 discrete attributes of the data are (1) cylinders with possible values of 3, 4, 5, 6, and 8; (2) model with possible values of 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, and 82; and (3) origin with possible values of 1, 2, and 3. The 4 continuous attributes are (1) displacement, (2) horsepower, (3) weight, and (4) acceleration. The training set contained 318 samples, while the cross validation and test sets contained 40 samples each. The unary-coded data required the neural network t o have 26 input units. One pruned network has 1 hidden and 7 input units left. The relevant network inputs are the following: (1) 14 = 0.2 iff cylinders a is greater than 3, (2) 19 = 0.2 iff model is later than 78, (3) 111 = 0.2 iff model is later than 76, (4) 1 1 4 = 0.2 iff model is later than 73, (5) 1 2 1 = 0.2 iff origin is 1, (6) 123 is horsepower, and (7) 1 2 4 is weight. A rule set consisting of just 2 rules was obtained: Rule Set 2: Rule 1: if Region 1, then jj = Yl. have used the 0/0.2 instead of 0/1 encoding scheme.
Knowledge Discovery from Continuous Data 229
Rule 2: if Region 2, then jj = Y 2 . The two subregions of the input space are defined as follows:
+
+ 0.67 1 1 1 + 0.52 1 1 4 -
+
+ 0.67 1 1 1 + 0.52 1 1 4
Region 1: slP < -0.8873 @ 0.41 1 4 1.07 19 0.41 1 2 1 - 0.003 1 2 3 - 0.0004 1 2 4 < -1.7198 0
Region 2: slP 2 -0.8873 H 0.41 1 4 1.07 19 0.41 1 2 1 - 0.003 1 2 3 - 0.0004 1 2 4 2 -1.7198
-
and the two corresponding linear equations are Y 1 = 16.26 0.54 1 4 1.40 1 9 0.88 1 1 1 0.69 1 1 4 - 0.53 1 2 1 - 0.0046 1 2 3 -
+
+
+
+
0.0005 1 2 4 Y 2 = 46.73 + 7.85 1 4 + 20.30 1 g + 12.72 1 1 1 + 9.96 1 1 4 - 7.73 1 2 1 - 0.0661 1 2 3 0.0079 1 2 4 We obtained the following rule set from C4.5 after executing the optional Step 3 of the algorithm REFANN: Rule Set 2a: Rule 1: if (19 = 0) and
(123
> 115) and
(124
> 3432), then 6 = Yl.
Rule 2: if
(111
= 0) and
(124
> 3574), then 6 = Y I .
Rule 3: if
(111
= 0) and
(123
> 130), then jj = Yl.
Rule 4: if
(123
5 98), then 6 = Y 2 .
Rule 5 : if
(123
5 130) and ( 1 2 4 5 3432), then 6 = Y 2 .
Rule 6: if
(111
= 0.2) and
(124
5 3432), then y = Y 2 .
Rule 7 : if
(111
= 0.2) and
(123
5 115), then y = Y 2 .
Rule 8: if
(19
= 0.2), then y = Yz.
Default rule: y = Yz. We compare the predictive accuracy of the extracted rules with the neural network and multiple linear regression in Table 3. The multiple linear regression model has 14 parameters that are significant at Q: = 0.10. Fitting the data with more input attributes, however, does not give a better model as shown by the RMSE and MAE of this model. By using a pruned neural network to divide the input space into two regions and having a linear equation in each of these regions for prediction, the RMSE and MAE are reduced by 18% and 27%, respectively.
230
R. Setiono and J . Zurada
Table 3. The errors of a pruned network, two rule sets extracted from it, and multiple linear regression for the autoMpg data on the test set.
Pruned network Rule set 2 Rule set 2a Linear regression
6
RMSE 2.91 3.00 3.00 3.65
RRMSE 35.31 36.33 36.36 44.25
MAE 2.03 2.07 2.07 2.85
RMAE 29.62 30.10 30.18 41.53
Conclusion
We have presented a method for discovering knowledge from data sets where the underlying problem of interest is that of predicting a continuous-valued target variable. The relationship between the input variables and the target variable for many practical data sets can be expected t o be nonlinear. The task of knowledge discovery from such a data set involves generating rules that are meaningful to humans and it is a challenging one. Our proposed method represents the nonlinear multivariable relationship as a set of rules where each of the rule consequent is a linear function. The conditions of the rules divide the input space of the data into smaller disjoint subspaces. The key components of the method are a feedforward backpropagation neural network and a simple linearization of the hidden unit activation function of the network. Neural networks are particularly suitable for learning nonlinear relationship among data variables. By approximating the nonlinear hidden unit activation function of the network as piece-wise linear functions, the network's outputs can also be expressed as linear functions of the input variables. In this paper, we have described the algorithm N2PFA for pruning neural networks that have been trained for regression and the algorithm REFANN for extracting linear regression rules from the pruned networks. The algorithm N2PFA produces pruned networks whose predictions are as accurate as those of other regression methods for many of the problems that we have tested [9]. The algorithm REFANN attempts t o provide an explanation for the network outputs by replacing the nonlinear mapping of a pruned network with a set of linear regression equations. Using the weights of a trained network, REFANN divides the input space of the data into a small number of subregions such that the prediction for the samples in the same subregion can be computed by a single linear equation. REFANN approximates the nonlinear hyperbolic tangent activation function of the hidden units using a simple 3-piece linear
Knowledge Discovery from Continuous Data 231
function. It then generates rules in the form of linear equations from the trained network. The conditions in these rules divide the input space into one or more subregions. For each subregion, a linear equation that approximates the network output is generated. We have conducted extensive experiments using a wide range of real world data sets and the results confirm the effectiveness of the algorithm. An extended version of this paper reports our findings in detail and it will be in print in a journal [ll]. References 1. Wang, Y. and Witten, I.H.(1997) Induction of model trees for predicting continuous classes. In Proc. of the Poster Papers of the European Conference on Machine Learning. Prague: University of Economics, Faculty of Informatics and Statistics, 128-137. 2. Quinlan, R. (1992) Learning with continuous classes. In Proc. of the Australian Joint Conference on Artificial Intelligence, Singapore, 343348. 3. Torgo, L. and Gama, J. (1997) Search-based class discretization. In Proc. of the 9th European Conference on Machine Learning, ECML-97, Lecture Notes in A1 1224, Springer, M. van Someren and G. Widmer (Eds), Prague, 266-273. 4. John, G.H., Kohavi, R.,and Pfleger, K. (1994) Irrelevant feature and the subset selection problem. In Proc. of the 11th International Conference on Machine Learning, Morgan Kaufmann Publisher, 121-129. 5. Ludl, M-C. and Widmer, G. (2000) Relative unsupervised discretization for regression problems. In Proc. of the 11th ECML, ECML 2000, Lecture Notes in A1 1810, Springer, R.A.Mantaras and E. Plaza (Eds), Barcelona, 246-253. 6. Quinlan, R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, California. 7. Breiman, L., F'riedman, J.H., Olshen, R.A., and Stone, C.J. (1984) Classification and Regression Trees. Wadsworth International Group, Belmont , California. 8. Setiono, R. (1997) A penalty function approach for pruning feedforward neural networks. Neural Computation 9(1), 185-204. 9. Setiono, R. and Leow, W.K. (2000) Pruned neural networks for regression. In Proc. of the 6th Pacific Rim Conference on Artificial Intelligence, Lecture Notes in A1 1886, Springer, R. Mizoguchi and J. Slaney (Eds), Melbourne, 500-509. 10. SAS Technical Report A-102, SAS Regression Applications, SAS Institute
232
R. Setiono and J . Z u m d a
Inc. Cary, NC, USA. 11. Setiono, R., W.K. Leow, and Zurada, J. (2002) Extraction of rules from artificial neural networks for nonlinear regression. IEEE Transactions on Neural Network, forthcoming.
Advanced Applications with Machine Intelligence
This page intentionally left blank
REVIEW OF FUZZY LOGIC IN THE GEOLOGICAL SCIENCES: WHERE WE HAVE BEEN AND WHERE WE ARE GOING ROBERT V. DEMICCO Department of Geological Science And Centerfor Intelligent Systems Binghamton University Binghamton, New York 13902-6000 E-mail:
[email protected] Geology is the application of physics, chemistry and biology to the study of the Earth. This report is a survey of approximately 70 recent papers in major geological science journals where fuzzy logic has been used to tackle Earth Science problems. These papers are briefly reviewed and grouped into nine categories: 1) geotechnical engineering; 2) surface hydrology; 3) subsurface hydrology; 4) hydrocarbon exploration; 5) ground-water risk assessment; 6) seismology; 7) soil science and landscape development; 8) deposition of sediments; and 9) a miscellaneous group. Papers cited in each category are intended to give an interested worker access to recent refereed papers in major geologically oriented journals that should be available in even modest university libraries. These papers should serve as an introduction into the literature of applications of fuzzy logic in the geological sciences by Earth Scientists. Keywords: fuzzy logic, geotechnical engineering, hydrology, hydrocarbon exploration. seismology, soil science.
1
INTRODUCTION
Many fields in the geological sciences are beginning to exploit the potential of machine intelligence. This report focuses 011 recent literature in geologically oriented journals on what is one of the most commonly employed “soft computing” techniques in the geological sciences: fuzzy logic. Since the mid 1980s, when fuzzy logic was initially developed for industrial control [11, many successful applications of fuzzy logic have been developed in other areas of engineering [2], as well as in decision making, business management, operations research and other professional areas [3], [4]. Although applications of fuzzy logic in the sciences are comparatively less developed, the utility of fuzzy logic has been demonstrated in chemistry [5], quantum physics [6], [7], [8], economics [9], ecology [lo], [l I], and geography [ 121. The trend in the Earth Sciences over the last ten years has been to view the Earth as a system and treat the hydrosphere, atmosphere, biosphere and lithosphere as interconnected subsystems. This approach is interdisciplinary and has been largely fueled by concern about the Earth’s present and past environments, with a growing realization that what happens in any one of the Earth’s “spheres” has impact on the others. Geology is the application of chemical, physical, and biological principles to the study of the lithosphere. Most problems in geology involve systems with large
235
236
R. V . Demicco
numbers of components and rich interactions among the components that are usually nonlinear and non-random. Such problems of organized complexity typify geologic systems and are exemplified by the geological systems that operate at the surface of the Earth. In response to the recognition that geology deals with the realm of organized complexity, there has been a recent explosive growth in the theory and application of fuzzy logic and other related “soft” computing techniques in the Earth Sciences. These techniques are now opening new ways of geologic modeling based on knowledge expressed in natural language. Geology, of all of the natural sciences, most readily lends itself to analysis and modeling by the emerging “soft computing” techniques being explored by this conference. In particular, fuzzy logic, in the broad sense, has at least a twenty-year history (and a growing body) of refereed works where it has been successfully applied to many areas of geological research. There are a number of reasons for this. First, geology is primarily a field science that began as an outgrowth of the mineral extraction industry. As such, the variables that geologists have routinely measured for hundreds of years are continua that commonly vary over many orders of magnitude. For example, the size of sedimentary particles ranges at least from mm through lo4 mm. Hydraulic conductivity, the constant k in Darcy’s Law
6h
Q=-k-* A , 61 varies from approximately lo-’’ to 10’ m/s2 for water and Earth materials’ [where Q is directional volume flow rate (m3/s), h is the hydraulic head (m) (a proxy for a fluid potential field made up of potential energy and pressure energy terms), I is a length (m) over which the potential change is measured, and A is the cross-sectional area of the fluid flow]. Currently, most of these naturally continuous variables are, more often than not, broken up into arbitrary “pigeon holes” by geologist seeking to “classify”. Second, because geological research is field based, it is commonly carried out over fairly broad regions of tens to hundreds of square kilometers in size. And, where subsurface data are added, the volume of a typical rock body being studied even on a modest scale, for minerals, gas, ground-water, or information about conditions on the ancient Earth, ranges up to1000 km3. Direct sampling of the entire rock or sediment body is clearly prohibitive, and so much of the threedimensional distribution of rock properties is measured over a relatively tiny percentage of the study area and inferred over the entire volume. The inference is commonly based on “expert knowledge” from a variety of sources to model the disposition of rock properties. 1
A closely related, curious unit that takes into account the viscosity and density of different fluids as well as the cross-sectional area of flow is known as a darcy. A darcy has the dimensions of [L2] and 1000 millidarcies = 1 darcy =: 10.’ m/s.
Review of Fuzzy Logic in the Geological Sciences 237
Third, remote sensing has always been used in geological data gathering. The
two most common remote sensing techniques are exploration seismic surveys and
geophysical measurements of boreholes. In the former, the differential impedance of rocks to artificial vibrational energy introduced into the subsurface produces what amounts to an “ultra sound” of subsurface rock disposition. In borehole geophysical surveys, the gamma ray output of rocks and the electrical resistivity of rocks to an induced potential are measured as proxies for rock type, permeability, etc. These techniques require correspondencesbetween the proxy measures and the desired rock properties to be established. These correspondences are commonly not one-to-one. In addition to these subsurface methods, there is growing reliance on Geographical Informational Systems (GIS) and satellite remote sensing in a number of geological fields interested in the land’s surface and it’s development. In spite of the above considerations, which highlight the imprecise nature of much geological information, as of this writing, most geophysicists and many geologists are wedded to the application of Newtonian mechanics, and particularly, partial differential equations, to modeling the Earth. The purpose of this paper is to briefly review the range of uses of fuzzy logic in the geological sciences literature to allow an interested worker access to the literature. The sources are major geological sciencesjournals that will be available even in a modest research library. 2
REVIEW OF AREAS OF GEOLOGY WHERE FUZZY LOGIC HAS BEEN EMPLOYED: (Where We Have Been)
The following synopsis divides the current literature of fuzzy logic into eight specific categories and one miscellaneous category. The papers cited in each category do not comprise a complete bibliography of materials. Instead, they represent recent refereed papers in major geological journals that will serve as an introduction into the literature. The categories chosen are briefly described here. It should come as no surprise that geotechnical engineering and areas closely allied to engineering, surface hydrology, subsurface hydrology, and hydrocarbon exploration, have seen early and extensive use of fuzzy logic. There have been a number of management-oriented models for ground-water risk assessment based on fuzzy logic. These models incorporate surface and subsurface hydrologic data. Exploration geophysicists in hydrocarbon exploration have adapted fuzzy logic into seismic processing and evaluation. Earthquake seismology is a discipline still very closely wedded to Newtonian mechanics. However, a growing number of geophysicists have recently adapted fuzzy logic and there are scattered uses mentioned below. Soil science and landscape development have seen extensive use of fuzzy logic and models of modern and ancient deposition of sediments have also seen modest use of fuzzy logic to simulate sediment production, sediment erosion, sediment transportation and sediment deposition. Finally, there is scattered miscellaneous literature in the “core” geological sciences outlining potential uses of fuzzy logic in specific areas not included in the categories outlined here.
238
3
R. V. Demicco
GEOTECHNICAL ENGINEERING
There are a number of models for river discharge control and flood management using fuzzy control systems. These include both simple, one dam systems [ 131 and an optimization model for simultaneous flood gate control of the Yangtze River involving the Three Gorges Dam, flood basin areas along the river, and eight flood control dams on tributaries [ 141. Optimized watershed management plans have also been developed for the Lake Erhai basin in southern China using a fuzzy multiobjective management program [15]. The Lake Erhai watershed is under intense developmental pressure from a variety of competing land uses (agricultural, scenic, light industry, etc). A number of applications of fuzzy logic have been developed for excavation and mining operations. A fuzzy expert system developed to evaluate the failure potential of road-cut slopes and embankments was applied along a highway in a landslide prone area of Jordan [ 161. Rock trencher performance was modeled with fuzzy logic [ 171 and fuzzy clustering algorithms have been developed to identify fractures sets encountered in exploration dnlling [181. 4
SURFACE HYDROLOGY
The discharge of a stream is the volume flow per unit time through a cross section of the stream at a point along the stream’s course. Obviously, reliable prediction of discharge, especially low flows during droughts and high flows that lead to flooding is important for a variety of reasons. The discharge of a river represents a complicated, highly non-linear response of a watershed area to precipitation. Complicating factors include: whether the soils are frozen or thawed, and, if thawed, their moisture content; the topography of the watershed; the plant cover of the watershed, its growth cycle, and whether the leaves are wet or dry; the intensity, duration and location of precipitation; etc. [ 191. A number of deterministic models have been developed to try to predict discharge from precipitation records or forecasts, with real-time predictions as the goal. These models vary in sophistication, but, as most were developed for a specific geographic area, they are difficult to apply globally and are highly parameterized. One of the first uses of fuzzy logic was to help refine the parameters input to these models. One well-known, deterministic model employed to study a small watershed in Brittany, France, was augmented by incorporation of fuzzy sets [20]. In this area, fuzzy slope measurements were combined with a fuzzy image analysis of the terrain (as measured by synthetic aperture radar) to evaluate the saturation state of the catchment area. This information, in turn, fed into the standard deterministic model. Another pre-existing deterministic model of rainfalVrunoff for a small catchment in Taiwan [21] was modified to use a fuzzy multi-objective function to calibrate the model parameters. Deterministic watershed response models on experimental, well instrumented and well-studied watersheds in
Review of f i z z y Logic i n the Geological Sciences
239
the western U.S. have also incorporated fuzzy logic based inferences from soils maps to provide parameters into the model [22]. Finally, the results of 5 deterministic watershed response models applied to 11 catchment areas in Taiwan were combined into a single response model using an affine Takagi-Sugeno model ~31. There are also a number of recent watershed response models that are entirely based on fuzzy logic. An adaptive neural fuzzy inference system (ANFIS) analysis of long-term data from a watershed in Tuscany, Italy, extracted fuzzy rules that were used in the development of a rainfalVresponse model [24]. Back propagation fuzzy-neural networks have been applied on long-term data from a watershed in Taiwan for the same purpose [25]. Somewhat closely related to these watershed response models are studies that have attempted to model and predict precipitation input onto watersheds. Pongracz and colleagues [26] tried to develop a drought prediction model for the Great Plains of the U.S. They developed a set of fuzzy rules that related two inputs: 1) the Southern Ocean Index (SOI) as a proxy for El Nino Southern Oscillation (ENSO); and 2) the geopotential height field of the SOOhPa level over a large area of the western hemisphere; to a long term record of droughts in 8 regions of Nebraska. This approach was extended to the general stochastic prediction of a time series for precipitation over Europe [27]. In the general European model, a fuzzy classification of point measurements of geopotential atmospheric pressure surfaces over a large-scale grid of Europe served as an input into a more conventional stochastic model.
5
Subsurface Hydrology
Ground-water flow has been most commonly modeled by the piece-wise solution of the diffusion equation either by finite difference approximations [28] or finite elements [29]. Solute transport in ground-water systems has likewise been modeled by finite difference or finite element approximations of the advective-dispersive equation [30]. These types of models are built around the empirical Darcy’s law (equation 1) and are very commonly applied to problems of ground-water well-field development or, where pollutants are dissolved in the ground-water, remediation plans. There is a growing recognition that such models (although quite common) may be inadequate [31] due to the inherent impression of knowledge of the threedimensional distribution of hydraulic conductivity. More fundamentally, the general fuzzy nature of the variables hydraulic conductivity, hydraulic head and storativity (specific yield) themselves have been incorporated in both steady-state [32] and transient [33] ground-water flow models. These models used fuzzy numbers in the differential equations and a fuzzy-technique for solution of the equations. A similar model for flow in the unsaturated zone (the surface-most soil zone where hydraulic conductivities vary with state of pore saturation) has also been developed [34]. Recently, fuzzy rule-based models for solute transport in the
240
R. V. Demicco
unsaturated zone have developed [35]. This change from using fuzzy numbers to represent imprecise variable in traditional finite difference solutions to models entirely based on fuzzy rules has mirrored a similar development in surface flow models. Another type of flow modeling seeks to understand ground-water flow from first principles instead of the empirically derived Darcy flow equation. In one such model, the degree of interconnectiveness among pores in a porous medium was modeled with fuzzy sets [36]. 6
Ground Water Risk Assessment
Assessing the risk of contaminant pollution of ground-water from various anthropogenic sources is an important element of municipal and agricultural wellfield planning. Aquifer vulnerability to contamination depends on soil properties, precipitation, topography, etc., all of which can be modeled with fuzzy sets. Models have been developed to assess non-point source pesticide pollution of aquifers [37] and to assess the potential of industrial ground-water contamination [38], [39]. 7
Hydrocarbon Exploration
Not surprisingly, this area has seen the most development of all so-called “soft computing” technologies including neural networks, fuzzy logic, genetic algorithms, and data analysis. There have been a number of recent journal issues solely dedicated to this area and they are an excellent place to gain access to the literature. Computers and Geosciences (volume 26, number 8, October 2000) produced an issue entitled “Applications of Virtual Intelligence to Petroleum Engineering”. This issue [40] was edited by Shahab Mohagheg and contained nine technical papers in addition to an introductory note. Most of the papers are dedicated to neural networks many of which rely heavily on fuzzy logic. In addition to the special issue cited above, numerous single papers on applications of soft computing to the Earth Sciences are found in Computers and Geosciences. The Journal of Petroleum Geology (volume 24, number 4, October 2001) published a thematic issue entitled: “Field applications of intelligent computing techniques”. This issue was edited by Wong and Nikravesh [41] and included five technical papers and an introduction by the editors. Most of the papers were applications using neural networks. However, there was specific application of fuzzy logic to the biostratigraphic interpretation of mudstones in a North Sea oil field [42] and specific application of fuzzy partitioning to the classification and interpretation of remotely sensed resistivity and spontaneous potential wire-line well logs [43].
Review of Fuzzy Logic in the Geological Sciences
241
The Journal of Petroleum Science and Engineering produced two special issues dedicated to “Soft computing and Earth Sciences” edited by Nikravesh, Aminzadeh and Zadeh. Part one [44] (volume 29, numbers 3-4, May 2001) contained seven technical papers plus an introduction by the editors. Part two [45] (volume 31, numbers 2-4, January 2001) carried nine papers. These two special issues have a bit more variety of applications of soft computing to the Earth Sciences but are still mostly dealing with applications of neural networks to data analysis. An extended version of these two theme issues is currently in press. 8
Earthquake Seismology
Deyi and Xihui [46] presented the result of an international symposium of fuzzy mathematics in earthquake research. Since the publication of this volume, application of fuzzy logic to seismology has focused on earthquake prediction, including: assessment of the magnitude of an earthquake (absolute amount of energy a seismic event puts into the ground) [47], [48], and the pattern of surface disruption. The surface effects of passage of seismic waves can vary dramatically in an urban area depending on the material properties of the area, the type and depth to bedrock in the area, and the amplification effects of earthquake waves by the shape of the resonating deposits. Prediction of the ground motion at a site depends not only on these properties, but also on the properties of the incident waves, including their orientation. Fuzzy based neural-network approaches have developed to integrate the properties of the waves and the properties of the ground to predict ground motion [49], [50], [51]. Locating the source site of earthquake waves arriving at a distant site has been one of the staples of seismology since it’s beginning. A number of deterministic models have developed to make these calculations. Recently, Lin and Sanford [52] described an inversion technique wherein deviations between theoretical and observed arrival times are assessed with fuzzy logic. 9
Soil science and landscape development
Maps that depict the distribution of different soil types are a standard product of geological and agricultural surveys throughout the world and have a variety of critical uses. Such uses include agricultural planning, conservation, input into watershed models, input into ground-water models, and the legal definition of wetlands among others. Soils are three-dimensional mantles of weathered Earth materials with a complex biogeochemical evolutionary history depending on parent material, precipitation, temperature, topography, ground-water levels, etc. [53]. Traditional soil classifications and maps produced from them ignored intergradations of soil types, both horizontally and vertically, and were based on widely spaced sampling pits. Although the actual microscopic structure of soils is
242
R. V . Demicco
difficult to assess, a two-dimensional, fuzzy model of soil element disposition has been developed [54]. In recent years soil science has been revolutionized by the development of Geographical Information Systems (GIs), improvements of remote sensing capabilities (to patches 10 meters or so on a side), and a fuzzy approach toward soil classifications and mapping [55], [56], and [57]. Examples of this approach have been applied to different geographical areas including: an alluvial flood plain in western Greece [58]; an area in New South Wales, eastern Australia [59]; and areas in Wisconsin and Montana [60]. In each of these studies, horizontal and vertical measurements of soil properties in test pits are employed to devise fuzzy soil classification systems, i.e. systems wherein a point can belong to more than one soil type. In this way, intergradations among soil types are naturally handled and small areas of slightly different soil types within larger areas can be identified. These soil types are then mapped from remotely sensed images of an area at a resolution of blocks that are approximately a few tens of meters on a side. Remote sensing commonly measures intensity of various wavelengths of radiation reflected off the Earth’s surface or produced at the surface. A transfer function (commonly fuzzy) is then developed to related wavelength and intensity of radiation to a soil type. These techniques have also been used to interpret glacial features from the eastern Italian Alps [61] and Greece [62]. Finally, fuzzy logic has been employed to understand geochemical aspects of soils and recent stream sediments. Fuzzy c-means clustering of elements measured in stream samples from the Alps of Austria established 4 categories of background levels of various elements [63]. These categories were better able to screen out non-anomalous concentrations of metals. Similarly, fuzzy clustering established relationships among glacial tills, bedrock and elements measured in soils of Finland [64]. 10 Deposition of sediments
Modern sedimentary systems comprise volumes of the uppermost tens of meters of lithosphere, and overlying hydrosphere and atmosphere where sediments accumulate as, for example, on the delta of the Mississippi River or on the floor Death Valley, California. Earth scientists have limited knowledge of the physical, chemical and biological processes that control sediment accumulation in modem sedimentary systems. This presents a difficulty insofar as analogous ancient sedimentary systems are where the sediments and sedimentary rocks, which hold much of the direct, long-term evidence of the history of the biosphere, atmosphere, hydrosphere and lithosphere of this planet, accumulated. In order to infer such important information as long and short term variations in geochemical cycles, ecosystems, and climate, we seek to recover from ancient sedimentary deposits just these records of the physical, chemical and biological processes that operated on those ancient surfaces. Our poor understanding of modem depositional processes is
Review of Fuzzy Logic in the Geological Sciences
243
due to: the large sizes of modern sedimentary systems (102-105km2);difficulties in instrumentation (especially during rare events such as hurricanes and floods); restrictions of observations on active processes to a few hundred years; and laborintensive data gathering. The thorniest problem faced by molders is simulating sediment erosion, sediment transportation, and sediment accumulation within a model. In coastal and shallow marine systems, waves, wave-induced currents, tidal currents and storminduced (i.e. ‘event’) waves and currents lead to ever-changing patterns of sediment erosion, transportation, and accumulation. Modeling such events entails handling physical laws and empirically derived relationships. These physical laws and empirical relationships are generally described by nonlinear, complex sets of partial differential equations [65], [66], [67]. Moreover, these equations must be coupled during solution. Furthermore, some parameters that cannot be easily formalized, such as antecedent topography and changing boundary conditions, and incorporation of “rare” events need to be taken into account. When we consider carbonate depositional systems, we are also confronted by the in situ formation of the sediments themselves both as reefs and bank-interior sediments. Coastal oceanographic modelers have made great strides in dealing with the complexities of coupled solutions as well as wave dynamics, current dynamics and sediment transport. However, finite difference and finite element numerical simulations have two drawbacks when applied to stratigraphic models. First, they are site specific and depend on rigorous application of boundary conditions, initial conditions, and wave and tidal forcing functions over a discrete domain. Secondly, these processresponse models operate at tens to hundreds of year time scales, which are very short in comparison to basin-filling models. As a result, the effects of large, complex storm events, which are suspected of being important agents in ancient depositional systems, are only rarely included in coastal models. Indeed, such complexities lead to questioning the applicability of even short-term coastal models built around dynamic sedimentary process simulators [68]. Sedimentary models that wed GIS-based observations of processes and sedimentary product through neural networks [69] offer the potential of a new generation of models to assess sedimentation in harbors and coastal areas. As a result of the difficulties outlined above, much of what we know about the evolution of sedimentary systems on geological time scales (lo5 - lo8 y) is increasingly based on quantitative models. Traditionally, these models have relied on the piece-wise solution of partial differential equations for fluid flow and sediment transport over grids of points using highly parameterized initial, internal, and boundary conditions. The disadvantages of such models are that they are quite computationally intensive and require a large degree of user judgment in the coefficients and boundary conditions employed. Fuzzy logic was initially introduced into stratigraphic models to overcome the computational difficulties of sedimentary process modeling [70], [71]. FUZZIM [71] is a share-ware program that replaces sedimentary physics with common-sense rules based on hard and soft
244
R. V. Demicco
information developed by sedimentologists over the past 100 years. Other modelers have incorporated fuzzy logic into long-term stratigraphic models of various sedimentary environments [72], [73], [74]. There are very few models wherein fuzzy logic has been utilized to describe the diagenetic processes that transform deposited sediment into rock. However, fuzzy clustering was used to develop a classification of geochemical and rock magnetic effects in deep-sea sediments due to flux of hydrothermal fluids [75]. 11 Miscellaneous
There have been a number of general papers advocating the use of fuzzy logic in the geological sciences [76], [77]. Bardossy and Duckstein, [78] authored a useful introductory text where various applications of fuzzy rule-based modeling in biological and engineering applications as well as geosciences are developed. In addition to these general references there have been the following specific applications of fuzzy logic to geological problems outside of the main areas listed above. These include the study of volcanoes [79] and a fuzzy rule-based model to simulate latent heat fluxes of coniferous forests [80]. Fuzzy set theory was also applied to thermodynamic parameters in aqueous chemical equilibrium calculations [8 11. Finally, fuzzy clustering of paleomagnetic measurements on deep-sea sediments produced populations that were a proxy €or estimating orbital forcing of climate over the last 276,000 years [82].
12 Conclusions The papers cited in each category above by no means comprise a complete bibliography of materials of use of fuzzy logic in the geological sciences. Instead, they are intended to give a potential worker access to recent refereed papers in major journals that should be available in even modest university libraries. These papers and their bibliographies should serve as an introduction into the literature. One final observation is in order. There seems to be a major trend in the use of fuzzy logic in the geological sciences. Initially, fuzzy sets were used to capture the continuous nature of geologic data and various techniques were developed to use fuzzy sets in previously developed deterministic models of geological phenomena. More and more, fuzzy rule-based models are beginning to supercede the older deterministic models. This trend will no doubt continue into the future. Acknowledgements The United States National Science Foundation Grant EAR9909336 to R.V.D. and G. Klir supported this research.
Review of Fuzzy Logic in the Geological Sciences
245
References 1. Sugeno, M., 1985, Industrial applications of fuzzy control. Elsevier, Amsterdam, New York, 269 p. 2. Ross, T. J., 1995, Fuzzy Logic with Engineering Applications. McGraw-Hill, New York. 3. Ruspini, E. H., Piero, P., Bonissone, P., Pedrycz, W., 1998, Handbook of Fuzzy Computation. Institute of Physics, Philadelphia. 4. Dubois, D., and Prade, H., (eds.), 1999-2000, The Handbooks of Fuzzy Sets Series. Kluwer, Boston. 5. Rouvray, D. H., 1997, Fuzzy Logic in Chemistry. Academic Press, San Diego. 6. Hsu, J. P., 1991, Theory of fuzzy transitions between quantum and classical mechanics and proposed experimental tests. Physical Review A. v. 43, p. 3227-323 1. 7. Pykacz, J., 1993, Fuzzy quantum logic I. International Journal of Theoretical Physics, v. 32, p. 1691. 8. Cattaneo, G., 1993, Fuzzy quantum logic 11. The logics of unsharp quantum mechanics. International Journal of Theoretical Physics, v. 32, p. 1709. 9. Billot, A., 1992, Economic Theory of Fuzzy Equilibria: An Axiomatic Analysis. Springer-Verlag, New York, 180 p. 10. Salski, A., 1992, Fuzzy knowledge-based models in ecological research. Ecological Modelling, v. 63, p.103-112. 1 1, Libelli, S. M., and Cianchi, P., 1996, Fuzzy Modelling: Paradigms and Practice. Kluwer, Boston. 12. Gale, S., 1972, Inexactness, fuzzy sets, and the foundation of behavioral geography. Geographical Analysis,v. 4, p. 5-1 6. 13. Teegavarapu. R. S. V., and Simonovic, S. P., 1999, Modeling uncertainty in reservoir loss functions using fuzzy sets. Water Resources Research, v., 35, p. 2815-2823. 14. Cheng, C., 1999, Fuzzy optimal model for the flood control system of the upper and middle reaches of the Yangtze River. Hydrological Sciences - Journal, v. 44, p. 573-582. 15. Huang, G. H., Liu, L., Chakma, A., Wu, S. M., Wang, X. H., and Yin, Y. Y., 1999, A hybrid GIs-supported watershed modeling system: application to the Lake Erhai basin, China. Hydrological Sciences - Journal, v. 44, p. 597-610. 16. Al-Homoud, A. S., and Al-Masri, G. A., 1999, CSEES: an expert system for analysis and design of cut slopes and embankments. Environmental Geology, V.39, p. 75-89. 17. Grima, A. M., and Verhorf, P. N. W., 1999, Forecasting rock trencher performance using fuzzy logic. International Journal of Rock Mechanics and Mining Sciences & Geomechanics Abstracts, v. 36, p. 413-432.
246
R. V. Demicco
18. Hammah, R. E., and Curran, J. H., 1998, Fuzzy cluster algorithm for the automatic identification of joint sets. International Journal of Rock Mechanics and Mining Sciences & Geomechanics Abstracts, v. 35, p. 889-905. 19. Dunne, T., and Leopold, L. B., 1978, Water in Environmental Planning. W. H. Freeman, San Francisco, 8 18 p. 20. Franks, S. W., Gineste, P., Beven, K. J., and Merot, P., 1998, On constraining the predictions of a distributed model: the incorporation of fuzzy estimates of saturated areas into the calibration process. Water Resources Research, v. 34, p. 787-797. 21. Yu, P. -S., and Yang, T. -C., 2000, Fuzzy multi-objective function for rainfallrunoff model calibration. Journal of Hydrology, v. 238, p. 1-14. 22. Zhu, A. X., and Mackay, D. S., 2001, Effects of spatial detail of soil information on watershed modeling. Journal of Hydrology, v. 248, p. 54-77. 23. Xiong, L., Shamseldin, A. Y., and O’Connor, K. M., 2001, A non-linear combination of the forecasts of rainfall-runoff models by the first-order TakagiSugeno fuzzy system. Journal of Hydrology, v. 245, p. 196-217. 24. Gautam, D. K., and Holz, K. P., 2001, Rainfall-runoff modeling using adaptive neuro-fuzzy systems. Journal of Hydroinformatics, v. 3, p. 3-10. 25. Chang, F. -J., and Chen, Y. -C., 2001, A counterpropagation fuzzy-neural network modeling approach to real time streamflow prediction. Journal of Hydrology, v. 245, p. 153-164. 26. Pongracz, R., Bogardi, I., and Duckstein, L., 1999, Application of fuzzy rulebased modeling technique to regional drought. Journal of Hydrology, v. 224, p. 100-1 14. 27. Stehlik, J., and Birdossy, A., 2002, Multivariate stochastic downscaling model for generating daily precipitation series based on atmospheric circulation. Journal of Hydrology, v. 256, p. 120-141. 28. McDonald, M. G., and Harbaugh, A. W., 1988, Chapter A l , A modular threedimensional finite-difference ground-water flow model. Techniques of WaterResources Investigations of the United States Geological Survey, Book 6. 29. Istok, J., 1989, Groundwater Modeling by the Finite Element Method. Water Resources Monograph 13, American Geophysical Society, 495 p. 30. Freeze, R. A., and Cherry, J. A., 1979, Groundwater. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 604 p. 31. Konikow, L. F., and Ewing, R. C., 1999, Is a probabilistic performance assessment enough? Ground Water, v. 37, p. 48 1. 32. Dou, C., Woldt, W., Bogardi, I., Dahab, M., 1995, Steady state groundwater flow simulation with imprecise parameters. Water Resources Research, v. 3 1, p. 2709-2719. 33. Dou, C., Woldt, W., Dahab, M., and Bogardi, I., 1997, Transient ground-water flow simulation using a fuzzy set approach. Ground Water, v. 35, p. 205-215.
Review of Fuzzy Logic an the Geological Sciences
247
34. Schulz, K., and Huwe, B., 1997, Water flow modeling in the unsaturated zone with imprecise parameters using a fuzzy approach. Journal of Hydrology, v. 201, p. 21 1-229. 35. Dou, C., Woldt, W., and Bogardi, I., 1999, Fuzzy rule-based approach to describe solute transport in the unsaturated zone. Journal of Hydrology, v. 220, p. 74-85. 36. Zeng, X., Vasseur, C., and Fayala, F., 2000, Modeling microgeometric structures of porous media with a predominant axis for predicting diffusive flow in capillaries. Applied Mathematical Modelling, v. 24, p. 969-986. 37. Freissinet, C., Vauclin, M., and Erlich, M., 1999, Comparison of first-order analysis and fuzzy set approach for the evaluation of imprecision in a pesticide groundwater pollution screening model. Journal of Contaminant Transport, v. 37, p. 21-43. 38. Zhou, H., Wang, G., and Yang, Q., 1999, A multi-objective fuzzy pattern recognition model for assessing groundwater vulnerability based on the DRASTIC system. Hydrological Sciences - Journal, v. 44, p. 61 1-618. 39. Ozdamar, L., Demirhan, M., Ozpinar, A., and Kilanc, B., 2000, A fuzzy areal assessment approach for potentially contaminated sites. Computers & Geosciences, v. 26, p. 309-3 18. 40. Mohagheg, S., (ed.), 2000, Applications of Virtual Intelligence to Petroleum Engineering. Computers & Geosciences, v. 26. 41. Wong, P. M., and Nikravesh, M., (eds.), 2001, Field Applications of Intelligent Computing Techniques. Journal of Petroleum Geology, v. 24, n. 4. 42. Wakefield, M. I., Cook, R. J., Jackson, H., and Thompson, P., 2001, Interpreting biostratigraphical data using fuzzy logic; the identification of regional mudstones within the Fleming Field, UK North Sea. Journal of Petroleum Geology, v. 24, p. 417-440. 43. Finol, J. J., Guo, Y. K., and Jing, X. D., 2001, Fuzzy partitioning systems for electrofacies classification; a case study for the Maricaibo Basin. Journal of Petroleum Geology, v. 24, p.441-458. 44. Nikravesh, M., Aminzadeh, F., and Zadeh, L. A., (eds)., 2001, Soft computing and Earth Sciences: Part 1. Journal of Petroleum Science and Engineering, v. 29, n. 3-4. 45. Nikravesh, M., Aminzadeh, F., and Zadeh, L. A,, (eds)., 2001, Soft computing and Earth Sciences: Part 2. Journal of Petroleum Science and Engineering, v. 31, n. 2-4. 46. Deyi, F., and Xihui, L., (eds.), 1985, Fuzzy mathematics in earthquake researches, Proceedings of International Symposium on Fuzzy Mathematics in Earthquake Researches, Seismological Press, Beijing, 646 p. 47. Wang, W., Wu, G., Huang, B., Zhuang, K., Zhou, P., Jiang, C., Li, D., and Zhou, Y., 1997, The FAM (fuzzy associative memory) neural network model and its application in earthquake prediction. Acta Seismologica Sinica, v. 10, p. 321-328.
248 R. V. Demicco
48. Wang, X. -Q., Zheng, Z., Qian, J., Yu, H. -Y., and Huang, X. -L., 1999, Research on the fuzzy relationship between the precursory anomalous elements and earthquake elements. Acta Seismologica Sinica, v. 12, p. 676-683. 49. Muller, S., Legrand, J. -F., Muller, J. -D., Cansi, Y., and Crusem, R., 1998, Seismic events discrimination by neuro-fizzy-based data merging. Geophysical Research Letters, v. 25, p. 3449-3452. 50. Muller, S., Garda, P., Muller, J. -D., and Cansi, Y., 1999, Seismic events discrimination by neuro-fuzzy merging of signal and catalogue features. Physics and Chemistry of the Earth (A), v. 24, p. 201-206. 51. Huang, C., and Leung, Y., 1999, Estimating the relationship between isoseismal area and earthquake magnitude by a hybrid fuzzy-neural-network method. Fuzzy Sets and Systems, v. 107, p. 131-146. 52. Lin, K., and Sanford, A. R., 2001, Improving regional earthquake locations using a modified G matrix and fuzzy logic. Bulletin of the Seismological Society of America, v. 91, p. 82-93. 53. Birkeland, P. W., 1974, Pedology, Weathering, and Geomorphological Research. Oxford University Press, New York, 285 p. 54. Moran, C. J., and McBratney, A. B., 1997, A two-dimensional fuzzy random model of soil pore structure. Mathematical Geology, v. 29, p. 755-777. 55. Galbraith, J. M., Bryant, R. B., and Ahrens, R. J., 1998, An expert system of soil taxonomy. Soil Science, v. 163, p. 748. 56. Galbraith, J. M., and Bryant, R. B., 1998, A functional analysis of soil tzxonomy in relation to expert system techniques. Soil Science, v. 163, p. 739747. 57. Wilson, J. P., and Burrough, P. A., 1999, Dynamic modeling, geostatistics, and fuzzy classification: new sneakers for a new geography? Annals of the Association of American Geographers, v. 89, p. 736-746. 58. Kollias, V. J., Kalivas, D. P., and Yassoglou, N. J., 1999, Mapping the soil resources of a recent alluvial plain in Greece using fuzzy sets in a CIS environment. European Journal of Soil Science, v. 50, p. 261-273. 59. Triantafilis, J., Ward, W. T., Odeh, I. 0. A., and McBratney, A. B., 2001, Creation and interpolation of continuous soil layer classes in the Lower Naomi Valley. Soil Science Society of America Journal, v. 65, p. 403-413. 60. Zhu, A. X., Hudson, B., Burt, J., Lubich, K., and Simonson, D., 2001, Soil mapping using GIS, expert knowledge, and fuzzy logic. Soil Science Society of America Journal, v. 65, p. 1463-1472. 61. Serandrei-Barbero, R., Rabagliati, R., Binaghi, E., and Rampini, A., 1999, Glacial retreat in the 1980s in the Breonie, Aurine and Pusteresi groups (eastern Alps, Italy) in Landsat TM images. Hydrological Sciences -Journal, v. 44, p. 279-296. 62. Smith, G. R., Woodward, J. C., Heywood, D. I., and Gibbard, P. L., 2000, Interpreting Pleistocene glacial features from SPOT HRV data using fuzzy techniques. Computers & Geosciences, v. 26, p. 479-490.
Review of Fuzzy Logic in the Geological Sciences
249
63. Rantitsch, G., Application of fuzzy clusters to quantify lithological background concentrations in stream-sediment geochemistry. Journal of Geochemical Exploration, v. 71, p. 73-82. 64. Lahdenpera, A. -M., Tamminen, P., and Tarvainen, T., 2001, Relationships between geochemistry of basal till and chemistry of surface soil at forested sites in Finland. Applied Geochemistry, v. 16, p. 123-136. 65. Li, M. Z., and Amos, C. L., 1995; SEDTRANS92: a sediment transport model of continental shelves. Computers & Geosciences, v. 21, p. 553-554. 66. Acinas, J. R., and Brebbia, C. A., (eds.), 1997, Computer Modelling of Seas and Coastal Regions. Computational Mechanics, Inc., Billerica, MA, 442 p. 67. Harff, J., Lemke, W., and Stattegger, K., (eds.), 1999, Computerized Modeling of Sedimentary Systems. Springer, New York, 452 p. 68. Pilkey, 0. H., and Thieler, E. R., 1996, Mathematical modeling in coastal geology. Geotimes, v. 41, p. 5. 69. Yang, Y., and Rosenbaum, M. S., 2001, Artificial neural networks linked to GIS for determining sedimentology in harbours. Journal of Petroleum Science and Engineering, v. 29, p. 213-220. 70. Nordlund, U., 1996, Formalizing geological knowledge - with and example of modeling stratigraphy using fuzzy logic: Journal of Sedimentary Research, v. 66, p. 689-712. 71. Nordlund, U., 1999, FUZZIM: forward stratigraphic modeling made simple. Computers & Geosciences, v. 25, p. 449-456. 72. Edington, D. H., Poeter, E. P., and Cross, T. A., 1998, FLUVSIM; a fuzzylogic forward model of fluvial systems. Abstracts with Programs - Geological Society of America Annual Meeting, v. 30, p. A105. 73. Parcell, W. C., Mancini, E. A,, Benson, D. J., Chen, H., and Yang, W., 1998, Geological and computer modeling of 2-D and 3-D carbonate lithofacies trends in the Upper Jurassic (Oxfordian), Smackover Formation, Northeastern Gulf Coast. Abstracts with Programs - Geological Society of America Annual Meeting, v. 30, p. A338. 74. Demicco, R. V., and Klir, G. J., 2001, Stratigraphic simulations using fuzzy logic to model sediment dispersal. Journal of Petroleum Science and Engineering, v. 31, p. 135-155. 75. Urbat, M., Dekkers, M. J., and Krumsiek, K., 2000, Discharge of hydrothermal fluids through sediment at the Escanaba Trough, Gorda Ridge (ODP Leg 169): assessing the effects on the rock magnetic signal. Earth and Planetary Science Letters, v. 176, p. 481-494. 76. Fang, J. H., 1997, Fuzzy logic and geology. Geotimes, v. 42, p. 23-26. 77. Fang, J. H., and Chen, H. C., 1990, Uncertainties are better handled by fuzzy arithmetic. American Association of Petroleum Geologists Bulletin, v. 74, p. 1228- 1233.
250
R. V . Demicco
78. Bardossy, A., and Duckstein, L., 1995, Fuzzy Rule-Based Modeling with Applications to Geophysical, Biological and Engineering Systems. CRC Press, Boca Raton, 323 p 79. Cagnoli, B., 1998, Fuzzy logic in volcanology. Episodes, v. 21, p. 94-96. 80. Van Wijk, M. T., and Bouten, W., 2000, Analyzing latent heat fluxes of coniferous forests with fuzzy logic. Water Resources Research, v. 36, p. 18651872. 81. Schulz, K., Huwe, B., and Peiffer, S., 1999, Parameter uncertainty in chemical equilibrium calculations using fuzzy set theory. Journal of Hydrology, v. 2 17, p. 119-134. 82. Kruiver, P. P., Kok, Y. S., Dekkers, M. J., Langereis, C. G., and Laj, C., 1999, A psuedo-Thellier relative palaeointensity record, and rock magnetic and geochemical parameters in relation to climate during the last 276 kyr in the Azores region. Geophysical Journal International, v. 136, p. 757-770.
BAYESIAN NEURAL NETWORKS IN PREDICTION OF GEOMAGNETIC STORMS GABRIELA ANDREJKOVA Department of Computer Science, Faculty of Science P. J. Safkrik University Jesennd 5, 041 54 KoSice, Slovakia e-mail:
[email protected] Bayesian probability theory provides a framework for data modelling. In this framework it is possible to find models that are well-matched to the data, and to use these models to make nearly optimal predictions. In connection to neural networks and especially to a neural network learning the theory is interpreted as an inference of the most probable parameters for the model and the given training data. This article describes an application of the Bayesian probability theory to the physical problem "Prediction of Geomagnetic Storms".
Keywords: Bayesian probability theory; Neural Network; Geomagnetic Storm; Prediction. 1
Introduction
Neural networks continue to offer an attractive paradigm for the design and analysis of adaptive, intelligent systems for many applications in artificial intelligence 4 , 5 , This is true for a number of reasons: for example, amenability to adaptation and learning, robustness in the presence of noise, potential for massively parallel computation. Predictions of the hourly D,t index from the interplanetary magnetic field and solar plasma data, based on Artificial Neural Networks (ANN), were made and analysed by Lundstedt and Wintoft (1994) (feedforward networks) and Andrejkov6 et al. (1996, 1999) (recurrent networks, fuzzy neural networks) ', '. Recent results have shown that it is possible to use dynamic neural networks for GMS prediction and modelling of the solar wind-magnetosphere coupling. In this study we are reporting preliminary results using a Bayesian neural network model.
'
There has been increased interest in combining artificial neural networks with Bayesian probability theory '. The Bayesian probability theory have been
251
252
G. Andrejkovd.
proved to be very successful in a variety of applications, for example D. J. C. MacKay (1995), 8, 9, M. I. Schlessinger and V. Hlavae, l 3 and P. Miiller and D. R. Insua, l o . The effectiveness of the models representing nonlinear input-output relationships depends on the representation of the input-output space.
A designed neuro-Bayesian model will predict the occurrence of geomagnetic storms on the base of input parameters n,v, (TB,and B,: n . . . the plasma density of solar wind, w . . . the bulk velocity of solar wind, B,, (TB=. . . zcomponent of the interplanetary magnetic field and its fluctuation.
To follow the changes of the geomagnetic field values we use the quantity D,t index. Its values are in interval f l O n T during normal situation but during the geomagnetic storm they can decrease as much as some hundreds nT in a few hours. In Section 2, we describe some basic definitions and properties of the Bayesian probability theory. In Section 3, we briefly describe the neural networks as a probabilistic models. Section 4 contains the starting point to the finding weights of neural networks. Some interesting results for GMS prediction are described in Section 5.
Bayesian probability theory
2
A Bayesian data-modeller's aim is to develop probabilistic model that is well matched to the data and makes optimal predictions using that model. Bayesian inference satisfies the likelihood principle: Inferences depend only on the probabilities assigned to the data that were received, not on properties of the data which might have occured. We will use the following notation for conditional probabilities: 0
a,0 # 0 - the space of elementary events;
0
3-1 -
0
A, B - events, P ( A ) ,P ( B ) - a probability of the events A, B ,
0
(Cl,'H,'P) - a probability space,
0
- algebra of some nonempty subset
of R (a model of computation),
P(AIB,'U) is pronounced "the probability of A, given B and 3-1" and it explains the conditional probability;
Bayesian Neural Networks i n Prediction of Geomagnetic Storms 253 0
the statements B and mean the conditional assumptions on which this measure of plausibility is based;
The Bayesian approach require: 0
0
specifying a set of prior distributions for all of weights in the network (and variance of the error) and computing the posterior distributions for the weights using Bayes’ Theorem.
Prior distribution is a probability distribution on the unknown parameter vector w E R in the probability model, typically described by its density function P(w) which encapsulates the available information about the unknown value of w . In our case, the vector of weights w has not some know prior distribution and it means the prior distribution will be replaced by a reference prior function. Posterior distribution is a probability distribution on the unknown parameter vector w E R in the probability model, typically described by its density function P(wID), conditionally on the model, encapsulates the available information about the unknown value of w, given the observed data D and the knowledge about w , which the prior distribution P ( w ) might contain. It is obtained by Bayes’ Theorem. Bayes’ Theorem: Given data D { X ( ~Y(~)} ) , generated by the probability model {P(DIA),AE R} and a prior distribution P ( A ) ,the posterior distribution of A is P(AID) cx P(DIA)* P ( A ) . The proportionality constant is
{J, P(DIA) * P(A)dA}-l.
Two approaches have been tried in the finding of the posterior probability: 0
0
to find the most probable parameters (weights) using methods similar to the conventional training and then approximate the distribution over weights using information available at this maximum. t o use Monte Carlo method to sample from the distribution over weights. We applied the method and we use Markov chaines.
There are two rules of probability which can be used: 0
The product rule relates to joint probability of A and B , P ( A ,B ( R )to the conditional probability:
P ( A ,BlR) * P(BIR) = P(AIB,X)
(1)
254 0
G. Andrejkovci
The sum rule relates the marginal probability distribution of A, P(AI'?f), to the joint and conditional distributions: R
B
Having specified the joint probability of all variables as in equation, we can use the rules of probability to evaluate how our beliefs and predictions should change when we get new information.
3
Neural Networks as probabilistic models
A supervised neural network is a non-linear parametrized mapping from an input x to an output y^ = f(x,w;A). The output is a continuous function of the parameters w, which are called weights and A is an architecture of the network. 1 , The network is trained in the classical way using a data set D = ( ~ ( ~ y'")} by the backpropagation algorithm. It means the following sum squared error is minimized
The weight decay is often t o include to the objective function for the minimization. It means
M(w) = BED where Ew =
+ aEw,
(4)
xiwl.
The learning process above can have the following probabilistic interpretation. The error function is interpreted as minus the log likelihood for a noise model:
parameter /3 here defines a noise level
where rs& =
k.
rsi = i .
Bayesian Neural Networks in Prediction of Geomagnetic Storms 255
The function E corresponds to the deduction of parameters w according t o data D . It means
Bayesian inference for modelling problems may be implemented by analytical methods, by Monte Carlo sampling, or by deterministic methods using Gaussian approximations.
4
Starting points to the application
We deal only with neural networks used for regression. Assuming a Gaussian noise model, the conditional distribution for the output vector given the input vector based on this mapping will be as follows:
where d is the dimension of the output vector and in the outputs.
0
is the level of the noise
In the Bayesian approach t o the statistical prediction, one does not use a single ”best” vector of weights, but rather integrates the prediction from all possible weight vectors over the posterior weight distribution which combines the data with prior computed weights. The best prediction for the given input from the testing data x;+1 can be expressed
where d is the dimension of the weight vector. Posterior probabilities of weight vectors are the following:
256
G. Andrejkovd
For the full formulation of Bayesian problem it is necessary to add the prior distribution of weights. One of the possibilities is the following:
To compute the previous integrals it is very time consuming problem. It is possible to use Metropolis algorithm and it is the base for Monte Carlo method that we used in the prediction of GMS. We used the Monte Carlo method for a construction of our models. The algorithm was applied according the construction of R. M. Neal ll. 5
Results of GMS predictions
We have t o discuss various implementation issues which is necessary to do for the real prediction. The data are available from the NASA "OMNI tape" and are distributed by National Space Science Data Center and WDC-A for Rockets&Satellites. In the period 1963 - 1999 at each hour are measured and saved the next quantities: B,, UB=,n, u and D S t . Some data are not complete and we use liner interpolation to fill them but only in the case if the gap has less then 30 hours. The reconstructed data are used for a choose of the samples t o the training set according t o the following criteria: if the value D,t decreases at least 40 n T during two hours then the training sample (the storm) is created from the measured values 36 hours before the decreasing, 2 hours of the identification of decreasing and 108 hours after the decreasing. The file of the values have t o fulfill requirement of completeness of measurements. It means 144 hours describe one event GMS. One storm is used for the learning of the neural network by the moving of 8 hours window. We have prepared the training data set two data testing sets A and B. To prepare the A and B sets we used the data from years 1980 - 1984 and 1989 - 1999 because we had the continued values of parameters n , Y,B,, n B Z and D,t. The prepared data were represented by a sequence of
where pt can be applied as time series.
Bayesian Neural Networks i n Prediction of Geomagnetic Storms 257 Table 1. Experimental Results
Data
#Iteration
#Good
#Bad
Average
Success
Predictions
Predictions
4000
49
87
2.78377
36,03 %
4000
84
52
1.86015
61,76 %
6000
62
74
1.90585
45,59 %
6000
101
35
0.48040
74,26 %
12000
76
60
1.19665
55,88 %
12000
113
23
0.23863
83,09 %
18000
86
50
0.77771
63,24 %
18000
109
27
0.23801
80,15 %
Error
The software of M. Levickjr described in was modified and used in the present application. The algorithm based on the works of Neal and McKay was written in Delphi 5. The first computed results are in the following Table 1. The models are just tested. We present results computed with two data sets A and B. Prediction performance is measured by #Good Predictions, #Bad Predictions, AverageError and % of Success. Total test samples in testing sets A and B: 272, the number of input neurons in the neural network: 32, the number of hidden neurons in the neural networks: 28, the number of output neurons: 1. The computed results are interesting from the following points of view: 0
0
0
With the higher number of iteration the average error decreases. It is one of criteria t o the evaluation of the model. After 18000 iterations the success grows very slowly in the case of the testing data set A and decrease in the case B. Bayesian neural networks that we used in the prediction of geomagnetic storms look like very good model. They move the weight vector to the most probable part of the weight space.
258
G. Andrejkova'
References
1. AndrejkovB, G., AzorovB, J., Kudela, K.: Artificial Neural Networks an Prediction D,t Index. Proceedings of the 1st Slovak Neural Network Symp., ELFA, KoSice, 1996, pp. 51-59. 2. AndrejkovB, G., T6th, H., K., Kudela, K.: Fuzzy Neural Networks in the Prediction of Geomagnetic Storms. Proceedings of " Artificial Intelligence in Solar-Terrestrial Physics" , Publisher European Space Agency, Lund, 1997, p. 173-179. 3. Bernardo, J. M.: Bayesian Reference Analysis, A Postgraduate Tutorial Course, Facultat de Matematiques, Valencia, 1998. 4. Hassoun, M. H.: Fundamentals of artificial neural networks, MIT Press, Cambridge, 1995. 5. Hertz, J., Krogh, A., Palmer, R.G.: Introduction t o the theory of neural computation, LN Vol. 1, Santa Fe Institute Studies in the science of complexity, Addison-Wesley. 1991. 6. M. LevickL: Neural Networks in the analysis and the document classification, Diploma Thesis, P. J. SafArik University, KoSice, 2002. 7. Lundstedt, H., Wintoft, P.: Prediction of geomagnetic storms from solar wind data with the use of a neural network. Ann. Geophysicae 12, EGSSpringer-Verlag, 1994, p.19-24. 8. MacKay, D.J.C.: Bayesian Methods for Neural Networks: Theory and Applications. Neural Network Summer School, 1995. 9. MacKay, D.J.C.: A practical Bayesian Framework for Backprop Networks. Neural Computation 4, p. 448-472. 10. Muller, P., Insua, D.R.: Issues in Bayesian Analysis of Neural Network Model. Neural Computation 10, p. 749-770. 11. Neal, R. M.: Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical report CRG-TR-93-1, University of Toronto, 1993. 12. Neal, R. M.: Bayesian Training of Backpropagation Networks by the Hybrid Monte Carlo Method Technical report CRG-TR-92-1, University of Toronto, 1992. 13. Schlesinger, M. I., HlavBE, V. : Deset pfedna'dek z teorie statistickkho a strukturnzho rozpozna'vani. CVUT, Praha 1999.
ADAPTATION IN INTELLIGENT SYSTEMS - CASE STUDIES FROM PROCESS INDUSTRY KAUKO L E I V I S U University of Oulu, Control Engineering Laboratoly, P.O. Box 4300, FIN-90014 Oulun yliopisto, Finland E-mail:kauko.
[email protected]? Intelligent methods are more and more applied to several tasks also in process industry. In industrial applications, the changing production environment causes the urgent need for adaptation. This paper concerns with applications of Smart Adaptive System to industrial problems using some case studies as examples.
Keywords: Intelligent systems,fuzzy logic, neural networks,paper mills,steel mills
1
Introduction
Many industrial applications of intelligent methods, including fuzzy logic, neural networks, methods from machine learning, and evolutionary computing, have recently launched especially in cases were an explicit analytical model is difficult. Examples come from the domains of measurements, monitoring, process control, diagnostics, quality control, forecasting, data mining, and many others. While being powerful and contributing to increased efficiency and economy of industrial processes, most solutions using intelligent methods lack one important property: they are not adaptive (or not adaptive enough), when the production environment changes. In other words, most such systems have to be redesigned - in many cases from scratch - when the production settings or process parameters change significantly. This may look strange, because at the first sight adaptivity seems to be the central issue for intelligent technology. The typical "learning by example" concept combined with techniques inspired by biological concepts is supposed to provide enhanced adaptivity. However, after looking more carefully, it becomes clear that most techniques for "learning" are used for one-time estimation of models from data, which remain fixed during routine application. "True" online learning, in most cases, has not reached a level mature enough for industrial applications. The target setting for EUNITE, the European Network on Intelligent Technologies for Smart Adaptive Systems, was originally written in the following way: "to join forces within the area of Intelligent Technologies for better understanding of the potential of hybrid systems and to provide guidelines for exploiting their practical implementations and particularly, 0 to foster synergies that contribute towards building Smart Adaptive Systems implemented in industry as well as in other sectors of the economy."
259
260
K. Leaviska
These targets are to be gained by various means: Roadmap activities, collecting case studies, writing best practice guidelines, concentrating on training and technology transfer, etc.
2
Smart Adaptive Systems
Early definitions of adaptation are from the 1950’es to 1970’es, but they are still valid today (see for instance Sagasti, 1970, Ashby, 1972). Control Engineering offers also a starting point in defining adaptive systems, when it defines adaptive controller simply as follows: “An adaptive controller is a controller with adjustable parameters and a mechanism for adjusting the parameters”. Further on, Control Engineering applications are defined as: Gain scheduling, that adapts to known changes in the environment Model Reference Adaptive Control that adapts the system parameters so that it follows the model behaviour Self-Tuning Control that adapts the controller so that the system response remains optimal in the sense of an objective function. We can carry knowledge about adaptive systems from control to other areas by analogy. An adaptive system is a well-defined concept within the automatic control community. Breakthroughs could occur by applying definitions, structures and theory to other areas such as management, transportation or healthcare. However, it should be remembered that Control Engineering methods are mostly concerned with parameter adaptation, which is only one alternative when speaking about Intelligent Systems. “Smart system” can be interpreted as a synonym to “Intelligent system”. A short definition is proposed in the first version of EUNITE Roadmap:
“Smartsystem is aware of its state and operation and can predict what will happen to it. This knowledge can also lead to adaptation. ”
A short definition for the Smart Adaptive System (SAS) comes from Anguita (2001): “A SAS can: Adapt to a changing environment Adapt to a similar settings without explicitly being ported to them Adapt to a newlunknown application.” The first level is the easiest form of adaptation and it concerns with systems that adapt their operation in changing environment by using their intelligence to recognise the changes and to react accordingly. The second level considers the change in the whole environment and the system’s ability to respond to it. The third level is the most demanding one and it requires tools to learn the system’s behaviour
Adaptation i n Intelligent Systems
261
from very modest initial knowledge. Some examples of it exist on Machine Learning field (Anguita, 2001). Both process and production industries face considerable control challenges, especially in the consistent production of high quality products, more efficient use of energy and raw materials, and stable operation in varying conditions. The processes are nonlinear, complex, multivariable and highly interactive. Usually, the important quality variables can be estimated only from other measured variables. Constraints, e.g. physical limitations of actuators must be taken into account. Significant interactions between process variables cause interactions between the controllers. Various time-delays depend strongly on operating conditions and can dramatically limit the performance and even destabilise the closed loop system. Uncertainty is an unavoidable part of the process control in real world applications. Systems for quality, environment and safety seem to be integrating and different kind of diagnostic functions play a more and more important role in the future. Production scheduling and resource allocation also present methodological challenges. Following chapters refer to potential SAS applications in process industry based on some case studies. Chapters are arranged according to the usual control hierarchy starting from measurements (Software Sensors) and control systems, and proceeding to &agnostics and quality control. Case studies come from the applications developed in Control Engineering Laboratory, University of Oulu. 3
3.1
Software Sensors
Basic Principles
Software Sensors are used in making the existing measurements more efficient or in replacing the non-existing measurements with software systems that form the measurements signals e.g. from other, existing measurements, laboratory analyses and a priori expert knowledge. Good example is the combining of information from several temperature or concentration measurements, e.g. from the blast furnace, to form a single measurement or the indication of the process state (Alaraasakka et al. 1998). Another possibility is to use process information to construct a Software Sensor for the non-existing quality measurement, e.g. in paper or biotechnical processes where on-line analysers are difficult to develop and expensive to install and maintain. Adaptation needs come from the necessity to correctly react to changing raw material quality, to different product specifications, or even to different processing alternatives. Portability is important from the commercial point of view: generic applications that guarantee long range of applications are needed. Software sensors can be developed with various modelling techniques. In early fermentation applications, fuzzy systems were used as a conversion estimator
262 K. Leaviska
(Kivikunnas, Bergonzini Corradini and Juuso, 1996) and as a trend analyser (Kivikunnas, Ibatici and Juuso, 1996). Some comments on using neural networks and adaptive algorithms are given in de Assis et al. (2000).
3.2
Case -Kappa Number Prediction in Pulp Cooking
Kappa number describes the extent of cooking wooden chips in the first stage of fibre production for paper industry. Kappa number cannot be measured directly during processing, but some measurement possibilities exist after it. However, the knowledge of Kappa number development during cooking is crucial to control the product quality. Kappa number prediction with measurements from ABB’s Cooking Liquor Analyser (CLA 2000) has been used as a comparison case for different methodologies for building Software Sensors (Leiviska et al. 2001, Isokangas and Juuso, 2000, Murtovaara et al., 1999): fuzzy logic (FL), partial least squares method (MLR), artificial neural networks (ANN) and linguistic equations (LE). While the cooking progresses, some wood comFonents start to dissolve into the cooking liquor. Usually, alkali components in the liquor are analysed by measuring the conductivity of the liquor sample, solid contents by measuring the refractive index and dissolved lignin by measuring the UV-absorbency of the sample. These measurements form the core for Kappa number prediction. Neural network models and linguistic equation models seem to learn the process behaviour in a similar manner. Differences are met in using the models in process environment (Murtovaara et al., 1999). Neural network models are suitable for processes where process conditions are stable and there is a lot of data available. The linguistic equation models are not so sensitive for changes in process conditions. The performance of fuzzy models depends on the shape and number of membership functions. The first level of adaptivity is easily gained with linguistic equations that adapt to changes in, e,g. production level using multiple models (Murtovaara et al. 1999). The second level corresponds to changing from continuous to batch process that, at this moment, requires manual tuning and requires also dynamic modelling. The third level could mean the porting of the analyser to a totally different process, but there are no experiences available. 3.3
Case -Adaptive Modelling of Carbon Dioxide in a Burning Process
The amount of carbon dioxide (C02) in combustion gases is one of the main variables that should be monitored in the burning process. Concentration of C 0 2 is usually measured with different types of gas analysers, but unfortunately, present methods are available only for large power plants. Thus, the need for alternative methods is evident in smaller wood burning processes. One possibility is to model the COZ concentration using existing process measurements.
Adaptation in Intelligent Systems 263
A fuzzy, adaptive modelling approach for estimating the carbon dioxide concentration of a burning process is presented in Ruusunen and Leiviska (2002). Adaptation mechanism to changes in operation conditions is based on recognition of three different combustion regimes: ignition, burning and charring. These regimes are described by three fuzzy membership functions. For each burning phase, two TS-type fuzzy models are constructed. The selection of model input variables is based on combustion theory and it is an important part of the model development. A gradient method and supervised learning is used for identification of the consequent parameters. Modelling results showed that the present model has a good capacity to approximate COz concentration of highly changing burning process. In comparison with other models, the performance of the fuzzy model was superior due to an adaptation mechanism that also enabled a robust and simple structure. Small size of training data and the nonlinear process limited performance of other models, while local learning and generalisation capabilities of the fuzzy model compensated for these disturbing factors. This system fulfils, by nature, the first level of SAS definition. It also makes it easy to convert the model to another burning process, according to the second level definition. Methodology itself is robust enough to allow porting to totally different areas where only limited data is available. 4
4.I
Adaptive Intelligent Control
Basic Principles
Adaptive controllers must include two elements that make adaptation possible: detection system that reveals changes in process characteristics and the adaptation mechanism, which updates the controller parameters. Changes in process characteristics can be detected through on-line identification of the process model, or by assessment of the control response. Adaptation mechanisms rely on parameter estimates of the process model, e.g. gain, dead time and time constant. The choice of performance measures depends on the type of response the control system designer wishes to achieve. Alternative measures include overshoot, rise time, settling time, delay ratio, frequency of oscillations, gain and phase margins and various error signals. Fuzzy logic controllers (FLC) can cope with a fairly large amount of nonlinearities in the process behaviour, but applying the following mechanisms extend their operating area: The changes in the amount of the nonlinearity can be taken into account by changing scaling fectors for appropriate variables. Altering the scaling factor is
264
K. Leiviska
similar to gain tuning in standard PID controllers as it changes the sensitivity of the controller to the input Changing the membership functions will alter the shapes of the fuzzy sets, and thus the gain can be modified within a specific region. Altering rules can change drastically the controller operation, and thus should be the last alternative in the adaptation procedure. These types of controllers are called self-organizing controllers. A vast literature is available (see references). The multilevel linguistic equation (LE) controllers correspond to adaptive fuzzy controllers, but also here the adaptation can be firther improved by the following mechanisms (Juuso et al. 1997, 1998): Altering the strengths of braking and asymmetrical action will fine-tune the controller operation for larger disturbances and remove the steady-state error. The changes in the strengths of the nonlinearity can be taken into account by scaling the membership definitions for appropriate variables. This mechanism, similar to the scaling in fuzzy controllers, is the basic technique in the operating point adaptation. Asymmetrical scaling is the main technique in braking and asymmetrical actions. Changing membership definitions will alter the gain within a specific region. This mechanism is used if the adaptation of the multilevel LE controller is not sufficient. Changing equations, which means that the control strategy is changed, is seldom used. The first direct linguistic equation controller was implemented in 1996 for a solar power plant in Spain (Juuso et al. 1997, 1998), and later the multilevel LE controller was installed for a lime kiln at UPM-Kymmene Pietarsaari mills (Juuso 1999a; Juuso 1999b). Both implementations are very compact and easier to tune than corresponding fuzzy controllers were. Modularity is beneficial for the tuning of the controller to various operating conditions, and most important is that the same controller can operate on the whole working area. 4.2
Case - Rotary Dryer Control
Rotary dryers are extensively used in chemical industry and minerals processing. To control a rotary dryer is difficult due to the long, and also varying, time delays involved, and also the changes in raw materials. Variations in the input variables as in the moisture content, temperature or flow of the solids will disturb the process for long periods of time, until they are observed in the output variables. Therefore, pure feedback control is inadequate for keeping the most important variable to be controlled, the output moisture content of the solids, at its target value with
Adaptation i n Intelligent S y s t e m s 265
.
acceptable variations. Also the development of model-based control systems has proved difficult and time consuming due to the complexity of the process. Increasing demands for uniform product quality and for economic and environmental aspects have necessitated improvements in dryer control. In Control Engineering Laboratory, University of Oulu, several methods for the rotary dryer control have been studied: conventional PI control, model-based cascade control, fuzzy control, and neural network control. Also two types of hybrid controllers where fuzzy logic and neural nets have been connected with PI control have been introduced (Yliniemi 2001, 1999, Koskinen et al. 1998, Yliniemi et al. 1998). Lately, three different fuzzy adaptive controllers were tested for the rotary dryer control. The first method was based on the previously applied hydrid PI-FLC controller with a simple gain adjusting approach. The output scaling factor is adjusted based on the experimentally defined rule base and membership functions. Good results were achieved and the controller performance was clearly improved. The second method introduced a fuzzy PI-controller and adjusting all scaling factors; for both inputs and the output. Results were not so encouraging, probably because of the fact that both the controller and the tuning part utilised rule bases and membership functions taken from the literature. The third approach was a simple procedure for tuning normal PI-controller with a fuzzy tuner. Once again, the adaptive controller performed better than the original one. All these controllers fulfil the first level definition of SAS; they can adapt to changes in the operation point of the dryer. They can also be transferred to another similar application with ease. The third level, porting them to a totally new application area must be approached with some hesitation. Methods itself are generic, but being rule-based they need process knowledge to be successfully applied. The results gained with the second method above seem to show this fact very clearly. 5
5.I
Diagnostics
Basic Principles
Fault diagnosis has been an area of strong theoretical development and industrial applications for several years. Various approaches have been studied and applied and they differ mostly in the following aspects: What sources of faults have been studied; sensors, actuators, controllers, control loops, process equipment, process parameters, operator-induced faults, How the faults have been described; Boolean, quantitative, qualitative; static or dynamic; deterministic or stochastic and What methods have been used in fault detection; testing against thresholds, pattern recognition, rule-based reasoning, fuzzy logic, neural networks, etc.
266
K. Leiviska
Fault diagnosis is already a classical area for fuzzy logic applications (Isermann 1997). Compared with algorithmic-based fault diagnosis the biggest advantage is that fuzzy logic gives possibilities to follow human’s way of fault diagnosing and to handle different information and knowledge in a more efficient way. Applications vary from troubleshooting of hydraulic and electronic systems using vibration analysis to revealing the slowly proceeding changes in process operation. The essential feature in the future is the combining of fuzzy logic with model-based diagnosis and other conventional tools and finally their integration within condition monitoring systems. Fault diagnosis is usually divided in three separate stages or tasks: Fault detection that generates symptoms based on the monitoring of the object system. Fault localisation that isolates fault causes based on the generated symptoms, and Fault recovery that decides upon the actions required to bring the operation of the object system back to the satisfactory level. The fault detection problem is, in principle, a classification problem that classifies the state of the process to normal or faulty. This is based on thresholds, feature selection and classification and trend analysis. Fault localisation starts from the symptoms generated in the previous stage and it either reports fault causes to the operator or starts automatically the corrective actions. In the first case, the importance of the madmachine interface must be emphasised. Actually this stage includes the defining of size, type, location and detection time of the fault in question. The frequency range of phenomena studied in fault detection varies tremendously; from milliseconds in vibration analysis to hours and days in analysing the slowly developing process faults. This sets different requirements for the system performance and computational efficiency in real-time situations. Also different methods are needed in data pre-processing. Vibration analysis uses spectral and correlation analysis to recognise the “fingerprints” of faults whereas regression analysis, moving averages or various trend indicators are used in revealing slowly developing faults. Especially for slow processes met in chemical, pulp and paper and biotechnical industries, temporal reasoning and trend analysis are very valuable tools to diagnose and control the process. Environmental changes together with changes in quality and production requirements lead to the need for adaptation. For instance, in surface quality testing of steel, change in the steel grade changes the patterns of single faults and it can also introduce totally new faults. This might require re-calibration of the camera system and training new patterns to the fault detection system. Reasons for surface faults can be traced back to earlier stages of production, to dosing of different chemicals, to inadequate control of process operations, and also to different routing of customer orders through the production line. Here, the steel grade change would
Adaptation in Intelligent Systems 267
require new models to be taken into the use, because of other variables affecting the surface quality. Fault diagnosis is in many cases essential to quality control and optimising the performance of the machinery. For example the overall performance of paper machines can be rated to the operating efficiency, which varies depending on breaks and the duration of the operations caused by the web break. Most of the indistinct breaks are due to dynamical changes in the nonlinear chemical process with many and long time varying delays. There are also process feedbacks on several levels, closed control loops, factors that exist and cannot be measured and interactions between physical and chemical factors. Combining expertise with data driven methods is necessary in complex industrial applications.
5.2
Nozzle Clogging in Continuous Casting of Steel
Surface faults of steel originate from different sources along the long production line. One important source is the clogging of the submerged entry nozzle that connects the tundish and the mold in the continuous casting of steel. It is a tube with a diameter of about 15 cm and it divides the molten steel into two directions on the both sides of the nozzle. The nozzle transfers molten steel from the tundish to the mold and separates steel from the atmosphere. Nozzle is clogged when some material is collected on the inner surface of the nozzle and stops the flow of molten steel. It causes production losses and quality impairment in continuous casting, especially with aluminium-killed steels. Variations in stopper rod position and casting speed provide the operator with the first information on increased risk of nozzle clogging. This cannot, however, answer the question, how long time the casting can continue and when the nozzle should be changed. Neural networks are used to predict of the amount of steel that can be cast without changing the nozzle (Leiviska et al. 2002). These models are based on data collected from Rautaruukki Steel Mill’s converter plant. Nozzle clogging has been modelled on two casters; numbers 5 and 6. Feedfonvard networks with backpropagation were used. Data from 5800 heats was available, but only a small part of it was used. First of all, only clogging cases were used in training and they were divided first according to casters and secondly based on steel grades. Practice showed that different variables dominate with different grades and no general models could be done. Results seem promising; in several cases the cast tons are estimated with the accuracy of *6O tons (one heat) in more than 80% of cases. Cross-testing considered using the model developed for caster 5 to the data from caster 6. It showed that different variables affect the nozzle clogging in different casters. When testing the models with successful castings, it was found that the models never gave too high predictions.
268
K . Leiviska
This is an example of a hybrid system. The risk of clogging is determined based on the stopper rod position and casting speed variations. Neural network models are used when the risk increases. The system itself is not adaptive at all. Adaptivity is gained by changing the model according to the caster and the grade. Basically, this corresponds to SAS second level adaptation, but no automatic operations are connected to it in this application. 5.3
Web Break Indicator for a Paper Machine
Paper web breaks can be divided into three classes: process related, mechanics related and automation related. Process related breaks are considered the most important ones especially in the short circulation and the wet end of paper machine, starting from the mixing chest to the first drying group. This area includes nonlinear processes and there are many long delays that change with time and with process conditions. It includes also closed control loops and there exist factors that cannot be measured. There are also interactions between physical and chemical factors. A prototype web break sensitivity indicator was developed together with Metso Corp. to give the process operators a continuous indication of web break sensitivity in an easily understandable way (Juuso et al. 1998). The prototype of web break sensitivity indicator has been implemented as a Case-Based Reasoning -type application. The application is based on recorded historical data of main process measurements. For modelling the break sensitivity of the paper machine was divided into five classes starting from “no breaks”- class and ending at “a lot of breaks”-class. Models were trained using Linguistic Equations approach. The indicator contains 25 separate LE models; five models for each class. The indicator compares on-line data to these models in database and estimates how well the present measurements fit to each of the break sensitivity classes. The final result is deduced with fuzzy logic and presented as a “break sensitivity index”. The system has been tested at two UPM Kymmene paper mills. Tests showed that the prototype indicator operates as planned. The existing system is not adaptive. LE approach makes it easy to add the first level adaptation, but in this application it is not so essential. More important is the second level adaptation; i.e. to be able to deal with different grades and actually with different paper machines. This could, in theory, also be done using LE and automatic generation of equations, but in practice a lot of experience and knowledge is required in selecting variables and equations. Same agrees with the third level, for instance porting the method to a steel rolling mill application.
A d a p t a t z o n zn Intellzgent Systems 269
6
6.I
Quality Control
Basic Principles
Quality control within process industries is used both for registration of product quality as well as for direct feedback to the process control system. When product quality is used directly in the control of the process it is very important to have a measurement system that makes the quality figure available as fast as possible. Many quality control systems are based on computer vision. It is very interesting to note that Intelligent Methods are used extensively in almost all stages of computer vision process from sensing to scene interpretation. If the time delay is too long between taking a sample of the product and until the quality analysis is available, it is often necessary to base quality control on measurements, which are related to the product quality, i.e. Software Sensors. Also diagnostic information is used in many cases. Quality control is very often based on modelling of relationships between continuous measurements and quality analyses. For processes, which are difficult to model mathematically, the techniques of neural nets, fuzzy logic and genetic algorithms are methods, which are likely to produce good results. Especially a combination of methods is a promising way to as good a quality control basis as possible. Quality data origins from several sources; it is partly connected with raw materials, the production processes and final or intermediate products. There are two types of data: discrete data (attributes) describing for instance the amount of defective products in a sample or the number of defects in a product and continuous data (variables) describing for instance analysed or measured values. The processing of continuous quality variables (e.g. paper quality, results from quality analysers in chemical processes) resembles the problem setting in process control and fault diagnosis. The analysis of data on attributes proceeds in three stages: feature extraction, classification and decision. The applications of quality control vary both in terms of on-line processing and madmachine interface requirements. Advisory systems can only tell the user what kinds of defects have been found and record the defect data e.g. for possible customer reclamations. They can also give advises for product rejecting. On-line systems can automatically take the product that does not fulfil quality standards out from the production line and reject it or return it to production and repairing. 6.2
Case - Quality Control of a TMP Plant
TMP is a common process to produce wood fibres to paper production. The quality control in a TMP plant (maintaining freeness and fibre length at optimum levels) has to cope with two different kinds of disturbances: slowly proceeding wearing of refiner plates and faster variations in wood quality (Myllyneva et al. 2001). Pulp
270
K . Leiviska
quality is usually obtained from laboratory testing of pulp or finished paper samples, so this information is a few hours old and therefore of no use for real-time control. Also, delays in on-line freeness and fibre length measurements have limited the use of automatic on-line quality control. Generally, refiner control systems are based on the principle that freeness is directly related to specific energy consumption in the process. Motor load usually controls this. The strategy presented in Myllyneva et al. (2001) is based on the fact that the refining consistency is another key variable in the TMP-plant control because it affects on how energy is transferred to fibres. When the consistency and motor load are stabilised there exists a good basis for the actual quality control of the TMP-plant. In this case this stabilisation is done using conventional adaptive controllers. The fuzzy quality control is based on on-line measurements of freeness, fibre length and consistency. The fuzzy quality controller works as a master controller adjusting the set points of the motor loads and primary stage consistency. Inputs are the difference between the freeness set point and measurement, consistency set point and mean fibre length measurement. The consistency set point is used as a control input instead of the measurement, because the actual consistency always includes short-term fluctuations around the set point and therefore may cause some unnecessary output changes. The fuzzy quality controller has been implemented in the mill’s automation system using standard fuzzy controller toolbox available in the system. Minimummaximum method was used in inference and singletons were used for outputs. The controller has been tuned so that the main controlled variable is the fibre length. Therefore, the freeness level is allowed to increase, when the motor load control opens the plate position more than the consistency controller requests. The fuzzy quality controller was implemented as a part of AutoTMP control system at four refiner lines at Holmen Paper, Hallstavik mill in Sweden (Myllyneva et al. 2001). The system features the TCA consistency measurement in the blow line. Based on it, the refining consistency and motor load are controlled using adaptive control algorithms. The fuzzy control strategy maintains the fibre length at an acceptable level and the freeness is controlled to be on a target level. Operator confidence in the control system has been very high; 95-100 % uptime for the consistency and motor load controls and over 90 % uptime for the fuzzy quality controls have been experienced. The six months follow-up period proved a 20-50 % decrease in the standard deviations of the key quality and process variables. In this case the quality controller is a fuzzy supervisory controller without adaptive features. The system is a hybrid one, where adaptivity is present at the direct process control level. It more or less represents the first level SAS. The second level SAS (porting to similar settings) is achieved easily for process control level, but requires tuning of the fuzzy part. The system is based on special
A d a p t a t i o n in Intelligent Systems
271
measurements and deep process knowledge, so the third level SAS is not applicable here.
7
Conclusions
This paper has introduced the applications of Smart Adaptive Systems in process industries using some case studies as examples. The applications are high in numbers and the future development seems to favour adaptive systems. Needs for adaptivity is coming from changes in process conditions, but also from the facts that generic easily tuneable systems are desirable from the commercial point of view. EUNITE is the European Network on Intelligent Technologies for Smart Adaptive Systems that started operations at the beginning of 200 1. Inside EUNITE, Smart Adaptive Systems have been defined to cover three levels of adaptation: adaptation to changes in environment, adaptation to a new application in a new, but similar environment and adaptation to a totally new/unknown application in different environment. The findings in this paper show that this definition is valid in process applications and can facilitate the analysis of systems applicability. References 1. Alaraasakka, M., Leiviska, K., Seppken, M. (1998). Neural Network based Classifier of Blast Furnace Gas Distribution. Proceedings of TOOLMET '98 Symposium. Oulu, Finland, 40-44 2. Anguita, D., (2001): Smart Adaptive Systems - State of the Art and Future Directions for Research. First EUNITE Annual Symposium, Tenerife, Spain, Dec. 13-14,2001 3. Ashby, W.R. (1972): Design for a Brain. Science Paperbacks, Chapman and Hall, London 4. Chung, H.Y., Chen, B.C., Lin, C.C. (1998): A PI-type fuzzy controller with self-tuning scaling factors. Fuzzy Sets and Systems 93, 23-28 5. Daugherity, W.C., Rathakrishnan, B., Yen, Y. (1992): Performance evaluation of a self-tuning fuzzy controller. In: Proc. IEEE International Conference on Fuzzy Systems 389-397 6. de Assis A.J., R.M. Filho (2000): Soft sensors development for on-line bioreactor state estimation. Computers / Chemical Engineering 24, 1099- 1103 7. He, S.Z., Tan, S., Xu, F.L. (1993): Fuzzy self-tuning of PID controllers. Fuzzy Sets and Systems 5 6 , 3 7 4 6 8. Ikaheimonen, J., Leiviska, K., Matkala, J. (2002). Nozzle clogging prediction in continuous casting of steel. Accepted to 2002 IFAC World Conference, Barcelona 9. Isermann R. (1997). Supervision, fault-detection and fault-diagnosis methods. XIV Imeko World Congress, 1-6 June 1997, Tampere, Finland
272
K. Leiviska
10. Isokangas, A., Juuso E.K. (2000): Development of Fuzzy Systems from Linguistic Equations for Kappa Number Prediction in Continuous Cooking. In: L. Yliniemi and E. Juuso (eds.), Proceedings of TOOLMET 2000 Symposium, Oulu, April 13-14, pp. 63-77 11. Jung, C.H., Ham, C.S. Lee, K.I. (1995): A real-time self-tuning fuzzy controller through scaling factor adjustment for the steam generator of NNP. Fuzzy Sets and Systems 7,: 53-60 12. Juuso, E.K. (1999a). Fuzzy Control in Process Industry: The Linguistic Equation Approach. In: H. B. Verbruggen, H.-J. Zimmermann, R. Babuska (eds.) Fuzzy Algorithms for Control, International Series in Intelligent Technologies, pp. 243-300, Kluwer, Boston 13. Juuso, E.K. (1999b). Intelligent Dynamic Simulation of a Lime Kiln with Linguistic Equations. In: H. Szczerbicka (ed.), ESM’99: Modelling and Simulation: A Tool for the Next.Millenium, 13th European Simulation Multiconference, SCS, Delft, The Netherlands, 1999. Volume 2, pp. 395-400, Delft, The Netherlands, SCS 14. Juuso, E. K., Balsa, P, Valenzuela, L., Leiviska, K. (1998). Robust Intelligent Control of a Distributed Solar Collector Field. 3rd Portuguese Conference on Automatic Control, Sept 9-1 1, 1998, Coimbra, Portugal. 6 pp. 15. Juuso, E. K., Balsa, P., Leiviska, K. (1997). Linguistic Equation Controller Applied to a Solar Collector Field, Proceedings of ECC’97, Paper 267,6 pp. 16. Juuso, E., Ahola,T., Oinonen, K., Leiviska, K. (1998). Web Break Sensitivity Indicator for a Paper Machine. Proceedings of 1998 EUFIT Symposium, Aachen, Germany 17. Kivikunnas, S., Bergonzini Corradini, M., Juuso, E.K. (1996): Fuzzy conversion estimation in fermentation control. In: L. Yliniemi, E.Juuso (eds.), Proceedings of TOOLMET’96 (Oulu, April 1-2, 1996), University of Oulu, Control Engineering Laboratory, Report A No 4, pp. 177-186 18. Kivikunnas, S., Ibatici, K., Juuso, E. (1996): Process trend analysis and fuzzy reasoning in fermentation control. In: B. G. Mertzios, P. Liatsis (eds.), Proceeding of IWISP’96 (Manchester, Nov. 4-7, 1996), p. 137-140 19. Koskinen, J., Yliniemi, L., Leiviska, K. (1998): Fuzzy modelling of a pilot plant rotary dryer. In: Proc.UKACC International Conference on CONTROL’98. Swansea, 1 , 515-5 18 20. Leiviska, K., Juuso, E., Isokangas, A. (2001). Intelligent Modelling of Continuous Pulp Cooking. In: Industrial Applications of Soft Computing (K. Leiviska, Ed.). Studies in Fuzziness and Soft Computing, Vol. 71, Springer Verlag 21. Lui, H.C., Gun, M.K., Goh, T.H. Wang, P.Z. (1994): A self-tuning adaptive resolution (STAR) fuzzy control algorithms. In: Proc. 3rd IEEE World Congress on Computational Intelligence. 3, 1508-1 5 13 22. Mudi, R.K. and N.R. Pal (1999). A Robust Self-Tuning Scheme for PI- and PID Type Fuzzy Controllers. IEEE Transactions on Fuzzy Systems, 7,2-16
A d a p t a t i o n in Intelligent S y s t e m s 273
23. Mudi, R.K. and N.R. Pal (2000). A self-tuning fuzzy PI controller. Fuzzy Sets and Systems, 115,327-338 24. Murtovaara, S., Leiviska, K., Juuso, E., Sutinen R. (1999): Modelling of Pulp Characteristic in Kraft Cooking. Oulu, December 1999, University of Oulu, Control Engineering Laboratory, Report A No 9. 20 p 25. Myllyneva, J., Karlsson, L., Joensuu I. (2001): Fuzzy Quality Control of a TMP Plant. In: Industrial Applications of Soft Computing (K. Leiviska, Ed.). Studies in Fuzziness and Soft Computing, Vol. 7 1, Springer Verlag 26. Rarnkumar, K.B., Chidambaram, M. (1 995): Fuzzy self-tuning PI controller for bioreactors. Bioprocess Engineering 12(5), 263-267 27. Ruusunen, M., Leiviska, K. (2002): Fuzzy Modelling of Carbon Dioxide in a Burning Process. Accepted for IFAC World Congress, Barcelona, July 28. Sagasti, F. (1970): A Conceptual and Taxonomic Framework for the Analysis of Adaptive behaviour, General Systems, Vol. XV, 1970 29. Yliniemi, L. (1999): Advanced control of a rotary dryer. Dissertation, University of Oulu, Department of Process Engineering 30. Yliniemi, L. (2001); Adaptive Fuzzy Control of a Rotary Dryer, In: Industrial Applications of Soft Computing (K. Leiviska, Ed.). Studies in Fuzziness and Soft Computing, Vol. 7 1, Springer Verlag 31. Yliniemi, L., Koskinen, J., Leiviska, K. (1998): Advanced control of a rotary dryer. In: Heidepriem, J. (ed.): Automation in Mining, Mineral and Metal Processing Preprints. Elsevier Science, NewYork, pp. 127-132
This page intentionally left blank
ESTIMATION AND CONTROL OF NON-LINEAR PROCESS USING NEURAL NETWORKS ANNA JADLOVSKA Department of Cybernetics and Artificial Intelligence, Technical University, Letncj 9, 040 01 KoSice, Slovak Republic E -mail:
[email protected] A neural network application for system identification and control of non-linear process is described in this paper. The non-linear identification is mostly using feed-forward neural networks as useful mathematical tool to build non-parametric model between the input and output of a real non-linear process. The possibility of on-line estimation of the actual parameters from off-line trained neural model of the non-linear process using the gain matrix is considered in this paper. This linearization technique is used in the algorithm on-line tuning of the controller parameters based on pole placement control design for non-linear SISO process. Keywords: neural network, non-linear parameter instantaneous linearization
1
estimation, non-linear control,
Introduction
The purpose of this paper is to show how feed-forward neural network (Multi Layer Perceptron - MLP) can be used for modeling and control of the non-linear process. When the mathematical model of the process cannot be derived with an analytical method, then the only way is using the relationship between the input and output of the process. Fitting the model from the data is known as an identification of the process. For linear processes this technique is generally well known [7]. For processes, which are complex or difficult to model the non-linear identification can use feed-forward neural network - MLP as useful mathematical tool to build the non-parametric model between the input and the output of the real non-linear process [ 2 ] , [3], [ 5 ] , [9]. We will consider the possibility of an on-line estimation of the actual parameters from off-line trained neural model of the nonlinear process using the gain matrix, introduced later. This linearization technique is used to perform an on-line tuning of the controller parameters based on pole placement control design in control structure, whwhich the functional behaviour is similar to a gain sheduling control [4], [8]. The advantage using instantaneous linearization is that the controller parameters can be changed in response to process changes.
275
276
2
A . Jadlovskd
Non - linear System Identification
In this part we will discuss some basic aspects of non-linear system identification using, among numerous neural networks structures, only Multi-Layer Perceptron MLP (a feed-forward neural network) with respect to model based neural control, where the control law is based upon the neural model. In this paper we will use feed-forward neural network MLP with a single hidden layer. This structure is shown in matrix notation in Fig. l., [8].
II
q - I
4 - 1I
I
I
I L___-_____________-----____J Fig. 0. Matrix block diagram of an MLP
The matrix W, represents the input weights, the matrix W2, represents the output weights, F represents vector function containing the non-linear (tanh) neuron functions. The ‘I ‘ shown in Fig.1 together with the last column in W, giving the offset in the network. The net input is represented by the vector Zjn and the net output is represented by the vector Zout . The mismatch between the desired output Zour and Zout is the prediction error E.
The output from MLP can be written as:
From a trained MLP (by Back-Propagation-Error algorithm - first-order gradient method, Gauss-Newton algorithm - second-order gradient method) which has m o inputs and rn2 outputs a gain matrix N can be found by differentiating with respect to the input vector of the network. The gain matrix N can be calculated from (1)
Estimation and Control of Non-linear Process
277
where W; W, (excl. last column). The above mentioned gain matrix N allows an on-line estimation of the actual model parameters from off-line trained neural model - MLP of the non-linear process. In our paper we will apply an idea using input-output parametric non-linear models ARMAX (NARMAX) in the non-linear system identification by neural networks, [2], [ 6 ] . The non-linear ARMAX (NARMAX) model can be defined Y(k)= F(Y(k - I),. ..,Y(k - p ) ,U(k - r),. ..,U(k - rn),E(k - I ),...,E(k - p),e)
Y ( k )= G(k)+ E ( k )
(3)
where F is non-linear vector function, 6' represents the parameters and E ( k ) is the prediction error. Here p and rn denote the number of delayed outputs and inputs. The neural NARMAX model with input and output vectors
~ ( -kI), ..., ~ ( -kp ),..., ~ ( -z),...,u(~ k - rn), E(k -I),..& -p )
zou, (k)= f
(4
(4)
is shown in Fig.2, which is a recurrent network After training neural network MLP the actual gain matrix N ( k ) can be on-line estimated, and calculated by (2) and for NARMAX model by (5): d f(k) N ( k ) = dZ&) d Z l ( k ) d { Y ( k - 1 ) ............ E ( k - p ) } ~
=
[hI (k)...
-
hp( k ) i, ( k ) . .. b, ( k )
( k ) .. . Ep ( k ) }
(5)
where a , ( k ) for i = 1,.. . , p , b,(k) for i = l,..., rn, l , ( k ) for i = 1,.. . , p are estimated parameters of neural NARh4AX model for step k . Because neural NARMAX model in Fig.2 contains feedback loops around MLP we will apply for training this recurrent network a second order Recursive Prediction Error Method (RPEM) using Gauss-Newton search direction, [2], [4], ~71.
278
A . Jadlovskd
Fig. 0. Input - output neural NARMAX model
3
Non-linear Control
The model of non-linear process and the training method has been considered generally for the multivariable case in part 2. Next we will think about control design for non-linear SISO process using neural NARMAX model. A trained neural NARMAX model representing a model of the non-linear process is ussed for an on-line estimation actual process parameters by gain matrix N ( k ). This linearization technique called instantaneous linearization allows an online tuning of the controller parameters using pole-placement control strategy. This control concept, which is well known from the linear control theory will be implemented with RST - controller, [ 11. An example of the control structure using the estimation process parameters from neural NARMAX model, which is applied for an on-line tuning parameters of RST-controller by pole-placement design is illustrated in Fig.3. 3.1
Simulation Results of Non-linear Control Using Parameter Estimation
The idea and results of the estimation process parameters from an off-line trained neural NARMAX model and its using for tuning parameters of RST - controller
Estimation and Control of Non-linear Process 279
designed by pole-placement strategy (non-linear system control) are presented for non-linear test SISO process, [4] : y(k+I)=
0,95. y ( k ) + 0 , 2 5 4 k )+ 0,58 . u ( k ) . y ( k )
1+ Y ( U 2
We consider NARMAX model with 6 inputs ( p = m = 2 ) and 8 neurons in the hidden layer. The activation hnction in the hidden layer is tunh function and in the output layer alinear function is selected. The actual gain matrix N ( k ) can be calculated by (2) and the actual values of the estimated 6 - and b - parameters can be obtained from N ( k ) by (5). These parameters can be used to calculate the RSTcontroller parameters.
Fig. 1. Control scheme using a trained neural network for parameter estimation
In Fig.4 there are plotted the input and the output signals, which are used to train the neural net in the NARMAX structure by simulation scheme in Fig.5. The neural NARMAX model is trained with Gauss-Newton algorithm based on RPEM. Model is validated using time validation test in Fig.6 [2]. It is clear that neural model can be accepted by this test. The presentation of results of a non-linear control, a control with a RSTcontroller designed by a pole-placement design using an on-line parameter estimation from an off-line trained neural model is illustrated in Fig.7 at the change of one parameter of the non-linear SISO process (6). This example shows real power of the neural modeling using the structure ARMAX known from the theory of linear identification and the possibility to apply
A . Jadlovskd
280
the pole-piacement method known from the linear control theory for control of nonlinear SISO process in the control structure, which is similar with the self-tuning control. 2.5
2 1.5 1
0.5
n 0
200
400
800
600
1000
1200
5,
1
4 -
3 2 -
-1
'
0
200
400 600 800 Fig. 2. The input and output training signals
+a
Gain
I
1200
1000
cia&
To W o k p a c e 3
Y T a W o k p a ce2
ZOH
To Wokpacel Scope
Scope1
Fig. 3. The simulation scheme for collection of the training data
Estimation and Control of Non-linear Process
281
2.5 2 1.5
1 0.5
0 -0.51
0
100
50
150
200
250
300
Fig. 4. Validation of the neural model
0.6
0.5
0.4 0.3 0.2 0.1 0 0
50
100
150
200
250
300
350
0.2 -
0.15
-
0.1 -
-0.05 -0.1
r
-
-
Fig. 5. The RST controller based on actual parameter estimation from neural NARMAX model
7
282
4
A . Jadlovskd
Conclusions
In this paper the neural NARMAX is trained model as one-step predictor for nonlinear SISO process. After training this NARMAX model can be used in closed control loop to an on-line estimation of the process parameters, which allow tuning of the controller parameters by the pole-placement method. A practical simulations by language MATLAB/SIMULINK and Neural Toolbox ilustrate, that this control strategy using linearization technique by the gain matrix from neural NARMAX model produces excelent performance for control of non-linear SISO process. But this controller design can be applied for only non-linear processes, which does not contain hard nonlinearities. References 1. Astrom, K., J. and B. Wittenmark (1990). Computer Controlled Systems, Theory and design, Prentice-Hall, second edition 2. Chen, S., S. Billings and P. Grant (1990). Non-linear system identification using neural networks. In: Znternational Journal of control, Vo1.5 1, No.6, pp.1191-1214 3. Jadlovska A. (2000). An Optimal Tracking Neuro-Controller for Nonlinear Dynamic Systems, In: Control System Design, A Proceedings volume from P A C Conference Bratislava, Slovak Republic, 18-20 June, Published for IFAC by Pergamon - an Imprint of Elsiever Science, pp.493-499, ISBN 00-08 043 546 7 4. Jadlovska A. (2001). Modeling and Control of Nonlinear Processes Using Neural Networks, Ph.D.thesis, FEI - TU, KoSice, 156 pages 5. Jadlovska A. and J. Sarnovsky (2001). Neural Predictive Control of Non-linear System. In: Proceedings of the 13-th Znternational Scient$c-Technical Conference PROCESS CONTROL ‘01, StrbskC Pleso, High Tatras, Slovak Republic, CD-disk, 5 pages, ISBN 80-227- 1542-5 6. Leontaritis I., J. (1985). Input-output parametric models for non-linear systems, part 1 and 2., In: International Journal of Control, Vo1.41, No.2, pp.303-344 7. Ljung L. (1987). System Identification, Theory for User, Prentice Hall, first edition 8. Najvarek, J. (1996). Matlab and neural networks, FEI VUT, Brno 9. Narendra, K. S. and A.U. Levin (1990). Identification and control of dynamical systems using neural networks, In: IEEE Transaction on Neural Networks, Vol. 1, NO.1, pp.4-27
THE USE OF NON LINEAR PARTIAL LEAST SQUARE METHODS FOR ON-LINE PROCESS M O N I T O R " AS AN ALTERNATIVE TO ARTIFICIAL NEURAL NETWORKS PAOLO F. FANTONI, MARIO HOFFMANI? IFE OECD Halden Reactor Project, Norway E-rnai1:
[email protected] WESLEY HTNES, BRANDON RASMUSSEN The University of Tennessee, W, USA ANDREAS KIRSCHNER IPM, University of Applied Sciences Zittau/Gorlitz (FH) On-Line monitoring evaluates instrument channel performance by assessing its consistency with other plant indications. Industry and EPRI experience at several plants has shown this overall approach to be very effective in identifying instrument channels that are exhibiting degrading or inconsistent performance characteristics. On-Line monitoring of instrument channels provides information about the condition of the monitored channels through accurate, more frequent monitoring of each channel's performance over time. This type of performance monitoring is a methodology that offers an alternate approach to traditional timedirected calibration. On-line monitoring of these channels can provide an assessment of instrument performance and provide a basis for determination if adjustments are necessary. Elimination or reduction of unnecessary field calibrations can reduce associated labour costs, reduce personnel radiation exposure and reduce the potential for miscalibration. PEANO is a system for on-line calibration monitoring developed in the years 1995-2000 at the Institutt for energiteknikk (IFE), Norway, which makes use of Artificial Intelligence techniques for its purpose. The system has been tested successfully in Europe in off-line tests with EDF (France), Tecnatom (Spain) and ENEA (Italy). PEANO is currently installed and used for online monitoring at the HBWR reactor in Halden. A major problem in the use of Artificial Neural Networks, as in PEANO, is its limited retraining capability (which is necessary whenever process component changes occur) and its exponential complexity increase with the number of monitored signals. To overcome these limitations, an approach based on Non Linear Partial Least Square, an extension of the well-known LPS method, is proposed. In this work the NLPLS algoritm will be implemented in the PEANO architecture and its performance will be compared with the current PEANO version, based on ANN. For this purpose, real data from an operating PWR will be used for testing both systems.
Keywords: Signal Validation, Process Monitoring, Pattern Recognition, Artz/icial Neural Networks. NLPLS
1
Introduction
Signal validation models are often constructed using sets of highly correlated variables. Variable matrices characterized by linear dependencies between columns are said to be collinear and present an ill-posed problem to empirical modeling
283
284
P. F. Fantoni et al.
tasks. This causes an increase in the noise level of predictions and causes unstable and unrepeatable results. The problem arises due to the numerical complexities of inverting collinear matrices. The data sets are often highly correlated due to the number of sensors used to monitor large-scale processes, such as a nuclear power plant or other type of electrical generation facility. These sensors are located throughout the system to monitor parameters that are physically related and thus, their measurements are also related. The use of redundant sensors to monitor safety critical parameters exacerbates the problem by increasing the average correlations of the data set. One class of methods well suited for empirical modeling with collinear data is projection based techniques. Partial least squares (PLS) is a member of this class whereby, the predictor variables are transformed to an equivalent set of orthogonal variables. Since orthogonal vectors are independent, this eliminates the problems associated with collinearity. Though other projection based methods, such as principal component regression (PCR) are available, PLS poses an advantage in that its projections to latent variables are supervised by the desired response, whereas in other methods the projections are unsupervised with respect to the desired response. PLS has many attractive features making it an important model to consider for signal validation purposes. The creation of orthogonal bases from which univariate regression models can be derived, eliminates the numerical problems associated with collinearity. An additional feature is the inherent regularization of the method, which provides reproducible, stable solutions.
2
2.1
TheMethod
The PLS Method
First introduced by H. Wold in the field of econometrics [I], PLS has become an important technique in many areas, including psychology, economics, chemistry, medicine, pharmaceutical science, and process modeling [2]. PLS is a class of techniques for modeling the associations between blocks of observed variables by means of latent variables [3]. These latent variables are created through a supervised transformation, whereby an orthogonal basis results. The PLS algorithm used in this work is a special case of the standard technique, where the Y block consists of a single column or response variable. This case of PLS modeling is referred to as PLS-I [4] and is employed here due to its regularization properties, putting it in the same class as the methods of Ridge Regression (RR) and Principal Component Regression (PCR) [3]. Extensive discussions of PLS methods for the general case of more than one response variable are available in the literature [5-81. All discussions presented here refer to the PLS-1 technique. The PLS-1 technique is an inferential modeling method. The inferential
Non-linear Partial Least Square Methods for On-line Process Monitoring
285
structure provides the additional benefit of eliminating the ability for perturbations in a specific variable to directly propagate to the prediction of that variable. This is due to the architecture of inferential models, in which a variable's response is inferred through the use of other variables correlated with it. PLS-1 is an inferential modeling technique of the class of regularization techniques. It has been directly compared to RR and truncated singular value decomposition (TSVD) in a signal validation study [9], as well as to RR and PCR in a chemical application [ 101. In both cases, favorable results were reported for the PLS algorithm comparisons, and stable solutions were obtained. The chemical application resulted in a PLS model exhibiting a solution which was stabilized in comparison to the OLS solution, and comparable in prediction error to RR. The signal validation study was based on a predictor variable block plagued with collinearity, and reported a highly stable PLS solution in comparison to unregularized neural networks, and OLS. This stability results from the orthogonalization of the predictor variable block, and the reduced variance of the estimates due to the elimination of higher ordered latent variable pairs. These higher ordered latent variable pairs contain the least amount of variation related to the response variable, and are generally considered to be noise. The inner relationships of the standard PLS algorithm are constructed using univariate regressions on the latent vectors. In attempts to enhance the technique, quadratic functions have been introduced into the inner relationship [l l-141. However, quadratics are still linear in their parameters and do not guarantee a proper solution [ 151. The use of single hidden layer feedforward neural networks (NN) has been suggested [16], and applied in the field of chemistry [17-211. These methods have been under study at the University of Tennessee, for the purposes of signal validation in large-scale processes, beginning in 1999. The method used herein, uses hlly connected feedforward neural networks in the inner relationships and maintains the linear orthogonalization in the outer relationships. This method will be referred to as neural network partial least squares ("PLS), to avoid confusion with the various nonlinear methods utilizing quadratics or radial basis functions [22]. A NNPLS signal validation system has been implemented, on a trial basis, at the gth unit of Tennessee Valley Authority's Kingston fossil plant, in Kingston, Tennessee, USA [23-261. Two features of the PLS algorithm provide extensive benefits for designing empirical models for signal validation tasks. The first is its elimination of numerical problems associated with collinearity. A matrix is said to be collinear if the columns in X are approximately or exactly linearly dependent. Collinearity means that the matrix X will have some dominating types of variability that carry most of the available information [8]. A square matrix is said to be singular if there is at least one linear dependency among the rows (or columns) of the matrix. For the case of a singular matrix the determinant is zero: I X = 0 . The rank of a matrix is defined as the maximum number of linearly independent columns (or rows). Based on this, matrices containing linearly dependent columns are said to be rank-
x
I
286
P. F. Fantoni
et al.
deficient, or ill-conditioned. In relation to empirical modeling, the matrix of concern
XTX. Any linear dependence that exists in a matrix x will be preserved in the new matrix XTX. Situations where the determinant of XTX is near zero lead to is
unstable estimates of the regression coefficients, in multiple linear regression (MLR), which may be unreasonably large or have the wrong sign [27]. When the determinant of a matrix is zero its inverse does not exist, and when it is near zero, its inverse contains values of extremely large magnitude. Collinearity in a data set leads to an ill-posed problem that causes inconsistent results when data based models such as MLR or neural networks are used [4]. Data sets from large-scale processes, such as an electrical generating facility, are often plagued with collinearity. Thus, MLR is not a suitable method of empirical modeling for these situations, due to the required matrix inversion to obtain the solution. The use of autoassociative artificial neural networks (AANN), used extensively for signal validation [28-3 13, is also inhibited by collinearity for similar reasons [32]. Some type of regularization method should be utilized when dealing with collinear data. Regularization can be integrated into neural network development through the use of cross validation training and robust training techniques however, these methods are time consuming and require significant oversight and knowledge [33-341. Other methods of regularization include truncated singular value decomposition (TSVD), and ridge regression (RR) [35-361. The utility of the PLS algorithm in eliminating the problems associated with collinearity is an indispensable feature of the method. Though this feature is also available in PCR, the orthogonal projections to latent structures in PCR is unsupervised with regard to the response variables, focusing only on the variability contained in the predictor variable set. PLS on the other hand is supervised in that its orthogonal projections are performed to capture the maximum covariance between the predictor variables and the desired response, in its latent variables. The first latent variable contains the maximum covariance in a given direction and subsequent latent variables contain the maximum covariance remaining, in a direction orthogonal to all of the previous latent variables. The second beneficial feature of PLS is its elimination of the higher ordered latent vectors from passing through to the inner relationships of the model. These higher order latent variables contain decreasing amounts of variance related to the response variable block, Y . This inherent regularization, via supervised dimensionality reduction, provides reproducible and stable solutions. The benefits associated with regularization, to stabilize the solutions of ill-posed problems, are making its incorporation into the design of signal validation systems essential [37]. There are two sets of relationships involved in the PLS technique, the outer relationships and the inner relationships. Consider an rn x n block of predictor variables, X , and an m X 1 response variable, Y . Note that the restriction of the response variable matrix to a single variable is the case for PLS-1. A schematic of the PLS-I algorithm is diagram of the mapping structure of the PLS-1 model is
Non-linear Partial Least Square Methods for On-line Process Monitoring
287
x
The outer relationships are linear relationships between the predictor variables, , and the latent variables for the same block. The creation of latent variable pairs is an iterative process. The diagram of the mapping performed by a single iteration of the PLS-1 algorithm provides further illustration.
Figure 1. Structure of the PLS Method
Following the creation of a latent variable pair, the predictor variables are successively deflated by the information extracted by the current latent variable pair. For this reason, the observed variable block is often referred to as a residual matrix, and herein will be referred to as the input residual matrix. Similarly there is an output residual matrix created by the deflation of the response variable matrix at the end of each iteration. The deflation operations are not an essential step in the PLS algorithm [38] however, the residual matrices provide information regarding the amount of variance contained in each successive set of latent variable pairs [19]. The outer relationships for the predictor variables and the subsequent deflation of the input residual matrix are given below. Define: E = input residual matrix f = output residual matrix W= predictor variable transformation weights t= input latent variable b = regression coefficient P= input loading vector
a = 1,..., R (iteration index) j = 1,...,n (column index) The outer relationship can be explicitly written as:
The deflation of the input residual matrix is given by:
288
P. F. Fantoni et al. 0
x,
Note that E = subscripts indicate the matrix indices, and superscripts indicate the iteration index. The number of iterations of the outer relationship, a , dictates the number of pairs of latent variables, and constitutes selection of the model. The number of latent variable pairs computed is the rank of the PLS model, R , and is often determined via a cross validation technique [9, 401. The PLS-1 algorithm does not require the transformation of the response variable, since it is a rank one matrix. Hence, there is no corresponding outer relationship for the response variable. The optional task of deflation of the output residual vector may still be carried out if desired, to provide a quantification of the variability explained by each consecutive input latent variable. The deflation of the output residual is given by:
Note that
f
= y , and
b" is the coefficient from the univariate regression of
t" onto f ". All of the discussions regarding the input outer relationships of the PLS algorithm also apply to the NNPLS algorithm. Differences between the methods occur in the inner relationships, and slightly modify the equation for the deflation of the output residual. The inner relationships of the PLS-1 algorithm are a set of univariate regressions between each latent variable pair up to the rank of the model. Substituting these univariate regressions with feedforward neural networks provides the algorithm with nonlinear mapping capabilities, and will often produce a more parsimonious model [19]. Following the method of S.J. Qin and T.J. McAvoy, each latent variable pair is mapped with a single input single output (SISO) neural network [20]. The use of very simple neural networks circumvents the overparameterized problem of the direct neural network approach. The combination of the PLS-1 linear outer relationships and the neural network inner relationships will be referred to as NNPLS- 1. 2.2
The NNPLS Approach
The neural networks used in the inner relation contain a single hidden layer with two neurons containing hyperbolic tangent activation functions, and a single output neuron with a linear activation function. The neural networks were trained with the scaled conjugate gradient (SCG) training algorithm [40]. Cross validation training was used for early stopping of the SCG algorithm. Initialization of the neural network weight and bias values can be based on the univariate regression solution. This method of initialization will decrease the time required to train the neural networks, and due to the SCG training algorithm finding the nearest local minimum from the initial point, the solution will be better than the initial linear model [20].
Non-linear Partial Least Square Methods for On-line Process Monitoring 289
Each latent variable pair requires a SISO neural network. The neural network weights and biases for each set of score vectors are determined during the current iteration and the explained variation is subtracted from the output residual via network simulation. The decision to use two hidden neurons was to provide the neural networks with a significant number of degrees of freedom without increasing the complexity of the model beyond that which is required. Through experimentation, it was noted that two hidden neurons often provided a better solution than a single hidden neuron, and greater than two hidden neurons did not result in improvements great enough to warrant their inclusion. For the purposes of a large-scale signal validation system the required architecture is autoassociative such that for a given set of input variables, the model provides a prediction at the output for each variable in the input set. For each variable a separate inferential NNPLS-1 model is required. The combination of a full set of inferential models mimics the standard autoassociative architecture.
2.3
The PEANO System
Artificial Neural Networks (ANN) and Fuzzy Logic can be combined to exploit learning and generalization capability of the first technique with the approximate reasoning embedded in the second approach [41]. Real-time process signal validation is an application field where the use of this technique can improve the diagnosis of faulty sensors and the identification of outliers in a robust and reliable way. PEANO [42-481 implements a fuzzy-possibilistic clustering algorithm [49-521 to classify the operating region in which the validation process has to be performed. The possibilistic approach (rather than probabilistic) allows a "don't know" classification that results in a fast detection of unforeseen plant conditions or outliers. The fuzzy classifier identifies the incoming signal pattern (a set of reactor process signals) as a member of one of the clusters covering the entire universe of discourse represented by the possible combinations of steady-state and transient values of the input set in the n-dimensional input world. Each cluster is associated with one ANN that was previously trained only with data belonging to that cluster, for the input set validation process. During the operation, while the input point moved in an n-dimensional world (because of process state changes or transients), the classifier provides an automatic switching mechanism to allow the best tuned ANN to do the job. There are two main advantages in using this architecture: the accuracy and generalization capability is increased compared to the case of a single network working in the entire operating region and the ability to identify abnormal conditions, where the system is not capable of operating with a satisfactory accuracy, is improved.
290
P. F. Fantoni et al.
The structure of PEANO is shown in Figure 2. PEANO has a Client-Server architecture. The server is connected to the process through a TCPhP communication protocol and the results of the validation activity are transferred to the client programs, also using TCP/IP.
I
I
I
I
Figure 2. The structure of PEANO
Figure 3 shows the display of a PEANO client during an on-line validation test. The error bands in the mismatch plots are calculated by PEANO during the training according to the expected error of prediction for each particular cluster and signal [53].The error bands should be interpreted as follows: Narrow error band : It is normally set at 2 standard deviations of the expected error. Exceeding this band is considered as the first warning, especially if the situation persists. Wide error band : It is normally set at 3 standard deviations of the expected error. Exceeding this band is considered a definite alarm condition. 2.4
Tests using ANN and NLPLS inside PEANO
Figure 4 to Figure 6 describe some results achieved with PEANO-ANN, using real data from a US PWR nuclear plant (data provided by EPRI). PEANO was used to monitor 55 process signals up to the secondary side of the steam generators. The data was taken from 3 months of operation (March to May 2000) at different operation conditions.
Non-linear Partial Least Square Methods for On-line Process Monitoring
291
Figure 3. The Client display of PEANO
Figure 4 shows the re-calibration of a steam flow sensor while the plant was in operation. It is visible that after this intervention, the mismatch was well within the 2 standard deviation limit. Figure shows the monitoring of a steam line pressure sensor during steady-state operation at full power. Note how the system acts like a low-pass filter, removing process and instrument noise. The error is always at the center of the tolerance band. Figure 6 is an interesting example of span drift of a steam flow sensor. Span drifts are difficult to detect because they show up only at some location of the instrument range. In this real-life example, the instrument was perfectly inside the calibration range until the power level came close to the rated level (high end of the instrument range). At this point the instrument started to drift and eventually finished outside the allowed tolerance band. At the plant, this drift was discovered only one month later. In the PEANO-NLPLS, the bank of ANN’S has been modified to use the NLPLS algorithm described in this paper. The total development time for this application, using the same EPRI data, was 0.5 hours, while for the PEANO-ANN it was 3 days. This dramatic improvement in the training time makes any retraining need possible and easy to perform. Figure 1 and Figure 2 the same tests as in Figure 4 and Figure 5, with the NLPLS algorithm in PEANO. The performance and accuracy is in the same range and in both cases the miscalibrated instruments were detected correctly.
292
3
P. F. Fantoni et al.
Conclusions
This paper shows how an approach based on NLPLS techniques can estimate and predict efficiently the behavior of a non-linear process, in comparison with a traditional approach based on ANN. The use of Non Linear Partial Square Methods is expected to improve the performance of a signal validation and estimation system, solving some of the problems embedded in any ANN approach, like retraining capability, training speed and large scale applications. Steam flow FT-1-28A. April 2 0 0 0
time (min) Mismatch
0
200
400
600 time (min)
aoo
1000
Figure 4. Drift detection with PEANO-ANN
1200
Non-linear Partial Least Square Methods f o r On-line Process Monitoring 293 Steam line pressure PT-1-2B, April 2 0 0 1 845,
I
830 1
I
I
200
400
I
I
200
400
,
I
I
1
800
1000
1200
I
I
I
I
600
800
1000
I200
600 t i m e ( m in )
Mismatch 4
-4
time (min)
Figure 5 . Noise removal with PEANO-ANN
Steam flow, end of March 2000
80
I
0
I I
I I
I
I
I
I
I
,
I
I
I
I
2000
4000
6000
8000
10000
12000
14000
10000
12000
14000
time (min) Mismatch
0
2000
4000
6000 8000 time (min)
Figure 6 . Span drift detection with PEANO-ANN
294
P. F. Fantoni et al.
Figure 1. NLPLS drift detection (compare with Fig.3)
Steam Line Pressure, April 2001, PT-1-28
845 i
'
-10 0
1
200
400
600
800
1000
Figure 2. NLPLS noise removal (compare with Fig. 4)
I
1200
Non-linear Partial Least Square Methods for On-line Process Monitoring
295
References 1. Wold H. (1966), "Non-linear Estimation by Iterative Least Squares Procedures," In: Research Papers in Statistics, David, F. (Ed.), Wiley, NY. 2. Ranpar, S., Lindgren, F., Geladi, P., and S. Wold (1994), "A PLS Kernel Algorithm for Data sets with Many Variables and Fewer Objects, Part 1: Theory and Algorithm," Journal of Chemometrics, 8, 111-125. 3. Wegelin, J.A. (2000), "A Survey of PLS Methods, with Emphasis on the TwoBlock Case," University of Washington Technical Report No. 371, Seattle. 4. Gil, J.A., and R. Romera (1998), "On Robust Partial Least Square (PLS) Methods," Journal of Chemometrics, 12,365-378. 5. Geladi, P., and B. R. Kowalski (1986), "Partial Least Squares Regression: A Tutorial," AnaIytica Chimica Acta, 185, 1-17. 6 . Hoskuldsson, A.( 1988), "PLS Regression Methods," Journal of Chemometrics, 2,211-228. 7. Hoskuldsson , A. (1995), "A Combined Theory for PCA and PLS," Journal of Chemometrics, 9, 91-123. 8. Martens, H., and T. Naes (1989), Multivariate Calibration, John Wiley and Sons, Chichester. 9. Hines, J.W., A.V. Gribok, I. Attieh, and R.E. Uhrig (1999), "Regularization Methods for Inferential Sensing in Nuclear Power Plants," Fuzzy Systems and Soft Computing in Nuclear Engineering, Ed. Da Ruan, Springer. 10. Wold, S., Ruhe, A., Wold, H., Dunn, W.J. 111(1984), "The collinearity problem in linear regression: The partial least squares approach to generalized inverses," SIAM J. Sci. Stat. Comput., vol. 5, pp. 735-743. 11. Baffi, G., Martin, E.B., Morris, A.J.(1999a), "Non-linear projection to latent structures revisited: the quadratic PLS algorithm," Computers in Chemical Engineering vol. 23, pp. 395-41 1. 12. Ni, Y., and Z. Peng (1995), "Determination of Mixed Metal Ions by Complexometric Titration and Nonlinear Partial Least Squares Calibration," Analytica Chemica Acta, 304,217-222. 13. Wold, S., N.K. Wold, and B. Skagerberg (1989), "Nonlinear PLS Modeling," Chemometrics and Intelligent Laboratory Systems, 7, 53-65. 14. Wold, S. (1992), "Nonlinear Partial Least Squares Modeling 11. Spline Inner Relation," Chemometrics and Intelligent Laborato y Systems, 14, 7 1-84. 15. Bro, R. (1995), "Algorithm for Finding an Interpretable Simple Neural Network Solution Using PLS," Journal of Chemometrics, 9,423-430. 16. Gemperline, P.J., Long, J.R., and V.G.Gregoriou (199 l), Anal Chem., 63,23 13. 17. Baffi, G., Martin, E.B., Morris, A.J. (1999), Non-linear projection to latent structures revisited: the neural network PLS algorithm," Computers in Chemical Engineering vol. 23, pp. 1293-1307. 18. Holcomb, T.R., and M. Morari (1992), "PLS/Neural Networks," Computers Chem. Engng., 16,393-41 1. 'I
296
P. F. Fantoni et al.
19. Malthouse, E.C., A.C. Tamhane, and R.S.H. Mah (1997), "Nonlinear Partial Least Squares," Computers Chem. Engng., 21, 875-890. 20. Qin, S.J., and T.J. McAvoy (1992), "Nonlinear PLS Modeling Using Neural Networks," Computers in Chemical Engineering, vol. 16, n. 4, pp. 379-391. 21. Qin, S.J., and T.J. McAvoy (1996), "Nonlinear FIR Modeling via Neural Net PLS Approach," Computers chem. Engng., 20, 147-159. 22. Wilson, D.J.H., G.W. Irwin, and G. Lightbody (1997), "Nonlinear PLS Modeling Using Radial Basis Functions," Proceedings of the American Control Conference, Albuquerque, NM, pp. 3275-3276. 23. Hines, J.W., B. Rasmussen, and R.E. Uhrig (2000), "An On-line Sensor Calibration System," Presented at the 13'h International Congress and Exhibition on Condition Monitoring and Diagnostic Engineering Management, Houston Texas. 24. Hines, J.W., and B. Rasmussen (2001), "Continuous Calibration Verification at a Coal Fired Power Plant," Presented at ICOMS, The International Conference of Maintenance Societies, Melbourne, Australia, June. 25. Rasmussen, B., J.W. Hines, and R.E. Uhrig (2000a), "Nonlinear Partial Least Squares Modeling for Instrument Surveillance and Calibration Verification," by published in the proceedings of the Maintenance and Reliability Conference (MARCON), Knoxville, TN, May 7-10. 26. Rasmussen, B., J.W. Hines, and R.E. Uhrig (2000), "A Novel Approach to Process Modeling for Instrument Surveillance and Calibration Verification," Proceedings of The Third American Nuclear Society International Topical Meeting on Nuclear Plant Instrumentation and Control and Human-Machine Interface Technologies, Washington DC, November 13-17. 27. Massart, D.L., B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufinan (1988), Chemometrics: A textbook, Elsevier Science Publishers, Amsterdam. 28. Fantoni, P., S. Figedy, A. Racz, (1998), "A Neuro-Fuzzy Model Applied to Full Range Signal Validation of PWR Nuclear Power Plant Data," FLINS-98, Antwerpen, Belgium. 29. Hines, J.W., A.V. Gribok, 1. Attieh, and R.E. Uhrig (2000), "Neural Network Regularization Techniques for a Sensor Validation System," American Nuclear Society Annual Meeting, San Diego, California, June 4-8. 30. Upadhyaya, B.R., and E. Eryurek (1992), "Application of Neural Networks for Sensor Validation and Plant Monitoring," Nuclear Technology, vol. 97, pp. 170-176. 31. Xu, X., J.W. Hines, and R.E. Uhrig (1999), "Sensor Validation and Fault Detection Using Neural Networks," Proceedings of the Maintenance and Reliability Conference (MARCON), Gatlinburg, TN, May 10-12. 32. Qin, S. (1997), "Neural networks for intelligent sensors and control - Practical issues and some solutions," In Neural Systems f o r Control, Chapter 8, Edited by 0. Omidvar and D. L. Elliott, Academic Press.
N o n - h e a r Partial Least Square Methods for On-line Process Monitoring
297
33. Xu, X. (2000), PhD dissertation, “Automated Neural Network-Based Instrument Validation System,” The University of Tennessee, Nuclear Engineering Department, Knoxville, TN. 34. Hines, J.W., A.V. Gribok, I. Attieh, and R.E. Uhrig (1999), “The Use of Regularization in Inferential Measurements,” Presented at the Enlarged Halden Programme Group (EHPG) Meeting, Loen, Norway, May 24-29. 35. Hoerl, A.E., and R.W. Kennard (1970), “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, 12, 55-67. 36. Hines, J.W., A.V. Gribok, I. Attieh, and R.E. Uhrig (2000), “Improved Methods for On-Line Calibration,” Proceedings of the 8th International Conference on Nuclaear Engineering, Baltimore, MD, April 2-6. 37. Dayal, B.S., and J.F. MacGregor, “Improved PLS Algorithms (1997), Journal of Chemometrics, 11, 73-85. 38. Hoskuldsson, A. (1996), “Experimental Design and Priority PLS Regression, Journal of Chemometrics, 10,637-668. 39. Wold, H. (1982), “Soft Modeling, The Basic Design and Some Extensions,” In Systems Under Indirect Observation, 1-11, K.G. Joreskog and H. Wold Eds., North-Holland, Amsterdam 40. Moller, M. F. (1993), “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Networks, vol. 6, pp. 525-533. 4 1. Fantoni P, Mazzola A (1994) Applications of Autoassociative Neural Networks for Signal Validation in Accident Management. In: Proceedings of the IAEA Specialist Meeting on Advanced Information Methods and Artificial Intelligence in Nuclear Power Plant Control Rooms, Halden, Norway. 42. Fantoni p, Mazzola A (1995) Transient and Steady State Signal Validation in Nuclear Power Plants using Autoassociative Neural Networks and Pattern Recognition. In: Proceedings of SMORN VII, Avignon, France. 43. Fantoni P, Mazzola A (1996) A Pattern Recognition-ArtificialNeural Networks Based Model for Signal Validation in Nuclear Power Plants. In: Annals of Nuclear Energy, Vol. 23, No.13, pp 1069-1076. 44. Fantoni P, Mazzola A (1996) Multiple Failure Signal Validation in Nuclear Power Plants using Artificial Neural Networks. In: Nuclear Technology, Vol. 113, NO. 3, pp 368-374. 45. Fantoni P (1996) Neuro-Fuzzy Models Applied to Full Range Signal Validation in NPP. In: Proceedings of NPIClkHMIT’96, The Pennsylvania State Univ., PA. 46. Fantoni P, Figedy S, Racz A (1998) PEANO, A Toolbox for Real-Time Process Signal Validation and Estimation. HWR-5 15, OECD Halden Reactor Project (restricted). 47. Fantoni P, Renders J (1998) On-Line Performance Estimation and Condition Monitoring using Neuro-Fuzzy Techniques. In: Proceedings of the Workshop on On-Line Fault Detection and Supervision in the Chemical Process Industries, Lyon, France.
298
P. F. Fantoni et al.
48. Fantoni P (2000) A Neuro-Fuzzy Model Applied to Full Range Signal Validation of PWR Nuclear Power Plant Data. In: International Journal of General Systems, Vol. 29(2), pp 305-320. 49. Tou JT, Gonzalez R C (1974) Pattern Recognition Principles. Addison Wesley Publishing Company, Reading, MA. 50. Bezdek JC (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, NY. 5 1. Gustafson DE, Kessel W C (1979) Fuzzy Clustering with a Fuzzy Covariance Matrix. In: Proceedings of IEEE CDC, San Diego, CA. 52. Krishnapuram R, Keller J (1993) A Possibilistic Approach to Clustering. 1n:IEEE Transactions on Fuzzy Systems, Vol. 1, No. 2, pp 98-100. 53. Fantoni P, Mazzola A (1996) Accuracy Estimate of Artificial Neural Networks Based Models for Industrial Applications. In: Artificial Intelligence in the Petroleum Industry. Symbolic and Computational Applications 2, Chapter 17, B Braunschweig & B Bremdal Editions.
RECURRENT NEURAL NETWORKS FOR REAL-TIME COMPUTATION OF INVERSE KINEMATICS OF REDUNDANT MANIPULATORS JUN WANG AND YUNONG ZHANG DEPARTMENT OF AUTOMATION AND COMPUTER-AIDED ENGINEERING T H E CHINESE UNIVERSITY OF HONG KONG SHATIN, NEW TERRITORIES, HONG KONG EMAIL:
[email protected] Recurrent neural networks are discussed for real-time inverse kinematic control of redundant manipulators. Three recurrent neural network models, the Lagrangian neural network, the primal-dual neural network, and the dual neural network. We begin with the Lagrangian neural network for the inverse kinematics computation based on the Euclidean norm of the joint velocities t o show the feasibility. Next, we present the primal-dual neural network for minimum infinity norm kinematic control of redundant manipulators. To reduce the model complexity and increase the computational efficiency, the dual neural network is developed with the advantages of simple architecture and exponential convergence. Simulation results based on the PA10 robot manipulator substantiate the effectiveness of the present recurrent neural network approach.
Keywords: kinematically redundant manipulators, recurrent neural networks, inverse kinematics. 1
Introduction
Kinematically redundant manipulators are those having more degrees of freedom than required to perform manipulator given tasks. The redundancy of such manipulators includes intrinsical redundancy and functional redundancy [ 11. The use of kinematically redundant manipulators is expected to increase dramatically in the future because of their ability to avoid obstacles, joint limits, and singularities, and to optimize various performance criteria, while conducting the end-effector motion task. The real-time computation of inverse kinematics solutions is very timeconsuming in high degree-of-freedom sensor-based robotic systems, especially for the case when multiple performance criteria and/or dynamic physical constraints are considered, due to the time-varying nature and the real-time calculation requirement. Parallel computation methods such as neural network approaches are effective and efficient alternatives for inverse kinematics solu-
299
300
J . W a n g and
Y.Zhang
tion computation [10]-[12]. In recent years, new interests in neural network research have been generated to reduce the computational complexity and t o improve the computational efficiency for kinematic control of redundant manipulators. Various neural network models have thus been developed including feedforward networks and recurrent neural networks [13]-[28].Unlike feedforward neural networks, most recurrent neural networks do not need off-line supervised learning and thus are more suitable for real-time robot control in uncertain environments. For example, Wang et aZ[19] developed a recurrent neural network, called the Lagrangian network, t o resolve manipulator redundancy at the velocity level. The Lagrangian network generates real-time solution t o the inverse kinematics problem formulated as a time-varying quadratic optimization problem. In contrast t o the minimum two-norm kinematic control scheme present in [18][19], a primal-dual neural network together with a pseudo-inverse neural network was developed in [18] and a primal-dual neural network with reduced architecture complexity was developed in [25] by minimizing the infinity-norm of robot joint velocities. To reduce model complexity, a single-layer dual neural network model was first presented in [26] for the inverse kinematics of redundant manipulators. The number of neurons of the dual network is just equal to the dimensionality of the robot joint, and moreover the dual neural network was extended for bi-criteria kinematic control of redundant manipulators in [28]. The dual neural network was proved t o be globally exponentially stable [26]-[28]. The remainder of this chapter is organized in four sections. The problem formulation and preliminaries are presented in Section 2. The Lagrangian neural network, the prime-dual neural network and the dual neural network are reviewed respectively in Subsections 3.1, 3.2 and 3.3 of Section 3. Simulation results based on the PA10 intelligent robot arm are illustrated in Section 4. Section 5 concludes this chapter with final remarks. 2
Problem formulation
The forward kinematics equation in robotics is concerned with the transformation of position and orientation information in a joint space to a Cartesian space described as:
where O ( t ) is an n-vector of joint variables, r ( t ) is an m-vector of Cartesian position and orientation variables, and I ( . ) is a continuous nonlinear function with a known structure and parameters for a given manipulator.
Real-time Computation of Inverse Kinematics of Redundant Manipulators 301
The inverse kinematics problem is t o find the joint variables given the desired positions and orientations of the robot end-effector through the inverse mapping of ( 1):
e ( t ) = f-l(T(t)). Solving the inverse kinematics problem of redundant manipulators is of vital importance in robotics [1]-[9]. The inverse kinematics problem involves the existence and uniqueness of a solution, and effectiveness and efficiency of solution methods. The inverse kinematics problem is thus much more difficult to solve than the forward kinematics problem for serial-link manipulators. The difficulties are compounded by the requirement of real-time solution in sensorbased robotic operations. Therefore, real-time solution procedures to the inverse kinematics problem of redundant manipulators are of very importance in robotics. The most direct way to solve (2) is t o derive a closed-form solution from (1). Unfortunately, obtaining a closed-form solution is difficult for most manipulators due to the nonlinearity of f(.). Moreover, the solution is often not unique for redundant manipulators due to the extra degrees of freedom. Making use of the relation between joint velocity 6 ( t ) and Cartesian velocity +(t)is a common indirect approach t o the inverse kinematics problem; i.e.,
where J ( e ) E R m x nis the Jacobian matrix defined as J ( 0 ) = af(0)/%. In a redundant manipulator, (1) and (3) are underdetermined since m < n, and hence they may admit infinite number of solutions. The indirect approach begins with the desired velocity of the end-effector +(t )based on a planned trajectory and required completion time T . The corresponding joint vector O(t) is obtained by integration of e(t)for a given e(0). The resulting e ( t ) then is used t o control the manipulator. Much effort has been devoted to numerical solutions of the inverse kinematics problem. For example, in the table lookup method the inverse Jacobian matrix is stored in memory a priori instead of computing in real time. In the pivot method, the inverse Jacobian is broken down into manageable submatrices. In the extended pivot method, the joint velocity is obtained by directly computing the joint velocities. In the residue arithmetic method, the pseudoinverse of the Jacobian is computed in a parallel manner. In the leastsquares method, e(t) is computed directly without solving the pseudoinverse of Jacobian explicitly. Other numerical methods such as the Newton's method are also developed for solving the inverse kinematics problem. For a review and an in-depth discussion, see [4][5].
302
J . Wang and Y. Zhang
Since 6 is underdetermined in a kinematically redundant manipulator, another way to determine e(t) without computing the pseudoinverse is to online solve the following time-varying optimization problem with equality constraints: minimize h ( e ( t ) ) , subject to J ( f ? ) e ( t= ) T(t),
(4)
where h(b(t))is a general performance measure and can be selected as eTWe/2 with W E ?JPxnbeing a symmetric positive-definite weighting matrix or := maxi l&l. The superscript denotes the transpose operator. If W is the identity matrix I, then the objective function to be minimized is equivalent t o the Euclidean-norm of joint velocity Ile(t)II$.If W is the inertia matrix, then the objective function to be minimized is the local kinetic energy. The problem formulation (4)is only a basic and common form for the involved inverse kinematics problems in the chapter. The formulation may become more realistic when considering more performance criteria or adding more physical constraints such as joint limits and joint velocity limits.
llelloo
Figure 1. Block diagram for the neural network based kinematic control process (Tang and Wang, A recurrent neural network for minimum infinity-norm kinematic control of redundant manipulators with an improved problem formulation and reduced architectural complexity, IEEE Trans. Syst., Man, Cybern., 0 2 0 0 1 IEEE).
Various neural network models have been developed in the past decade for inverse kinematics of redundant manipulators. In the early 1 9 9 0 ’ ~the ~ research focused on feedforward neural networks such as the multilayered Perceptron trained via supervised learning using the backpropagation algorithm or its variants, which need off-line training and may violate the real-time computation requirement of industrial robots. Since the rnid-l990’s, recurrent neural networks with feedback connections have also been applied to kinematic control. In particular, three recurrent neural network models are discussed successively in the following section; i.e. the Lagrangian network, the primal-dual network and the dual neural network in [19][25]-[28].Fig. 1 delineates the kinematic control process based on recurrent neural networks.
Real-time Computation of Inverse Kinematics of Redundant Manipulators
3 3.1
303
Recurrent Neural network models The Lagrangian neural network
In this subsection, the performance measure of (4)is eTWe/2 t o show the development and stability property of a Lagrangian neural network for online manipulator redundancy resolution. The Lagrangian of the time-varying equality-constrained quadratic program (4)is defined as:
where A ( t ) is an rn-dimensional column vector of Lagrangian multipliers at time t. By setting the partial derivatives of L(6,A) t o zero, the Lagrange necessary condition gives rise to the following time-varying algebraic equations:
It can be shown that the optimality condition is also sufficient. Multiplying both sides of (6) by -1, then rewriting (6) and (7) in a combined matrix form, we have
Let the state vectors of output neurons and hidden neurons be denoted by v ( t ) and u ( t ) , an n-vector representing estimated e ( t ) and an rn-vector representing estimated A ( t ) , respectively. The dynamic equations of the proposed Lagrangian network can be expressed by the following time-varying linear differential equations:
ClW(t)= -Wv(t) - J(O)Tu(t), CZiL(t)= J ( Q ) v ( t ) i(t), where C1 E X n x n and C2 E X m x m are positive diagonal capacitive matrices. The positive diagonal capacitive matrices C1 and C2 are used to precondition the system matrix and scale the convergence rate of the Lagrangian network. Fig. 2 illustrates the kinematic control process based on the Lagrangian network. In this context, the desired velocity vector T ( t ) is input into the Lagrangian network, and a t the same time the Lagrangian network outputs the computed joint velocity vector e(t). In details, (9) shows that the symmetric connection weight matrix among the neurons represented by v ( t ) is -W, and that the time-varying connection weight matrix from the neurons represented
304 J . Wang and Y. Zhang
by u(t) t o the neurons represented by w ( t ) is - J ( 0 ) T . (10) shows that the time-varying connection weight matrix from the neurons represented by v ( t ) to the neurons represented by u ( t ) is J(O), and that the external input vector to the hidden layer is -+(t).
Figure 2. Block diagram of the Lagrangian network for robot kinematic control (Wang, Hu, and Jiang, A Lagrangian network for kinematic control of redundant manipulators, IEEE Trans. Neural N e t . , 0 1 9 9 9 IEEE).
Before obtaining the global stability results summarized below, the Lagrangian network is written as the following time-varying linear dynamic system:
Given an initial point (0, we say the solution ( ( t )of the system starting from &, is stable if for any positive real number S > 0, there exists a positive real number E > 0, such that for any initial point ((0) in the &-neighborhoodof (0, the corresponding solution of (11) remains in the S-neighborhood of ( ( t )for t E [0, +m). We also say that the v state of the solution ( ( t )is asymptotically stable if the v part of components of the corresponding solution converges to v ( t ) as t + +co. As the system defined in (11) is linear, these types of stability is equivalent to the stability of zero solution to the corresponding homogeneous system, namely, the system without b ( t ) term. The Lagrangian network defined in (11) is shown t o be globally stable. Furthermore, 21 part of the state vector is globally asymptotically stable [19]. The above analytical results show that no pole of the neural system is located on the right-hand side or on t,he imaginary axis of the complex plane, and only one pole is located at the origin if the rank of A(t) is n m - 1. Therefore, if the redundant manipulator loses a t most one degree of freedom
+
Real-time Computation of Inverse Kinematics of Redundant Manipulators
305
+
at any time (i.e., rank(A(t)) 2 n m - l), the Lagrangian network with sufficiently large gains (i.e., sufficiently small values of C1 and 6 ' 2 ) has the asymptotic tracking capability. At the limiting state, limt-tco tj(t) = limt+oo G ( t ) = 0,
which satisfies the Lagrange optimality condition (8).
3.2
The primal-dual neural network
The minimum two-norm of joint velocity vector, as discussed in Subsection 3.1, is widely used as an optimization criterion in many robotics applications. The two-norm optimization scheme minimizes the squared sum of joint velocities, which does not necessarily minimize the magnitudes of individual joint velocity. It is used as the optimization criterion by researchers more because it is mathematically tractable than physically desirable [4]. The minimum infinity-norm of joint velocities (i.e., I18(t)llm),however, minimizes the largest component of the joint velocity vector in magnitude and is consistent with the physical velocity limits. Moreover, the minimization of infinity-norm of joint velocity vector enables a better direct monitoring and control of the magnitude of individual joint velocities than that of 2-norm of joint velocities [9]. It is therefore more desirable in the situation where low individual joint velocity is of primary concern; e.g., robotic surgery. In this subsection, we first convert the infinity-norm minimization problem (4) into a linear program which can be solved by the primal-dual neural network. For 8 = [el,8 2 , . . . , E !Rn, its infinity-norm is defined as
where I . I denotes the absolute value of the component, and e j E Rn is the j t h column of the identity matrix I . Let the objective function I16(t)llcoin (4) be s ( t ) ;i.e., s ( t ) = m a x l s j s n Ie,Te(t)I. The minimum infinity-norm inverse-kinematics problem (4) is then equivalent to minimize s ( t ) , subject to leTe(t)I
5 s(t) ,
J ( e ) e ( t )= i.(t),
306
J . W a n g and Y . Zhang
which can be re-formulated as minimize s ( t ) ,
where i := [l,1,.. . , 1IT and 6 := [O,O,. . . ,0IT denote respectively the one and null vectors with appropriate dimensions. Rewriting (12) in a standard matrix form, we have minimize c T y , subject to A l y 2 b l , A2Y = b2,
where
A2
= [J(e),O] E SRmx(n+l),
b2 = f ( t ) E SR”,
c = [O,O,.. .,O,1IT E
?Rn+l,
Now, by the dual theory [29] the dual linear program corresponding t o the linear program (13) is
maximize bTz2, subject to ATzl ATz2 = c, z2 unrestricted,
+
z1
> 0,
(14)
where z1 E SRZn and 22 E !Rm are the dual decision variables. In view of the primal and dual programs (13) and (14), define the following energy function:
1 1 (A1y)T (Aly - IAlyI) + ,zT(zl - 1211). (15) 4 The first term in (15) is the squared duality gap. The second and third terms are for the equality constraints respectively in (13) and (14). The fourth and last terms are for the nonnegativity constraints respectively in (13) and (14). Clearly, the energy function (15) is convex and continuously differentiable. It can be seen that E ( y * , z ; , z;) = 0 if and only if (y*, z ; , z;) is the optimal solution of the primal and dual linear programs defined in (13) and (14). -
Real-time Computation of Inverse Kinematics of Redundant Manipulators 307
Figure 3. Block diagram of the primal-dual neural network for robot kinematic control (Tang and Wang, A recurrent neural network for minimum infinity-norm kinematic control of redundant manipulators with an improved problem formulation and reduced architectural complexity, IEEE Trans. Syst., Man, Cybern., 0 2 0 0 1 IEEE).
In view of the energy function (15), the dynamics for the neural network solving (13) and (14) can be defined by the negative gradient systems. Note that 21 := [yT,zT,zT]T, and that for any column vector 2, z - 1x1 = 2 2 where 2- = [x,, 2 2 , . . . , and xi = min(0, xi}. The network dynamical equations can be thus expressed as:
where p E 8 is a strictly positive parameter used to scale the convergence rate of the network and should be selected as large as possible. Fig. 3 shows the block diagram for the architecture of the primal-dual neural network. In this context, the desired velocity vector of the end-effector i ( t ) are fed into the neural network, and the neural network generates the command signal y at the same time which contains the minimum infinity-norm joint velocity vector 4. The stability property of the primal-dual neural network is as follows [ 2 5 ] :
308 J . Wang and Y . Zhang
the primal-dual neural network defined in (16)-(18) is proved to be globally stable and convergent to the optimal solutions of the linear programs (13) and (14).
3.3
The dual neural network
Though the Lagrangian neural network is effective to solve equalityconstrained quadratic programs, it is known that when solving inequalityconstrained quadratic programs, the Lagrangian network may exhibit the premature defect, and that the dimensionality of Lagrangian network will be much larger than that of original problems due to the introduction of slack and surplus variables [12][19]. The primal-dual neural network present in Subsection 3.2 handles the primal program and its dual problem simultaneously by minimizing the duality gap with Kuhn-Tacker conditions. Unfortunately, because of using gradient descent method, the dynamic equations of the primal-dual neural network are usually very complicated and contain high-order nonlinear terms; e.g., (16)-(18). To reduce scheme complexity and increase computational efficiency, a dual neural network model has been developed for inverse kinematics control. Though the dual network first in [26] is to solve equality-constrained programs only, its design method can be generalized to handle the quadratic programs under equality and bound constraints [27][28]. A practical problem arising in robotics is the joint limit avoidance. Among the previous studies on recurrent neural network based inverse-kinematics [18]-[26],it is always assumed implicitly that there exists no joint limit when solving such inverse kinematic problems. However, joint limits are physical constraints of the work space of a robot and they do exist for all kinds of robot. If a solution exceeds the mechanical joint rotation limit, the desired path becomes impossible to follow and the solution is then inapplicable. Therefore, in this section the feature of joint limit avoidance, as a basic requirement, is considered into the redundancy resolution scheme (4); i.e., the following time-varying optimization problem 1. minimize - ~ ( t > ~ ~ b ( t > , 2 subject to J ( 0 ) b ( t )= T ( t ) , 0- 5 0 ( t ) 5 o+,
(19)
The limited joint range [0-, 0+] in (19) has t o be converted to some kind of dynamically-updated bound constraints on joint velocity variables theta.
Real-time Computation of Inverse Kinematics of Redundant Manipulators
The proposed effective conversion scheme is t o replace 8C ( t ) 5 &t) I
5 O(t) 5 O+
r+w
309
with
(20)
where r - ( t ) := a(pO- - O ( t ) ) and [+(t) := (.(PO+ - O(t)). The coefficient 0 < p < 1usually being 0.99 defines the critical areas [O-,pO-] and [pO+,O+] t o prevent the robot arm from hitting its joint limit. The coefficient a > 0 usually being 1 for the ensuing PA10 simulations, of which the maximal allowable value is limited by bounds on joint velocity or actuator parameters, determines the deceleration magnitude when robot entering such critical areas. The repelling mechanism at the joint limits is based on the existence of critical areas and the dynamical updating of joint velocity bound constraint. For any i, if the ith joint variable Oi(t) is within its feasible range (pO;,pO:), then & : 5 0, [+ 2 0, and with an appropriate value of a , it follows from (20) that e i might be positive or negative solution to the inverse kinematic problem (19) if existent. But if O i ( t ) arrives at the critical area [PO:, O:], then 50 and 5 0; that is, the bound constraint (20) becomes &' 5 0, which implies a deceleration opposite to the original joint movement and drives the i t h joint away from its physical limit. So does the other critical area [O;, PO;]. In summary, the inverse-kinematic problem for limited-joint-range redundant manipulators is re-formulated as 1. minimize -e(t)TwO(t), 2 subject to J(O)e(t)= i.(t),
(xt,S t , PG‘,(St),
PG:
(st))]
where sk = [sk,. . . , s ; ] , k = 1,.. . ,t , is the social satisfaction resulting from X k . Basically, if Ht is encouraging, then the inhabitants may become more demanding, and g f ( H t ) and % : ( I l k ) may move up. On the other hand, if Ht is discouraging, then g f ( H t ) and g(H(Ht)may move down (cf. Kacprzyk, 1983, 1997a). Very often, however, one can limit the analysis t o the reduced trajectory [cf. 13)]. This important aspect, discussed in Section 3, will not be considered here. The social satisfaction at t is now St=St
1
A . . . A s 7t
(22)
where ‘LA’1 again reflects a pessimistic, safety-first attitude, and a lack of substitutability.
Towards Perception-based Fuzzy Modeling
333
The social satisfaction st is subjected t o a subjective fuzzy goal p ~( s:t ) which is meant similarly as its objective counterpart shown in Figure 5. The effectiveness o f t is meant as a relation of what has been attained (the life quality indices and their respective social satisfactions) t o what has been “paid for” (the respective investments), i.e. is a benefit-cost relationship. Formally, the (fuzzy) effectiveness of stage t is expressed as pEt(Ut-l,Xt,St)= pCt-’(Ut-1)
A pG:(Xt) ApG:(St)
(23)
and the aggregation reflects the nature of a compromise between the interests of the authorities (for whom the fuzzy constraints and the objective fuzzy goal matter), and those of the inhabitants (for whom the subjective fuzzy goal, and to some extent the objective fuzzy goal, matter); the minimum reflects a safety-first attitude, hence a “more just” compromise. Then, the effectiveness measures of the particular t = 1,.. . , N , p E t ( u t - I , X t , s t ) given by (23), are aggregated to yield the fuzzy effectiveness measure for the whole development P E ( H N ) = P E (~U o ,
xi,S i ) A . . . A PEN ( U N - 1 , XN,S n )
(24)
The fuzzy decision is PD(UO,...,UN-l
I X0,BN) =
= [PCo( u O ) A PG:
. . . A [PCN-l
(Xl) A PG: (s1>]A . . .
(UN-1)
A PGC (XN) A pG: ( S N ) ]
(25)
and it expresses some crucial compromises between, e.g.: the fuzzy constraints and (objective and subjective) fuzzy goals, the interests of the authorities and inhabitants, etc. The problem is now to find an optimal sequence of controls (investments)
u;T,. . . , u > - ~such that (under a given policy B N ;the optimization of policy is a separate problem which will not be cosnidered here): PD(U;),
...,
k
1
I X0,BN) =
. ’ . A [PCN-’(uN-l) A PGr (XN) A PG,N ( S N ) ] }
(26)
For illustration we will show a simple example [cf. Kacprzyk (1997a)l. Example 1 The region, predominantly agricultural, has a population of ca. 120,000 inhabitants, and its arable land is ca. 450,000 acres. For simplicity, the region’s development will be considered over the next 3 development stages
334
J . Kacprzyk
(years, for simplicity). The life quality index consists of the four life quality indicators:
xi - average subsidies in US$ per acre (per year), xf' - sanitation expenditures (water and sewage) in US$ per capita (per year) 7
xi''
-
health care expenditures in US$ per capita (per year), and
xf" - expenditures for paved roads (new roads and maintenance of the existing ones) in US$ (per year).
Suppose now that the investments are partitioned into parts devoted to the improvement of the above life quality indicators due to the fixed partitioning rule At-l(ut-l,i): 5% for subsidies, 25% for sanitation, 45% for health care, and 25% for infrastructure. Let the initial, at t = 0, values of the life quality indicators be: X; = 0.5
x;' = 15 x;'' = 27 x;" = 1,700,000
For clarity, we will only take into account the following two scenarios (policies) : Policy 1: uo = $8,000,000 U I = $8,000,000 uz = $8,000,000 0
Policy 2: uo = $7,500,000 U I = $8,000,000 uz = $8,500,000
Under Policy 1 and Policy 2, the values of the life quality indicators attained are: Policy 1: Year(t)
0 1 2 3 Policy 2: Year(t) 0 1 2 3
ut
z;
XI' xi11
Xi"
$8,000,000 $8,000,000 0.88 16.7 30 $2,000,000 $8,000,000 0.88 16.7 30 $2,000,000 0.88 16.7 30 $2,000,000 ut
xf
xi' xi"
Xf"
$7,500,000 $8,000,000 0.83 15.6 28.1 $1,875,000 $8,500,000 0.88 16.7 30 $2,000,000 0.94 17.7 31.9 $2,125,000
Towards Perception-based Fuzzy Modeling 335
For the evaluation of the above two development trajectories, for simplicity and readability we will only take into account the effectiveness of development, and the objective evaluation only. The consecutive fuzzy constraints and objective fuzzy subgoals are assumed piecewise linear, i.e. their definition requires two values only (cf. Figure 4, and Figure 5): the aspiration level (i.e. the fully acceptable value) and the lowest (or highest) possible (still acceptable) value) which are:
t 0 Co : u; = $7,500,000 ug = $8,500,000 1 C1: uy = $7,750,000 uE = $9,000,000 GAJ : g: = 0.6 GA>I1: xi' = 14 Gi,III : - = 27 GAJv : xiv = $1,800,000 2 C2 : ug = $8,000,000 U E = $10,000,000 GZJ : & = 0.7 GZJ : = 15 G:)III : - == 28 GZJv : xiv = $1,900,000 3 GiJ : gi = 0.75
T: = 0.85 3:' = 16 -111 = 29 51
Tiv = $1,900,000
T i = 0.9 x': = 17 = 30 Tiv = $2,000,000 TI3 -- 1 G2,II -3 - 16 Ti1 = 18.5 G;,"' .. E 3111 29 = 31 GiJv : xiv = $1,950,000 Tiv = $2,100,000 ~
Using the "A" t o reflect a safety-first attitude, which is clearly preferable in the situation considered (a rural region plagued by the aging of the society, out-migration t o neighboring urban areas, economic decay, etc.) , the evaluation of the two investment policies is: 0
Policy 1 p~($8,000,000;$8,000,000; $8,000,000
= pco ($8,000,000) A (,+;,I
I .) =
(0.88) A
A pG;.11(16.7) A pG;,lll(30) A pG;.lv ($2,000,000)) A
Apc1($8,000, 000) A (pGz,1(0.88) A
ApG2y,11(l6.7) A p G z , ~ ~ ~ (A3 0 pGz,1v ) ($2, 000,000)) A Apcz ($8,000,000) A (pG3,1(0.88) A
336
J . Kacprzyk ApG?,11(16.7) A pG;,lll(30) A
~
~
($2,000,000)) . 1 ~ =
2
= 0.5A (1A 1A 1A 1) A 0 . 8 A A(o.9 A 0.85 A 1A 1) A 1A (0.52 A 0.2) A 0.5 A 0.33) =
= 0.5 A 0.8 A 0.28 = 0.28 0
Policy 2 p0($7,500,000;$8,000,000;$8,500,000 .) = = p~0($7,500,000)A (pG;.1(O.83) A ApG;,11(15.6) A pG;,111(28.1) A pG;,~v($l,875,000)) A A p ~ ~ ( $ 8 , 0 0 0000) , A (pG~,l(O.SS) A ApG:,ll (16.7) A p G z , l l l (30) A pG:.lv ($2,000, *OOO)) A Apcz ($8,500,000) A (pG3.1(O.94) A
ApG:,11(17.7) A pG3,111 (31.9) A
($2,125,000)) =
pG3.1v
= 1 A (0.92 A 0.8 A 0.55 A 0.75) A 0.8 A
A(0.9 A 0.85 A 1A 1) A 0.75 A (0.76 A 0.68 A 1A 1) =
= 0.55 A 0.8 A 0.68 = 0.55
The second policy is therefore better. 5
Concluding remarks
We extended the basic Bellman and Zadeh's (1970) model of multistage decision making (control) in a fuzzy environment t o include both objective and subjective evaluations of how well fuzzy constraints on decisions (controls) applied and fuzzy goals on states attained are satisfied. To show how perceptions can be reflected by using this model, we presented its application for regional development planning with quality of life as a main element; its attainment is clearly strongly related t o human perception. We hope that the model can convince the reader that computing with words can provide tools for devising perception related models of real world problems.
Bibliography Bellman R.E. and L.A. Zadeh (1970) Decision making in a fuzzy environment. Management Sci. 17: 141-164.
Towards Perception-based Fuzzy Modeling
337
Francelin R.A., F.A.C. Gomide and J. Kacprzyk (1995) A class of neural networks for dynamic programming. Proc. of Sixth IFSA World Congress (Sa6 Paolo, Brazil), Vol. 11, pp. 221-224. Francelin R.A., J. Kacprzyk and F.A.C. Gomide (2001a) A biologically inspired neural network for dynamic programming. Int. Journal of Neural Systems 11: 561-572. Francelin R.A., J. Kacprzyk and F.A.C. Gomide (2001b) Neural newtork based algorithm for dyanmic system optimization. Asian Journal of Control 3: 131-142. Kacprzyk J. (1978) A branch-and-bound algorithm for the multistage control of a nonfuzzy system in a fuzzy environment. Control and Cybernetics 7: 51-64. Kacprzyk J. (1979) A branch-and-bound algorithm for the multistage control of a fuzzy system in a fuzzy environment. Kybernetes 8: 139-147. Kacprzyk J. (1983) Multistage Decision Making under Fuzziness, Verlag TUV Rheinland, Cologne. Kacprzyk J. (1996) Multistage control under fuzziness using genetic algorithms. Control and Cybernetics 25: 1181-1215. Kacprzyk J. (1997a) Multistage Fuzzy Control. Wiley, Chichester. Kacprzyk J. (199710) A genetic algorithm for the multistage control of a fuzzy system in a fuzzy environment. Mathware and Soft Computing IV: 219232. Kacprzyk J. (1998) Multistage Control of a Stochastic System in a Fuzzy Environment Using a Genetic Algorithm. International Journal of Intelligent Systems 13:1011-1023. Kacprzyk J. and A.O. Esogbue (1996) Fuzzy dynamic programming: main developments and applications. Fuzzy Sets and Systems 81: 31-46. Kacprzyk J., R.A. Romero and F.A.C. Gomide (1999) Involving objective and subjective aspects in multistage decision making and control under fuzziness: dynamic programming and neural networks. International Journal of Intelligent Systems 14: 79-104. Kacprzyk J. and A. Straszak (1984) Determination of stable trajectories for integrated regional development using fuzzy decision models. IEEE Trans. o n Systems, Man and Cybernetics SMC-14: 310-313. Zadeh L.A. and J. Kacprzyk (1999) Computing with words in information/intelligent systems. Part 1: Foundations, Part 2: Applications. PhysicaVerlag (Springer-Verlag), Heidelberg and New York.
This page intentionally left blank
Machine Intelligence for High Level Intelligent Systems
This page intentionally left blank
NEURAL NETWORK MODELS FOR VISION KUNIHIKO FUKUSHIMA
Tokyo University of Technology, Hachioji, Tokyo 1960982, Japan E-mail:
[email protected]. ac.jp This paper introduces two neural network models for visual information processing. The first topic is handwritten digit recognition by a “neocognitron” of an improved version. The neocognitron showed the recognition rate of 98.6% for a blind test set consisting of 3000 digits randomly sampled from a large database of handwritt,en digits (ETL-l), and 100% for the training set. The second topic is a neural network model that has an ability to recognize and repair partly occluded patterns. It is a multi-layered hierarchical neural network, in which visual information is processed by interaction of bottom-up and top-down signals. When a learned pattern is occluded, the model recognizes it and tries to complete the shape using the learned information. The model does not use a simple template matching method. It can accept even deformed versions of the learned patterns. If the pattern is unfamiliar to the model, the model tries to reconstruct the original shape by extrapolating the contours of the unoccluded part of the pattern.
Keywords: neural networks, vision, neocognitron. handwriten diggit recognition. 1
Introduction
This paper introduces two neural network models for visual information processing: (1) handwritten digit recognition by a neocognitron of an improved version, and (2) recognition and completion of occluded patterns. The first model uses only bottom-up signals in a hierarchical network, but the second one has both bottom-up and top-down signal flows.
2
Neocognitron for Handwritten Digit Recognition
The author previously proposed a neural network model neocognitron for robust visual pattern recognition It was initially proposed as a neural network model of the visual system. It has a hierarchical multilayered architecture similar to the classical hypothesis of Hubel and Wiesel 2 , 3 . It acquires the ability to recognize robustly visual patterns through learning. The neocognitron consists of layers of S-cells, which resemble simple cells in the primary visual cortex, and layers of C-cells, which resemble complex
’.
34 1
342
K. Fukushima
cells. These layers of S-cells and C-cells are arranged alternately in a hierarchical manner. S-cells are feature-extracting cells, whose input connections are variable and are modified through learning. C-cells, whose input connections are fixed and unmodified, exhibit an approximate invariance to the position of the stimuli within their receptive fields. The C-cells in the highest stage work as recognition cells, which indicates the result of the pattern recognition. This section discusses an improved version of the neocognitron designed for handwritten digit recognition 4. To improve the recognition rate, several modifications have been applied: such as, the inhibitory surround in the connections from S-cells to C-cells, contrast-extracting layer between input and edge-extracting layers, supervised competitive learning at the highest stage, staggered arrangement of S- and C-cells, and so on. These modifications allow the removal of accessory circuits that were appended to the previous versions, resulting in an improvement of recognition rate as well as simplification of the network architecture. We will demonstrate the ability of this network using a large database of handwritten digits (ETL-1). 2.1
Architecture of the Network
Figure 1 shows the architecture of the proposed network, which has 4 stages of S- and C-cell layers. The stimulus pattern is presented to the input layer Uo. Contrast-extracting layer UG, which corresponds to retinal ganglion cell layer or lateral geniculate nucleus, follows Uo. Layer UG consists of two cellplanes: a cell-plane of cells with concentric on-center receptive fields, and a cell-plane of cells with off-center receptive fields. The former cells extract positive contrast in brightness, whereas the latter extract negative contrast from the images presented t o the input layer. The output of UG is sent to the S-cell layer of the first stage ( U s l ) . Cells of Us1 have been trained using supervised learning to extract edge components of various orientations from the input image. The output of layer Us1 (S-cell layer of the Ith stage) is fed to a C-cell layer U c l , where a blurred version of the response of Us[ is generated. In the conventional neocognitron, the input connections of a C-cell, which converge from a group of S-cells, consisted of only excitatory components of a circular spatial distribution. An inhibitory surround is newly introduced around the excitatory connections. By the effect of this concentric inhibitory surround, an end point of a line usually elicits a larger response from C-cells than a middle part of the line. This endows the C-cells with the characteristics of end-stopped cells, and C-cells behave like hypercomplex cells in the visual cortex. Bend points
Neural Network Models for Vision 343
contrast V extraction edge extraction
layer
Figure 1. The architecture of the proposed neocognitron.
(a) No inhibitory surround (conventional)
(b) Inhibitory surround (new version)
Figure 2. The blurred responses produced by two independent features can be separated by the inhibitory surround in the input connections to a C-cell '.
and end points of lines are important features for pattern recognition. In the network of previous versions (e.g., 5 ) , an extra S-cell layer, which was called a bend-extracting layer, was placed after the line-extracting stage. In the proposed network, C-cells, whose input connections have inhibitory surrounds, participate in extraction of bend points and end points of lines while they are making a blurring operation. This allows the removal of accessory layer of bend-extracting cells, resulting in a simplification of the network architecture and an increased recognition rate as well. The inhibitory surrounds in the connections also have another benefit. The blurring operation by C-cells, which usually is effective for improving robustness against deformation of input patterns, sometimes makes it difficult t o detect whether a lump of blurred response is generated by a single feature
344
K. Fukvshima
or by two independent features of the same kind. For example, a single line and a pair of parallel lines of a very narrow separation generate a similar response when they are blurred. The inhibitory surround in the connections to C-cells creates a non-responding zone between the two lumps of blurred responses (Fig. 2(B)). This silent zone makes the S-cells of the next stage easily detect the number of original features even after blurring. The density of the cells in each cell-plane reduces between layers Us1 and U c l . A staggered arrangement of S- and C-cells is utilized t o reduce a harmful side effect of the thinning-out of cells. The S-cells of intermediate stages (Us2 and Us3) are self-organized using unsupervised competitive learning Every time a training pattern is presented to the input layer, seed cells are determined by a kind of winner-take-all process. Each input connection to a seed cell is increased by an amount proportional to the output of the cell from which the connection leads. Because of the shared connections within each cell-plane, all cells in the cell-plane come to have the same set of input connections as the seed cell. Training of the network is performed from the lower stages t o the higher stages: after the training of a lower stage has been completely finished, the training of the succeeding stage begins. The same set of training patterns is used for the training of all stages except U s l . Incidentally, the method of dual threshold is used Namely, higher threshold values are used for S-cells in the learning phase than in the recognition phase. Layer Us4 at the highest stage is trained through supervised competitive learning. The learning rule resembles the competitive learning used to train Us2 and Us3, but the class names of the training patterns are also utilized for the learning. When the network learns varieties of deformed training patterns through competitive learning, more than one cell-plane for one class is usually generated in Us4. Therefore, when each cell-plane first learns a training pattern, the class name of the training pattern is assigned to the cell-plane. Thus, each cell-plane of Us4 has a label indicating one of the 10 digits. Every time a training pattern is presented, competition occurs among all S-cells in the layer. If the winner of the competition has the same label as the training pattern, the winner becomes the seed cell and learns the training pattern in the same way as the seed cells of the lower stages. If the winner has a wrong label (or if all S-cells are silent), however, a new cell-plane is generated and is put a label of the class name of the training pattern. During the recognition phase, the label of the maximum-output S-cell of Us4 determines the final result of recognition. Competition among S-cells occur, and only one maximum output S-cell within the whole layer Us4 can
'.
'.
Neural Network Models for Vasion 345
UO
UCl
uc2
uc3
uc4
0-0
LL
0 1 0 2
input
0 3 0 4 m5 0 6 0 7
contrast on- and
0 8 0 9
off-center
recognition
edges
higher-order features
Figure 3. An example of the response of the neocognitron. The input pattern is recognized correctly as ‘5’.
transmit its output to U C ~ .
2.2
Computer Simulation
We tested the behavior of the proposed network using handwritten digits (free writing) randomly sampled from the ETL-1 database. Incidentally, the ETL1 is a database of segmented handwritten characters and is distributed by Electrotechnical Laboratory, Tsukuba, Japan. The recognition rate varies depending on the number of learning patterns. When we used 3000 patterns (300 patterns for each digit) for the learning, for example, the recognition rate was 98.6% for a blind test sample (3000 patterns), and 100% for the learning set. Figure 3 shows a typical response of the network that has finished the learning using 3000 learning patterns. The responses of layers 170, UG and layers of C-cells of all stages are displayed in series from left to right, and the responses of S-cell layers are omitted from the figure. The rightmost layer, Uc4, is the recognition layer, whose response shows the final result of recognition. It can be seen that the input pattern is recognized correctly as 0, the yield is h N, and the model is described by (3)
dt
The stationary states are at N=O and N=K(l-(hlr)). The second solution is asymptotically stable if h < r. The steady yield is maximized by h=r/2 and has the maximum value r K/4 (so called maximum sustainable yield, or MSY). The textbook solutions [7,8] for harvesting described above become more complicated, if we study a system, where the logistic growth is continuous, but the harvest occurs periodically at discrete time intervals. If we use e.g. a system, where K=5; N(O)=O.Ol; r = 0.5; h = r/2, there is a noticeable difference between the results for a continuous harvest and a discretized ones. The discretization for a time interval tp = 1 then looks like a mixture of a continuous logistic model with harvesting and a discrete logistic model with harvesting. Although the forthcoming equations (4), (5) look like (2), the similarity is partially misleading. While the equation (2) can be used to calculate a dynamics of one system throughout the time, the equations (4) and (5) are designed to calculate only one size of a population after a fixed harvesting time interval 4, when after a continuous growth expressed by the first part of the right hand side of the equations follows a sharp harvesting cut controlled by a part of the term with a harvesting rate h. The system can be discretized in two ways: a) the harvest is proportional to the last population size. This looks like a logical extension of (3), however, in (3) the difference between the current population size and its size in the next moment was infinitesimally small, while it is not true for the current equation
1
N(tP =
ertp
N(O)K
K - N(O)+ e r f pN(0)
- h tp N(O),
(4)
b) the harvest is proportional to the current population size N(tP
1
=
e r t pN(0)K
(l-htpj, K - N(o)+ e r f p N(O)
(5)
While for a time period approaching zero both the results would be approaching the continuous harvesting curve, for a time period tp =1 the differences are quite substantial. Even though in all three cases the basic shapes of functions describing the development of a population size with time are similar, the final population size for a time approaching infinity differs (even if we disregard the
Evolutzon of Unconscious Cooperation
389
cyclical change for the discrete-time harvest), see Fig. 1. The approach (A) is more natural for mathematicians. However, since it would be rather illogical to control amount of harvest by the population size in the last season instead of the current one, only the second case shall be further considered. 3 ~
c.u
I :
*:
11111
population size continuous growth discrete-timp harvesting proportional
1.5 .
to the last population
1 0.5
" k t i n u o u s growth
% discrete-time harvesting
continuous harvesting
proportional to the current population time ~
P
10
33
_--L-
90
40
-
growth with a proportional harvesting: Fig. 1. Logistic ~. . - a difference between a continuous harvesting and a discrete-time harvesting with a time period tp = I , all system control variables are the same.
Fig. 2 shows, that the steady yield (in a steady state after several hundreds of harvesting periods) depends on the length of the harvesting period, so the theoretical values for a continuous harvesting are not valid in this case. When we solve (5) for N ( t p ) = N ( 0 ) and express the population size of a steady state, where the growth during a time period equals the harvest, we get N(O)=O or
steady yield=arnount of harvestihawestinaperiod h = r12.4
\
0.3 .
0.2
h = r11.6
0.1
! . . . . , . . . . , . . . , , . . . 0.5
1
,
1.5 2 harvesting period (time t,)
Fig. 2. Dependence of a steady yield on a time period between harvesting. The graphs show, that as a time period approaches zero (at t,=O.O l), the steady yield approaches its theoretical maximum value r K/4 = 0.625 for h=r/2, but it is not true for a greater time period fp = I , when a smaller harvest rate constant h gives a better steady yield.
390
J . Pospa'chal
Whether the population will vanish or survive depends on the parameters h and tp, see Fig. 3.
steady state population size N
harvest rate h
0.5
Fig. 3. The steady state population size Ndepending on the harvest rate h and the harvesting time period $, for K=5 and ~ 0 . 5For . tp=l the critical harvest rate equals h=0.393469, for a greater harvest rate the population vanishes.
When we take the formula for the size of the population N(tp) just before harvest in the time period t,, from the right hand side of (2) with t t tp, substitute N(0) by the right term of (6) so that the state before harvesting is calculated for a steady state cycle with periodically repeated conditions, multiply the result by h tp to get a yield for a unit of time and differentiate the result with respect to the harvest rate h,,equate the result to zero and solve the equation with respect to h, we get the ideal harvest rate h in order to get a maximum steady yield (h=0.221199 for K=5, ~ 0 . and 5 $=l, the other existing solution would have h>l):
By setting (6)equal to zero and solving it for h, we get a critical value h above which a total destruction ofresources appears (h= 0.393469 for ~ 0 . and 5 tp=l).
The dependence of the steady yield on the harvest rate h and the harvesting time period tp,for K=5 and ~ 0 . is5 displayed in Fig. 4.
Evolution of Unconscious Cooperation 391
harvest rate h
-'
hawest rate h
Fig. 4. Steady yield depending on the harvest rate h and the harvesting time period tp.for K=5 and ~ 0 . 5 . For a time period 6, and a harvest rate approaching zero the maximum value of the steady yield approaches the ideal value Y K/4=0.625 for a continuous harvesting, but for a time period fp=l the ideal harvest rate is h= 0.221199 with the maximum yield 0.621765, see the second graph.
1' \
size of population - biomass
Fig. 5. The dynamics of the size of population in time for the ideal harvest rate h= 0.221 199 and for K=5, ~ 0 . 5and , tp=l for different initial sizes of population.
Fig. 5 shows the dynamics of the system in time for the ideal harvest rate h= 0.221199 and for K=5, 1-0.5, and tp=l for different initial sizes of population. Although the presented curves are smooth, in fact we are only showing joined lower points of zig-zag functions like those in Fig. 1. 2.2
Two lumberjacks and the optimal harvest
An optimal harvesting strategy for more lumberjacks can be deduced from the
previous section. When the maximum of harvest rates proposed by single lumberjacks is accepted, this maximum harvest rate must be equal to the ideal harvest rate for the previously studied case of one lumberjack. Since the harvest is divided among lumberjacks proportionally to their proposed harvest rates, in order to get an equal division of the harvest, each lumberjack should propose this maximum harvest rate. Fig. 6 shows that when one lumberjack proposes a harvest rate, which would get the maximum total yield for both, it is advantageous for the other lumberjack to
392
J . Pospichal
cheat to get a higher yield by proposing a higher harvest rate. In the consequence the first lumberjack has a much lower yield and the total yield is also smaller. This is the core of the social dilemma in the presented model.
ideal harvest rate for
steady yield for the first lumberjack
harvest rate h, proposed by the first lumberjack. h,=O 221199
Fig. 6 . The first figure shows a steady yield for the first lumberjack depending on harvest rates of both, for &=I. The second part shows the section of this three-dimensional graph, for a constant harvest rate of the second lumberjack h2= 0.221 199.
The problem can be visualized in Fig. 7, where contours of one fimction and a vector field represented by arrows defining another approach to the problem are shown. Independent variables for both plots are harvest rates hl and h2 proposed by two lumberjacks sharing the same resource. The contours show a function defined by the sum of steady yields of these two lumberjacks, where the maximum yield at the contour diagram can be found at a right wing shaped ridge, where the greater of the two harvest rates equals 0.221199. If the lumberjacks should share the total yield evenly, they should both propose this rate (shown in the figure by the first heavy dot on the secondary diagonal). This is the point of convergence of cooperative lumberjacks maximizing the total steady yield and dividing it equally between them, hl=h2=0.22 1199. The arrows describe decisions of two selfish lumberjacks, when at every point each lumberjack suggests a change in his harvest rate. This change is proportional to the derivative of his steady yield with respect to his proposed rate and the change has such a direction which should increase the lumberjack's steady yield, presuming that the other lumberjack's harvest rate will remain unchanged. These changes proposed independently by both lumberjacks would be represented by a horizontal arrow for the first lumberjack and a vertical arrow for the second lumberjack. If we combine these separate decisions of both lumberjacks, they define a vector of direction of moves of selfish shortsighted agents in the space of their harvest rates. Arrows defined by such vectors represent a gradient plot of combined moves of couples of selfish independent lumberjacks directed towards a supposed increase of their steady yields. The "ideal" selfish strategy point, to which such selfish agents can converge, is at both harvest rates equal to 0.255038. It is a point (see the second symbol - square on the secondary diagonal of Fig. 7), where an increase in the proposed harvest rate of one of agents will cause a greater proportion of his share in a steady harvest yield, but this increase in his share will be exactly matched by the decrease of the
Evolution of Unconscious Cooperation 393
total steady yield caused by an overharvesting, so that the agent gains nothing. At a point h1=h2=0.255038any greater change of either lumberjack's harvest rate would decrease his steady yield, even though the other lumberjack's harvest rate would not change. harvest rate h, proposed
r
0.3 'Ideal" selfish
convergence for different strategies of two lumberjacks
0.2. .M
Cooperative
Arrows of gradient combined from selfish decisions of both lumberjacks
0.1 .
..
0.
Contour plot of the sum of steady yields of both lumberjacks
~
~
0
0.1
0.2
0.3
harvest rate h, proposed by the first lumberjack
Fig. 7. Thc axcs correspond to harvest rates of both lumbejacks. The degree of shadc defines contours of a sum of both steady yields of the two lumberjacks. The arrows show a gradient combined from perpendicular gradients of both lumberjacks aiming for a better individual steady yield. Different strategies of agents lcad to four separate points of convergencc.
However, if the strategy of each of the agents-lumberjacks is based on the derivative of his steady yield by his harvest rate, so that the final strategy is taken at a fixed point of iterative moves maximizing their individual steady yields according to (9), the ideal selfish harvest rates equal to 0.255038 are achieved by this approach if the starting harvest rates of both lumberjacks are exactly equal to each other h1(0)=h2(0):
The other possibility how to achieve this result is to have the moves of the single lumberjacks calculated not independently, but one after the other taking the results of the other's current move, like
394
J . Pospichal
h2 (n+ 2) = h2 (n)+ learning - rate
d steady - yield, (h, (n+ I), h, (n)) d h2
("1
7
(10)
or simply by an alternative approach hz(n+2)+ hI(n+l), when the second lumberjack copies moves of the first lumberjack. It should be emphasized, that the iteration counter n is different from either time t or a time period tp, since the expressions for steady-yield] and steadyyield2 for both lumberjacks are calculated for a prescribed time period tp and a time t going to infinity. If the initial harvest rates of both lumberjacks are not the same, and both are taking their decision at the same time, they arrive at harvest rates oscillating around 0.277178, the "derivative based" selfish strategy point of convergence starting from different harvesting rates is hl=h2=0.277178. The last point of convergence in Fig. 7 is the result of competitive agents, which do not compare absolute values of yield. When each agent is trying to get a better steady yield than the other agent-lumberjack, strategies result in a total exhaustion of resources, h1=h2=0.393469. What is the reason, why very similar selfish strategies converge to two different points? The explanation can be found in the shape of the function of a steady yield for the first lumberjack depending on harvest rates of both lumberjacks. The explanation of the first convergence point of selfish strategies can be visualized on Fig. 8. While for the maximum steady yield for both lumberjacks the first lumberjack still gets more by increasing its harvest rate (arrow pointing up in the left hand side diagram), in the second case the yield gain by "cheating" his partner is lowered by a decrease of total yield (arrow pointing at horizontal level) so that there is no incentive to cheat. Fig. 9 explains the existence of the second convergence point. The first convergence point is placed at the position of a "gradient", where the gradient arrows meet perpendicularly. The other convergence point is placed at the position, where the gradient arrows meet each other at exactly opposite direction and the slopes of the steady yield function on both sides of the convergence point are the same. While the first point is "ideal", i.e. no selfish agent will leave it, the selfish agents driven by derivatives mostly converge to the second point. The smaller the learning rate, the smaller the jumps across the secondary diagonal of the first plot at Fig. 9 and the closer their oscillation comes to the second convergence point. When we take the formula for the size of the population N($) just before a harvest in the time period tp from the right hand side of (2) with t t tp, substitute N(0) by the right term of (6) with h +hl so that the state before harvesting is calculated for a steady state cycle with periodically repeated conditions, multiply the result by hl tp to get a total yield for a unit of time, multiply by All( hl+ h2) to get a part of yield going to the first lumberjack and differentiate the result with respect to the harvest rate hl,, we get the tendency of the first agent-lumberjacks to increase its harvest rate hl.
Evolution of Unconscious Cooperation Cooperative and "ideal convergence points
395
A section of the 3D graph in figure 6 at the "ideal" selfish convergence point horizontal plateau no incentive to p 2 - t E i f e h tO"r o flumberjack increase harvest rate the 0.3
f
the first lumberjack with an increase of
steady yield
0.295
0.1
I
0.28 0.285
0
arvest rate h? by the second lumberjack
0.25
0.26
0.27
0.28
harvest rate h, proposed by the first lumberjack
Fig. 8. Plot explaining the existence of the first convergence point of selfish agents. The first part shows a diagonal section of the graph of the steady yield of the first lumberjack shown at Fig. 6 . The second part explains by showing the top of one section of the graph at Fig. 6 , why the first selfish lumberjack stops to increase his harvest rate.
If it is presumed, that the second lumberjacks rate is smaller, his tendency to increase his harvest rate would be similar, only to get a part of yield going to the second lumberjack would require multiplication by h2/( hl+ hz), and the result would be differentiated with the respect to the harvest rate h2. Since near the convergence point these tendencies exactly oppose each other, they should be equated with the opposite sign. The second point can be derived by solving (1 la)
dhl
where N(tp) is replaced by a right hand side of (2) with a time t t tp and N(0) is then replaced by
followed by the differentiation and replacement of h2 by hl (at the convergence point both harvest rates should be equal), from which we get the result
hi =
+
- 1 4 e r t p- d l
+ 8 ertP >
4ertptp
which for ~ 0 . and 5 tp=l equals to 0.277178.
396
J . Pospz’chal Point of convergence of derivative based selfish lumberjacks
0.285
steady yield for the first
0.275
1
0.27
0.26
combined fish decisions (as in figure 7)
0.27
0.28
0.29
harvest rate h, proposed by the first lumberjack
“ideal”selfish lumberjacks Fig. 9. The plots show, why selfish agents dnven by decisions based on derivatives of their yields with respect to their proposed harvest rates have two convergence points depending on the starting point and the exact implementation. When these agents are driven by the derivative and they have a smaller harvest rate than the other agent, their derivative is big and nearly constant up to the value of the harvest rate of the other agent. Driven by this derivative, they tend to “jump too far” over their optimum, so that they nearly swap their positions with a symmetric position of the other agent. This jumping back and force continues to go zig-zag as in the figure, until both agents have the same value of the derivative, where they remain oscillating, each reaching every second iteration the convergence value, for the used constants it is hl=hz=0.277178,while the “ideal“ selfish convergence point is hl=h2=0.255038.
Fig. 7 with results based on the theory has its correspondence in Fig. 10, where the results represent trajectories of iterative positions of actual couples of lumberjacks driven by derivatives computed for both lumberjacks at the same time. Arrows show a gradient plot of steady yields of lumberjacks combined from their separate decisions as in Fig. 7. Two close points are oscillating points of convergence of selfish lumberjacks, maximizing their individual steady yields, when each lumberjacks strategy is based on the derivative of his steady yield by his rate, oscillating points have maximal horizontal and vertical positions at hl=h2=0.277178, the other position depends on the learning rate. The “ideal” point of convergence of selfish lumberjacks is lower on the secondary diagonal. It results for selfish lumberjacks maximizing their individual steady yields, when starting harvest rates are equal to each other, hl=h2. Any change of either lumberjack’s harvest rate would decrease his steady yield at hl=h2=0.255038. Fig. 11 shows a trajectory of selfish lumberjacks, which decide the moves one after another, taking the results of the other’s current move, based on the derivative of a steady yield function. Even though the final result is the same “ideal” point of convergence as for selfish lumberjacks, which decide the moves in parallel and independently, but start with the same initial harvest rate proposal, the trajectory is quite different. For the independent agents deciding in parallel the trajectory follows the secondary diagonal, moving only in one direction towards the equilibrium point.
Evolution of Unconscious Cooperation 397
In contrast, the sequentially moving agents may wander near the point of extinction before going back to the equilibrium point. starting points of iterations given by harvest rates h, and h,
trajectories of iterative moves of couples of lumberjacks
harvest rate
h, proposed 0.35
by the second lumberjack o,3
oscillating points of convergence of derivative based selfish lumberjacks
0.25
0.2 0.15
"ideal" point of convergence of se(tish lumberjacks
0.1
0.05
harvest rate h, proposed by the first lumberjack Fig. 10. Trajectories of couples of selfish lumberjacks, starting at the points at the edges of the figure. All of the trajectories come to oscillate between the points (0.271969, 0. 0.277178) and (0,277178, 0.271969}, where the lesser coordinate depends on the learning rate multiplying the derivative value. The only exceptions are starting points at the secondary diagonal, which converge to the "ideal" convergence point {0.255038,0.255038}. harvest rate 0 .3 4 h, proposed by the second lumberjack
0.32
0.3-
0.28-
0.26steady yields sequentially
Fig.11. Trajectory of a couple of selfish lumberjacks, deciding moves sequentially. The result is better for both than for independent decisions starting at a different position, but the agents may get near to extinction first.
398
3
J . Pospichal
Evolutionary algorithm solution of the lumberjacks' dilemma
The theoretical and experimental results from the previous section give us some estimates, what can be expected fiom the evolutionary algorithm solving the same problem of couples of interacting lumberjacks. However, these results do not predict the outcome of the evolutionary algorithm precisely. It is clear, that the harvest rates will be somewhere between the ideal harvest rate for both hl= hz= 0.22 1199 and the total exhaustion of resources corresponding to competitive agents hl= hz= 0.393469. Agents in an evolutionary algorithm are competitive, but on the other hand, they are not compared just within one couple of agents, but from all the interactions. The "collaborating" agents can gather fitness in their mutual interactions, if there is enough of them, while "selfish" agents will gather relatively more yield in a one-to-one interaction with a "collaborating" agent, but it may be still a small amount. 1 numberof
harvest
0.35 0.34 0.33 0.32 0.31 0.3
(average for species)
0.29 0.28
0.27 0.26 0.25
generations 0
200
400
600
800
1000
Fig. 12. The first diagram shows a series of histograms of numbers of individuals within a certain harvest rate collected from 100 runs throughout generations (the harvest rate was divided into 40 equidistant parts in the interval (0-0.4)). The second diagram shows plots of average values of harvest rates of the 10 species against number of generations. The central line is average of those averages from 100 runs, the lines adjacent above and below corresponds to average*standard deviation and the top(bottom) line correspond to the maximum (minimum) of the averages.
The presented study uses asimple version of an evolutionary algorithm (adapted from Akiyama and Kaneko [1,2], so that the results can be to some extent compared). In the game world of the lumberjacks' dilemma game there are 10 species of lumberjacks and 100 wooded hills. In each hill, two lumberjacks live and a single tree grows. Each of the 200 individuals in the population was randomly selected from one of 10 species. The individuals are randomly distributed to 100 couples, each couple is placed on "one hill" with a "tree", which starts with a "population level of biomass'' N equal to 0.1. A species is characterized only by one value - its proposal for a harvest rate. Random numbers from the interval (0,0.4) were used to generate initial species. In each of the hills these couples of lumberjacks conduct the game repeatedly, each agent with its constant level of the proposed harvest rate. The processes are simultaneously ongoing on all the hills for 1000 harvest iterations. This is called one generation of the game. After a
Evolution of Unconscious Cooperation 399
generation the species are evaluated by the average of the last yield of their players from all the hills. Then 3 species with the lowest average of the yield are removed (get extinct), replaced by new 3 species, mutants of the parents randomly chosen from the 7 surviving species. Mutation consists of replacing the harvest rate of a parent by a new harvest rate created from a sum of the parental harvest rate with a random number with a zero mean and standard deviation equal to 0.01 (this mutation is subject to the condition, that the resulting harvest rate is still from the interval (0,0.4)). Then new 200 individuals are generated, each selected from one of the new generation of species. They are again randomly distributed to 100 couples, each couple is placed on "one hill" competing for its "tree", which again starts with a fresh "population level of biomass" N equal to 0.1, and the process of evaluation by harvesting is repeated. The same procedure is repeated by each generation. After 1000 of generations of such species replacements the final result is taken as an average of proposed harvest rates of the last generation of species. The resulting harvest rate is h=0.2755, which is slightly less than the second convergence point h=0.277178 (result for sequentially deciding selfish agents (9)) but more than the first convergence point h=0.255038 (result for "parallel" selfish agents (10)). Fig. 12 shows, that a convergence is quite fast. The harvest rates start at a low level, then comes a maximum and after that the average harvest rate converges to h=0.2755, while mild excesses occur toward overharvesting. The results show, that a conscious mutual understanding is not necessary for an emergence of cooperation for the presented problem. The results however do not remove the tragedy of the commons. If more than a couple of lumberjacks is placed on "one hill", the theoretical solution shows that the equilibrium is closer to overharvesting, than the ideal harvest rate for all. This shift of the equilibrium point towards overharvesting is not so great for the theoretical equilibrium, but experimental results from the evolutionary algorithm get much closer to the extinction point for a great number of lumberjacks "on one hill" competing for the same resource. One must take into account that this is true for competing individuals, but not for a competition of species, where the individuals share their profit and only a bilateral interaction takes place. A future study shall be concentrated on explanation of this dissonance and on a theoretical prediction of the convergence point and its dependence on parameters of the logistic growth of resources, sharing mechanism, selection pressure and a value of standard deviation caused by mutation in evolutionary algorithms. 4
Conclusions
A new version of a social dilemma problem has been described and analyzed, which allows for a cooperation to be controlled by a continuous variable instead of discrete
YES/NO moves like in most applications of prisoners dilemma or in the previous variant of the lumberjacks' dilemma [ 1,2]. While an ideal continuous harvesting has been thoroughly analyzed before [ 17,19,211, discontinuous harvesting of
400
J . Pospichal
continuously growing resources has been studied only recently in a few papers, mostly in a stock exchange [27] or in predator-prey systems [9]. In these papers mostly dynamics of such systems were studied, while the main subject of the presented approach is the evolution of cooperation. The evolutionary study of a similar problem with a discrete type of moves resulted in a cooperation only for an algorithm exploiting memory of past moves and information exchange between the involved agents [ 1,2]. Theoretical examinations of iterated interactions in a multi-agent system were conducted and compared with empirical results of both gradient driven agents as well as the evolutionary algorithm model. The results showed that unconscious development of a cooperation is entirely possible and while the players in the studied simulations will not end up in an ideal cooperation bringing the maximum steady yield for everybody, neither they will end up in a total "cheating war", bringing yields to the absolute minimum by a greediness of competing players. It is not necessary to resort to a trait or group selection to get a cooperation. The couples of lumberjacks in the present version of evolution do not survive as a group past a single generation, as was in the case [4,12,22]. The computational results of our approach show that for a selected type of problems a conscious cooperation is not necessary to avoid a trap of "free riders" which otherwise causes a spread of defective moves destructive in the final outcome for everybody, that can be avoided only by restraints [ 14,251 or a group selection. Acknowledgements The work was supported by the grants 1/7336/20 and 1/8107/01 of the Scientific Grant Agency of the Slovak Republic. References 1. Akiyama, E., and Kaneko, K. (2000): Dynamical Systems Game Theory and Dynamics of Games. Physica D 147(3-4), 221-258. 2. Akiyama, E., and Kaneko, K. (2000): Evolution of Cooperation in Social Dilemma - Dynamical Systems Game Approach. In: Bedau, M.A., McCaskill, J.S., Packard, N.H., Rasmussen, S. (eds.): ArtzJiciaZ Life 7. The MIT Press, pp. 186-195. 3. Alonso-Sanzs, R., Martin, M. C., and Martin, M. (2001): The Effect of Memory in the Spatial Continuous-Valued Prisoner's Dilemma. Znternationd Journal of Bifurcation and Chaos 11,2061-2083. 4. AvilCs, L. (1999): Cooperation and non-linear dynamics: An ecological perspective on the evolution of sociality. Evol.Ecol. Research 1,459-477. 5. Axelrod, R., and Hamilton, D.H. (1981): The evolution of cooperation. Science 211, 1390-1396.
Evolution of Unconscious Cooperation
401
6. Birk, A. (1999): Evolution of Continuous Degrees of Cooperation in an NPlayer Iterated Prisoner's Dilemma, working paper, W B AI-MEMO 99-6, h~://arti4.vub.ac.be/-cyrano/PUBLICATIONS/jce~cnpd.ps.gz 7. Bulmer, M. (1994): Theoretical evolutionary ecology. Sinauer Associates, Inc., Sunderland, Mass. 8. Clark, C. W. (1990): Mathematical bioeconomics: The optimal management of renewable resources. John Wiley and Sons, New York. 9. Costa, M.I.S., Kaszkurewicz, E., Bhaya, A., Hsu, L. (2000): Achieving global convergence to an equilibrium population in predator-prey systems by the use of a discontinuous harvesting policy. Ecological Modelling 128, 89-99. 10. Darwen, P.J., and Yao, X. (2001): Why More Choices Causes Less Cooperation in Iterated Prisoner's Dilemma. Congress on Evolutionary Computation (CEC'2OO l), Seoul, Korea, 27-30 May 2001,.pp. 987-994. 11. Fader, P.S., and Hauser, J. R. (1988): Implicit coalitions in a generalized Prisoner's Dilemma. Journal of ConfEict Resolution 32(3), 553-582. 12. Fletcher, J.A., and Zwick, M. (2000): N-Player Prisoner's Dilemma in Multiple Groups: A Model of Multilevel Selection. In: Boudreau, E., and Maley, C. (eds.): Proceedings ofthe Artificial Life VII Workshops. Portland, Oregon. 13. Frean, M.R. (1996): The evolution of degrees of cooperation. Journal of Theoretical Biology 182, 549-559. 14. Hardin, G. (1968): The Tragedy of the Commons. Science 162, 1243-1248. 15. KvasniCka V., Pospichal J. (1999): Evolutionary study of interethnic cooperation. Adv. Complex Systems 2(4), 395-42 1. 16. KvasniCka V., Pospichal J. (2000): An Emergence of Coordinated Communication in Populations of Agents. Artificial Life 5,3 19-342. 17. LeBel, P. (2001): Optimal Pricing of Biodiverse Natural Resources for Sustainable Economic Growth. Working paper,
http://alpha.montclair.edu/-lebelp/plebel.html 18. Lindgren, K., and Johansson, J. (2002): Coevolution of strategies in n-person Prisoner's Dilemma. In: Crutchfield, J., and Schuster, P. (eds.): Evolutionary Dynamics - Exploring the Interplay of Selection, Neutral$, Accident, and Function. Oxford University Press, New York. 19. Lueck, D., and Caputo, M.R. (1999): A Theory of Natural Resource Use under Common Property Rights. The Fondazione Eni Enrico Mattei, working paper, http :llwww.feem.it/gneelpaplists/papa1.html 20. Mar, G., and St. Denis, P. (1994): Chaos in Cooperation: Continuous-Valued Prisoner's Dilemmas in Infinite-Valued Logic. International Journal of Bifurcation and Chaos 4,943-958. 21. Noailly, J., van den Bergh, J., and Withagen, C. (2001): Evolution of harvesting strategies: replicator and resource dynamics. The 6th annual workshop on economics with heterogeneous interacting agents, Jun 7-9, in Maastricht, http://meritbbs.unimaas.nVWEHIA/Full/noailly .pdf
402
J . Pospichal
22. Pepper, J.W., and Smuts, B.B. (2001): Agent-based modeling of multilevel selection: the evolution of feeding restraint as a case study. In: Pitt, W. C. (ed): Swarmfest 2000, Proceedings of the 4th Annual Swarm User Group Conference. Natural Resources and Environmental Issues, Volume XIII, S. J. and Jessie E. Quinney Natural Resources Research Library Logan, UT, pp 5768. 23. Pospichal, J. (2001): Tragedy of the commons in transportation networks. Proceedings of MENDEL 2001, PC-DIR, Bmo, ISBN 80-214-1894-X, pp. 97102. 24. Sen, S., and Mundhe, M. (2000): Evolving agent societies that avoid social dilemmas. In: GECCO 2000. San Francisco, CA: Morgan Kaufmann Pub., pp. 809-8 16. 25. Sethi, R., and Somanathan, E. (1996): The Evolution of Social Norms in Common Property Resource Use. American Economic Review 86,766-788. 26. Verhoeff, T. (1998): The Trader’s Dilemma: A Continuous Version of the Prisoner’s Dilemma. Computing Science Notes 93/02, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, The Netherlands. January 1993, revised January 1998, http://wwwpa.win.tue.nVwstomv/publications/td.pdf 27. Wirl, F. (1995): The cyclical exploitation of renewable resource stocks may be optimal. Journal of Environmental Economics and Management 29,252-26 1.
ARTIFICIAL CHEMISTRY, REPLICATORS, AND MOLECULAR DARWINIAN EVOLUTION ZN SZLICO VLADIM~RKVASNICKA Department of Mathematics, Slovak Technical University, 8123 7 Bratislava, Slovakia E-mail:
[email protected] A simplified model of Darwinian evolution at the molecular level is studied by applying the methods of artificial chemistry. A chemical reactor (chemostat) is composed of molecules that are represented by strings of tokens and these strings are autoreplicated with a probability proportional to their fitness. Moreover, the process of autoreplication is not fully correct, there may appear sporadic mutations that produce new offspring strings that are slightly different from their parental templates. The dynamics of such an autoreplicating system is described by Eigen's differential equations. These equations have a unique asymptotically stable state, which corresponds to those strings that have the highest rate constants (fitness). A generalized version of a rugged fitness landscape realized by a Kauffman KN function is used for an evaluation of strings. Recently, Newman and Engelhardt have demonstrated that this simple type of fitness surface simulates in fact almost all basic results about molecular Darwinian evolution achieved by Schuster with his associates. Schuster et al. used a physical model of RNA molecules with fitness specified by their ability to be folded into a secondary structure. The presented model with Kauffman rugged function induces a detailed look at mechanisms of molecular Darwinian evolution, in particular to ameaning and importance of neutral mutations.
Keywords: Ariificial lije, artificial chemistv, Kauffian KN function, fitness landscape, molecular Darwinian evolution, neutral mutations, neutral evolution
1
Introduction
Darwinian evolution belongs to standard subjects of interest of Artificial Life. In particular, the main stimulus was observed at the end of eighties, when evolutionary algorithms suddenly emerged as a new paradigm of computer science based on the metaphor of Darwinian evolution. This paradigm may be traced back to 1932 when Seal Wright [26] postulated an adaptive landscape (nowadays called the fitness landscape or fitness surface) and characterized Darwinian evolution as an adaptive process (nowadays we say optimization process), where the genotype of population is adapted in such a way that it reaches a local (even global) maximum on the fitness surface. Forty years later, this ingenious idea was used by John Holland [13] as a metaphor for creation of genetic algorithms, that may be now interpreted as an abstraction of Darwinian evolution in aform of universal optimization algorithm (see Bennett seminal book [4]). The purpose of this paper is to present a very simple computational model of Darwinian evolution that may reflect some of its most elementary aspects appearing on biomacromolecular level (e.g. see experiments of Spiegelman [25] from the end of sixties). This "molecular" model of Darwinian evolution is capable to offer a
403
404
V. KvasniEka
detailed quantitative look at many of its basic concepts and notions, e.g. a role of neutral mutations may be studied as an auxiliary device to overcome local valleys of fitness surface in the course of the adaptation process. A very important role in the present study play methods of artificial chemistry [ 1,3,5,6,11,12], that may be considered as a subfield of artificial life, which is based on the metaphor of chemical reactor (in our forthcoming discussions called the chemostat). It is composed of "molecules" that are represented by abstract objects (e.g. by strings composed of symbols, trees, formulae constructed within an algebra, etc.), which are transformed stochastically into other feasible objects by "chemical reactions". A probability of these transformations is strictly determined by the structure of incoming objects, resulting - outgoing objects are returned to the chemostat. Kinetics of the processes running in the chemostat is well described by Eigen's replicator differential equations [7,8], which were constructed on the basis of the well-known physico-chemical law of mass action. The main objects of artificial chemistry is ( i ) to study formal systems that are based on the chemical metaphor and that are potentially able to perform parallel computation, and (ii) to generate formal autocatalytic chemical systems (molecules are represented by structured objects) for purposes of "in silico" simulations of an emergence of living systems. Recently, Newman and Engelhardt [22] have demonstrated that many aspects of molecular Darwinian evolution in silico may be studied by making use of a fitness surface based on a generalization of Kauffman KN [15,16,22] rugged function with tunable degree of neutrality. They demonstrated that almost all basic results obtained by Peter Schuster's Vienna group [10,23,24] based on a realistic model of RNA molecules and their folding, may be simply and immediately obtained by this simple model of fitness surface. We use this simple model of fitness surface in the present paper and we demonstrate that the obtained results are formally closely related to or even almost identical to the theoretical results predicted by Eigen's replicator differential equations.
2
Eigen's replicators
Manfred Eigen published in the beginning of seventies a seminal paper entitled "Self organization of matter and the evolution of biological macro molecules" [7,8], where he postulated a hypothetical chemical system composed of the so-called replicators. This system mimics Darwinian evolution even on an abiotic level. Eigen and Schuster [S] discussed the proposed model as a potentially possible abiotic mechanism of a driving force for an increase of complexity on a border of abiotic and biotic systems. Let us consider biomacromolecules (called the replicators) X , ,X,, ..., X,that are capable of the following chemical reactions:
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution
(i=1,2, ...,n )
xi-+O
405
(1b)
xi
The first reaction (la) means that a molecule is replicated onto itself with a rate constant ki,whereas the second reaction (Ib) means that Xi becomes extinct with a rate parameter 4 (this parameter is called the "dilution jlux" and will be specified further). Applying the mass-action law of chemical kinetics, we get the following system of differential equations Xi =xi (k, -4)
(2a)
(i = 1,2, ..., n )
The dilution flux 4 is a fiee parameter and it will be determined in such a way, that the following condition is satisfied: A sum of time derivatives of concentrations x ' s is vanishing, xxi= 0 , we get
where the condition C x i =1 is used without a loss of generality of our considerations. Its analytical solution looks as follows
j=1
This solution has an asymptotic property, where only one type of molecules (with the maximal rate constant kmm)is surviving while other ones become extinct lim xi ( t ) = r+m
i
1 (fork,
= k,
= m a { kl , . . . , k n } )
(4)
0 (otherwise)
Loosely speaking, each type of molecules may be considered as a type of species with a fitness specified by the rate constant k. In a chemostat "survive" only those molecules - species that are best fitted, i.e. that have the highest rate constant kmm,and all the other molecules with smaller rate constants become extinct, see Fig. 1. The condition of invariability of the sum of concentrations (i.e. xi =1) introduces a "selection pressure" to replicated molecules, only those molecules will survive that are best fitted with the maximal rate constant. The proposed model may be simply generalized in such a way that mutations are introduced into process of replications, the system (2) is modified as follows xi = x, (k,,
(i = 1,2,...,
4
406
V . KvasniEka
where kv is a rate constant assigned to a modified reaction (la)
x,
% > X i + X , (i,j=1,2 ,..., n )
Fig. 1. A plot of relative concentrations of four component system with rate constants kl=l, kZ=2,k3=3, and k4=4. We see that only molecules X4 survive at the end.
There is postulated that a rate constant matrix K = (k, ) has dominant diagonal elements, i.e. nondiagonal elements are much smaller than diagonal ones. This requirement directly follows from an assumption that imperfect replications (6) are very rare, the product X, is considered as a weak mutation of autoreplicated X,= Om,, ( X i ) . The diluiion flux # from (5) is determined by the condition that a sum of time derivatives of concentrations is vanishing,
c
X, = 0
,we get
Analytical solution of (5) with dilution flux specified by (7) is [ 141
where Q = ( q , ) is a nonsingular matrix that diagonalizes the rate matrix K , Q-'KQ = A = diug(h,,h*,...,h,). Since we have postulated that nondiagonal elements of K are much smaller than its diagonal elements, its eigenvalues h's are very close to diagonal elements, h, 0 k,,, and the transformation matrix Q is tightly related to a unit matrix,
q, I7 6, (a Kronecker's delta symbol). It means that an introduction of weak mutations does not change dramatically the general properties of the above simple replicator system without mutations. In particular, the final (for t - m ) chemostat will be composed almost entirely of molecules with the greatest rate constant kma.
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution
407
These molecules are weakly accompanied by other replicators with rate constants k's slightly smaller than kma. 3
Chemical metaphor - chemostat
Let us consider a chemostat (chemical reactor) composed of formal objects called the molecules. It is postulated that the chemostat is not spatially structured (in chemistry it is said that the reactor is well stirred). Molecules are represented by formal structured objects (e.g. token strings, rooted trees, 1-expressions, etc.). An interaction between molecules is potentially able to transform information, which is stored in the composition of the molecules. Therefore a chemical reaction (it causes changes in the internal structure of reacting molecules) can be considered as an act of information processing. The capacity of the information processing depends on the complexity of molecules and chemical reactions between them. General ideas of the chemostat approach will be explained by an example of chemostat as a binary function optimizer. Let us consider a binary function
f
: {O,'}"
+[0,1]
(9) N
of the length n This function Ax) maps binary strings x = (x,,x2,...,xN) E {O,l} onto real numbers fiom the closed interval [0,1]. We look for an optimal solution
Since the cardinality of the set {0,1}" of solutions is equal to 2", a CPU time necessary for solution of the above optimization problem grows exponentially
t,
= 2"
it means that the solution of the binary optimization problem (10) belongs to a class of hard numerical NP-complete problems. This is the main reason why the optimization problems (10) are solved by the so-called evolutionary algorithms [9,13], that represent very efficient numerical techniques how to solve binary optimization problems. The purpose of this subsection is to demonstrate that a metaphor of replicator provides an efficient stochastic optimization algorithm. P:=randomly generated chemostat of molecules x; epoch :=O ; for epoch:=l to epoch,,, do begin select randomly a molecule x; if randomx+xr
(12)
where the formed molecule xr substitutes a randomly selected molecule from the chemostat. A term&) assigned to the chemical reaction is interpreted as a probability (rate constant) of a performance of reaction (12). In evolutionary algorithms a selection pressure in population of solutions (chromosomes) is created by a reproduction process based on chromosome fitness. Chromosomes with a greater fitness have the greater chance to take part in a reproduction process (a measure of quality of chromosomes); on the other hand, chromosomes with a small fitness are rarely used in the reproduction process. This simple manifestation of Darwin's natural selection ensures a gradual evolution of the whole population. In the present approach the mentioned principle of fitness selection of molecules is preserved, but it is now combined with an additional selection pressure due to a constancy of number of molecules in the chemostat. A molecule incoming to the reaction is randomly selected from the chemostat. After an evaluation of a quality of the selected molecule it is then stochastically decided whether the reaction is performed or not (see Algorithm l), and moreover, the resulting molecule substitutes another randomly selected molecule. Finally, we specify a product xr from the right-hand side of (12) as a mutation [ 131 of an incoming molecule x xr= om, (x)
(13)
where Om,,is a stochastic mutation operator that changes single bits with a probability P,,,. A pseudo Pascal code for the replicator algorithm is presented in Algorithm 1. As an illustrative example we will study the chemostat approach specified for a simple unimodal function determined over binary strings of the length 6. Let us postulate that a chemostat is formed by a multiset composed of binary strings of the length 6
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution 409
P = {..., (110011),...}c{o,1}6
(14)
Each binary vector a is evaluated by a rational number from the closed interval (0.9
1 real ( a )= -iHt (a) 26 -1 where int(a) is a nonnegative integer assigned to a . A rate constant k assigned to the binary string is specified as follows
1 k (a)= f (real ( a ) )= -( 1+ sin( 2n. real (a))) 2
with an optimal solution a,,,=(OI 0000), where real(a)=16/63 and j(16/63)=0.999845. The chemostat is composed of 1000 randomly generated binary strings and the mutation operator Om,, is specified by a 1-bit probability Pm,,=O.O1. Obtained numerical results are displayed in Fig. 2. We see that those binary strings are spontaneously emerging in the chemostat, which correspond to the optimal solution with rational numerical value closely related to ~ ~ ~ ~ 0Main . 2 5results . of this Section may be summarized as follows: 1. A metaphor of Eigen's replicators offers an effective stochastic optimization algorithm, where 2. a proof of its convergence to a global solution immediately follows from an existence of a unique asymptotically stable solution with the greatest rate constant. 3. This algorithm is very similar to standard genetic algorithms [13], but it is based on an entirely different metaphor than GA; in particular the metaphor of Darwinian evolution is substituted by a new metaphor of the chemostat of replicators.
0,s
""0
1M
2M
5M time
4M
5M
Fig. 2. Plot of frequencies of appearance of all 64 binary strings of the length 6 . The chemostat was initiated by randomly generated 1000 binary strings, after 5x106 time steps most dominant solution is the optimal solution aopt=(010000). Another solution a=(001111) is persisting, real(a)=17/63, which is juxtaposed to the optimal solution but with a great Hamming distance (in theory of GA this effect is called the Hamming's cliff).
410
4
V. KvasniEka
Chemostat as a simulator of molecular Darwinian evolution
The Eigen's system of replicators with mutations (i.e. with imperfect replications) presented in the previous Section 3 can be simply used for a description of molecular Darwinian evolution. Let us consider a hypothetical reaction system composed of four replicators XI,X,,X3, and X4. They are endowed with a property that may produce by an imperfect replication the juxtaposed replicators &I, see Fig. 3 , diagram A. If an initial concentration of XI is xl(0)=l, then in the course of evolution there exist concentration waves that are sequentially assigned to X2,X3, and X,, see Fig. 3 . diagram B. This fact may be simply interpreted as a manifestation of molecular Darwinian evolution, where fitness of single "species" are specified by diagonal rate constants kjj.The evolution process was started by a population composed entirely of XI. Since its replication is imperfect, it may occasionally produce (specified by the rate constant k I 2 )the next replicatorX2 with a greater fitness (rate constant k22)than its predecessor XI (k1,k22), i.e. this new "species" X3 will survive. This process is finished when the last replicator X4 has appeared initially as a consequence of imperfect replication of X3 and then its concentration is increased to 1 by its autoreplication. In order to formalize on a semiquantitative level the above considerations, let us assume that the replicator system in time to is situated in such a transient state, where the concentration of Xi is almost unit, whereas concentrations of juxtaposed two replicators X-l and are negligible small, i.e. xi(t0)=1-26, ~;.~(t,,)= xjtl(t0)=6, and xj(t0)=0 for other remaining concentrations. It means that dilution flux K t ) (7) is specified by 4 ( t o ) = 6 ( k l l+ k , , ) + ( 1 - 2 6 ) ( k , , + k , , + k 2 3 ) + S ( k 3 3 + k 3 2 + k 3 4 )Then . differential equation ( 5 ) for i=2 looks as follows:
x,
4 l d t = x 2 (k22 - 4) + X l 4 2 + X A 2
(17)
If we introduce here the above specifications of concentrations for the time to and an assumption that dx2(t,)/dt = 0 we get 0 = -(k2, + k23)+ 6(k12+ k 3 , ) , or
It means that an assumption of good separability of concentration waves gives a strong condition for rate constants touching the replicator X2; in particular, we may say that for a particular replicator a sum of "outgoing" rate constants must be
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution
411
much smaller than asum of “incoming” rate constants. In other words, loosely speaking, a “probability of creation of a particular replicator from its juxtaposed replicators by their imprecise replications must be much greater than a ‘probability” of destroying of the respective replicator by its imprecise replication. ”
B
A
Fig. 3. Diagram (A) represents a 4-replicator system, where a replicator X, produces by imperfect replications juxtaposed replicators X,., and X+I .Edges of the diagram are evaluated by rate constants, their numerical values are specified by matrix K (19) ,where diagonal elements are well separated and much greater than nondiagonal ones. Diagram (B) displays plots of replicator concentration profiles that form a sequence of concentration waves. This diagram also contains a plot of mean fimess specified by
-
k = k,,x, + ._.+ k44 x ,which forms a typical nondecreasing “stair” function.
The above simple considerations are numerically verified for a simple 4replicator system with rate constants specified by the following matrix of rate coefficients (0.1
10” 10-7 0.55 0
lo-“
o
1
o \I
10.~ o 0.8 lo-“
This matrix satisfies both above postulated conditions: (i) Its diagonal matrix elements are much greater than its nondiagonal ones, and (ii) the nondiagonal rate constants satisfy inequalities of the type (18). It means that we may expect a Darwinian behavior of the replicator system. This expectation nicely coincides with our numerical results displayed in Fig. 3, diagram B, where concentration profiles of single replicators are shown that form a sequence of concentration waves typical for Darwinian evolution. Summarizing the present Section, we may say that Eigen ’s phenomenological theory of replicators forms a proper theoretical framework for numerical studies of molecular Darwinian evolutionary theory (i.e. biomacromolecules that are capable of replication process like RNA or DNA).
412 V. Kvasnitka
5
Fitness surface specified by generalized Kauffman's KN rugged functions
Let g be a string composed of N integers from { 0,1,...,p - 1}
g = ( g,g, ...gN) E G = { 0,1, ...,p - I}
N
(20)
Each entry index 1< i I N is evaluated by a subset composed of K+1 randomly selected indlces (including i) from {1,2, (this subset is called the neighborhood),
...m
r(i)={jI < j , < . . . < j K + l } ~ {,..., l , 2N}
1
2
3
4
5
6
(21)
a vector
Fig. 4.An illustrative example of Kauffman's rugged function specified for N=6 and K=2, where subsets l-s are specified as follows: = {1,2,5},r(2) = {l,2,3}, r ( 3 ) = {2,3,4}
r(l)
r(4)= {2,4,5},1-(5) = {1,5,6}, r ( 6 ) = (3,4,6} .
Generalized Kauffman's rugged function maps p-nary vectors of the length N onto positive real numbers from the interval [0,1], this mapping is determined with respect to subsets T(i) as follows:
where F-1 is a maximal positive integer that can be assigned to an auxiliary function 'p randomly specified by \
-I
(p(ajl ,aj2,..., ah+,) = random int (ajl,aj2,...,
where an integer int ( aj l ,aj 2 , . .., a jK+l
)=
2
),F
1
a j K *,-, pr is used as a RandSeed of a
I=O
particular random number generator with uniform distribution of positive integers from {0,1, ...,F-l}.
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution 413
Fig. 5. Schematic outline of the composite mapping (27). Both respective mapping are of the type: manyto-one, i.e. many gene strings are mapped onto one string representing the phenotype, and similarly, many phenotype strings are mapped onto one value of fitness. It means that there exists a huge redundancy of the genotype coding, many different genes (strings) may be evaluated by one value of fitness. This property of the huge redundancy of genotype coding is of considerable importance for the existence of neutral stases in Darwinian evolutionary theory.
For better understanding of the above ideas let us consider aKauffman’s function with sets specified in Fig. 4. For instance, a string g = (02231 1) (i.e.
p=4) is evaluated by the generalized Kauffman’s rugged function 1
-
6(F-1)
(‘p( 021)
+ ‘p( 022) + cp( 223) + ‘p( 23 1)+ ‘p(011) + ‘p( 23 1))
(random (9,F ) + random (10, F ) + random (43, F )
+ random (45,F ) + random (5, F)+ random (45, F ) ) where, e.g., random(9fl represents a random number generator initiated by RandSee&9 that produces a nonnegative integer smaller than F. N A string g = (g,g>...g,)E G = {O,l, . . . , p-1} will be in our forthcoming considerations interpreted as a genotype that specifies basic “genetic” information coding a hypothetical figment, which is a subject of our simulations of Darwinian evolution. A phenotype p h ( g )= (ph,pF,...ph,) E Ph = {O,l....,F-l}N of the figment is specified by
Finally, fitness of the genotype g = ( g , g ,...g,)
is determined by the
corresponding phenotype p h ( g ) = ( p h , pF,...ph,) as follows (see eq. (22))
fitness (g)= f (g)=
~
N ( F - 1)
tPh, ,=I
It means that both genotype and phenotype are determined by integer vectors of the length N , actual value of the phenotype is specified not only by subsets r, but also by an actual implementation of the random number generator random(RandSeed,F).
414 V . KvasniEka
A
- ., .
..
..
Fig. 6. Fitness landscape (or more precisely, fitness surface) was initially introduced to theory of Darwinian evolution by Seal Wright in 1932 [26],who characterized the Darwinian process as an optimization process over the fitness landscape. The resulting population genotype corresponds to a point - optimal genotype gopt - with a maximal fitness.
An interrelationship between the elements of a triad "genotype-phenotypefitness" is formally expressed as a sequence of two mappings (see Fig. 5)
G
ph
,ph
fimess
,[OJI
(27)
It means that the basic entity is the genotype, it is initially mapped onto the phenotype, and then the phenotype is mapped onto the fitness. An abbreviated form of this composite mapping looks as follows G f=jirnessoph [OJ (28)
, 'I
This new mapping immediately maps the genotype strings onto fitness without anecessity to consider explicitly an intermediate called the phenotype. In this connection one may ask why there is worthwhile to introduce the phenotype as a mediator between the genotype and the fitness. Of course, such a question is fully acceptable from a pure mathematical point of view, but there must be noted that the concept of phenotype represents a very effective and fruitful heuristic for interpretation of Darwinian evolutionary theory. In particular, a given form of phenotype is usually considered as an evolutionary goal, and therefore we may say that Darwinian evolution is represented mainly as a sequence of phenotypes that are progressively closer and closer to the evolutionary phenotype goal. Seal Wright in 1932 [26] has introduced one of the most fundamental concepts of Darwinian evolution called thefitness surface (see Fig. 6). Moreover, by making use of this concept he characterized the Darwinian evolution as an optimization process, where the evolved population looks for a global maximum (or another solution closely related to this global one)
gap, =
rmzx f ( 8 )
(29)
This complex optimization combinatorial problem will be solved in the forthcoming part of this paper by methods of artificial chemistry based on the
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution 415
metaphor of Eigen's replicators. It will be demonstrated that the "chemostat" method offers an effective formalism for solution of optimization problems of the form (29), i.e. "replicator" methods of artificial chemistry are very well suitable to mimic molecular Darwinian evolution.
Fig. 7. The genotype graph G is composed of all possible genes - strings, two genes are connected by an edge if their Hamming distance is equal to one. A neutral (sub)graph
Gf c G is induced by those
vertices of G that are evaluated by the same fitness f. 0.10 1
A
B
Fig. 8. (A) Histogram diagram that represents a frequency of appearance of randomly generated strings for parameters W40, p=4, F=2, and K=3. (B) Histogram diagram of a distribution of cluster size for the same parameters.
6. General properties of Kauffman's rugged fitness surface One of the fundamental concepts in theoretical description of Darwinian evolution at the molecular level are neutral graphs [10,23,24] that are determined as subgraphs of the so-called genotype graph (see Fig. 7 )
G= (V = {0,1,..., p - l}N , E)
(30)
Its vertices are formed from all strings (genes), and two vertices (strings) are connected by an edge iff their Hamming distance equals to one (i.e. one string may be transformed onto another one by a single 1-point mutation).
V. KvasniEka
416
The genotype graph G has IVI = p N vertices and
lq = p N.N . ( p-1)/2
It may be decomposed onto the so-called neutral graphs
=
edges.
(v,?), which are
induced by vertex subsets of vertices with the same fitness (see Figure 7)
v/ = { g E V;Jitness(g)= S}
(32)
Adistribution of fitness in the genotype graph G for specific values of parameters K, N, p , and F is shown in Fig. 8. For a sufficiently great parameter K (number of neighborhoods) randomly generated quaternary strings @=4) have the most frequent fitness around 0.5. For smaller values of K, the maximal frequency of fitness appearance may be substantially displaced from the value 0.5 and is strongly determined by a random specification of the auxiliary function cp ,see eq. (23). This dependency of cp on its actual implementation is suppressed by an increase of the parameter K. Figure 8 shows also a histogram of cluster sizes of neutral graphs Gf, where the term “cluster” denotes asubset of its strings (vertex set) that induces aconnected subgraph. It means that aparticular neutral graph Gf may be decomposed onto disconnected subgraphs called the clusters. Vertices within a particular cluster are connected by one or more paths with edges equivalent to 1point mutations, but there do not exist direct interconnections between different clusters of a neutral graph. .
I/
0,14
0,lO 0,12
~
:
,
,
~
. fI-0.80 5 0
,,,, ~
~
~
0,15
0,06 0.04
OJO 0,os
0,02
0,oo
~
0,zo
0.08
0.0
02
0.4
0.6
fihWA.S
A
0.8
1,O
0,oo O,(I
0,lO
0,2
0,4 0,6 ,fimcss
B
0,X
1,O
0,oo 0,0
0,2
0,4 0,h fimerr
0,X
I,0
C
Fig. 9. Numerical experiments with an effect of 1-point mutations to fitness changes of randomly generated strings for parameters: N=40, p=4, F=2, and K=4. Diagram (A) shows a histogram of fitness of strings created by 1-point mutations of randomly generated strings. Diagrams (B-C) correspond to two different cases, when a fitness of mutated strings was fixed and histograms visualize an appearance of strings created by I-point mutations. From diagrams (B-C) we may see that an increase of fitness of initial strings causes an asymmetry in a distribution of fitness of created strings, which is shifted to smaller values of fitness.
An effect of 1-point mutation on change of string fitness is of great importance for a better understanding of detailed llmicroscopicllmechanism of the present model of molecular Darwinian evolution. In particular, there is necessary to describe possibilities of 1-mutation “jumps” from one neutral graph Gf to another Our neutral graph Gr , with a greater fitness than its original counterpart (1.e. PJ). numerical experiments with an effect of 1-point mutations to fitness changes are
Artzficzal Chemistry, Replicators, and Molecular Darwinian Evolution
417
summarized in Figure 9. The most important results are presented in diagrams (BC). These diagrams correspond to histograms of fitness of created mutated strings from randomly generated strings with a prescribed value of fitness. It is possible to see that a distribution of fitness for higher values of fitness of input strings is asymmetric and shifted to the left-hand side (smaller values) with respect to the fixed fitness of input strings. For instance, cf. diagram (C), if the fitness of input strings is relatively highf-0.8, then almost all strings produced by 1-point mutation have a fitness smaller than its original "template", only a small fi-action of the produced strings has a greater fitness.
.fitne.ss Fig. 10. In order to generalize the above observation from our numerical calculations, we introduce a concept of transition probabilities which will quantitatively distinguish 1-point mutational "jumps" from one neutral graph to another one graph. Neutral graphs Gf are lined-up horizontally with increasing fitness. It means that a string belonging to a neutral graph Gfmay be transformed by 1-point mutation to another string belonging to G;, where f' # f ,only if both respective neutral graphs are connected by the edge.
Fig. 11. Set-theory visualization of 1-point-mutation transition probability from a neutral subgraph Gf to another neutral subgraph G,-. Since the probabilities are not symmetric, P,( G 4 G , ) # p, ( G , G ) , an edge in the condensed genotype graph should be substituted by
,
,
two edges that are oriented in opposite ways, and each of them is evaluated by the respective probability.
In a schematic condensed representation of genotype graphs as graphs with vertices assigned to single neutral graphs (e.g. see Figure 10) each edge may be evaluated by a probability of transition from one neutral graph to the other neutral graph. Single edges in the condensed genotype graph may be evaluated by probabilities Pr (q + (+) of stochastic transition from one neutral graph Cj to another neutral graph
9, (see Figure 1 1 ) [23]
418
V. KvasniEka
where
ay
represents a neighborhood of the neutral subset constructed by 1-
point mutations. The size of this probability depends on a ratio of set cardinalities nay1and , . since l\G, nay1< la\cI, we get immediately
lv,
0 < P(
9+ 9,)< 1 . If
I\G. nay1=
I@l,
then the probability is very small, and
loosely speaking, we may say that a transition from one neutral graph
9to another
neutral graph 5 , i s a very rare event. Diagrams (B-C) in Figure 9 provide these transition probabilities in such a way that they are proportional to lengths of the respective bars. For instance, from the diagram C we may immediately deduce that a probability Pr (C& + q,9) is an extremely small number, i.e. it may be classified as a rare event.
4-0,
A
B
20
4OsfeP60
80
100
C
Fig. 12. Hill-climbing optimization of single randomly generated strings with parameters N=40, p=4, F=2, and K=3. Diagram (A) shows a histogram of fitness of strings that resulted from the hill-climbing process initiated by randomly generated strings. Diagram (B) represents mean fitness of the actual solutions of hill-climbing optimizations. We have to emphasize, that both diagrams (A) and (B) were created as a result of 104 hill-climbing optimizations. Finally, diagram (C) shows an illustrative example of the hill-climbing optimization (randomly selected), we see that fitness plot forms a nondecreasing "stair" function, which is characteristic for description of molecular Darwinian evolution [26].
Finally, randomly generated strings were studied from the standpoint of their ability to be optimized by a simple hill-climbing procedure (see Fig. 12). The hillclimbing procedure consists of a repetition of the following local optimization procedure: All possible strings that can be formed by 1-point mutations of an actual string are generated. In this "neighborhood" we look for a better solution (with a greater fitness than a fitness of the actual solution). If such a solution exists, then we substitute the actual solution by this best solution. In the opposite case, when such a better solution does not exist, we take randomly another "neutral" solution from the created neighborhood. This recurrent process is repeated prescribed number of times (in our calculations 100 times). Figure 13 shows a typical situation frequent in hill-climbing optimization of a single solution. In particular, such an initial solution may appear at the initial stage of optimization process within a cluster of the neutral graph Gf that there does not exist such a solution, which is able to "jump" to a better higher solution from the forthcoming neutral graph, i.e. the hill-climbing method is stopped within this cluster. Numerical results of our hill-climbing optimization results are presented in Figure 12. Diagram (A) shows a histogram of resulting
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution
419
solutions, when hill-climbing optimization was initialized by randomly generated strings. We see that most frequent solutions are situated in interval 0.8 - 0.9. A fact that hill-climbing solutions most frequently do not reach maximal fitness 1.0 is simply explained by our above observation that there exist "traps" in solution clusters without possibility to "jump" to another neutral graph with a higher fitness. Mean fitness calculated separately for each hill-climbing step is displayed in the diagram (B), a plot of the mean fitness is monotonously increasing and approaches a saturation value slightly smaller than 0.9. Diagram (C) shows a typical plot of fitness for a randomly selected single hill-climbing optimization. A very interesting feature of this plot is a presence of relatively long neutral stases, that correspond to series of neutral mutations within single clusters of a neutral graph (see Figure 13). In the course of these stases a hill-climbing optimization looks-for a chance to escape from the respective neutral graph by a "jump" to forthcoming neutral graph with higher fitness. For that search a "trial and error" seeking neutral mutations is used. Theoretically, if there exists a cluster of an enormous size, then the stasis time is considerably greater than other stasis periods. Summarizing, we may conclude that simple hill-climbing optimization procedure offers results that essentially mimic molecular Darwinian evolution, but this procedure could not produce global maximal solutions as a consequence of an existence of the above mentioned neutralcluster traps. 9,.
I
I-pnmt miitation
/
I
Y ncwtrul stusw oj e i d u t i o n
e,,,, 0
A
B
Fig. 13. Schematic outline two neutral stasis between '3umps" from the neuh-al graph Gf into another neutral graph G f . At the first neutral stage, hill-climbing forms a sequence of actual solutions within a cluster, the first solution in this sequence corresponds a result from the previous "jump" and the last solution is able to "jump" to the forthcoming neutral graph G f , then the neutral stasis is repeated until a proper solution is found (within the respective cluster in G f ), which is able to perform next "jump" to further neutral graph. The lower left part of the diagram contains a cluster of strings (within neutral graph Gf) that is unable to perform a '3ump" to a neutral graph with a fitness higher than f. This cluster represents a trap for the hill-climbing method, this method could not escape the cluster.
7. Molecular Darwinian evolution The previous section demonstrated that a simple hill-climbing optimization procedure applied to a single string offers results that are very similar to expected
420
V. KvasniEka
results of Darwinian evolution, see Fig. 12. This oversimplified approach to a study of Darwinian evolution will be now substituted by more general look. Let a chemostat be composed ofp-nary strings of the length N P = { ..., (001...p- 1) ,...} E(0,I ,...p-l}N
(34)
The chemostat will be characterized by two parameters: population homogeneity and population entropy, their application allow us to specify some more detailed interpretation of Darwinian evolution. Let the i-th string of population
(
P is denoted by gi = g/i)g!)...gi)), where 0 I gj')I p -1
. Probabilities of
g p - 1 in the whole population in the i-th position is appearance of an entry 0 I specified by the followingp-dimensional vector =
",
I
(
,(i)
0
(4 .mi
"(4 ,'..I
where Kronecker's delta is 6(i,g)=1 (for i=g) an1 6(i,g)=O (for i#g). If all the entries in miare equal to llp, then the i-th values in all population strings are fully random. A homogeneity of the i-th entry in all population strings may be specified as follows
If xI=O (xj=l), then the i-th entries are in the whole population fully random (unambiguous). Finally, a population homogeneity coefjicient is determined as an arithmetic mean of all particular homogeneity coefficients
c,
A fitness entropy of population is determined by probability distributions OSwfil(where ws = 1 ) by
/
If all strings in the population P have the same fitnessf; then the fitness entropy is o(P)=O,in the opposite case, if the population is composed of strings of different
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution
421
fitness, then the fitness entropy is o(P)>O (a positiveness of the entropy corresponds to a measure of population heterogeneity with respect to fitness). In a similar way as in Section 3, we consider in the chemostat a simple replication reaction accompanied with low-rate mutations
where gr= Om,, ( g ) is a stochastic mutation modifying g that changes its single entries - bits with a probability Pmu,.An algorithm of the chemostat is the same as that one specified by Algorithm 1 (see Section 3). The Kauffman’s rugged function was specified by N=40, p=4, F=2, and K=3, whereas the chemostat algorithm used a population of the size 1000 and Pm,,i10-6.An initial chemostat population was started entirely by the same randomly generated string. Numerical results are displayed in Fig. 14. We see that the plot of mean fitness manifests the typical nondecreasing “stair” function of the same type as was obtained for the simple hillclimbing procedure applied to a single randomly generated string (see Fig. 12). The plot of population entropy in Fig. 14 has remarkable peaks that specify increase of sudden fitness increases, after this “jump” during neutral stasis the population entropy is negligibly small influenced by stochastic fluctuations. The population homogeneity factor during the whole evolution was slightly less or equal to one. From this observation we may deduce that in the course of the whole evolution the population manifests a very high order of homogeneity. On the particular plot in Fig. 14 there is observable relatively substantial stochastic fluctuation of population homogeneity, which may be interpreted as a manifestation of neutral mutations in the course of long-term neutral stasis. Loosely speaking, as was already mentioned in Section 6 (see Fig. 13) we may say that these neutral stases are necessary for preparation of a population for a spontaneous emergence of mutational “jump” to another neutral graph with a higher fitness, and moreover, the length of these neutral stases are proportional to probabilities Pr( Cj + 9,)introduced in Section
6. If the respective probability is very small or almost negligible, then a time duration of neutral stasis may be extraordinary long. Our above mentioned conclusion, that the population homogeneity coefficient in the course of the whole evolution is slightly less or equal to one, immediately implies that the whole evolutionary process may be simply interpreted as a hillclimbing optimization of single p-nary string, and aplot of its fitness may be identified with a plot of the mean population fitness (cf. Fig. 12, diagram C with Figure 14). Why the Nature did not use this much simpler approach to the optimization problem (20), why the Nature used much more complicated evolutionary technique based on the concept of population? An answer to this question is very simple, as follows from our simulations (see Fig. 12), simple hillclimbing optimizations most fi-equently finished in local minima and only scarcely produce solutions with a fitness equal or slightly below one. It means that a standard Darwinian evolutionary approach with a large-size population represents very
422
V. KvasniEka
robust optimization technique over rugged fitness surface, which is frequently able to provide a final solution of high quality that are closely related (or even equal) to a global optimal solution. A BCDEF . . .
Cr
H
I
I
nn
""'0 0,5x10 1,OX10 1,5XlO' time Fig. 14. Numerical results of chemostat simulation of molecular Darwinian evolution. The lower plot corresponds to the population fitness entropy, the middle plot corresponds to a mean population fitness, and finally, top plot corresponds to the population homogeneity. At the first stage of evolution there exist very fast increase of fitness without a neutral stasis (stage A), followed by several long-term neutral stasis (stages B-F), the last two stages C and H represent genuine long-term neutral stases.
8. Conclusions The main purpose of this paper was to study a simple model of molecular evolution recently suggested by Newman and Engelhardt [22]. These authors elaborated amodel of Darwinian molecular evolution based on strings of integers that are evaluated by a fitness based on a generalized Kauffman's rugged function [15,16,22]. Their most principal conclusion was an observation that the model is able to reproduce all the basic results obtained by Vienna group [ 10,23,24] based on a genuine physico-chemical model of RNA sequences and their folding to secondary structures. Our results and observations may be summarized as follows: 1. Eigen's theory of replicators (see Section 2) f o A s a sound phenomenological foundation of molecular Darwinian evolution. Plots of its differential-equation solutions that are specified by a proper selection of the required rate constants offer graphs very similar to those ones postulated hypothetically for Darwinian evolution (see Fig. 3). 2. Darwinian evolution is an interplay between Monodian chance and necessity, between deterministic and stochastic processes that are its (or her) integral parts. Darwinian evolution is composed of parts that are fully deterministic, which are fully predictable (e.g. genotype - phenotype mapping), but its integral parts are also processes of strong stochastic character that could not be
Artijicial Chemistry, Replicators, and Mokcular Darwinian Evolution 423
3.
4.
5.
6.
7.
well predicted, we may speak only about their basic statistic characteristics (e.g. mutations). Seal Wright’s idea offitness surface (adaptive landscape) is of a great heuristic importance and may be considered as one of thegreatest achievements of theory of Darwinian evolution. Then, after this concept, the Darwinian evolution may be interpreted as a form of an evolutionary algorithm [9,13] for the solution of complicated (obviously NP-complete) combinatorial optimization problems. The used model provides a simple mechanism for an explanation of neutral mutations and their importance for Darwinian evolution (see Fig. 14). An existence of neutral mutations on fitness surface is of great importance for overcoming evolutionary traps of local maxima. Darwinian evolution is divided into long-term neural stases, where mean fitness of population remains unchanged, but the composition of population strings slowly stochastically wander towards strings with a possibility of 1-point mutational jumps to strings with greater fitness. Time orientation of Darwinian evolution is unambiguous, it is manifested, for example, by an existence of nondecreasing population mean fitness plot. Fitter strings reproduce more frequently than strings with smaller fitness, fitter strings in the course of evolution have an evolutionary advantage with respect to other strings. In other words, we may say that Darwinian evolution is a progressive change of mean population genotype such that a corresponding mean fitness is nondecreasing during the whole evolution. Two different time scales may be distinguished in Darwinian evolution, in particular adaptive stages and neutral stages. An adaptive stage corresponds to a sudden change of mean fitness (see Fig. 14), where two different phenotypes coexist; the old phenotype has a smaller fitness than a new phenotype. Since a probability of replication of strings is proportional to their fitness, strings corresponding to the new phenotype have a greater chance to be reproduced than the old ones. New strings with a greater fitness win in the course of several evolutionary steps; consequently, adaptive stages, when a string - species is substituted by another string - species, seems to an external observer as an extremely short evolutionary stage. On the other hand, a neutral stage of Darwinian evolution consists of a long-term stases, where an appearance of neutral mutations stochastically prepares a sudden emergence of the next adaptive stage. What are the limits of the present model? The used forms of genotype and its mapping onto phenotype are extremely simple, they do not reflect more complex organisms other than viruses, bacteriophages, and some prokaryotic bacteria. For a complex organism the used model of genotype (and its mapping onto phenotype) must be much more complex, it should take into account such concepts as a variable length, a hierarchical structure, other mutations than 1point ones, etc. The most important problem of the recent theory of Darwinian
424
V. Kvasnitka
evolution is a search for a theoretical model explaining an emergence of the socalled “irreducible complaities” [2]. The present model is unable to explain this problem even on the elementary level. Recent efforts in Artificial Life are concentrated on areas that might be of an importance for the elaboration of a more general theory of Darwinian evolution than that which is presented here. In particular, recently, modular aspects of the genotype are very intensively studied [18,19] and aproblem of symbiosis is modelled [17], where both problems require much complex genotype than those ones studied here (linear strings of symbols with constant lengths). Acknowledgements
I thank Jifi Pospichal for a critical reading of the manuscript and removing many misprints and incorrect formulations. The work was supported by the grants 117336120 and 1/8107/01 of the Scientific Grant Agency of the Slovak Republic. References 1. Banzhaf W., Dittrich P., Eller B. (1999). Topological Interactions in a Binary String System. Physica D 125, 85. 2. Behe, M. J. (1996). Darwin’s black box. Simon & Schuster, New York. 3. Benatre J.-P., Le Metayer D. (1990). The Gamma Model and its Discipline of Programming. Sci. Comp. Progr. 15, 55. 4. Bennett D. C. (1995). Darwin’s Dangerous Idea - Evolution and the Meaning of Life. Penguin Press, London. 5. Berry G., Boudol G. (1992). The Chemical Abstract Machine. Theoret. Comp. Sci. 96,217 6. Dittrich P. (1999). Artificial Chemistries (tutorial material). A tutorial held at ECAL‘99, 13-17 September 1999, Lausanne, Switzerland. 7. Eigen M. (1971). Self organization of matter and the evolution of biological macro molecules. Naturwissenshaften 58,465 8. Eigen M., Schuster P. (1977). The Hypercycles: A Principle of Natural Evolution. Naturwissenschanften 64, 541; 65,7; 65,341 9. Fogel, D. B. (1995). Evolutionary Computation. Toward a New Philosophy of Machine Intelligence. The IEEE Press, New York. 10. Fontana W. and Schuster P. (1998). Continuity in Evolution: On the Nature of Transitions. Science, 280, 1451. 11. Fontana W. (1991). Algorithmic Chemistry. In: Langton C.G. (ed.) Artzjkial Life 11 Addison Wesley, Reading, MA, p. 159. 12. Gillespie D. T. (1977). Exact Stochastic Simulation of Coupled Chemical Reactions. J. Phys. Chem. 81,2340.
Artificial Chemistry, Replicators, and Molecular Darwinian Evolution 425
13. Holland, J. H. (1975). Adaptation in Natural and Artzfzcial Systems. University of Michigan Press, Ann Arbor. 14. Jones B.L., Ems R.H., Rangnekar S.S. (1976). On the theory of selection of coupled macromolecular systems. Bulletin of Mathematical Biology 38, 15 15. Kaufiinan S.A. (1993). The Origins of Order: Self-organization and Selection in Evolution. Oxford University Press, New York. 16. Kauffman S.A. and Hohnsen S. (1991). Coevolution to the edge of chaos: Coupled fitness landscape, poised states, and coevolutionary avalanches. J. Theor. Biology 149,467. 17. KvasniEka V. (2000). An evolutionary model of symbiosis. In: P. SinEak, J.VasEhk (eds.): Quo Vadis Computational Intelligence? Physica-Verlag, Heidelberg, p. 293. 18. KvasniEka V. (2001). An evolutionary simulation of modularity emergence of genotype-phenotype mappings. Neural Network World 5,473. 19. KvasniEka V. (2002). A modularity emergence of genotype-phenotype mappings. Artificial Life (sent for publication). 20. KvasniEka V., Pospichal J. (2001). Autoreplicators and Hypercycles in Typogenetics, Journal of Molecular Structure (Theochem) 547, 119. 21. KvasniEka V., Pospichal J. (2001). A study of autoreplicators and hypercycles by typogenetics. In: J. Kelemen, P. Sosik: Advances in Artificial Life, ECAL 2001, LNAI 2159, Springer 2001, p. 37. 22. Newman M.E.J. and Engelhardt R. (1998). Effects of neutral selection on the evolution of molecular species. Proc. Roy. SOC.London B, 265, 1333. 23. Schuster P. (2002). Molecular Insight into Evolution of Phenotypes. In Evolutionaly Dynamics - Exploring the Interplay of Accident, Selection, Neutrality, and Function, ed. by Crutchfield J.P. and Schuster P., Oxford University Press. 24. Schuster P. and Fontana W. (1999). Chance and Necessity in Evolution: Lessons from RNA. Physica D, 133,427. 25. Spiegelman S. (1971). An Approach to the Experimental Analysis of Precellular Evolution. Quart. Rev. Biophysics, 4,213. 26. Wright, S. (193 1). Evolution in Mendelian Populations. Genetics 16,97.
This page intentionally left blank
HUMAN CENTERED SUPPORT SYSTEM USING INTELLIGENT SPACE AND ONTOLOGICAL NEURAL NETWORKS TORU YAMAGUCHI*, HIROKI MURAKAMI**, DAYON CHEN** */**Department of Electronic System Engineering, Tokyo Metropolitan Institute of Technology 6-6,Asahigaoka, Hino, Tokyo, 191-0065 JAPAN *PRESTO, Japan Science and Technology Corporation (JST) E-mail:
[email protected],
[email protected], chen @fml.ec.tmit.ac.jp This paper aims at realizing safe and comfortable driving support by showing information required for driver in advance, taking driver intention into consideration. We propose Intelligent Space based on Distributed sensory intelligent architecture and intention recognition model utilizing Conceptual Fuzzy Sets (CFS) that is one of fuzzy associative memories in neural network. We construct the engineering ontology which is the common concept of people from this model. This paper shows improvement of the detection accuracy of a pedestrian and inductive learning in this intention recognition model using ontological neural network, in Intelligent Space.
Keyword: Intelligent Space, Conceptual Fuzzy Sets, Ontology
1
Introduction
It is possible that Intelligent Transport System (ITS) that used most advanced information communication technology communicates with people, road and vehicles. Because “the safe driving support” that is one of the development field of ITS is expected to have the big effect in reduction to the traffic accident and the early realization of the driving support system. However, a conventional system is not esteeming the comfortableness and safety of people. It is important to consider driving of the old people, especially, with whom the capacity of consciousness, cognition, judgment, and driving had declined. If a system is able to recognize the intention of people, people can obtain only information required. In this paper, we propose safe and comfortable human centered driving support by showing information required for a driver by re-composing environmental information and recognizing driver intention. 2 2.1
Space with distributed sensory intelligence: Intelligent Space Intelligent Space and its sensory intelligence model
This section shows Intelligent Space based on Distributed sensory intelligent architecture. This architecture distributes sensory intelligence and connects each
427
428
T. Yumaguchi, H. M u m k a m i and D. Chen
other by network. The sensory intelligence has a sensor (such as CCD camera) and intelligence (such as processing processor), and it shares and compensates information to each other. In this way, the Intelligeqt Space is the space as if the whole space has high intelligence. As a sensory intelligence model in the Intelligent Space, we assume a hierarchical intelligent model that extends the hierarchical model of Rasmussen [ 11. The sensor section in the sensor intelligence model extracts environmental information from color, motion, and form information in environment. Using the below-mentioned attention, the intelligence section performs intelligent processing of re-composing environmental information.
2.2
Re-composition of Information in Intelligent Space
The re-composition of information in the Intelligent Space has each sensory intelligence integrating of environmental information, they were acquired in the sensor section of self and information from the surrounding sensory intelligence, or updating information one by one. In this way, the information on the wide range is able to be re-composed.
7 symbol
pattern layer (neo cortex)
Fig. 1 Re-composition of information model by Attention
A hippocampus is one of the memory of the brain in connection with cognition, consciousness, etc. of man. The neocortico-hippocampalmodel [2] in Fig 1 modeled the hippocampus. The neocortico-hippocampalmodel is very effective in recognizing and learning, using multiple elements. This model consists of two layers: a pattern layer and a symbol layer. It is expressed in the pairs of the element expression of the pattern layer and the sign expression of the symbol layer. When certain things and concepts are recollected, the cell group of both the pattern layer and the symbol layer will be stimulated. At this time, an attention vector controls the pattern layer in order to use only a part of the element or checks retrieval of a part of the symbol layer. Although there are various
H u m a n Centered Support System Using Intelligent Space
429
functions in the attention, “Attention” in this paper is a function that controls the pattern layer and the symbol layer. Equation (1) expresses how to control an attention vector by Attention based on the neocortico-hippocampalmodel.
M = attnlMl + attnzM2
+ ......+ attn,Mn
(attni E 0,l))
(1)
Here, attnl, attn2, ......, attn, are the variables which control attention by each attribute, M I ,Mz, ......, M, are individual associative matrices, respectively, and Mis the associative matrix integrated by attention. In Fig.1, in case multiple attribute patterns retrieve one symbol, Attention controls which attribute is observed. Therefore, one result can be obtained. For example, suppose that there are attribute],attribute2.If we want to observe attribute,, but do not want to observe attributez, we should just bring the value of atin] and at@ close to 1.O and 0.0, respectively. In this way, Attention can control attention to each attribute by making the value of attni into a value between 0 and 1. As a result, the agent can re-compose the whole space from the sensing information that each agent obtained.
Fig 2 Human detected by motion and color attribute
2.3
Pedestrian detection experiment
The human is detected by using motion attribute (attribute])and color attribute (attributez). The left of Fig.2 shows human’s motion area detected by the motion attribute. The right of Fig.2 shows human’s skin color (a face and hands) distinguished by the color attribute. There is a fault that the relative position stabilized is not calculated since detection of an area was not fixed. In addition, when the pedestrian is not moving, he cannot be detected at all. On the other hand, in detecting pedestrian only by the color attribute, if the position of a face is certainly detectable, the error of relative position calculation is small. However, when the color nearer to skin color exists, it is hard to detect the position of an exact human face. Since pedestrian detection only by color attribute is hard under various environments, the robustness of pedestrian detection is small. Then in this section, position information is re-composed based on two-attribute information, such as “motion” and “color”. And by harnessing a mutual advantage and
430
T. Yamaguchi, H. Murakami and D. Chen
compensating a fault, it is shown that the accuracy of pedestrian position calculation and the robustness of pedestrian detection are high. The accuracy of detecting pedestrian’s position is improved by re-composing information. We show the comparison of the error value of a pedestrian position calculated only by the motion attribute or color attribute, and an actual position calculated by re-composing information. The error average of pedestrian detection only by motion attribute was 115.6cm7and only by color attribute was 67.7cm. The error average was 59.7cm as a result of re- composition of information using both of the motion and the color attribute. As a result, the accuracy of detecting pedestrian position improved by re-composing information. The robustness of pedestrian detection improved by using the pedestrian’s motion attribute and color attribute. 3
Intention recognition model
This chapter shows the system that used CFS knowledge expression as the method that constructs the knowledge for driving intention that is fuzzy concept. 3.1
Conceptual fuzzy sets
The label of a fuzzy set represents the name of a concept and the fuzzy set represents the meaning of the concept. According to the theory of meaning representation from use proposed by Wittgensteinr31, the various meanings for a label (word) may be represented by other labels (words) and thus grades of activations, which show the degree of compatibility between different labels, can be assigned. The distributed knowledge is called Conceptual Fuzzy Sets (CFS) and is shown in Fig.3. Since the distribution changes depending on the activated labels indicate the conditions, the activation resulting from the CFS displays a context-dependent meaning. Thus, a CFS can represent not only logical knowledge, but also knowledge whose representation is logically impossible. In this research, CFS is constructed with bidirectional associative memory (BAM). name of concept (explained label) d e g r e e of m e m b e r s h i p (activation value)
other label
Fig.3 Image of conceptual fuzzy sets in associative memories
H u m a n Centered Support S y s t e m Using Intelligent Space
tuna
gUPPY
cat
canary
tuna
SPPY
cat
431
canary
Fig.4 Recollection of “pet” and “fish” in CFS
Fig.4 shows the CFS representing the meaning of “pet fish” evolved by the activation of “pet” and “fish”. The CFS simply realizes recollecting abstracted upper concept explaining instance by means of property of associative memories without any procedure. The nearest upper concept arises from the activations of instance. This abstraction behaves robustly against errors. The left of Fig4 shows the recollection of “pet” from the activation of “guppy” and “cat” with small activation of “tuna” as an error. Similarly, the right of Fig.4 shows recollection of “fish” from the activation of “guppy” and ”tuna”. 3.2
Intention recognition model
When the drivers try to cause a driving action, they do the action by the almost same order. Intelligent Space detects this action and the intention of the driver is recognized. We limit driving intention to going straight, curving to the right and the left. This section explains the intention recognition model that does the recognition of these driving intentions. The network is a constitution of three stages. The most lower stage is an input layer and the membership degree of fuzzy label expresses each action. So that Activation of nodes in the middle stage increases according to the degree of agreement. The network between the middle stage and the highest stage is composed of CFS networks and distribution of activations of nodes in the network converges to satisfy contexts after the reverberation. The details of the network design are as follows. The lowest stage of the network has three layers and each layer corresponds to the first and the second face action and breaking. The characteristic vector extracted from each action is input to nodes of the corresponding layer. The layers in the middle stage are divided into three parts. The first and the second part and breaking part. Moreover, each part has three layers that correspond to three driving action. There are nodes of some instances in each layer. Fuzzy associative memory is realized, and the node is activated by the largest value when the values of the characteristics are inputted to the network. 3.3
Construction of ontology
When people have different word and culture communication to each other, there is common abstract mapping of human in their brain and they can get sympathy with some degree by it. Ontology is in this extension.
432
T. Yamaguchi, H. M u m k a m i and D. Chen
The changed distance offace
The facial drection
n h s changed
The facial
distance offaec
dlrslion
T k Imagb orUw brake
Fig3 Intention recognition model
In usual knowledge expression, the complicated concept is divided into small concepts. The systematization of the division of the concept into small concept and characterization is the engineering ontology model. We intend the construction of the common part of knowledge structure. At present, there is research that implement ontology using CFS and express environmental “situation“ and “label” by the interaction with people and environment. This research form concrete instances fiom abstract expression or, conversely, abstract expression from concrete instances by the interaction with people, and construct certain kind of association database. Because the intention recognition model in this paper gets the knowledge of an individual action by arranging instances node in the middle of the stairs, new context that divides knowledge is formed. For example, driving action will differ by the feeling of driver in the time. In Fig 6 , if a learned instance is the driving action when irritated, the node corresponding to “irritation” will be formed.
Fig.6 Formation of ontology from instances
The formation of new ontology becomes possible by utilizing the new pair of expression of this context and text (situation and label) . We expand such a method of concept description and concept formation to conversation between intelligent
H u m a n Centered Support S y s t e m Using Intelligent Space
433
agents. And instances become possible to be held in common by matching of the ontology formed. People are able to communicate with intelligent agents smoothly, too.
recognition part
1
CG display part
Fig.7 Experiment system
3.4
The simulation experiment
As shown in Fig.7, in this system, we use large size display and handle and CCD cameras and 2 processors (one is for recognizing face motion; another one is for CG (CG makes the real driving conditions)). Driver sees the display, and operates the handle; at the same time driver does conduct confirming the right side. This system can recognize the driving action from mechanized input of the handle, and recognize face direction and position through color information from CCD camera. The direction of the face is requested from the position relation between the face and lips. The number of the learned instances in the network is from 1 to 3. The input data to the system is the exercise action of unlearning. The intention recognition rate is shown in Table 1. In the case that the number of instances is 3, the recognition percentage is improving by 12 % compared with the case that the number of instances is 1.
Table 1 The number of instance and intention recognition rate The number of instance
I
1
2
I
3
In the case that action detection department is a stereo view, the accuracy of detection of driving action rises and even the intention recognition rate improved to 84.3%.
434 T . Yamaguchi, H. Mumkami and D. Chen
4
Conclusions
This paper shows that the robustness of detection is improved in Intelligent Space based on distributed sensory intelligent architecture, and that the intention recognition rate improved by the increase of the learned instances in CFS and inductive learning was carried out. These results show the effectiveness of proposed architecture and intention recognition system. Acknowledgements
This research is supported by PRESTO, JST References
1. J.Rusmussen:Skills, Rules, and Knowledge; Signals, Sings, and Symbols, and other Distinguish on Human Performance models, IEEE trance. on SMC, Vo113, pp.257-266( 1993) 2. T.Omori:Computation Theory of Pattern and Symbol Interaction, Journal of Japanese Neural Network Society,Vo1.3, No.2, pp65-67( 1996) 3. Wittgenstein:Philosophical Investtigations, Basil Blackwell, Oxford( 1953) 4. T.Yamaguchi, D.Chen, H.Nitta:Distributed Sensory Intelligence Architecture For Human Centered Systems, Proc. of 2001 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-200 l), Banff, Alberta, Canada, pp. 194-199(CD-ROM) (2001) 5 . A. Imura, T. Yamaguchi et al: Distributed fuzzy knowledge processing using associative memories, Proc. IEEE Int. Workshop on NEURO-FUZZY CONTROL, pp.176-180(1993) 6. T. Takagi, T. Yamaguchi and M. Sugeno: Conceptual fuzzy sets, Proc. Fuzzy Engineering toward Human Friendly Systems IFES'91, pp.261-272( 1991) 7. H. Ushida, T. Takagi and T. Yamaguchi: Recognition of facial expressions using conceptual fuzzy sets, Proc. IEEE Int. Conf. on Fuzzy Systems, pp.594-599( 1993) 8 . T.Yahagi, M.Hagiwara, T.Yamaguchi: Neural Network and Fuzzy Signal Processing, Corona Publishing Co.( 1998)
THE K.E.I. (KNOWLEDGE, EMOTION AND INTENTION) BASED ROBOT USING HUMAN MOTION RECOGNITION TORU YAMAGUCHI**
TAKE0 SAITO*
NOBUHIRO ANDO*
*/**Department of Electronic System Engineering, Tokyo Metropolitan Institute of Technology 6-6,Asahigaoka, Hino, Tokyo, 191-0065 JAPAN **PRESTO, Japan Science and Technology Corporation (JSr) In the present Japan, because of the rapid aging society, the welfare support system is remarkable. In this paper, the welfare support robot system becomes necessary to support a human being. For its purpose, the position information and correct operating for the human being is necessary. Therefore, each agent re-composes information of color attributes from more than one camera by Attention, to get the position informationor operating of the correct human being. Keyword: Aiieniion, The K.E.I model, Human Ceniered Interface
1
Introduction
The aging society becomes a serious problem and 65 year-old population will be accounted in 2050 in Japan. It will be 1/3 of the whole population. Therefore, the realization of the support system to support the daily life of the aged people is wanted. In this paper, it realizes the interface to detect the position information or operating of the human being in the room to give the K.E.1 based robots. At the time, each agent re-composes each color attribute from more than one camera by Attention.
2
Distributed-sensory-intelligent architecture
Distributed-sensory-intelligent-architecture distributes sensory intelligence and connects each other twity network. The sensory intelligence has a sensor (such as a CCD camera) and intelligence (such as a processing processor), and it shares and compensates information to each other. And each sensory intelligence can switch their role autonomously by environmental information. In this way, the Intelligent Space is the space as if the whole space has high intelligence. 3
Human centered interface
The human centered interface, which is shown in Figure 1, is composed of four pieces of Distribute-sensory-intelligence architecture (The CCD camera + CPU). It detects a human being, it re-composes a color attribute by attention and it gets information about the position of the human being, on the facial direction, and so on. Moreover, it recognizes the command of the human being from the camera on the agent. 435
436
T. Yamaguchi, T. Saito and N. Ando
It is using the intellectual robot here as the agent.
Figure 1 . Human centered interface
4
Re-composed information by attention
A hippocampus is one of the memory of the brain in connection with cognition, consciousness, etc. of man. The neocortico-hippocampalmodel in Fig 3 modelled the hippocampus. The neocortico-hippocampal model is very effective in recognizing and learning of the use of multiple elements. I
1
1
1
1
.
1
symbol
pattern attribute,
I
T T T
attribute,
stfrlbule,
f
f
f
f
f
attribute.
f
pattern layer(neo cortex) I
l
ei;viranm riital i n f o r m a t i o n
l
Ti
Figure 3. Re- composition of information model by Attention
This model consists of two layers: a pattern layer and a symbol layer. It is expressed in the pairs of the element expression of the pattern layer and the sign expression of the symbol layer. When certain things and concepts are recollected, the cell group of both, the pattern layer and the symbol layer, will be stimulated. At this time, an attention vector controls the pattern layer in order to use only a part of the element or checks the retrieval of a part of the symbol layer. Although there are
The K.E.I. Based Robot
437
various functions in the attention, “Attention “ in this paper is a function which controls the pattern layer and the symbol layer. Formula(1) expresses how to control an attention vector by Attention based on the neocortico-hippocampalmodel. M=attenl MI fatten2 Mz+ ...+atten, M,( attn, E [0,1])
(1)
Here, attn, ,attn, ,...., attn, are the variables which control attention by each attribute,
M , ,M , .... M,, are individual associative matrices, respectively,
k.?is the associative matrix integrate by Attention. In Fig.3, in case multiple attribute patterns retrieve one symbol, Attention controls which attribute is observed. Therefore, one result can be obtained. For example, suppose that there are attribute, and attribute,. I f we want to and
observe attribute, but do not want to observe attribute,, we should just bring the value of attn, and attn, close to 1.0 and 0.0, respectively. In this way, Attention can control attention to each attribute by making the value of attn, into a value between 0 and 1. As a result , the agent can re- compose the whole space from the sensing information which each agent obtained.
5
5.1
The position information detection which used Re-composition of information by Attention
The experiment system
The experiment system is composed of five pieces of distributed sensory intelligence architecture that consists of a CCD camera and a PC for the color detection. All five PC communicate the socket in the gotten color attribute, and they re-compose them in the 3-D position information by re-composition of ainformation model by Attention (Fig.4). Each agent knows the 3-D position information of the human being by this.
5.2
Re-composition of the position information
Each agent re-composes the color attribute of cameras that the human being enters the sight by Attention. Then the robustness is high and it is possible to recognize in the high precision.
5.3
The experiment result
The position information error of the human being when experimenting is shown in Fig. 5 .
438
T . Yumaguchi, T.Suito and N. Ando
Human detection by color altribute
Human detection by color attribute
ofcrmcl.al
of c:.*r