Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G.Goos, J. Hartmanis, and J. van Leeuwen
1778
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Stefan Wermter Ron Sun (Eds.)
Hybrid Neural Systems
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Stefan Wermter University of Suderland Centre of Informatics, SCET St Peters Way, Sunderland, SR6 0DD, UK E-mail:
[email protected] Ron Sun University of Missouri-Colombia CECS Department 201 Engineering Building West, Columbia, MO 65211-2060, USA E-mail:
[email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Hybrid neural systems / Stefan Wermter ; Ron Sun (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1778 : Lecture notes in artificial intelligence) ISBN 3-540-67305-9
CR Subject Classification (1991): I.2.6, F.1, C.1.3, I.2 ISBN 3-540-67305-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer is a company in the BertelsmannSpringer publishing group. c Springer-Verlag Berlin Heidelberg 2000 Printed in Germany Typesetting: Camera-ready by author data conversion by PTP Berlin, Stefan Sossna Printed on acid-free paper SPIN: 10719871 06/3142 543210
Preface
The aim of this book is to present a broad spectrum of current research in hybrid neural systems, and advance the state of the art in neural networks and artificial intelligence. Hybrid neural systems are computational systems which are based mainly on artificial neural networks but which also allow a symbolic interpretation or interaction with symbolic components. This book focuses on the following issues related to different types of representation: How does neural representation contribute to the success of hybrid systems? How does symbolic representation supplement neural representation? How can these types of representation be combined? How can we utilize their interaction and synergy? How can we develop neural and hybrid systems for new domains? What are the strengths and weaknesses of hybrid neural techniques? Are current principles and methodologies in hybrid neural systems useful? How can they be extended? What will be the impact of hybrid and neural techniques in the future? In order to bring together new and different approaches, we organized an international workshop. This workshop on hybrid neural systems, organized by Stefan Wermter and Ron Sun, was held during December 4–5, 1998 in Denver. In this well-attended workshop, 27 papers were presented. Overall, the workshop was wide-ranging in scope, covering the essential aspects and strands of hybrid neural systems research, and successfully addressed many important issues of hybrid neural systems research. The best and most appropriate paper contributions were selected and revised twice. This book contains the best revised papers, some of which are presented as state-of-the-art surveys, to cover the various research areas of the collection. This selection of contributions is a representative snapshot of the state of the art in current approaches to hybrid neural systems. This is an extremely active area of research that is growing in interest and popularity. We hope that this collection will be stimulating and useful for all those interested in the area of hybrid neural systems. We would like to thank Garen Arevian, Mark Elshaw, Steve Womble and in particular Christo Panchev, from the Hybrid Intelligent Systems Group of the University of Sunderland for their important help and assistance during the preparations of the book. We would like to thank Alfred Hofmann from Springer for his cooperation. Finally, and most importantly, we thank the contributors to this book. January 2000
Stefan Wermter Ron Sun
Table of Contents
An Overview of Hybrid Neural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Wermter and Ron Sun
1
Structured Connectionism and Rule Representation Layered Hybrid Connectionist Models for Cognitive Science . . . . . . . . . . . . . 14 Jerome Feldman and David Bailey Types and Quantifiers in SHRUTI: A Connectionist Model of Rapid Reasoning and Relational Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Lokendra Shastri A Recursive Neural Network for Reflexive Reasoning . . . . . . . . . . . . . . . . . . . 46 Steffen H¨ olldobler, Yvonne Kalinke and J¨ org Wunderlich A Novel Modular Neural Architecture for Rule-Based and Similarity-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Rafal Bogacz and Christophe Giraud-Carrier Addressing Knowledge-Representation Issues in Connectionist Symbolic Rule Encoding for General Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Nam Seog Park Towards a Hybrid Model of First-Order Theory Refinement . . . . . . . . . . . . . 92 Nelson A. Hallack, Gerson Zaverucha, and Valmir C. Barbosa Distributed Neural Architectures and Language Processing Dynamical Recurrent Networks for Sequential Data Processing . . . . . . . . . . 107 Stefan C. Kremer and John F. Kolen Fuzzy Knowledge and Recurrent Neural Networks: A Dynamical Systems Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Christian W. Omlin, Lee Giles, and Karvel K. Thornber Combining Maps and Distributed Representations for Shift-Reduce Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Marshall R. Mayberry and Risto Miikkulainen Towards Hybrid Neural Learning Internet Agents . . . . . . . . . . . . . . . . . . . . . . 158 Stefan Wermter, Garen Arevian, and Christo Panchev
VIII
Table of Contents
A Connectionist Simulation of the Empirical Acquisition of Grammatical Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 William C. Morris, Garrison W. Cottrell, and Jeffrey Elman Large Patterns Make Great Symbols: An Example of Learning from Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Pentti Kanerva Context Vectors: A Step Toward a “Grand Unified Representation” . . . . . . 204 Stephen I. Gallant Integration of Graphical Rules with Adaptive Learning of Structured Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Paolo Frasconi, Marco Gori, and Alessandro Sperduti Transformation and Explanation Lessons from Past, Current Issues, and Future Research Directions in Extracting the Knowledge Embedded in Artificial Neural Networks . . . . . . . 226 Alan B. Tickle, Frederic Maire, Guido Bologna, Robert Andrews, and Joachim Diederich Symbolic Rule Extraction from the DIMLP Neural Network . . . . . . . . . . . . . 240 Guido Bologna Understanding State Space Organization in Recurrent Neural Networks with Iterative Function Systems Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Peter Tiˇ no, Georg Dorffner, and Christian Schittenkopf Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network that Performs Low Back Pain Classification . . . . . . . . . 270 Marilyn L. Vaughn, Steven J. Cavill, Stewart J. Taylor, Michael A. Foy, and Anthony J.B. Fogg High Order Eigentensors as Symbolic Rules in Competitive Learning . . . . . 286 Hod Lipson and Hava T. Siegelmann Holistic Symbol Processing and the Sequential RAAM: An Evaluation . . . . 298 James A. Hammerton and Barry L. Kalman Robotics, Vision and Cognitive Approaches Life, Mind, and Robots: The Ins and Outs of Embodied Cognition . . . . . . . 313 Noel Sharkey and Tom Ziemke Supplementing Neural Reinforcement Learning with Symbolic Methods . . . 333 Ron Sun
Table of Contents
IX
Self-Organizing Maps in Symbol Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Timo Honkela Evolution of Symbolization: Signposts to a Bridge Between Connectionist and Symbolic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Ronan G. Reilly A Cellular Neural Associative Array for Symbolic Vision . . . . . . . . . . . . . . . . 372 Christos Orovas and James Austin Application of Neurosymbolic Integration for Environment Modelling in Mobile Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Gerhard Kraetzschmar, Stefan Sablatn¨ og, Stefan Enderle, and G¨ unther Palm Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
An Overview of Hybrid Neural Systems Stefan Wermter1 and Ron Sun2 1
University of Sunderland, Centre for Informatics, SCET St. Peter’s Way, Sunderland, SR6 0DD, UK 2 University of Missouri, CECS Department Columbia, MO, 65211-2060, USA
Abstract. This chapter provides an introduction to the field of hybrid neural systems. Hybrid neural systems are computational systems which are based mainly on artificial neural networks but also allow a symbolic interpretation or interaction with symbolic components. In this overview, we will describe recent results of hybrid neural systems. We will give a brief overview of the main methods used, outline the work that is presented here, and provide additional references. We will also highlight some important general issues and trends.
1
Introduction
In recent years, the research area of hybrid and neural processing has seen a remarkably active development [62,50,21,4,48,87,75,76,25,49,94,13,74,91]. Furthermore, there has been an enormous increase in the successful use of hybrid intelligent systems in many diverse areas such as speech/natural language understanding, robotics, medical diagnosis, fault diagnosis of industrial equipment and financial applications. Looking at this research area, the motivation for examining hybrid neural models is based on different viewpoints. First, from the point of view of cognitive science and neuroscience, a purely neural representation may be most attractive but symbolic interpretation of a neural architecture is also desirable, since the brain has not only a neuronal structure but has the capability to perform symbolic reasoning. This leads to the question how different processing mechanisms can bridge the large gap between, for instance, acoustic or visual input signals and symbolic reasoning. The brain uses specialization of different structures. Although a lot of the functionality of the brain is not yet known in detail, its architecture is highly specialized and organized at various levels of neurons, networks, nodes, cortex areas and their respective connections [10]. Furthermore, different cognitive processes are not homogeneous and it is to be expected that they are based on different representations [73]. Therefore, there is evidence from cognitive science and neuroscience that multiple architectural representations are involved in human processing. Second, from the point of view of knowledge-based systems, hybrid symbolic/neural representations have some advantages, since different, mutually complementary properties can be combined. Symbolic representations have advantages of easy interpretation, explicit control, fast initial coding, dynamic variable S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 1–13, 2000. c Springer-Verlag Berlin Heidelberg 2000
2
S. Wermter and R. Sun
binding and knowledge abstraction. On the other hand, neural representations show advantages for gradual analog plausibility, learning, robust fault-tolerant processing, and generalization. Since these advantages are mutually complementary, a hybrid symbolic neural architecture can be useful if different processing strategies have to be supported. While from a neuroscience or cognitive science point of view it is most desirable to explore exclusively neural network representations, for knowledge engineering in complex real-world systems, hybrid symbolic/neural systems may be very useful.
2
Various Forms of Hybrid Neural Architectures
Various classification schemes of hybrid systems have been proposed [77,76,89,47]. Other characterizations of architectures covered specific neural architectures, for instance recurrent networks [38,52], or they covered expert systems/knowledgebased systems [49,29,75]. Essentially, a continuum of hybrid neural architectures emerges which contains neural and symbolic knowledge to various degrees. However, as a first introduction to the field, we present a simplified taxonomy here: unified neural architectures, transformation architectures, and hybrid modular architectures. 2.1
Unified Neural Architectures
Unified neural architectures are a type of hybrid neural system. They have also been referred to as unified hybrid systems [47]. They rely solely on connectionist representations but symbolic interpretations of nodes or links are possible. Often, specific knowledge of the task is built into a unified neural architecture. Much early research on unified neural architectures can be traced back to work by Feldman and Ballard, who provided a general framework of structured connectionism [16]. This framework was extended in many different directions including, for instance, parsing [14], explanation [12], and logic reasoning [30,40,70,71,72]. Recent work along these lines focuses also on the so-called NTL, Neural Theory of Language, which attempts to bridge the large gap between neurons and cognitive behavior [17,65]. A question that naturally arises is: why should we use neural models for symbol processing, instead of symbolic models? Possible reasons may include: neural models are a more apt framework for capturing a variety of cognitive processes, as is argued in [15,66,86,72]. Some inherent processing characteristics of neural models, such as similarity-based processing, [72,6] make them more suitable for certain tasks such as cognitive modeling. Learning processes may be more easily developed in neural models, such as gradient descent [63] and its various approximations, Expectation-Maximization, and even Inductive Logic Programming methods [26]. There can be two types of representations [77]: Localist connectionist architectures contain one distinct node for representing each concept [42,71,67,3,58,31,66].
An Overview of Hybrid Neural Systems
3
Distributed neural architectures comprise a set of non-exclusive, overlapping nodes for representing each concept [60,50,27]. The work of researchers such as Feldman [16,17], Ajjanagadde and Shastri [67], Sun [72], and Smolensky [69] has demonstrated why localist connectionist networks are suitable for implementing symbolic processes usually associated with higher cognitive functions. On the other hand, “radical connectionism” [13] is a distributed neural approach to modeling intelligence. Usually, it is easier to incorporate prior knowledge into localist models since their structures can be made to directly correspond to that of symbolic knowledge [19]. On the other hand, neural learning usually leads to distributed representation. Furthermore there has been work on integrating localist and distributed representations [28,72,87]. 2.2
Transformation Architectures
Hybrid transformation architectures transform symbolic representations into neural representations or vice versa. The main processing is performed by neural representations but there are automatic procedures for transferring neural representations to symbolic representations or vice versa. Using a transformation architecture it is possible to insert or extract symbolic knowledge into or from a neural architecture. Hybrid transformation architectures differ from unified neural architectures by the automatic transfer. While certain units in unified neural architectures may be interpreted symbolically by an observer, hybrid transformation architectures actually allow the knowledge transfer into symbolic rules, symbolic automata, grammars, etc. Examples of such transformation architectures include the work on activationbased automata extraction from recurrent networks [54,90]. Alternatively, a weight-based transformation between symbolic rules and feedforward networks has been extensively examined in knowledge-based artificial neural networks [68,20]. The most common transformation architectures are rule extraction architectures where symbolic rules are extracted from neural networks [19,1]. These architectures have received a lot of attention since rule extraction discovers the hyperplane positions of units in neural networks and transforms them to if-thenelse rules. Rule extraction has been performed mostly with multi-layer perceptron networks [79,5,8,11], Kohonen networks, radial basis functions [2,33] and recurrent networks [53,90]. Extraction of symbolic knowledge from neural networks has also played an important aspect in this current volume, e.g. [81,7,84]. Furthermore, insertion of symbolic knowledge can be either gradual through practice [23] or one-shot. 2.3
Hybrid Modular Architectures
Hybrid modular architectures contain both symbolic and neural modules appropriate to the task. Here, symbolic representations are not just initial or final representations as in a transformation architecture. Rather, they are combined
4
S. Wermter and R. Sun
and integrated with neural representations in many different ways. Examples in this class, for instance, contain CONSYDERR [72], SCREEN [95] or robot navigators where sensors and neural processing are fused with symbolic top-down expectations [37]. A variety of distinctions can be made. Neural and symbolic modules in hybrid modular architectures can be loosely coupled, tightly coupled or completely integrated [48]. Loosely Coupled Architectures A loosely coupled hybrid architecture has separate symbolic and neural modules. The control flow is sequential in the sense that processing has to be finished in one module before the next module can begin. Only one module is active at any time, and the communication between modules is unidirectional. There are several loosely coupled hybrid modular architectures for semantic analysis of database queries [9] or dialog processing [34] or simulated navigation [78]. Another example of a loosely coupled architecture has been described in a model for structural parsing [87] combining a chart parser and feedforward networks. Other examples of loose coupling, which is sometimes also called passive coupling, include [45,36]. In general, this loose coupling enables various loose forms of cooperation among modules [73]. One form of coupling is in terms of pre/postprocessing vs. main processing: while one or more modules take care of pre/postprocessing, such as transforming input data or rectifying output data, a main module focuses on the main part of the processing task. Commonly, while pre/post processing is done using a neural network, the main task is accomplished through the use of symbolic methods. Another form of cooperation is through a master-slave relationship: while one module maintains control of the task at hand, it can signal other modules to handle some specific aspects of the task. Yet another form of cooperation is the equal partnership of multiple modules. Tightly Coupled Architectures A tightly coupled hybrid architecture contains separate symbolic and neural modules where control and communication are via common shared internal data structures in each module. The main difference between loosely and tightly coupled hybrid architectures are common data structures which allow bidirectional exchanges of knowledge between two or more modules. This makes communication faster and more active but also more difficult to control. Therefore, tightly coupled hybrid architectures have also been referred to as actively coupled hybrid architectures [47]. As examples of tightly coupled architectures, systems for neural deterministic parsing [41] and inferencing [28] have been built where the control changes between symbolic marker passing and neural similarity determination. Furthermore, a hybrid system developed by Tirri [83] consists of a rule base, a fact base and a neural network of several trained radial basis function networks [57,59]. In general, a tightly coupled hybrid architecture allows multiple exchanges of knowledge between two or more modules. The result of a neural module can have a direct influence on a symbolic module or vice versa before it finishes its global
An Overview of Hybrid Neural Systems
5
processing. For instance, CDP is a system for deterministic parsing [41], SCAN contains a tightly coupled component for structural processing and semantic classification [87]. While the neural network chooses which action to perform, the symbolic module carries out the action. During the process of parsing, control is switched back and forth between these modules. Other tightly coupled hybrid architectures for structural processing have been described in more detail in [89]. CLARION is also a system that couples symbolic and neural representations to explore their synergy. Fully Integrated Architectures In a fully integrated hybrid architecture there is no discernible external difference between symbolic and neural modules, since the modules have the same interface and they are embedded in the same architecture. The control flow may be parallel. Communication may be bidirectional between many modules, although not all possible communication channels have to be used. One example of an integrated hybrid architecture is SCREEN, which was developed for exploring integrated hybrid processing for spontaneous language analysis [95,92]. In fully integrated and interleaved systems, the constituent modules interact through multiple channels (e.g., various possible function calls), or may even have node-to-node connections across two modules, such as CONSYDERR [72] in which each node in one module is connected to a corresponding node in the other module. Another hybrid system designed by Lees et al [43] interleaves case-based reasoning modules with several neural network modules.
3
Directions for Hybrid Neural Systems
In Feldman and Bailey’s paper, it was proposed that there are the following distinct levels [15]: cognitive linguistic level, computational level, structured connectionist level, computational biology level and biological level. A condition for this vertical hybridization is that it should be possible to bridge the different levels, and the higher levels should be reduced to, or grounded in, lower levels. A top-down research methodology is advocated and examined for concepts towards a neural theory of language. Although the particulars of this approach are not universally agreed upon, researchers generally accept the overall idea of multiple levels of neural cognitive modeling. In this view, models should be constructed entirely of neural components; both symbolic and subsymbolic processes should be implemented in neural networks. Another view, horizontal hybridization, argues that it may be beneficial, and sometimes crucial, to “mix” levels so that we can make better progress on understanding cognition. This latter view is based on realistic assessment of the state of the art of neural model development, and the need to focus on the essential issues (such as the synergy between symbolic and subsymbolic processes [78]) rather than nonessential details of implementation. Horizontal approaches have been used successfully for real-world hybrid systems, for instance in speech/language
6
S. Wermter and R. Sun
analysis [95]. Purely neural systems in vertical hybridization are more attractive for neuroscience but hybrid systems of horizontal hybridization are currently also a tractable way of building large-scale hybrid neural systems. Representation, learning and their interaction represent some of the major issues for developing symbol processing neural networks. Neural networks designed for symbolic processing often involve complex internal structures consisting of multiple components and several different representations [67,71,3]. Thus learning is made more difficult. There is a need to address the problems of what type of representation to adopt, how the representational structure in such systems is built up, how the learning processes involved affect the representation acquired and how the representational constraints may facilitate or hamper learning. In terms of what is being learned in hybrid neural systems, we can have (1) learning contents for a fixed architecture, (2) learning architectures for given contents, or (3) we can learn both contents and architecture at the same time. Although most hybrid neural learning systems fall within the first two categories, e.g. [18,46], there are some hybrid models that belong to the third category, e.g. [50,92]. Furthermore, there is some current work on parallel neural and symbolic learning, which includes using (1) two separate neural/symbolic algorithms applied simultaneously [78], (2) two separate algorithms applied in succession, (3) integrated neural/symbolic learning [80,35], and (4) purely neural learning of symbolic knowledge, e.g. [46,51]. The issues described above are important for making progress in theories and applications of hybrid systems. Currently, there is not yet a theory of “hybrid systems”. There has been some preliminary early work towards a theoretical framework for neural/symbolic representations, but to date there is still a lack of an overall theoretical framework that abstracts away from the details of particular applications, tasks and domains. One step towards such a direction may be the research into the relationship between automata theory and neural representations [39,24,88]. Processing natural language has been and will continue to be a very important test area for exploring hybrid neural architectures. It has been argued that “language is the quintessential feature of human intelligence” [85]. While certain learning and architectures in humans may be innate, most researchers in neural networks argue for the importance of development and environment during language learning [87,94]. For instance, it was argued [51] that syntax is not innate and that it is a process rather than representation, and abstract categories, like subject, can be learned bottom-up. The dynamics of learning natural language is also important for designing parsers using techniques like SRN and RAAM. SARDSRN and SARDRAAM were presented in the context of shift-reduce parsing [46] to avoid the problem associated with SRN and RAAM (that is, losing constituent information). Interestingly, it has been argued that compositionality and systematicity in neural networks arise from an associationistic substrate [61] based on principles from evolution.
An Overview of Hybrid Neural Systems
7
Also, research into improving WWW use by using neural networks may be promising [93]. While currently most search engines only employ fairly traditional search strategies, machine learning and neural networks could improve processing of heterogeneous unstructured multimedia data. Another important promising research area is knowledge extraction from neural networks in order to support text mining and information retrieval [81]. Inductive learning techniques from neural networks and symbolic machine learning algorithms could be combined to analyze the underlying rules for such data. A crucial task for applying neural systems, especially for applying learning distributed systems, is the design of appropriate vector representations for scaling up to real-world tasks. Large context vectors are also essential for learning document retrieval [22]. Due to the size of the data, only linear computations are useful for full-scale information retrieval. However, vector representations are still often restricted to co-occurances, rather than focusing on syntax, discourse, logic and so on [22]. However, complex representations may be formed and analyzed using fractal approaches [82]. Hard real-world applications are important. A system was built for foreign exchange rate prediction that uses a SOM for reduction and that generates a symbolic representation as input for a recurrent network which can produce rules [55]. Another self-organizing approach for symbol processing was described for classifying Usenet texts and presenting the classification as a hierarchical two-dimensional map [32]. Related neural classification work for text routing has been described [93]. Neural network representations have also been used for important parts of vision and association [56]. Finally, there is promising progress in neuroscience. Computational neuroscience is still in its infancy but it may be very relevant to the long-term progress of hybrid symbolic neural systems. Related to that, more complex high order neurons may be one possibility for building more powerful functionality [44]. Another way would be to focus more on global brain architectures, for instance for building biological inspired robots with rooted cognition [64]. It was argued [85] that in 20 years computer power will be sufficient to match human capabilities, at least in principle. But meaning and deep understanding are still lacking. Other important issues are perception, situation assessment and action [78], although perceptual pattern recognition is still in a very primitive state. Rich perception also requires links with rich sets of actions. Furthermore, it has been argued that language is the “quintessential feature” of human intelligence [85] since it is involved in many intelligent cognitive processes.
4
Concluding Remarks
In summary, further work towards a theory and fundamental principles of hybrid neural systems is needed. First of all, there is promising work towards relating automata theory with neural networks, or logics with such networks. Furthermore, the issue of representation needs more focus. In order to tackle larger real world tasks using neural networks, for instance in information retrieval, learning
8
S. Wermter and R. Sun
internet agents, or large-scale classification, further research on the underlying vector representations for neural networks is important. Vertical forms of neural/symbolic hybridization models are widely used in cognitive processing, logic representation and language processing. Horizontal forms of neural/symbolic hybridization exist for larger tasks, such as speech/language integration, knowledge engineering, intelligent agents or condition monitoring. Furthermore, it will be interesting to see in the future to what extent computational neuroscience will offer further ideas and constraints for building more sophisticated forms of neural systems.
References 1. R. Andrews, J. Diederich, and A. B. Tickle. A survey and critique of techniques for extracting rules from trained artificial networks. Technical report, Queensland University of Technology, 1995. 2. R. Andrews and S. Geva. Rules and local function networks. In Proceedings of the Rule Extraction From Trained Artificial Neural Networks Workshop, Artificial Intelligence and Simulation of Behaviour, Brighton UK, 1996. 3. J. Barnden. Complex symbol-processing in Conposit. In R. Sun and L. Bookman, editors, Architectures incorporating neural and symbolic processes. Kluwer, Boston, 1994. 4. J. A. Barnden and K. J. Holyoak, editors. Advances in connectionist and neural computation theory, volume 3. Ablex Publishing Corporation, 1994. 5. J. Benitz, J. Castro, and J. I. Requena. Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997. 6. R. Bogacz and C. Giraud-Carrier. A novel modular neural architecture for rulebased and similarity-based reasoning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 7. G. Bologna. Symbolic rule extraction form the DIMLP neural network. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 8. G. Bologna and C. Pellegrini. Accurate decomposition of standard MLP classification responses into symbolic rules. In International Work Conference on Artificial and Natural Neural Networks, IWANN’97, pages 616–627, Lanazrote, Canaries, 1997. 9. Y. Cheng, P. Fortier, and Y. Normandin. A system integrating connectionist and symbolic approaches for spoken language understanding. In Proceedings of the International Conference on Spoken Language Processing, pages 1511–1514, Yokohama, 1994. 10. P. S. Churchland and T. J. Sejnowski. The Computational Brain. MIT Press, Cambridge, MA, 1992. 11. T. Corbett-Clarke and L. Tarassenko. A principled framework and technique for rule extraction from multi-layer perceptrons. In Proceedings of the 5th International Conference on Artificial Neural Networks, pages 233–238, Cambridge, England, July 1997. 12. J. Diederich and D. L. Long. Efficient question answering in a hybrid system. In Proceedings of the International Joint Conference on Neural Networks, Singapore, 1992. 13. G. Dorffner. Neural Networks and a New AI. Chapman and Hall, London, UK, 1997.
An Overview of Hybrid Neural Systems
9
14. M. A. Fanty. Learning in structured connectionist networks. Technical Report 252, University of Rochester, Rochester, NY, 1988. 15. J. Feldman and D. Bailey. Layered hybrid connectionist models for cognitive science. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 16. J. A. Feldman and D. H. Ballard. Connectionist models and their properties. Cognitive Science, 6:205–254, 1982. 17. J. A. Feldman, G. Lakoff, D. R. Bailey, S. Narayanan, T. Regier, and A. Stolcke. L0 - the first five years of an automated language acquisition project. AI Review, 8, 1996. 18. P. Frasconi, M. Gori, and A. Sperduti. Integration of graphical rules with adaptive learning of structured information. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 19. L.M. Fu. Rule learning by searching on adapted nets. In Proceedings of the National Conference on Artificial Intelligence, pages 590–595, 1991. 20. L.M. Fu. Neural Networks in Computer Intelligence. McGraw-Hill, Inc., New York, NY, 1994. 21. S. I. Gallant. Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA, 1993. 22. S. I. Gallant. Context vectors: a step toward a grand unified representation. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 23. J. Gelfand, D. Handleman, and S. Lane. Integrating knowledge-based systems and neural networks for robotic skill. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 193–198, San Mateo, CA., 1989. 24. L. Giles and C. W. Omlin. Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks. Connection Science, 5:307–337, 1993. 25. S. Goonatilake and S. Khebbal. Intelligent Hybrid Systems. Wiley, Chichester, 1995. 26. N. A. Hallack, G. Zaverucha, and V. C. Barbosa. Towards a hybrid model of firstorder theory refinement. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 27. J. A. Hammerton and B. L. Kalman. Holistic symbol computation and the sequential RAAM: An evaluation. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 28. J. Hendler. Developing hybrid symbolic/connectionist models. In J. A. Barnden and J. B. Pollack, editors, Advances in Connectionist and Neural Computation Theory, Vol.1: High Level Connectionist Models, pages 165–179. Ablex Publishing Corporation, Norwood, NJ, 1991. 29. M. Hilario. An overview of strategies for neurosymbolic integration. In Proceedings of the Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid Approaches, pages 1–6, Montreal, 1995. 30. S. H¨ olldobler. A structured connectionist unification algorithm. In Proceedings of the National Conference of the American Association on Artificial Intelligence 90, pages 587–593, Boston, MA, 1990. 31. S. H¨ olldobler, Y. Kalinke, and J. Wunderlich. A recursive neural network for reflexive reasoning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 32. S. Honkela. Self-organizing maps in symbol processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 33. J. S. R. Jang and C. T. Sun. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions Neural Networks, 4(1):156–159, 1993.
10
S. Wermter and R. Sun
34. D. Jurafsky, C. Wooters, G. Tajchman, J. Segal, A. Stolcke, E. Fosler, and N. Morgan. The Berkeley Restaurant Project. In Proceedings of the International Conference on Speech and Language Processing, pages 2139–2142, Yokohama, 1994. 35. P. Kanerva. Large patterns make great symbols: an example of learning from example. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 36. C. Kirkham and T. Harris. Development of a hybrid neural network/expert system for machine health monitoring. In R. Rao, editor, Proceedings of the 8th International Congress on Condition Monitoring and Engineering Management, COMADEM95, pages 55–60, 1995. 37. G.K. Kraetzschmar, S. Sablatnoeg, S. Enderle, and G. Palm. Application of neurosymbolic integration for environment modelling in mobile robots. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 38. S. C. Kremer. A theory of grammatical induction in the connectionist paradigm. Technical Report PhD dissertation, Dept. of Computing Science, University of Alberta, Edmonton, 1996. 39. S.C. Kremer and J. Kolen. Dynamical recurrent networks for sequential data processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 40. F. Kurfeß. Unification on a connectionist simulator. In T. Kohonen, K. M¨ akisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 471–476. North-Holland, 1991. 41. S. C. Kwasny and K. A. Faisal. Connectionism and determinism in a syntactic parser. In N. Sharkey, editor, Connectionist natural language processing, pages 119–162. Lawrence Erlbaum, Hillsdale, NJ, 1992. 42. T. Lange and M. Dyer. High-level inferencing in a connectionist network. Connection Science, 1:181–217, 1989. 43. B. Lees, B. Kumar, A. Mathew, J. Corchado, B. Sinha, and R. Pedreschi. A hybrid case-based neural network approach to scientific and engineering data analysis. In Proceedings of the Eighteenth Annual International Conference of the British Computer Society Specialist Group on Expert Systems, pages 245–260, Cambridge, 1998. 44. H. Lipson and H.T. Siegelmann. High order eigentensors as symbolic rules in competitive learning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 45. J. MacIntyre and P. Smith. Application of hybrid systems in the power industry. In L. Medsker, editor, Intelligent Hybrid Systems, pages 57–74. Kluwer Academic Press, 1995. 46. M.R. Mayberry and R. Miikkulainen. Combining maps and distributed representations for shift-reduce parsing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 47. K. McGarry, S. Wermter, and J. MacIntyre. Hybrid neural systems: from simple coupling to fully integrated neural networks. Neural Computing Surveys, 2:62–94, 1999. 48. L. R. Medsker. Hybrid Neural Network and Expert Systems. Kluwer Academic Publishers, Boston, 1994. 49. L. R. Medsker. Hybrid Intelligent Systems. Kluwer Academic Publishers, Boston, 1995. 50. R. Miikkulainen. Subsymbolic Natural Language Processing. MIT Press, Cambridge, MA, 1993. 51. W. C. Morris, G. W. Cottrell, and J. L. Elman. A connectionist simulation of the empirical acquisition of grammatical relations. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
An Overview of Hybrid Neural Systems
11
52. M. C. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and N. Gershenfeld, editors, Time series prediction: Forecasting the future and understanding the past, pages 243–264. Addison-Wesley, Redwood City, CA, 1993. 53. C. W. Omlin and C. L. Giles. Extraction and insertion of symbolic information in recurrent neural networks. In V. Honavar and L. Uhr, editors, Artificial Intelligence and Neural Networks:Steps Towards principled Integration, pages 271–299. Academic Press, San Diego, 1994. 54. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41–52, 1996. 55. C.W. Omlin, L. Giles, and K. K. Thornber. Fuzzy knowledge and recurrent neural networks: A dynamical systems perspective. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 56. C. Orovas and J. Austin. A cellular neural associative array for symbolic vision. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 57. J. Park and I. W. Sandberg. Universal approximation using radial basis function networks. Neural Computation, 3:246–257, 1991. 58. N. S. Park. Addressing knowledge representation issues in connectionist symbolic rule encoding for general inference. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 59. T. Peterson and R. Sun. An RBF network alternative for a hybrid architecture. In International Joint Conference on Neural Networks, Ancorage, AK, May 1998. 60. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77– 105, 1990. 61. R. Reilly. Evolution of symbolisation: Signposts to a bridge between connectionist and symbolic systems. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 62. R. G. Reilly and N. E. Sharkey. Connectionist Approaches to Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992. 63. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, Cambridge, MA, 1986. 64. N. Sharkey and N. T. Ziemke. Life, mind and robots: The ins and outs of embodied cognition. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 65. L. Shastri. A model of rapid memory formation in the hippocampal system. In Proceedings of the Meeting of the Cognitive Science Society, pages 680–685, Stanford, 1997. 66. L. Shastri. Types and quantifiers in SHRUTI: a connectionist model of rapid reasoning and relational processing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 67. L. Shastri and V. Ajjanagadde. From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings. Behavioral and Brain Sciences, 16(3):417–94, 1993. 68. J. Shavlik. A framework for combining symbolic and neural learning. In V. Honavar and L. Uhr, editors, Artificial Intelligence and Neural Networks: Steps towards principled Integration, pages 561–580. Academic Press, San Diego, 1994. 69. P. Smolensky. On the proper treatment of connnectionism. Behavioral and Brain Sciences, 11(1):1–74, March 1988. 70. A. Sperduti, A. Starita, and C. Goller. Learning distributed representations for the classifications of terms. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 494–515, Montreal, 1995.
12
S. Wermter and R. Sun
71. R. Sun. On variable binding in connectionist networks. Connection Science, 4(2):93–124, 1992. 72. R. Sun. Integrating Rules and Connectionism for Robust Commonsense Reasoning. Wiley, New York, 1994. 73. R. Sun. Hybrid connectionist-symbolic models: A report from the IJCAI95 workshop on connectionist-symbolic integration. Artificial Intelligence Magazine, 1996. 74. R. Sun. Supplementing neural reinforcement learning with symbolic methods. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 75. R. Sun and F. Alexandre. Proceedings of the Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid Approaches. McGraw-Hill, Inc., Montreal, 1995. 76. R. Sun and F. Alexandre. Connectionist Symbolic Integration. Lawrence Erlbaum Associates, Hillsdale, NJ, 1997. 77. R. Sun and L.A. Bookman. Computational Architectures Integrating Neural and Symbolic Processes. Kluwer Academic Publishers, Boston, MA, 1995. 78. R. Sun and T. Peterson. Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks, 9(6):1217–1234, 1998. 79. S. Thrun. Extracting rules from artificial neural networks with distributed representations. In G.Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7. MIT Press, San Mateo, CA, 1995. 80. S. Thrun. Explanation-Based Neural Network Learning. Kluwer, Boston, 1996. 81. A. Tickle, F. Maire, G. Bologna, R. Andrews, and J. Diederich. Lessons from past, current issues and future research directions in extracting the knowledge embedded in artificial neural networks. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 82. P. Tino, G. Dorffner, and C. Schittenkopf. Understanding state space organization in recurrent neural networks with iterative function systems dynamics. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 83. H. Tirri. Replacing the pattern matcher of an expert system with a neural network. In S. Goonatilake and S.Khebbal, editors, Intelligent Hybrid Systems, pages 47–62. John Wiley and Sons, 1995. 84. M.L. Vaughn, S.J. Cavill, S.J. Taylor, M.A. Foy, and A.J.B. Fogg. Direct knowledge extraction and interpretation from a multilayer perceptron network that performs low back pain classification. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 85. D. Waltz. The importance of importance. In Presentation at Workshop on Hybrid Neural Symbolic Integration, Breckenridge, CO., 1998. 86. D. L. Waltz and J. A. Feldman. Connectionist Models and their Implications. Ablex, 1988. 87. S. Wermter. Hybrid Connectionist Natural Language Processing. Chapman and Hall, Thomson International, London, UK, 1995. 88. S. Wermter. Preference Moore machines for neural fuzzy integration. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 840–845, Stockholm, 1999. 89. S. Wermter. The hybrid approach to artificial neural network-based language processing. In R. Dale, H. Moisl, and H. Somers, editors, A Handbook of Natural Language Processing. Marcel Dekker, 2000. 90. S. Wermter. Knowledge extraction from transducer neural networks. Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Techniques, 12:27–42, 2000.
An Overview of Hybrid Neural Systems
13
91. S. Wermter, G. Arevian, and C. Panchev. Towards hybrid neural learning internet agents. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 92. S. Wermter and M. Meurer. Building lexical representations dynamically using artificial neural networks. In Proceedings of the International Conference of the Cognitive Science Society, pages 802–807, Stanford, 1997. 93. S. Wermter, C. Panchev, and G. Arevian. Hybrid neural plausibility networks for news agents. In Proceedings of the National Conference on Artificial Intelligence, pages 93–98, Orlando, USA, 1999. 94. S. Wermter, E. Riloff, and G. Scheler. Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer, Berlin, 1996. 95. S. Wermter and V. Weber. SCREEN: Learning a flat syntactic and semantic spoken language analysis using artificial neural networks. Journal of Artificial Intelligence Research, 6(1):35–85, 1997.
Layered Hybrid Connectionist Models for Cognitive Science Jerome Feldman and David Bailey International Computer Science Institute, Berkeley CA 94704
Abstract. Direct connnectionist modeling of higher cognitive functions, such as language understanding, is impractical. This chapter describes a principled multi-layer architecture that supports AI style computational modeling while preserving the biological plausibility of structured connectionist models. As an example, the connectionist realization of Bayesian model merging as recruitment learning is presented.
1
Hybrid Models in Cognitive Science
Almost no one believes that connectionist models will suffice for the full range of tasks in creating and modeling intelligent systems. People whose goals are primarily performance programs have no compunction about deploying hybrid systems and rightly so. But many connectionists are primarily interested in modeling human and other animal intelligence and it is not as clear what methodology is most appropriate in this enterprise. This paper provides one answer that has been useful to our group in our decade of effort in modeling language acquisition in the NTL (originally L0) project. Connectionists of all persuasions agree that intelligence will best be explained in terms of its neural foundations using computational models with simple notions of spreading activation and experience-based weight change. But no one claims that a model containing millions (much less billions) of units is itself a scientific description of some phenomenon, such as vision or language understanding. Beyond this basic agreement there is a major bifurcation into two approaches. The (larger) PDP community believes that progress is best made by training large back propagation networks (and more recently Elman style recurrent nets) to perform specific functions and then examining the learned weights for patterns and insight. There is a lot of good current work on extracting higher level representations from learned weights and this is discussed in other chapters. But there is no evidence that PDP networks can just learn the full range of intelligent behavior and they will not be discussed in this chapter. The other main approach to connectionist modeling is usually called structured because varying amounts of specificly designed computational mechanism are built into the model. This work is almost always localist in character, because it is much more natural to postulate that the pre-wired computational mechanisms are realized by localized circuits, especially if one is actually building the S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 14–27, 2000. c Springer-Verlag Berlin Heidelberg 2000
Layered Hybrid Connectionist Models for Cognitive Science
15
model. In principle, structured connectionist models (SCM) could capture the exact brain structure underlying some behavior and, in fact, models of this sort are common in computational neuroscience. But for the complex behaviors underlying intelligence, a complete SCM would not be any simpler to understand that the neural circuitry itself. And, of course, we don’t know nearly enough about either the brain or cognition to even approximate such iconic models. One suggestion that came up in the early years was to just use conventional symbolic models for so-called higher functions and restrict connectionist modeling to simple sensory motor behaviors. This defeatist stance was never very popular because it left higher cognition without the parallel, fault tolerant, evidential computational mechanisms that are the heart of connectionism. Something like this is now more feasible because the conventional AI approach has become a probabilistic belief network and thus more likely to be mappable to connectionist models. Even so, if one is serious about cognitive modeling, there are good reasons to restrict choices to computational mechanisms that are at least arguably within the size, speed and learnability constraints of the brain. For example, in general it takes exponential time for an arbitrary belief network to settle and thus specializations would be needed for a plausible cognitive model. Although the idea was implicit in earlier structured connectionist work, it is only recently that we have enunciated a systematic philosophy on how to build hybrid connectionist cognitive models. The central idea is hierarchical reducibility - any construct posited at a higher modeling level must have a computationally and biologically plausible reduction to the level below. The table below depicts our current five level structure. We view this approach as a perfectly ordinary instance of the scientific method as routinely practiced in the physical and life sciences. But it does seem to provide us with a good working style providing tractability while maintaining connectionist principles and the potential for direct experimental testing. cognitive: computational: connectionist: comp. neuro.: neural:
words, concepts f-structs, x-schemas (see below) structured models, learning rules detailed neural models [implicit]
Our computational level is analogous to Marr’s and comprises a mixture of familiar notions like feature structures and a novel representation, executing schemas, described below. Apart from providing a valuable scientific language for specifying proposed structures and mechanisms, these representational formalisms can be implemented in simulations to allow us to test our hypotheses. They also support computational learning algorithms so we can use them in experiments on acquisition. Importantly, these computational mechanisms are all reducible to structured connectionist models so that embodiment can be realized. It is not necessarily easy to carry out these reductions and a great deal of effort has gone into understanding them. Perhaps the most challenging problem is
16
J. Feldman and D. Bailey
the connectionist representation of variable binding. This has been addressed for over a decade by Shastri and his students [13,12] and also by a number of other groups [9,15]. This body of work has shown that connectionist models can indeed encode a large body of systematic knowledge and perform interesting inferences in parallel via spreading activation. A recent extension of these techniques [7] supports the connectionist realization of the X-schema formalism discussed below. For contrast, we focus here on a rather different kind of problem - the mapping of the Bayesian model merging technique [14] to the standard structured connectionist mechanism of recruitment learning [6]. The particular context is the David Bailey’s dissertation [2], on a model of how children learn the words for simple actions of their hand.
2
Verblearn and Its Connectionist Reduction
Bailey’s Verblearn system has three major subparts as depicted at the computational level in Figure 1. The bottom section of Figure 1 depicts the underlying actions, encoded as X-schemas. The top third depicts various possible wordsenses associated with an action, encoded as feature structures. The task is to learn the best set of word senses for describing the variations expressed in the training language. The crucial linking mechanism (also encoded as a feature structure) is shown in the center of Figure 1. As will be described below, the system learns an appropriate set of word senses and can demonstrate its knowledge by labeling novel actions or carrying out commands specified with newly learned words. The goal of this paper is to show how Bailey’s computational level account can be reduced to the structured connectionist level in a principled way. The two main data structures we need to map are feature structures and executing schemas (X-schemas). As mentioned above, X-schemas can be modeled at the connectionist level as an extension of Shruti [7]. The modeling of individual feature values follows the standard structures connectionist strategy (and a biologically plausible one as well) of using place coding [6]. That is, for each feature, there is a dedicated connectionist unit (i.e. a separate place) for each possible value of the feature. Within the network representing the possible values of a feature, we desire a winner-take-all (WTA) behavior. That is, only one value unit should be active at a time, once the network has settled, again in the standard way [6]. Representing the link between entities, feature names and feature values is done, again as usual, with triangle nodes. 2.1
Triangle Units
The essential building block is the triangle unit [6,4], shown in Figure 2(a). A triangle unit is an abstraction of a neural circuit which effects a three-way binding. In the figure, the units A, B and C represent arbitrary “concepts” which are bound by the triangle unit. All connections shown are bidirectional and excitatory. The activation function of a triangle unit is such that activation
Layered Hybrid Connectionist Models for Cognitive Science
17
push schema
posture
elbow jnt aspect
slide 1.0
palm 0.7
extend 0.9 once 0.8 cube 0.8
schema
posture
depress 1.0
relevant linking features
accel
index 0.9
aspect
low 0.7
shove
object
object
schema
posture
elbow jnt accel
slide 1.0
palm 0.9
extend 0.9 high 0.9
once 0.6 button 1.0
world state features
motor parameter features schema slide | depress
posture
elbow jnt
direction
accel
aspect
object
grasp|palm|indx flex | extend up | dn | lf | rt once | iteratedlow | med | hi cube | button
world state features used by schema
small PRESHAPE
Slide Schema
GRASP slipping
TIGHTEN GRIP
weight
at goal
2.3 lbs
false
at goal PRESHAPE
PALM start
||
2
large
MOVE ARM (horiz-dir, force, dur)
APPLY GRIP
MOVE ARM TO (objloc)
done
not at goal
2
Fig. 1. An overview of the verb-learner at the computational level, showing details of the Slide x-schema, some linking features, and two verbs: push (with two senses) and shove (with one sense).
(a)
(b) B
A
a+b+c>=2
a
C
A
b
B
c
C
(external sources of activation)
Fig. 2. (a) A simple triangle unit which binds A, B and C. (b) One possible neural realization.
18
J. Feldman and D. Bailey
on any two of its incoming connections causes an excitatory signal to be sent out over all three outgoing connections. Consequently, the triangle unit allows activation of A and B to trigger C, or activation of A and C to trigger B, etc. Triangle nodes will be used here as abstract building blocks, but Figure 2(b) illustrates one possible connectionist realization. A single unit is employed to implement the binding, and each concept unit projects onto it. Concept units are assumed to fire at a uniform high rate when active and all weights into the main unit are equal. As a result, each input site of the triangle unit can be thought of as producing a single 0-or-1 value (shown as lower-case a, b and c) indicating whether its corresponding input unit is active. The body of the binding unit then just compares the sum of these three values to the threshold of 2. If the threshold is met, the unit fires. Its axon projects to all three concept units, and the connections are strong enough to activate all of the concept units, even those receiving no external input.
posture
"push"
palm Fig. 3. Using a triangle unit to represent the value (palm) of a feature (posture) for an entity ("push").
A particularly useful type of three-way binding consists of an entity, a feature, and a value for the feature, as shown in Figure 3. With this arrangement, if posture and palm are active, then "push" will be activated—a primitive version of the labelling process. Alternatively, if "push" and posture are active, then palm will be activated—a primitive version of obeying. The full story [1] requires a more complex form of triangle units, but the basic ideas can be conveyed with just the simple form. 2.2
Connectionist Level Network Architecture
This section describes a network architecture which implements (approximately) the multiple-sense verb representation and its associated algorithms for labelling and obeying. The architecture is shown in Figure 4, whose layout is intended to be reminiscent of the upper half of Figure 1.
Layered Hybrid Connectionist Models for Cognitive Science
19
phonology, morphology, etc.
"push"
"shove"
"pull"
word senses push1
force
force=low
force=med
force=high
push2
dir
dir=away
motor control (x-schemas)
dir=down
size
size=small
size=large
perceptual system
Fig. 4. A connectionist version of the model, using a collection of triangle units for each word sense.
On the top is a “vocabulary” subnetwork containing a unit for each known verb. Each verb is associated with a collection of phonological and morphological details, whose connectionist representation is not considered here but is indicated by the topmost “blob” in the figure. Each verb unit can be thought of as a binding unit which ties together such information. The verb units are connected in a winner-take-all fashion to facilitate choosing the best verb for a given situation. On the bottom is a collection of subnetworks, one for each linking feature. The collection is divided into two groups. One group—the motor-parameter features—is bidirectionally connected to the motor control system, shown here as a blob for simplicity. The other group—the world-state features—receives connections from the perceptual system, which is not modelled here and is indicated by the bottom-right blob. Each feature subnetwork consists of one unit for each possible value. Within each feature subnetwork, units are connected in a winner-take-all fashion. A separate unit also represents each feature name. The most interesting part of the architecture is the circuitry connecting the verb units to the feature units. In the central portion of Figure 4 the connectionist representation of two senses of push are shown, each demarcated by a box. Each sense requires several triangle units with specialized functions. One triangle unit for each sense can be thought of as primary; these are drawn larger and labelled “push1” and “push2”. These units are of the soft conjunctive type and serve to integrate information across the features which the sense is concerned about. Their left side connects to the associated verb unit. Their right
20
J. Feldman and D. Bailey
side has multiple connections to a set of subsidiary triangle units, one for each world-state feature (although only one is shown in the figure). The lower side of the primary triangle unit works similarly, but for the motor-parameter features (two are shown in the figure). Note also that the primary triangle units are connected into a lexicon-wide winner-take-all network. 2.3
Labelling and Obeying
We can now illustrate how the network performs labelling and obeying. Essentially, these processes involve providing strong input to two of the three sides of some word sense’s primary triangle unit, resulting in activation of the third side. For labelling, the process begins when x-schema execution and the perceptual system activate the appropriate feature and value units in the lower portion of Figure 4. In response—and in parallel—every subsidiary triangle unit connected to an active feature unit weighs the suitability of the currently active value unit according to its learned connection strengths. In turn, these graded responses are delivered to the lower and right-hand sides of each word sense’s primary triangle unit. The triangle units become active to varying degrees, depending on the number of activated subsidiary units and their degrees of activation. The winner-take-all mechanism ensures that only one primary unit dominates, and when that occurs the winning primary unit turns on its associated verb unit. For obeying, we assume one verb unit has been activated (say, by the auditory system) and the appropriate world-state feature and value units have been activated (by the perceptual system). As a result, the only primary triangle units receiving activation on more than one side will be those connected to the command verb. This precipitates a competition amongst those senses to see which has the most strongly active world-state subsidiary triangle units—that is, which sense is most applicable to the current situation. The winner-take-all mechanism boosts the winner and suppresses the others. When the winner’s activation peaks, it sends activation to its motor-parameter subsidiary triangle units. These, in turn, will activate the motor-parameter value units in accordance with the learned connection strengths. Commonly this will result in partial activation on multiple values for some features. The winner-take-all mechanism within each feature subnetwork chooses a winner. (Alternatively, we might prefer to preserve the distributed activation pattern for use by smarter x-schemas which can reason with probabilistic specification of parameters. E.g., if all the force value units are weakly active, the x-schema knows it should choose a suitable amount of force.)
3
Learning - Connectionist Account
The ultimate goal of the system is to learn the right word senses from labeled experience. The verb learning model assumes that the agent has already acquired various x-schemas for the actions of one hand manipulating an object on a table and that an informant labels actions that the agent is performing. The algorithm starts by assuming that each instance (e.g. of a word sense) is a new category
Layered Hybrid Connectionist Models for Cognitive Science
21
and then proceeds to merge these until a total information criterion no longer improves. More technically, the learning task is an optimization problem, in that we seek, amongst all possible lexicons, the “best” one given the training set. We seek the lexicon model m that is most probable given the training data t. argmax P (m | t) m
(1)
The probability being maximized is the a posteriori probability of the model, and our algorithm is a “maximum a posteriori (MAP) estimator” . The fundamental insight of Bayesian learning is that this quantity can be decomposed, using Bayes’ rule, into components which separate the fit to the training data and an a priori preference for certain models over others. P (m | t) ∝ P (m) P (t | m)
(2)
Here, as usual, the prior term P(m) is proportional to the complexity of the model and the likelihood term P(t | m) is a measure of how well the model fits the data, in this case the labeled actions. The goal is to adjust the model to optimize the overall fit, model merging is one algorithm for this. In general terms, the algorithm is: Model merging algorithm: 1. Create a simple model for each example in the training set. 2. Repeat the following until the posterior probability decreases: a) Find the best candidate pair of models to merge. b) Merge the two models to form a possibly more complex model, and remove the original models. In our case, “model” in the name “model merging” refers to an individual word sense f-struct. The learning algorithm creates a separate word sense for every occurrence of a word, and then merges these word sense f-structs so long as the reduction in the number of word senses outweighs the loss of training-set likelihood resulting from the merge. A major advantage of the model merging algorithm is that it is one-shot. After a single training example for a new verb, the system is capable of using the verb in a meaningful, albeit limited, way. Model merging is also relatively efficient since it does not backtrack. Yet it often successfully avoids poor local minima because its bottom-up rather than top-down strategy is less likely to make premature irreversible commitments. We now consider how the model merging algorithm can be realized in a connectionist manner, so that we will have a unified connectionist story for the entire system. At first glance, the model merging algorithm does not appear particularly connectionist. Two properties cause trouble. First, the algorithm is constructivist. That is, new pieces of representation (word senses) need to be built, as
22
J. Feldman and D. Bailey
opposed to merely gradually changing existing structures. Second, the criterion for merging is a global one, rather than depending on local properties of word senses. Nevertheless, we have a proposed connectionist solution employing a learning technique known as recruitment learning. 3.1
Recruitment Learning
Recruitment learning [5,11] assumes a localist representation of bindings such as the triangle unit described in §2.1, and provides a rapid-weight-change algorithm for forming such “effective circuits” from previously unused connectionist units. Figure 5 illustrates recruitment with an example. Recall that a set of triangle nodes is usually connected in a winner-take-all (WTA) fashion to ensure that only one binding reaches an activation level sufficiently high to excite its third member. For recruitment learning, we further posit that there is a pool of “free” triangle units which also take part in the WTA competition. The units are free in that they have low, random weights to the various “concept units” amongst which bindings can occur. Crucially, though, they do have connections to these concept units. But the low weights prevent these free units from playing an active role in representing existing bindings.
recruited
Triangle unit WTA network
T1
Concept units
A B
T2
C D
free
E T3
F
G
Fig. 5. Recruitment of triangle unit T3 to represent the binding E–F–G.
This architecture supports the learning of new bindings as follows. Suppose, as in Figure 5, several triangle units already represent several bindings, such as
Layered Hybrid Connectionist Models for Cognitive Science
23
T1, which represents the binding of A, C and F. (The bindings for T2 are not shown.) Suppose further that concept units E, F and G are currently active, and the WTA network of triangle units is instructed (e.g. by a chemical mechanism) that this binding must be represented. If there already exists a triangle unit representing the binding, it will be activated by the firing of E, F and G, and that will be that. But if none of the already-recruited triangle units represents the binding, then it becomes possible for one of the free triangle units (e.g. T3)—whose low, random weights happen to slightly bias it toward this new binding—to become weakly active. The WTA mechanism selects this unit and increases its activation, which then serves as a signal to the unit to rapidly strengthen its connections to the active concept units.1 It thereby joins the pool of recruited triangle units. As described, the technique seems to require full connectivity and enough unrecruited triangle units for all possible conjunctions. Often, though, the overall architecture of a neural system provides constraints which greatly reduce the number of possible bindings, compared to the number possible if the pool of concept units is considered as an undifferentiated whole. For example, in our connectionist word sense architecture, it is reasonable to assume that the initial neural wiring is predisposed toward binding words to features—not words to words, or feature units to value units of a different feature. The view that the brain starts out with appropriate connectivity between regions on a coarse level is bolstered by the imaging studies of [3] which show, for example, different localization patterns for motor verbs (nearer the motor areas) vs. other kinds of verbs. Still, the number of potential bindings and connections may be daunting. It turns out, though, that sparse random connection patterns can alleviate this apparent problem [5]. The key idea is to use a multi-layered scheme for representing bindings, in which each binding is represented by paths amongst the to-be-bound units rather than direct connections. The existence of such paths can be shown to have high probability even in sparse networks, for reasonable problem sizes [16]. 3.2
Merging Via Recruitment
The techniques of recruitment learning can be put to use to create the word sense circuitry shown earlier in Figure 4. The connectionist learning procedure does not exactly mimic the algorithm given above but captures the main ideas. To illustrate our connectionist learning procedure, we will assume that the two senses of push shown in Figure 4 have already been learned, and a new training example has just occurred. That is, the “push” unit has just become active, as have some of the feature value units reflecting the just-executed action. 1
This kind of rapid and permanent weight change, often called long term potentiation or LTP, has been documented in the nervous system. It is a characteristic of the NMDA receptor, but may not be exclusive to it. It is hypothesized to be implicated in memory formation. See [10] for details on the neurobiology, or [12] for a more detailed connectionist model of LTP in memory formation.
24
J. Feldman and D. Bailey
The first key observation is that when a training example occurs, external activation arrives at a verb unit, motor-parameter feature value units, and world-state feature value units. This three-way input is the local cue to the various triangle units that adaptation should occur—labelling and obeying never produce such three-way external input to the triangle units. Depending on the circumstances, there are three possible courses of action the net may take: – Case 1: The training example’s features closely match those of an existing word sense. This case is detected by activation of the primary triangle unit of the matching sense—strong enough activation to dominate the winner-take-all competition. In this case, an abbreviated version of merging occurs. Rather than create a full-fledged initial word sense for the new example, only to merge it into the winning sense, the network simply “tweaks” the winning sense to accommodate the current example’s features. Conveniently, the winning sense’s primary triangle unit can detect this situation using locally available information, namely: (1) it is highly active; and (2) it is receiving activation on all three sides. The tweaking itself is a version of Hebb’s Rule [8]: the weights on connections to active value units are incrementally strengthened. With an appropriate weight update rule, this strategy can mimic the probability distributions learned by the model merging algorithm. – Case 2: The training example’s features do not closely match any existing sense. This case is detected by failure of the winner take all mechanism to elevate any word sense above a threshold level. In this case, standard recruitment learning is employed. Pools of unrecruited triangle units are assumed to exist, pre-wired to function as either primary or subsidiary units in future word senses. After the winner-take-all process fails to produce a winner from the previously-recruited set of triangle units, recruitment of a single new primary triangle unit and a set of new subsidiary units occurs. The choice will depend on the connectivity and initial weights of the subsidiary units to the feature value units, but will also depend on the connections amongst the new units which are needed for the new sense to cohere. Once chosen, these units’ weights are set to reflect the currently active linking feature values, thereby forming a new word sense which essentially is a copy of the training example. – Case 3: The training example’s features are a moderate match to two (or more) existing word senses. This case is detected by a protracted competition between the two partially active senses which cannot be resolved by the winner-take-all mechanism. Figure 6 depicts this case. As indicated by the darkened ovals, the training example is labelled “push” but involved medium force applied to a small size object—a combination which doesn’t quite match either existing sense. This case triggers recruitment of triangle units to form a new sense as described for case 2, but with an interesting twist. The difference is that the weights of the new subsidiary triangle units will reflect not only the linking features of the current training example, but also the distribution of values
Layered Hybrid Connectionist Models for Cognitive Science
25
phonology, morphology, etc.
"push"
"shove"
"pull"
new merged sense
push1
push12
force
force=low
force=med
force=high
push2
dir
dir=away
motor control (x-schemas)
dir=down
size
size=small
size=large
perceptual system
Fig. 6. Connectionist merging of two word senses via recruitment of a new triangle unit circuit.
represented in the partially active senses. Thus, the newly recruited sense will be a true merge of the two existing senses (as well as the new training example). Figure 6 illustrates this outcome by the varying thicknesses on the connections to the value units. If you inspect these closely you will see that the new sense “push12” encodes broader correlations with the force and size features than those of the previous senses “push1” and “push2”. In other words, “push12” basically codes for dir = away, force not high. How can this transfer of information be accomplished, since there are no connections from the partially active senses to the newly recruited sense? The trick is to use indirect activation via the feature value units. The partially active senses, due to their partial activation, will deliver some activation to the value units—in proportion to their outgoing weights. Each value unit adds any such input from the various senses which connect to it. Consequently, each feature subnetwork will exhibit a distributed activation pattern reflecting an average of the distributions in the two partially active senses (plus extra activation for the value associated with the current action). This distribution will then be effectively copied into the weights in the newly recruited triangle units, using the usual weight update rule for those units. A final detail for case 3: to properly implement merging, the two original senses must be removed from the network and returned to the pool of unrecruited units. If they were not removed, the network would quickly accumulate an implausible number of word senses. After all, part of the purpose
26
J. Feldman and D. Bailey
of merging is to produce a compact model of each verb’s semantics. But there is another reason to remove the original senses. The new sense will typically be more general than its predecessors. If the original senses were kept, they would tend to “block” the new sense by virtue of their greater specificity (i.e. more peaked distributions). The new sense would rarely get a chance to become active, and its weights would weaken until it slipped back into unrecruited status. So to force the model to use the new generalization, the original senses must be removed. Fortunately, the cue for removal is available locally to these senses’ triangle units: the protracted period of partial activation, so useful for synthesizing the new sense, can serve double duty as a signal to these triangle units to greatly weaken their own weights, thus returning them to the unrecruited pool. The foregoing description is only a sketch, and activation functions have not been fully worked out. It is possible, for example, that the threshold distinguishing case 2 from case 3 could prove too delicate to set reliably for different languages. These issues are left for future work. Nonetheless, several consequences of this particular connectionist realization of a model-merging-like algorithm are apparent. First, the strategy requires presentation of an intermediate example to trigger merging of two existing senses. The architecture does not suddenly “notice” that two existing senses are similar and merge them. Another consequence of the architecture is that it never performs a series of merges as a “batch” as happens in model merging. On the other hand, the architecture does, in principle, allow each merge operation to combine more than two existing senses at a time. Indeed, technically speaking, the example illustrated in Figure 6 is a three-way merge of “push1,” “push2” and the current training example. The question of the relative merits of these two strategies is another good question to pursue.
4
Conclusion
We have shown that the two seemingly connectionist-unfriendly aspects of model merging—its constructiveness and its use of a global optimization criterion—can be overcome by using recruitment learning and a modified winner-take-all mechanism. This, hopefully, elucidates the general point of this chapter. Of the many ways of constructing hybrid connectionist models, one seems particularly well suited for cognitive science. For both computational and explanatory purposes, it is convenient to do some (sometimes all) of our modeling at a computational level that is not explicitly connectionist. By requiring a biologically and computationally plausible reduction of all computational level primitives to the (structured) connectionist level, we retain the best features of connectionist models and promote the development of an integrated Cognitive Science. And it is a lot of fun.
Layered Hybrid Connectionist Models for Cognitive Science
27
References 1. David R. Bailey. When Push Comes to Shove: A Computational Model of the Role of Motor Control in the Acquisition of Action Verbs. PhD thesis, Computer Science Division, EECS Department, University of California at Berkeley, 1997. 2. David R. Bailey, Jerome A. Feldman, Srini Narayanan, and George Lakoff. Modeling embodied lexical development. In Proceedings of the 19th Cognitive Science Society Conference, pages 19–24, 1997. 3. Antonio R. Damasio and Daniel Tranel. Nouns and verbs are retrieved with differently distributed neural systems. Proceedings of the National Academy of Sciences, 90:4757–4760, 1993. 4. Joachim Diederich. Knowledge-intensive recruitment learning. Technical Report TR-88-010, International Computer Science Institute, Berkeley, CA, 1988. 5. Jerome A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46:27–39, 1982. 6. Jerome A. Feldman and Dana Ballard. Connectionist models and their properties. Cognitive Science, 6:205–254, 1982. 7. Dean J. Grannes, Lokendra Shastri, Srini Narayanan, and Jerome A. Feldman. A connectionist encoding of schemas and reactive plans. Poster presented at 19th Cognitive Science Society Conference, 1997. 8. Donald O. Hebb. The Organization of Behavior. Wiley, New York, NY, 1949. 9. J.E. Hummel and I. Biederman. Dynamic binding in a neural network for shape recognition. Psychological Review, 99:480–517, 1992. 10. Gary Lynch and Richard Granger. Variations in synaptic plasticity and types of memory in corticohippocampal networks. Journal of Cognitive Neuroscience, 4(3):189–199, 1992. 11. Lokendra Shastri. Semantic Networks: An evidential formalization and its connectionist realization. Morgan Kaufmann, Los Altos, CA, 1988. 12. Lokendra Shastri. A model of rapid memory formation in the hippocampal system. In Proceedings of the 19th Cognitive Science Society Conference, pages 680–685, 1997. 13. V. Ajjanagadde & L. Shastri. Rules and variables in neural nets. Neural Computation, 3:121–134, 1991. 14. Andreas Stolcke and Stephen Omohundro. Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, International Computer Science Institute, Berkeley, CA, January 1994. 15. Ron Sun. On variable binding in connectionist networks. Connection Science, 4:93–124, 1992. 16. Leslie Valiant. Circuits of the mind. Oxford University Press, New York, 1994.
Types and Quantifiers in shruti — A Connectionist Model of Rapid Reasoning and Relational Processing Lokendra Shastri International Computer Science Institute, Berkeley CA 94704, USA,
[email protected], WWW home page: http://icsi.berkeley/˜shastri
Abstract. In order to understand language, a hearer must draw inferences to establish referential and causal coherence. Hence our ability to understand language suggests that we are capable of performing a wide range of inferences rapidly and spontaneously. This poses a challenge for cognitive science: How can a system of slow neuron-like elements encode a large body of knowledge and perform inferences with such speed? shruti attempts to answer this question by demonstrating how a neurally plausible network can encode a large body of semantic and episodic facts, and systematic rule-like knowledge, and yet perform a range of inferences within a few hundred milliseconds. This paper describes a novel representation of types and instances in shruti that supports the encoding of rules and facts involving types and quantifiers, enables shruti to distinguish between hypothesized and asserted entities, and facilitates the dynamic instantiation and unification of entities during inference.
1
Introduction
In order to understand language, a hearer must draw inferences to establish referential and causal coherence, generate expectations, make predictions, and recognize the speaker’s intent. Hence our ability to understand language suggests that we are capable of performing a wide range of inferences rapidly, spontaneously and without conscious effort — as though they are a reflex response of our cognitive apparatus. In view of this, such reasoning has been described as reflexive reasoning [22]. This remarkable human ability poses a challenge for cognitive science and computational neuroscience: How can a system of slow neuron-like elements encode a large body of systematic knowledge and perform a wide range of inferences with such speed? The neurally plausible (connectionist) model shruti attempts to address the above challenge. It demonstrates how a network of neuron-like elements could encode a large body of structured knowledge and perform a variety of inferences within a few hundred milliseconds [3][22][14][23][20]. shruti suggests that the encoding of relational information (frames, predicates, etc.) is mediated by neural circuits composed of focal-clusters and the S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 28–45, 2000. c Springer-Verlag Berlin Heidelberg 2000
Types and Quantifiers in shruti
29
dynamic representation and communication of relational instances involves the transient propagation of rhythmic activity across these clusters. A role-entity binding is represented within this rhythmic activity by the synchronous firing of appropriate cells. Systematic mappings — and other rule-like knowledge — are encoded by high-efficacy links that enable the propagation of rhythmic activity across focal-clusters, and a fact in long-term memory is a temporal pattern matcher circuit. The possible role of synchronous activity in dynamic neural representations has been suggested by other researchers (e.g., [28]), but shruti offers a detailed computational account of how synchronous activity can be harnessed to solve problems in the representation and processing of high-level conceptual knowledge. A rich body of neurophysiological evidence has emerged suggesting that synchronous activity might indeed play an important role in neural computation [26] and several models using synchrony to solve the binding problem during inference have been developed (e.g., [9]).1 As an illustration of shruti’s inferential ability consider the following narrative: “John fell in the hallway. Tom had cleaned it. He got hurt.” Upon being presented with the above narrative2 shruti reflexively infers the following:3 Tom had mopped the floor. The floor was wet. John was walking in the hallway. John slipped and fell because the floor was wet. John got hurt because he fell. Notice that shruti draws inferences required to establish referential and causal coherence. It explains John’s fall by making the plausible inference that John was walking in the hallway and he slipped because the floor was wet. It also infers that John got hurt because of the fall. Moreover, it determines that “it” in the second sentence refers to the hallway, and that “He” in the third sentence refers to John, and not to Tom. The representational and inferential machinery developed in shruti can be applied to other problems involving relational structures, systematic but contextsensitive mappings between such structures, and rapid interactions between persistent and dynamic structures. The shruti model meshes with the “Neural Theory of Language” project [4] on language acquisition and provides neurally plausible solutions to several representational and computational requirements arising in the project. The model also offers a plausible framework for realizing the “Interpretation as Abduction” approach to language understanding described in [8]. Moreover, shruti’s representational machinery has been extended to realize control and coordination mechanisms required for modeling actions and reactive plans [24]. This paper describes a novel representation of types and instances in shruti. This representation supports the encoding of rules and facts involving types and 1 2
3
For other solutions to the binding problem within a structured connectionist framework see [11][5][27]. Each sentence in the narrative is conveyed to shruti as a set of dynamic bindings (see Section 4). The sentences are presented in the order of their occurrence in the narrative. After each sentence is presented, the network is allowed to propagate activity for a fixed number of cycles. A detailed discussion of this example appears in [25].
30
L. Shastri
quantifiers, and at the same time allows shruti to distinguish between hypothesized entities and asserted entities. This in turn facilitates the dynamic instantiation and unification of entities and relational instances during inference. For a detailed description of various aspects of shruti’s representational machinery refer to [22][20][25]. The rest of the chapter is organized as follows: Section 2 provides an overview of how relational knowledge is encoded in shruti. Section 3 discusses the representation of types and instances. Section 4 describes the representation of dynamic bindings. Section 5 explains how phase separation between incompatible entities is enforced in the type hierarchy via inhibitory mechanisms, and how phases are merged to unify entities. Section 6 describes the associative potentiation of links in the type hierarchy. Next, Section 7 reviews the encoding of facts, and Section 8 outlines the encoding of rules (or mappings) between relational structures. A simple illustrative example is presented in Section 9. +v:Human+v:Book
+:John +:Mary +:Book-17 ?:John ?:Mary ?:Book-17
E-fact F1
1000
-
+
+
giver recip g-obj
?
?v:Human ?v:Book
T-fact F2
50
-
?
buyer
b-obj
buy
give 900
800
med1
+
r1
?
r2 r3
+
?
s1
med2
s2
John 800
+
980
Mary +
?
? Book-17 +
+
-
?
owner
?
o-obj Human +e
+v
?v ?e
own +e Agent +e
+v
+v
?v ?e Book
?v ?e
Fig. 1. An overview of shruti’s representational machinery.
Types and Quantifiers in shruti
2
31
An Overview of shruti’s Representational Machinery
All long-term (persistent) knowledge is encoded in shruti via structured networks of nodes and links. Such long-term knowledge includes generic relations, instances, types, general rules, and specific facts. In contrast, dynamic aspects of knowledge are represented via the activity of nodes, the propagation of activity along excitatory and inhibitory links, and the integration of incident activity at nodes. Such dynamic knowledge includes active (dynamic) facts and bindings, propagation of bindings, fusion of evidence, competition among incompatible entities, and the development of coherence. Figure 1 provides an overview of some of the key elements of shruti’s representational machinery. The network fragment shown in the figure depicts a partial encoding of the following rules, facts, instances, and types: 1. 2. 3. 4. 5. 6.
∀(x:Agent y:Agent z:Thing) give(x,y,z) ⇒ own(y,z) [800,800]; ∀(x:Agent y:Thing) buy(x,y) ⇒ own(x,y) [900,980]; EF: give(John, Mary, Book-17) [1000]; TF: ∀(x:Human y:Book) buy(x,y) [50]; is-a(John, Human); is-a(Mary, Human); 7. is-a(Human, Agent); 8. is-a(Book-17, Book).
Item (1) is a rule which captures a systematic relationship between giving and owning. It states that when an entity x of type Agent, gives an entity z of type Thing, to an entity y of type Agent, then the latter comes to own it. Similarly, item (2) is a rule which states that whenever any entity of the type Agent buys something, it comes to own it. The pair of weights [a,b] associated with a rule have the following interpretation: a indicates the degree of evidential support for the antecedent being the probable cause (or explanation) of the consequent, and b indicates the degree of evidential support for the consequent being a probable effect of the antecedent.4 Item (3) corresponds to a long-term “episodic” fact (or E-fact) which states that John gave Mary a specific book (Book-17). Item (4) is a long-term “taxon” fact (or T-fact) which states that the prior evidential support for a given (random) human buying a given (random) book is 50. Item (5) states that John is a human. Similarly, items (6–8). Given the above knowledge, shruti can rapidly draw inferences of the following sort within a few hundred milliseconds5 (numbers in [] indicate strength of inference): 4
5
Weights in shruti lie in the interval [0,1000]. The mapping of probabilities and evidential supports to weights in shruti is non-linear and loosely defined. The initial weights can be set approximately, and subsequently fine tuned to model a given domain via learning. The time required for drawing an inference is estimated by c∗π, where c is the number of cycles of rhythmic activity it takes shruti to draw an inference (see Section 9), and π is the period of rhythmicity. A plausible value of π is 25 milliseconds [22].
32
L. Shastri
1. own(Mary, Book-17) [784]; Mary owns a particular book (referred to as Book-17). 2. ∃x:Book own(Mary,x) [784]; Mary owns a book. 3. ∃(x:Agent y:Thing) own(x,y) [784]; Some agent owns something. 4. buy(Mary,Book-1) [41]; Mary bought a particular book (referred to as Book-1). 5. is-a(Mary, Agent); Mary is an agent. Figure 2 depicts a schematized response of the shruti network shown in Figure 1 to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). We will revisit this activation trace in Section 9 after we have reviewed shruti’s representational machinery, and discussed the encoding of instances and types in more detail. For now it suffices to observe that the query is conveyed to the network by activating appropriate “?” nodes (?:own, ?:Mary and ?e:Book) and appropriate role nodes (owner and o-obj). This leads to a propagation of activity in the network which eventually causes the activation of the nodes +:own and +:Book-17. This signals an affirmative answer (Yes, Mary owns Book-17). Note that bindings between roles and entities are expressed by the synchronous activation of bound role and entity nodes. 2.1
Different Node Types and Their Computational Behavior
Nodes in shruti are computational abstractions and correspond to small ensembles of cells. Moreover, a connection from a node A to a node B corresponds to several connections from cells in the A ensemble to cells in the B ensemble. shruti makes use of four node types: m-ρ-nodes, τ -and nodes, τ -or nodes of type 1), and τ -or nodes of type 2. This classification is based on the computational properties of nodes, and not on their functional or representational role. In particular, nodes serving different representational functions can be of the same computational type. The computational behavior of m-ρ-nodes and τ -and nodes is described below: m-ρ nodes: An m-ρ node with threshold n becomes active and fires upon receiving n synchronous inputs. Here synchrony is defined relative to a window of temporal integration ω. Thus all inputs arriving at a node with a lead/lag of no more than ω, are deemed to be synchronous. Thus an m-ρ node A receiving above-threshold periodic inputs from m-ρ nodes B and C (where B and C may be firing in different phases) will respond by firing in phase with both B and C. A similar node type has been described in [15]. A scalar level (strength) of activity is associated with the response of an m-ρ node.6 This level of activity is computed by the activation combination function 6
The response-level of a m-ρ node in a phase can be governed by the number of cells in the node’s cluster firing in that phase.
Types and Quantifiers in shruti
33
+:own +:med1 +:give F1 ?:give giver g-obj recip ?:med2 s2 s1
ρ3
r1 ?:med1 r3 r2 ?:own o-obj owner +:John ?:John +:Book-17 ?:Book-17
ρ2
?e:book +:Mary ?:Mary
ρ1 time
Fig. 2. A schematized activation trace of selected nodes for the query own(Mary,Book17)?.
34
L. Shastri
(ECF) associated with the node. Some ECFs used in the past are sum, max, and sigmoid. Other combination functions are under investigation [25]. τ -and nodes: A τ -and node becomes active on receiving an uninterrupted and above-threshold input over an interval ≥ πmax , where πmax is a system parameter. Computationally, this sort of input can be idealized as a pulse whose amplitude exceeds the threshold, and whose duration is greater than or equal to πmax . Physiologically, such an input may be identified with a high-frequency burst of spikes. Thus a τ -and node behaves like a temporal and node and becomes active upon receiving adequate and uninterrupted inputs over an interval πmax . Upon becoming active, such a node produces an output pulse of width ≥ πmax . The level of output activation is determined by the ECF associated with the node for combining the weighted inputs arriving at the node. The model also makes use of inhibitory modifiers that can block the flow of activation along a link. This blocking is phasic and lasts only for a duration ω. 2.2
Encoding of Relational Structures
Each relation (in general, a frame or a predicate) is represented by a focal-cluster which as an anchor for the complete encoding of a relation. Such focal-clusters are depicted as dotted ellipses in Figure 1. The focal-cluster for the relation give is depicted toward the top and the left of Figure 1. For the purpose of this example, it is assumed that give has only three roles: giver, recipient and giveobject. Each of these roles is encoded by a separate node labeled giver, recip and g-obj, respectively. The focal-cluster of give also includes an enabler node labeled ? and two collector nodes labeled + and –. The positive and negative collectors are mutually inhibitory (inhibitory links are depicted by filled blobs). In general, the focal-cluster for an n-place relation contains n role nodes, one enabler node, one positive collector node and one negative collector node. We will refer to the enabler, the positive collector, and the negative collector of a relation P as ?:P, +:P, and –:P, respectively. The collector and enabler nodes of relations behave like τ -and nodes. Role nodes and the collector and enabler nodes of instances behave like m-ρ nodes. Semantic Import of Enabler and Collector Nodes. Assume that the roles of a relation P have been dynamically bound to some fillers and thereby represent an active instance of P (we will see how this is done, shortly). The activation of the enabler ?:P means that the system is seeking an explanation for the active instance of P. In contrast, the activation of the collector +:P means that the system is affirming the active instance of P. Similarly, the activation of the collector -:P means that the system is affirming the negation of the active instance of P. The activation levels of ?:P, +:P and -:P signifies the strength with which information about P is being sought, believed, or disbelieved, respectively. For example, if the roles giver, recipient and object are dynamically bound to John, Mary, and a book, respectively, then the activation of ?:give means that the system is asking whether “John gave Mary a book” matches a fact in memory, or whether it can be inferred from what is known. In contrast, the
Types and Quantifiers in shruti
35
activation of +:P with the same role bindings means that the system is asserting “John gave Mary a book”. Degrees of Belief: Support, no Information and Contradiction. The levels of activation of the positive and negative collectors of a relation measure the effective degree of support offered by the system to the currently active relational instance. Thus the activation levels of the collectors +:P and -:P encode a graded belief ranging continuously from no on the one extreme (only -:P is active), to yes on the other (only +:P is active), and don’t know in between (neither collector is very active). If both the collectors receive comparable and strong activation then a contradiction is indicated. Significance of Collector to Enabler Connections. Links from the collector nodes to the enabler node of a relation convert a dynamic assertion of a relational instance into a query about the assertion. Thus the system continually seeks an explanation for active assertions. The weight on the link from +:P (or -:P) to ?:P is a sum of two terms. The first term is proportional to the system’s propensity for seeking explanations — the more skeptical the system, the higher the weight. The second term is inversely proportional to the probability of occurrence of a positive (or negative) instance of P — the more unlikely a fact, the more intense the search for an explanation. The links from the collectors of a relation to its enabler also create positive feedback loops of activation and thereby create stable coalitions of active cells under appropriate circumstances. If the system seeks an explanation for an instance of P and finds support for this instance, then a stable coalition of activity arises consisting of ?:P, other ensembles participating in the explanation, +:P and finally ?:P. Such activity leads to priming (see Section 6), and the formation of episodic memories (see [19,21]).
3
Encoding Instances and Types
The encoding of types and instances is illustrated in Figure 3. The focal-cluster of each entity consists of a ? and a + node. In contrast, the focal-cluster of each type consists of a pair of ? nodes (?e and ?v) and a pair of + nodes (+e and +v). While the nodes +v and ?v participate in the expression of knowledge (facts and attributes) involving the whole type, the nodes +e and ?e participate in the encoding of knowledge involving particular instances of the type. Thus nodes v and e signify universal and existential quantification, respectively. All nodes participating in the representation of types are m-ρ nodes. 3.1
Interconnections within Focal-Clusters of Instances and Types
The interconnections shown in Figure 3 among nodes within the focal-cluster of an instance and among nodes within the focal-cluster of a type lead to the following functionality (I refers to an instance, T 1 refers to a type):
36
L. Shastri
– Because of the link from +:I to ?:I, any assertion about an instance leads to a query or a search for a possible explanation of the assertion. – Because of the link from +v:T 1 to +e:T 1, any assertion about the type leads to the same assertion being made about an unspecified member of the type (e.g., “Humans are mortal” leads to “there exists a mortal human”).7 – Because of the link from +v:T 1 to ?v:T 1, any assertion about the whole type leads to a query or search for a possible explanation of the assertion (e.g., the assertion “Humans are mortal” leads to the query “Are humans mortal?”). – Because of the link from +e:T 1 to ?e:T 1, any assertion about an instance of the type leads to a query or search for a possible instance that would verify the assertion (e.g., the assertion “There is a human who is mortal” to the query “Is there is a human who is mortal?”). – Because of the link from ?e:T 1 to ?v:T 1, any query or search for an explanation about a member of the type leads to a query about the whole type (one way of determining whether “A human is mortal” is to find out whether “Humans are mortal”). – Moreover, paths formed by the above links lead to other behaviors. For example, given the path from +v:T 1 to ?e:T 1, any assertion about the whole type leads to a query or search for an explanation of the assertion applied to a given subtype/member of the type (e.g., “Humans are mortal” leads to the query “Is there a human who is mortal?”). Note that the closure between the “?” and “+” nodes is provided by the matching of facts (see Section 7). 3.2
The Interconnections Between Focal-Clusters of Instances and Types
The interconnections between nodes in the focal-clusters of instances and types lead to the following functionality: – Because of the link from +v:T 1 to +:I, any assertion about the type T 1 leads to the same assertion about the instance I (“Humans are mortal” leads to “John is mortal”). – Because of the link from +:I to +e:T 1, any assertion about I leads to the same assertion about a member of T 1 (“John is mortal” leads to “A human is mortal”). – Because of the link from ?:I to ?v:T 1, a query about I leads to a query about T 1 as a whole (one way of determining whether “John is mortal” is to determine whether “Humans are mortal”). – Because of the link from ?e:T 1 to ?:I, a query about a member of T 1 leads to a query about I (one way of determining whether “A human is mortal” is to determine whether “John is mortal”). 7
shruti infers the existence of a mortal human given that all humans are mortal, though this is not entailed in classical logic.
Types and Quantifiers in shruti
37
Similarly, interconnections between sub- and supertypes lead to the following functionality. – Because of the link from +v:T 2 to +v:T 1, any assertion about the supertype T 2 leads to the same assertion about the subtype T 1 (“Agents can cause change” leads to “Humans can cause change”). – Because of the link from +e:T 1 to +e:T 2, any assertion about a member of T 1 leads to the same assertion about a member of T 2 (“Humans are mortal” leads to “mortal agents exist”). – Because of the link from ?v:T 1 to ?v:T 2, a query about T 1 as a whole leads to a query about T 2 as a whole (one way of determining whether “Humans are mortal” is to determine whether “Agents are mortal”). – Because of the link from ?e:T 2 to ?e:T 1, a query about a member of T 2 leads to a query about a member of T 1 (one way of determining whether “an Agent is mortal” is to determine whether “a Human is mortal”).
I
+
(John)
? from to T-facts and E-facts
T1 (Human)
+e
+v
?v
?e
to from T-facts and E-facts
T2
+e
+v
?v
?e
(Agent) to from T-facts and E-facts
Fig. 3. The encoding of types and (specific) instances. See text for details.
38
L. Shastri
+:give g-obj recip giver +:John +:Mary +e:Book
Fig. 4. The rhythmic activity representing the dynamic bindings give(John, Mary, a-Book). Bindings are expressed by the synchronous activity of bound role and entity nodes.
4
Encoding of Dynamic Bindings
The dynamic encoding of a relational instance corresponds to a rhythmic pattern of activity wherein bindings between roles and entities are represented by the synchronous firing of appropriate role and entity nodes. With reference to Figure 1, the rhythmic pattern of activity shown in Figure 4 is the dynamic representation of the relational instance (give: hgiver=Johni, hrecipient=Maryi, hgive-object=a-Booki) (i.e., “John gave Mary a book”). Observe that the collector ensembles +:John, +:Mary and +e:Book are firing in distinct phases, but in phase with the roles giver, recip, and g-obj, respectively. Since +:give is also firing, the system is making an assertion. The dynamic representation of the query “Did John give Mary a book?” would be similar except that the enabler node would be active and not the collector node. The rhythmic activity underlying the dynamic representation of relational instances is expected to be highly variable, but it is assumed that over short durations — ranging from a few hundred milliseconds to about a second — such activity may be viewed as being composed of k interleaved quasi-periodic activities where k equals the number of distinct entities filling roles in active relational instances. The period of this transient activity is at least k ∗ ωint where ωint is the window of synchrony, i.e., the amount by which two spikes can lead/lag and still be treated as being synchronous. As speculated in [22], the activity of role and entity cells engaged in dynamic bindings might correspond to γ band activity (∼ 40 Hz).
5
Mutual Exclusion and Collapsing of Phases
Instances in the type hierarchy can be part of a phase-level mutual exclusion cluster (ρ-mex cluster). The + node of every entity in a ρ-mex cluster sends inhibitory links to, and receives inhibitory links from, the + node of all other
Types and Quantifiers in shruti
39
entities in the cluster. As a result of this mutual inhibition, only the most active entity within a ρ-mex cluster can remain active in any given phase. A similar ρ-mex cluster can be formed by +e: nodes of mutually exclusive types as well as +v: nodes of mutually exclusive types. Another form of inhibitory interaction between siblings in the type hierarchy leads to an “explaining away” phenomenon in shruti. Let us illustrate this inhibitory interaction with reference to the type hierarchy shown in Figure 1. The link from +:John to +e:Human sends an inhibitory modifier to the link from ?e:Human to ?:Mary. Similarly, the link from +:Mary to +e:Human sends an inhibitory modifier to the link from ?e:Human to ?:John (such modifiers are not shown in the figure). Analogous connections exist between all siblings in the type hierarchy. As a result of such inhibitory modifiers, if ?e:Human propagates activity to ?:John and ?:Mary in phase ρ1, then the strong activation of +:John in phase ρ1 attenuates the activity arriving from ?e:Human into ?:Mary. In essence, the success of the query “Is it John?” in the context of the query “Is it human?” makes the query “Is it Mary?” unimportant. This use of inhibitory connections for explaining away is motivated by [2]. As discussed in Section 8, shruti supports the introduction of “new” phases during inference. In addition, shruti also allows multiple phases to coalesce into a single phase during inference. In the current implementation, such phase unification can occur under two circumstances. First, phase collapsing can occur whenever a single entity dominates multiple phases (for example, if the same entity comes to be the answer of multiple queries). Second, phase collapsing can occur if two unifiable instantiations of a relation arise within a focal-cluster. For example, an assertion own(Mary, Book-17) alongside the query ∃ x:Book own(Mary,x)? (Does Mary own a book”) will result in a merging of the two phases for “a book” and “Book-17. Note that the type hierarchy will map the query ∃ x:Book own(Mary,x)? into own(Mary,Book-17)?, and hence, lead to a direct match between own(Mary,Book-17) and own(Mary,Book-17)?.
6
Priming: Associative Short-Term Potentiation of Weights
Let I be an instance of T 1. If ?:I receives activity from ?e:T1 and concurrent activity from +:I, then the weight of the link from ?e:T1 to ?:I increases (i.e., gets potentiated) for a short-duration.8 Let T2 be a supertype of T1. If ?e:T1 receives activity from ?e:T2, and concurrent activity from +e:T1, then the weight of the link from ?e:T2 to ?e:T1 also increases for a short-duration. Analogous weight increases can occur along the link from ?v:T1 to ?v:T2 if ?v:T2 receives concurrent activity from +v:T2 and ?v:T1. Similarly, the weight of the link from ?:I to ?v:T1 can undergo a short-term increase if ?v:T1 receives concurrent activity from +v:T1 and ?:I.9 8 9
This is modeled after the biological phenomena of short-term potentiation (STP) [6]. In principle, short-term weight increases can occur along the link from +e:T1 to +e:T2 if +e:T2 receives concurrent activity from +v:T2 and +e:T1. Similarly, the weight of the link from +:I to +e:T1 can undergo a short-term increase, if +e:T1 receives concurrent activity from +v:T1 and +:I.
40
L. Shastri
The potentiation of link weights can affect the system’s response time as well as the response itself. Let us refer to an entity whose incoming links are potentiated as a “primed” entity. Since a primed entity would become active sooner than an unprimed entity, a query whose answer is a primed entity would be answered faster (all else being equal). Furthermore, all else being equal, a primed entity would dominate an unprimed entity in a ρ-mex cluster, and hence, if a primed and an unprimed entity compete to be the filler of a role, the primed entity would emerge as the role-filler.
7
Facts in Long-Term Memory: E-Facts and T-Facts
Currently shruti encodes two types of relational instances (i.e., facts) in its long-term memory (LTM): episodic facts (E-Facts) and taxon facts (T-facts). While an E-fact corresponds to a specific instance of a relation, a T-fact corresponds to a distillation or statistical summary of various instances of a relation (e.g., “Days tend to be hot in June”). An E-fact E1 associated with a relation P becomes active whenever all the dynamic bindings specified in the currently active instantiation of P match those encoded in E1 . Thus an E-fact is sensitive to any mismatch between the bindings it encodes and the currently active dynamic bindings. In contrast, a T-fact is sensitive only to matches between its bindings and the currently active dynamic bindings. Note that both E- and Tfacts tolerate missing bindings, and hence, respond to partial cues. The encoding of E-facts is described below – the encoding of T-facts is described in [20]. Figure 5 illustrates the encoding of E-facts love(John, Mary) and ¬love(Tom, Susan). Each E-fact is encoded using a distinct fact node (these are labeled F1 and F2 in Figure 5). A fact node sends a link to the + or – collector of the relation depending on whether the fact encodes a positive or a negative assertion. Given the query love(John,Mary)? the E-fact node F1 will become active and activate +:love, +:John and +:Mary nodes indicating a “yes” answer to the question. Similarly, given the query love(Tom,Susan)?, the E-fact node F2 will become active and activate –:love, +:Tom and +:Susan nodes indicating a “no” answer to the query. Finally, given the query love(John,Susan)?, neither +:love nor –:love would become active, indicating that the system can neither affirm nor deny whether John loves Susan (the nodes +:John and +:Susan will also not receive any activation). Types can also serve as role-fillers in E-facts (e.g., Dog in “Dogs chase cats”) and so can unspecified instances of a type (e.g., a dog in “a dog bit John”). Such E-facts are encoded by using the appropriate nodes in the focal-cluster for “Dog”. In general, if an existing instance, I, is a role-filler in a fact, then ?:I provides the input to the fact cluster and +:I receives inputs from the binder node in the fact cluster. If the whole type T is a role-filler in a fact, then ?v:T provides the input to the fact cluster and +v:T receives inputs from the binder node in the fact cluster. If an unspecified instance of type T is a role-filler in a long-term fact, then a new instance of type T is created and its “?” and “+” nodes are used to encode the fact.
Types and Quantifiers in shruti
41
(b)
(a)
α1 α2
00 11 00 00 11 11 11 00 11 11 00 00 11 1000 00 10 11
+ love
?
000 F2 111 00011 111 00 000 00 111 11 000 111 00 F1 11 00 11 00 1111 00 00 000 11 111
from ?:Tom from
F1 +:John
?:Susan from ?:John from ?:Mary
lover lovee
+:Mary
lover lovee
Fig. 5. (a) The encoding of E-facts: love(John,Mary) and ¬love(Tom,Susan). The pentagon shaped nodes are “fact” nodes and are of type τ -and. The dark blobs denote inhibitory modifiers. The firing of a role node without the synchronous firing of the associated filler node blocks the activation of the fact node. Consequently, the E-fact is blocked whenever there is a mismatch between the dynamic binding of a role and its binding specified in the E-fact. (b) Links from the fact node back to role-fillers are shown only for the fact love(John,Mary) to avoid clutter. The circular nodes are m-ρ nodes with a high threshold which is satisfied only when both the role node and the fact node are firing. Consequently, a binder node fires in phase with the associated role node, if the fact node is firing. Weights α1 and α2 indicate strengths of belief.
8
Encoding of Rules
A rule is encoded via a mediator focal-cluster that mediates the flow of activity and bindings between antecedent and consequent clusters (mediators are depicted as parallelograms in Figure 1). A mediator consists of a single collector (+), an enabler (?), and as many role-instantiation nodes as there are distinct variables in the rule. A mediator establishes links between nodes in the antecedent and consequent clusters as follows: (i) The roles of the consequent and antecedent relation(s) are linked via appropriate role-instantiation nodes in the mediator. This linking reflects the correspondence between antecedent and consequent roles specified in the rule. (ii) The enabler of the consequent is connected to the enabler of the antecedent via the enabler of the mediator. (iii) The appropriate (+/–) collector of the antecedent relation is linked to the appropriate (+/–) collector of the consequent relation via the collector of the mediator. A collector to collector link originates at the + (–) collector of an antecedent relation if the relation appears in its positive (negated) form in the antecedent. The link terminates at the + (–) collector of the consequent relation if the relation appears in a positive (negated) form in the consequent.10 10
The design of the mediator was motivated, in part, by discussions the author had with Jerry Hobbs.
42
L. Shastri
Consider the encoding of the following rule in Figure 1: ∀ x:agent y:agent z:thing give(x,y,z) ⇒ own(y,z) [800,800] This rule is encoded via the mediator, med1, containing three role-instantiation nodes r1, r2, and r3. The weight on the link from ?:med1 to ?:give indicates the degree of evidential support for give being the probable cause (or explanation) of own. The weight on the link from +:med1 to +:own indicates the degree of evidential support for own being a probable effect of give. These strengths are defined on a non-linear scale ranging from 0 to 1000. A role-instantiation node is an abstraction of a neural circuit with the following functionality. If a role-instantiation node receives activation from the mediator enabler and one or more consequent role nodes, it simply propagates the activity onward to the connected antecedent role nodes. If on the other hand, the role-instantiation node receives activity only from the mediator enabler, it sends activity to the ?e node of the type specified in the rule as the type restriction for this role. This causes the ?e node of this type to become active in an unoccupied phase.11 The ?e node of the type conveys activity in this phase to the role-instantiation node which in turn propagates this activity to connected antecedent roles nodes. The links between role-instantiation nodes and nodes in the type hierarchy has not been shown in Figure 1. shruti can encode rules involving multiple antecedents and consequents (see [20]). Furthermore, shruti allows a bounded number of instantiations of the same predicate to be simultaneously active during inference [14].
9
An Example of Inference
Figure 2 depicts a schematized response of the shruti network shown in Figure 1 to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). This query is posed by activating ?:Mary and ?e:book nodes, the role nodes owner and oobj, and the enabler ?:own, as shown in Figure 2. We will refer to the phases of activation of ?:Mary and ?e:book as ρ1 and ρ2, respectively. Activation from the focal-cluster for own reaches the mediator structure of rules (1) and (2). Consequently, nodes r2 and r3 in the mediator med1 become active in phases ρ1 and ρ2, respectively. Similarly, nodes s1 and s2 in the mediator med2 become active in phases ρ1 and ρ2, respectively. At the same time, the activation from ?:own activates the enablers ?:med1 and ?:med2 in the two mediators. Since r1 does not receive activation from any of the roles in its consequent’s focal-cluster (own), it activates the node ?e:agency in the type hierarchy in a free phase (say ρ3). The activation from nodes r1, r2 and r3 reach the roles giver, recip and g-obj in the give focal-cluster, respectively. Similarly, activation from nodes s1 and s2 reach the roles buyer and b-obj in the buy focal-cluster, respectively. In essence, the system has created new bindings for give and buy wherein giver is 11
A similar phase-allocation mechanism is used in [1] for realizing function terms. Currently, an unoccupied phase is assigned in software, but eventually this will result from inhibitory interactions between nodes in the type hierarchy.
Types and Quantifiers in shruti
43
bound to an undetermined agent, recipient is bound to Mary, g-obj is bound to a book, buyer is bound to Mary, and b-obj is bound to a book. These bindings together with the activation of the enabler nodes ?:give and ?:own encode two new queries: “Did some agent give Mary a book?” and “Did Mary buy a book?”. At the same time, activation travels in the type hierarchy and thereby maps the query to a large number of related queries such as “Did a human give Mary a book?”, “Did John give Mary Book-17?”, “Did Mary buy all books” etc. The E-fact give(John, Mary, Book-17) now becomes active as a result of matching the query give(John, Mary, Book-17)? and causes +:give to become active. This in turn causes +:med1, to become active and transmit activity to +:own. This results in an affirmative answer to the query and creates a reverberant loop of activity involving the clusters own, med1, give, the fact node F1, and the entities John, Mary, and Book-17.
10
Conclusion
The type structure described above, together with other enhancements such as support for negation, priming, and evidential rules, allow shruti to support a rich set of inferential behaviors, and perhaps, shed some light on the nature of symbolic neural representations. shruti identifies a number of constraints on the representation and processing of relational knowledge and predicts the capacity of the active (working) memory underlying reflexive reasoning [17][22]. First, on the basis of neurophysiological data pertaining to the occurrence of synchronous activity in the γ band, shruti leads to the prediction that a large number of facts (relational instances) can be active simultaneously and a large number of rules can fire in parallel during an episode of reflexive reasoning. However, the number of distinct entities participating as role-fillers in these active facts and rules must remain very small (≈ 7). Recent experimental findings as well as computational models lend support to this prediction (e.g., [12][13]). Second, since the quality of synchronization degrades as activity propagates along a chain of cell clusters, shruti predicts that as the depth of inference increases, binding information is gradually lost and systematic inference reduces to a mere spreading of activation. Thus shruti predicts that reflexive reasoning has a limited inferential horizon. Third, shruti predicts that only a small number of instances of any given relation can be active simultaneously. A number of issues remain open. These include the encoding of rules and facts involving complex nesting of quantifiers. While the current implementation supports multiple existential and universal quantifiers, it does not support the occurrence of existential quantifiers within the scope of an universal quantifier. Also, the current implementation does not support the encoding of complex types such as radial categories [10]. Another open issue is the learning of new relations and rules (mappings). In [18] it is shown that a recurrent network can learn rules involving variables and semantic restrictions using gradient-descent learning. While this work serves as a proof of concept, it does not address issues of
44
L. Shastri
scaling and catastrophic interference. Several researchers are pursuing solutions to the problem of learning in the context of language acquisition (e.g., [16][4][7]). In collaboration with M. Cohen, B. Thompson, and C. Wendelken, the author is also augmenting shruti to integrate the propagation of belief with the propagation of utility. The integrated system will be capable of seeking explanations, making predictions, instantiating goals, constructing reactive plans, and triggering actions that maximize the system’s expected future utility. Acknowledgment This work was partially funded by grants NSF SBR-9720398 and ONR N0001493-1-1149, and subcontracts from Cognitive Technologies Inc. related to contracts ONR N00014-95-C-0182 and ARI DASW01-97-C-0038. Thanks to M. Cohen, J. Feldman, D. Grannes, J. Hobbs, D.R. Mani, B. Thompson, and C. Wendelken.
References 1. Ajjanagadde, V.: Reasoning with function symbols in a connectionist network. In the Proceedings of the 12th Conference of the Cognitive Science Society, Cambridge, MA. (1990) 285–292. 2. Ajjanagadde, V.: Abductive reasoning in connectionist networks: Incorporating variables, background knowledge, and structured explanada, Technical Report WSI 91-6, Wilhelm-Schickard Institute, University of Tubingen, Germany (1991). 3. Ajjanagadde, V., Shastri, L.: Efficient inference with multi-place predicates and variables in a connectionist network. In the Proceedings of the 11th Conference of the Cognitive Science Society, Ann-Arbor, MI (1989) 396–403. 4. Bailey, D., Chang, N., Feldman, J., Narayanan, S.: Extending Embodied Lexical Development. In the Proceedings of the 20th Conference of the Cognitive Science Society, Madison, WI. (1998) 84–89. 5. Barnden, J., Srinivas, K.: Encoding Techniques for Complex Information Structures in Connectionist Systems. Connection Science, 3, 3 (1991) 269–315. 6. Bliss, T.V.P., Collingridge, G.L.: A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361, (1993) 31–39. 7. Gasser, M., Colunga, E.: Where Do Relations Come From? Indiana University Cognitive Science Program, Technical Report 221, (1998). 8. Hobbs, J.R., Stickel, M., Appelt, D., Martin, P.: Interpretation as Abduction, Artificial Intelligence, 63, 1-2, (1993) 69–142. 9. Hummel, J. E., Holyoak, K.J.: Distributed representations of structure: a theory of analogical access and mapping. Psychological Review, 104, (1997) 427–466. 10. Lakoff, G.: Women, Fire, and Dangerous Things — What categories reveal about the mind, University of Chicago Press, Chicago (1987). 11. Lange, T. E., Dyer, M. G.: High-level Inferencing in a Connectionist Network. Connection Science, 1, 2 (1989) 181–217. 12. Lisman, J. E., Idiart, M. A. P.: Storage of 7 ± 2 Short-Term Memories in Oscillatory Subcycles. Science, 267 (1995) 1512–1515. 13. Luck, S. J., Vogel, E. K.: The capacity of visual working memory for features and conjunctions. Nature 390 (1997) 279–281.
Types and Quantifiers in shruti
45
14. Mani, D.R., Shastri, L.: Reflexive Reasoning with Multiple-Instantiation in a Connectionist Reasoning System with a Typed Hierarchy, Connection Science, 5, 3&4, (1993) 205–242. 15. Park, N.S., Robertson, D., Stenning, K.: An extension of the temporal synchrony approach to dynamic variable binding in a connectionist inference system. Knowledge-Based Systems, 8, 6 (1995) 345–358. 16. Regier, T.: The Human Semantic Potential: Spatial Language and Constrained Connectionism, MIT Press, Cambridge, MA, (1996). 17. Shastri, L.: Neurally motivated constraints on the working memory capacity of a production system for parallel processing. In the Proceedings the 14th Conference of the Cognitive Science Society, Bloomington, IN (1992) 159–164. 18. Shastri, L.: Exploiting temporal binding to learn relational rules within a connectionist network. TR-97-003, International Computer Science Institute, Berkeley, CA, (1997). 19. Shastri, L.: A Model of Rapid Memory Formation in the Hippocampal System, In the Proceedings of the 19th Annual Conference of the Cognitive Science Society, Stanford University, CA, (1997) 680–685. 20. Shastri, L.: Advances in shruti — A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence, 11 (1999) 79–108. 21. Shastri, L.: Recruitment of binding and binding-error detector circuits via longterm potentiation. Neurocomputing, 26-27 (1999) 865–874. 22. Shastri, L., Ajjanagadde V.: From simple associations to systematic reasoning: A connectionist encoding of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16:3 (1993) 417–494. 23. Shastri, L., Grannes, D.J.: A connectionist treatment of negation and inconsistency. In the Proceedings of the 18th Conference of the Cognitive Science Society, San Diego, CA, (1996). 24. Shastri, L., Grannes, D.J., Narayanan, S., Feldman, J.A.: A Connectionist Encoding of Schemas and Reactive Plans. In Hybrid Information Processing in Adaptive Autonomous vehicles, G.K. Kraetzschmar and G. Palm (Eds.), Lecture Notes in Computer Science, Springer-Verlag, Berlin (To appear). 25. Shastri, L., Wendelken, C.: Knowledge Fusion in the Large – taking a cue from the brain. In the Proceedings of the Second International Conference on Information Fusion, FUSION’99, Sunnyvale, CA, July (1999) 1262–1269. 26. Singer, W.: Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology 55 (1993) 349–74. 27. Sun, R.: On variable binding in connectionist networks. Connection Science, 4, 2 (1992) 93–124. 28. von der Malsburg, C.: Am I thinking assemblies? In Brain Theory, ed. G. Palm & A. Aertsen. Springer-Verlag (1986).
A Recursive Neural Network for Reflexive Reasoning Steffen H¨olldobler 1 , Yvonne Kalinke 2 ? , and J¨ org Wunderlich 3 ?? 1 2
Dresden University of Technology, Dresden, Germany Queensland University of Technology, Brisbane, Australia 3 Neurotec Hochtechnologie GmbH, Friedrichshafen, Germany
Abstract. We formally specify a connectionist system for generating the least model of a datalogic program which uses linear time and space. The system is shown to be sound and complete if only unary relation symbols are involved and complete but unsound otherwise. For the latter case a criteria is defined which guarantees correctness. Finally, we compare our system to the forward reasoning version of Shruti.
1
Introduction
Connectionist systems exhibit many desirable properties of intelligent systems like, for example, being massively parallel, context–sensitive, adaptable and robust (see eg. [10]). It is strongly believed that intelligent systems must also be able to represent and reason about structured objects and structure–sensitive processes (see eg. [12,25]). Unfortunately, we are unaware of any connectionist system which can handle structured objects and structure–sensitive processes in a satisfying way. Logic systems were designed to cope with such objects and processes and, consequently, it is a long–standing research goal to combine the advantages of connectionist and logic systems in a single system. There have been many results on such a combination which involves propositional logic (c.f. [24,26]). In [15] we have shown that a three–layered feed forward network of binary threshold units can be used to compute the meaning function of a logic program. The input and output layer of such a network consists of a vector of units, each representing a propositional letter. The activation pattern of these layers represent an interpretation I with the understanding that the unit representing the propositional letter p is active iff p is true under I . For certain classes of logic programs it is well–known that they admit a least model, and that this model can be computed as the least fixed point of the program’s meaning function applied to an arbitrary initial interpretation [1,11]. To ? ??
The author acknowledges support from the German Academic Exchange Service (DAAD) under grant no. D/97/29570. The results reported in this paper were achieved while the author was at the Dresden University of Technology.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 46–62, 2000. c Springer-Verlag Berlin Heidelberg 2000
A Recursive Neural Network for Reflexive Reasoning
47
compute the least fixed point in such cases, the feed forward network mentioned in the previous paragraph is turned into a recurrent one by connecting each unit in the output layer to the corresponding unit in the input layer. We were able to show — among other results — that such so-called Rnns, ie. recursive neural networks with a feed forward kernel , converge to a stable state which represents the least model of the corresponding logic program. Moreover, in [6] it was shown that the binary threshold units in the hidden layer of the kernel can be replaced by units with sigmoidal activation function. Consequently, the networks can be trained by backpropagation and after training new refined program clauses (or rules) can be extracted using, for example, the techniques presented in [31]. Altogether, this is a good example of how the properties of logic systems can be combined with the inherent properties of connectionist system. Unfortunately, this does not solve the aforementioned problem because structured objects and structure–sensitive processes cannot be modeled within propositional logic but only within first- and higher–order logics. For these logics, however, similar results combining connectionist and logic systems are not known. One of the problems is that as soon as the underlying alphabet contains a single non–nullary function symbol and a single constant, then there are infinitely many ground atoms which cannot be represented locally in a connectionist network. In [17,18] we have shown that for certain classes of first–order logic programs interpretations can be mapped onto real numbers such that the program’s meaning function can be encoded as a continuous function on the real numbers. Applying a result from [13] we conclude that three–layered feed forward networks with sigmoidal activation function for the units occurring in the hidden layer and linear activation function for the units occurring in the input and output layer can approximate the meaning function of first–order logic programs arbitrarily well. Moreover, turning this feed forward kernel into an Rnn, the Rnn computes an approximation of the least fixed point, i.e. the least model, of a given logic program. The notion of an approximation is based on a distance function between interpretations such that — loosely speaking — the distance is indirectly proportional to the number of atoms on which both interpretations agree. Unfortunately, the result reported in [17,18] is purely theoretical and we have not yet developed a real connectionist system which uses it. One of the main obstacles for doing so is that we need to find a connectionist representation for terms. There are various alternatives: • We may use a structured connectionist network as in [14]: In this case the network is completely local, all computations like, for example, unification can be performed, but it is not obvious at all how such networks can be learned. The structure is by far to complex for current learning algorithms based on the recruitment paradigm [9]. • We may use a vector of fixed length to represent terms as in the recursive auto–associative memory [28], the labeling recursive auto–associative memory [30] or in the memory based on holographic reduced representations [27]. Unfortunately, in extensive tests none of these proposals has led to sa-
48
S. H¨ olldobler, Y. Kalinke and J. Wunderlich
tisfying results: The systems could not safely store and recall terms of depth larger than five [20]. • We may use hybrid systems, where terms are represented and manipulated in a conventional way. But this is not a kind of integration that we were hoping for because in this case results from connectionist systems cannot be applied to the conventional part. • We may use connectionist encodings of conventional data structures like counters and stacks [16,21], but currently the models are still too simple. • We may use a phase–coding to bind constants to terms as suggested in Shruti [29]: In this case we restrict our first–order language to contain only constants and multi–place relation symbols. Considering this current state of the art in connectionist term representations we propose in this paper to extend our connectionist model developed in [15] to handle constants and multi–place relations by turning the units into phase– sensitive ones and solving the variable binding problem as suggested in Shruti. Because our system generates models for logic programs in a forward reasoning manner, it is necessary to consider the version of Shruti, were forward reasoning is performed. There are three main difficulties with such a Shruti system: • In almost any derivation more than one copy of the rules is needed, which leads to sequential processing. • The structure of the system is quite complex: there are many different types of units with a complex connection structure, and it is not clear at all how such structures can be learned. • The logical foundation of the system has not been developed yet. In [2] we have developed a logical calculus for the backward reasoning version of Shruti by showing that reflexive reasoning as performed by Shruti is nothing but reasoning by reductions in a conventional, but parallel logic system based on the connection method [3]. In this paper we develop a logic system for forward reasoning in a first–order calculus with constants and multi–place relations and specify a recurrent neural network implementing this system. The logic system is again based on the connection method using reduction techniques which link logic systems to database systems. We define a calculus called Bur (for bottom– up reductions) which has the following properties: • For unary relation symbols the calculus is sound and complete. • For relation symbols with an arity larger than one the calculus is complete but not necessarily sound. • We develop a criterion which guarantees that the result achieved in the case where the relation symbols have an arity larger than one are sound. • Computations require only linear parallel time and linear parallel space. Furthermore, we extend the feed forward neural networks developed in [15] by turning the units into phase–sensitive ones. We formally show that the Bur calculus can be implemented in these networks. Compared to Shruti our networks consist only of two types of phase–sensitive units. The connection structure is an Rnn with a four–layered feed forward neural network as kernel and, thus,
A Recursive Neural Network for Reflexive Reasoning
49
is considerable simpler than the connection structure of Shruti. Besides giving a rigorous formal treatment of the logic underlying Shruti if run in a forward reasoning manner, this line of research may also lead to networks for reflexive reasoning, which can be trained using standard techniques like backpropagation. The paper is organized as follows: In the following Section 2 we repeat some basic notions, notations and results concerning logic programming and reflexive reasoning. The Bur calculus is formally defined in Section 3. Its connectionist implementation is developed in Section 4. The properties of the implementation and its relation to the Shruti system are discussed in Sections 5 and 6 respectively. In the final Section 7 we discuss our results and point out future research.
2
Logic Programs and Reflexive Reasoning
We assume the reader to have some background in logic programming and deduction systems (see eg. [23,5]) as well as in connectionist systems and, in particular, in the Shruti system [29]. Thus, in this section we will just briefly repeat the basic notions, notations and results. A (definite) logic program is a set of clauses, ie. universally closed formulas1 of the form A ← A1 ∧ . . . ∧ An , where A, Ai , 1 ≤ i ≤ n , are first–order atoms. A and A1 ∧ . . . ∧ An are called head and body respectively. A clause of a logic program is said to be a fact if its body is empty and its head does not contain any occurrences of a variable; otherwise it is called a rule. A logic program is said to be a datalogic program if all function symbols occurring in the program are nullary, ie. if all function symbols are constants. For example, the database in Shruti is a datalogic program.2 Definite logic programs enjoy many nice properties, among which is the one that each program P admits a least model. This model contains precisely all the logical consequences of the program. Moreover, it can be computed iteratively as the least fixed point of a so–called meaning function TP which is defined on interpretations I as TP (I) = {A | there exists a ground instance A ← A1 ∧ . . . ∧ An of a clause in P such that {A1 , . . . , An } ⊆ I}, where an interpretation is a set of ground atoms. In case of a datalogic program P over a finite set of constants, the least fixed point of TP can be computed in finite, albeit exponential time (in the worst case) with respect to the size of P . Following the argumentation in [29], datalogic programs are thus unsuitable to model reflexive reasoning. Only by imposing additional conditions on the syntactic structure of datalogic programs as well as on their runtime behavior, it was possible to show that a backward reasoning version of Shruti is able to answer questions in linear time. 1 2
Ie. all variables are assumed to be universally closed. To be precise, existentially bound variables in a Shruti database must be replaced by new constants (see [2]).
50
3
S. H¨ olldobler, Y. Kalinke and J. Wunderlich
Bottom–Up Reductions: The BUR Calculus
In this section we develop a new calculus called Bur based on the idea to apply reduction techniques to a given knowledge base in a bottom–up manner, whereby the reduction techniques can be efficiently applied in parallel. We are particularly interested in reduction techniques, which can be applied in linear time and space. Let C be a finite set of constant symbols and R a finite set of relation symbols.3 A Bur knowledge base P is a number of formulas that are either facts or rules. Thus, a Bur knowledge base is simply a datalogic program. Before turning to the definition of the reduction techniques we consider a knowledge base P1 with the facts
as well as and a single rule
p(a, b) and p(c, d)
(1)
q(a, c) and q(b, c)
(2)
r(X, Y, Z) ← p(X, Y ) ∧ q(Y, Z),
(3)
where C = {a, b, c, d} is the set of constants, R = {p, q, r} the set of relation symbols and X, Y, Z are variables. Using a technique known as database (or DB) reduction in the connection method (see [4]) the facts in (1) can be equivalently replaced by p(X, Y ) ← (X, Y ) ∈ {(a, b), (c, d)}
(4)
and, likewise, the facts in (2) can be equivalently replaced by q(X, Y ) ← (X, Y ) ∈ {(a, c), (b, c)}.
(5)
Although the transformations seem to be straightforward they have the desired side–effect that there is now only one possibility to satisfy the conditions p(X, Y ) and q(Y, Z) in the body of rule (3), viz. by using (4) and (5) respectively. Technically speaking, there is an isolated connection between p(X, Y ) occurring in the body of (3) and the head of (4) and, likewise, between q(Y, Z) occurring in the body of (3) and the head of (5) [4]. Such isolated connections can be evaluated. Applying the corresponding reduction technique yields r(X, Y, Z) ← (X, Y, Z) ∈ π1,2,4 (p 1p/2=q/1 q),
(6)
where 1 denotes the (natural equi-) join of the relations p and q , p/2 = q/1 denotes the constraint that the second argument of the relation p should be identical to the first argument of q and π1,2,4 (s) denotes the projection of the relation s to the first, second and forth argument. Evaluating the database operations occurring in equation (6) leads to the reduced expression r(X, Y, Z) ← (X, Y, Z) ∈ {(a, b, c)}. 3
(7)
Throughout the paper we will make use of the following notational conventions: a, b, . . . denote constants, p, q, . . . relation symbols and X, Y, . . . variables.
A Recursive Neural Network for Reflexive Reasoning
51
In general, after applying database reductions to facts the evaluation of isolated connections between the reduced facts and the atoms occurring in the body of rules leads to expressions containing the database operations union ( ∪ ), intersection ( ∩ ), projection ( π ), Cartesian product ( ⊗ ) and join ( 1 ). These are the standard operations of a relation database (see eg. [32]). The most costly operation is the join, which in the worst case requires exponential space and time with respect to the number of arguments of the involved relations and the number of atoms occurring in the body of a rule. Because it is our goal to set up a calculus which allows reasoning within linear time and space boundaries, we must avoid the join operation. This can be achieved if we replace database reductions by so–called pointwise database reductions: In our example, the facts in (1) and (2) are replaced by and
p(X, Y ) ← X ∈ {a, c} ∧ Y ∈ {b, d}
(8)
q(X, Y ) ← X ∈ {a, b} ∧ Y ∈ {c}
(9)
respectively. After evaluating isolated connections (3) now becomes r(X, Y, Z) ← X ∈ π1 (p) ∧ Y ∈ π2 (p) ∩ π1 (q) ∧ Z ∈ π2 (q),
(10)
which can be further evaluated to r(X, Y, Z) ← X ∈ {a, c} ∧ Y ∈ {b} ∧ Z ∈ {c}.
(11)
In general, the use of pointwise database reductions instead of database reductions leads to expressions involving only the database operations union, intersection, projection and Cartesian product, all of which can be computed in linear time and space using an appropriate representation. The drawback of this approach is that now so–called spurious tuples may occur in relations. For example, according to (11) not only (a, b, c) is in relation r (as in (7)) but also (c, b, c) . r(a, b, c) is a logical consequence of the example knowledge base, whereas r(c, b, c) is not. It is easy to see that spurious tuples may occur only if multiplace relation symbols are involved. We will come back to this problem later in this section. After this introductory example we can now formally define the reduction rules of the Bur calculus. One should keep in mind that these rules are used to compute the least fixed point of TP for a given Bur database P . Without loss of generality we may assume that the head of each rule contains only variable occurrences: Any occurrence of a constant c in the head of a rule may be replaced by a new variable X if the condition X ∈ {c} is added to the body of the rule.4 A similar transformation can be applied to facts, ie. each fact of the form p(c1 , . . . , cn ) can be replaced by p(X1 , . . . , Xn ) ←
n ^
Xi ∈ {ci }.
i=1
We will call such expressions generalized facts. 4
This is called the homogeneous form in [8].
52
S. H¨ olldobler, Y. Kalinke and J. Wunderlich
The Bur calculus contains the following two rules: • Pointwise DB reduction: Let p(X1 , . . . , Xn ) ←
n ^
Xi ∈ Ci and p(X1 , . . . , Xn ) ←
i=1
n ^
Xi ∈ Di
i=1
be two generalized facts in P . Replace these facts by n ^
p(X1 , . . . , Xn ) ←
Xi ∈ Ci ∪ Di ,
i=1
• Evaluation of isolated connections: Let C be the set of constants, n ^
p(X1 , . . . , Xm ) ←
pi (ti1 , . . . , tiki )
(12)
i=1
be a rule in P such that there are also generalized facts of the form pi (Yi1 , . . . , Yiki ) ←
ki ^
Yil ∈ Cil , 1 ≤ i ≤ n,
l=1
in P . Let Dj = ∩{Cil | Xj occurs at the l th position in pi (ti1 , . . . , tiki )} for each variable Xj occurring in (12). If Dj 6= ∅ for each j , then add the following generalized fact to P : p(X1 , . . . , Xm ) ←
n ^
Xi ∈ C ∩ Di .
i=1
One should observe that by evaluating isolated connections generalized facts will be added to a Bur knowledge base P . Because pointwise DB reductions do not decrease the number of facts and C is finite, this process will eventually terminate in that no new facts are added. Let M denote the largest set of facts obtained from p by applying the Bur reduction rules. Theorem 1. 1. The Bur calculus is sound and complete if all relation symbols occurring in P are unary, ie. M is precisely the least model of P . 2. The Bur calculus is complete but not necessarily sound if there are multiplace relation symbols in P , ie. M is a superset of the least model of P . The proof of this theorem can be found in [33]. The first part of Theorem 1 confirms the fact that considering unary relation symbols and a finite set of constants does neither extend the expressive power of a Bur knowledge base compared to propositional Horn logic nor does it affect the time and space requirements for computing the minimal model of a program (see [7]). Because the reduction techniques in the Bur calculus can be applied in linear time and
A Recursive Neural Network for Reflexive Reasoning
53
space the minimal model of such a Bur knowledge base can be computed in linear time and space as well. The second part of Theorem 1 confirms the fact that considering multiplace relation symbols and a finite set of constants does not change the expressive power of a Bur knowledge base compared to propositional Horn logic but does affect the time and space requirements. Turning a Bur knowledge base into an equivalent propositional logic program may lead to exponentially more rules and facts. Hence, the best we can hope for if we apply reduction techniques bottom– up and in linear time and space is a pruning of the search space. In the worst case the pruning can be neglected. In some cases however the application of the reduction techniques may lead to considerable savings. Such a beneficial case is characterized in the following theorem. Theorem 2. If after d applications of the reduction techniques all relations have at most one argument for which more than one binding is generated, then all facts derived so far are logical consequences of the knowledge base. In other words, the precondition of this theorem defines a correctness criterion in that the Bur calculus is also sound for multiplace relations if the criterion is met in the limit. The proof of the theorem can again be found in [33].
4
A Connectionist Implementation of the BUR Calculus
The connectionist implementation of the Bur calculus is based on two main ideas: (1) use the kernel of the Rnn model to encode the logical structure of the Bur knowledge base and its recursive part to encode successive applications of the reduction techniques and (2) use the temporal synchronous activation model of Shruti to solve the dynamic binding problem. In the Bur model two types of phase–sensitive binary threshold units are used, which are called btu–p–units and btu–c–units, respectively. They have the same functionality as the ρ–btu and τ –and–units in the Shruti model, ie. the output of a btu–p- and a btu–c–unit in a phase πc in a cycle ω are 1 if ibtu–p (πc ) ≥ θbtu–p , obtu–p (πc ) = 0 else and
obtu–c (πc ) =
1 if ∃πc0 . [πc0 ∈ ω ∧ ibtu–c (πc0 ) ≥ θbtu–c ] 0 else
respectively, where θ denotes the threshold and i(πc ) the input of the unit in the phase πc . Because all connections in the network will be defined as weighted with 1 the input i(πc ) of a btu–p and btu–c–unit equals the sum of the outputs of all units that are connected to that unit in the phase πc . The number of phases in a cycle ω is determined by the number of constant symbols occurring in a given Bur knowledge base.
54
S. H¨ olldobler, Y. Kalinke and J. Wunderlich
For a given Bur knowledge base we construct a four–layered feed forward network according to the following algorithm. To shorten the notation the superscripts I , O , 1 and 2 indicate whether the unit belongs to the input, output, first or second hidden layer of the network respectively. Definition 3. The network corresponding to a Bur knowledge base P is an Rnn with an input, two hidden and an output layer constructed as follows: 1 For each constant c occurring in P add a unit btu–pIc with threshold 1 . 2 For each relation symbol p with arity k occurring in P add units btu–pIp[1] , O . . . , btu–pIp[k] and btu–pO p[1] , . . . , btu–pp[k] each with threshold 1. 3 For each formula F of the form p(. . .) ← p1 (. . .) ∧ . . . ∧ pn (. . .) in P do: 3.1 For each variable X occurring in the body of F add a unit btu–c1X . Draw connections from each unit btu–pIp[j] to this unit iff relation p occurs in the body of C and its j th argument is X . Set the threshold of the new unit btu–c1X to the number of incoming connections. 3.2 For each constant c occurring in F add a unit btu–c1c . Draw a connection from unit btu–pIc to this unit and connections from each unit btu–pIp[j] iff relation p occurs in the body of F and its j th argument is c . Set the threshold of the new unit btu–c1c to the number of incoming connections. 3.3 For each unit btu–c1X that was added in step 3.1 add a companion unit btu–p1X iff variable X occurs in the head of F . For each unit btu–c1c that was added in step 3.2 add a companion unit btu–p1c iff constant c occurs in the head of F . Draw connections from the input layer to the companion units such that these units receive the same input as their companion units btu–c1X and btu–c1c , and assign the same threshold. 3.4 If k is the arity of the relation p(. . .) occurring in the head of F then add units btu–p2p[1] , . . . , btu–p2p[k] . Draw a connection from each btu–c1 unit added in steps 3.1 to 3.3 to each of these units. 3.5 Draw a connection from btu–p1X to btu–p2p[j] iff variable X occurs at position j in p(. . .) . Draw a connection from btu–p1c to btu–p2p[j] iff constant c occurs at position j in p(. . .) . Set the threshold of the btu–p2 units to the number of incoming connections. 3.6 For each 1 ≤ j ≤ k draw a connection from unit btu–p2p[j] to unit btu–pO p[j] . 4 For each relation p with arity k occurring in P and for each 1 ≤ j ≤ k I draw a connection from unit btu–pO p[j] to unit btu–pp[j] . 5 Set the weights of all connections in the network to 1 . The network is a recursive network with a feed forward kernel. This kernel is constructed in steps (1) to (3.6). It is extended in step (4) to a Rnn. As an example consider the following Bur knowledge base P2 : p(a, b) q(a, b) ← p(a, b) p(Y, X) ← q(X, Y ) r(X) ← p(X, Y ) ∧ q(Y, X)
A Recursive Neural Network for Reflexive Reasoning p
1 a
r
p[1]
p[2]
q[1]
q[2]
1
1
1
1
clause 2
2
q
1
clause 4
3
3
2
2
1 b
clause 3
3
2
2
2
2
3
3
1
1
1
1
1
1
1
p[1]
p[2]
q[1]
q[2]
p
55
q
1
1 r
Fig. 1. The Bur network for a simple knowledge base. btu–p–units are depicted as squares and btu–c–units as circles. For presentation clarity we have dropped the recurrent connections between corresponding units in the output and input layer.
Fig. 1 shows the corresponding Bur network. Because this knowledge base contains just the two constants a and b , the cycle ω is defined by {πa , πb } . The inference process is initiated by presenting the only fact p(a, b) to the input layer of the network. More precisely, the units labelled p1 and a in the input layer of Fig. 1 are clamped in phase πa , whereas the units labelled p2 and b are clamped in phase πb . The external activation is maintained throughout the inference process. Fig. 2 shows the activation of the units during the computation. After five cycles (equals 10 time steps) the spreading of activation reaches a stable state and the generated model can be read off as will be explained in the following paragraph. Analogous to the Rnn model the input and the output layer of a Bur network represent interpretations for the knowledge base. The instantiation of a variable X (occurring as an argument of some relation) by a constant c is realized by activating the unit that represents X in phase πc representing c . A ground atom p(c1 , . . . , ck ) is an element of the interpretation encoded in the activation patterns of the input (and output) layer in cycle ω iff for each 1 ≤ j ≤ k the 5 units btu–pIp[j] (and btu–pO p[j] ) are activated in phases πcj ∈ ω . Because all facts of a Bur knowledge base are ground, the set of facts represents interpretation for the given knowledge base. A computation is initialized 5
Because each relation is represented only once in the input and the output layer and each argument p[j] may be activated in several phases during one cycle, these ac-
56
S. H¨ olldobler, Y. Kalinke and J. Wunderlich r q[2] q[1] p[2] p[1] b a 0
2
4
6
8
10
time
Fig. 2. The activation of the units within the computation in the Bur network shown in Fig. 1. Each cycle ω consists of two time steps, where the first one corresponds to the phase πa and the second one to the phase πb . The bold line after two cycles marks the point in time up to which the condition in Theorem 2 is fulfilled, ie. each relation argument is bound to one constant only. During the third cycle the arguments p[1] and p[2] are both bound to both constants a and b .
by presenting the facts to the network. This is done by clamping all units representing constants and all units representing the arguments of these facts. Thereafter the activation is propagated through the network. We refer to the propagation of activation from the input layer to the output layer as a comd putation step. Let IBU R denote the interpretation represented by the output layer of a Bur network after d ≥ 1 computation steps. The computation in the Bur network terminates if the network reaches a stable state, ie. if for two interpretations computed in successive computation steps d and d + 1 we find d+1 d IBU R = IBU R . Such a stable state will always be reached in finite time because all rules in a Bur knowledge base are definite clauses and there are only finitely many constants and no other function symbols.
5
Properties of the BUR Network
It is straightforward to verify that a Bur network encodes the reduction rules of the Bur calculus. A condition X ∈ C for some argument j of a relation p is encoded by activating the unit p[j] occurring in the input (and output) layer in all phases corresponding to the constants occurring in C . This basically covers pointwise DB reductions. The hidden layers and their connections are constructed such that they precisely realize the evaluation of isolated connections. This is in fact an enrichment of a McCulloch–Pitts network [24] by a phase–coding of bindings. Let us now first consider the case where the Bur knowledge base P contains only unary relation symbols. In this case the computation of the Bur network with respect to an interpretation I within one computation step equals the computation of the meaning function TP (I) for the logic program P . One tivation pattern represent several bindings and, thus, several instances of the relation p simultaneously. This may lead to crosstalk as will be discussed later.
A Recursive Neural Network for Reflexive Reasoning
57
should observe that TP (∅) is precisely the set of all facts occurring in P and, thus, corresponds precisely to the activation pattern presented as external input to initialize the Bur network. Hence, it is not too difficult to show by induction on the number of computation steps that the following proposition holds. Proposition 4. Let TP the meaning function for a Bur knowledge base P and d be the number of computation steps. If P contains only unary relation d+1 d (∅) . symbols, then IBU R = TP Because for a Bur knowledge base P the least fixed point of TP exists and can be computed in finite time Proposition 4 ensures that in the case of unary relation symbols the Bur network computes the least model of the Bur knowledge base. Moreover, by Theorem 1(1) we learn that the Bur network is a sound and complete implementation of the Bur calculus in this case. The result is restricted to a knowledge base with unary relation symbols only, because in the general case of multi–place relation symbols the so–called crosstalk problem may occur. If several instances of a multi–place relation are encoded in an interpretation the relation arguments can each be bound to several constants. Because it is not encoded which argument binding belongs to which instance, instances that are not actually elements of the interpretation may by mistake supposed to be. This problem corresponds precisely to the problem of whether spurious tuples are computed in the Bur calculus. Reconsider P2 for which Fig. 2 shows the input to the Bur network and the activation of the output units during the first five computation steps of the computation. During the second computation step the units btu–pO p[1] and are both activated in the phases π and π , because they have to btu–pO a b p[2] represent the bindings p[1] = a ∧ p[2] = b of the instance p(a, b) and p[1] = b ∧ p[2] = a of the instance p(b, a) of the computed interpretation. But these activations also represents the instances p(a, a) and p(b, b) . We can show, however, that despite of the instances that are erroneously represented as a result of the crosstalk problem, all instances that result from an application of the meaning function to a given interpretation are computed correctly. Proposition 5. Let P be a Bur knowledge base, TP the meaning function for d+1 d (∅) . P and d be the number of computation steps. Then, IBU R ⊇ TP One can be even more precise by showing that the Bur network is again a sound and complete implementation of the Bur calculus in that the stable state of the network precisely represents M (see Theorem 1(2)). As a consequence of the one–to–one correspondence between the Bur calculus and its connectionist implementation we can now apply the precondition of Theorem 2 as a criterion that, if met, ensures that all instances represented by an interpretation in the Bur network belong to the least model of the Bur knowledge base. More precisely, because the computation in the Bur network yields all ground atoms that actually are logical consequences of P and using the criterion stated in Theorem 2 we can determine the computation step d in
58
S. H¨ olldobler, Y. Kalinke and J. Wunderlich
which ground atoms that are not logical consequences of P are computed for the first time. Let I As with relations, conditions are represented by triples of the form and subsequently stored in assemblies of the form described in section 2.1.1. The various assemblies representing the conditions of a rule can then be connected into a network of assemblies that encodes the rule. Figure 4 shows such a network for the above rule. In the network, there are connections between each unit from the LHS assemblies and each unit from the RHS assembly. The weights between assemblies are set according to the Hebb rule. Therefore, if one knows one side of the rule, one can retrieve the other. To accommodate rules with varying numbers of conditions in LHS and to provide a uniform network topology for both rules and relations (rather than a set of disconnected networks), assemblies are organised into a large hexagonal
A Novel Modular Neural Architecture
67
soil For simplicity, only one neuron per element of relation is shown.
soil sandy is
soil
compaction
high high
humus_level
Fig. 4. Rule Encoding in Network network as shown in Figure 5. With such an architecture, each rule may have a maximum of 6 conditions in its LHS. Larger numbers of conditions in LHS can, of course, be handled by chaining rules appropriately, using new variables (e.g., IF A AND B AND C THEN D can be rewritten as IF A AND B THEN E, and IF E AND C THEN D).
Each line between assemblies denotes connections between all units from the assemblies
Fig. 5. Hexagonal Network of Assemblies When reasoning with rules, BRAINN implements backward chaining. Hence, the network retrieves the LHS of a rule upon delivery of its RHS. The following, along with Figure 6, details how this takes place in the network. For the sake of argument, assume that a rule consists of four conditions in its LHS. The RHS is stored in one assembly and the four conditions from the LHS are stored in adjacent assemblies. Upon activation of the assembly corresponding to the RHS, the network must retrieve the four conditions of the LHS. a)
b)
i. RHS RHS
RHS
RHS
RHS RHS
RHS
ii.
RHS RHS
RHS
RHS
LHS LHS
RHS LHS
LHS
iii. LHS LHS
LHS
iv. LHS
LHS
LHS
LHS
RHS LHS
RHS LHS
LHS
Fig. 6. Rule Retrieval: a) Storage; b) Retrieval - i) RHS sent to all assemblies, ii) Relaxation, iii) LHS sent to adajacent assemblies and iv) Relaxation
68
R. Bogacz and C. Giraud-Carrier
First, the RHS is delivered to all the assemblies in the hexagonal network. Once all of the Hopfield networks have relaxed, only the assembly storing the RHS has stabilised on the delivered pattern, since for that assembly, delivered and stored patterns are the same. Then, the assembly storing the RHS sends its pattern (vector of activation) to all six adjacent assemblies. Each adjacent assembly receives the pattern vector of the RHS assembly multiplied by the matrix of weights between the RHS assembly and itself. All of the adjacent assemblies relax after receiving the vector. In the case of the four LHS assemblies, the vector received is one of the patterns remembered in the local Hopfield network, so these assemblies will be stable. The other two assemblies will not be stable and will thus change their state. Moreover, the four LHS assemblies now send their patterns back to the RHS assembly. The RHS assembly receives these patterns multiplied by the matrix of weights between the LHS assemblies and itself. Thus, the pattern received by the RHS assembly is equal to its own pattern of activation. A kind of resonance is achieved, allowing the retrieval of the correct LHS. Note that LHS assemblies recruited by the aforementioned process are implicitly conjoined, i.e., the left-hand side of the rule is the conjunction of the conditions found in all of the LHS assemblies retrieved. To avoid confusion during backward chaining when several rules have the same right-hand sides (e.g., IF A AND B THEN C, and IF D THEN C), the right-hand sides are stored in different assemblies.
Rules with Variables The rules discussed so far are essentially propositional. It is often useful, and even necessary, to encode and use more general rules, which include variables. To reason in the presence of such rules, an effective way of binding variables is required. In BRAINN, variable binding is achieved by using special weight values between LHS and RHS assemblies, as shown in Figure 7 for the rule IF THEN . Let &X be the variable. Then, the weights between the units representing &X in LHS and the units representing &X in RHS are equal to 1, whilst the weights between the units representing &X and all other units are equal to 0.
VRPHRQH
Legend VRPHRQH
GULQNV
weights equal to 1 weights set up according to Hebb rule If there is no line between two units, the weight is equal to 0
PLON LV
VWURQJ
Fig. 7. Weight Setting for a Simple Rule
A Novel Modular Neural Architecture
69
With such a set of weights, the pattern for the variable is sent between assemblies without any modifications nor interactions with the rest of the information in the assembly. The weights inside the LHS assemblies and the RHS assembly must also satisfy similar conditions. That is, the weight of self-connection for all units representing a variable is equal to 1, whilst the weight between each unit representing a variable and any other unit is equal to 0. These latter conditions guarantee the stability of the assembly, which is critical to the reasoning algorithm. 2.2
Functional Overview
Although BRAINN’s knowledge implementation is inspired by biological considerations, its information processing mechanisms are not biologically plausible. A high-level view of BRAINN’s overall architecture is shown in Figure 8.
Short Term Memory
Reasoning Goal
Control Process
Long Term Memory (Hexagonal network)
Fig. 8. BRAINN’s Architecture The system’s knowledge (i.e., rules and relations) is stored in the Long Term Memory (LTM). Temporary, run-rime information is stored in the Short Term Memory (STM) and the reasoning goal is stored in a dedicated variable. Reasoning is effected by a form of backward chaining. The following sections detail the reasoning mechanisms implemented by the Control Process. 2.3
Rule-Based Reasoning
To facilitate reasoning, BRAINN’s assemblies are labelled with the type of information they store: SN for a (semantic net’s) relation, LHS for a rule’s left-hand side, and RHS for a rule’s right-hand side. The label is represented by a unique sequence of 4 bits, stored in a few additional units in each assembly. Hence, each assembly actually consists of 3N + 4 units. As previously stated, BRAINN’s rule-based reasoning engine implements a form of backward chaining. The pseudocode for the algorithm is described in Figure 9. If more than one rule can be used, the rules are sorted by ascending number of conditions in their LHS. The algorithm checks that an LHS condition is satisfied by (recursively) asking the network to produce its value. For example,
70
R. Bogacz and C. Giraud-Carrier
ApplyRule(question) 1. Deliver question to all assemblies 2. Relax the network 3. If there is a SN assembly containing question Then return corresponding answer 4. Else (a) For all RHS assemblies containing question i. Retrieve LHS of rule ii. Sort rules by ascending number of LHS conditions (b) For all rules in above order i. Load rule to STM (both RHS and LHS assemblies) ii. For each LHS condition of rule – If LHS.value 6= ApplyRule()) Then try next rule iii. Give the answer from RHS of rule
Fig. 9. BRAINN’s Backward Chaining Algorithm
the algorithm checks the condition sky has colour blue by asking the question <sky, has colour, ?>. Although adequate for single-valued attributes, this may cause problems for multi-valued attributes. The following illustrates the working of the rule application algorithm on a simple reasoning task. Assume that BRAINN’s knowledge base consists of the following relation and rule: IF THEN For simplicity, also assume that the hexagonal network consists of only 3 assemblies, organised as shown in Figure 10. The divisions in the assemblies represent subsets of units, one for each element of information (i.e., object, attribute, value and label). Also assume that the relation is stored in the upper assembly and the rule in lower and right assemblies as shown in Figure 10.
Fig. 10. Simple Hexagonal Network
A Novel Modular Neural Architecture
71
The simplest question that the user can ask, is about what Garfield drinks, i.e., The algorithm delivers the question to all the assemblies. The network, after relaxation, is shown in Figure 11. A label over a division denotes that the activation of units in that division is equal to the binary representation of that label. If there is no label over a division, the network is in a spurious attractor.
Fig. 11. Network after Relaxation: After relaxation, the bottom assembly is in a spurious attractor, the right assembly has settled to one of the patterns stored in that assembly, and the top assembly has settled to the relation (tag is SN) containing the question. Hence, the system gives the answer from this assembly, i.e., milk. The following question, which asks what Garfield is, causes BRAINN’s rulebased reasoning mechanisms to be applied. As before, the question is delivered to all the assemblies. The network, after relaxation, is shown in Figure 12.
Fig. 12. Network after Relaxation: Two assemblies are empty, because the network has settled in spurious attractors. Sequences of bits in those assemblies have no meaning. The lower assembly
72
R. Bogacz and C. Giraud-Carrier
stores the RHS condition, which contains the question. Neighbours of that assembly receive its pattern of activation multiplied by the matrices of weights between assemblies. The resulting network is shown in Figure 13.
Fig. 13. Network after Retrieving LHS of Rule The upper assembly is clear because the weights between the upper and lower assemblies are equal to zero (no rule is stored). In the right assembly, the LHS has been retrieved. The rule, IF THEN , is written to STM and the question, , is delivered to the network. The behaviour of the network for this question is as described above. The value returned is the same as the value in the LHS of the retrieved rule, hence the answer for the question, , is given from the RHS of the rule, i.e., strong. Currently, the system cannot answer questions involving variables (e.g.,