Distributed Prediction and Hierarchical Knowledge Discovery by ARTMAP Neural Networks Gail A. Carpenter Department of Cognitive and Neural Systems, Boston University 677 Beacon Street, Boston, MA 02215 USA
[email protected] Abstract Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition. ART networks function both as models of human cognitive information processing [1,2,3] and as neural systems for technology transfer [4]. A neural computation central to both the scientific and the technological analyses is the ART matching rule [5], which models the interaction between topdown expectation and bottom-up input, thereby creating a focus of attention which, in turn, determines the nature of coded memories. Sites of early and ongoing transfer of ART-based technologies include industrial venues such as the Boeing Corporation [6] and government venues such as MIT Lincoln Laboratory [7]. A recent report on industrial uses of neural networks [8] states: “[The] Boeing … Neural Information Retrieval System is probably still the largest-scale manufacturing application of neural networks. It uses [ART] to cluster binary templates of aeroplane parts in a complex hierarchical network that covers over 100,000 items, grouped into thousands of self-organised clusters. Claimed savings in manufacturing costs are in millions of dollars per annum.” At Lincoln Lab, a team led by Waxman developed an image mining system which incorporates several models of vision and recognition developed in the Boston University Department of Cognitive and Neural Systems (BU/CNS). Over the years a dozen CNS graduates (Aguilar, Baloch, Baxter, Bomberger, Cunningham, Fay, Gove, Ivey, Mehanian, Ross, Rubin, Streilein) have contributed to this effort, which is now located at Alphatech, Inc. Customers for BU/CNS neural network technologies have attributed their selection of ART over alternative systems to the model's defining design principles. In listing the advantages of its THOT® technology, for example, American Heuristics Corporation (AHC) cites several characteristic computational capabilities of this family of neural models, including fast on-line (one-pass) learning, “vigilant” detection of novel patterns, retention of rare patterns, improvement with experience, “weights [which] are understandable in real world terms,” and scalability (www.heuristics.com). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP [9], ART-EMAP [10], ARTMAP-IC [11], V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1-4, 2003. Springer-Verlag Berlin Heidelberg 2003
2
Gail A. Carpenter
Gaussian ARTMAP [12], and distributed ARTMAP [3,13]. Comparative analysis of these systems has led to the identification of a default ARTMAP network, which features simplicity of design and robust performance in many application domains [4,14]. Selection of one particular ARTMAP algorithm is intended to facilitate ongoing technology transfer. The default ARTMAP algorithm outlines a procedure for labeling an arbitrary number of output classes in a supervised learning problem. A critical aspect of this algorithm is the distributed nature of its internal code representation, which produces continuous-valued test set predictions distributed across output classes. The character of their code representations, distributed vs. winner-take-all, is, in fact, a primary factor differentiating various ARTMAP networks. The original models [9,15] employ winner-take-all coding during training and testing, as do many subsequent variations and the majority of ART systems that have been transferred to technology. ARTMAP variants with winner-take-all (WTA) coding and discrete target class predictions have, however, shown consistent deficits in labeling accuracy and post-processing adjustment capabilities. The talk will describe a recent application that relies on distributed code representations to exploit the ARTMAP capacity for one-to-many learning, which has enabled the development of self-organizing expert systems for multi-level object grouping, information fusion, and discovery of hierarchical knowledge structures. A pilot study has demonstrated the network's ability to infer multi-level fused relationships among groups of objects in an image, without any supervised labeling of these relationships, thereby pointing to new methodologies for self-organizing knowledge discovery.
References [1] [2]
[3] [4]
[5]
S. Grossberg, “The link between brain, learning, attention, and consciousness,” Consciousness and Cognition, vol. 8, pp. 1-44, 1999. ftp://cns-ftp.bu.edu/pub/diana/Gro.concog98.ps.gz S. Grossberg, “How does the cerebral cortex work? Development, learning, attention, and 3D vision by laminar circuits of visual cortex,”Behavioral and Cognitive Neuroscience Reviews, in press, 2003, http://www.cns. bu.edu/Profiles/Grossberg/Gro2003BCNR.pdf G.A. Carpenter, “Distributed learning, recognition, and prediction by ART and ARTMAP neural networks,” Neural Networks, vol. 10, pp. 1473-1494, 1997, http://cns.bu.edu/~gail/115_dART_NN_1997_.pdf O. Parsons and G.A. Carpenter, “ARTMAP neural networks for information fusion and data mining: map production and target recognition methodologies,” Neural Networks, vol. 16, 2003, http://cns.bu.edu/~gail/ ARTMAP_map_2003_.pdf G.A. Carpenter and S. Grossberg, “A massively parallel architecture for a self-organizing neural pattern recognition machine,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.
Distributed Prediction and Hierarchical Knowledge Discovery
3
[6]
T.P. Caudell, S.D.G. Smith, R. Escobedo, and M. Anderson, “NIRS: Large scale ART 1 neural architectures for engineering design retrieval,” Neural Networks, vol. 7, pp. 1339-1350, 1994, http://cns.bu.edu/~gail/ NIRS_Caudell_1994_.pdf [7] W. Streilein, A. Waxman, W. Ross, F. Liu, F., M. Braun, D. Fay, P. Harmon, and C.H. Read, “Fused multi-sensor image mining for feature foundation data,” In Proceedings of 3rd International Conference on Information Fusion, Paris, vol. I, 2000. [8] P. Lisboa, “Industrial use of saftey-related artificial neural netoworks,” Contract Research Report 327/2001, Liverpool John Moores University, 2001. http://www.hse.gov.uk/research/crr_pdf/2001/crr01327. pdf [9] G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds, and D.B. Rosen, “Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps,” IEEE Transactions on Neural Networks, vol. 3, pp. 698-713, 1992, http://cns.bu.edu/~gail/ 070_Fuzzy_ARTMAP_1992_.pdf [10] G.A. Carpenter and W.D. Ross, “ART-EMAP: A neural network architecture for object recognition by evidence accumulation,” IEEE Transactions on Neural Networks, vol. 6, pp. 805-818, 1995, http://cns.bu. edu/~gail/097_ART-EMAP_1995_.pdf [11] G.A. Carpenter and N. Markuzon, “ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases,” Neural Networks, vol. 11, pp. 323-336, 1998. http://cns.bu.edu/~gail/ 117_ARTMAP-IC_1998_.pdf [12] J.R. Williamson, “Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps,” Neural Networks, vol. 9, pp. 881-897, 1998, http://cns.bu.edu/~gail/ G-ART_Williamson_1998_.pdf [13] G.A. Carpenter, B.L. Milenova, and B.W. Noeske, “Distributed ARTMAP: a neural network for fast distributed supervised learning,” Neural Networks, vol. 11, pp. 793-813, 1998, http://cns.bu.edu/~gail/ 120_dARTMAP_1998_.pdf [14] G.A. Carpenter “Default ARTMAP,” Proceedings of the International Joint Conference on Neural Networks (IJCNN'03), 2003. http://cns.bu.edu/~gail/Default_ARTMAP_ 2003_.pdf [15] G.A. Carpenter, S. Grossberg, and J.H. Reynolds, “ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network,” Neural Networks, vol. 4, pp. 565-588, 1991, http://cns.bu.edu/~gail/054_ ARTMAP_1991_.pdf.
Biography Gail Carpenter (http://cns.bu.edu/~gail/) obtained her graduate training in mathematics at the University of Wisconsin (PhD, 1974) and taught at MIT and Northeastern University before moving to Boston University, where she is a professor
4
Gail A. Carpenter
in the departments of cognitive and neural systems (CNS) and mathematics. She is director of the CNS Technology Lab and CNS director of graduate studies; serves on the editorial boards of Brain Research, IEEE Transactions on Neural Networks, Neural Computation, Neural Networks, and Neural Processing Letters; has been elected to successive three-year terms on the Board of Governors of the International Neural Network Society (INNS) since its founding, in 1987; and was elected member-at-large of the Council of the American Mathematical Society (1996-1999). She has received the INNS Gabor Award and the Slovak Artificial Intelligence Society Award. She regularly serves as an organizer and program committee member for international conferences and workshops, and has delivered many plenary and invited addresses. Together with Stephen Grossberg and their students and colleagues, Professor Carpenter has developed the Adaptive Resonance Theory (ART) family of neural networks for fast learning, pattern recognition, and prediction, including both unsupervised (ART 1, ART 2, ART 2-A, ART 3, fuzzy ART, distributed ART) and supervised (ARTMAP, fuzzy ARTMAP, ARTEMAP, ARTMAP-IC, ARTMAP-FTR, distributed ARTMAP, default ARTMAP) systems. These ART models have been used for a wide range of applications, such as remote sensing, medical diagnosis, automatic target recognition, mobile robots, and database management. Other research topics include the development, computational analysis, and application of neural models of vision, nerve impulse generation, synaptic transmission, and circadian rhythms.
The Brain's Cognitive Dynamics: The Link between Learning, Attention, Recognition, and Consciousness Stephen Grossberg Center for Adaptive Systems and Department of Cognitive and Neural Systems Boston University, 677 Beacon Street, Boston, MA 02215
[email protected] http://www.cns.bu.edu/Profiles/Grossberg
Abstract The processes whereby our brains continue to learn about a changing world in a stable fashion throughout life are proposed to lead to conscious experiences. These processes include the learning of top-down expectations, the matching of these expectations against bottom-up data, the focusing of attention upon the expected clusters of information, and the development of resonant states between bottom-up and top-down processes as they reach a predictive and attentive consensus between what is expected and what is there in the outside world. It is suggested that all conscious states in the brain are resonant states, and that these resonant states trigger learning of sensory and cognitive representations when they amplify and synchronize distributed neural signals that are bound by the resonance. Thus, processes of learning, intention, attention, synchronization, and consciousness are intimately bound up together. The name Adaptive Resonance Theory, or ART, summarizes the predicted link between these processes. Illustrative psychophysical and neurobiological data have been explained and quantitatively simulated using these concepts in the areas of early vision, visual object recognition, auditory streaming, and speech perception, among others. It is noted how these mechanisms seem to be realized by known laminar circuits of the visual cortex. In particular, they seem to be operative at all levels of the visual system. Indeed, the mammalian neocortex, which is the seat of higher biological intelligence in all modalities, exhibits a remarkably uniform laminar architecture, with six characteristic layers and sublamina. These known laminar ART, or LAMINART, models illustrate the emerging paradigm of Laminar Computing which is attempting to answer the fundamental question: How does laminar computing give rise to biological intelligence? These laminar circuits also illustrate the fact that, in a rapidly growing number of examples, an individual model can quantitatively simulate the recorded dynamics of identified neurons in anatomically characterized circuits and the behaviors that they control. In this precise sense, the classical Mind/Body problem is starting to get solved. It is further noted that many parallel processing streams of the brain often compute properties that are complementary to each other, much as a lock fits a key or the pieces of a puzzle fit together. Hierarchical and parallel interactions within and between these processing streams can overcome their complementary deficiencies by generating emergent properties that compute complete information about a prescribed V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 5-12, 2003. Springer-Verlag Berlin Heidelberg 2003
6
Stephen Grossberg
form of intelligent behavior. This emerging paradigm of Complementary Computing is proposed to be a better paradigm for understanding biological intelligence than various previous proposals, such as the postulate of independent modules that are specialized to carry out prescribed intelligent tasks. Complementary computing is illustrated by the fact that sensory and cognitive processing in the What processing stream of the brain, that passes through cortical areas V1-V2-V4-IT on the way to prefrontal cortex, obey top-down matching and learning laws that are often complementary to those used for spatial and motor processing in the brain's Where/How processing stream, that passes through cortical areas V1-MT-MST-PPC on the way to prefrontal cortex. These complementary properties enable sensory and cognitive representations to maintain their stability as we learn more about the world, while allowing spatial and motor representations to forget learned maps and gains that are no longer appropriate as our bodies develop and grow from infanthood to adulthood. Procedural memories are proposed to be unconscious because the inhibitory matching process that supports their spatial and motor processes cannot lead to resonance. Because ART principles and mechanisms clarify how incremental learning can occur autonomously without a loss of stability under both unsupervised and supervised conditions in response to a rapidly changing world, algorithms based on ART have been used in a wide range of applications in science and technology.
Biography Stephen Grossberg is Wang Professor of Cognitive and Neural Systems and Professor of Mathematics, Psychology, and Biomedical Engineering at Boston University. He is the founder and Director of the Center for Adaptive Systems, founder and Chairman the Department of Cognitive and Neural Systems, founder and first President of the International Neural Network Society (INNS), and founder and co-editor-in-chief of the official journal, Neural Networks, of INNS and the European Neural Network Society (ENNS) and Japanese Neural Network Society (JNNS). Grossberg has served as an editor of many other journals, including Journal of Cognitive Neuroscience, Behavioral and Brain Sciences, Cognitive Brain Research, Cognitive Science, Adaptive Behavior, Neural Computation, Journal of Mathematical Psychology, Nonlinear Analysis, IEEE Expert, and IEEE Transactions on Neural Networks. He was general chairman of the IEEE First International Conference on Neural Networks and played a key role in organizing the first annual INNS conference. Both conferences have since fused into the International Joint Conference on Neural Networks (IJCNN), the largest conference on biological and technological neural network research in the world. His lecture series at MIT Lincoln Laboratory on neural network technology was instrumental in motivating the laboratory to initiate the national DARPA Study on Neural Networks. He has received a number of awards, including the 1991 IEEE Neural Network Pioneer award, the 1992 INNS Leadership Award, the1992 Thinking Technology Award of the Boston Computer Society, the 2000 Information Science Award of the Association for Intelligent Machinery, the 2002 Charles River Laboratories prize of the Society for Behavioral Toxicology, and the 2003 INNS Helmholtz award. He was elected a fellow of the American
The Brain’s Cognitive Dynamics
7
Psychological Association in 1994, a fellow of the Society of Experimental Psychologists in 1996, and a fellow of the American Psychological Society in 2002. He and his colleagues have pioneered and developed a number of the fundamental principles, mechanisms, and architectures that form the foundation for contemporary neural network research, particularly those which enable individuals to adapt autonomously in real-time to unexpected environmental changes. Such models have been used both to analyse and predict interdisciplinary data about mind and brain, and to suggest novel architectures for technological applications. Grossberg received his graduate training at Stanford University and Rockefeller University, and was a Professor at MIT before assuming his present position at Boston University. Core modeling references from the work of Grossberg and his colleagues for neural models of working memory and short-term memory, learning and long-term memory, expectation, attention, resonance, synchronization, recognition, categorization, memory search, hypothesis testing, and consciousness in vision, visual object recognition, audition, speech, cognition, and cognitive-emotional interactions. Some articles since 1997 can be downloaded from http://www.cns.bu.edu/ Profiles/Grossberg
References [1] [2]
[3] [4] [5] [6] [7]
[8]
Baloch, A.A. and Grossberg, S. (1997). A neural model of high-level motion processing: Line motion and formotion dynamics. Vision Research, 37, 30373059. Banuet, J-P. and Grossberg, S. (1987). Probing cognitive processes through the structure of event-related potentials during learning: An experimental and theoretical analysis. Applied Optics, 26, 4931-4946. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Boardman, I., Grossberg, S., Myers, C., and Cohen, M.A. (1999). Neural dynamics of perceptual order and context effects for variable-rate speech syllables. Perception and Psychophysics, 61, 1477-1500. Bradski, G. and Grossberg, S. (1995). Fast-learning VIEWNET architectures for recognizing three-dimensional objects from multiple two-dimensional views. Neural Networks, 8, 1053-1080. Bradski, G., Carpenter, G.A., and Grossberg, S. (1992). Working memory networks for learning temporal order with application to three-dimensional visual object recognition. Neural Computation, 4, 270-286. Bradski, G, Carpenter, G.A., and Grossberg, S. (1994). STORE working memory networks for storage and recall of arbitrary temporal sequences. Biological Cybernetics, 71, 469-480. Carpenter, G.A. (1989). Neural network models for pattern recognition and associative memory. Neural Networks, 1989, 2, 243-257. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. (1997). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, 10,1473-1494.
8
[9]
[10]
[11]
[12] [13] [14] [15]
[16]
[17] [18] [19] [20]
[21]
Stephen Grossberg
Carpenter, G.A. and Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54-115. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1987). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26,49194930. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1990). ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks, 3, 129-152. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1992). A self-organizing neural network for supervised learning, recognition, and prediction. IEEE Communications Magazine, September, 38-49. Carpenter, G.A. and Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosciences, 16, 131-137. Carpenter. G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., and Rosen, D.B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimentional maps. IEEE Transactions on Neural Networks, 3, 698-713. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1991). ARTMAP: Supervised real-time learning and classification of nonstationary data by a selforganizing neural network. Neural Networks, 4, 565-588. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1995). A fuzzy ARTMAP nonparametric probability estimator for nonstationary pattern recognition problems. IEEE Transactions on Neural Networks, 6, 1330-1336. Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991). ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition, Neural Networks, 4, 493-504. Chey, J., Grossberg, S., and Mingolla, E. (1997). Neural dynamics of motion grouping: From aperture ambiguity to object speed and direction. Journal of the Optical Society of America A, 14, 2570-2594. Cohen, M.A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactrions on Systems, Man, and Cybernetics, 13, 815-826. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. I. Amsterdam: Elsevier Science. Cohen, M.A. and Grossberg, S. (1986). Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short term memory. Human Neurobiology, 5, 1-22. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. II. Amsterdam: Elsevier Science.
The Brain’s Cognitive Dynamics
9
[22] Cohen, M.A. and Grossberg, S. (1987). Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics, 26, 1866-1891. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [23] Cohen, M.A., Grossberg, S. and Stork, D.G. (1988). Speech perception and production by a self-organizing neural network. In Evolution, Learning, Cognition, and Advanced Architectures. (Y.C. Lee, Ed.). Singapore: World Scientific. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. [24] Ellias, S.A. and Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short-term memory of shunting on-center off-surround networks. Biological Cybernetics, 20, 69-98. [25] Gove, A., Grossberg, S., and Mingolla, E. (1995). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neuroscience, 12, 1027-1052. [26] Granger, E., Rubin, M., Grossberg, S., and Lavoie, P. (2001). A what-andwhere fusion neural network for recognition and tracking of multiple radar emitters. Neural Networks, 14, 325-344. [27] Grossberg, S. (1969). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1969, 1, 319350. [28] Grossberg, S. (1971). Pavlovian pattern learning by nonlinear neural networks. Proceedings of the National Academy of Sciences, 68, 828-831. [29] Grossberg, S. (1972). Pattern learning by functional-differential neural networks with arbitrary path weights. In Delay and functional-differential equations and their applications (K. Schmitt, Ed.). New York: Academic Press, 1972. [30] Grossberg, S. (1973) Contour enhancement, short term memory, and constancies in reverberating neural Networks. Studies in Applied Mathematics, LII, 213-257. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [31] Grossberg, S. (1974). Classical and instrumental learning by neural networks. In Progress in Theoretical Biology, Vol. 3 (R. Rosen and F. Snell, Eds.), pp. 51141. New York: Academic Press, 1974. [32] Grossberg, S. (1976). Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121-134. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [33] Grossberg, S. (1976). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187202. [34] Grossberg, S. (1977). Pattern formation by the global limits of a nonlinear competitive interaction in n dimensions. Journal of Mathematical Biology, 4, 237-256.
10
Stephen Grossberg
[35] Grossberg, S. (1978). A theory of human memory: Self-organization and performance of sensory-motor codes, maps and plans. In Progress in Theoretical Biology, Vol. 5 (R. Rosen and F. Snell, Eds.), pp. 233-374. New York, Academic Press. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [36] Grossberg, S. (1978). Behavioral contrast in short term memory: serial binary memory models or parallel continuous memory models. Journal of Mathematical Psychology, 17, 199-219. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [37] Grossberg, S. (1978). Competition, decision, and consensus. Journal of Mathematical Analysis and Applications, 66, 470-493. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [38] Grossberg, S. (1978). Decisions, patterns, and oscillations in nonlinear competitive systems with applications to Volterra-Lotka systems. Journal of Theoretical Biology, 73, 101-130. [39] Grossberg, S. (1980). Biological competition: Decision rules, pattern formation, and oscillations. Proceedings of the National Academy of Sciences, 77, 23382342. [40] Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1-51. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [41] Grossberg, S. (1982) Studies of Mind and Brain: Neural principles of learning, perception, development, cognition, and motor control. New York: Kluwer/Reidel. [42] Grossberg, S. (1982). Associative and competitive principles of learning and development: The temporal unfolding and stability of STM and LTM patterns. In Competition and Cooperation in Neural Nets (S. Amari and M. Arbib, Eds.). Lecture Notes in Biomathematics, 45, 295-341. New York: Springer-Verlag. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. I. Amsterdam: Elsevier Science. [43] Grossberg, S. (1982). Processing of expected and unexpected events during conditioning and attention: A psychophysiological theory. Psychological Review, 89, 529-572. [44] Grossberg, S. (1984). Some normal and abnormal behavioral syndromes due to transmitter gating of opponent processes. Biological Psychiatry, 19, 1075-1118. [45] Grossberg, S. (1984). Some psychophysiological and pgharmacological correlates of a developmental, cognitive, and motivational theory. In Brain and Information: Event Related Potentials, 425, 58-151. (R. Karrer, J. Cohen, and P. Tueting, Ed.s). New York Academy of Sciences. Reprinted in Grossberg, S. (1987. The Adaptive Brain, Vol. 1. Amsterdam: Elsevier Science. [46] Grossberg, S. (1984). Unitization, automaticity, temporal order, and word recognition. Cognition and Brain Theory, 7, 263-283. [47] Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11, 23-63. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [48] Grossberg, S. (1987). The Adaptive Brain, Vols. I and II. Amsterdam: Elsevier Science.
The Brain’s Cognitive Dynamics
11
[49] Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [50] Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17-61. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. [51] Grossberg, S. (1995). The attentive brain. American Scientist, 83, 438-449. [52] Grossberg, S. (1999). How does the cerebral cortex work: Learning, attention, and grouping by the laminar circuits of visual cortex. Spatial Vision, 12, 163186. [53] Grossberg, S. (1999). The link between brain learning, attention, and consciousness. Consciousness and Cognition, 8, 1- 44. [54] Grossberg, S. (1999). Pitch-based streaming in auditory perception. In Musical networks: Parallel Distributed Perception and Performance (N. Griffith and P. Todd, Eds.),. Cambridge, MA: MIT Press, pp.117-140. [55] Grossberg, S. (2000). The complementary brain: Unifying brain dynamics and modularity. Trends in Cognitive Sciences, 233-246. [56] Grossberg, S. (2000). The imbalanced brain: From normal behavior to schizophrenia. Biological Psychiatry, 81-98. [57] Grossberg, S. (2000). How hallucinations may arise from brain mechanisms of learning, attention, and volition. Journal of the International Neuropsychological Society, 6, 583-592. [58] Grossberg, S., Boardman, I., and Cohen, M.A. (1997). Neural dynamics of variable-rate speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 23,481-503. [59] Grossberg, S. and Grunewald, A. (1997). Cortical synchronization and perceptual framing. Journal of Cognitive Neuroscience, 9, 117-132. [60] Grossberg, S. and Howe, P.D.L. (2002). A laminar cortical model of stereopsis and three-dimensional surface perception. Vision Research, in press. [61] Grossberg, S. and Levine, D. (1976). Some developmental and attentional biases in the contrast enhancement and short-term memory of recurrent neural networks. Journal of Theoretical Biology, 53, 341-380. [62] Grossberg, S. and Levine, D. (1987). Neural dynamics of attentionally modulated Pavlovian cnditioning: Blocking, interstimulus interval, and secondary conditioning. Applied Optics, 26, 5015-5030. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [63] Grossberg, S. and Merrill, J.W.L. (1992). A neural network model of adaptively timed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, 1, 3-38. [64] Grossberg, S. and Merrill, J.W.L. (1996). The hippocampus and cerebellum in adaptively timed learning, recognition, and movement. Journal of Cognitive Neuroscience, 8, 257-277. [65] Grossberg, S., Mingolla, E., and Ross, W.D. (1994). A neural theory of attentive visual search: Interactions of boundary, surface, spatial, and object representations. Psychological Review, 101, 470-489.
12
Stephen Grossberg
[66] Grossberg, S. Mingolla, E., and Ross, W.D. (1997). Visual brain and visual perception: How does the cortex do perceptual grouping? Trends in Neurosciences, 20, 106-111. [67] Grossberg, S., Mingolla, E., and Viswanathan, L. (2001). Neural dynamics of motion integration and segmentation within and across apertures. Vision Research, 41, 2521-2553. [68] Grossberg, S. and Myers, C. (2000). The resonant dynamics of speech perception: Interword integration and duration-dependent backward effects. Psychological Review, 107, 735-767. [69] Grossberg, S. and Raizada, R.D.S. (2000). Contrast-sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Research, 40, 1413-1432. [70] Grossberg, S. and Schmajuk, N.A. Neural dynamics of attentional modulated Pavlovian conditioning: Conditioned reinforcement, inhibition, and opponent processing. Psychobiology, 15, 195-240. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [71] Grossberg, S. and Somers, D. (1991). Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks, 4, 453-466. [72] Grossberg, S. and Stone, G. (1986). Neural dynamics of attention switching and temporal order information in short term memory. Memory and Cognition, 14, 451-468. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [73] Grossberg, S. and Stone, G. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. II. Amsterdam: Elsevier Science. [74] Grossberg, S. and Williamson, J.R. (1999). A self-organizing neural system for learning to recognize textured scenes. Vision Research, 39, 1385-1406. [75] Grossberg, S. and Williamson, J.R. (2001). A neural model of how horizontal and interlaminar connections of visual cortex develop into adult circuits that carry out perceptual grouping and learning. Cerebral Cortex, 11, 37-58 [76] Grunewald, A. and Grossberg, S. (1998). Self-organization of binocular disparity tuning by reciprocal corticogeniculate interactions. Journal of Cognitive Neuroscience, 10, 100-215. [77] Olson, S.J. and Grossberg, S. (1998). A neural network model for the development of simple and complex cell receptive fields within cortical maps of orientation and ocular dominance. Neural Networks, 11, 189-208. [78] Raizada, R.D.S. and Grossberg, S. (2001). Context-sensitive binding by the laminar circuits of V1 and V2: A unified model of perceptual groupiing, attention, and orientation contrast. Visual Cognition, 8, 431-466. [79] Raizada, R. and Grossberg, S. (2003). Towards a Theory of the Laminar Architecture of Cerebral Cortex: Computational Clues from the Visual System. Cerebral Cortex, 13, 100-113.
Adaptive Data Based Modelling and Estimation with Application to Real Time Vehicular Collision Avoidance Chris J. Harris Department of Electronics and Computer Science, University of Southampton Highfield, Southampton SO17 1BJ, UK
[email protected] Abstract The majority of control and estimation algorithms are based upon linear time invariant models of the process, yet many dynamic processes are nonlinear, stochastic and non-stationary. In this presentation an online data based modelling and estimation approach is described which produces parsimonious dynamic models, which are transparent and appropriate for control and estimation applications. These models are linear in the adjustable parameters – hence are provable, real time and transparent but exponential in the input space dimension. Several approaches are introduced – including automatic structure algorithms to reduce the inherent curse of dimensionality of the approach. The resultant algorithms can be interpreted in rule based form and therefore offer considerable transparency to the user as to the underlying dynamics, equally the user can control the resultant rule base during learning. These algorithms will be applied to (a) helicopter flight control (b) auto-car driving and (c) multiple ship guidance and control.
References [1] [2] [3]
Harris C.J., Hong X, Gan Q. Adaptive Modelling Estimation and Fusion from Data. Springer Verlag, Berlin (2002) Chen S., Hong D., Harris C.J. Sparse multioutput rbf network construction using combined locally regularised OLS & D Optimality. IEE Proc. CTA. Vol. 150 (2) (March 2002) pp.139-146 Hong X., Harris C.J., Chen S. Robust neurofuzzy model knowledge extraction. To appear Trans. IEEE SMC (2003)
Biography Professor Harris has degrees from the Universities of Southampton, Leicester and Oxford. He is a Fellow of the Royal Academy of Engineering, Honorary Professor at University of Hong Kong; Holder of the 2001 Faraday Medal and the 1998 IEE Senior Achievement Medal for research into Nonlinear Signal Processing. Author of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 13-14, 2003. Springer-Verlag Berlin Heidelberg 2003
14
Chris J. Harris
7 research books in nonlinear adaptive systems and control theory and over 300 learned papers. Currently he is Emeritus Professor of Computational Intelligence at Southampton University and Director of the UK – MOD Defence Technology Centre in Data and Information Fusion – a £10M a year initiative involving three companies and eight leading UK Universities, supporting over 70 researchers.
Creating a Smart Virtual Personality Nadia Magnenat-Thalmann MIRALab - University of Geneva 24, Rue General-Dufour,1211 Geneva, Switzerland
[email protected] www.miralab.ch
As people are getting more and more dependent on using the computer for a variety of tasks, providing interfaces that are intelligent and easy to interact with, has become an important issue in computer science. Current research in computer graphics, artificial intelligence and cognitive psychology aims to give computers a more human face, so that interacting with computers becomes more like interacting with humans. With the emergence of 3D graphics, we are now able to create very believable 3D characters that can move and talk. However, the behaviour that is expressed by such characters is far from believable in a lot of systems. We feel that these characters lack individuality. This talk addresses some of the aspects of creating 3D virtual characters that have a form of individuality, driven by state-of-the-art personality and emotion models (see Figure 1). 3D characters will be personalized not only on an expressive level (for example generating emotion expressions on a 3D face for example), but also on an interactive level (response generation that is coherent with the personality and emotional state of the character) and a perceptive level (having an emotional reaction to the user and her/his environment). In order to create a smart virtual personality, we need to concentrate on several different research topics. Firstly, the simulation of personality and emotional state requires an extensive psychological research of how real humans act/react emotionally to events in their surroundings, as well as an investigation into which independent factors cause humans to act/react in a certain way (also called the personality). Once this has been determined, we need to investigate whether or not such models are suitable for computer simulations, and if so, we have to define a concrete form of these notions and how they interact. Secondly, the response generation mechanism used by a virtual character needs to take personality and emotions into account. It is crucial to find generic constructs for linking personality and emotions with a response generation system (which can be anything from rule-based pattern matching systems to fullscale logical reasoning engines). And finally, the expressive virtual character should have speech capabilities and face and body animation. The animation should be controlled by highlevel parameters such as facial expressions and gestures. Furthermore, the bodies and faces of virtual characters should be easily exchangeable and animation sequences should be character-independent (an example of this is face and body representation and animation using the MPEG-4 standard).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 15–16, 2003. c Springer-Verlag Berlin Heidelberg 2003
16
Nadia Magnenat-Thalmann
Fig. 1. 3D virtual character overview. From the user’s behaviour, the system determines the impact on its emotional state. This information is then used (together with semantic information of the user’s behaviour) to generate an appropriate response, which is expressed through a 3D virtual character linked with a text-to-speech system
Biography Prof. Nadia Magnenat-Thalmann has pioneered research into virtual humans over the last 20 years. She obtained several Bachelor’s and Master’s degrees in various disciplines and a PhD in Quantum Physics from the University of Geneva. From 1977 to 1989, she was a Professor at the University of Montreal in Canada. She moved to the University of Geneva in 1989, where she founded MIRALab. She has received several scientific and artistic awards for her work in Canada and in Europe. In l997, she has been elected to the Swiss Academy of Technical Sciences, and more recently, she was nominated as a Swiss personality who has contributed to the advance of science in the 150 years history CD-ROM produced by the Swiss Confederation Parliament, 1998, Bern, Switzerland. She has been invited to give hundreds of lectures on various topics, all related to virtual humans. Author and coauthor of a very high number of research papers and books, she has directed and produced several films and real-time mixed reality shows, among the latest are CYBERDANCE (l998), FASHION DREAMS (1999) and the UTOPIANS (2001). She is editor-in-chief of the Visual Computer Journal published by Springer Verlag and editor of several other research journals.
Intelligent Navigation on the Mobile Internet Barry Smyth1,2 1
1
Smart Media Institute, University College Dublin, Dublin, Ireland 2 ChangingWorlds, South County Business Park Leopardstown Road, Dublin, Ireland
[email protected] Summary
For many users the Mobile Internet means accessing information services through their mobile handsets - accessing mobile portals via WAP phones, for example. In this context, the Mobile Internet represents both a dramatic step forward and a significant step backward, from an information access standpoint. While it offers users greater access to valuable information and services “on the move”, mobile handsets are hardly the ideal access device in terms of their screen-size and input capabilities. As a result, mobile portal users are often frustrated by how difficult it is to quickly access the right information at the right time. A mobile portal is typically structured as a hierarchical set of navigation (menu) pages leading to distinct content pages (see Fig. 1). An ideal portal should present a user with relevant content without the need for spurious navigation. However, the reality is far from ideal. Studies highlight that while the average user expects to be able to access content within 30 seconds, the reality is closer to 150 seconds [1]. WAP navigation effort can be modelled as click-distance ([5, 4]) - the number of menu selections and scrolls needed to locate a content item. Our studies indicate that, ideally, content should be positioned no more than 10-12 ‘clicks’ from the portal home page. However, we have also found that many portals suffer from average click-distances (home page to content items) in excess of 20 [3] (see Fig. 1). Personalization research seeks to develop techniques for learning and exploiting user preferences to deliver the right content to the right user at the right time (see [2]). We have shown how personalization techniques for adapting the navigation structure of a portal can reduce click-distance and thus radically reduce navigation effort and improve portal usability ([5]). Personalization technology has led to the development of the ClixSmart NavigatorT M product-suite, developed by ChangingWorlds Ltd. (www.changingworlds.com). ClixSmart Navigator has been deployed successfully with some of Europe’s leading mobile operators. The result: users are able to locate information and services more efficiently through their mobile handsets and this in turn has led to significant increases in portal usage. In fact live-user studies, encompassing thousands of users, indicate usage increases in excess of 30% and dramatic improvements in the user’s online experience ([5, 4]).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 17–18, 2003. c Springer-Verlag Berlin Heidelberg 2003
18
Barry Smyth
[News and Weather] [Sport] [Business] [Entertainment]
[TV/Video] [Horoscopes] [Lottery Results] [Cinema] [Cinema]
[Cinema Times] [Cinema Booking] [Flix]
Options
Options
Options
Back
[Derry] [Donegal] [Down] [Dublin] [Dublin]
Options
Back
Back
[Meeting House Square] [Omniplex ] [Ormonde ]
Back
Options
Back
Fig. 1. To access their local cinema (the Ormonde) this user must engage in an extended sequence of navigation actions; 16 clicks in total (selects and scrolls) are need in this example
2
Biography
Prof. Barry Smyth is the Digital Chair of Computer Science at University College Dublin, Ireland. His research interests cover many aspects of artificial intelligence, case-based reasoning and personalization. Barry’s research has led to the development of ChangingWorlds Ltd, which develops AI-based portal software for mobile operators, including Vodafone and O2 . He has published widely and received a number of international awards for his basic and applied research, including best paper awards at the International Joint Conference on Artificial Intelligence (IJCAI) and the European Conference on Artificial Intelligence Prestigious Applications of Intelligent Systems (ECAI - PAIS).
References [1] M. Ramsey and J. Nielsen. The WAP Usability Report. Neilsen Norman Group, 2000. 17 [2] D. Reiken. Special issue on personalization. Communications of the ACM, 43(8), 2000. 17 [3] B. Smyth. The Plight of the Mobile Navigator. MobileMetrix, 2002. 17 [4] B. Smyth and P. Cotter. Personalized Adaptive Navigation for Mobile Portals. In Proceedings of the 15th European Conference on Artificial Intelligence - Prestigious Applications of Artificial Intelligence. IOS Press, 2002. 17 [5] B. Smyth and P. Cotter. The Plight of the Navigator: Solving the Navigation Problem for Wireless Portals. In Proceedings of the 2nd International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH’02), pages 328– 337. Springer-Verlag, 2002. 17
The Evolution of Evolutionary Computation Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK
[email protected] http://www.cercia.ac.uk
Abstract. Evolutionary computation has enjoyed a tremendous growth for at least a decade in both its theoretical foundations and industrial applications. Its scope has gone far beyond binary string optimisation using a simple genetic algorithm. Many research topics in evolutionary computation nowadays are not necessarily “genetic” or “evolutionary” in any biological sense. This talk will describe some recent research efforts in addressing several fundamental as well as more applied issues in evolutionary computation. Links with traditional computer science and artificial intelligence will be explored whenever appropriate.
Evolutionary Algorithms as Generate-and-Test Evolutionary algorithms (EAs) can be regarded as population-based stochastic generate-and-test [1, 2]. The advantage of formulating EAs as a generate-and-test algorithm is that the relationships between EAs and other search algorithms, such as simulated annealing (SA), tabu search (TS), hill-climbing, etc., can be made clearer and thus easier to explore and understand. Under the framework of generate-andtest, different search algorithms investigated in artificial intelligence, operations research, computer science, and evolutionary computation can be unified. Computational Time Complexity of Evolutionary Algorithms Most work in evolutionary computation has been experimental. Although there have been many reported results of EAs algorithms in solving difficult optimisation problems, the theoretical results on EA’s average computation time have been few. It is unclear theoretically what and where the real power of EAs is. It is also unclear theoretically what role a population plays in EAs. Some recent work has started tackling several fundamental issues in evolutionary computation, such as the conditions under which an EA will exhibit polynomial/exponential time behaviours [3, 4], the conditions under which a population can make a difference in terms of complexity classes [5], and the analytical tools and frameworks that facilitate the analysis of EA’s average computation time [6]. Two Heads Are Better than One Although one of the key features of evolutionary computation is a population, most work did not actually exploit this. We can show through a number of examples that exploiting population V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 19–21, 2003. c Springer-Verlag Berlin Heidelberg 2003
20
Xin Yao
information, rather than just the best individual, can lead to many benefits in evolutionary learning, e.g., improved generalisation ability [7, 8, 9] and better fault-tolerance [10].
References [1] X. Yao, “An overview of evolutionary computation,” Chinese Journal of Advanced Software Research (Allerton Press, Inc., New York, NY 10011), vol. 3, no. 1, pp. 12–29, 1996. 19 [2] X. Yao, ed., Evolutionary Computation: Theory and Applications. Singapore: World Scientific Publishing Co., 1999. 19 [3] J. He and X. Yao, “Drift analysis and average time complexity of evolutionary algorithms,” Artificial Intelligence, vol. 127, pp. 57–85, March 2001. 19 [4] J. He and X. Yao, “Erratum to: Drift analysis and average time complexity of evolutionary algorithms: [artificial intelligence 127 (2001) 57-85],” Artificial Intelligence, vol. 140, pp. 245–248, September 2002. 19 [5] J. He and X. Yao, “From an individual to a population: An analysis of the first hitting time of population-based evolutionary algorithms,” IEEE Transactions on Evolutionary Computation, vol. 6, pp. 495–511, October 2002. 19 [6] J. He and X. Yao, “Towards an analytic framework for analysing the computation time of evolutionary algorithms,” Artificial Intelligence, vol. 145, pp. 59–97, April 2003. 19 [7] P. J. Darwen and X. Yao, “Speciation as automatic categorical modularization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 2, pp. 101–108, 1997. 20 [8] X. Yao and Y. Liu, “Making use of population information in evolutionary artificial neural networks,” IEEE Trans. on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 28, no. 3, pp. 417–425, 1998. 20 [9] Y. Liu, X. Yao, and T. Higuchi, “Evolutionary ensembles with negative correlation learning,” IEEE Transactions on Evolutionary Computation, vol. 4, pp. 380–387, November 2000. 20 [10] T. Schnier and X. Yao, “Using negative correlation to evolve fault-tolerant circuits,” in Proceedings of the 5th International Conference on Evolvable Systems (ICES-2003), LNCS 2606, Springer, Germany, March 2003. 20
Speaker Bio Xin Yao is a professor of computer science and Director of the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA) at the University of Birmingham, UK. He is also a visiting professor of the University College, the University of New South Wales, the Australian Defence Force Academy, Canberra, the University of Science and Technology of China, Hefei, the Nanjing University of Aeronautics and Astronautics, Nanjing, and the Northeast Normal University, Changchun. He is an IEEE fellow, the editor in chief of IEEE Transactions on Evolutionary Computation, an associate editor or an editorial board member of five other international journals, and the chair of IEEE NNS Technical Committee on Evolutionary Computation. He is the recipient of the 2001 IEEE Donald G. Fink Prize Paper Award and has given more than 20 invited keynote/plenary speeches at various conferences. His major
The Evolution of Evolutionary Computation
21
research interests include evolutionary computation, neural network ensembles, global optimization, computational time complexity and data mining.
A Unified Model Maintains Knowledge Base Integrity John Debenham University of Technology, Sydney, Faculty of Information Technology, PO Box 123, NSW 2007, Australia
[email protected] Abstract. A knowledge base is maintained by modifying its conceptual model and by using those modifications to specify changes to its implementation. The maintenance problem is to determine which parts of that model should be checked for correctness in response a change in the application. The maintenance problem is not computable for first-order knowledge bases. Two things in the conceptual model are joined by a maintenance link if a modification to one of them means that the other must be checked for correctness, and so possibly modified, if consistency of the model is to be preserved. In a unified conceptual model for first-order knowledge bases the data and knowledge are modelled formally in a uniform way. A characterisation is given of four different kinds of maintenance links in a unified conceptual model. Two of these four kinds of maintenance links can be removed by transforming the conceptual model. In this way the maintenance problem is simplified.
1
Introduction
Maintenance links join two things in the conceptual model if a modification to one of them means that the other must be checked for correctness, and so possibly modified, if consistency of that model is to be preserved. If that other thing requires modification then the links from it to yet other things must be followed, and so on until things are reached that do not require modification. If node A is linked to node B which is linked to node C then nodes A and C are indirectly linked. In a coherent knowledge base everything is indirectly linked to everything else. A good conceptual model for maintenance will have a low density of maintenance links [1]. The set of maintenance links should be minimal in than none may be removed. Informally, one conceptual model is “better” than another if it leads to less checking for correctness. The aim of this work is to generate a good conceptual model. A classification into four classes is given here of the maintenance links for conceptual models expressed in the unified [2] knowledge representation. Methods are given for removing two of these classes of link so reducing the density of maintenance links. Approaches to the maintenance of knowledge bases are principally of two types [3]. First, approaches that take the knowledge base as presented and then try to control the maintenance process [4]. Second, approaches that engineer a model of the knowledge base so that it is in a form that is inherently easy to maintain [5] [6]. The approach described here is of the second type.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 21-27, 2003. c Springer-Verlag Berlin Heidelberg 2003
22
John Debenham
The terms data, information and knowledge are used here in the following sense. The data things in an application are the fundamental, indivisible things. Data things can be represented as simple constants or variables. If an association between things cannot be defined as a succinct, computable rule then it is an implicit association. Otherwise it is an explicit association. An information thing in an application is an implicit association between data things. Information things can be represented as tuples or relations. A knowledge thing in an application is an explicit association between information and/or data things. Knowledge can be represented either as programs in an imperative language or as rules in a declarative language.
2
Conceptual Model
Items are a formalism for describing all data, information and knowledge things in an application [7]. Items incorporate two classes of constraints, and a single rule of decomposition is specified for items. The key to this unified representation is the way in which the “meaning” of an item, called its semantics, is specified. The semantics of an item is a function that recognises the members of the “value set” of that item. The value set of an item will change in time τ, but the item’s semantics should remain constant. The value set of an information item at a certain time τ is the set of tuples that are associated with a relational implementation of that item at that time. Knowledge items have value sets too. Consider the rule “the sale price of parts is the cost price marked up by a universal mark-up factor”; this rule is represented by the item named [part/sale-price, part/cost-price, mark- u p ] with a value set of corresponding quintuples. The idea of defining the semantics of items as recognising functions for the members of their value set extends to complex, recursive knowledge items. An item is a named triple A[ SA, VA, CA ] with item name A, SA is called the item semantics of A, VA is called the item value constraints of A and CA is called the item set constraints of A. The item semantics, SA , is a λ-calculus expression that recognises the members of the value set of item A. The expression for an item’s semantics may contain the semantics of other items {A1 ,..., An } called that item’s components: 1 n 1 ) º......º S (yn ,...,y n ) º J(y1 ,..,y 1 ,..,yn ) ]• λ y11 ...y m ...ym •[ SA (y11,...,y m A 1 m 1 m m 1
n
1
1
n
n
1
n
The item value constraints, VA, is a λ-calculus expression: 1 n 1 ) º......º V (yn ,...,y n ) º K(y1 ,..,y 1 ,..,yn ) ]• λ y11 ...y m ...ym •[ V A (y11 ,...,y m An 1 mn 1 m1 mn 1 1 n 1 that should be satisfied by the members of the value set of item A as they change in time; so if a tuple satisfies SA then it should satisfy VA [8]. The expression for an item’s value constraints contains the value constraints of that item’s components. The item set constraints, CA, is an expression of the form: C A º CA º...º CA º (L)A 1
2
n
where L is a logical combination of: • Card lies in some numerical range; • Uni(Ai) for some i, 1 ≤ i ≤ n, and
A Unified Model Maintains Knowledge Base Integrity
23
• Can(A i , X) for some i, 1 ≤ i ≤ n, where X is a non-empty subset of {A1 ,..., An } - {Ai }; subscripted with the name of the item A, “Uni(a)” means that “all members of the value set of item a must be in this association”. “Can(b, A)” means that “the value set of the set of items A functionally determines the value set of item b”. “Card” means “the number of things in the value set”. The subscripts indicate the item’s components to which that set constraint applies. For example, each part may be associated with a cost-price subject to the “value constraint” that parts whose part-number is less that 1,999 should be associated with a cost price of no more than $300. The information item named part/cost-price then is: part/cost-price[ λxy•[ Spart(x) º Scost-price (y) º costs(x, y) ]•, λxy•[ Vpart(x) º V cost-price(y) º ((x < 1999) 8 (y ≤ 300)) ]•, C part º Ccost-price º (Uni(part) º Can( cost-price, {part})part/cost-price ] Rules, or knowledge, can also be defined as items, although it is neater to define knowledge items using “objects”. “Objects” are item building operators. The knowledge item [part/sale-price, part/cost-price, mark-up] which means “the sale price of parts is the cost price marked up by a uniform markup factor” is: [part/sale-price, part/cost-price, mark-up][ λx1x2y1y2z•[(Spart/sale-price(x1, x2) º Spart/cost -price(y1, y2) º Smark-up (z) ) º ((x 1 = y 1) 8 (x2 = z _ y2))]•, λx1x2y1y2z•[Vpart/sale-price(x1, x2) º V part/cost-price(y1, y2) ºVmark-up (z) ) º (( x1 = y1 ) 8 ( x2 > y2 ))]•, C [part/sale-price, part/cost-price, mark-up] ] Two different items can share common knowledge and so can lead to a profusion of maintenance links. This problem can be avoided by using objects. An n-adic object is an operator that maps n given items into another item for some value of n. Further, the definition of each object will presume that the set of items to which that object may be applied are of a specific “type”. The type of an m-adic item is determined both by whether it is a data item, an information item or a knowledge item and by the value of m. The type is denoted respectively by Dm, Im and Km. Items may also have unspecified, or free, type which is denoted by X m . The formal definition of an object is similar to that of an item. An object named A is a typed triple A[E,F,G] where E is a typed expression called the semantics of A, F is a typed expression called the value constraints of A and G is a typed expression called the set constraints of A. For example, the part/cost-price item can be built from the items part and cost-price using the costs operator: part/cost-price = costs(part, cost-price) costs[λP:X1Q:X1•λxy•[ SP(x) º SQ(y) º costs(x,y) ]••, λP:X1Q:X1•λxy•[VP(x) º V Q(y) º ((1000 < x < 1999) 8 (y ≤ 300)) ]••,
λP:X1Q:X1•[ CP º CQ º (Uni(P) º Can(Q, {P})) n(costs,P,Q) ]•] where n(costs, P, Q) is the name of the item costs(P, Q). Objects also represent value constraints and set constraints in a uniform way. A decomposition operation for objects is defined in [2].
24
John Debenham
A conceptual model consists of a set of items and a set of maintenance links. The items are constructed by applying a set of object operators to a set of fundamental items called the basis. The maintenance links join two items if modification to one of them necessarily means that the other item has at least to be checked for correctness if consistency is to be preserved. Item join provides the basis for item decomposition [7]. Given items A and B, the item with name A ⊗E B is called the join of A and B on E, where E is a set of components common to both A and B. Using the rule of composition ⊗, knowledge items, information items and data items may be joined with one another regardless of type. For example, the knowledge item: [cost-price, tax] [λxy•[Scost-price(x) º Stax (y) º x = y
_ 0.05]•,
λxy•[V cost-price(x) º V tax(y) º x < y]•, C[cost -price, tax] ]
can be joined with the information item part/cost-price on the set {cost-price} to give the information item part/cost-price/tax. In other words: [cost-price, tax] ⊗{cost-price} part/cost-price =
part/cost-price/tax[ λxyz•[ Spart(x) º Scost-price (x) º Stax (y) º costs(x,y) º z = y _ 0.05 ]•, λxyz•[ Vpart(x) º V cost-price(x) º V tax(y) º ((1000<x (X, 0) emptyF, nbC(R, X), atF (R) f ullF, nbC(R, −(X, 1)) emptyF, nbC(R, X)
UnLoad(R,X) Prec : Rive(R), Integer(X) f ullF, nbC(R, X), atF (R) Add : emptyF, nbC(R, +(X, 1)) Del : f ullF, nbC(R, X) Move(R1 ,R2 ) Prec: Rive(R1 ), Rive(R2 ) = (R1 , R2 ), atF (R1 ) Add : atF (R2 ) Del : atF (R1 )
Note: As in STRIPS, all variables are present in the parameter list. Therefore, numerical variables (like X) should also be present in the parameter list. In this example, the numerical parameter X represents the number of cars in a place R. The functions + and - are used to carry out calculations of the numerical terms. Definition 3 A world state is described by a set of atoms without function symbols or variable symbols. Definition 4 Substitution. The following two particular substitutions are used:
56
Joseph Zalaket and Guy Camilleri
The first substitution noted σ correponds to substitute variable symbols by constant symbols and function terms without variable symbols by constant symbols. The second substitution noted θ such as θ(X) is a substitution of X where: – X = {p1 (...); ...; pn (...)} is a set of atoms without variable symbols then θ(X) = {θ(p1 (...)); ...; θ(pn (...))} – X = P (t1 , . . . , tn ) is an atom without variable symbols then θ(X) = P (θ(t1 ), . . . , θ(tn )) – X = t is a term without variable symbols then • if t is a constant symbol then t = c and θ(t) = c. • if t = f (c1 , ... , cn ) with c1 , ... , cn constant symbols then it exists a constant symbol c such as {c/f (c1 , ... , cn )} ⊆ θ and θ(f (c1 , ... , cn )) = c. • (recursive part) if t = f (t1 , ... , tn ) with t1 , ..., tn are terms without variable symbols then θ(t) = θ(f (t1 , ... , tn )) = θ(f (θ(t1 ), ..., θ(tn ))). Remarks : – These two substitutions are used to replace variables by constants and to evaluate the functions (by replacing functions by constants). – As in the presented language L all terms are finite, so the recursive definition of θ stops at a certain time. – The substitution θ(f (c1 , ... , cn )) corresponds to the application of the function f . For example if θ(+(1, 2)) = 3 then in the domain theory T the result of 1 + 2 is 3. – If an infinite domain definition is used for the numerical variables then the C set can be infinite. – In the domain theory T , a function must only return a constant value and it does not have any side effects. Definition 5 An action α can be instantiated by two substitutions σ and θ (defined as above) iff ∀p(t1 , ... , tn ) ∈ prec(α), p ∈ Pt then T θ(σ(p(t1 , ... , tn ))). In the ferry example (Table 1) Load(R,X) can be instantiated by σ={r1 /R; 1/X} and {0/ − (1, 1)} ⊆ θ because the theory T uses the classical interpretation of the comparator > then θ(σ(>(X, 0)) = >(1, 0), θ(σ(Rive(R)))=Rive(r1 ), θ(σ(Integer(X)))=Integer(1) and T Rive(r1) ∧ Integer(1)∧ >(1, 0). Definition 6 a is a ground action of α iff it exists two substitutions θ and σ such as α can be instantiated by θ and σ. a is defined by: – Param(a)=σ(Param(α)). – Prec(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Pre(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}. – Add(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Add(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}. – Del(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Del(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}.
State-Based Planning with Numerical Knowledge
57
Remark : Each variable symbol and each function symbol is replaced by the application of θ o σ. Therefore, the generated ground actions are identical to the ground actions generated in a pure STRIPS paradigm. They don’t contain any atom of Pt .
3
Numerical Planning
Usually planners generate the ground actions from the domain of action variables. In our framework, numerical knowledge can lead to an infinite number of ground actions. For example: the action Load(r1 ,1) defined below is a ground action of the Load(R,X) action. Load(r1 ,1) Prec: emptyF, nbC(r1 , 1), atF (r1 ) Add : f ullF, nbC(r1 , 0) Del : emptyF, nbC(r1 , 1)
3.1
Planning Graph
We present an approach based on the relaxation of the planning problem through a planning graph building for heuristic calculation. In this numerical paradigm, the generation of ground actions has to be completed progressively during the building of the graph. The idea is to instantiate actions from the current level during the graph building in a forward pass. In this way, the use of bijective (inversible) function is not required. The resulting ground actions being identical to pure STRIPS ground actions, which make possible the use of symbolical algorithms based on a planning graph building like GRAPHPLAN [1], FF [7] and HSP[2] to solve numerical problems. However, the search completeness cannot be guaranteed as the use of functions and numbers makes the planning process undecidable. To avoid an infinite flow during the planning process we give the modeler a possibility to add a lower limit and an upper limit for each numerical type depending on the planning problem to be solved. For example in the ferry domain of Table 1, if the problem consists of transferring 50 cars from place r1 to place r2 and the initial state contains 50 cars in place r1 (nbC(r1 , 50)) and 0 cars in place r2 (nbC(r2 , 0)), the interval of upper and lower limits for the integer variable X in the Load and UnLoad actions definition is [0, 50] . And thus, extra preconditions could be added to the list of preconditions ; < (X, 50) for the Load(R, X) action and > (X, 0), < (X, 50) for the U nLoad(R, X) action. 3.2
Numerical Fast Forward (NFF)
In contrary to FF, the ground actions in NFF are not computed at the beginning from the definition domain of variables.
58
Joseph Zalaket and Guy Camilleri
Fig. 1. The relaxed planning graph of the problem P
The NFF algorithm can be roughly described in the following way: 1. Only the symbolical part of actions is instantiated. 2. The planning graph is built by completing the actions instantiation (numerical part), then by applying the instantiated actions. 3. The relaxed plan is extracted from planning graph; the length of this relaxed plan constitutes the heuristic value. 4. The heuristic value previously calculated is used to guide the search of an algorithm close to the FF’s Hill Climbing algorithm. The heuristic calculation (steps: 2)and 3) and 4)) is done for each state in the main algorithm. Relaxed Planning Graph Building Let consider the following planning problem P=(A,I,G) where the action set A={Move; Load; UnLoad} (see Table 1), the initial state I={atF (r1 ); emptyF; nbC(r1 , 1); nbC(r2 , 0)} and the goal G={nbC(r2 , 1); atF (r1 )}.The relaxed planning graph of the problem P is discribed in figure 1. The graph starts at the level 0 by the initial state I. The actions Move(r1 ,r2 ) and Load(r1 ,1) are applicable at this level. The partially instantiated action Load(r1 ,X) is completed by σ={r1 /R; 1/X} because {nbC(r1 , 1); atF (r1 ); emptyF } ⊆ Level 0 and T >(1,0). Moreover, θ according to the domain theory T replaces -(1,1) by 0 , that is {0/-(1,1)} ⊆ θ. The action Load(r1 ,1) is defined by: {emptyF ; nbC(r1 , 1); atF (r1 )}, Add(Load(r1 ,1))= Pre(Load(r1,1))= {f ullF ; nbC(r1, 0)} and Del(Load(r1 ,1))= {emptyF ; nbC(r1 , 1)}. The two actions Move(r1 ,r2 ) and Load(r1 ,1) are applied and their add lists are added to the level 1. Remark : In the relaxed graph building algorithm, we only defined the σ substitution. It seems reasonable that the θ substitution be common for all the actions in a planning problem. In our implementation, all functional calculus are carried out in the interpretation domain The actions completion algorithm is described in Algorithm1
State-Based Planning with Numerical Knowledge
59
Algorithm 1 Action completion algorithm for all p(...) ∈ Pre(α) and p ∈ Pp do for all predicate symbols p’ in the current level for which p=p’ and for all c’i ∈ Param(p’) and ci ∈ Param(p) such as c’i =ci do for all variable terms Xi ∈ Param(p) and ci ∈ Param(p’) do {ci /X} is added to the set S of σ end for end for end for for all σ ∈ S do if α can be instantiated by σ and θ then generate the corresponding ground action end if end for
4
Empirical Results
We have implement the NFF in Java, our main objective was the test of feasibility of certain type of numerical problems, little effort is made for code optimization. The machine used for tests is Intel celeron Pentium III with 256MB of RAM. The water jug problem: Consists of 3 jugs j1 ,j2 and j3 where Capacity(j1 ,36), Capacity(j2 ,45), Capacity(j3 ,54) The initial state is: Volume(j1 ,16), Volume(j2 ,27), Volume(j3 ,34) The goal is : Volume(j1 ,25), Volume(j2 ,0), Volume(j3 ,52)− > time = 8.36 s. The ferry domain: The initial state is: atF (r1 ), emptyF , nbC(r1 , 50), nbC(r2 , 0) For goals: nbC(r2 , 5)− > time= 0.571 s; nbC(r2 , 10)− > time= 1.390 s; nbC(r2 , 30)− > time= 21.937 s.
5
Conclusion
We have presented a STRIPS extension to support numerical and symbolical domains definition. We have proposed the way of substituting actions progressively during the planning process to reduce the ground action generation for numerical domains. The main goal of introducing numbers in the planning by the presented way is to allow the definition of domains closer to the real world. In our framework world objects can be retrieved from numerical variables by functions evaluation instead of only being constant symbols as in pure STRIPS. We haven’t interested in our work in the resources optimization (time or metrics), but we believe this task can be added easily to (or retrieved from) our planning framework.
60
Joseph Zalaket and Guy Camilleri
References [1] Avrim L. Blum and Merrick L. Furst. Fast planning through planning graph analysis. Proceedings of the 14th International Joint Conference on AI (IJCAI95), pages 1636–1642, 1995. 53, 57 [2] B. Bonet and H. Geffner. Planning as heuristic search. Artificial Intelligence, 129:5–33, 2001. 53, 57 [3] Minh B. Do and Subbarao Kambhampati. Sapa: A domain-independent heuristic metric temporel planner. European Conference on Planning, 2001. 53 [4] Fikes R. E. and Nilsson N. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2:189–208, 1971. 53 [5] M. Fox and D. Long. PDDL2.1 : An extention to PDDL for expressing temporal planning domains. AIPS, 2002. 53 [6] P. Haslum and H. Geffner. Heuristic planning with time and resources. Proc. IJCAI-01 Workshop on Planning with Resources, 2001. To appear., 2001. 53 [7] J. Hoffman. FF: The fast-forward planning system. AI Magazine, 22:57 – 62, 2001. 53, 54, 57 [8] J. Hoffmann. Extending FF to numerical state variables. to appear in: Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France, 2002. 53, 54
KAMET II: An Extended Knowledge-Acquisition Methodology* Osvaldo Cairó and Julio César Alvarez Instituto Tecnológico Au tónomo de México (ITAM) Department of Computer Science, Río Hondo 1, Mexico City, Mexico
[email protected] [email protected] Abstract. The Knowledge-Acquisition (KA) community necessities demand more effective ways to elicit knowledge in different environments. Methodologies like CommonKADS [8], MIKE [1] and VITAL [6] are able to produce knowledge models using their respective Conceptual Modeling Languages (CML). However, sharing and reuse is nowadays a must-have in knowledge engineering (KE) methodologies and domain-specific KA tools in order to permit Knowledge-Based System (KBS) developers to work faster with better results, and give them the chance to produce and utilize reusable Open Knowledge Base Connectivity (OKBC)-constrained models. This paper presents the KAMET II1 Methodology, which is the diagnosis-specialized version of KAMET [2,3], as an alternative for creating knowledge-intensive systems attacking KE-specific risks. We describe here one of the most important characteristics of KAMET II which is the use of Protégé 2000 for implementing its CML models through ontologies.
1
Introduction
The KAMET [2,3] methodology was born during the last third of the nineties. The life-cycle of KAMET's first version was constituted mainly by two phases. The first one, analyzes and acquires the knowledge from different sources involved in the system. The second one, models and processes this knowledge. The mechanism proposed in KAMET for KA was based on progressive improvements of models, which allowed to refine knowledge and to obtain its optimal structure in an incremental fashion. KAMET was inspired by two basic ideas: the spiral model proposed by Boehm and the essence of cooperative processes. Both ideas are closely related to the concept of risk reduction. *
1
This project has been founded by CONACyT, as project number 33038-A, and Asociación Mexicana de Cultura, A.C. KAMET II is a project that is being carried out in collaboration with the SWI Group at Amsterdam University and Universidad Politécnica de Madrid.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 61-67, 2003. Springer-Verlag Berlin Heidelberg 2003
62
Osvaldo Cairó and Julio César Alvarez
Fig. 1. KAMET II's Methodological Pyramid
Fig. 1 shows the Methodological Pyramid of KAMET II. KAMET-II's life cycle: it represents the methodology itself; it is the beginning to reach the goal. Auxiliary tools provide an automated or semi-automated support for the process and the methods. A KBS Model represents a KBS detailed description, in form of models. This has led KAMET to successful applications in the medicine, telecommunications and human resources areas. The integration of diagnosis specialization and the KAMET methodology ideas converges in the KAMET II methodology which is a specialized methodology in the diagnosis area by focusing on KA from multiple sources. Although KAMET provides a framework for managing the phase of knowledge modeling process in a systematic and disciplined way, the manner to obtain additional organizational objectives like predictability, consistency and improvement in quality and productivity is still not well defined within the KAMET Life-Cycle Model (LCM). KAMET II supports most aspects of a KBS development project, including organizational analysis, KA from multiple sources, conceptual modeling and user interaction. We will present in this section a brief description of Project Management (PM) in KAMET II. We want a PM activity that helps us measure as much as possible with the purpose of making better plans and reaching commitments. We believe that the best way to deal with these problems is to analyze not only the “what” and the “how” of a software system, but also “why” we are using it. This is best done by starting the KA process with the early requirements, where one analyzes the domain within which the system will operate, by studying how the use of the system will modify the environment, and by progressively refining this analysis down to the actual implementation of single knowledge modules. The purpose of KAMET II PM is the elimination of the project mismanagement and its consequences in KA projects [4, 5]. This new approach intends to build reliable KBS that fulfill customer and organization quality expectations by providing the project manager with effective knowledge and PM techniques. The purpose of this paper is not to present the PM dimension in detail, but it is important to emphasize this new key piece in KAMET II.
2
Diagnosis within KAMET II
Diagnosis differs from classification in the sense that the desired output is a malfunction of the system. In diagnosis the underlying knowledge typically contains knowledge about system behavior, such as a causal model. The output of diagnosis can take many forms: it can be a faulty component, a faulty state, or a causal chain. Diagnosis tasks are frequently encountered in the area of technical and medical systems.
KAMET II: An Extended Knowledge-Acquisition Methodology
63
Fig. 2. KAMET II's Environment
KAMET II tries to mitigate the problem of constructing solutions for diagnosis problems with a complete Problem-Solving Method (PSM) library. This will be carried out by substituting the elaboration of the answer from the beginning for tailoring and refinement of PSMs with predefined tasks. KAMET II will have as part of his assets a PSM library for technical diagnosis in order to carry out diagnosis task modeling by selecting a subset of assumptions and roles from this library. The objective is to convert KAMET II into a methodology that concentrates as much experience as possible in the form of a PSM library, so that whenever a diagnosis problem needs to be treated, the steps to tackle it are always available. Fig. 2 shows how the PSM Diagnosis Library is the base for the creation of knowledge models by KAMET II. The philosophy of KAMET II is sharing and reuse. But, not only PSMs are the reusable KE artifacts in KAMET II, but the knowledge models as well, as shown next.
3
KAMET II Models and Protégé 2000 Ontologies
Constructing knowledge systems is viewed as a modeling activity for developing structured knowledge and reasoning models. To ensure well-formed models, the use of some KE methodology is crucial. Additionally, reusing models can significantly reduce time and cost of building new applications [2]. The goal is to have shareable knowledge, by encoding domain knowledge using a standard vocabulary based on the ontology. The KAMET II CML is presented after a discussion of what the potential problems are in KA. Pure rule representation as well as an object-modeling language, data dictionaries, entity-relationship diagrams, among other methods are considered no longer sufficient neither for the purpose of system construction nor for that of knowledge representation. Knowledge is too rich to be represented with notations like the Unified Modeling Language (UML). This requires stronger modeling facilities. A knowledge modeling method should provide a rich vocabulary in which the expertise can be expressed in an appropriate way. Knowledge and reasoning should be modeled in such a way that models can be exploited in a very flexible fashion [2]. The KAMET II CML has three levels of abstraction. The first corresponds to structural constructors and components. The second level of abstraction corresponds to nodes and composition rules. The third level of abstraction corresponds to the global model [3].
64
Osvaldo Cairó and Julio César Alvarez
Fig. 3. KAMET II' implemented in a Protégé 2000 ontology
In the following lines, it will be described how the Protégé 2000 model can be used to implement KAMET II knowledge models visually by means of Protégé 2000 frame-based class model (or metaclass model, although it is not necessary for KAMET II as it is for CommonKADS [9]) and the Diagram_Entity implementation. The structural constructors and the structural components of the KAMET II CML are mapped in Protégé 2000 in the following way. An abstract subclass Construct of the abstract class: THING is created as a superclass of the Problem, Classification and Subdivision classes that implement the correspondent concepts in the CML. The same is done for the structural components creating an abstract subclass Component as the superclass of Symptom, Antecedent, Solution, Time, Value, Inaccuracy, Process, Formula and Examination. However, with the purpose of creating diagrammatic representations of knowledge models, composition rules need to be subclasses of the abstract class Connector in order to be used in a KAMETII Diagram instance. [9] should be consulted for a complete description of the Protégé 2000 knowledge model and its flexibility of adaptation to other knowledge models like KAMET II's. Nevertheless, the importance of adapting KAMET II to Protégé 2000 is because of the potential visual modeling available in the tool. Diagrams are one way to accomplish the latter goal. A diagram is, visually, a set of nodes and connectors that join the nodes. Underlying these, there are instances of classes. Nodes map to domain objects and connectors map to instances of Connector (a class the diagram widget uses to
KAMET II: An Extended Knowledge-Acquisition Methodology
65
represent important relationships). When users see a diagram, they see a high-level summary of a large number of instances, without seeing a lot of the details of any particular instance. That is why a Diagram Widget is presented. This widget has the representativeness above mentioned. The mapping permits the use of an automated aid to construct KAMET II models. Fig. 3 (left) shows the KAMET II in a Protégé 2000 ontology [9]. In Fig. 3 (right) it is presented the Diagram Widget for KAMET II, that is nothing but a special kind of form [9] to obtain instances for the classes (or concepts) involved in diagnosis-specific problems. This is the way to edit the KAMET II models. As it is visible in the left hierarchy, all the KAMET II diagrams will be instances of the KAMETIIDiagram class, which is a direct subclass of the Network class. Fig. 3 (right) shows the structural components and constructors and the composition rules. Developers used to the original graphical notation [2] will not have any problem getting used to the symbols implemented in Protégé 2000. It is important to mention that inaccuracies are represented as entities and not as probabilistic slots. So, better intrinsic notation and representation are necessary to have better working models. Probabilistic networks [10] need a more elaborate analysis before they could be expressed in these graphical terms. Fig. 4 shows a simple model in KAMET II CML modeled in Protégé 2000. The model expresses that the problem P1 can occur due to two different situations. In the first one, the model expresses that if the symptoms S1 and S2 are known to be true then we can deduce the problem P1 is true with probability 0.70. In the second one, the model shows that if symptoms S1 and S5 are observed then we can conclude that the problem is P1 with probability 0.70. On the other hand, we can deduce that the problem P3 is true with probability 0.40 is symptoms S1 and S4 are known to be true. Finally, we can reach a conclusion that the problem is P2 with probability 0.90 if problems P1 and P3 and the symptom S3 are observed.
Fig. 4. Electrical Diagnosis modeled in Protégé 2000
66
Osvaldo Cairó and Julio César Alvarez
Fig. 4 showed an example implemented in Protégé 2000 using the facilities it provides for diagram construction with the Diagram Widget in the Instances tab.
4
Conclusions
In this paper we have described various aspects of the new KE methodology KAMET II. The reasons why KAMET II is a complete diagnosis methodology through its PSM Diagnosis Library for the procedural knowledge part of KBS construction were discussed. Not only simple cause problems can be tackled but also a great variety of them can be solved using a PSM Library. It was also shown how the modeling phase of declarative knowledge in KAMET II can be carried out using the Protégé 2000 automated tool through ontologies. The purpose was to map the KAMET II CML models to the Protégé 2000 frame-based model in order to provide sharing and reuse to knowledge engineers. These objectives can be achieved in the tool thanks to the API it provides in order to reuse the knowledge stored in the knowledge representation schemes facilitated and by the OKBC compliance of the tool. It was also presented the PM dimension of the methodology in order to diminish KE-specific risks in KBS development.
References [1] [2] [3]
[4]
[5] [6] [7] [8]
Angele, J., Fensel, D., and Studer, R.: Developing Knowledge-BASED Systems with MIKE. Journal of Automated Software Engineering. Cairó, O.: A Comprehensive Methodol ogy for Knowledge Acquisition from Multiple Knowledge Sources. Expert Systems with Applications, 14(1998), 116. Cairó, O.: The KAMET Methodology: Content, Usage and Knowledge Modeling. In Gaines, B. and Musen, M., editors, Proceedings of the 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, pages 1-20. Department of Comp uter Science, University of Calgary, SRGD Publications Cairó, O., Barreiro, J., and Solsona, F.: Software Methodologies at Risk. In Fensel, D. and Studer, R., editors, 11th European Workshop on Knowledge Acquisition, Modeling and Management, volume 1621 of LNAI, pages 319-324. Springer Verlag. Cairó, O., Barreiro, J., and Solsona, F.: Risks Inside-out. In Cairó, O., Sucar, L. and Cantu, F., editors, MICAI 2000: Advances in Artificial Intelligence, volume 1793 of LNAI, pages 426-435. Springer Verlag. Domingue, J., Motta, E., and Watt, S.: The Emerging VITAL Workbench. Medsker, L., Tan, M., and Turban, E.: Knowledge Acquisition from Multiple Experts:Problems and Issues. Expert Systems with Applications, 9(1), 35-40. Schreiber, G., and Akkermans, H.: Knowledge Engineering and Management: the CommonKADS Methodology. MIT Press, Cambridge, Massachusetts, 1 999.
KAMET II: An Extended Knowledge-Acquisition Methodology
67
Schreiber, G., Crubézy, M., and Musen, M.: A Case Study in Using Protégé2000 as a Tool for CommonKADS. In Dieng, R. and Corby, O., editors, 12th International Conference, EKAW 2000, Juan-les-Pins, France. [10] van der Gaag, L., and Helsper, E.: Experiences with Modeling Issues in Building Probabilistic Networks. In Gómez-Pérez, A. and Benjamins, R., editors, 13th International Conference, EKAW 2002,volume 2473 of LNAI, pages 21-26. Springer Verlag.
[9]
Automated Knowledge Acquisition by Relevant Reasoning Based on Strong Relevant Logic* Jingde Cheng Department of Information and Computer Sciences, Saitama University Saitama, 338-8570, Japan
[email protected] Abstract. Almost all existing methodologies and automated tools for knowledge acquisition are somehow based on classical mathematical logic or its various classical conservative extensions. This paper proposes a new approach to knowledge acquisition problem: automated knowledge acquisition by relevant reasoning based on strong relevant logic. The paper points out why any of the classical mathematical logic, its various classical conservative extensions, and traditional relevant logics is not a suitable logical basis for knowledge acquisition, shows that strong relevant logic is a more hopeful candidate for the purpose, and establishes a conceptional foundation for automated knowledge acquisition by relevant reasoning based on strong relevant logic.
1
Introduction
From the viewpoint of knowledge engineering, knowledge acquisition is the purposive modeling process of discovering and learning knowledge about some particular subject from one or more knowledge sources, and then abstracting, formalizing, representing, and transferring the knowledge in some explicit and formal forms suitable to computation on computer systems [10]. Automated knowledge acquisition is concerned with the execution of computer programs that assist in knowledge acquisition. As Sestito and Dillon pointed out: The main difficulty in extracting knowledge from experts is that they themselves have trouble expressing or formalizing their knowledge. Experts also have a problem in describing their knowledge in terms that are precise, complete, and consistent enough for use in a computer program. This difficulty stems from the inherent nature of knowledge that constitutes human expertise. Such knowledge is often subconscious and may be approximate, incomplete, and inconsistent [15]. Therefore, the intrinsically characteristic task in knowledge *
This work is supported in part by The Ministry of Education, Culture, Sports, Science and Technology of Japan under Grant-in-Aid for Exploratory Research No. 09878061 and Grant-in-Aid for Scientific Research (B) No. 11480079.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 68-80, 2003. Springer-Verlag Berlin Heidelberg 2003
Automated Knowledge Acquisition
69
acquisition is discovery rather than justification, and the task have to be performed under the condition that working with approximate, incomplete, inconsistent knowledge is the rule rather than the exception. As a result, if we want to establish a sound methodology for knowledge acquisition in knowledge engineering practices, we have to consider the issue how to discover new knowledge from one or more knowledge sources where approximateness, incompleteness, and inconsistency are existent in some degree. Until now, almost all existing methodologies and automated tools for knowledge acquisition are somehow based on classical mathematical logic (CML for short) or its various classical conservative extensions. This approach, however, may be suitable to searching and describing a formal proof of a previously specified statement, but not necessarily suitable to forming a new concept and discovering a new statement because the aim, nature, and role of CML is descriptive and non-predictive rather than prescriptive and predictive. This paper proposes a new approach to knowledge acquisition problem: knowledge acquisition by relevant reasoning based on strong relevant logic. The paper points out why any of the classical mathematical logic, its various classical conservative extensions, and traditional relevant logics is not a suitable logical basis for knowledge acquisition, shows that strong relevant logic is more a hopeful candidates for the purpose, and establishes a conceptional foundation for automated knowledge acquisition by relevant reasoning based on strong relevant logic.
2
Reasoning and Proving
Reasoning is the process of drawing new conclusions from given premises, which are already known facts or previously assumed hypotheses (Note that how to define the notion of “new” formally and satisfactorily is still a difficult open problem until now). Therefore, reasoning is intrinsically ampliative, i.e., it has the function of enlarging or extending some things, or adding to what is already known or assumed. In general, a reasoning consists of a number of arguments (or inferences) in some order. An argument (or inference) is a set of declarative sentences consisting of one or more sentences as its premises, which contain the evidence, and one sentence as its conclusion. In an argument, a claim is being made that there is some sort of evidential relation between its premises and its conclusion: the conclusion is supposed to follow from the premises, or equivalently, the premises are supposed to entail the conclusion. Therefore, the correctness of an argument is a matter of the connection between its premises and its conclusion, and concerns the strength of the relation between them (Note that the correctness of an argument depends neither on whether the premises are really true or not, nor on whether the conclusion is really true or not). Thus, there are some fundamental questions: What is the criterion by which one can decide whether the conclusion of an argument or a reasoning really does follow from its premises or not? Is there the only one criterion, or are there many criteria? If there are many criteria, what are the intrinsic differences between them? It is logic that deals with the validity of argument and reasoning in general.
70
Jingde Cheng
A logically valid reasoning is a reasoning such that its arguments are justified based on some logical validity criterion provided by a logic system in order to obtain correct conclusions (Note that here the term “correct” does not necessarily mean “true.”). Today, there are so many different logic systems motivated by various philosophical considerations. As a result, a reasoning may be valid on one logical validity criterion but invalid on another. For example, the classical account of validity, which is one of fundamental principles and assumptions underlying CML and its various conservative extensions, is defined in terms of truth-preservation (in some certain sense of truth) as: an argument is valid if and only if it is impossible for all its premises to be true while its conclusion is false. Therefore, a classically valid reasoning must be truth-preserving. On the other hand, for any correct argument in scientific reasoning as well as our everyday reasoning, its premises must somehow be relevant to its conclusion, and vice versa. The relevant account of validity is defined in terms of relevance as: for an argument to be valid there must be some connection of meaning, i.e., some relevance, between its premises and its conclusion. Obviously, the relevance between the premises and conclusion of an argument is not accounted for by the classical logical validity criterion, and therefore, a classically valid reasoning is not necessarily relevant. Proving is the process of finding a justification for an explicitly specified statement from given premises, which are already known facts or previously assumed hypotheses. A proof is a description of a found justification. A logically valid proving is a proving such that it is justified based on some logical validity criterion provided by a logic system in order to obtain a correct proof. The most intrinsic difference between reasoning and proving is that the former is intrinsically prescriptive and predictive while the latter is intrinsically descriptive and non-predictive. The purpose of reasoning is to find some new conclusion previously unknown or unrecognized, while the purpose of proving is to find a justification for some specified statement previously given. Proving has an explicitly given target as its goal while reasoning does not. Unfortunately, until now, many studies in Computer Science and Artificial Intelligence disciplines still confuse proving with reasoning. Discovery is the process to find out or bring to light of that which was previously unknown. For any discovery, both the discovered thing and its truth must be unknown before the completion of discovery process. Since reasoning is the only way to draw new conclusions from given premises, there is no discovery process that does not invoke reasoning. As we have mentioned, the intrinsically characteristic task in knowledge acquisition is discovery rather than justification. Therefore, the task is concerning reasoning rather than proving. Below, let us consider the problem what logic system can satisfactorily underlie knowledge discovery.
3
The Notion of Conditional as the Heart of Logic
What is logic? Logic is a special discipline which is considered to be the basis for all other sciences, and therefore, it is a science prior to all others, which contains the
Automated Knowledge Acquisition
71
ideas and principles underlying all sciences [9, 16]. Logic deals with what entails what or what follows from what, and aims at determining which are the correct conclusions of a given set of premises, i.e., to determine which arguments are valid. Therefore, the most essential and central concept in logic is the logical consequence relation that relates a given set of premises to those conclusions, which validly follow from the premises. In general, a formal logic system L consists of a formal language, called the object language and denoted by F(L), which is the set of all well-formed formulas of L, and a logical consequence relation, denoted by meta-linguistic symbol |−L, such that for P ⊆ F(L) and c ∈ F(L), P |−L c means that within the framework of L, c is a valid conclusion of premises P, i.e., c validly follows from P. For a formal logic system (F(L), |−L), a logical theorem t is a formula of L such that φ |−L t where φ is the empty set. We use Th(L) to denote the set of all logical theorems of L. Th(L) is completely determined by the logical consequence relation |−L. According to the representation of the logical consequence relation of a logic, the logic can be represented as a Hilbert style formal system, a Gentzen natural deduction system, a Gentzen sequent calculus system, or other type of formal system. A formal logic system L is said to be explosive if and only if {A, ¬A} |−L B for any two different formulas A and B; L is said to be paraconsistent if and only if it is not explosive. Let (F(L), |−L) be a formal logic system and P ⊆ F(L) be a non-empty set of sentences (i.e., closed well-formed formulas). A formal theory with premises P based on L, called a L-theory with premises P and denoted by TL(P), is defined as TL(P) =df Th(L) ∪ ThLe(P), and ThLe(P) =df {et | P |−L et and et ∉ Th(L)} where Th(L) and ThLe(P) are called the logical part and the empirical part of the formal theory, respectively, and any element of ThLe(P) is called an empirical theorem of the formal theory. A formal theory TL(P) is said to be directly inconsistent if and only if there exists a formula A of L such that both A ∈ P and ¬A ∈ P hold. A formal theory TL(P) is said to be indirectly inconsistent if and only if it is not directly inconsistent but there exists a formula A of L such that both A ∈ TL(P) and ¬A ∈ TL(P). A formal theory TL(P) is said to be consistent if and only if it is neither directly inconsistent nor indirectly inconsistent. A formal theory TL(P) is said to be explosive if and only if A ∈ TL(P) for arbitrary formula A of L; TL(P) is said to be paraconsistent if and only if it is not explosive. An explosive formal theory is not useful at all. Therefore, any meaningful formal theory should be paraconsistent. Note that if a formal logic system L is explosive, then any directly or indirectly inconsistent L-theory TL(P) must be explosive. In the literature of mathematical, natural, social, and human sciences, it is probably difficult, if not impossible, to find a sentence form that is more generally used for describing various definitions, propositions, and theorems than the sentence form of “if ... then ... .” In logic, a sentence in the form of “if ... then ...” is usually called a conditional proposition or simply conditional which states that there exists a relation of sufficient condition between the “if” part and the “then” part of the sentence. Scientists always use conditionals in their descriptions of various definitions, propositions, and theorems to connect a concept, fact, situation or conclusion to its sufficient conditions. The major work of almost all scientists is to discover some sufficient condition relations between various phenomena, data, and laws in their research
72
Jingde Cheng
fields. Indeed, Russell 1903 has said, “Pure Mathematics is the class of all propositions of the form ‘p implies q,' where p and q are propositions containing one or more variables, the same in the two propositions, and neither p nor q contains any constants except logical constants” [14]. In general, a conditional must concern two parts which are connected by the connective “if ... then ... ” and called the antecedent and the consequent of that conditional, respectively. The truth of a conditional depends not only on the truth of its antecedent and consequent but also, and more essentially, on a necessarily relevant and conditional relation between them. The notion of conditional plays the most essential role in reasoning because any reasoning form must invoke it, and therefore, it is historically always the most important subject studied in logic and is regarded as the heart of logic [1]. In fact, from the age of ancient Greece, the notion of conditional has been discussed by the ancients of Greek. For example, the extensional truth-functional definition of the notion of material implication was given by Philo of Megara in about 400 B.C. [11, 16]. When we study and use logic, the notion of conditional may appear in both the object logic (i.e., the logic we are studying) and the meta-logic (i.e., the logic we are using to study the object logic). In the object logic, there usually is a connective in its formal language to represent the notion of conditional, and the notion of conditional, usually represented by a meta-linguistic symbol, is also used for representing a logical consequence relation in its proof theory or model theory. On the other hand, in the meta-logic, the notion of conditional, usually in the form of natural language, is used for defining various meta-notions and describing various meta-theorems about the object logic. From the viewpoint of object logic, there are two classes of conditionals. One class is empirical conditionals and the other class is logical conditionals. For a logic, a conditional is called an empirical conditional of the logic if its truth-value, in the sense of that logic, depends on the contents of its antecedent and consequent and therefore cannot be determined only by its abstract form (i.e., from the viewpoint of that logic, the relevant relation between the antecedent and the consequent of that conditional is regarded to be empirical); a conditional is called a logical conditional of the logic if its truth-value, in the sense of that logic, depends only on its abstract form but not on the contents of its antecedent and consequent, and therefore, it is considered to be universally true or false (i.e., from the viewpoint of that logic, the relevant relation between the antecedent and the consequent of that conditional is regarded to be logical). A logical conditional that is considered to be universally true, in the sense of that logic, is also called an entailment of that logic. Indeed, the most intrinsic difference between various different logic systems is to regard what class of conditionals as entailments, as Diaz pointed out: “The problem in modern logic can best be put as follows: can we give an explanation of those conditionals that represent an entailment relation?” [7]
Automated Knowledge Acquisition
4
73
The Logical Basis for Automated Knowledge Acquisition
Any science is established based on some fundamental principles and assumptions such that removing or replacing one of them by a new one will have a great influence on the contents of the science and even lead to creating a completely new branch of the science. CML was established in order to provide formal languages for describing the structures with which mathematicians work, and the methods of proof available to them; its principal aim is a precise and adequate understanding of the notion of mathematical proof. Given its mathematical method, it must be descriptive rather than prescriptive, and its description must be idealized. CML was established based on a number of fundamental assumptions. Some of the assumptions concerning with our subject are as follows: The classical abstraction: The only properties of a proposition that matter to logic are its form and its truth-value. The Fregean assumption: The truth-value of a proposition is determined by its form and the truth-values of its constituents. The classical account of validity: An argument is valid if and only if it is impossible for all its premises to be true while its conclusion is false. The principle of bivalence: There are exactly two truth-values, TRUE and FALSE. Every declarative sentence has one or other, but not both, of these truth-values. The classical account of validity is the logical validity criterion of CML by which one can decide whether the conclusion of an argument or a reasoning really does follow from its premises or not in the framework of CML. However, since the relevance between the premises and conclusion of an argument is not accounted for by the classical validity criterion of CML, a reasoning based on CML is not necessarily relevant, i.e., its conclusion may be not relevant at all, in the sense of meaning and context, to its premises. In other words, in the framework of CML, even if a reasoning is classically valid, the relevance between its premises and its conclusion cannot be guaranteed necessarily. Note that this proposition is also true in the framework of any classical conservative extension or non-classical alternative of CML where the classical account of validity is adopted as the logical validity criterion. On the other hand, taking the above assumptions into account, in CML, the notion of conditional, which is intrinsically intensional but not truth-functional, is represented by the truth-functional extensional notion of material implication (denoted by → in this paper) that is defined as A→B =df ¬(A∧¬B) or A→B =df ¬A∨B. This definition of material implication, with the inference rule of Modus Ponens for material implication (from A and A→B to infer B), can adequately satisfy the truthpreserving requirement of CML, i.e., the conclusion of a classically valid reasoning based on CML must be true (in the sense of CML) if all premises of the reasoning are true (in the sense of CML). This requirement is basic and adequate for CML to be used as a formal description tool by mathematicians. However, the material implication is intrinsically different from the notion of conditional in meaning (semantics). It is no more than an extensional truth-function of its antecedent and consequent but does not require that there is a necessarily relevant
74
Jingde Cheng
and conditional relation between its antecedent and consequent, i.e., the truth-value of the formula A→B depends only on the truth-values of A and B, though there could exist no necessarily relevant and conditional relation between A and B. It is this intrinsic difference in meaning between the notion of material implication and the notion of conditional that leads to the well-known “implicational paradox problem” in CML. The problem is that if one regards the material implication as the notion of conditional and regards every logical theorem of CML as an entailment or valid reasoning form, then a great number of logical axioms and logical theorems of CML, such as A→(B→A), B→(¬A∨A), ¬A→(A→B), (¬A∧A)→B, (A→B)∨(¬A→B), (A→B)∨(A→¬B), (A→B)∨(B→A), ((A∧B)→C)→((A→C)∨(B→C)), and so on, present some paradoxical properties and therefore they have been referred to in the literature as “implicational paradoxes” [1]. Because all implicational paradoxes are logical theorems of any CML-theory TCML(P), for a conclusion of a reasoning from a set P of premises based on CML, we cannot directly accept it as a correct conclusion in the sense of conditional, even if each of the given premises is regarded to be true and the conclusion can be regarded to be true in the sense of material implication. For example, from any given premise A ∈ P, we can infer B→A, C→A, ... where B, C, ... are arbitrary formulas, by using the logical axiom A→(B→A) of CML and Modus Ponens for material implication, i.e., B→A ∈ TCML(P), C→A ∈ TCML(P), ... for any A ∈ TCML(P). However, from the viewpoint of scientific reasoning as well as our everyday reasoning, these inferences cannot be regarded to be valid in the sense of conditional because there may be no necessarily relevant and conditional relation between B, C, ... and A and therefore we cannot say “if B then A,” “if C then A,” and so on. Obviously, no scientists did or will reason in such a way in their scientific discovery. This situation means that from the viewpoint of conditional or entailment, the truth-preserving property of reasoning based on CML is meaningless. Note that any classical conservative extension or non-classical alternative of CML where the notion of conditional is directly or indirectly represented by the material implication has the similar problems as the above problems in CML. Consequently, in the framework of CML and its various classical or non-classical conservative extensions, even if a reasoning is classically valid, neither the necessary relevance between its premises and conclusion nor the truth of its conclusion in the sense of conditional can be guaranteed necessarily. From the viewpoint to regard reasoning as the process of drawing new conclusions from given premises, any meaningful reasoning should be ampliative but not circular and/or tautological, i.e., the truth of conclusion of the reasoning should be recognized after the completion of the reasoning process but not be invoked in deciding the truth of premises of the reasoning. As an example, let us see the most typical human logical reasoning form, Modus Ponens. The natural language representation of Modus Ponens may be “if A holds then B holds, now A holds, therefore B holds.” When we reason using Modus Ponens, what we know? We know “if A holds then B holds” and “A holds.” Before the reasoning is performed, we do not know whether or not “B holds.” If we know, then we would not need to reason at all. Therefore, Modus Ponens should be ampliative but not circular and/or tautological. While, how can we know “B holds” by using Modus Ponens? Indeed, by using Modus Ponens, we can
Automated Knowledge Acquisition
75
know “B holds,” which is unknown until the reasoning is performed, based on the following reasons: (i) “A holds,” (ii) “There is no case such that A holds but B does not hold,” and (iii) we know (ii) without investigating either “whether A holds or not” or “whether B holds or not.” Note that the Wright-Geach-Smiley criterion (see below) for entailment is corresponding to the above (ii) and (iii). From this example, we can see that the key point in ampliative and non-circular reasoning is the primitive and intensional relevance between the antecedent and consequent of a conditional. Because the material implication in CML is an extensional truth-function of its antecedent and consequent but not requires the existence of a necessarily relevant relation between its antecedent and consequent, a reasoning based on the logic must be circular and/or tautological. For example, Modus Ponens for material implication is usually represented in CML as “from A and A→B to infer B.” According to the extensional truth-functional semantics of the material implication, if we know “A is true” but do not know the truth-value of B, then we cannot decide the truth-value of “A→B.” In order to know the truth-value of B using Modus Ponens for material implication, we have to know the truth-value of B before the reasoning is performed! Obviously, Modus Ponens for material implication is circular and/or tautological if it is used as a reasoning form, and therefore, it is not a natural representation of Modus Ponens. Moreover, in general, our knowledge about a domain as well as a scientific discipline may be incomplete and inconsistent in many ways, i.e., it gives us no evidence for deciding the truth of either a proposition or its negation, and it directly or indirectly includes some contradictions. Therefore, reasoning with incomplete (and some time inconsistent) information and/or knowledge is the rule rather than the exception in our everyday real-life situations and almost all scientific disciplines. Also, even if our knowledge about a domain or scientific discipline seems to be consistent at present, in future, we may find a new fact or rule that is inconsistent with our known knowledge, i.e., we find a contradiction. In these cases, we neither doubt the “logic” we used in our everyday logical thinking nor reason out anything from the contradictions, but we must consider that there are some wrong things in our knowledge and will investigate the causes of the contradictions, i.e., we do reason under inconsistency in order to detect and remove the causes of the contradictions. Indeed, in scientific research, the detection and explanation of an inconsistency between a new fact and known knowledge often leads to formation of new concepts or discovery of new principles. How to reason with inconsistent knowledge is an important issue in scientific discovery and theory formation. For a paraconsistent logic with Modus Ponens as an inference rule, the paraconsistence requires that the logic does not have “(¬A∧A)→B” as a logical theorem where “A” and “B” are any two different formulas and “→” is the relation of implication used in Modus Ponens. If a logic is not paraconsistent, then infinite propositions (even negations of those logical theorems of the logic) may be reasoned out based on the logic from a set of premises that directly or indirectly include a contradiction. However, CML assumes that all the information is on the table before any deduction is performed. Moreover, it is well known that CML is explosive but not paraconsistent, and therefore, any directly or indirectly inconsistent CML-theory TCML(P) must be explosive. This is because CML uses Modus Ponens for material implication as its inference rule, and has “(¬A∧A)→B”
76
Jingde Cheng
as a logical theorem, which, in fact, is the most typical implicational paradox. In fact, reasoning under inconsistency is impossible within the framework of CML. Through the above discussions, we have seen that a reasoning based on CML is not necessarily relevant, the classical truth-preserving property of a reasoning based on CML is meaningless in the sense of conditional, a reasoning based on CML must be circular and/or tautological but not ampliative, and reasoning under inconsistency is impossible within the framework of CML. These facts are also true to those classical or non-classical conservative extensions of CML. What these facts tell us is that CML and its various classical or non-classical conservative extensions are not suitable logical basis for automated knowledge acquisition. Traditional relevant logics ware constructed during the 1950s in order to find a mathematically satisfactory way of grasping the elusive notion of relevance of antecedent to consequent in conditionals, and to obtain a notion of implication which is free from the so-called “paradoxes” of material and strict implication [1, 2, 8, 12, 13]. Some major traditional relevant logic systems are “system E of entailment”, “system R of relevant implication”, and “system T of ticket entailment”. A major feature of these relevant logics is that they have a primitive intensional connective to represent the notion of conditional and their logical theorems include no implicational paradoxes. Von Wright, Geach, and Smiley suggested some informal criteria for the notion of entailment, i.e., the so-called “Wright-Geach-Smiley criterion” for entailment: “A entails B, if and only if, by means of logic, it is possible to come to know the truth of A→B without coming to know the falsehood of A or the truth of B” [1]. However, it is hard until now to know exactly how to formally interpret such epistemological phrases as “coming to know” and “getting to know” in the context of logic. Anderson and Belnap proposed variable-sharing as a necessary but not sufficient formal condition for the relevance between the antecedent and consequent of an entailment. The underlying principle of these relevant logics is the relevance principle, i.e., for any entailment provable in E, R, or T, its antecedent and consequent must share a sentential variable. Variable-sharing is a formal notion designed to reflect the idea that there be a meaning-connection between the antecedent and consequent of an entailment [1, 2, 12, 13]. It is this relevance principle that excludes those implicational paradoxes from logical axioms or theorems of relevant logics. However, although the traditional relevant logics have rejected those implicational paradoxes, there still exist some logical axioms or theorems in the logics, which are not so natural in the sense of conditional. Such logical axioms or theorems, for instance, are (A∧B)⇒A, (A∧B)⇒B, (A⇒B)⇒((A∧C)⇒B), A⇒(A∨B), B⇒(A∨B), (A⇒B)⇒(A⇒(B∨C)) and so on, where ⇒ denotes the primitive intensional connective in the logics to represent the notion of conditional. The present author named these logical axioms or theorems ‘conjunction-implicational paradoxes' and ‘disjunction-implicational paradoxes' [3, 4, 6]. For example, from any given premise A⇒B, we can infer (A∧C)⇒B, (A∧C∧D)⇒B, and so on by using logical theorem (A⇒B)⇒((A∧C)⇒B) of T, E, and R and Modus Ponens for conditional. However, from the viewpoint of scientific reasoning as well as our everyday reasoning, these inferences cannot be regarded as valid in the sense of conditional because there may be no necessarily relevant and conditional relation between C, D, ... and B and therefore we cannot say ‘if A and C then B', ‘if A and C and D then B', and so on.
Automated Knowledge Acquisition
77
In order to establish a satisfactory logic calculus of conditional to underlie relevant reasoning, the present author has proposed some strong relevant logics (or strong relevance logics), named Rc, Ec, and Tc [3, 4, 6]. The logics require that the premises of an argument represented by a conditional include no unnecessary and needless conjuncts and the conclusion of that argument includes no unnecessary and needless disjuncts. As a modification of traditional relevant logics R, E, and T, strong relevant logics Rc, Ec, and Tc rejects all conjunction-implicational paradoxes and disjunction-implicational paradoxes in R, E, and T, respectively. What underlies the strong relevant logics is the strong relevance principle: If A is a theorem of Rc, Ec, or Tc, then every sentential variable in A occurs at least once as an antecedent part and at least once as a consequent part. Since the strong relevant logics are free of not only implicational paradoxes but also conjunction-implicational and disjunctionimplicational paradoxes, in the framework of strong relevant logics, if a reasoning is valid, then both the relevance between its premises and its conclusion and the validity of its conclusion in the sense of conditional can be guaranteed in a certain sense of strong relevance. The strong relevant logics are hopeful candidates for the fundamental logic to satisfactorily underlie automated knowledge acquisition. First, the strong relevant logics can certainly underlie relevant reasoning in a certain sense of strong relevance. Second, a reasoning based on the strong relevant logics is truth-preserving in the sense of the intensional primitive semantics of the conditional. Third, since the WrightGeach-Smiley criterion for entailment is accounted for by the notion of conditional, a reasoning based on the strong relevant logics is ampliative but not circular and/or tautological. Finally, the strong relevant logics can certainly underlie paraconsistent reasoning.
5
Automated Knowledge Acquisition by Relevant Reasoning
Knowledge in various domains is often represented in the form of conditional. From the viewpoint of logical calculus of conditional, the problem of automated knowledge acquisition can be regarded as the problem of automated (empirical) theorem finding from one or more knowledge sources. Since a formal theory TL(P) is generally an infinite set of formulas, even though premises P are finite, we have to find some method to limit the range of candidates for “new and interesting theorems” to a finite set of formulas. The strategy the present author adopted is to sacrifice the completeness of knowledge representation and reasoning to get the finite set of candidates. This is based on the author's conjecture that almost all “new and interesting theorems” of a formal theory can be deduced from the premises of that theory by finite inference steps concerned with finite number of low degree entailments. Let (F(L), |−L) be a formal logic system and k be a natural number. The kth degree fragment of L, denoted by Thk(L), is a set of logical theorems of L which is inductively defined as follows (in the terms of Hilbert style formal system): (1) if A is a jth (j ≤ k) degree formula and an axiom of L, then A ∈ Thk(L), (2) if A is a jth (j ≤ k) degree formula which is the result of applying an inference rule of L to some mem-
78
Jingde Cheng
bers of Thk(L), then A ∈ Thk(L), and (3) Nothing else are members of Thk(L), i.e., only those obtained from repeated applications of (1) and (2) are members of Thk(L). Obviously, the definition of the kth degree fragment of logic L is constructive. Note that the kth degree fragment of logic L does not necessarily include all kth degree logical theorems of L because it is possible for L that deductions of some kth degree logical theorems of L must invoke those logical theorems whose degrees are higher than k. On the other hand, the following holds obviously: Th0(L) ⊂ Th1(L) ⊂ ... Thk−1(L) ⊂ Thk(L) ⊂ Thk+1(L) ⊂ ... Let (F(L), |−L) be a formal logic system, P ⊂ F(L), and k and j be two natural numbers. A formula A is said to be jth-degree-deducible from P based on Thk(L) if and only if there is an finite sequence of formulas f1, ..., fn such that fn = A and for all i (i ≤ n) (1) fi ∈ Thk(L), or (2) fi ∈ P and the degree of fi is not higher than j, or (3) fi whose degree is not higher than j is the result of applying an inference rule to some members fj1, ..., fjm (j1, ..., jm < i) of the sequence. If P ≠ φ, then the set of all formulas which are jth-degree-deducible from P based on Thk(L) is called the jth degree fragment of the formal theory with premises P based on Thk(L), denoted by T j th k Thk(L)(P). A formula is said to be j -degree-deductive from P based on Th (L) if and only if it is jth-degree-deducible from P based on Thk(L) but not (j−1)th-degreededucible from P based on Thk(L). Note that in the above definitions, we do not require (j ≤ k). The notion of jth-degree-deductive can be used as a metric to measure the difficulty of deducing an empirical theorem from given premises P based on logic L. The difficulty is relative to the complexity of problem being investigated as well as the strength of underlying logic L. Based on the above definitions, we have the following result. Let TSRL(P) be a formal theory where SRL is a strong relevant logic, and k and j be two natural numbers. If P is finite, then T jThk(SRL)(P) must be finite. This is also true even if T jThk(SRL)(P) is directly or indirectly inconsistent. This means that there exists a fixed point P′ such that P ⊆ P′ and T jThk(SRL)(P′) = P, even if T jThk(SRL)(P) is directly or indirectly inconsistent. The above result about SRL does not hold for those paradoxical logics such as CML and its various classical or non-classical conservative extensions, and traditional (weak) relevant logics T, E and R because these logics accept implicational, conjunction-implicational, or disjunction-implicational paradoxes as logical theorems. Let TSRL(P) be a formal theory where SRL is a strong relevant logic and k be a natural number. SRL is said to be kth-degree-complete for TSRL(P) if and only if for any empirical theorem et of ThSRLe(P), there is a finite natural number j such et is jthdegree-deducible from P based on Thk(SRL), i.e., all empirical theorems of ThSRLe(P) are some how deducible from P based on Thk(SRL). Having SRL where SRL is a strong relevant logic as the fundamental logic, and constructing, say the 3rd, degree fragment of SRL previously, for any given premises P, we can find the fixed point T jTh3(SRL)(P). Since SRL is free of implicational, conjunction-implicational, and disjunction-implicational paradoxes, as a result, we can obtain finite meaningful empirical theorems as candidates for “new and interesting theorems” of T jTh3(SRL)(P). Moreover, if SRL is 3rd-degree-complete for TSRL(P),
Automated Knowledge Acquisition
79
then we can obtain all candidates for “new and interesting theorems” of TSRL(P). These are also true even if TSRL(P) is inconsistent. Of course, maybe SRL is not 3rddegree-complete for TSRL(P). In this case, a fragment of SRL whose degree is higher then 3 must be used in order to find more empirical theorems. Until now, we have established a conceptional foundation for the empirical theorem finding within the framework of strong relevant logic. The next problem is how to develop programs to find empirical theorems automatically. Since any backward and/or refutation deduction system cannot serve as an autonomous reasoning mechanism to form and/or discover some completely new things, what we need is an autonomous forward reasoning system. We are developing an automated forward deduction system for general-purpose entailment calculus, named EnCal, which provides its users with the following major facilities [5]. For a logic L which may be a propositional logic, a first-order predicate logic, or a second-order predicate logic, a non-empty set P of formulas as premises, a natural number k (usually k < 5), and a natural number j all specified by the user, EnCal can do the following tasks: (1) reason out all logical theorem schemata of the kth degree fragment of L, (2) verify whether or not a formula is a logical theorem of the kth degree fragment of L, if yes, then give the proof, (3) reason out all empirical theorems of the jth degree fragment of the formal theory with premises P based on Thk(L), and (4) verify whether or not a formula is an empirical theorem of the jth degree fragment of the formal theory with premises P based on Thk(L), if yes, then give the proof. Now, for one or more given knowledge sources, we can represent the explicitly known knowledge in logical formulas at first, and then regard the set of logical formulas as the set of premises of a formal theory for the subject under investigation, use relevant reasoning based on strong relevant logic and EnCal to find new knowledge from the premises.
References [1] [2] [3] [4]
[5]
Anderson, A.R., Belnap Jr., N.D.: Entailment: The Logic of Relevance and Necessity, Vol. I. Princeton University Press, Princeton (1975) Anderson, A.R., Belnap Jr., N.D., Dunn, J. M.: Entailment: The Logic of Relevance and Necessity, Vol. II. Princeton University Press, Princeton (1992) Cheng, J.: Logical Tool of Knowledge Engineering: Using Entailment Logic rather than Mathematical Logic. Proc. ACM 19th Annual Computer Science Conference. ACM Press, New York (1991) 228-238 Cheng, J.: The Fundamental Role of Entailment in Knowledge Representation and Reasoning. Journal of Computing and Information, Vol. 2, No. 1, Special Issue: Proceedings of the 8th International Conference of Computing and Information. Waterloo (1996) 853-873 Cheng, J.: EnCal: An Automated Forward Deduction System for GeneralPurpose Entailment Calculus. In: Terashima, N., Altman, E. (eds.): Advanced IT Tools, Proc. IFIP World Conference on IT Tools, IFIP 96 − 14th World Computer Congress. CHAPMAN & HALL , London Weinheim New York Tokyo Melbourne Madras (1996) 507-514
80
Jingde Cheng
[6]
Cheng, J.: A Strong Relevant Logic Model of Epistemic Processes in Scientific Discovery. In: Kawaguchi, E., Kangassalo, H., Jaakkola, H., Hamid, I.A. (eds.): Information Modelling and Knowledge Bases XI. IOS Press, Amsterdam Berlin Oxford Tokyo Washington DC (2000) 136-159 Diaz, M.R.: Topics in the Logic of Relevance. Philosophia Verlag, Munchen (1981) Dunn, J.M., Restall, G.: Relevance Logic. In: Gabbay, D., Guenthner, F. (eds.): Handbook of Philosophical Logic, 2nd Edition, Vol. 6. Kluwer Academic, Dordrecht Boston London (2002) 1-128 Godel, K.: Russell's Mathematical Logic. In: Schilpp (ed.): The Philosophy of Bertrand Russell. Open Court Publishing Company, Chicago (1944) Hayes-Roth, F., Waterman, D.A., Lenat, D.B. (eds.): Building Expert Systems. Addison-Wesley, Boston (1983) Kneale, W., Kneale, M.: The Development of Logic. Oxford University Press, Oxford (1962) Mares, E.D., Meyer, R.K.: Relevant Logics. In: Goble, L. (ed.): The Blackwell Guide to Philosophical Logic. Blackwell, Oxford (2001) 280-308 Read, S.: Relevant Logic: A Philosophical Examination of Inference. Basil Blackwell, Oxford (1988) Russell, B.: The Principles of Mathematics. 2nd edition. Cambridge University Press, Cambridge (1903, 1938). Norton Paperback edition. Norton, New York London (1996) Sestito, S., Dillon, T.S.: Automated Knowledge Acquisition. Prentice Hall, Upper Saddle River (1994) Tarski, A.: Introduction to Logic and to the Methodology of the Deductive Sciences. 4th edition, Revised. Oxford University Press, Oxford (1941, 1946, 1965, 1994)
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
CONCEPTOOL: Intelligent Support to the Management of Domain Knowledge Ernesto Compatangelo and Helmut Meisel Department of Computing Science, University of Aberdeen AB24 3UE Scotland, UK {compatan,hmeisel}@csd.abdn.ac.uk
Abstract. We have developed ConcepTool, a system which supports the modelling and the analysis of expressive domain knowledge and thus of different kinds of application ontologies. The knowledge model of ConcepTool includes semantic constructors (e.g. own slots, enumerations, whole-part links, synonyms) separately available in most framebased, conceptual, and lexical models. Domain knowledge analysis in ConcepTool explicitly takes into account the differences between distinct categories of concepts such as classes, associations, and processes. Moreover, our system uses lexical and heuristic inferences in conjunction with logical deductions based on subsumption in order to improve the quantity and the quality of analysis results. ConcepTool can also perform approximate reasoning by ignoring in a controlled way selected parts of the expressive power of the domain knowledge under analysis.
1
Context, Background and Motivations
Knowledge technologies are based on the concept of a “lifecycle” that encompasses knowledge acquisition, modelling, retrieval, maintenance, sharing, and reuse. Domain knowledge is currently a major application area for the lifecycleoriented approach to knowledge management [15]. This area includes most ontologies, which are formal shared specifications of domain knowledge concepts [6]. Domain knowledge is continuously created and modified over time by diverse actors. The effective management of this knowledge thus needs its continuous Verification and Validation (V&V). Verification analyses whether the modelling intentions are met and thus the intended meaning matches the actual one. Validation analyses whether there is consistency, and thus lack of contradictions. In practice, sizeable domain knowledge bodies cannot be properly verified and validated (i.e. they cannot be semantically analysed ) without a computerbased support. However, semantic analysis requires computers to understand the structure and the lexicon of this knowledge, thus making its implications explicit (verification) and detecting inconsistencies in it (validation). In recent years, Description Logics (DLs) were proposed as a family of languages for representing domain knowledge in a way that enables automated reasoning about it [1]. However, DL reasoners (e.g. FaCT [10], (Power)Loom [12], V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 81–88, 2003. c Springer-Verlag Berlin Heidelberg 2003
82
Ernesto Compatangelo and Helmut Meisel
CLASSIC [3], RACER [9]) and, consequently, the analysis tools based on them (e.g. i•COM [8], OilEd [2]) do not fully support domain knowledge V&V. There are two main reasons why DL-based systems do not fully support the verification and the validation of domain knowledge (and application ontologies): – They do not really operate at the conceptual level, as DL reasoning algorithms fail to distinguish between different deductive mechanisms for diverse categories of concepts. For instance, the differences between hierarchies of classes and hierarchies of functions are not taken into account [4]. – They do not complement taxonomic deductions with other logical, lexical and heuristic inferences in order to maximise conceptual analysis. Therefore, they do not detect further (potential) correlations between concepts, attributes, associative links, and partonomic links [5]. We have developed ConcepTool, the Intelligent Knowledge Management Environment (IKME) described in this paper, in order to overcome the above drawbacks of current DL-based tools for domain knowledge analysis.
2
The Knowledge Model of CONCEPTOOL
Existing modelling-oriented environments aim to provide an adequate expressive power to represent and combine a rich variety of knowledge. Modelling-oriented focus on expressiveness is not influenced by the potential undecidability or by the potential complexity of semantic analysis. Conversely, existing analysis-oriented environments aim to provide an adequate set of specialised reasoning services within a decidable, complete and computationally tractable framework. However, current analysis focus on worst-case decidability, completeness and tractability actually limits the allowed knowledge expressiveness. Research on ontologies highlighted that expressive domain knowledge models are needed to enable sharing and reuse in distributed heterogeneous contexts [7]. This implies allowing incomplete (approximate) deductions whenever decidable, complete and tractable reasoning is not theoretically or practically possible [14]. ConcepTool provides a “reasonable framework” for domain knowledge which supports expressive modelling and “heterogeneous” analysis. This framework is based on a modular approach that (i) expresses domain concepts by composing basic semantic constructors and (ii) combines diverse kinds of specialised inferences about concepts. The constructors that characterise the knowledge model of ConcepTool are described below, while the heterogeneous approach to reasoning used in our system is discussed in the following section. In devising the knowledge model (and the expressiveness) of ConcepTool, we focused on the following aspects which enhance its semantic interoperability: – Allow the analyst to specify domain knowledge at the “conceptual level”, using different concept types such as classes, associations, functions, goals. – Include constructors used in frame-based conceptual, and lexical knowledge models (e.g. own slots, enumerations, whole-part links, synonyms).
ConcepTool: Intelligent Support to the Management of Domain Knowledge
83
– Introduce the notion of individual, which generalises the notion of instance to denote an element that is not necessarily a member of an explicitly defined concept in the considered domain. We describe the ConcepTool knowledge model using the resource ontology in Figure 1, which integrates semantic constructors from frame-based, DL-based, conceptual, and lexical models. This ontology includes three kinds of elements, namely concepts (i.e. class and individual frames), roles (i.e. slots) and properties of inheritable concept-role links (i.e. facets of inheritable slots). All elements have non-inheritable properties (own slots). Concepts also have inheritable properties (template slots), which constrain them to other concepts as specified by facets. Concepts can be interpreted either as individuals or as sets of individuals (i.e. classes, associations, functions, etc.). Roles are interpreted as sets of binary links. Concept-role links (slot attachments) are interpreted as constraints of inheritable concept properties (e.g. attributes, parts, associative links, global axioms). Concept and role constructors (slots and facets) are interpreted according to the standard set-theoretic semantics used in description logics [1]. Non-inheritable concept properties (own slots) include, among others, the following constructors — the first three ones are not applicable to individuals. – The concept type specifies whether the concept is a class, an association, a process, or an individual belonging to one of these three categories. – The concept definition states whether the concept is either totally or only partially specified by its description. In the latter case, the concept is a generic subset of the (dummy) set fully specified by the description. – The instantiability states whether the concept can have any instances. – An enumeration explicitly defines a concept as the set characterised by the listed individuals. Note that concepts defined in this way (i.e. starting from the empty set — bottom — and adding individuals) cannot be compared with concepts defined starting from the universe (Top or Thing) and adding constraints. In other words, no subsumption relationship can be derived between concepts belonging to these two complementary groups. Each labelled field defines a couple (key name and corresponding value) that can be used to introduce a new own property such as author, creation date, version. All these user-definable properties can be used for different knowledge management purposes (e.g. versioning, annotations, retrieval). Inheritable concept properties (template slots) include the following constructors. Attributes are individual properties shared by all the individuals that belong to the same set concept. Parts are either individual components in structured individuals or set concept components in structured set concepts. Associative links are bi-directional connections between different types of concepts (e.g. classes and associations). Global axioms are mutual constraints between set concepts (e.g. disjointness or full coverage in terms of some subclasses). Non-inheritable role properties include inverse roles, used to define bidirectional associative links, and synonym roles, used to specify property name aliases.
84
Ernesto Compatangelo and Helmut Meisel
Restrictions on inheritable concept-role links (i.e. slot facets) include the following constructors. Minimum and maximum cardinalities constrain the multiplicity of a property (i.e. its lower and upper number of allowed values). Fillers constrain all individuals belonging to a concept to share the same value(s) for a property. The domain of an attribute constrains all its values to be selected from the class (or combination of classes) specified as domain. A domain enumeration explicitly lists all the allowed values (individuals) in the domain.
(0,N)
(0,N)
(0,N)
Concept (Frame)
(0,N)
Role (Slot Frame)
Concept-Role Link (Frame-Slot Attachment)
Role (Slot Frame)
Role Properties (Slots on Slot Frames)
Local Restrictions on Roles (Facets)
Global Restrictions on Roles
(a) Resource ontology overview Non-Inheritable Properties (Own Slots) Concept Type
1:1 {Class | . . . }
Name
1:1 String
Definition
1:1 {Partial | Total}
Superconcepts
0:M Concept
Instances
0:M Concept Instance
Instantiability
1:1 {Concrete | Abstract}
Enumeration
0:M Individual
Labelled Fields
0:M (Key, Value) Pair
Inheritable Properties (Template Slots) Attributes
0:M Concept Attribute
Parts
0:M Concept Part
Associative Links 0:M Concept Global Axioms
0:M Assertion
(b) Concepts Restrictions on inheritable properties (Facets)
Non-Inheritable Properties ( Own Slots)
Min Cardinality
1:1 Integer
Name
1:1 String
Max Cardinality
1:1 Integer
Superroles
0:M Global Role
Fillers
0:M Individual
Inverse Role
0:1 Global Role
Domain
0:1 Class (and|or Class)∗
Synonym Roles 0:M Global Role
Domain Enumeration 0:M Individual
(c) Concept-role links
Labelled Fields 0:M Key Value Pair
(d) Roles
Fig. 1. The ConcepTool resource ontology
ConcepTool: Intelligent Support to the Management of Domain Knowledge
85
Individuals in ConcepTool either explicitly or implicitly commit to the structure of some set concepts. Individuals directly belonging to a set concept are denoted as concept instances, while those belonging to any of its subsets are denoted as concept elements. This interpretation of individuals extends the one in DL-based and in frame-based approaches.
3
Knowledge Management Services in CONCEPTOOL
In devising the services to be provided by ConcepTool, we focused on a wide range of inferences and management support functionalities for very expressive domain knowledge. Our previous research highlighted the benefits gained (i) by combining different kinds of inferences and/or (ii) by including incomplete deductions and/or (iii) by reducing the expressiveness of the knowledge subject to analysis [5, 11]. We have incorporated all these features in ConcepTool, which currently combines functionalities from three different levels: – At the epistemological level, deductions from DL engines are used as the fundamental logic-based analysis services. These deductions are considered as a kind of reference ontology that can be compared with the domain knowledge under analysis. However, DL engines do not correctly compute subsumption (and thus the analysis results based on it) between concepts with “non-standard” hierarchical rules, such as associations (relationships) and functions. Therefore, conceptual emulators [4] are used to transform DL deductions into the result expected at the conceptual level whenever needed. – At the conceptual level, deductions include (i) the generation of different concept hierarchies (one for each concept type), (ii) the identification of inconsistent concepts, and (iii) the explicitation of new concept constraints and properties which are not explicitly stated in a domain knowledge body. The possibility of setting a specific level of expressive power to be used in conceptual reasoning allows the generation of further (and / or different) analysis results. This is performed by selecting those semantic constructors (e.g. attribute multiplicities, domains, or fillers) that can be ignored while reasoning with a domain knowledge body. The reduction of the expressive power subject to analysis is particularly useful for aligning and articulating domain knowledge during the early stages of ontology sharing and reuse [5]. – At the lexical level, deductions include lexical subsumption, synonymity, antonymity, and partonymity provided by lexical databases such as WordNet [13]. These deductions are complemented by a set of lexical heuristic inferences based on string containment. Inferential results from the three levels are presented to the analyst who can decide which ones to accept and which ones to reject. Both accepted and rejected inferences can be incorporated into the knowledge body under analysis by adding new concept constraints or by removing existing ones. The different kinds of epistemological, conceptual and lexical inferences can be performed in any user-defined sequence in order to maximise analysis results.
86
Ernesto Compatangelo and Helmut Meisel
In other words, the user (i) can decide which inferential services to invoke, (ii) compares the results of each invoked services, and (iii) decides which step to perform next. However, (partial) results from a specific inferential service (e.g. a constraint satisfaction check) could be automatically passed to another inferential service (e.g. a subsumption computation algorithm). In this way, a reasoning system could derive results about expressive knowledge that cannot be completely interpreted and processed by each single inferential service alone. The combination of different kinds of inferences has been performed with a prototypical version of ConcepTool which dynamically fuses taxonomic deductions from a DL engine with inferences from a constraint satisfaction reasoner [11]. In this approach, constraints on (numeric) concrete domains are transformed into a hierarchy of concrete domain classes which are then used to compute subsumption on the whole domain knowledge base. Preliminary results showed how this “inference fusion” mechanism can successfully extend the automated reasoning capabilities of a DL engine. Because the inferences provided by the constraint satisfaction reasoner are in no way complete, this same approach shows that incomplete deductions are nonetheless useful to derive meaningful (although potentially incomplete) analysis results. ConcepTool also provides an environment where different domain knowledge bodies can share the lexical namespace of their concept properties (attributes, associative links, partonomic links). Synonyms (e.g. last name and surname) are first proposed by lexical databases and then validated by ConcepTool on the basis of their mutual structural features. Full synonymity between two properties with lexical synonymity is rejected if their structural features (e.g. super-properties, inverse) are different. This approach, where distinct domain knowledge bases can share the same project space, and possibly the namespace associated to the project, explicitly supports both cooperative domain knowledge development and the alignment of existing domain knowledge bodies.
4
Discussion and Future Work
ConcepTool introduces three major novelties with respect to existing intelligent knowledge management environments. Firstly, its expressive knowledge model contains semantic components, such as individuals, which are neither jointly nor separately available in other models. Secondly, ConcepTool performs deductions at the conceptual level, thus providing correct analysis results for non-standard concepts like associations and functions. It also allows to ignore selectable properties and restrictions while reasoning. Moreover, it complements conceptual deductions with both lexical and heuristic inferences of different kinds. Thirdly, the ConcepTool interface explicitly supports cooperative domain knowledge development by providing an environment where new knowledge bodies can be directly linked to (and later decoupled from) existing ontologies or other reference knowledge bases.
ConcepTool: Intelligent Support to the Management of Domain Knowledge
87
The modelling and analysis functionalities provided in ConcepTool have been successfully tested using a number of sizeable domain ontologies. A preliminary version that demonstrates some relevant features of our system can be downloaded from http://www.csd.abdn.ac.uk/research/IKM/ConcepTool/. Future releases are expected to include (i) functionalities to import from / export to frame-based and DL-based environments using XML and RDF(S), (ii) a more general version of the articulation functionality already available in a previous version of ConcepTool that uses an entity-relationship knowledge model [5], and (iii) inference fusion for approximate reasoning with complex constraints [11].
Acknowledgements This work is supported by the British Engineering & Physical Sciences Research Council (EPSRC) under grants GR/R10127/01 and GR/N15764 (IRC in AKT).
References [1] F. Baader et al., editors. The Description Logic Handbook. Cambridge University Press, 2003. 81, 83 [2] S. Bechhofer et al. OilEd: a Reason-able Ontology Editor for the Semantic Web. In Proc. of the Joint German/Austrian Conf. on Artificial Intelligence (KI’2001), number 2174 in Lecture Notes in Computer Science, pages 396–408. SpringerVerlag, 2001. 82 [3] R. J. Brachman et al. Reducing CLASSIC to practice: knowledge representation meets reality. Artificial Intelligence, 114:203–237, 1999. 82 [4] E. Compatangelo, F. M. Donini, and G. Rumolo. Engineering of KR-based support systems for conceptual modelling and analysis. In Information Modelling and Knowledge Bases X, pages 115–131. IOS Press, 1999. Revised version of a paper published in the, Proc. of the 8th European-Japanese Conf. on Information Modelling and Knowledge Bases. 82, 85 [5] E. Compatangelo and H. Meisel. EER-ConcepTool: a “reasonable” environment for schema and ontology sharing. In Proc. of the 14th IEEE Intl. Conf. on Tools with Artificial Intelligence (ICTAI’2002), pages 527–534, 2002. 82, 85, 87 [6] A. Duineveld et al. Wondertools? a comparative study of ontological engineering tools. In Proc. of the 12th Banff Knowledge Acquisition for Knowledge-based Systems Workshop (KAW’99), 1999. 81 [7] R. Fikes and A. Farquhar. Distributed Repositories of Highly Expressive Reusable Ontologies. Intelligent Systems, 14(2):73–79, 1999. 82 [8] E. Franconi and G. Ng. The i•com Tool for Intelligent Conceptual Modelling. In Proc. of the 7th Intl. Workshop on Knowledge Representation meets Databases (KRDB’00), 2000. 82 [9] V. Haarslev and R. M¨ oller. RACER System Description. In Proc. of the Intl. Joint Conf. on Automated Reasoning (IJCAR’2001), pages 701–706, 2001. 82 [10] I. Horrocks. The FaCT system. In Proc. of the Intl. Conf. on Automated Reasoning with Analytic Tableaux and Related Methods (Tableaux’98), number 1397 in Lecture Notes In Artificial Intelligence, pages 307–312. Springer-Verlag, 1998. 81
88
Ernesto Compatangelo and Helmut Meisel
[11] B. Hu, I. Arana, and E. Compatangelo. Facilitating taxonomic reasoning with inference fusion. In Knowledge Based Systems – To appear. Revised version of the paper published in the Proc. of the 22nd Intl. Conf. of the British Computer Society’s Specialist Group on Artificial Intelligence (ES’2002). 85, 86, 87 [12] R. MacGregor. A description classifier for the predicate calculus. In Proc. of the 12th Nat. Conf. on Artificial Intelligence (AAAI’94), pages 213–220, 1994. 81 [13] G. A. Miller. WordNet: a Lexical Database for English. Comm. of the ACM, 38(11):39–41, 1995. 85 [14] H. Stuckenschmidt, F. van Harmelen, and P. Groot. Approximate terminological reasoning for the semantic web. In Proc. of the 1st Intl. Semantic Web Conference (ISWC’2002), 2002. 82 [15] The Advanced Knowledge technologies (AKT) Consortium. The AKT Manifesto, Sept. 2001. http://www.aktors.org/publications http://www.aktors.org/publications. 81
Combining Revision Production Rules and Description Logics Chan Le Duc and Nhan Le Thanh Laboratoire I3S, Universit´e de Nice - Sophia Antipolis, France
[email protected] [email protected] Abstract. Knowledges of every application area are heterogeneous and they develop in time. Therefore, a combination of several formalisms on which revision operations are defined, is required for knowledge representation of such a hybrid system. In this paper, we will introduce several revision policies in DL-based knowledge bases and show how revision operators are computed for DL language ALE. From this, we define revision production rules consequent of which is a revision operator. Such rules allow us to represent context rules in translation between DL-based ontologies. Finally, we will introduce formal semantics of the formalism combining revision production rules and Description Logics (ALE).
1
Introduction
Description Logics can be used as a formalism for design of ontologies of an application domain. In order to reduce the size of ontologies, different ontologies for different subdomains or user profiles are derived from a shared common ontology. The shared concepts in the common ontology can be redefined in a derived ontology with the aim of fitting an adaptable context. This is necessary to determine more sufficient meaning of a shared concept in the context of current users. The redefining can be performed in run-time owing to context rules. The antecedent of a context rule is a predicate which describes context condition and its consequent is an operation which translates a concept definition into another. For this reason, we need an other formalism to capture the semantics of such ontologies. It is specific production rules, namely revision production rules, whose consequent is an operation allowing to change a shared concept definition. In previous hybrid languages, for example CARIN in [4], it combines Horn rules and Description Logics to obtain a formalism the expressiveness of which is significantly improved. However, it seems that these languages could not capture context rules since the consequent of a Horn rule is still a predicate. On the other hand, since the definition of a concept in knowledge base can be modified by the application of a revision production rule, revision operations must be defined on DL-based terminology. Previous works on how to define revision operations have concentrated on operations TELL (to add a concept description fragment to a concept definition), FORGET (to delete a concept description fragment V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 89–99, 2003. c Springer-Verlag Berlin Heidelberg 2003
90
Chan Le Duc and Nhan Le Thanh
from the concept definition), and on simple languages which do not permit existential restrictions [1]. Moreover, operation FORGET requires that a deleted fragment must be a part of the concept definition to be revised. Thus, the main contributions of this paper are, first, to propose several revision policies some of which require the definition of a new operation, namely PROPAG, in order to conserve subsumption relationships. Second, we propose to extend revision operations to language ALE. This extension will allow us to define the semantics of a formalism combining revision production rules and language ALE.
2 2.1
Knowledge Base Revision and Revision Policies Preliminaries
Revision Operators Informally, TELL is interpreted as an operation which adds knowledge to knowledge base ∆, whereas FORGET deletes knowledge from knowledge base ∆. If we use a DL language allowing both to normalize concept descriptions and rewrite concept definitions as conjunctions of simple concepts, the revision operators are more formally written as follows : – TELL(C, Expr ) := C Expr – FORGET(C Expr, Expr ) := C where C is concept to be revised and Expr is a simple concept. In order to have reasonable definitions of the revision operators, it is paramount that several constraints are respected. However, if all of these constraints are to be respected, certain problems are inevitable. In the following subsection, we attempt to identify some of important constraints for the revision operators. Criteria for Revision The majority of these criteria is proposed and discussed in [1]. We only recall and reformulate some of them for the purposes of this paper. 1. Adequacy of the Request Language. The request language should be interpretable in the formalism the KB uses. 2. Closure. Any revision operation should lead to a state of the KB representable by the formalism used. 3. Success. The criterion requires that any revision operation must be successful, i.e after a TELL the new knowledge (Expr ) must be derivable from the KB and after a FORGET the deleted knowledge (Expr ) must not be derivable from the KB no longer. More formally, there exists an interpre/ tation I and an individual aI ∈ OI domain, such that aI ∈ C I and aI ∈ (T ELL(C, Expr))I . Similarly, there exists an interpretation I and an indi/ C I and aI ∈ (F ORGET (C, Expr))I . vidual aI ∈ OI domain, such that aI ∈ 4. Minimal Change. This criterion requires that revision operations cause a minimal change of the KB. It is obvious that it needs a definition concerning the distance between two KBs. Therefore, a pragmatic consideration is necessary in each real application.
Combining Revision Production Rules and Description Logics
91
5. Subsumption Conservation. The terminological subsumption hierarchy of the KB should be conserved by any revision operation. A priori, criteria 4 and 5 are difficult to guarantee simultaneously since subsumption conservations may change the majority of the hierarchy. 2.2
Revision Policies
In the previous subsection, we introduced the general principles for revision on DL-based knowledge base. In certain cases there may exist one principle that we cannot entirely adhere to. We now present pragmatic approaches in which some of these criteria are respected while the others are not. In order to facilitate the choice of policy made by the user, we will attempt to identify the advantages and disadvantages. The first policy takes into account criterion 5 (Subsumption Conservation) whereas the second policy allows us to change subsumption hierarchy. The third policy can be considered as a compromise between the first and the second. Conservative Policy This policy allows us to modify definition of a concept C (concept description) by adding or deleting a fragment to or from the definition. Concept C and all concepts defined via C now have a new definition. However, the modification of C should not violate any subsumption relationship. Otherwise, no operation is to be done. By that we mean, if A B holds before the modification for some concepts A, B in the Tbox, then A B still holds after the modification. This requirement is in accordance with the viewpoint in which the subsumption relationships are considered as invariant knowledges of the KB. This condition is clearly compatible with the object-oriented database model provided that the hierarchical structure of classes is persistent, although class definitions may be changed. In this model, each time a class definition is changed, all existent instances of the old class must remain the instances of the new class. In fact, the evolution of our understanding about the world leads to more precise definitions. In other words, we need to change the definition of a concept C while all subsumption relationships and assertions C (a) in Abox remain held. This can lead to some incompatibilities between the considered criteria for definitions of inferences in the open world. Indeed, we suppose that revision operator TELL is defined as follows : TELL(C, Expr ) := C Expr. According to the assertion conservation proposed by this policy, if C I (aI ) holds then (T ELL(C, Expr))I (aI ) holds as well for all interpretation I. This means that C TELL(C, Expr ) for all interpretation I. Hence C ≡ TELL(C, Expr ) since TELL(C, Expr ) C for all interpretation I. This contradicts criterion 3 (success). In order to avoid this incompatibility, an addition of the epistemic operator can be required [5]. We now try to provide a framework for the revision operators. Suppose that Tbox is acycle and does not contain general inclusions. Since the subsumption relationship conservation is required in this policy, a modification on C can be propagated in all Tbox. In the following, we will investigate this propagation
92
Chan Le Duc and Nhan Le Thanh
for each operator TELL and FORGET. The definitions of TELL, FORGET and PROPAG with respect to criteria 3, 4 (success and minimal change) for a concrete language are investigated in the next section. TELL(C, Expr ). Obviously, this operation makes a concept D more specific where D contains concept C in its definition. On the other hand, concepts no containing concept C may be modified because of the conservation of the subsumption relationships. The unfolded definition of D containing C is written ˜ ) and the unfolded definition of D no containing C is written as D( ˜ C). ¯ as D(C ˜ ) and D( ˜ C) ¯ can be obtained by replacing all concept names by their defiD(C nitions except concept C. Let E be a direct super concept of F in the hierarchy. ˜ C) ¯ then relationship F E is There are two cases as follows : i) if E = E( automatically conserved since E is not changed and F may be not changed or ¯ and E = E(C ˜ ) or F = F˜ (C ) become more specific. ii) if whether F = F˜ (C) ˜ ), we need a new operator, namely PROPAG TELL, that allows and E = E(C us to compute a new definition of F such that the subsumption relationship is conserved. Operator PROPAG TELL should be defined with respect to criterion 4 in some way. FORGET(C, Expr ). This operator is, in fact, computed via an other operator, namely MINUS, that can be interpreted as a deletion of Expr from C. In the case where C = C’ Expr, we must have MINUS(C, Expr ) = C’. Similar as TELL, the modification made by FORGET may be propagated in all Tbox. Revision of Literal Definition This revision policy had been proposed in [1]. The main idea of this policy is to allow us to modify the definition of a concept without taking into account subsumption relationships. In addition, it requires to revise the Abox with respect to the new Tbox because a modification in Tbox may cause inconsistencies in the Abox. This policy is convenient for the case in which we need a correction of a misunderstanding in terminology design. The individuals of a “bad” definition now belong to a better definition. In fact, in this approach concept definitions are considered as literal knowledges while subsumption relationships are derived from these definitions. As a result, a reclassification of the concept hierarchy can be required after a revision. Note that this policy does not apply operator PROPAG since subsumption conservation is omitted. Operators TELL(C, Expr ) and FORGET(C, Expr ) are defined owing to operator MINUS described as above. Revision on Abox. An important task in this policy is to revise the ABox when the Tbox is changed. All assertions D (a) where D is defined directly or indirectly via C must be revised. It is clear that operator FORGET does not cause any inconsistence in the Abox since if F (C )(a) holds where F (C ) is a concept description via C, then FORGET(F (C )(a), Expr ) holds as well. In contrast, there may be an inconsistence in the Abox after TELL(C, Expr ). In this case, the revision on Abox requires an inference which is based on the added part Expr. In particular, D (a) holds for some concept D = F (C ) = C, iff Expr (a) holds.
Combining Revision Production Rules and Description Logics
93
Mixed Policy As presented above, this policy can be considered as a compromise between two policies presented, namely mixed policy. It allows us to modify the definition of a concept while respecting all subsumption relationships. Moreover, following a revision on the Tbox, a revision on the Abox is necessary. Technically, this policy does not require any additional operator in comparison to the two policies above.
3 3.1
Revision Operators for Language ALE Preliminaries
Let NC be a set of primitive concepts and NR be a set of primitive roles. Language ALE uses the following constructors to build concept descriptions: conjunction (C D ), value restriction (∀r.C ), existential restriction (∃r.C ), primitive negation (¬P ), concepts TOP and BOTTOM. Let ∆ be an non-empty set of individuals. Let .I be a function that transforms each primitive concept P ∈ NC into P I ⊆ ∆ and each primitive role r ∈ NR into rI ⊆ ∆× ∆. The semantics of a concept description are inductively defined owing to the interpretation I = (∆,.I ) in the table below.
Syntax ⊥ C D ∀r.C, r ∈ NR ∃r.C, r ∈ NR ¬P , P ∈ NC
Semantics ∆ ∅ CI ∩ DI {x ∈ ∆|∀y:(x,y) ∈ r I → y ∈ C I } {x ∈ ∆|∃y:(x,y) ∈ r I ∧ y ∈ C I } ∆ \ PI
Least Common Subsumer (LCS) [8] Subsumption Let C, D be concept descriptions. D subsumes C, C D, iff C I ⊆ DI for all interpretations I. Least Common Subsumer Let C1 , C2 be concept descriptions in a DL language. The least common subsumer C is a least common subsumer of C1 , C2 (for short lcs(C1 , C2 )) iff i) Ci C for all i , and ii)C is the least concept description which has this property, i.e if C’ is a concept description such that Ci C’ for all i, then C C’. According to [8], lcs of two or more descriptions always exists for language ALE and the authors proposed an exponential algorithm for computing it.
94
Chan Le Duc and Nhan Le Thanh
Language ALE and Structural Subsumption Characterizing The idea underlying the structural characterization of subsumption proposed in [8] is to transform concept descriptions into a normal form. This normal form allows one to build a description tree the nodes of which are labeled by sets of primitive concepts. The edges of this tree correspond to value restrictions, namely ∀r-edge and existential restrictions, namely r-edge. Hence, the subsumption relationship of two concept descriptions can be translated into the existence of a homomorphism between the two corresponding description trees. The following theorem results from this idea. Theorem for structural subsumption characterization [8] Let C, D be ALEconcept descriptions. Then, C D if and only if there exists a homomorphism to GC . from GD Simple Concept Descriptions in Tbox In a Tbox, a concept definition can depend on other concept definitions. This dependence must be conserved by revision operations. For this reason, we cannot define simple concept as a conjunct in unfolded form. In what following, we propose a simple concept definition which fits the idea above. An ALE-concept description is called simple if it does not contain any conjunction on the top-level. A concept description is called normal if it is a conjunction of simple concept descriptions. We denote SC as a set of simple concept descriptions of concept C in a TBox. Note that a simple concept can be considered part of the meaning of a concept. As a consequence, the sets of simple concepts of the equivalent concepts can be different. A discussion in detail of this subject is given in [1]. 3.2
Operator TELL
Operator TELL Let C1 ... Cn be of the normal form of C and Expr be simple concept. The operator TELL is defined as follows : If C Expr then TELL(C, Expr ) := C Otherwise, TELL(C, Expr ) := C Expr and SD = SC ∪ {Expr } If the knowledge to be added is derivable from C before the execution of TELL i.e C Expr, no more remains to be done and the KB is not changed. Otherwise, we need to guarantee that the added knowledge is derivable from C after the execution of TELL. In fact, since TELL(C, Expr ) = C Expr , we have TELL(C, Expr ) Expr. Moreover, if C Expr, this definition of operator TELL will satisfy criteria 3 (success). Criterion 4 (minimal modification) will be semantically satisfied since C Expr is the most concept description which is subsumed by C and Expr. Operator PROPAG TELL We will use the notions presented in section 2.2. ˜ In addition, denote E (C, D ), E(C, D ) as original and unfolded forms where each occurrence of C is replaced by D. Let E be a direct super concept of a concept
Combining Revision Production Rules and Description Logics
95
F in the hierarchy. Operator PROPAG TELL(E, F, Expr ) will compute a new concept description of F while taking into account criterion 4 (minimal change) and F E. ˜ ). We denote S 0 (C ) as a set of all sE ∈ SE such Assume that E = E(C E 1 ¯ that sE depends on C, denoted by sE (C ), and SE (C) as a set of all sE ∈ SE ¯ We have : such that sE does not depend on C, denoted by sE (C). ˜ PROPAG TELL(E, F, Expr ):=F (C, C Expr ) sE ∈SE0 (C) sE (C, C Expr ) As a consequence, PROPAG TELL(E, F, Expr ) E. Note that PROPAG TELL(E, F, Expr ) is the most concept description ˜ which is less than F˜ (C, C Expr ) and E(C, C Expr ).
Example 1. Let T1 ={ C := ∃r.Q, E := ∃r.C, F := ∃r.∃r.(Q R) } and Expr := ∃r.P where P , Q, R ∈ NC , r ∈ NR . ˜ ), F = F˜ (C); ¯ SE ={∃r.C }, SF ={∃r.∃r.(Q R)} We have F E and E = E(C ˜ ˜ ˜ Note that F (C, C Expr ) E(C, C Expr ) since E(C, C Expr ) = ∃r.(∃r.Q ˜ ∃r.P ) and F (C, C Expr ) = ∃r.∃r.(Q R). We have PROPAG TELL(E, F, Expr ) = ∃r.∃r.(Q R) ∃r.(C ∃r.P ) and SP ROP AG T ELL(E,F,Expr) = {∃r.∃r.(Q R), ∃r.(C ∃r.P )}. 3.3
Operator FORGET
This operator FORGET(C, Expr ) here defined, allows for Expr ∈ / SC . This extension is necessary because in some cases we need to forget a term Expr from C where C Expr and Expr ∈ / SC . Such knowledge can be derived from C. Operator MINUS between Two ALE-Description Trees Let C, D be propagated normal ALE-concept descriptions and C D. MINUS(C, D ) is defined as the remainder of GC description tree after the deletion of image ϕ(GD ) in GC for all homomorphisms ϕ from GD to GC . More formally, let GC =(NC , EC , n0 , lC ) and GD =(ND , ED , m0 , lD ). MINUS(C, D ) = (NC , EC , p0 , lC ) where lC (ϕ(m)) = lC (ϕ(m)) \ lD (m) for all m ∈ ND and all ϕ. Example 2. Let E, F be concept descriptions in the example1. We have MINUS(F, E ) = ∃r.∃r.R.
Proposition 3.1 Let C be a propagated normal ALE-concept description, D be a propagated normal ALE-simple concept description and D = D1 ... Dn . Moreover, C D. If C = X D and X Di for all i, then MINUS(C, D)≡ X. A proof of Proposition 3.1 can be found in [9]. 0 1 a set of all sC ∈ SC such that sC Di for all i and SC−D We denote SC−D 0 a set of all sC as the remainder of sC ∈ / SC−D following the operation MINUS. 0 Proposition 3.1 guarantees that sC ∈ SC are not changed. We have SMIN US(C,D) 0 1 = SC−D ∪ SC−D .
96
Chan Le Duc and Nhan Le Thanh
We can now define operator FORGET. Let C1 ... Cn be of the normal form of C and Expr be a simple concept. Operator FORGET is defined as follows : If C Expr then FORGET(C, Expr ) := C Otherwise, FORGET(C, Expr ) := MINUS (C, Expr ) and, 0 1 SF ORGET (C,Expr) = SC−Expr ∪ SC−Expr If C Expr, i.e we may leave out the fragment Expr as it does not contribute to the definition of C. In this case, there is nothing to do since C is not derivable from the knowledge to be forgotten. In fact, there exists an individual a such / ExprI where I is an interpretation. By that we mean, that aI ∈ C I and aI ∈ the knowledge Expr is not derivable from concept C following the execution of operation FORGET. Therefore, the operator which is defined in such a way satisfies criterion 3 (success). If C Expr, operator MINUS ensures the unique existence of the result concept description. Hence, operator FORGET is well defined. Also, operator MINUS guarantees criteria 3 and 4 for operator FORGET owing to Proposition 3.1 [9]. Note that if C = X Expr then FORGET(C, Expr) = X. Operator PROPAG FORGET Let E be a direct super concept of a concept F in the hierarchy. Operator PROPAG FORGET(E, F, Expr ) will compute a new concept description of E while taking into account criterion 4 (minimal change). Computing lcs is, therefore, necessary for this aim. Assume that F = F˜ (C ). We define : PROPAG FORGET(E, F, Expr ):= ˜ lcs{E(C, MINUS(C, Expr )), F˜ (C, MINUS(C, Expr ))} Computed in this way, PROPAG FORGET(E, F, Expr ) is the least concept ˜ description such that it is more than E(C, MINUS(C, Expr )) and ˜ F (C, MINUS(C, Expr )). We now compute SP ROP AG F ORGET (E,F,Expr) according to the definition above. ˜ MINUS(C, Expr )) and E2 = Thus abbreviated, we denote E1 = E(C, PROPAG FORGET(E, F, Expr ). Since E1 E2 , we can define an intersection of all images ϕ(GE2 ) in GE1 : K = ϕ ϕ(GE2 ) where ϕ is a homomorphism from GE2 0 1 to GE1 . We denote SE a set of all sE1 ∈ SE1 such that GsE1 ⊆ K and SE a set of 1 2 all sE2 ∈ SE2 such that ϕ(GsE2 ) K where ϕ is some homomorphism from GE2 0 1 to GE1 . We define SE : = SE ∪ SE . It is not difficult to check that 1 2 s∈SE s ≡ PROPAG FORGET(E, F, Expr ).
Example 3. Let T1 ={ C := ∃r.(P Q ) ∀r.R, E := ∃r.S ∃r.∃r.(P Q ), F := ∃r.S ∃r.C } and Expr := ∃r.P where P , Q, R, S ∈ NC , r ∈ NR . ˜ C); ¯ SE ={∃r.S, ∃r.∃r.(P Q )}, We have F E and F = F˜ (C ), E = E( SF ={∃r.S, ∃r.C }, MINUS(C, Expr ) = ∃r.Q ∀r.R. ˜ Note that F˜ (C,MINUS(C, Expr )) E(C,MINUS(C, Expr )) since ˜ F (C,MINUS(C, Expr ))=∃r.S ∃r.(∃r.Q ∀r.R) and ˜ E(C, MINUS(C, Expr ))= ∃r.S ∃r.∃r.(P Q ) We have PROPAG FORGET(E, F, Expr ) = ∃r.S ∃r.∃r.Q; 0 1 ={∃r.S , ∃r.∃r.Q} and SE = ø. K = {∃r.S ∃r.∃r.Q}; SE 1 2
Combining Revision Production Rules and Description Logics
As a consequence, SP ROP AG 3.4
F ORGET (E,F,Expr)
= {∃r.S, ∃r.∃r.Q }
97
Revision on Abox
In the previous subsection, a concept can be considered as a conjunction of simple concepts. The revision operators on Tbox add or delete some terms to/from these sets. Let E be a modified concept definition in the Tbox. Let SE and SE be the sets of simple concepts of E, respectively, before and after a revision on the Tbox. Let I be a model of the KB. For each assertion E I (aI ), revision on the Abox performs a checking of all assertions sI (aI ) where s ∈ SE \SE . If there is some sI (aI ) which is not verified, E I (aI ) becomes inconsistent and it is deleted from the Abox. Otherwise, if every sI (aI ) is verified, assertion E I (aI ) is conserved in the Abox.
4 4.1
Revision Production Rules and Description Logics Revision Production Rules
The rule to be defined below is a specific form of production rule, namely revision production rule . In fact, a consequent of these rules is only an revision operator on a terminology. The general form of these rules in a KB ∆ = (∆T , ∆A , ∆PR ) is given as follows : r ∈ ∆PR , r : p ⇒ TELL(Concept, Expr ) or r : p ⇒ FORGET(Concept, Expr ) where p is a predicate formed from assertions in ∆T with the constructors ¯ 1 ) ∧. . . ∧ pn (X ¯ n ), X ¯ i are tuples of variables or constants. The predip1 (X cates p1 ,. . . , pn are whether concepts names, role names or an ordinary predicate. Revision operators TELL and FORGET have been defined in section 3. An example which illustrates how to exploit these rules can be found in [9]. 4.2
Semantics of Combining Formalism
We can now define the semantics of the formalism combining revision production rules and description logic ALE based on the semantics of revision operators. We will define semantics of the formalism on the assumption that the propagation operators are not taken into account. It will not be too difficult to extend the following definition to a KB with these operators. Let ∆ be a KB composed of three components ∆ = (∆T , ∆A , ∆PR ). An interpretation I is a pair (O,.I ) where O is a non-empty set and a function .I maps every constant a in ∆ to an object aI ∈ O. An interpretation I is a model of ∆ if it is a model of each component of ∆. Models of terminological component ∆T are normally defined as what of description logics. An interpretation I is ¯ I ∈ pI . a model of a ground fact p(¯ a) in ∆A if a An interpretation I is a model of r : p ⇒ TELL(C, Expr ) if, whenever the ¯ i )I ∈ pI for function .I maps the variables of r to the domain O such that (X i every atom of the antecedent of r, then
98
Chan Le Duc and Nhan Le Thanh
1. if aI ∈ (C I ∩ ExprI ) then aI ∈ (T ELL(C, Expr))I 2. if aI ∈ C I and aI ∈ / ExprI then aI ∈ / (T ELL(C, Expr))I An interpretation I is a model of r : p ⇒ FORGET(C, Expr ) if, whenever ¯ i )I ∈ pI the function .I maps the variables of r to the domain O such that (X i for every atom of the antecedent of r, then if aI ∈ C I then aI ∈ (F ORGET (C, Expr))I
5
Conclusion and Future Work
We introduced several revision policies and a framework for the revision operators on DL-based knowledge base. In the first policy, the assertion conservation on Abox, despite modifications of concept definition on Tbox, focuses our attention on inferences with the epistemic operator in a closed world. However, the interaction between the epistemic operator and the revision operators was not investigated in this paper. Next, we showed how to compute revision operators for language ALE. A deeper study of the complexity of these operators is necessary for a possible improvement in the computing. An other question arises concerning the extension of the revision operators to languages allowing for disjunction (ALC). This extension requires a structural characterization of subsumption for these languages. On the other hand, if we allow for Horn rules as a third component, then which interactions will take place in the combining formalism ? In this case, the obtained formalism will be an extension of language CARIN [4]. Hence, an extension of the inference services of CARIN to the combining formalism deserves investigation.
References [1] B. Nebel. Reasoning and Revision in Hybrid Representation Systems, Thesis 19901995 90, 92, 94 [2] B. Nebel. Terminological reasoning is inherently intractable. Artificial Intelligence, 1990 [3] M. Schmidt-Schauβ, G. Smolka. Attributive concept descriptions with complements. Artificial Intelligence, 1991 [4] A. Y. Levy and M.C Rousset. Combining Horn Rules and Description Logics in CARIN, Artificial Intelligence, Vol 104, 1998. 89, 98 [5] F. M. Donini, M. Lenzerini, D. Nardi, W. Nutt, A. Schaerf. An Epistemic Operator for Description Logics, Artificial Intelligence, Vol 100, 1998. 91 [6] F. M. Donini, M. Lenzerini, D. Nardi, A. Schaerf. Reasoning in Description Logics, CSLI Publications, 1997 [7] R. K¨ usters. Non-Standard Inferences in Description Logics. Thesis, Springer 2001 [8] F. Baader, R. K¨ usters and R. Molitor. Computing least common subsumer in Description Logics with existential restrictions, Proc. of the 16th Int. Joint Conf. on Artificial Intelligence (IJCAI-99 ) 93, 94
Combining Revision Production Rules and Description Logics
99
[9] C. Le Duc, N. Le Thanh. Combining Revision Production Rules and Description Logics. Technical Report 2003. See http://www.i3s.unice.fr/~cleduc. 95, 96, 97
Knowledge Support for Modeling and Simulation ˇ cenko Michal Sevˇ Czech Technical University in Prague Faculty of Electrical Engineering, Department of Computer Science
[email protected] Abstract. This paper presents intermediate results of the KSMSA project (Knowledge Support for Modeling and Simulation Systems). The outcome of the project is to apply artificial intelligence and knowledge engineering techniques to provide support for users engaged in modeling and simulation tasks. Currently we are focused on offering search services on top of a knowledge base, supported by a domain ontology.
1
Introduction
System modeling and simulation is an important part of an engineer’s work. It is an important tool for e.g. analyzing behavior of systems, predicting behavior of new systems, or diagnosing discrepancies in faulty systems. However, modeling and simulation tasks are highly knowledge-intensive. They require lot of knowledge directly needed to perform modeling and simulation tasks, as well as lot of background knowledge, such as mathematics, physics, etc. A great deal of this knowledge is tacit, hard to express, and thus hard to transfer. The knowledge has often heuristic nature, which complicate implementation of such a knowledge directly into computer systems. We propose to design a knowledge-based system supporting modeling and simulation tasks. At present, we are focused on offering search services for the users. The prerequisite to our system is the development of a formal model of concepts in the domain of modeling and simulation, or of engineering in general. Such a model is commonly referred to as an ontology. Ontologies basically contain taxonomy of concepts related to each other using axioms. Axioms are statements that are generally recognized to be true, represented as logical formulas. Ontologies help us to represent the knowledge normally possessed only by humans in computer memory, and to perform simple reasoning steps that resemble reasoning of humans. Ontologies also help to structure large knowledge bases, and provide a means to search such knowledge bases with significantly better precision than conventional search engines. Ontology can be accompanied by a lexicon, which provides a mapping between concepts in the ontology and natural language. This allows users to refer to concepts in the ontology using plain English words or phrases, or to automatically process natural language texts and to estimate their logical content. We have developed a prototype of ontology and lexicon for our domain. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 99–103, 2003. c Springer-Verlag Berlin Heidelberg 2003
100
ˇ cenko Michal Sevˇ
The new generation of search engines are based on the concept of metadata. Metadata are information that is associated with documents being indexed and searched by search engines, but are invisible to the user when a document is normally displayed. They are used to focus the search, thus improving its precision. Metadata that refer to the document as a whole include the name of the document author, date of the last modification, application domain, bindings to other documents, etc. Such kind of metadata provide additional information about the document that is not explicitly specified in its content. However, metadata can be associated also with parts of documents, such as sections, paragraphs, sentences or even words, often for the purpose of associating them with certain semantics, i.e. meaning. For instance, a string ‘John Smith’ can be associated with metadata stating that this string expresses the name of some person. Such metadata in general do not provide additional information to humans, as they understand the meaning of words from the context. However, they help to compensate the lack of intelligence of indexing and search engines that do not have such capability. Of course, ontologies and metadata bear on each other. Ontologies provide a means of describing the structure of metadata, so that metadata can be authored in a uniform way. A resource (document) is said to commit to an ontology if its metadata are consistent with that ontology. Indexing engine can then search a group of documents committing to one ontology, and it can provide users with means for focusing their searches according to concepts in that ontology. This scenario demonstrates the need for standardized ontologies that are committed to by many resources, so that the ontology-based search engines can index all of them.
2
Our Approach
We propose to design an ontology-based indexing and search engine for engineering applications, and for modeling and simulation in particular. To do this, we need to design an ontology for this domain. This section describes the most fundamental design goals of our search engine, the next section presents an overview of the content of our prototype ontology. The architecture of our knowledge system is quite conventional. It consists of the following components: – – – – –
knowledge base knowledge base index ontology and lexicon user’s interface (search service) administrator’s interface
The knowledge base is a collection of documents on the Internet available through the HTTP protocol. These are mostly HTML documents, containing either unrestricted natural language text relevant to our domain (such as electronic textbooks), or semi-structured documents (collection of solved examples,
Knowledge Support for Modeling and Simulation
101
catalogs of system components, model libraries, etc). Documents in the knowledge base may be provided by different authors. Since the knowledge in these documents is expressed implicitly, it is the user’s responsibility to extract and interpret this knowledge. Our system merely retrieves documents that it considers useful to the user. This is done with the help of the metadata stored in the knowledge base index. The knowledge base index is a central repository that stores metadata associated with documents in the knowledge base. It is used by the search engine to determine documents relevant to users’ queries. For semi-structured documents in the knowledge base, metadata have database-like nature. For example, documents in a collection of solved examples are annotated with attributes like application domain, physical domain, modeling technique, or used simulation engine. Documents containing unrestricted text may be also annotated with keywords structured according to concepts in the ontology. Since the index is connected to the ontology which defines mutual couplings between concepts, the search engine can perform simple reasoning steps during searches, such as generalizations, specializations, and natural-language disambiguations. The design decision that the knowledge base index is centralized is pragmatic. Centralized index is easier to maintain and is more accessible for the search engine that needs it for its operation. One can think that the metadata might be better supplied by authors of the documents and stored directly with these documents. This is the idea behind the initiative called the Semantic Web. However, today’s practice is that only few documents on the web are annotated with metadata (at least in the engineering domain), as the web-document authors are conservative and are not willing to invest extra effort into the development of their web pages. Our centralistic approach allows us to annotate documents developed by third parties. The user’s interface allows users to express their queries and submit them to the search engine. We propose two kinds of search interfaces. For searching for semi-structured documents, an administrator can design an ad-hoc form for entering the values of attributes. This enables users to focus on attributes that are relevant to a group of documents they are interested in. The second kind of interface enables users to search according to keywords. This service is very similar to conventional keyword-based search engines, but thanks to its exploitation of ontology and lexicon, it should exhibit better precision and recall. Keywords entered by the user are looked up in the lexicon, and related to respective concepts in the ontology. In the case of ambiguity (for polysemous keywords), users are prompted to resolve this ambiguity by selecting a subset of meanings (ontology concepts) corresponding to the keyword. The lexicon can also contain multiple words (synonyms) for single ontology concept, reducing thus the risk of document loss during the search due to using synonymous words when asking for concepts. These simple techniques partially reduce the impact of ambiguity of the natural language, and ensure better quality of search. The final component of our knowledge server is the administrator’s interface. It is an integrated environment for maintaining the ontology, lexicon and
102
ˇ cenko Michal Sevˇ
knowledge base index. It contains an ontology and lexicon browser and editor, document annotation tool, and analytical tools supporting the maintenance and evolution of the knowledge server.
3
The Ontology
This section presents a brief overview of the ontology that makes the core of our knowledge system. Our ontology is based on a general purpose, or upper , ontology called Suggested Upper Merged Ontology (SUMO) [1]. This implies that our ontology is represented in the Knowledge Interchange Format (KIF). One of the unique features of SUMO is that it is mapped to a large general purpose lexicon WordNet [2]. This enable the user to refer to concepts in the SUMO ontology using plain English words or phrases. We started to create our domain ontology on top of SUMO, and added corresponding terms to the lexicon by embedding them in the ontology using special predicates. Unlike other engineering ontologies, like EngMath [3] or PhysSys [4], our ontology is intended to represent basic common-sense concepts from the domain, rather than formally represent engineering models. Our knowledge base should represent the knowledge about modeling, not the models themselves. Our ontology is divided into several sections. The first section describes a taxonomy of models and modeling methodologies that engineers use to describe modeled systems, e.g. differential equations, block diagrams, multipole diagrams, bond graphs, etc. Although these concepts could be defined quite formally, we decided to provide only basic categorization accompanied by loose couplings providing informal common-sense navigation among the concepts. The second section describes a taxonomy of engineering components (devices) and processes, i.e. entities that are subject to modeling and simulation. The taxonomy in the current version of our ontology is only a snapshot of the terminology used in engineering, but should provide a solid base for extending it by domain experts. The third section of the ontology contains a taxonomy of typical engineering tasks (problems) and methods that can be used to solve them. The last section describes structured metadata that can be associated with individual documents in the knowledge base.
4
Conclusions and Future Directions
In this paper, we have presented our approach to the design of the knowledge system supporting modeling and simulation tasks. We presented only the general idea, more technical details can be found in [5]. We have implemented a first prototype of the whole system, which can be accessed from the project homepage [6]. The system includes all mentioned components, i.e. the search interface, the knowledge base index, the ontology and lexicon, and the administration tool. In the following paragraphs we propose some future work that should be done within the project.
Knowledge Support for Modeling and Simulation
103
Evolution of the Ontology, Lexicon and Knowledge Base. Although the current version of the ontology and the knowledge base can already demonstrate some concepts of our approach, it is clear that these resources must be much bigger both to be useful in practice and to prove that the approach is correct. We must thus dedicate some time for evolution of these resources—we must add new documents to the knowledge base and amend the ontology and lexicon accordingly. This work should also provide useful feedback for improving our tools for the knowledge base maintenance. Enlarging the ontology and knowledge base will also raise some questions about the performance and scalability. Improving the Functionality of Interfaces. The administrator’s interface and user’s interface should be improved. Especially the administration tool provides wide space for improvement. We plan to add some analytical tools based on information extraction and statistical methods, which should help to evolve and validate the knowledge base. We also plan to develop a tool that will automatically evaluate the performance of our knowledge system based on data recorded by the search interface, i.e. questions asked by users and answers provided by the knowledge system. Formalizing the Task Knowledge. Besides the static knowledge implicitly contained in the knowledge base, we propose to explicitly formalize the task knowledge required to perform certain classes of typical tasks taking place in the modeling and simulation domain. This task knowledge should provide more direct support for the users than mere access to static knowledge. Integration with the Existing Environment. Another important goal is to integrate our prototypical tools to an existing modeling and simulation environment, DYNAST, which we developed within another project. We will seek for new userinterface and computer-human interaction paradigms that can directly exploit the ontology and knowledge structures developed within this project.
References [1] I. Niles and A. Pease, “Toward a standard upper ontology,” in Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), 2001. 102 [2] “Wordnet homepage,” http://www.cogsci.princeton.edu/~wn/. 102 [3] T. R. Gruber and R. G. Olsen, “An ontology for engineering mathematics,” in Fourth International Conference on Principles of Knowledge Representation and Reasoning, Bonn, Germany, 1994. 102 [4] W. N. Borst, Construction of Engineering Ontologies for Knwoledge Sharing and Reuse, Ph.D. thesis, University of Twente, September 1997. 102 ˇ cenko, “Intelligent user support for modeling and simulation of dy[5] Michal Sevˇ namic systems,” Tech. Rep., Czech Technical University in Prague, January 2003, Postgraduate Study Report DC-PSR-2003-01. 102 [6] “Knowledge support for modeling and simulation (ksmsa) project homepage,” http://virtual.cvut.cz/ksmsa/index.html. 102
Expert System for Simulating and Predicting Sleep and Alertness Patterns Udo Trutschel, Rainer Guttkuhn, Anneke Heitmann, Acacia Aguirre, and Martin Moore-Ede Circadian Technologies, Inc. 24 Hartwell Avenue, Lexington, MA 02421,
[email protected] Abstract. The Expert System “Circadian Alertness Simulator” predicts employee sleep and alertness patterns in 24/7-work environments using rules of complex interactions between human physiology, operational demands and environmental conditions. The system can be tailored to the specific biological characteristics of individuals as well as to specific characteristics of groups of individuals (e.g., transportation employees). This adaptation capability of the system is achieved through a built-in learning module, which allows transferring information from actual data on sleep-wake-work patterns and individual sleep characteristics into an internal knowledge database. The expert system can be used as a fatigue management tool for minimizing the detrimental biological effects of 24-hour operations by providing feedback and advice for work scheduling and/or individualspecific lifestyle training. Consequently, it can help reduce accident risk and human-related costs of the 24/7 economy.
1
Introduction
Expert systems are proven tools for solving complex problems in many areas of everyday life, including fault analysis, medical diagnostic, scheduling, system planning, economic predictions, stock investments, sales predictions, management decisions and many more. Despite the numerous applications, expert systems that allow the prediction of human behaviour are relatively rare. One important application in this area is the prediction of sleep and alertness patterns of individuals and/or industry specific employee groups exposed to 24/7 operations. The last 40 years have seen the evolution of a non-stop 24/7 (24 hours a day, 7 days a week) world driven by the need to increase productivity, enhance customer service and utilize expensive equipment around the clock. However, humans are biologically designed to sleep at night. Millions of years of biological evolution led to the development of an internal biological clock in the brain (suprachiasmatic nucleus), which generates the human body's circadian (approximately 24 hour) rhythms of sleep and wakefulness, and synchronizes them with day and night using V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 104-110, 2003. Springer-Verlag Berlin Heidelberg 2003
Expert System for Simulating and Predicting Sleep and Alertness Patterns
105
daylight as a resetting tool (Moore-Ede 1982). In this non-stop 24/7 world, people are constantly exposed to mismatches between the biological clock and the work demands, leading to compromised sleep and alertness, decreased performance and accidents, and an increase in human error-related costs. Circadian Technologies has developed the Expert System “Circadian Alertness Simulator” (CAS) that predicts employee sleep and alertness patterns in a 24/7 work environment. This expert system is based on complex rules and interactions between human physiology (e.g., light sensitivity, sleep pressure, circadian sleep and alertness processes), operational demands (e.g., work schedules) and environmental conditions (e.g., light effects). The system can be tailored to the specific biological characteristics of individuals as well as to specific characteristics of groups of individuals (e.g., specific employee populations). This adaptation capability of the system is achieved through a built-in learning module, which applies the system rules and algorithms to actual data on sleep-wake-work patterns and individual characteristics in a training process and stores the results into an internal knowledge database. The Expert System CAS is used as a fatigue management tool for minimizing the detrimental biological effects of 24-hour operations by providing advice for work scheduling or individual-specific lifestyle training. It can therefore help reduce the human-related costs of the 24/7 economy.
2
Expert System CAS – Structure and Learning Capabilities
The Expert System CAS is based on a general concept with model functions, specific rules and free parameters for storage of the information about the light sensitivity, circadian and homeostatic components of sleep/wake behavior and alertness and other individual features of a person, such as chronotype (morningness/eveningness), habitual sleep duration (long/short sleeper), napping capabilities and sleep flexibility (individual capability to cope with mismatches of internal clock and sleeping time constraints). The knowledge base of the system represents a complex interaction of model functions, free parameters and rules. The most important features will be described below. A two process model of sleep and wake regulation assumes an interaction of homeostatic (modeled by exponential functions and two free parameters) and circadian components (modeled by multiple sinus functions and ten free parameters) based on the following rules: (1) If ‘the homeostatic factor reaches the upper circadian threshold' then ‘the system switches from sleep to wake, and (2) If the homeostatic factor reaches the lower circadian threshold'; then ‘the system switches from wake to sleep', provided that sleep is allowed, otherwise the system switches always to sleep below the lower threshold as soon as sleep is permitted (Daan et al. 1984). The characteristics of the circadian component depend on the complex interaction of biological rhythms, individual light sensitivity (modeled by the Phase Response Curve (PRC)) and environmental light exposure strongly correlated to the 24-hour daily cycle which can be summarized by the next two rules: (3) If ‘light is applied before the minimum of the circadian component' then ‘the phase of the biological clock is delayed, and (4) If light is applied after the minimum of the circadian component' then ‘the phase of the biological clock is advanced' (Moore-Ede, et al.
106
Udo Trutschel et al.
1982). As a consequence the circadian curve is shifted to a new position over the next day, modifying all future sleep/wake behaviour. As a further consequence, the sleep/wake behaviour will determine the timing and intensity of light exposure expressed by the following rule: (5) If ‘the state of a person is sleep' then ‘the light exposure is blocked' leading to an unchanged phase of the circadian component. The expert knowledge for the sleep and alertness prediction is obtained through a training process, in which the expert system learns from actual data of individual sleep personality, sleep/wake-activity and light exposure (Fig 1.). Training Process Minimize Target Function
Interference Engine Input (Interface)
Output (Interface)
Sleep Personality Wake/Sleep Data Light Exposure
Sleep Times Alertness Level
Knowledge Base External Database of Sleep Personality Profiles of Shiftworkers
External Database of Wake/Work/Sleep Data of Shiftworkers During Work Times
Example-Database of Wake/Sleep Data of Shiftworkers during Vacation Times
Fig. 1. Training procedure for the Expert system CAS
The target of the training is to determine the free parameter that result in the best match of actual and simulated sleep and alertness data. After the training is complete, a specific data set is stored in the knowledge database of the Expert System CAS and can be used to simulate and predict sleep and alertness patterns for given specific work-rest schedules (Fig. 2).
Interference Engine Input (Interface)
Output (Interface)
Person/Group ID Wake/Work Data Light Exposure
Sleep/Alertness Patterns Fatigue Risk Score Risk Times
Knowledge Base Fig. 2. Sleep and alertness predictions of the trained Expert System CAS
Expert System for Simulating and Predicting Sleep and Alertness Patterns
107
The simulated and predicted sleep behavior of a person has direct consequences for the alertness level as a function of time, in this paper expressed by two overall easy to understand output parameters: (1) Fatigue Risk Score (0 = lowest fatigue, 100 = highest fatigue), which calculates the cumulative fatigue risk, and (2) Percentage of Work Time below a critical alertness threshold.
3
“Short-Lark”- and “Long-Owl”-Test Cases for Shift Scheduling
To illustrate the capabilities of the Expert System CAS, two artificial people were simulated in a shiftwork environment, working 8-hour backward rotating shift schedule (7 working days-3 shift start times). They are called “Short-Lark” (extreme morning type with short habitual sleep time) and “Long-Owl” (extreme evening type with long habitual sleep time). Without any external constraints such as work or other blocked times for sleep, the “Short-Lark” sleeps every day from 2200 to 0400 and the “Long-Owl” sleeps every day from 0100 to 1100. The expert system was trained with both sleep patterns, and the results were stored in the knowledge database as personspecific input vectors. The predicted sleep and alertness behavior of our test subjects when working 8-hour backward rotating shift schedule with the assumption that sleep is forbidden during the blue marked working times is shown in Fig. 3 and Fig. 4.
Fig. 3. Predicted sleep pattern for “Short-Lark” working 8-hour shift schedule
It is clear from Fig. 3 that the evening and especially the night shift interfere with the normal sleep behavior (black bars) of the “Short-Lark,” creating conflicts with the internal biological clock. This should have severe consequences for the alertness level
108
Udo Trutschel et al.
during work times (blue bars) as expressed by a high overall Fatigue Risk Score of 72, and 37.4% work performed at extreme risk times.
Fig. 4. Predicted sleep pattern for “Long-Owl” working 8-hour shift schedule
Completely different is the sleep pattern of the “Long-Owl” (Fig.4). Here the work schedule interferes with sleep (black bars) only during the day shift, and the disturbed night sleep is recovered immediately after the shift. Therefore, the conflict with the internal biological clock is only modest and reasonable alertness levels during work times are maintained, expressed by a lower overall Fatigue Risk Score of 57 and only 12.3% of work at extreme risk times.
4
Group-Specific Application in the Transportation Industry
Fatigue-related safety concerns are particularly relevant for safety-critical occupations, such as vehicle operators in the rail, road transport or bus industry. Accidents caused by driver fatigue are often severe, as the drowsy driver may not take evasive action (i.e. brake or steer) to avoid a potential collision. In recognition of this, the U.S. Department of Transportation has identified human fatigue as the “Number One” safety problem with a cost in excess of $12 billion per year. To address this need Circadian Technologies, Inc. over the past ten years in partnership with the rail & trucking industry has collected work/rest/sleep and alertness data from over 10,000 days of rail and truck operators in actual operating conditions. Using the training process depicted in Fig.1 an Expert System “CAS-Transportation” was created and applied to a specific trucking operation. “CAS-Transportation” was used to simulate
Expert System for Simulating and Predicting Sleep and Alertness Patterns
109
sleep and alertness behavior and predict fatigue risk from the work-rest data of truck drivers. The results were correlated with the actual accident rates and costs of a 500truck road transport operation. Using “CAS-Transportation” as feedback tool, nonexperts (managers, dispatchers) were provided with the relative risk of accidents due to driver fatigue from any planned sequence of driving and resting hours, and were ask to make adjustments to the working hours to minimize the overall Fatigue Risk Score. As a result of the intervention, the mean Fatigue Risk Score of the group was significantly reduced from 46.8 to 28.9 (dark line in Fig. 5a, 5b). The reduction of the Fatigue Risk Score was associated with a reduction in the frequency and severity of accidents (Moore-Ede et al., 2003). The total number of truck accidents dropped 23.3% and the average cost per accident was significantly reduced (65.8%).
Frequency in Percent
Before
8 7 6 5 4 3 2 1 0
0
10
20
30
40
50
60
70
80
90
100
Fatigue Index Fig. 5a. Fatigue Risk Score before feedback from Expert System CAS-Transportation
Frequency in Percent
After
8
p< 0.0001
7 6 5 4 3 2 1 0
0
10
20
30
40
50
60
70
80
90
Fatigue Index Fig. 5b. Fatigue Risk Score after feedback from Expert System CAS-Transportation
100
110
Udo Trutschel et al.
Using the feedback from the Expert System CAS in a risk-informed, performancebased safety program gives managers and dispatchers incentives to address some of the most important causes of driver fatigue, and therefore of fatigue-related highway accidents. This risk-informed, performance-based approach to fatigue minimization enables non-experts (managers and dispatchers) to make safety-conscious operational decisions while having sufficient flexibility to balance the specific business needs of their operation.
5
Conclusion
The Expert System CAS is able to acquire knowledge about individual sleep personalities or industry specific groups in order to predict Fatigue Risk Scores and/or alertness patterns for given shift schedule options or real world work/wake data. Missing sleep patterns can be replaced through simulated sleep patterns with considerable certainty. From a database of many shift schedules acceptable schedule options can be selected by exposing these options to different sleep personalities or industry specific groups and predicting possible fatigue risks. In addition, the simulation capabilities of Expert System CAS are supporting the decision process during schedule design. The examples (“Short-Lark”, “Long-Owl”, “Transportation Employees”) presented in this paper as well as the shift schedule design process in general can be considered as forward-chaining application of the Expert System CAS. Shift schedule design starts with the operational requirements, investigates the large number of possible solutions for shift durations, shift starting times, direction of rotation and sequence of days ‘on' and days ‘off', and makes the schedule selection based on the criteria fatigue risk and/or alertness levels specified by the user. In addition, the Expert System CAS can be used in the investigation of a fatigue related accident. This specific application represents an example of a backward chaining approach where fatigue as a probable cause of an accident is assumed and the Expert System CAS attempts to find the supporting evidence to verify this assumption.
References [1] [2] [3]
Daan S., Beersma DG., Borbely A. Timing of human sleep: recovery process gated by a circadian pacemaker. Am. J. Physiology 1984; 246: R161-R178. Moore-Ede M. Sulzman F., Fuller C. The clocks that time us. Harvard University Press, Cambridge, 1982. Moore-Ede M., Heitmann A., Dean C., Guttkuhn R. Aguirre A. Trutschel U. Circadian Alertness Simulator for Fatigue Risk Assessment in Transportation: Application to Reduce Frequency and Severity of Truck Accidents. Aviation, Space and Environmental Medicine 2003, in press.
Two Expert Diagnosis Systems for SMEs: From Database-Only Technologies to the Unavoidable Addition of AI Techniques Sylvain Delisle1 and Josée St-Pierre2 Institut de recherche sur les PME Laboratoire de recherche sur la performance des entreprises Université du Québec à Trois Rivières 1 Département de mathématiques et d'informatique 2 Département des sciences de la gestion C.P. 500, Trois-Rivières, Québec, Canada, G9A 5H7 Phone: 1-819-376-5011 + 3832 Fax: 1-819-376-5185 {sylvain_delisle,josee_st-pierre}@uqtr.ca www.uqtr.ca/{~delisle, dsge}
Abstract. In this application-oriented paper, we describe two expert diagnosis systems we have developed for SMEs. Both systems are fully implemented and operational, and both have been put to use on data from actual SMEs. Although both systems are packed with knowledge and expertise on SMEs, neither has been implemented with AI techniques. We explain why and how both systems relate to knowledgebased and expert systems. We also identify aspects of both systems that will benefit from the addition of AI techniques in future developments.
1
Expertise for Small and Medium-Sized Enterprises (SMEs)
The work we describe here takes place within the context of the Research Institute for SMEs—the Institute's core mission (www.uqtr.ca/inrpme/anglais/index.html) is to support fundamental and applied research to foster the advancement of knowledge on SMEs to contribute to their development. The specific lab in which we have conducted the research projects we refer to in this paper is the LaRePE (LAboratoire de REcherche sur la Performance des Entreprises: www.uqtr.ca/inrpme/larepe/). This lab is mainly concerned with the development of scientific expertise on the study and modeling of SMEs' performance, including a variety of interrelated subjects such as finance, management, information systems, production, technology, etc. The vast majority of research projects carried out at the LaRePE involves both theoretical and practical aspects, often necessitating in-field studies with SMEs. As a result, our research projects always attempt to provide practical solutions to real problems confronting SMEs. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 111-125, 2003. Springer-Verlag Berlin Heidelberg 2003
112
Sylvain Delisle and Josée St-Pierre
In this application-oriented paper we briefly describe two expert diagnosis systems we have developed for SMEs. Both can be considered as decision support systems— see [15] and [18]. The first is the PDG system [5]: a benchmarking software that evaluates production and management activities, and the results of these activities in terms of productivity, profitability, vulnerability and efficiency. The second is the eRisC system [6]: a software that helps identify, measure and manage the main risk factors that could compromise the success of SME development projects. Both systems are fully implemented and operational. Moreover, both have been put to use on data from actual SMEs. What is of particular interest here, especially from a knowledge-based systems perspective, is the fact that although both the PDG and the eRisC systems are packed with knowledge and expertise on SMEs, neither has been implemented with Artificial Intelligence (AI) techniques. However, if one looks at them without paying attention to how they have been implemented, they qualify as “black-box” diagnostic expert systems. In the following sections, we provide further details on both systems and how they relate to knowledge-based and expert systems. We also identify aspects of both systems that could benefit from the addition of AI techniques in future developments.
2
The PDG System: SME Performance Diagnostic
2.1
An Overview of the PDG System
The PDG system evaluates a SME from an external perspective and on a comparative basis in order to produce a diagnosis of its performance and potential, complemented with relevant recommendations. Although we usually refer to the PDG system as a diagnostic system it is in fact a hybrid diagnostic-recommendation system as it not only identifies the evaluated SME's weaknesses but it also makes suggestions on how to address these weaknesses in order to improve the SME's performance. An extensive questionnaire is used to collect relevant information items on the SME to be evaluated. Data extracted from the questionnaire is computerized and fed into the PDG system. The latter performs an evaluation in approximately 3 minutes by contrasting the particular SME with an appropriate group of SMEs for which we have already collected relevant data. The PDG's output is a detailed report in which 28 management practices (concerning human resources management, production systems and organization, market development activities, accounting, finance and control tools), 20 results indicators and 22 general information items are evaluated, leading to 14 recommendations on short term actions the evaluated SME could undertake to improve its overall performance. As shown in Figure 1 (next page), the PDG expert diagnosis system is connected to an Oracle database which collects all the relevant data for benchmarking purposes— the PDG also uses the SAS statistics package, plus Microsoft Excel for various calculations and the generation of the final output report. The PDG reports are constantly monitored by a team of multidisciplinary human experts in order to ensure that recommendations are valuable for the entrepreneurs. This validation phase, which always takes place before the report is sent to the SME, is an occasion to make further im-
Two Expert Diagnosis Systems for SMEs
113
provements to the PDG system, whenever appropriate. It is also a valuable means for the human experts to update their own expertise on SMEs. Figure 1 also shows that an intermediary partner is part of the process in order to guarantee confidentiality: nobody in our lab knows to what companies the data are associated. data and results
Oracle Database
report
report
Lab questionnaire financial data
questionnaire, financial data
report
Entrepreneur (SME)
Intermediary Partner
PDG Expert Diagnosis System
information expertise
Multidisciplinary Team of Human Experts
Fig. 1. The PDG system: evaluation of SMEs, from an external perspective and on a comparative basis, in order to produce a diagnosis of their performance and potential
The current version of the PDG system has been in production for 2 years. So far, we have produced more than 600 reports and accumulated in the database the evaluation results of approximately 400 different manufacturing SMEs. A recent study was made of 307 Canadian manufacturing SMEs that have used the PDG report, including 49 that have done so more than once. Our results show that the PDG's expert benchmarking evaluation allows these organisations to improve their operational performance, confirming the usefulness of benchmarking but also, the value of the recommendations included in the PDG report concerning short term actions to improve management practices [17]. 2.2
Some Details on the PDG System
The PDG's expertise is located in two main components: the questionnaire and the benchmarking results interpretation module—in terms of implementation, the PDG uses an Oracle database, the SAS statistical package, and Microsoft Excel. The first version of the questionnaire was developed by a multidisciplinary team of researchers in the following domains: business strategy, human resources, information systems, industrial engineering, logistics, marketing, economics, and finance. The questionnaire development team was faced with two important challenges that quickly became crucial goals: 1) find a common language (a shared ontology) that would allow researchers to understand each other and, at the same time, would be accessible to entrepreneurs when answering the questionnaire, and 2) identify long-term performance indicators for SMEs, as well as problem indicators, while keeping contents to a minimum since in-depth evaluation was not adequate.
114
Sylvain Delisle and Josée St-Pierre
The team was able to meet these two goals by the assignment of a “knowledge integrator” role to the project leader. During the 15-month period of its development, the questionnaire was tested with entrepreneurs in order to ensure that it was easy to understand both in terms of a) contents and question formulation, and b) report layout and information visualization. All texts were written with a clear pedagogical emphasis since the subject matter was not all that trivial and the intended lectureship was quite varied and heterogeneous. Several prototypes were presented to entrepreneurs and they showed a marked interest for graphics and colours. Below, Figure 2 shows a typical page of the 10-page report produced by the PDG system.
Fig. 2. An excerpt from a typical report produced by the PDG system. The evaluated SME's performance is benchmarked against that of a reference group
The researchers' expertise was precious in the identification of vital information that would allow the PDG system to rapidly produce a general diagnosis of any manufacturing SME. The diagnosis also needed to be reliable and complete, while being
Two Expert Diagnosis Systems for SMEs
115
comprehensible by typical entrepreneurs as we pointed out before. This was pioneering research work that the whole team was conducting. Indeed, other SME diagnosis systems are generally financial and based on valid quantitative data. The knowledge integrator mentioned above played an important part in this information engineering and integration process. Each expert had to identify practices, systems, or tools that had to be implemented in a manufacturing SME to ensure a certain level of performance. Then, performance indicators had to be defined in order to measure to what extent these individual practices, systems, or tools were correctly implemented and allowed the enterprise to meet specific goals—the relationship between practices and results is a distinguishing characteristic of the PDG system. Next, every selected performance indicator was assigned a relative weight by the expert and the knowledge integrator. This weight is used to position the enterprise being diagnosed with regard to its reference group, thus allowing the production of relevant comments and recommendations. The weight is also used to produce a global evaluation that will be displayed in a synoptic table. Contrary to many performance diagnostic tools in which the enterprise's information is compared to norms and standards (e.g. [11]), the PDG system evaluates an enterprise relative to a reference group selected by the entrepreneur. Research conducted at our institute seriously questions this use of norms and standards: it appears to be dubious for SMEs as they simply are too heterogeneous to support the definition of reliable norms and standards. Performance indicators are implemented as variables in the PDG system—more precisely in its database, and in the benchmarking results interpretation module (within the report production module). These variables are defined in terms of three categories: 1) binary variables, which are associated with yes/no questions; 2) scale variables, which are associated with the relative ranking of the enterprise along a 1 to 4 or a 1 to 5 scale, depending on the question; and 3) continuous (numerical) variables, which are associated with numerical figures such as the export rate or the training budget. Since variables come in different types, they must also be processed differently at the statistical level, notably when computing the reference group used for benchmarking purposes. In order to characterize the reference group with a single value, a central tendency measure that is representative of the reference group's set of observations is used. Depending on the variable category and its statistical distribution, means, medians, or percentages are used in the benchmarking computations. Table 1 (next page) shows an example of how the evaluated enterprise's results are ranked and associated with codes that will next be used to produce the various graphics in the benchmarking report. The resulting codes (see CODE in Table 1) indicate the evaluated enterprise's benchmarking result for every performance indicator. They are then used by the report generation module to produce the benchmarking output report, which contains many graphical representations, as well as comments and recommendations. The codes are used to assign colours to the enterprise, while the reference group is always associated with the beige colour. For instance, if the enterprise performs better than its reference group, CODE = 4 means colour is green forest. However, in the opposite situation, CODE = 4 would mean colour is red. Other colours with other meanings are yellow, salmon, and green olive. Figure 2 above illustrates how these coloured graphics look like (although they appear only in black and white here).
116
Sylvain Delisle and Josée St-Pierre
Table 1. Some aspects of the representation of expertise within the PDG system with performance indicators implemented as variables. This table shows three (3) variables: one scale variable (participative management), one binary (remuneration plan), and one continuous numerical (fabrication cost). Legend: SME = variable value for the evaluated enterprise; MEA = mean value of the variable in the reference group; RG = reference group; MED = median value of the variable in the reference group; CODE = resulting code for the evaluated enterprise
Scale variable if SME >= (1.25 x MEA), then CODE = 4
Binary variable if SME = 1 and 10% of RG =1 then CODE = 4 if SME = 1 and 25% of RG = 1 then CODE = 3 if SME = 1 and 50% of RG = 1 then CODE = 2 if SME = 1 and 75% of RG = 1 then CODE = 1 if SME = 1 and 90% of RG = 1 then CODE = 0
if SME >= (1.10 x MEA), then CODE = 3 if SME >= (1.00 x MEA), then CODE = 2 if SME >= (0.90 x MEA), then CODE = 1 if SME >= (0.75 x MEA), then CODE = 0 Example: participative Example: remuneration plan management
Continuous (numerical) variable if SME >= (1.25 x MED), then CODE = 4 if SME >= (1.10 x MED), then CODE = 3 if SME >= (1.00 x MED), then CODE = 2 if SME >= (0.90 x MED), then CODE = 1 if SME >= (0.75 x MED), then CODE = 0 Example: fabrication cost
3
The eRisC System: Risk Assessment of SME Development Projects
3.1
An Overview of the eRisC System
SMEs often experience difficulties accessing financing to support their activities in general, and their R&D and innovation activities in particular—see [1], [4], [8], and [9]. Establishing the risk levels of innovation activities can be quite complex and there is no formalized tool to help financial analysts assess them and correctly implement compensation and financing terms that will satisfy both lenders and entrepreneurs. This situation creates a lot of pressure on the cash resources of innovating SMEs. Based on our team's experience with SMEs and expertise with the assessment of risk, and thanks to the contribution of several experts that constantly deal with SMEs development projects, we have developed a state-of-the-art Web-based software called eRisC (see Figure 3 next page). The eRisC (https://oraprdnt.uqtr.uquebec.ca/erisc/index.jsp) expert diagnosis system identifies, measures and allows to manage the main risk factors that could compromise the success of SMEs development projects including expansion, export and innovation projects, each of which is the object of a separate section of the software. An extensive dynamic, Web-based questionnaire is used to collect relevant information items on the SME expansion project to be evaluated.
Two Expert Diagnosis Systems for SMEs
Entrepreneur (SME) or other agent
data
117
Oracle Database
data and results
report
report
Internet data
eRisC Expert Diagnosis System
information expertise
Multidisciplinary Team of Human Experts
Fig. 3. The eRisC system: a Web-based software that helps identify, measure and manage the main risk factors involved in SME development projects
The contents of the questionnaire are based on an extensive review of literature in which we identified over 200 risk factors acting upon the success of SMEs development projects. For example, factors associated with the export activity are export experience, commitment/planning, target market, product, distribution channel, shipping and contractual/financial aspects. These seven elements are broken down into 21 subelements involving between 58 and 216 questions—the number of questions ranges from 59 to 93 for an expansion project, from 58 to 149 for an export project, and from 86 to 216 for an innovation project. Data extracted from the questionnaire is fed into an elaborated knowledge-intensive algorithm that computes risk levels and identifies main risk elements associated with the evaluated project. As shown in Figure 3, the eRisC expert diagnosis system is connected to an Oracle database which collects all the relevant data. Since eRisC was developed after the PDG system, it benefited from the most recent Web-based technologies (e.g. Oracle Java) and was right from the start designed as a fully automated system. More precisely, contrary to the PDG reports, there is no need to constantly monitor eRisC's output reports—thus the dotted arrows on the right-hand side of Figure 3 above. eRisC was developed for and validated by entrepreneurs, economic agents, lenders and investors, to identify the main risk factors of SMEs development projects in order to improve their success rates and facilitate their financing. As of now, various organizations are starting to put eRisC at use in real life situations, allowing us to collect precious information in eRisC's database on SMEs projects and their associated risk assessment. We have a group of 30 users, from various organizations and domains, who currently use eRisC for real-life projects and who provide us with useful feedback for marketing purposes.
118
Sylvain Delisle and Josée St-Pierre
3.2
Some Details on the eRisC System
eRisC's contents was developed by combining various sources of information, knowledge and expertise: the literature on business failure factors and the one on project management, our colleagues' expertise on SMEs, and invaluable information from various agents dealing with these issues on a day-to-day basis, such as lenders, investors, entrepreneurs, economic advisors and experts. Based on this precious and abundant information, we first assembled a long list of potential risk factors that could disturb or influence significantly the development of SMEs projects. In a second phase, we had to reduce the original list of risk factors which was simply too long to be considered in its entirety in real-life practical situations. In order to do that, we considered the relative importance and influence of risk factors on the failure of evaluated projects. Once this pruning was completed, and after we ensured that we had not discarded important factors, the remaining key factors were grouped into meaningful generic categories. We then developed sets of questions and subquestions that would support the measurement of the actual risk level of a project. This also allowed us to add a risk management dimension to our tool by inviting the user to identify with greater precision facets that could compromise the success of the project, and thus allowing a better control with the implementation of appropriate corrective measures. A relatively complex weight system was also developed in order to associate a quantitative measure to individual risk elements, to rank these elements, and to compute a global risk rating for the evaluated project—see Figure 4 below.
Fig. 4. An excerpt from the expansion project questionnaire. The only acceptable answers to questions are YES, NO, NOT APPLICABLE, DON'T KNOW
Two Expert Diagnosis Systems for SMEs
119
In a third and final phase, the contents of eRisC was validated with many potential users and their feedback was taken into consideration to make adjustments on several aspects such as question formulation, term definition, confidentiality of information, etc. At this point, the tool was still “on paper”, as an extensive questionnaire (grid), and had not been implemented yet. So an important design decision had to be made at the very beginning of the implementation phase: how to convert the on-paper, large, and static questionnaire into an adequate form to be implemented into the eRisC software? As we examined various possibilities, we gradually came to look at it more and more as an interactive and dynamic document. In this dynamic perspective, the questionnaire would be adaptable to the users' needs for the specific project at hand. In a sense, the questionnaire is at the meeting point of three complementary dimensions: the risk evaluation model as defined by domain experts, the user's perspective as a domain practitioner; and the computerized rendering of the previous two dimensions. Moreover, from a down-to-earth, practical viewpoint, users would only be interested in the resulting software if it proved to be quick, user friendly, and better than their current non-automated tool.
Fig. 5. Risk Assessment Results Produced by eRisC
With regard to the technological architecture, eRisC is based on the standard 3tiered Web architecture for which we selected Microsoft's Internet Explorer (Web browser) for the client side, the Tomcat Web server for the middleware, and, for the data server, the Oracle database server (Oracle Internet Application Server 8.1.7 Enterprise Edition) running on a Unix platform available at our University in a secured environment. All programming was done with JSP (JaveServer Page) and JavaScript. A great advantage of the 3-tiered model is that it supports dynamic Web applications
120
Sylvain Delisle and Josée St-Pierre
in which the contents of Web pages to be shown on the user's (client's) Web browser is computed “on the fly”, i.e. dynamically, from the Web server and the information it fetched from the database server in response to the user's (client's) request. The five (5) main steps of processing involved in a project risk evaluation with eRisC are: 1) dynamic creation of the questionnaire, according to the initial options selected by the user; 2) project evaluation (questions answering: see Figure 4) by the user; 3) saving of data (user's answers) to the database; 4) computation of results; and 5) presentation of results in an online and printable report. Once phases 1 to 3 are completed, after some 30 minutes on average, eRisC only takes a minute or so to produce the final results, all this taking place online. Final results include a numerical value representing the risk rating (a relative evaluation between 0 and 100) for the specific SME project just evaluated, combined with the identification of at least the five most important risk factors (to optionally perform risk mitigation) within the questionnaire's sections used to perform the evaluation, plus a graphical (pie) representation showing the risk associated with every section and their respective weight in the computation of the global project risk rating—see Figure 5 on the previous page and Figure 6 below. The user can change these weights to adjust the evaluation according to the project's characteristics, or to better reflect her/his personal view on risk evaluation. These “personal” weights can also be saved by eRisC in the user's account so that the software can reuse them the next time around.
Fig. 6. Mitigation Report and Risk Assessment Simulation in eRisC
Two Expert Diagnosis Systems for SMEs
121
When sufficient data will have been accumulated in eRisC's database, it will be possible to establish statistically-based weight models for every type of user. Amongst various possibilities, this will allow entrepreneurs to evaluate their projects with weights used by bankers, allowing them to better understand the bankers' viewpoint when entrepreneurs ask for financing assistance. Finally, mitigation elements are associated with many risk factors listed in eRisC's output report. Typically associated with the most important risk factors, these mitigation elements suggest ways to reduce the risk rating just computed. The user can even re-compute the risk level with the hypothesis that the selected mitigation elements have been put in place in order to assess the impact they may have on the project's global risk level. A new graphic will then be produced showing a comparison of the risk levels before and after the mitigation process.
4
Conclusion: AI-less Intelligent Decision Support Systems
A good deal of multi-domain expertise and informal knowledge engineering was invested into the design of the PDG and the eRisC expert diagnosis systems. In fact, at the early stage of the PDG project, which was developed before eRisC, it was even hoped that an expert-system approach would apply naturally to the task we were facing. Using an expert system shell, a prototype expert system was in fact developed for a subset of the PDG system dealing only with human resources. However, reality turned out to be much more difficult than anticipated. In particular, the knowledge acquisition, knowledge modelling, and knowledge validation/verification phases ([7], [12], [11], [16], [3]) were too demanding in the context of our resources constraints especially in the context of a multidisciplinary domain such as that of SME for which little formalized knowledge exists. Indeed, many people were involved, all of them in various specialization fields (i.e. management, marketing, accounting, finance, human resources, engineering, technical, information technology, etc.) and with various backgrounds (researchers, graduate students, research professionals and, of course, entrepreneurs). One of the main difficulties that hindered the development of the PDG as an expert system was the continuous change both the questionnaire and the benchmarking report were undergoing during the first three years of the project. So at the same time the research team was trying to develop a multidisciplinary model of SME performance evaluation, users' needs had to be considered, software development had to be carried out, and evaluation reports had to be produced for participating SMEs. This turned out to be a rather complicated situation. The prototype expert system mentioned above was developed in parallel with the current version, although only for the subset dealing with human resources—see [10] and [19] for examples of expert systems in finance. The project leader's knowledge engineer role was very difficult since several experts from different domains were involved and the extraction and fusion of these various fields of expertise had never been done before. Despite the experts' valuable experience, knowledge, and good will, they had never been part of a similar project before. The modelling of such rich, complex, and vast information, especially for SMEs, was an entirely new challenge both scientifically and technically. Indeed, because of their heterogeneous nature, and contrary to large enterprises, SMEs are
122
Sylvain Delisle and Josée St-Pierre
much more difficult to model and evaluate. For instance, the implementation of certain management practices may be necessary and usual for traditional manufacturing enterprises, but completely inappropriate for a small enterprise subcontracting for a large company or a prime contractor. These important considerations and difficulties, not mentioning the consequences they had on the project's schedule and budget, lead to the abandon of the expert system after the development of a simple prototype. As to the eRisC system, since it was another multi-domain multi-expert project, and thanks to our prior experience with the PDG system, it was quickly decided to stay away from AI-related approaches and techniques. During the development of eRisC's questionnaires, we realized how risk experts always tended to model risk assessment from their own perspective and from their own personal knowledge, as reported in the literature. This is why we built our risk assessment model from many sources, thanks to a comprehensive review of literature and the availability of several experts, in order to ensure we ended up with an exhaustive list of risk determining factors for SME projects. For instance, here are the main different perspectives (see e.g. [14]). • • •
Bankers and lenders care mostly about financial aspects and tend to neglect qualitative dimensions that indicate whether the enterprise can solve problems and meet challenges in risky projects. Entrepreneurs do not realize that their implication in the project can in fact constitute a major risk from their partners' viewpoint. Economic consultants and advisors have a specialized background that may prevent them from having a global perspective on the project.
Obviously, it is the fusion of all these diverse and complementary expertise sources that would have been used to develop the knowledge base of an expert-system version of the current eRisC system. However, this was simply impossible given the timetable and resources available to us. Of course, this does not mean that AI tools were inappropriate for those two projects. As a research team involved in an applied project, we made a rational decision based on our experience on a smaller scale experiment (i.e. the PDG prototype expert system on human resources), on our time and budget constraints, and on the welldocumented fact that multi-domain multi-expert knowledge acquisition and modelling constitutes a great challenge. Yet another factor that had great influence on our design decisions was the fact that both projects started out on paper as questionnaires which led naturally to database building and use of all the database-related software development. Thus, both the PDG and eRisC systems ended up as knowledge-packed systems built on database technology. However, as we briefly discuss in Section 5 below, we are now at a stage where we plan the addition of AI-related techniques and tools. The current versions of the PDG and eRisC systems, although not implemented with AI techniques, e.g. knowledge base of rules and facts, inference engine, etc. (see, e.g., [13], [18]), qualifies as “black-box” expert diagnosis systems. These unique systems are based on knowledge, information and algorithms that allow them to produce outputs that only a human expert, or in fact several human experts in different domains, would be able to produce in terms of diagnosis and recommendation quality. These reports contain mostly coloured diagrams and simple explanations that are formulated in plain English (or French) so that SMEs entrepreneurs can easily under-
Two Expert Diagnosis Systems for SMEs
123
stand them. The PDG is the only system that can be said to use some relatively old AI techniques. Indeed, the comments produced in the output report are generated via a template-based approach, an early technique used in natural language processing.
5
Future Work: Bringing Back AI Techniques into the Picture
The PDG and the eRisC systems are now at a stage where we can now reconsider the introduction of AI techniques in new developments. The main justification for this is the need to eliminate human intervention while preserving high quality outputs, based on rare highly-skilled knowledge and expertise. We have started to develop new modules that will increase even more the intelligence features of both systems. Here is a short, non-exhaustive list accompanied with brief explanations: •
•
•
•
•
Development of data warehouses and data mining algorithms to facilitate statistical processing of data and extend knowledge extraction capabilities. Such extracted knowledge will be useful to improve the systems' meta-knowledge level, which could be used in the systems' explanations for instance, and also to broaden human experts' domain knowledge. This phase is already in progress. The huge number of database attributes and statistical variables manipulated in both systems is overwhelming. A conceptual taxonomy, coupled with an elaborated data dictionary, has now become a necessary addition. For instance, the researcher should be able to find out quickly to what concepts a particular attribute (or variable) is associated, to what computations or results it is related, and so on. This phase has recently begun. Development of an expert system to eliminate the need for any human intervention in the PDG system. Currently, a human expert must revise all reports before they are sent to the SME. Most of the time, only minor adjustments are required. The knowledge used to perform this final revision takes into consideration individual results produced in various parts of the benchmarking report and analyze potential consequences of interrelationships between them in order to ensure that conclusions and recommendations of the evaluated SME are both valid and coherent. This phase is part of our future work. Augment current systems with case-based reasoning and related machine learning algorithms. In several aspects of both systems, evaluation of the problem at hand could be facilitated if it were possible to establish relationships with similar problems (cases) already solved before. Determining the problems' salient features to support this approach would also offer good potential to lessen the users' burden during the initial data collection phase. This phase is part of our future work. Study the potential of agent technology to reengineer some elements of both systems, especially from a decision support system perspective [2]. This could be especially interesting for the modelling and implementation of distributed sources of expertise that contribute to decision processing. For example, in the PDG system each source of expertise in the performance evaluation of a SME could be associated with a distinct agent controlling and managing its own knowledge
124
Sylvain Delisle and Josée St-Pierre
base. Interaction and coordination between these agents would be crucial aspects of a PDG system based on a community of cooperative problem-solving agents.
References [1] [2] [3] [4] [5]
[6]
[7] [8] [9] [10] [11] [12] [13] [14]
Beaudoin R. and J. St-Pierre (1999). “Le financement de l'innovation chez les PME”, Working paper for Développement Économique Canada, 39 pages. Available: http://www.DEC-CED.gc.ca/fr/2-1.htm Bui T. and J. Lee (1999). “An Agent-Based Framework for Building Decision Support Systems”, Decision Support Systems, 25, 225-237. Caulier P. and B. Houriez (2001). “L'approche mixte expérimentée (modélisation des connaissances métiers)”, L'Informatique Professionnelle, 32 (195), juin-juillet, 30-37. Chapman R.L., C.E. O'Mara, S. Ronchi and M. Corso (2001). “Continuous Product Innovation : A Comparison of Key Elements across Different Contingency Sets”, Measuring Business Excellence, 5(3), 16-23. Delisle S. and J. St-Pierre (2003). “An Expert Diagnosis System for the Benchmarking of SMEs' Performance”, First International Conference on Performance Measures, Benchmarking and Best Practices in the New Economy (Business Excellence '03), Guimaraes (Portugal), 10-13 June 2003, to appear. Delisle S and J. St-Pierre (2003). “SME Projects: A Software for the Identification, Assessment and Management of Risks”, 48th World Conference of the International Council for Small Business (ICSB-2003), Belfast (Ireland), 15-18 June 2003, to appear. Fensel D. and F. Van Harmelen (1994). “A Comparison of Languages which Operationalize and Formalize KADS Models of Expertise”, The Knowledge Engineering Review, 9(2), 105-146. Freel M.S. (2000). “Barriers to Product Innovation in Small Manufacturing Firms”, International Small Business Journal, 18(2), 60-80. Menkveld A.J. and A.R. Thurik (1999). “Firm Size and Efficiency in Innovation: Reply”, Small Business Economics, 12, 97-101. Nedovic L. and V. Devedzic (2002), “Expert Systems in Finance—A CrossSection of the Field”, Expert Systems with Applications, 23, 49-66. Matsatsinis N.F., M. Doumpos and C. Zopounidis (1997). “Knowledge Acquisition and Representation for Expert Systems in the Field of Financial Analysis”, Experts Systems with Applications, 12(2), 247-262. Rouge A., J.Y. Lapicque, F. Brossier and Y. Lozinguez (1995). “Validation and Verification of KADS Data and Domain Knowledge”, Experts Systems with Applications, 8(3), 333-341. Santos J., Z. Vale and C. Ramos (2002). “On the Verification of an Expert System: Practical Issues”, Lecture Notes in Artificial Intelligence #2358, 414424. Sarasvathy, D.K., H.A. Simon and L. Lave (1998). “Perceiving and Managing Business Risks : Differences Between Entrepreneurs and Bankers”, Journal of Economic Behavior and Organization, 33, 207-225.
Two Expert Diagnosis Systems for SMEs
125
[15] Shim J.P., M. Warkentin, J.F. Courtney, D.J. Power, R. Sharda and C. Carlsson (2002). “Past, Present, and Future of Decision Support Technology”, Decision Support Systems, 33, 111-126. [16] Sierra-Alonso A. (2000). “Definition of a General Conceptualization Method for the Expert Knowledge”, Lecture Notes in Artificial Intelligence #1793, 458469. [17] St-Pierre J., L. Raymond and E. Andriambeloson (2002). “Performance Effects of the Adoption of Benchmarking and Best Practices in Manufacturing SMEs”, Small Business and Enterprise Development Conference, The University of Nottingham. [18] Turban E. and J.E. Aronson (2001). Decision Support Systems and Intelligent Systems, Prentice Hall. [19] Wagner W.P., J. Otto and Q.B. Chung (2002). “Knowledge Acquisition for Expert Systems in Accounting and Financial Problem Solving”, Knowledge-Based Systems, 15, 439-447.
Using Conceptual Decision Model in a Case Study Miki Sirola Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400 , FIN-02015 HUT, Finland
[email protected] Abstract. Decision making is mostly based on decision concepts and decision models built in decision support systems. Type of decision problem determines application. This paper presents a case study analysed with a conceptual decision model that utilises rule-based methodologies, numerical algorithms and procedures, statistical methodologies including distributions, and visual support. Selection of used decision concepts is based on case-based needs. Fine tuning of the model is done during construction of the computer application and analysis of the case examples. A kind of decision table is built including pre-filtered decision options and carefully chosen decision attributes. Each attribute is weighted, decision table values are given, and finally total score is calculated. This is done with a many-step procedure including various elements. The computer application is built on G2 platform. The case example choice of career is analysed in detail. The developed prototype should be considered mostly as an advisory tool in decision making. More important than the numerical result of the analysis is to learn about the decision problem. Evaluation expertise is needed in the development process. The model constructed is a kind of completed multi-criteria decision analysis concept. This paper is also an example of using a theoretical methodology in solving a practical problem.
1
Introduction
There are aspects in decision making that need to be paid special attention to. Various decision concepts have been composed and many kind of decision models have been built to provide the decision support systems with best possible aid. Decision maker has to be able to integrate all valuable information available and ennoble out good enough decision in each particular decision case. Visual support is often very valuable for the decision maker. Visual support can mean several things of course. Visualisation of process data is a basic example of giving to the decision maker information that can be valuable. Visualisation of process variables begin with simple plots and time series, and may go into more and more complicated development. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 126-133, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Conceptual Decision Model in a Case Study
127
Such support methodologies as decision tables, decision trees, flow diagrams, and rule-based methods ought to be mentioned. Calculation algorithms e.g. for optimisation are often needed as well. Selection criteria formation and decision option generation are also important parts of the decision process, systematically used for example in multi-criteria decision analysis methodology. Statistical methodologies, distributions, object models, agents and fuzzy models are also introduced as parts of decision models. Simulation for tracking purposes and prediction should not be forgotten either. In system level applications several of these concepts are needed and utilised further. The decision types can be divided into long-term decisions and short-term decisions. The number of objectives, possible uncertainty, time dependence, etc. also affect to the decision problem perspective. Sometimes the decisions need to be made on-line and sometimes off-line. All these factors need to be kept in mind when the problem solving methodology is chosen. Comparing risk and cost is a common methodology in decision making. Choosing the preferences with competing priorities is an important point. Cumulative quality function [1] and chained paired comparisons [2] are examples of more specific methodologies used in decision making. Measures in decision making are also interesting. Deterministic decision making has its own measures mostly based on value and utility theory, while stochastic decision making uses statistical measures such as distributions. Decision concepts have been reviewed in reference [3]. Although decision making is applied in many areas, the literature seems to concentrate on economy and production planning. In used methodologies there exists more variation. For instance decision analysis approach and knowledge-based technologies are commonly used concepts. Decision making in handling sequences, resource allocation, network design, sorting problems and classification are examples of problem types studied in detail. These issues are also discussed in references [4], [5] and [6]. The author has dealt with other decision analysis case examples e.g. in references [7], [8] and [9]. In the conceptual decision model presented in this paper only carefully considered features have been used. Some of the earlier presented techniques have been selected into this model. This model utilises rule-based methodologies, numerical algorithms and procedures, statistical methodologies including e.g. distributions, and visual support. Rule-based methodologies are used for instance in preliminary elimination of decision options, algorithms and procedures e.g. in calculation of weight coefficients, and statistical distributions in evaluating the values in a kind of decision table. This utilisation is explained more in detail in the next chapter about the decision concept and model. The selection of the used decision concepts in the model is based on case-based needs. By analysing real decision problems the most suitable features and methodologies have been taken in use. If some feature is noticed to be unnecessary in the decision process, it has been left out from the final decision model. Also missing features have been completed on the way. The concept of the model is first planned on rough level, and then fine tuned during the examination of the case examples. The type of the decision problem determines the application. In a decision situation there often exist alternatives. Sometimes the alternatives appear in frequent sets. In each decision situation there is a certain history that can be more or less known.
128
Miki Sirola
Statistical support, production of solutions, filtering and selection are all needed in certain steps. Situation-based assessment is very important for instance in a case as if checkers. This game located on a draught-board may be analysed by using different area sizes or numbers of elements. The number of possible combinations increases very fast with the size. Other possible cases worth to mention are e.g. selection of career, buying decision (car is the classical example), or an optimisation of a going route. The selection of career may include such attributes as inclination, interest, economy, risk, etc. The buying decision has such possible attributes as price, technical qualities, colour, model, etc. In this paper the case choice of career is analysed more in detail. The computer application of this case example has been built with G2 expert system shell.
2
Decision Concept of the Model
The conceptual decision model is presented here first on a rough level, and then by going more and more into details when getting more into the examined case itself. It was found out that the rule-based methods, algorithms and procedures, statistical methods such as distributions, and visual support are the most suitable methodologies and give the most desired features for the decision model in question. The whole concept in use consists of these elements just mentioned. In the model a kind of decision table is built including decision options and decision attributes (see Figure 1). The decision attributes can also be called decision criteria. Each decision attribute has a weight coefficient, and each decision option can be valued in regard to each attribute. This far the table is quite similar that is used in quantitative analysis of multi-criteria decision analysis. In fact also the formula calculating the final numerical result is exactly the same: n
vtot ( ai ) = ∑ wjvj ( ai )
(1)
i =1
where ai is decision alternative i, wj is weight coefficient of criteria j and vj(ai) is a scaled score of the decision alternative i compared to criterion j. The decision problem shapes during the analysis process. First the decision attributes are defined for the case in question. Then the decision options are created. The decision options are filtered with a rule-based examination, and only the most suitable ones are selected for the final analysis. Similar procedure is possible to realise with the decision attributes as well. The weight coefficients are calculated with an algorithm based on pairwise comparisons and step by step adjustment through all attributes in the final analysis. This procedure is adjusted for each case separately. The use of statistical distributions comes into the figure when the values in the decision table are given. Some attributes with regard to their corresponding decision option are valued by using information in statistical tables. Historical data is one of the elements used in constructing these tables. This method and its realisation is explained more in detail in the next chapter about the computer application.
Using Conceptual Decision Model in a Case Study
129
The decision table helps also the decision maker to perceive the decision problem visually. It is of course possible to use many kinds of visualisation methodologies in addition, visualisations that help the decision maker to picture the decision problem better than just pure numbers in a decision table. But already the decision problem shaped into a decision table gives great help in understanding the decision problem better.
3
Computer Application
The computer application is realised in G2 expert system shell. The presented realisation scheme (see Figure 1) is for the case choice of career, but it can be easily generalised for any of the cases mentioned in this paper. Some key features of the realisation are also documented here. The decision table itself is realised as a freedom table. The changing values comes either from arrays such as choice of career (options), value of attribute (criteria) and weight of attribute (weight coefficients), or some of the functions explained later on (numerical values in the decision table). The result is calculated with formula (1) as it was mentioned in chapter 2. Also the array values are calculated as outputs of procedures or functions. An object class career has been defined, and four subclasses inclination, interest, economy and risk. The career candidates such as mathematician, natural scientist, linguist, lawyer, economist and trucker are defined as instances of the object class career. The rule base (not seen in the figure) takes care for instance of the filtering of the initial decision options. Minimum limit values are defined for the weight coefficients of the attributes, and based on these comparisons the final decision options are selected. As already mentioned there are procedures and functions calculating the weight coefficients and numerical values in the decision table. The procedures and functions calculating the numerical values in the decision table also utilises accessory tables including information from statistical distributions about the attributes in the decision table. The realisation scheme of the calculation of the weight coefficients and numerical table values is not described in every detail in this paper.
4
Case Example about the Choice of Career
In the case more than thirty persons choice of career is analysed. The selection process is more or less the same with each person, or at least so similar, that only one such example is discussed in detail. No statistical analysis is done for the whole random sample. The qualities of the decision model and the concept used are considered more important, and these features are best lighted through an illustrative example. The decision table for person number 17 is seen in Figure 1. For the person number 17 the decision option filtering gives out three options: mathematician, lawyer and trucker. So in the decision table of Figure 1 the career1 is mathematician, career2 is lawyer, and career3 is trucker. The decision attributes are
130
Miki Sirola
inclination, interest, economy and risk. Here inclination means the genetic feasibility of the person for such a career. Interest means the subjective willingness to choose the career. Economy means the statistical income of each career type, and risk the statistical possibility of getting unemployed in each profession.
Fig. 1. A window of the G2 application where the case choice of career is analysed for person number 17
As input for the procedure calculating weight coefficients are given such things as each person's subjective evaluation and a kind of general importance. Pairwise comparisons are done to each two attributes and the final values are calculated step by step. As already mentioned the attributes economy and risk are valued by using tables including information from statistical distributions, while the other attributes are valued by different means. The statistical distributions are taken from public databases. A subjective measure is used in valuing the attribute interest, while the attribute inclination is valued by combined technology including subjective, statistical and kind of quality measures. Person number 17 seems to have most inclination for the career of lawyer, and most interest for the career of mathematician. The statistics shows best income for the career of lawyer, and smallest risk for the career of trucker. Note that the attribute risk is in inverse scale, so a big number means a small risk (which is considered to be a good quality), while all other attributes are in normal scale (big number means big inclination, interest or economy). The scale of the values in the decision table is from 0 (worst possible) to 5 (best possible).
Using Conceptual Decision Model in a Case Study
131
The numerical result shows clearly that person number 17 should choose the option lawyer. The second best choice would be mathematician, and clearly last comes trucker. This result is so clear that sensitivity analysis is not needed with this person. In many cases the sensitivity analysis shows the week points of the analysis by making clear how different parameters affect to the final result. It must be noted though that the attribute risk was given very small weight. Still many people consider this attribute rather important in this context. In this case the inclination has been given rather high emphasis. The importance of attributes economy and interest are located somewhere between. This tool should be considered as some kind of advisory tool, and by no means as an absolute reference for the final choice. Although with the person number 17 two of the attributes clearly point to the choice of lawyer career, there are still two important attributes that disagree with this opinion. For instance many people think that one should follow the voice of interest in such things as choosing career. Still this kind of analysis can be very informative for the decision maker to find out the other motives for such choices that are often more hidden.
5
Discussion
Conceptual decision model has been built by first collecting the desired decision concept by combining in a new way existing decision support methodology. Then the decision model has been built by iterating the details while the computer application is being built and the case examples analysed. The computer application is programmed on G2 platform. The case example of choice of career is analysed more in detail. Also other case examples have been used in the development of the computer application, and also in the fine tuning of the concept and model itself. The developed prototype should be considered mostly as an advisory tool in decision making. The features have been selected with thorough care, and therefore the desired qualities are mostly found from the final application. Also the realisation puts up some restrictions in the possibilities of course, but in general I think that we can be rather satisfied with the results achieved. The decision model includes rule-based methodologies, numerical algorithms and procedures, statistical methodologies such as distributions, and visual support. The use of these methodologies in the model has been explained more in detail in previous chapters. The role of visual support is not very remarkable in this application, although it can be considered generally quite important and also an essential part of the concept. There exist many possibilities in making errors, for instance in calculating the weight coefficients and in producing the values in the decision table, but mostly the sensitivity analysis helps to find out the week points in the analysis. Plausible comparison of different attributes is also a problem. By giving the weight coefficients it is tried to put the attributes in a kind of order of importance, but to justify the comparison itself is often problematic. For instance comparison of risk and cost is not always correct by common opinion. The numerical result of the analysis is not the most important result achieved. The most important result is to find out what are the most important decisive factors in the
132
Miki Sirola
whole decision making process, and therefore to learn more about the problem itself. Already the better understanding of the problem helps in finding out a good solution, even if the decision maker does not agree with the numerical result given by the tool. The decisive factors and their order of importance are found from the decision attributes (criteria) of the formulated decision problem. Sometimes the revelation of hidden motives helps the decision maker. This often happens during the long procedure of playing with the decision problem, from the beginning of the problem formulation to the final analysis and even documentation. Better understanding by learning during the decision problem is one of the key issues that this paper is trying to present. Evaluation expertise is also very important to include in the solution procedure of the decision problem. Otherwise we are just playing with random numbers without a real connection to the decision itself that we are tying to solve. This paper is an example of using a theoretical methodology in solving a practical problem. The model constructed is a kind of completed multi-criteria decision analysis (MCDA) concept. The decision table itself is very similar, but the difference is in the way that different components and values of the table are produced. In ordinary MCDA method more handwork is done, while this system is rather far automated. The case about choice of career is a typical one time decision. The case of buying a new car is very similar in this regard. On the other hand the game checkers is completely different decision problem type. In that case rather similar decision situations come out repeatedly. The role of historical data, retrieval and prediction becomes more important. The case of optimising a going route is again different, third type of problem. Only the first one of the previous problems was analysed in this paper, although also other cases have been calculated with the tool. The others were just very shortly introduced here. Reporting large amount of analysis results would be impossible and not appropriate in a short paper, and also better focus on the chosen particular problem has been possible. With different type of decision problem the methodology used changes from many parts. Such case could be a topic of another paper. In this paper the used methodology has been reflected through a single application to be able to point out findings quite in detail. Although many persons were analysed in the choice of career case example experiment, no statistical analysis for the whole random sample was made. To concentrate more in this issue is one clear future need. As a tool capable of handling rather large amounts of data in a moderate time has been built, a natural way to proceed is to widen the scope into this direction. In the analysis of other persons choosing their career numerous variations in the results, and also in different phases of the analysis were found out. Because from the methodological point of view the procedure is similar, reporting this part of analysis has been ignored. The realisation platform for the computer application is G2 expert system shell. This environment is quite suitable for this kind of purpose. G2 is very strong in heuristics, and not so good in numerical calculation. Although both of them are needed in this application, the need of very heavy numerical calculation is not so essential. As a kind of combinatorial methodology has been developed, also other
Using Conceptual Decision Model in a Case Study
133
platforms could be considered. G2 was just a natural choice to make the first prototype to test the ideas presented in this paper. I owe an acknowledgement for the Laboratory of Automation Technology of our university for the possibility of using G2 expert system shell as a platform of the computer application, which made the analysis of the case examples possible.
References [1] [2] [3] [4] [5] [6] [7] [8]
[9]
Zopounidis C., M. Doumpos. Stock evaluation using preference disaggregation methodology. Decision sciences. Atlanta (1999) Ra J. Chainwise paired comparisons. Decision sciences. Atlanta (1999) Sirola M. Decision concepts. To be published. Ashmos D., Duchon D., McDaniel R. Participation in strategic decision making: the role of organizational predispositions and issue interpretation. Decision sciences. Atlanta (1998) Santhanam R., Elam J. A survey of knowledge-based systems research in decision sciences (1980-1995). The Journal of the Operational Research Society. Oxford (1998) Rummel J. An empirical investigation of costs in batching decisions. Decision sciences. Atlanta (2000) Sirola M. Computerized decision support systems in failure and maintenance management of safety critical processes. VTT Publications 397. Espoo, Finland (1999) Laakso K., Sirola M., Holmberg J. Decision modelling for maintenance and safety. International Journal of Condition Monitoring and Diagnostic Engineering Management. Birmingham (1999) Vol. 2, No. 3, ISSN 1363 – 7681, pp. 13-17 Sirola M. Applying decision analysis method in process control problem in accident management situation. Proceedings of XIV International Conference on System Science. Wroclaw, Poland (2001)
Automated Knowledge Acquisition Based on Unsupervised Neural Network and Expert System Paradigms Nazar Elfadil and Dino Isa Division of Engineering, University of Nottingham in Malaysia No. 2 Jln Conlay, 50450 Kuala Lumpur, Malaysia Tel: 60321408137, Fax: 603-27159979
[email protected] Abstract. This is paper presents an approach for automated knowledge acquisition system using Kohonen self-organizing maps and k-means clustering. For the sake of illustrating the system overall architecture and validate it, a data set represent world animal has been used as training data set. The verification of the produced knowledge based had done by using conventional expert system.
1
Problem Background
In our daily life, we cannot avoid making decisions. Decision-making may be defined as making a conclusion or determination upon a problem at hand. However, in recent years, problems to be solved have become more complex. Consequently, knowledge based decision-making systems have been developed to aid us in solving complex problems. Nevertheless, the knowledge base itself has become the bottleneck, as it is the part of the system that is still developed manually. Nonetheless, as the size and complexity of the problems increase, and experts become scarce, the manual extraction of knowledge becomes very difficult. Hence, it is imperative that the task of knowledge acquisition be automated. The demand for automated knowledge acquisition system has increased dramatically. The previous approaches for automated knowledge acquisition are based on decision trees, progressive rule generation, and supervised neural networks [1]. All the above-mentioned approaches are supervised learning methods, requiring training examples combined with their target output values. In the real world cases, target data is not always known so as to provide to the system before start training the data set [2]. This paper is organized as follows: Section II presents an overview of the automated knowledge acquisition. Section III demonstrates illustrative application. Finally section IV illustrates the conclusion and future work.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 134-140, 2003. Springer-Verlag Berlin Heidelberg 2003
Automated Knowledge Acquisition Based on Unsupervised Neural Network
2
135
Automated Knowledge Acquisition
The paper proposes an automated knowledge acquisition method in which, knowledge (connectionist) is extracted from data that have been classified by a Kohonen self-organizing map (KSOM) neural network. This knowledge (at this stage) is of the intermediate-level concept rule hierarchy. The final concept rule hierarchy is generated by applying a rule generation algorithm that is aided by an expert system inference engine. The resulting knowledge (symbolic) may be used in the construction of the symbolic knowledge base of an expert system. The proposal is rationalized from the realization that most complex real-world problems are not solvable using just symbolic or just adaptive processing alone. However, not all problems are suitable for neural expert system integration [3]. The most suitable ones are those that seek to classify inputs into a small number of groups. Figure. 1 illustrates the top-level architecture that integrates neural network and expert system.
Fig. 1. Neural Expert system architecture
3
Illustrative Application
To illustrate the details of the various tasks in the proposed method, a simple and illustrative case study will be use. The system has been trained with animals data set illustrated in Table 1. Each pattern consists of 11 attributes and the whole animal data set with various attributes represented 11 kinds of animals, as shown in Table 1. The trained data is composed of 1023 examples.
136
Nazar Elfadil and Dino Isa Table 1. Portion of animal data set input data
Attributes Is Big
2 legs
4 legs
Hair
Feathers
Hunt
Run
Fly
Swim
Likes to
Medium
Dove Duck Goose Hawk Cat Eagle Fox Dog Wolf Lion Horse
Has
Small
Patterns
1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 0 1 0 0 0 0 0
0 0 0 0 1 0 1 1 1 1 1
0 0 0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 0 0 0 0 0
0 0 0 1 1 1 1 0 1 1 0
0 0 0 0 0 0 0 1 1 1 1
1 0 1 1 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 1
Step 1: initialize weight to small random values and set the initial neighborhood to be large. Step 2: stimulate the net with a given input vector. Step 3: calculate the Euclidean distance between the input and each output node and select the output with the minimum distance. D ( j) = ( w − x ) 2 D(j) is a minimum
∑
ij
i
i
Step 4: update weights for the selected node and the nodes within its neighborhood. wij (new) = wij (old ) + α ( xi − wij (old )) Step 5: repeat from step 2 unless stopping condition. Fig. 2. KSOM learning algorithm
The data-preprocessing phase consist of normalization and initialization of input data. In normalization, the objective is to ensure that there is no data that dominates over the rest of input data vectors [4]. The initialization task involves three tasks namely: weight initialization, topology initialization, and neighborhood initialization. The hexagonal lattice type is chosen as the map topology in animal data set case study. The choice of number of output nodes is done through comprehensive trials [5]. The weights of the neural network is initialized either by linear or random initialization. The random initialization technique is chosen here. For the neighborhood function, Gaussian or Bubble are the suitable choices [3]. The bubble function is considered the simpler but adequate one, and it is applied here.
Automated Knowledge Acquisition Based on Unsupervised Neural Network
137
Table 2. Portion of KSOM output
Output Nodes (Row, Column) (1,1) (1,2) (1,3) (1,4) (1,5)
Input nodes Q1
Q2
0 0 0 0 0
1.0 1.0 1.0 1.0 1.0
Q3 Q4 0 0 0 0 0
0 0 0 0 0
Q5
Q6
Q7
Q8
Q9
Q10
Q11
1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0
0 0 0 0 0
1.0 0.9 0.9 0.5 0.1
0 0 0 0 0
1.0 0.9 0.9 0.5 0.1
1.0 1.0 1.0 1.0 0.9
The machine learning and clustering phase composed of KSOM learning and Kmeans clustering algorithms. The proposed method employs unsupervised learning, and this is the key contribution of this research. Figure 2 outlines the KSOM learning algorithm. Table 2 illustrates a portion of the result of the KSOM training session. While Figure 3 and Figure 4 demonstrates the visual representation of KSOM output and Kmeans output, respectively. The selection of neighborhood range and learning rate techniques well defined in [5]. The next step is classifying the updated weights explicitly, and the process is referred as clustering. The terminology comes from the appearance of an incoming sequence of feature vectors which arrange themselves into clusters, groups of points that are closer to each other and their own centers rather than to other groups. When a feature vector is input to the system, its distance to the existing cluster representatives is determined, and it is either assigned to the cluster, with minimal distance or taken to be a representative if a new cluster. The required clustering process carried out by the modified K-means algorithm. The K-means algorithm self-organizes its input data to create clusters [5]. Figure 4 shows a visual representation for clustering session output. Each gray shade represents a certain cluster. The step now is to find the codebook vector and the indices for each cluster. This data contains the weights that distinguish and characterize each cluster. In this stage of the knowledge acquisition process, the extraction of a set of symbolic rules that map the input nodes into output nodes (with respect to each cluster) is performed. The antecedents of the rules that define these concepts consist of contributory and inhibitory input weights. In a KSOM network, each output node is connected to every input node, with the strength of interconnection reflected in the associated weight vector. The larger the weight vector associated with a link, the greater is the contribution of the corresponding input node to the output node. The input with the largest weight link makes the largest contribution to the output node [4]. To distinguish the contributory inputs from inhibitory inputs, then binarize the weights. If contributive, the real-valued weight is converted to 1, and is converted to 0 if inhibitory. There are two approaches to do this, namely: threshold and breakpoint technique [4]. The threshold technique has chosen. The threshold is set at 50% (i.e. below 0.5 is considered as 0 and above 0.5
138
Nazar Elfadil and Dino Isa
considered as 1). Such threshold selection made the probability of false positive and true negative are equal. More information concerning this selection found in [5].
Fig. 3. Output of Kohonen NN learning
Fig. 4. Output of modified Kmeans clustering
The final set of antecedents in each cluster usually contains some duplicated patterns. This redundancy is now removed. Now can map symbolically the antecedents to each cluster and obtain the rules for each cluster. The symbolic rule extraction algorithm is an inductive learning procedure. The algorithm as is provided in Figure 5 is self-explanatory.
Automated Knowledge Acquisition Based on Unsupervised Neural Network
139
Fig. 5. Symbolic rule extraction algorithm
The system provides rules that recognized the various types of animal's types. Given an input pattern of the animal data, the clusters can now recognized using this rule base. For making the rule antecedents more comprehensible we removed the inhibitory parts, as shown in Table 3. For the sake of testing and evaluating the final produced symbolic rules an expert system has developed using C language, and evaluated the rules using the expert system inference engine. Table 3. Elaborated extracted symbolic rule base
Rule No. 1 2 3 4 5 6 7 8 9 10 11
4
Antecedents parts
Conclusion
(Big)&(4_legs)& (hair)& (hunt)&(run) (Big)&(4_legs)& (hair)&(run)&(swim) (Small)&(2_legs)&(feathers)&(swim) (Small)& (2_legs)& (feathers)& (fly)& (swim) (Small)&(4_legs)& (hair)& (hunt) (Small)&(2_legs)& (feathers)& (fly) (Medium)& (4_legs)& (hair)& (run) (Small)&(2_legs)& (feathers)& (hunt)& (fly) (Medium)& (4_legs)& (hair)& (hunt) (Medium)&(4_legs)& (hair)& (hunt)& (run) (Medium)&(2_legs)&(feathers)& (hunt)& (fly)
Lion Horse Duck Goose Cat Dove Dog Hawk Fox Wolf Eagle
Conclusion
The animal data set case study shows that the proposed automated knowledge acquisition method can successfully extract knowledge in the form of production rules from numerical data set representing the salient features of the problem domain. This study has demonstrated that symbolic knowledge extraction can be performed using unsupervised learning KSOM neural networks, where no target output vectors are available during training. The system is able to learn from examples via the neural network section. The extracted knowledge can form the knowledge base of an expert system, from which explanations may be provided. Large, noisy and incomplete data
140
Nazar Elfadil and Dino Isa
set can be handled. The system proves the case of the viability of integrating neural network and expert system to solve real-world problems.
References [1]
[2] [3] [4]
[5]
T. S. Dillon, S. Sestito, M. Witten, M. Suing. “ Automated knowledge acquisition Using Unsupervised Learning”. Proceeding of the second IEEE workshop on Emerging Technology and Factory Automation (EFTA'93). Cairns, Sept. 1993, pp. 119-128. M. S. Kurzyn, “ Expert Systems and Neural Networks: A comparison”, IEEE Expert, pp. 222-223, 1993. Sabrina Sestito, Automated knowledge acquisition, Prentice Hall, 1994, Australia. N. Elfadil, M. Khalil, S. M. Nor, and S. Hussein. “Kohonen self-organizing maps & expert system for disk network performance prediction”. Journal of Systems Analysis Modeling & Simulation (SAMS), 2002, Vol. 42, pp. 10251043. England. N. Elfadil, M. Khalil, S. M. Nor, and S. Hussein. “Machine Learning: The Automation of Knowledge Acquisition Using Kohonen Self-Organizing Maps Neural Networks”. Proceeding of Malaysian Journal of Computer Science, Malaysia, June 2001: Vol. 14, No. 1, pp. 68-82.
Selective-Learning-Rate Approach for Stock Market Prediction by Simple Recurrent Neural Networks Kazuhiro Kohara Chiba Institute of Technology 2-17-1, Tsudanuma, Narashino, Chiba 275-0016, Japan
[email protected] Abstract. We have investigated selective learning techniques for improving the ability of back-propagation neural networks to predict large changes. The prediction of daily stock prices was taken as an example of a noisy real-world problem. We previously proposed the selective-presentation and selective-learning-rate approaches and applied them into feed-forward neural networks. This paper applies the selective-learning-rate approach into three types of simple recurrent neural networks. We evaluated their performances through experimental stock-price prediction. Using selective-learning-rate approach, the network can learn the large changes well and profit per trade was improved in all of simple recurrent neural networks.
1
Introduction
Prediction using back-propagation neural networks [1] has been extensively investigated (e.g., [2-5]), and various attempts have been made to apply neural networks to financial market prediction (e.g., [6-18]), electricity load forecasting (e.g., [19, 20]) and other areas (e.g., flour price prediction [21]). In the usual approach, all training data are equally presented to a neural network (i.e., presented in each cycle) and the learning rates are equal for all the training data independently of the size of the changes in the prediction-target time series. Generally, the ability to predict large changes is more important than the ability to predict small changes, as we mentioned in the previous paper [16]. When all training data are presented equally with an equal learning rate, the neural network will learn the small and large changes equally well, so it cannot learn the large changes more effectively. We have investigated selective learning techniques for improving the ability of a neural network to predict large changes. We previously proposed the selectivepresentation and selective-learning-rate approaches and applied them into feedforward neural networks [16, 17]. In the selective-presentation approach, the training data corresponding to large changes in the prediction-target time series are presented more often. In the selective-learning-rate approach, the learning rate for training data corresponding to small changes is reduced. This paper applies the selective-learningV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 141-147, 2003. Springer-Verlag Berlin Heidelberg 2003
142
Kazuhiro Kohara
rate approach into three types of simple recurrent neural networks. We evaluate their performances, using the prediction of daily stock prices as a noisy real-world problem.
2
Selective-Learning-Rate Approach
To allow neural networks to learn about large changes in prediction-target time series more effectively, we separate the training data into large-change data (L-data) and small-change data (S-data). L-data (S-data) have next-day changes that are larger (smaller) than a preset value. In the selective-learning-rate approach [17], all training data are presented in every cycle; however, the learning rate of the back-propagation training algorithm for S-data is reduced compared with that for L-data. The outline of the approach is as follows. Selective-Learning-Rate Approach: 1. 2. 3.
3
Separate the training data into L-data and S-data. Train back-propagation networks with a lower learning rate for the S-data than for the L-data. Stop network learning at the point satisfying a certain stopping criterion.
Simple Recurrent Neural Prediction Model
We considered the following types of knowledge for predicting Tokyo stock prices. These types of knowledge involve numerical economic indicators. 1. 2. 3.
If interest rates decrease, stock prices tend to increase, and vice versa. If the dollar-to-yen exchange rate decreases, stock prices tend to decrease, and vice versa. If the price of crude oil increases, stock prices tend to decrease, and vice versa.
We used the following five indicators as inputs to the neural network in the same way as in our previous work [16, 17]. • • • • •
TOPIX: the chief Tokyo stock exchange price index EXCHANGE: the dollar-to-yen exchange rate (yen/dollar) INTEREST: an interest rate (3-month CD, new issue, offered rates) (%) OIL: the price of crude oil (dollars/barrel) NY: New York Dow-Jones average of the closing prices of 30 industrial stocks (dollars)
TOPIX was the prediction target. EXCHANGE, INTEREST and OIL were chosen based on the knowledge of numerical economic indicators. The Dow-Jones average was used because Tokyo stock market prices are often influenced by New York exchange prices. The information for the five indicators was obtained from the Nihon Keizai Shinbun (a Japanese financial newspaper).
Selective-Learning-Rate Approach for Stock Market Prediction
143
The daily changes in these five indicators (e.g. ∆ TOPIX(t) = TOPIX(t) TOPIX(t-1)) were input into neural networks, and the next-day's change in TOPIX was presented to the neural network as the desired output. The back-propagation algorithm [1] was used to train the network. All the data of the daily changes were scaled to the interval [0.1, 0.9]. We considered three types of simple recurrent neural networks (SRN). The structures of three SRN are shown in Figures 1, 2 and 3. The SRN-1 is similar to the recurrent network proposed by Elman [22], where inputs C(t) to the context layer at time t are the outputs of the hidden layer at time t-1, H(t-1): C(t) = H(t-1). In the SRN-1 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-1 was 25-20-1 (the 25 includes 5 in the input layer and 20 in the context layer).
Fig.1. SRN-1
The SRN-2 is similar to the recurrent network proposed by Jordan [23], where input C(t) to the context layer at time t is the output of the output layer at time t-1, O(t-1): C(t) = O(t-1). In the SRN-2 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-2 was 6-6-1 (the 6 includes 5 in the input layer and 1 in the context layer). The SRN-3 is our original network. In the SRN-3, input C(t) to the context layer at time t is the prediction error at time t-1: C(t) = ∆ TOPIX(t-1) - O(t-1). In the SRN-3 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-3 was also 6-6-1 (the 6 includes 5 in the input layer and 1 in the context layer).
144
Kazuhiro Kohara
Fig. 2. SRN-2
Fig. 3. SRN-3
4
Evaluation through Experimental Stock-Price Prediction
4.1
Experiments
We used data from a total of 409 days (from August 1, 1989 to March 31, 1991): 300 days for training, 109 days for making predictions.
Selective-Learning-Rate Approach for Stock Market Prediction
145
In Experiments 1, 3 and 5, all training data were presented to the SRN-1, SRN-2, and SRN-3 in each cycle with an equal learning rate ( ε = 0.7), respectively. In Experiments 2, 4 and 6, the learning rate for the S-data was reduced up to 20% (i.e., ε = 0.7 for the L-data and ε = 0.14 for the S-data) in the SRN-1, SRN-2, and SRN-3 training, respectively. Here, the large-change threshold was 14.78 points (about US$ 1.40), which was the median (or the 50% point) of absolute value of TOPIX daily changes in the training data. In each experiment, network learning was stopped at 3000 learning cycles. The momentum parameter α was 0.7. All the weights and biases in the neural network were initialized randomly between -0.3 and 0.3. When a large change in TOPIX was predicted, we tried to calculate “Profit” as follows: when the predicted direction was the same as the actual direction, the daily change in TOPIX was earned, and when it was different, the daily change in TOPIX was lost. This calculation of profit corresponds to the following experimental TOPIX trading system. A buy (sell) order is issued when the predicted next-day's up (down) in TOPIX is larger than a preset value which corresponds to a large change. When a buy (sell) order is issued, the system buys (sells) TOPIX shares at the current price and subsequently sells (buys) them back at the next-day price. Transaction costs on the trades were ignored in calculating the profit. The more accurately a large change is predicted, the larger the profit is. 4.2
Results
In each experiment the neural network was run ten times for the same training data with different initial weights and the average was taken. The first experimental results are shown in Table 1, where a change larger than 14.78 points (the 50% point) in TOPIX was predicted (“prediction threshold” equals to 14.78), we tried to calculate “Profit” according to the above trading method. When the prediction threshold was comparatively low, number of trades was too large. In the actual stock trading, the larger number of trades became, the higher transaction costs became. Table 1. Experimental results 1 (prediction threshold = 14.78)
Ex. 1 Profit Number of trades Profit per trade
Ex. 2 SRN-1 equal selective 466 768 24.2 45.0 19.2 17.0
Ex. 3
Ex. 4 SRN-2 equal selective 486 781 24.4 44.4 19.9 17.5
Ex. 5
Ex. 6 SRN-3 equal selective 367 760 17.4 36.7 21.1 20.7
The second experimental results are shown in Table 2, where a change larger than 31.04 points (the 75% point) in TOPIX was predicted, we tried to calculate “Profit.” When the prediction threshold became high, number of trades became small. In equallearning-rate approach (Experiments 1, 3 and 5), the networks cannot learn the large changes well and the output value of the networks cannot reach high level. So, number of trades was very small. In our selective-learning-rate approach
146
Kazuhiro Kohara
(Experiments 2, 4 and 6), the networks can learn the large changes well and number of trades became larger than that of equal-learning-rate approach. Using selectivelearning-rate approach, number of trades became large and profit per trade was improved in all of three types of simple recurrent neural networks. The SRN-3 achieved the best profit per trade in these experiments. Table 2. Experimental results 2 (prediction threshold = 31.04)
Ex. 1
Profit Number of trades Profit per trade
5
Ex. 2 SRN-1 equal selective 27 256 1.4 10.5 19.2 24.3
Ex. 3
Ex. 4 SRN-2 equal selective 8 180 0.5 6.9 16.0 26.0
Ex. 5 Ex. 6 SRN-3 equal selective 20 161 0.8 5.2 25.0 31.0
Conclusion
We investigated selective learning techniques for stock market forecasting by neural networks. We applied our selective-learning-rate approach into three types of simple recurrent neural networks. The learning rate for training data corresponding to small changes is reduced. The results of several experiments on stock-price prediction showed that using selective-learning-rate approach, the network can learn the large changes well and profit per trade was improved in all simple recurrent neural networks. Next, we will apply these techniques to other real-world forecasting problems. We also plan to develop a forecasting method that integrates statistical analysis with neural networks.
References [1] [2] [3] [4] [5] [6] [7]
Rumelhart D, Hinton G, Williams, R. Learning internal representations by error propagation. In: Rumelhart D, McClelland, J and the PDP Research Group (eds.). Parallel Distributed Processing 1986; 1. MIT Press, Cambridge, MA Weigend A, Gershenfeld N. (eds). Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, Reading, MA, 1993 Vemuri V, Rogers R (eds). Artificial Neural Networks: Forecasting Time Series. IEEE Press, Los Alamitos, CA, 1994 Pham D, Liu X. Neural Networks for Identification, Prediction and Control. Springer, 1995 Kil D, Shin F. Pattern Recognition and Prediction with Applications to Signal Characterization. American Institute of Physics Press, 1996 Azoff E. Neural Network Time Series Forecasting of Financial Markets. John Wiley and Sons, West Sussex, 1994 Refenes A, Azema-Barac M. Neural network applications in financial asset management. Neural Comput & Applic 1994; 2(1): 13-39
Selective-Learning-Rate Approach for Stock Market Prediction
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
147
Goonatilake S, Treleaven P. (eds). Intelligent Systems for Finance and Business. John Wiley and Sons, 1995 White H. Economic prediction using neural networks: the case of IBM daily stock return. In: Proc of Int Conf Neural Networks 1988; II-451-II-458. San Diego, CA Baba N, Kozaki M. An intelligent forecasting system of stock price using neural networks. In: Proc of Int Conf Neural Networks 1992; I-371-I-377. Singapore Dutta S, Shekhar S. Bond rating: a non-conservative application of neural networks. In: Proc of Int Conf Neural Networks 1988; II-443-II-450. San Diego, CA Freisleben B. Stock market prediction with backpropagation networks. In: F Belli and J Rademacher (eds). Lecture Notes in Computer Science 604, pp. 451-460, Springer -Verlag, Heidelberg, 1992 Kamijo K, Tanigawa T. Stock price pattern recognition – a recurrent neural network approach. In: Proc of Int Conf Neural Networks 1990; I-215-I-221. San Diego, CA Kimoto T, Asakawa K, Yoda M, Takeoka M. Stock market prediction with modular neural networks. In: Proc of Int Conf Neural Networks 1990; I-1-I-6. San Diego, CA Tang Z, Almeida C, Fishwick P. Time series forecasting using neural networks vs. Box-Jenkins methodology. Simulation 1991; 57(5): 303-310 Kohara K, Fukuhara Y, Nakamura Y. Selective presentation learning for neural network forecasting of stock markets. Neural Comput & Applic 1996; 4(3): 143-148 Kohara K, Fukuhara Y, Nakamura Y. Selectively intensive learning to improve large-change prediction by neural networks. In: Proc of Int Conf Engineering Applications of Neural Networks 1996; 463-466. London Kohara K. Neural networks for economic forecasting problems. In: Cornelius T. Leondes (ed). Expert Systems -The Technology of Knowledge Management and Decision Making for the 21st Century- 2002; Academic Press Park D, El-Sharkawi M, Marks II R, Atlas L, Damborg M. Electric load forecasting using an artificial neural network. IEEE Trans Power Syst 1991; 6(2): 442-449 Caire P, Hatabian G, Muller C. Progress in forecasting by neural networks. In: Proc of Int Conf Neural Networks 1992; II-540-II-545. Baltimore, MD Chakraborty K, Mehrotra K, Mohan C, Ranka S. Forecasting the behavior of multivariate time series using neural networks. Neural Networks 1992; 5: 961970 Elman J. L. Finding structure in time, CRL Tech. Rep. 8801, Center for Research in Language, Univ. of California, San Diego, 1988 Jordan M. I. Attracter dynamics and parallelism in a connectionist sequential machine. In: Proc 8th Annual Conf Cognitive Science Society 1986; 531-546. Erlbaum
A Neural-Network Technique for Recognition of Filaments in Solar Images V.V. Zharkova1 and V. Schetinin2 1
Department of Cybernetics, University of Bradford, BD7 1DP, UK
[email protected] 2 Department of Computer Science, University of Exeter, EX4 4QF,UK
[email protected] Abstract. We describe a new neural-network technique developed for an automated recognition of solar filaments visible in the hydrogen Halpha line full disk spectroheliograms. This technique deploys the artificial neural network (ANN) with one input and one output neurons and the two hidden neurons associated either with the filament or with background pixels in this fragment. The ANN learns to recognize the filament depicted on a local background from a single image fragment labelled manually. The trained neural network has properly recognized filaments in the testing image fragments depicted on backgrounds with various brightness caused by the atmospherics distortions. Using a parabolic activation function this technique was extended for recognition of multiple solar filaments occasionally appearing in selected fragments.
1
Introduction
Solar images observed from the ground and space-based observatories in various wavelengths were digitised and stored in different catalogues, which are to be unified under the grid technology. The robust techniques including limb fitting, removal of geometrical distortion, centre position and size standardisation and intensity normalisation were developed to put the Hα and Ca K lines full disk images taken at the Meudon Observatory (France) into a standardised form [1]. There is a growing interest to widespread ground-based daily observations of a full solar disk in the Hydrogen Hα-line, which can provide important information on the long-term solar activity variations during months or years. The project European Grid of Solar Observations [2] was designed to deal with the automated detection of various features associated with solar activity, such as: sunspots, active regions and filaments, or solar prominences. Filaments are the projections on a solar disk of prominences seen as very bright and large-scale features on the solar limb [1]. Their location and shape does not change very much for a long time and, hence, their lifetime is likely to be much longer then one solar rotation. However, there are visible changes seen in the filament V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 148-154, 2003. Springer-Verlag Berlin Heidelberg 2003
A Neural-Network Technique for Recognition of Filaments in Solar Images
149
elongation, position with respect to an active region and magnetic field configuration. For this reason the automated detection of solar filaments is a very important task to tackle in sense of understanding the physics of prominence formation, support and disruptions. Quite a few techniques were explored for a different level of the feature detection such as: the rough detection with a mosaic threshold technique [3], the image segmentation and region growing techniques [4] – [7]. Artificial Neural Networks (ANNs) [8], [9] applied to the filament recognition problem commonly require a representative set of image data available for training. The training data have to represent image fragments, depicting filaments on different conditions, under which the ANN has to solve the recognition problem. For this reason the number of the training examples must be large. However the image fragments are still taken from the solar catalogues manually. In this paper we describe a new neural-network technique which is able to learn recognizing filaments from a few image fragments labelled manually. This technique deploys the artificial neural network (ANN) with one input and one output neurons and the two hidden neurons associated either with the filament or background pixels in this fragment. Despite the difference in backgrounds in the selected fragments of solar images containing filaments the trained network has properly recognized all the filaments in these fragments. Using a parabolic activation function, this technique was extended to recognize multiple solar filaments occasionally appearing in some fragments.
2
A Recognition Problem
First, let us introduce the image data as a n×m matrix X = {xij}, i = 1, …, n, j = 1, …, m, consisting of pixels whose brightness ranges between 1 and 255. This matrix depicts filament which is no more than a dark elongated feature observable on the solar surface with a high background brightness. Then, a given pixel xij ∈ X may belong to a filament region, class Ω0, or to a non-filament region, class Ω1. Note that the brightness of non-filament region varies hardly over the solar surface. Dealing with images we can make a realistic assumption that the influence of neighbouring pixels on the central pixel xij is restricted to k elements. Using this assumption we can easily define a triangular window, k×k matrix P(i,j), with central pixel xij and k nearest neighbours. The background of the filament elements is assumed to be additive to xij that allows us to evaluate and subtract it from the brightness values of all the elements of matrix P. Now we can define a background function u = ϕ(X; i, j), which reflects a total contribution of background elements to the pixel xij. Parameters of this function can be estimated from image data X. In order to learn background function ϕ and then decide whether a pixel x is a filament element or not, we can use image fragments whose pixels are manually labelled and assigned either to class Ω0 or class Ω1. A natural way to do this is to use a neural network which is able to learn recognising filaments from a few number of the labelled image fragments.
150
V.V. Zharkova and V. Schetinin
3
The Neural-Network Technique for Filament Recognition
As we stated in the Introduction, this technique deploys the artificial neural network with one input and one output neurons and the two hidden neurons associated either with the filament or with background pixels in this fragment. In general, neural networks perform a threshold technique of pattern recognition using one hidden neuron. However, in the case of solar images it turned out that background, on which darker filaments occur, vary from image to image. In order to make filament detection independent on the seeing conditions this technique has to take into account a variability of background elements in images. Hence, the idea behind the proposed method is to use the additional information on a contribution of the variable background elements which is represented by the function u = ϕ(X; i, j). This function, as we assume, can learn from image data. One of the possible ways to estimate the function ϕ is to approximate its values for each pixel xij of a given image X. Regarding to the filament recognition; we can use either a parabolic or linear approximation of this function. Whilst the first type is suitable for small size image fragments, the second one is used for relative large fragments of the solar surface. Below we describe our neural-network technique exploiting the last type of approximation. For image processing, our algorithm exploits a standard sliding technique for which a given image matrix X is transformed into a column matrix Z consisting of q = (n – k + 1)(m – k + 1) columns z(1), …, z(q)). Each column z presents the r pixels taken from the matrix P, where r = k2. The pixels x11, x12, …, x1k, …, xk1, xk2, …, xkk of this matrix are placed in the columns of the matrix Z so that the central element of P is located in the (r + 1)/2 position of the column z. Let us now introduce a feed-forward ANN consisting of the two hidden and one output neurons as depicted in Fig 1. The first hidden neuron is fed by r elements of column vector z(j). The second hidden neuron evaluates the value uj of a background for the jth vector z(j). The output neuron makes a decision, yi = {0,1}, on the central pixel in the column vector z(j). (j)
f2
uj
(j)
z1 z2(j)
w0(2)
f3
… zr(j)
f1
sj
yj = (0, 1)
w0(3)
w0(1)
Fig. 1. The feed-forward neural network consisting of the two hidden and one output neurons
Assuming that the first hidden neuron is fed by r elements of the column vector z, its output s is calculated as follows: sj = f1(w0(1), w(1); z(j)), j = 1, …, q,
(1)
A Neural-Network Technique for Recognition of Filaments in Solar Images
151
where w0(1), w(1), and f1 are the bias term, weight vector and activation function of the neuron, respectively. The activity of the second hidden neuron is proportional to the brightness of a background and can be described by the formula: uj = f2(w0(2), w(2); j), j = 1, …, q.
(2)
(2)
(2)
The bias term w0 and weight vector w of this neuron are updated so that the output u becomes an estimation of a background component contributed to the pixels of the jth column z(j). Parameters w0(2) and w (2) may learn from the image data Z. Taking into account the outputs of the hidden neurons, the output neuron makes a final decision, yj ∈ (0, 1), for each column vector z(j) as follows: yj = f3(w0(3), w(3); sj, uj), j = 1, …, q.
(3)
Depending on activities of the hidden neurons, the output neuron assigns a central pixel of the column z(j) either to the class Ω0 or Ω1.
4
A Training Algorithm
In order to train our feed-forward ANN plotted in Fig 1, we can use the backpropagation algorithms, which provide a global solution. These algorithms require recalculating the output sj for all q columns of matrix Z and for each the training epochs. However, there are some local solutions in which the hidden and output neurons are trained separately. Providing an acceptable accuracy, the local solutions can be found much more easer than the global ones. First, we need to fit the weight vector of the second hidden, or a “background” neuron that evaluates a contribution of the background elements. Fig 2 depict an example where the top left plot shows the image matrix X presenting a filament on the unknown background and the top right plot reveals the corresponding filament elements depicted in black. The two bottom plots in Fig 2 show the outputs s (the left plot) and the weighted sum of s and u (the right one) plotted versus the columns of matrix Z. From the left top plot we see that the brightness of a background varies from the lowest level at the left bottom corner to the highest at the right top corner. Such variations of the background increase the output value u calculated over the q columns of matrix Z, see the increasing curve depicted at the bottom left plot. This plot shows us that the background component is changed over j = 1, …, q and can be fitted by a parabola.. Based on this finding we can define a parabolic activation function of the “background” neuron as: uj = f2 (w0(2), w(2); j) = w0(2) + w1(2)j + w2(2)j2.
(4)
152
V.V. Zharkova and V. Schetinin
1(a)
1(b)
2000
4
1500
2
1000
0
500
-2
0
0
5000 Y
10000
-4
0
5000 F
10000
Fig. 2. The example of image matrix X depicting the filament on the unknown background
The weight coefficients w0(2), w1(2), and w2(2) of this neuron can be fitted to the image data Z so that a squared error e between the outputs ui and si became minimal: e = Σj(uj – sj)2 = Σi(w0(2) + w1(2)j + w2(2)j2 – sj)2 → min, j = 1,… , q.
(5)
The desirable weight coefficients can be found with the least square method. Using a recursive learning algorithm [10], we can improve the evaluations of these coefficients due to robustness to non-Gauss noise in image data. So the “background” neuron can be trained to evaluate the background component u. In the right bottom plot of Fig 2 we present the normalized values of si, which are no longer affected by the background component. The recognized filament elements are shown at the right top plot in the Fig 2. By comparing the left and right top plots in Fig 2 we can reveal that the second hidden neuron has successfully learnt to evaluate a background component from the given image data Z. Before training the output neuron, the weights of the first hidden neuron are to be found. For this neuron a local solution is achieved for a set of the coefficients to be equal to 1. In case, if it would be necessarily to improve the recognition accuracy, one can update these weights by using the back-propagation algorithm. After defining the weights for both hidden neurons, it is possible to train the output neuron which makes the decisions between 0 or 1. Let us re-write the output yi of this neuron as follows: yj = 0, if w1sj + w2uj < w0, and yj = 1, if w1sj + w2uj ≥ w0.
(6)
Then the weight coefficients w0, w1, and w2 can be fit in such way that the recognition error e is minimal: e = Σi| yi – ti | → min, i = 1,… , h,
(7)
A Neural-Network Technique for Recognition of Filaments in Solar Images
153
where | ⋅ | means a modulus operator, tj ∈ (0, 1) is the ith element of a target vector t and h is the number of its components, namely, the training examples. In order to minimize the error e one can apply any supervised learning methods, for example, the perceptron learning rule [8], [9].
5
Results and Discussion
The neural network technique with the two hidden neurons and one input and one output neurons described above was applied for recognition of dark filaments in solar images. The full disk solar images obtained on the Meudon Observatory (France) during the period of March-April 2002 were considered for the identification [1]. The fragments with filaments were picked from the Meudon full disk images taken for various dates and regions on the solar disk with different brightness and inhomogeneity in the solar atmosphere. There were 55 fragments selected depicting the filaments on a various background, one of them was used for training the ANN, the remaining 54 ones were used for testing the trained ANN. Visually comparing the resultant and origin images we can conclude that our algorithm recognised these testing filaments well. Using a parabolic approximation we can now recognize the large image fragments which may contain multiple filaments as depicted at the left plot in Fig 3.
1(a)
1(b)
Fig. 3. The recognition of the multiple filaments
The recognized filament elements depicted here in black are shown at the right plot. A visual comparison of the resulting and original images confirms that the proposed algorithm has recognized all four filaments being close to their location and shape.
154
V.V. Zharkova and V. Schetinin
6
Conclusions
The automated recognition of filaments on the solar disk images is still a difficult problem because of a variable background and inhomogeneities in the solar atmosphere. The proposed neural network technique with the two hidden neurons responsible for filaments and background pixel values can learn the recognition rules from a single image depicting the solar filament fragmented and labelled visually. The recognition rule has been successfully tested on the 54 other image fragments depicting the filaments on various backgrounds. Despite the background differences, the trained neural network has properly recognized as single as multiple filaments presented in the testing image fragments Therefore, the proposed neural network technique can be effectively used for an automated recognition of filaments in solar images.
References [1]
Zharkova, V.V., Ipson, S.S., Zharkov, S.I., Benkhalil, A., Aboudarham, J., Bentley, R.D.: A full disk image standardisation of the synoptic solar observations at the Meudon Observatory. Solar Physics (2002) accepted [2] Bentley, R.D. et al: The European grid of solar observations. Proceedings of the 2nd Solar Cycle and Space Weather Euro-Conference, Vico Equense, Italy (2001) 603 [3] Qahwaji, R., Green, R.: Detection of closed regions in digital images. The International Journal of Computers and Their Applications 8 (4) (2001) 202207 [4] Bader, D.A., Jaja, J., Harwood, D., Davis L.S.: Parallel algorithms for image enhancement and segmentation by region growing with experimental study. The IEEE Proceedings of IPPS'96 (1996) 414 [5] Turmon, M. Pap, J., Mukhtar, S.: Automatically finding solar active regions using SOHO/MDI photograms and magnetograms (2001) [6] Turmon, M., Mukhtar, S., Pap, J.: Bayesian inference for identifying solar active regions (2001) [7] Gao, J., Zhou, M., Wang, H.: A threshold and region growing method for filament disappearance area detection in solar images. The Conference on Information Science and Systems, The Johns Hopkins University (2001) [8] Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press (1995) [9] Nabney, I.T.: NETLAB: Algorithms for pattern recognition. Springer-Verlag (1995) [10] Schetinin, V.: A Learning Algorithm for Evolving Cascade Neural Networks. Neural Processing Letters, Kluwer, 1 (2003 )
Learning Multi-class Neural-Network Models from Electroencephalograms Vitaly Schetinin1, Joachim Schult2, Burkhart Scheidt2, and Valery Kuriakin3 1
Department of Computer Science, Harrison Building, University of Exeter, EX4 4QF, UK
[email protected] 2 Friedrich-Schiller-University of Jena, Ernst-Abbe-Platz 4, 07740 Jena, Germany
[email protected] 3 Intel Russian Research Center, N. Novgorod, Russia
Abstract. We describe a new algorithm for learning multi-class neuralnetwork models from large-scale clinical electroencephalograms (EEGs). This algorithm trains hidden neurons separately to classify all the pairs of classes. To find best pairwise classifiers, our algorithm searches for input variables which are relevant to the classification problem. Despite patient variability and heavily overlapping classes, a 16-class model learnt from EEGs of 65 sleeping newborns correctly classified 80.8% of the training and 80.1% of the testing examples. Additionally, the neural-network model provides a probabilistic interpretation of decisions.
1
Introduction
Learning classification models from electroencephalograms (EEGs) is still a complex problem [1] - [7] because of the following problems: first, the EEGs are strongly nonstationary signals which depend on an individual Background Brain Activity (BBA) of patients; second, the EEGs are corrupted by noise and muscular artifacts; third, a given set of EEG features may contain features which are irrelevant to the classification problem and may seriously hurt the classification results and fourth, the clinical EEGs are the large-scale data which are recorded during several hours and for this reason the learning time becomes to be crucial. In general, multi-class problems can be solved by using one-against-all binary classification techniques [8]. However, a natural way to induce multi-class concepts from real data is to use Decision Tree (DT) techniques [9] - [12] which exploit a greedy heuristic or hill-climbing strategy to find out input variables which efficiently split the training data into classes. To induce linear concepts, multivariate or oblique DTs have been suggested which exploit the threshold logical units or so-called perceptions [13] - [16]. Such multivariate DTs known also as Linear Machines (LM) are able to classify linearly separable V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 155-162, 2003. Springer-Verlag Berlin Heidelberg 2003
156
Vitaly Schetinin et al.
examples. Using the algorithms [8], [13] - [15], the LMs can also learn to classify non-linearly separable examples. However, such DT methods applied for inducing multi-scale problems from real large-scale data become impractical due to large computations [15], [16]. Another approach to multiple classification is based on pairwise classification [17]. A basic ides behind this method is to transform a q-class problem into q(q - 1)/2 binary problems, one for each pair of classes. In this case the binary decision problems are presented by fewer training examples and the decision boundaries may be considerably simpler than in the case of one-against-all binary classification. In this paper we describe a new algorithm for learning multi-class neural-network models from large-scale clinical EEGs. This algorithm trains hidden neurons separately to classify all the pairs of classes. To find best pairwise classifiers, our algorithm searches for input variables which are relevant to the classification problem. Additionally, the neural-network model provides a probabilistic interpretation of decisions. In the next section we define the classification problem and describe the EEG data. In section 3 we describe the neural-network model and algorithm for learning pairwise classification. In section 4 we compare our technique with the standard data mining techniques on the clinical EEGs, and finally we discuss the results.
2
A Classification Problem
In order to recognize some brain pathologies of newborns whose prenatal age range between 35 and 51 weeks, clinicians analyze their EEGs recorded during sleeping and then evaluate a EEG-based index of brain maturation [4] - [7]. So, in the pathological cases, the EEG index does not match to the prenatal maturation. Following [6], [7] we can use the EEGs recorded from the healthy newborns and define this problem as a multi-class one, i.e., a 16-class concept. Then all the EEGs of healthy newborns should be classified properly to their prenatal ages, but the pathological cases should not. To build up such a multi-class concept, we used the 65 EEGs of healthy newborns recorded via the standard electrodes C3 and C4. Then, following [4], [5], [6], these records were segmented and represented by 72 spectral and statistical features calculated on a 10-sec segment into 6 frequency bands such as sub-delta (0-1.5 Hz), delta (1.5-3.5 Hz), theta (3.5-7.5 Hz), alpha (7.5-13.5 Hz), beta 1 (13.5-19.5 Hz), and beta 2 (19.5-25 Hz). Additional features were calculated as the spectral variances. EEGviewer has manually deleted the artifacts from these EEGs and then assigned normal segments to the 16 classes correspondingly with age of newborns. Total sum of the labeled EEG segments was 59069. Analyzing the EEGs, we can see how heavily these depend on the BBA of newborns. Formally, we can define the BBA as a sum of spectral powers over all the frequency bands. As an example, Fig. 1 depicts the BBA calculated for two newborns aged 49 weeks. We can see that the BBA depicted as a dark line chaotically varies during sleeping and causes the variations of the EEG which significantly alter the class boundaries.
Learning Multi-class Neural-Network Models from Electroencephalograms
157
Clearly, we can beforehand calculate and then subtract the BBA from all the EEG features. Using this pre-processing technique we can remove the chaotic oscillations from the EEGs and expect improving the classification accuracy. 4
5
x 10
4 3 2 1 0
0
200
400
600
800
1000
1200
1400
4
5
x 10
4 3 2 1 0
0
500
1000
1500
Fig. 1. EEG segments of two sleeping newborns aged 49 weeks
Below we describe a neural-network technique we developed to learn multi-class concept from the EEGs.
3
The Neural-Network Model and Learning Algorithm
The idea behind our method of multiple classification is to separately train the hidden neurons of neural network and combine them to approximate the dividing surfaces. These hidden neurons learn to divide the examples from each pair of classes. For q classes, therefore, we need to learn q(q - 1)/2 binary classifiers. The hidden neurons that deal with one class are combined into one group, so that the number of the groups corresponds to the number of the classes. The hidden neurons combined into one group approximate the dividing surfaces for the corresponding classes. Let fi/j be a threshold activation function of hidden neuron which learns to divide the examples x of ith and jth classes Ωi and Ωj respectively. The output y of the hidden neuron is y = fi/j(x) = 1, ∀ x ∈ Ωi,, and y = fi/j(x) = – 1, ∀ x ∈ Ωj.
(1)
Assume q = 3 classification problem with overlapping classes Ω1, Ω2 and Ω3 centered into C1, C2, and C3, as Fig. 2(a) depicts. The number of hidden neurons for this example is equal to 3. In the Fig. 2(a) lines f1/2, f1/3 and f2/3 depict the hyperplanes of the hidden neurons trained to divide the examples of three pair of the classes which are (1) Ω1 and Ω2, (2) Ω1 and Ω3, and (3) Ω2 and Ω3.
158
Vitaly Schetinin et al. 1
w
Ω1 f1/3
g1 = f1/2 + f1/3
f1/2
f1/2
x1
+1
x
g3 = - f1/3 - f2/3
g1
2
w
Ω3
+1
-1
x2
f1/3
+1
g2
3
g2 = f2/3 - f1/2 f2/3
Ω2
xm
w
-1 f2/3
g3
-1
Fig. 2. The dividing surfaces (a) g1, g2, and g3, and the neural network (b) for q = 3 classes
Combining these hidden neurons into q = 3 groups we built up the new hyperplanes g1, g2, and g3. The first one, g1, is a superposition of the hidden neurons f1/2 and f1/3, i.e., g1 = f1/2 + f1/3. The second and third hyperplanes are g2 = f2/3 – f1/2 and g3 = – f1/3 – f2/3 correspondently. For hyperplane g1, the outputs f1/2 and f1/3 are summed with the weights 1 because both give the positive outputs on the examples of the class Ω1. Correspondingly, for hyperplane g3, the outputs f1/3 and f2/3 are summed with weights –1 because they give the negative outputs on the examples of the class Ω3. Fig. 2(b) depicts for this example a neural network structure consisting of three hidden neurons f1/2, f1/3, and f2/3 and three output neurons g1, g2, and g3. The weight vectors of hidden neurons here are w1, w2, and w3. These hidden neurons are connected to the output neurons with weights equal to (+1, +1), (–1, +1) and (–1, –1), respectively. We can see that in general case for q > 2 classes, the neural network consists of q output neurons g1, …, gq and q(q – 1)/2 hidden neurons f1/2, …, fi/j, …, fq - 1/q, where i < j = 2, …, q. Each output neuron gi is connected to the (q – 1) hidden neurons which are partitioned into two groups. The first group consists of the hidden neurons fi/k for which k > i. The second group consists of the hidden neurons fk/i for which k < i. So, the weights of output neuron gi connected to the hidden neurons fi/k and fk/i are equal to + 1 and –1, respectively. As the EEG features may be irrelevant to the binary classification problems, for learning the hidden neurons we use a bottom up search strategy which selects features providing the best classification accuracy [11]. Below we discuss the experimental results of our neural-network technique applied to the real multi-class problem.
4
Experiments and Results
To learn the 16-class concept from the 65 EEG records, we used the neural network model described above. For training and testing we used 39399 and 19670 EEG segments, respectively. Correspondingly, for q = 16 class problem, the neural network includes q (q – 1)/2 = 120 hidden neurons or binary classifiers with a threshold activation function (1).
Learning Multi-class Neural-Network Models from Electroencephalograms
159
The testing errors of binary classifiers varied from 0 to 15% as depicted in Fig. 3(a). The learnt classifiers exploit different sets of EEG features. The number of these features varies between 7 and 58 as depicted in Fig 3(b). Our method has correctly classified 80.8% of the training and 80.1% of the testing examples taken from 65 EEG records. Summing all the segments belonging to one EEG record, we can improve the classification accuracy up to 89.2% and 87.7% of the 65 EEG records for training and testing, respectively. 20
Error,%
15 10 5 0
0
20
40
60 Classifiers
80
100
120
0
20
40
60 Classifiers
80
100
120
Number of Features
60
40 20
0
Fig. 3. The testing errors (a) and the number of features (b) for each of 120 binary classifiers
In Fig. 4 we depict the outputs of our model summed over all the testing EEG segments of two patients which belong to the second and third age groups, respectively. In both cases, most parts of the segments were correctly classified. In addition, summing the outputs over all the testing EEG segments, we may provide a probabilistic interpretation of decisions. For example, we assign the patients to 2 and 3 classes with probabilities 0.92 and 0.58, respectively, calculated as parts of the correctly classified segments. These probabilities give us the additional information about the confidence of decisions. We compared the performance of our neural network technique and the standard data mining techniques on the same EEG data. First, we tried to train a standard feedforward neural network consisted of q = 16 output neurons and a predefined number of hidden neurons and input nodes. The number of hidden neurons was defined between 5 and 20, and the number of the input nodes between 72 and 12 by using the standard principal component analysis. Note that in our experiments we could not use more than 20 hidden neurons because even a fast Levenberg-Marquardt learning algorithm provided by MATLAB has required an enormous time-consuming computational effort. Because of a large number of the training examples and classes, we could not also use more than and 25% of the training data and, as a result, the trained neural network has correctly classified less than 60% of the testing examples.
Vitaly Schetinin et al.
Disribution of Segments
Disribution of Segments
160
1 0.8 0.6 0.4 0.2 0
0
2
4
6
8 Classes
10
12
14
16
0
2
4
6
8 Classes
10
12
14
16
0.8 0.6 0.4 0.2 0
Fig. 4. Probabilistic interpretation of decisions for two patients
Second, we trained q = 16 binary classifiers to distinguish one class against others. We defined the same activation function for these classifiers and trained them on the whole data. However the classification accuracy was 72% of the testing examples. Third, we induced a decision tree consisting of (q – 1) = 15 binary decision trees trained to split (q – 1) subsets of the EEG data. That is, the first classifier learnt to divide the examples taken from classes Ω1, …, Ω8 and Ω9, …, Ω16. The second classifier learnt to divide the examples taken from classes Ω1, …, Ω4 and Ω5, …, Ω8, and so on. However the classification accuracy on the testing data was 65%.
5
Conclusion
For learning multi-class concepts from large-scale heavily overlapping EEG data, we developed a neural network technique and learning algorithm. Our neural network consists of hidden neurons which perform the binary classification for each pairs of classes. The hidden neurons are trained separately and then their outputs are combined in order to perform the multiple classification. This technique has been used to learn a 16-class concept from 65 EEG records represented by 72 features some of which were irrelevant. Having compared our technique with the other classification methods on the same EEG data, we found out that it gives the better classification accuracy for an acceptable learning time. Thus, we conclude that the new technique we developed for learning multi-class neural network models performs on the clinical EEGs well. We believe that this technique may be also used to solve other large-scale multi-class problems presented many irrelevant features.
Learning Multi-class Neural-Network Models from Electroencephalograms
161
Acknowledgments The research has been supported by the University of Jena (Germany). The authors are grateful to Frank Pasemann for enlightening discussions, Joachim Frenzel from the University of Jena for the clinical EEG records, and to Jonathan Fieldsend from the University of Exeter (UK) for useful comments.
References [1] [2] [3]
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Riddington, E., Ifeachor, E., Allen, N., Hudson, N., Mapps, D.: A Fuzzy Expert System for EEG Interpretation. Neural Networks and Expert Systems in Medicine and Healthcare, University of Plymouth (1994) 291-302 Anderson, C., Devulapalli, S., Stolz, E.: Determining Mental State from EEG Signals Using Neural Networks. Scientific Programming 4 (1995) 71-183 Galicki, M., Witte, H., Dörschel, J., Do ering, A., Eiselt, M., Grießbach, G.: Common Optimization of Adaptive Preprocessing Units and a Neural Network During the Learning: Application in EEG Pattern Recognition. J. Neural Networks 10 (1997) 1153-1163 Breidbach, O., Holthausen, K., Scheidt, B., Frenzel, J.: Analysis of EEG Data Room in Sudden Infant Death Risk Patients. J. Theory Bioscience 117 (1998) 377-392 Holthausen, K., Breidbach, O., Schiedt, B., Frenzel, J.: Clinical Relevance of Age Dependent EEG Signatures in Detection of Neonates at High Risk of Apnea. J. Neuroscience Letter 268 (1999) 123-126 Holthausen, K., Breidbach, O., Scheidt, B., Schult, J., Queregasser, J.: Brain Dismaturity as Index for Detection of Infants Considered to be at Risk for Apnea. J. Theory Bioscience 118 (1999) 189-198 Wackermann, J., Matousek, M.: From the EEG Age to Rational Scale of Brain Electric maturation. J. Electroencephalogram Clinical Neurophysiology 107 (1998) 415-421 Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley and Sons, New York (1973) Quinlan, J.: Induction of Decision Trees. J. Machine Learning 1 (1986) 81-106 Cios, K., Liu, N.: A Machine Learning Method for Generation of Neural Network Architecture. J. IEEE Transaction of Neural Networks 3 (1992) 280-291 Galant, S.: Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA (1993) Kononenko, I., Šimec, E.: Induction of Decision Trees using RELIEFF. In: Kruse, R., Viertl, R., Riccia, G. Della (eds.). CISM Lecture Notes, Springer Verlag (1994) Utgoff, P., Brodley, C.: Linear Machine Decision Trees. COINS Technical Report 91-10, University of Massachusetts, Amhert, MA (1991) Brodley, C., Utgoff, P.: Multivariate Decision Trees. COINS Technical Report 92-82, University of Massachusetts, Amhert, MA (1992)
162
Vitaly Schetinin et al.
[15] Frean, M.: A Thermal Perceptron Learning Rule. J. Neural Computational 4 (1992) [16] Murthy, S., Kasif, S., Salzberg, S.: A System for Induction of Oblique Decision Trees. J. Artificial Intelligence Research 2 (1994) 1-33 [17] Hastie, T., Tibshirani, R.: Classification by Pairwise Coupling. Advances in Neural Information Processing Systems 10 (NIPS-97) (1998) 507-513
Establishing Safety Criteria for Artificial Neural Networks Zeshan Kurd and Tim Kelly Department of Computer Science University of York, York, YO10 5DD, UK. {zeshan.kurd,tim.kelly}@cs.york.ac.uk
Abstract. Artificial neural networks are employed in many areas of industry such as medicine and defence. There are many techniques that aim to improve the performance of neural networks for safety-critical systems. However, there is a complete absence of analytical certification methods for neural network paradigms. Consequently, their role in safety-critical applications, if any, is typically restricted to advisory systems. It is therefore desirable to enable neural networks for highlydependable roles. This paper defines the safety criteria which if enforced, would contribute to justifying the safety of neural networks. The criteria are a set of safety requirements for the behaviour of neural networks. The paper also highlights the challenge of maintaining performance in terms of adaptability and generalisation whilst providing acceptable safety arguments.
1
Introduction
Typical uses of ANNs (Artificial Neural Networks) in safety-critical systems include areas such as medical systems, industrial process & control and defence. An extensive review of many areas of ANN use in safety-related applications has been provided by a recent U.K. HSE (Health & Safety Executive) report [1]. However, the roles for these systems have been restricted to advisory, since no convincing safety arguments has yet been produced. They are typical examples of achieving improved performance without providing sufficient safety assurance. There are many techniques for traditional multi-layered perceptrons [2, 3] which aim to improve generalisation performance. However, these performance-related techniques do not provide acceptable forms of safety arguments. Moreover, any proposed ANN model that attempts to satisfy safety arguments must carefully ensure that measures incorporated do not diminish the advantages of the ANN. Many of the existing approaches which claim to contribute to safety-critical applications focus more upon improving generalisation performance and not generating suitable safety arguments. One particular type of ANN model employs diversity [2] which is an ensemble of ANNs where each is devised by different methods to encapsulate as much of the target function as possible. Results demonstrate improvement in V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 163-169, 2003. Springer-Verlag Berlin Heidelberg 2003
164
Zeshan Kurd and Tim Kelly
generalisation performance over some test set but lack in the ability to analyse and determine the function performed by each member of the ensemble. Other techniques such as the validation of ANNs and error bars [3] may help deal with uncertainty in network outputs. However, the ANN is still viewed as a black-box and lacks satisfactory analytical processes to determine the overall behaviour. Finally, many development lifecycles [4] for ANNs lack provision for analytical processes such as hazard analysis and focuses more upon black-box approaches. However, work performed on fuzzy-ANNs [5] attempts to approach development based upon refining partial specifications.
2
The Problem
The criticality of safety-critical software can be defined as directly or indirectly contributing to the occurrence of a hazardous system state [6]. A hazard is a situation, which is potentially dangerous to man, society or the environment [7]. When developing safety-critical software, there are a set of requirements which must be enforced and are formally defined in many safety standards [8]. One of the main tools for determining requirements is the use of a safety case to encapsulate all safety arguments for the software. A safety case is defined in Defence Standard 00-55 [8] as: “The software safety case shall present a well-organised and reasoned justification based on objective evidence, that the software does or will satisfy the safety aspects of the Statement of Technical Requirements and the Software Requirements specification.” Some of the main components in formulating safety requirements are hazard analysis and mitigation. Function Failure Analysis (FFA) [9] is a predictive technique to distinguish and refine safety requirements. It revolves around system functions and deals with identifying failures to provide a function, incorrect function outputs, analysing effects of failures and determining actions to improve design. Another technique is Software Hazard Operability Study (SHAZOP) [10] which uses ‘guide words' for qualitative analysis to identify hazards not previously considered. It attempts to analyse all variations of the system based on these guide words and can uncover causes, consequences, indications and recommendations for particular identified hazards. Arguing the satisfaction of the safety requirements is divided into analytical arguments and arguments generated from testing. This is defined in a widely accepted Defence Standard 00-55 [8]. Arguments from testing (such as white-box or blackbox) can be generated more easily than analytical arguments. Unlike neural networks, the behaviour of the software is fully described and available for analysis. However, generating analytical arguments is more problematic. Some of these problems are mainly associated with providing sufficient evidence that all hazards have been identified and mitigated. There are several safety processes performed during the life-cycle of safety-critical systems [11]. Preliminary Hazard Identification (PHI) is the first step in the lifecycle,
Establishing Safety Criteria for Artificial Neural Networks
165
it forms the backbone of all following processes. It aims to identify, manage and control all potential hazards in the proposed system. Risk Analysis and Functional Hazard Analysis (FHA) analyses the severity and probability of potential accidents for each identified hazard. Preliminary System Safety Assessment (PSSA) [9] ensures that the proposed design will refine and adhere to safety requirements and help guide the design process. System Safety Analysis (SSA) demonstrates through evidence that safety requirements have been met. It uses inductive and deductive techniques to examine the completed design and implementation. Finally, the Safety Case [11] generated throughout development delivers a comprehensible and defensible argument that the system is acceptably safe to use in a given context. Any potential safety case must overcome problems associated with typical neural networks. Some typical problems may be concerned with ANN structure and topology. These are factors that may influence the generalisation performance of the ANN. Another problem lies in determining the training and test set where they must represent the desired function using a limited number of samples. Dealing with noise during training is also problematic to ensure that the network does not deviate from the required target function. Other issues related to the learning or training process may involve forgetting of data particularly when using the back-propagation algorithm. This could lead to poor generalisation and long training times. Furthermore, problems during training may result in learning settling in local minima instead of global and deciding upon appropriate stopping points for the training process. This could be aided by cross-validation [12] but relies heavily on test sets. One of the key problems is the inability to analyse the network such that a whitebox view of the behaviour can be presented. This contributes to the need for using test sets to determine generalisation performance as an overall error and the lack of explanation mechanisms for network outputs. The inability to analyse also makes it difficult to identify and control potential hazards in the system and provide assurance that some set of requirements are met.
3
Safety Criteria for Neural Networks
We have attempted to establish safety criteria for ANNs (Artificial Neural Networks) that defines minimum behavioural properties which must be enforced for safetycritical contexts. By defining requirements from a high-level perspective, the criteria are intended to apply to most types of neural networks. Figure 1.0 illustrates the safety criteria in the form of Goal Structuring Notation (GSN) [11] which is commonly used for composing safety case patterns. Each symbol in figure 1.0 has a distinct meaning and is a subset of the notation used in GSN. The boxes illustrate goals or sub-goals which need to be fulfilled. Rounded boxes denote the context in which the corresponding goal is stated. The rhomboid represents strategies to achieve goals. Diamonds underneath goals symbolise that the goal requires further development leading to supporting arguments and evidence.
166
Zeshan Kurd and Tim Kelly G1
C2 Use of network in safetycritical context must ensure specific requirements are met
Neural network is acceptably safe to perform a specified function within the safety-critical context
C1 Neural Network model definition
S1 Argument over key safety criteria
C6 A fault is classified as an input that lies outside the specified input set
C3 ‘Acceptably safe’ will be determined by the satisfaction of safety criteria
G2 Pattern matching functions for neural network have been correctly mapped
C4 Function may partially or completely satisfy target function
G3
G4
G5
Observable behavior of the neural network must be predictable and repeatable
The neural network tolerates faults in its inputs
The neural network does not create hazardous outputs
C5 Known and unknown inputs
C7 Hazardous output is defined as an output outside a specified set or target function
Fig. 1. Preliminary Safety Criteria for Artificial Neural Networks
The upper part of the safety criteria consists of a top-goal, strategy and other contextual information. The goal G1 if achieved, allows the neural network to be used in safety-critical applications. C1 requires that a specific ANN model is defined such as multi-layered perceptron or other models. C2 intends for the ANN to be used when conventional software or other systems cannot provide the desired advantages. C3 attempts to highlight that ‘acceptably safe' is related to product and process based arguments and will rely heavily on sub-goals. The strategy S1 will attempt to generate safety arguments from the sub-goals (which form the criteria) to fulfil G1. The goals G2 to G5 presents the following criteria: Criterion G2 ensures that the function performed by the network represents the target or desired function. The function represented by the network may be considered as input-output mappings and the term ‘correct' refers to the target function. As expressed in C4, the network may also represent a subset of the target function. This is a more realistic condition, if all hazards can be identified and mitigated for the subset. This may also avoid concerns for attempting to solve a problem where analysis may not be able to determine whether totality of the target function is achieved. Previous work on dealing with and refining partial specification for neural networks [5] may apply. However, additional methods to analyse specifications in terms of performance and safety (existence of potential hazards) may be necessary. Forms of sub-goals or strategies for arguing G2 may involve using analytical methods such as decompositional approaches [13]. This attempts to extract behaviour
Establishing Safety Criteria for Artificial Neural Networks
167
by analysing the intrinsic structure of the ANN such as each neuron or weight. This will help analyse the function performed by the ANN and help present a white-box view of the network. Techniques to achieve this may involve determining suitable ANN architectures whereby a meaningful mapping exists from each network parameter to some functional description. On the other hand, pedagogical approaches involve determining the function by analysing outputs for input patterns such as sensitivity analysis [14]. This methodology however, maintains a black-box perspective and will not be enough to provide satisfactory arguments for G2. Overall, approaches must attempt to overcome problems associated with the ability of the ANN to explain outputs and generalisation behaviour. Criterion G3 will provide assurance that safety is maintained during ANN learning. The ‘observable behaviour' of the neuron means the input and output mappings that take place regardless of the weights stored on each connection. The ANN must be predictable given examples learnt during training. ‘Repeatable' ensures that any previous valid mapping or output does not become flawed during learning. The forgetting of previously learnt samples must not occur given adaptation to a changing environment. Possible forms of arguments may be associated with functional properties. This may involve providing assurance that learning maintains safety by abiding to some set of behavioural constraints identified through processes such as hazard analysis. Criterion G4 ensures the ANN is robust and safe under all input conditions. An assumption is made that the ANN might be exposed to training samples that do not represent the desired function. Achievement of this goal is optional as it could be considered specific to the application context. The network must either detect these inputs and suppress them, or ensure a safe state with respect to the output and weight configuration. Other possible forms of argument and solutions may involve ‘gating networks' [15] which receives data within a specific validation area of the input space. Criterion G5 is based upon arguments similar to G2. However, this goal focuses solely on the network output. This will result in a robust ANN, which through training and utilisation, ensures that the output is not hazardous regardless of the integrity of the input. For example, output monitors or bounds might be used as possible solutions. Other possible forms of arguments may include derivatives of ‘guarded networks' [15] that receives all data but is monitored to ensure that behaviour does not deviate too much from expectations. The above criteria focuses on product-based arguments in contrast to processbased arguments commonly used for conventional software [16]. This follows an approach which attempts to provide arguments based upon the product rather than a set of processes that have been routinely carried out [16]. However, process and product-based arguments may be closely coupled for justifying the safety ANNs. For example, methods such as developing ANNs using formal languages [17] might reduce faults incorporated during implementation. However the role of analytical tools is highly important and will involve hazard analysis for identifying potential hazards. Some factors which prevent this type of analysis can be demonstrated in typical monolithic neural networks. These scatter their functional behaviour around its
168
Zeshan Kurd and Tim Kelly
weights in incomprehensible fashion resulting in black-box views and pedagogical approaches. Refining various properties of the network such as structural and topological factors may help overcome this problem such as modular ANNs [18]. It may help improve the relationship between intrinsic ANN parameters and meaningful descriptions of functional behaviour. This can enable analysis in terms of domainspecific base functions associated with the application domain [19]. Other safety processes often used for determining existence of potential hazards may require adaptation for compatibility with neural network paradigms.
4
Performance vs. Safety Trade-off
One of the main challenges when attempting to satisfy the criteria is constructing suitable ANN models. The aim is to maintain feasibility by providing acceptable performance (such as generalisation) whilst generating acceptable safety assurance. Certain attributes of the model may need to be limited or constrained so that particular arguments are possible. However, this could lead to over-constraining the performance of ANNs. Typical examples may include permissible learning during development but not whilst deployed. This may provide safety arguments about the behaviour of the network without invalidating them with future changes to implementation. The aim is to generate acceptable safety arguments providing comparable assurance to that achieved with conventional software.
5
Conclusion
To justify the use of ANNs within safety critical applications will require the development of a safety case. For high-criticality applications, this safety case will require arguments of correct behaviour based both upon analysis and test. Previous approaches to justifying ANNs have focussed predominantly on (test-based) arguments of high performance. In this paper we have presented the key criteria for establishing an acceptable safety case for ANNs. Detailed analysis of potential supporting arguments is beyond the scope of this paper. However, we have discussed how it will be necessary to constrain learning and other factors, in order to provide analytical arguments and evidence. The criteria also highlight the need for adapting current safety processes (hazard analysis) and devising suitable ANN models that can accommodate them. The criteria can be considered as a benchmark for assessing the safety of artificial neural networks for highly-dependable roles.
References [1] [2]
Lisboa, P., Industrial use of safety-related artificial neural networks. Health & Safety Executive 327, (2001). Sharkey, A.J.C. and N.E. Sharkey, Combining Diverse Neural Nets, in Computer Science. (1997), University of Sheffield: Sheffield, UK.
Establishing Safety Criteria for Artificial Neural Networks
[3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
169
Nabney, I., et al., Practical Assessment of Neural Network Applications. (2000), Aston University & Lloyd's Register: UK. Rodvold, D.M. A Software Development Process Model for Artificial Neural Networks in Critical Applications. in Proceedings of the 1999 International Conference on Neural Networks (IJCNN'99). (1999). Washington D.C. Wen, W., J. Callahan, and M. Napolitano, Towards Developing Verifiable Neural Network Controller, in Department of Aerospace Engineering, NASA/WVU Software Research Laboratory. (1996), West Virginia University: Morgantown, WV. Leveson, N., Safeware: system safety and computers. (1995): Addison-Wesley. Villemeur, A., Reliability, Availability, Maintainability and Safety Assessment. Vol. 1. (1992): John Wiley & Sons. MoD, Defence Standard 00-55: Requirements for Safety Related Software in Defence Equipment. (1996), UK Ministry of Defence. SAE, ARP 4761 - Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment. (1996), The Society for Automotive Engineers,. MoD, Interim Defence Standard 00-58 Issue 1: HAZOP Studies on Systems Containing Programmable Electronics. (1996), UK Ministry of Defence. Kelly, T.P., Arguing Safety – A Systematic Approach to Managing Safety Cases, in Department of Computer Science. (1998), University of York: York, UK. Kearns, M., A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split, AT&T Bell Laboratories. Andrews, R., J. Diederich, and A. Tickle, A survey and critique of techniques for extracting rules from trained artificial neural networks. (1995), Neurocomputing Research Centre, Queensland University of Technology. Kilimasaukas, C.C., Neural nets tell why. Dr Dobbs's, (April 1991): p. 16-24. Venema, R.S., Aspects of an Integrated Neural Prediction System. (1999), Rijksuniversiteit, Groningen: Groningen, Netherlands. Weaver, R.A., J.A. McDermid, and T.P. Kelly. Software Safety Arguments: Towards a Systematic Categorisation of Evidence. in International System Safety Conference. (2002). Denver, CO. Dorffner, G., H. Wiklicky, and E. Prem, Formal neural network specification and its implications on standardisation. Computer Standards and Interfaces, (1994). 16 (205-219). Osherson, D.N., S. Weinstein, and M. Stoli, Modular Learning. Computational Neuroscience, Cambridge - MA, (1990): p. 369-377. Zwaag, B.J. and K. Slump. Process Identification Through Modular Neural Networks and Rule Extraction. in 5th International FLINS Conference. (2002). Gent, Belgium: World Scientific.
Neural Chaos Scheme of Perceptual Conflicts Haruhiko Nishimura1 , Natsuki Nagao2 , and Nobuyuki Matsui3 1
Hyogo University of Education 942-1 Yashiro-cho, Hyogo 673-1494, Japan
[email protected] 2 Kobe College of Liberal Arts 1-50 Meinan-cho, Akashi-shi, Hyogo 673-0001, Japan
[email protected] 3 Himeji Institute of Technology 2167 Shosya, Himeji-shi, Hyogo 671-2201, Japan
[email protected] Abstract. Multistable perception is perception in which two (or more) interpretations of the same ambiguous image alternate while an obserber looks at them. Perception undergoes involuntary and random-like change. The question arises whether the apparent randomness of alternation is real (that is, due to a stochastic process) or whether any underlying deterministic structure to it exists. Upon this motivation, we have examined the spatially coherent temporal behaviors of multistable perception model based on the chaotic neural network from the viewpoint of bottom-up high dimensional approach. In this paper, we focus on dynamic processes in a simple (minimal) system which consists of three neurons, aiming at further understanding of the deterministic mechanism.
1
Introduction
Multistable perception is perception in which two (or more) interpretations of the same ambiguous image alternate spontaneously while an obserber looks at them. Figure-ground, perspective (depth) and semantic ambiguities are well known (As an overview, for example, see [1] and [2]). Actually, when we view the Necker cube which is a classic example of perspective alternation, a part of the figure is perceived either as front or back of a cube and our perception switches between the two different interpretations. In this circumstance the external stimulus is kept constant, but perception undergoes involuntary and random-like change. The measurements have been quantified in psychophysical experiments and it has become evident that the frequency of the time intervals spent on each percept T (n) in Fig.1 is approximately Gamma distributed [3, 4]. The Gestalt psychologist Wolfgang K¨ ohler [5] claimed that perceptual alternation occurs owing to different sets of neurons getting “tired” of firing after they have done so for a long time. The underlying theoretical assumptions are that: 1. different interpretations are represented by different patterns of neural activity, 2. perception corresponds to whichever pattern is most active in the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 170–176, 2003. c Springer-Verlag Berlin Heidelberg 2003
Neural Chaos Scheme of Perceptual Conflicts
171
t
T(2)
T(1)
T(3)
Fig. 1. A schematic example of switching behavior brain at the time, 3. neural fatigue causes different patterns of activation to dominate at different times. Upon this neural fatigue hypothesis, synergetic Ditzinger-Haken model [6, 7] was proposed and is known to be able to reproduce the Gamma distribution of T well by subjecting the time-dependent attention parameters to stochastic forces (white noises). However, the question arises whether the apparent randomness of alternation is real (that is, due to a stochastic process) or whether any underlying deterministic structure to it exists. Until now diverse types of chaos have been confirmed at several hierarchical levels in the real neural systems from single cells to cortical networks (e.g. ionic channels, spike trains from cells, EEG) [8]. This suggests that artificial neural networks based on the McCulloch-Pitts neuron model should be re-examined and re-developed. Chaos may play an essential role in the extended frame of the Hopfield neural network beyond the only equilibrium point attractors. Following this idea, we have examined the spatially coherent temporal behaviors of multistable perception model based on the chaotic neural network [9, 10] from the viewpoint of bottom-up high dimensional approach. In this paper, we focus on dynamic processes in a simple (minimal) system which consists of three neurons, aiming at further understanding of the deterministic mechanism.
2
Model and Method
The chaotic neural network (CNN) composed of N chaotic neurons is described as [11, 12] Xi (t + 1) = f (ηi (t + 1) + ζi (t + 1)) ηi (t + 1) =
N
wij
j=1
ζi (t + 1) = −α
t
(1)
kfd Xj (t − d)
(2)
krd Xi (t − d) − θi
(3)
d=0 t d=0
where Xi : output of neuron i(−1 ≤ Xi ≤ 1), wij : synaptic weight from neuron j to neuron i, θi : threshold of neuron i, kf (kr ) : decay factor for the feed-
172
Haruhiko Nishimura et al.
back(refractoriness) (0 ≤ kf , kr < 1), α : refractory scaling parameter, f : output function defined by f (y) = tanh(y/2ε) with the steepness parameter ε. Owing to the exponentially decaying form of the past influence, Eqs.(2) and (3) can be reduced to ηi (t + 1) = kf ηi (t) +
N
wij Xj (t)
(4)
j=1
ζi (t + 1) = kr ζi (t) − αXi (t) + a
(5)
where a is temporally constant a ≡ −θi (1 − kr ). All neurons are updated in parallel, that is, synchronously. The network corresponds to the conventional discrete-time Hopfield network when α = kf = kr = 0 (Hopfield network point (HNP)). Under an external stimulus, Eq.(1) is influenced as (6) Xi (t + 1) = f ηi (t + 1) + ζi (t + 1) + σi where {σi } is the effective term by external stimulus. The two competitive interpretations are stored in the network as minima of the energy map : E=−
1 wij Xi Xj 2 ij
(7)
at HNP. The conceptual picture of our model is shown in Fig.2. Under the external stimulus {σi }, chaotic activities arise on the neural network and cause the transitions between stable states of HNP. This situation corresponds to the dynamic multistable perception.
3
Simulations and Results
To carry out computational experiments, we consider a minimal network which consists of three neurons (N = 3). Of 23 possible states, Two interpretation patterns {ξi11 } = (1, −1, 1) and {ξi12 } = (−1, 1, −1) are chosen and then the corresponding bistable landscape is made on the network dynamics by 0 −2 2 1 {wij } = −2 0 −2 . 3 2 −2 0 In the case of HNP, the stored patterns {ξi11 } and {ξi12 } are stable (with E = −2) and the remaining 6 patterns are all unstable (with E = 23 ). (−1, −1, 1), (1, 1, 1) and (1, −1, −1) converge to {ξi11 }, and (1, 1, −1), (−1, −1, −1) and (−1, 1, 1) converge to {ξi12 } as shown in Fig.3.
Neural Chaos Scheme of Perceptual Conflicts
173
1 {σi } = s { ξ i }
stimulus
E{X}
(−1, 1, 1)
( 1, 1, 1)
ambiguous 1 { ξi }(1,1,0) 12
stable { ξi } (−1, 1,−1)
( 1, 1,−1) 11
11
{ξ i }
1
{ξ i }
12
{ξ i }
stable { ξi } (1, −1, 1)
(−1,−1, 1)
{X} (−1,−1,−1)
Fig. 2. Conceptual picture illustrating state transitions induced by chaotic activity
( 1,−1,−1)
Fig. 3. Pattern states for N=3 neurons corresponding to the ambiguous figure {ξi1 } and its interpretations {ξi11 } and {ξi12 }, and flow of the network at HNP
Figure 4 shows a time series evolution of CNN (kf = 0.5, kr = 0.8, α = 0.46, a = 0, ε = 0.015) under the stimulus {σi } = 0.59{ξi1}. Here, m11 (t) =
N 1 11 ξ Xi (t) N i=1 i
(8)
and is called the overlap of the network state {Xi } and the interpretation pattern {ξi11 }. A switching phenomenon between {ξi11 } (m11 = 1.0) and {ξi12 } (m11 = −1.0) can be observed. Bursts of switching are interspersed with prolonged periods during which {Xi } trembles near {ξi11 } or {ξi12 }. Evaluating the maximum Lyapunov exponent to be positive (λ1 = 0.30), we find that the network is dynamically in chaos. Figure 5 shows the return map of the active potential hi (t) = ηi (t) + ζi (t) + σi of a neuron(#3). In the cases λ1 < 0, such switching phenomena do not arise. From the 2×105 iteration data (until t = 2×105 ) of Fig.4, we get 1545 events T (1) ∼ T (1545) staying near one of the two interpretations, {ξi12 }. They have various persistent durations which seem to have a random time course. From the evaluation of the autocorrelation function for T (n), C(k) =< T (n + k)T (n) > − < T (n + k) >< T (n) > (here, means an average over time), we find the lack of even short term correlations (−0.06 < C(k)/C(0) < 0.06 for all k). This indicates that successive durations T (n) are independent. The frequency of
174
Haruhiko Nishimura et al.
neuron#:3 2
hi(t+1)
ξ
ξ
ξ
0
−2
−2
0
2
hi(t)
Fig. 4. Time series of the overlap with Fig. 5. Return map of the active {ξi11 } and the energy map under the stim- potential hi of #3 neuron for the ulus {ξi1 } data up to t=5000. Solid line traces its behavior for a typical T term t=3019 to 3137 0.3
Frequency
0.25 0.2 0.15 0.1 0.05 0 30
60
90
120 150 180 210 240 270 300 T
Fig. 6. Frequency distribution of the persistent durations T (n) and the corresponding Gamma distribution
occurrence of T is plotted for 1545 events in Fig.6. The distribution is well fitted by Gamma distribution ˜
G(T˜ ) =
bn T˜ n−1 e−bT Γ (n)
(9)
with b = 2.03, n = 8.73(χ2 = 0.0078, r = 0.97), where Γ (n) is the Euler-Gamma function. T˜ is the normalized duration T /15 and here 15 step interval is applied to determine the relative frequencies.
Neural Chaos Scheme of Perceptual Conflicts
175
Frequency
0.3
TI=50
0.2 0.1 0
Lower cortical area
Higher cortical area
Frequency
0.3
TI=75
0.2 0.1 0
Neuron assembly
Neuron assembly
Frequency
0.3
TI=100
0.2 0.1 0
Ping-pong
Fig. 7. Conceptual diagram of cortico-cortical interaction within the cortex
30
60
90
120
150
180
210
240
270
300
T
Fig. 8. Frequency distribution of the persistent durations T (n) and the corresponding Gamma distribution, under the background cortical rhythm with bs=0.005
The results are in good agreement with the characteristics of psychophysical experiments [3, 4]. It is found that aperiodic spontaneous switching does not necessitate some stochastic description as in the synergetic D-H model [6, 7]. In our model, the neural fatigue effect proposed by K¨ ohler [5] is considered to be supported by the neuronal refractriness α, and its fluctuation originates in the intrinsic chaotic dynamics of the network. The perceptual switching in binocular rivalry shares many features with multistable perception, and might be the outcome of a general (common) neural mechanism. Ping-pong style matching process based on the interaction between the lower and higher cortical area [13] in Fig.7 may serve as a candidate of this mechanism. According to the input signal from the lower level, the higher level feeds back a suitable candidate among the stored templates to the lower level. If the lower area cannot get a good match, the process starts over again and lasts until a suitable interpretation is found. The principal circuitry among the areas by the cortico-cortical fibers is known to be uniform over the cortex. As a background rhythm from the ping-pong style matching process, we introduce the following sinusoidal stimulus changing between bistable patterns {ξi11 } and {ξi12 }: S1 (t) = bs · cos(2πt/T0 ), S2 (t) = −bs · cos(2πt/T0 ), and S3 (t) = bs · cos(2πt/T0 ), where T0 = 2TI is the period and bs is the background strength. As we can see from the three cases TI = 50, 75, and 100 in Fig.8, the mean value of T is well controlled by the cortical matching period.
176
4
Haruhiko Nishimura et al.
Conclusion
We have shown that the deterministic neural chaos leads to perceptual alternations as responses to ambiguous stimuli in the chaotic neural network. Its emergence is based on the simple process in a realistic bottom-up framework. Our demonstration suggests functional usefulness of the chaotic activity in perceptual systems even at higher cognitive levels. The perceptual alternation appears to be an inherent feature built in the chaotic neuron assembly.It will be interesting to study the brain with the experimental technique (e.g., fMRI) under the circumstance where the perceptual alternation is running.
References [1] Attneave, F.: Multistability in perception. Scientific American 225 (1971) 62–71 170 [2] Kruse, P., Stadler, M., eds.: Ambiguity in Mind and Nature. Springer-Verlag (1995) 170 [3] Borsellino, A., Marco, A. D., Allazatta, A., Rinsei, S., Bartolini, B.: Reversal time distribution in the perception of visual ambiguous stimuli. Kybernetik 10 (1972) 139–144 170, 175 [4] Borsellino, A., Carlini, F., Riani, M., Tuccio, M. T., Marco, A. D., Penengo, P., Trabucco, A.: Effects of visual angle on perspective reversal for ambiguous patterns. Perception 11 (1982) 263–273 170, 175 [5] K¨ ohler, W.: Dynamics in psychology. Liveright, New York (1940) 170, 175 [6] Ditzinger, T., Haken, H.: Oscillations in the perception of ambiguous patterns: A model based on synergetics. Biological Cybernetics 61 (1989) 279–287 171, 175 [7] Ditzinger, T., Haken, H.: The impact of fluctuations on the recognition of ambiguous patterns. Biological Cybernetics 63 (1990) 453–456 171, 175 [8] Arbib, M. A.: The Handbook of Brain Theory and Neural Networks. MIT Press (1995) 171 [9] Nishimura, H., Nagao, N., Matsui, N.: A perception model of ambiguous figures based on the neural chaos. In Kasabov, N., et al., eds.: Progress in ConnectionistBased Information Systems. Volume 1. Springer-Verlag (1997) 89–92 171 [10] Nagao, N., Nishimura, H., Matsui, N.: A neural chaos model of multistable perception. Neural Processing Letters 12 (2000) 267–276 171 [11] Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks. Phys.Lett. A 144 (1990) 333–340 171 [12] Nishimura, H., Katada, N., Fujita, Y.: Dynamic learning and retrieving scheme based on chaotic neuron model. In Nakamura, R., et al., eds.: Complexity and Diversity. Springer-Verlag (1997) 64–66 171 [13] Mumford, D.: On the computational architecture of the neocortex ii. the role of cortico-cortical loops. Biol. Cybern. 66 (1992) 241–251 175
Learning of SAINNs from Covariance Function: Historical Learning Paolo Crippa and Claudio Turchetti Dipartimento di Elettronica ed Automatica, Universit` a Politecnica delle Marche Via Brecce Bianche, I-60131 Ancona, Italy {pcrippa, turchetti}@ea.univpm.it
Abstract. In this paper the learning capabilities of a class of neural networks named Stochastic Approximate Identity Neural Networks (SAINNs) have been analyzed. In particular these networks are able to approximate a large class of stochastic processes from the knowledge of their covariance function.
1
Introduction
One attracting property of neural networks is their capability in approximating stochastic processes or, more in general, input-output transformations of stochastic processes. Here we consider a wide class of nonstationary processes for which a canonical representation can be defined from the Karhunen-Lo`eve theorem [1]. Such a representation constitutes a model that can be used for the definition of stochastic neural networks based on the Approximate Identity (AI) functions whose approximating properties are widely studied in [3, 2]. The aim of this work is to show that these stochastic neural networks are able to approximate a given stochastic process belonging to this class, provided only its covariance function is known. This property has been called historical learning.
2
Approximation of Stochastic Processes by SAINNs
Let us consider a stochastic process (SP) admitting the canonical representation ϕ (t) = Φ (t, λ) η (dλ) , (1) Λ
where η (dλ) is a stochastic measure and Φ (t, λ) is a family of complete realvalued functions of λ ∈ Λ depending on the parameter t ∈ T such that 2 Φ (t, λ) Φ (s, λ)F (dλ) with F (∆λ) = E{|η (∆λ)| } (2) B (s, t) = Λ
In this Section we want to show that a SP ϕ(t) of this kind can be approximated in mean square by neural networks of the kind ψm um (t) (3) ψ (t) = m
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 177–184, 2003. c Springer-Verlag Berlin Heidelberg 2003
178
Paolo Crippa and Claudio Turchetti
where ψm are random variables and um (t) are AI functions [2]. These networks are named Stochastic Approximate Identity Neural Networks (SAINNs). For this purpose, let be given the SP ϕ (t) admitting the canonical representation 2 (1) where Φ (t, λ) is L2 -summable, i.e. T Λ |Φ (t, λ)| F (dλ) dt < +∞. Let us consider the function Ψ (t, λ) = am um (t) um (λ) (4) m
where um (·) are AI functions. As it has been shown in [2], the set of functions Ψ (t, λ) is dense in L 2 (Λ × T ). Thus, ∀ ε > 0 it results d (Ψ (t, λ) , Φ (t, λ)) < ε, ∀ t ∈ T , λ ∈ Λ, where d( , ) is the usual distance. As the stochastic measure η (∆) establishes a correspondence U between L 2 (Λ) and L 2 (ϕ), therefore a process ψ (t) corresponds through U to the function Ψ (t, λ) ∈ L 2 (Λ) with canonical representation given by Ψ (t, λ) η (dλ) . (5) ψ (t) = Λ
Due to the isometry property of U the following equality holds E{|ϕ (t) − ψ (t)|2 } 2 = Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ), ∀ t ∈ T . By taking advantage of the dense prop 2 erty of the functions Ψ (t, λ) we have T Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ)dt → 0 that 2 implies E 2 (t) = Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ) → 0 in all Λ but a set belonging 2 to Λ of null measure, and thus also E{|ϕ (t) − ψ (t)| } → 0. From what showed above we conclude that the SP ψ (t) defined by (5) is able to approximate in mean square the given process ϕ (t) [4]. We can also write ψ (t) = am um (t) um (λ) η (dλ) = am um (t) um (λ) η (dλ) . (6) Λ
m
m
Λ
By defining the random variables ψm = am
Λ
um (λ) η (dλ) ,
(7)
we finally demonstrate that (3) holds. Eq. (3) can be viewed as a stochastic neural network expressed in terms of linear combination of AI functions, through the coefficients um , being um random variables whose statistics depends on the process ϕ (t) to be approximated. It is clear the similarity of (3) to Karhunen-Lo`eve representation, although in K.L. expansion the functions of time are eigenfunctions dependent on the covariance function, while in (3) they are AI functions.
3
Learning of SAINNs from Covariance Function
The process ψ (t) in (3) defines a stochastic neural network that, on the basis of what previously proven, is able to approximate any given process ϕ (t) belonging to the class of SP’s (nonstationary in general) admitting a canonical form. As
Learning of SAINNs from Covariance Function: Historical Learning
179
the process ψ (t) depends on function Ψ (t, λ) = m am um (t) um (λ), which approximates the function Φ (t, λ), it is essential to know Φ (t, λ) in order to perform the learning of the neural network. To this end it is convenient to treat the case of finite T separately to the case of infinite T . Finite T In this case the set Λ is countable, as it follows from Kahrunen-Lo`eve theory [1, 4], thus the approximating process ψ(t) reduces to: ψ (t) = ψm um (t) with ψm = am um (λj ) η (λj ) . (8) m
j
Moreover we can write (8) as ψ (t) = am um (t) um (λj ) η (λj ) . j
m
(9)
Karhunen-Lo`eve theory establishes that ϕ (t) can be expanded as ϕ (t) = Φ (t, λj ) η (λj ) .
(10)
Comparing (9) and (10), and assuming the approximation Φ (t, λj ) ≈ am um (t) um (λj )
(11)
j
m
then we have ϕ (t) ≈ ψ (t) .
(12)
More rigorously
2
E |ϕ (t) − ψ (t)|2 = E
am um (t) um (λj ) η (λj )
Φ (t, λj ) −
j m
2
= am um (t) um (λj ) E |η (λj )| 2
Φ (t, λj ) −
m j
2
= am um (t) um (λj ) λ2j → 0 , (13)
Φ (t, λj ) −
m
j
where the orthogonality of the stochastic measure η (λj ) and the approximating property of the AINN’s have been adopted. In the case of finite T the functions Φ (t, λj ) are known, since they are the eigenfunctions of the integral operator defined by the covariance function B (s, t). Therefore (13) can be viewed as a learning relationship of the stochastic neural network. The other learning relationship is given by η (λj ) = ϕ (t) Φ (t, λ j ) dt . (14) T
180
Paolo Crippa and Claudio Turchetti
Thus, in conclusion, in the case of finite T from the knowledge of covariance function B (s, t) (or an estimation of it) we can define the stochastic neural network ψ (t) = ψm um (t) , ψm = am um (λj ) η (λj ) (15) m
m
approximating the process ϕ (t), through the relationships am um (t) um (λj )| 2 → 0 i) | Φ (t, λj ) − m ii ) η (λj ) = ϕ (t) Φ (t, λ j ) dt .
(16)
T
It is worth to note that when ϕ (t) is Gaussian, η (λj ) are independent Gaussian random variables. This case is of particular interest for applications because the statistical behavior of η (λj ) is completely specified by the variance E{|η (λj )|2 }, and thus the eigenvalues λj . As an example of the complexity of the approximating problem let us consider a stationary process ϕ (t) with covariance function B (s, t) = a exp (−a |t − s|)
(17)
on the interval [0, T ]. The eigenvalue equation in this case is
T
a exp (−a |t − s|) Φ (s) ds = λ Φ (t) .
(18)
0
In order to solve this integral equation it is useful to rewrite the integral as the convolution over the entire real axis (−∞, +∞) h (t) ∗ Φ (t) = λ Φ (t)
with h (t) = a exp (−a |t|)
(19)
constraining (19) to the boundary conditions as derived from (18) at the boundaries of the time interval [0,T], so that (19) is identical of (18). By Fourier transform (19) becomes H (jω) · U (jω) = λU (jω)
(20)
where H (jω), U (jω) are the Fourier transforms of h (t), Φ (t) respectively. Since it results H (jω) = 2a2 a2 + ω 2 , (21) eq. (20) reduces to 2 2 2a a + ω 2 U (jω) = λU (jω)
(22)
and the eigenvalues λ are given by λ = 2a2
2 a + ω2 .
(23)
Learning of SAINNs from Covariance Function: Historical Learning
181
Applying the inverse Fourier transform to (22), a second-order differential equation results Φ (t) + [(2 − λ) /λ] a2 Φ (t) = 0 (24) Solving this equation requires two boundary conditions on Φ (t) and its derivative Φ (t) that can be easily obtained from (18): Φ (T ) /Φ (T ) = −1/a and Φ (0) /Φ (0) = 1/a. The general solution of (24) is given by Φ (t) = A sin (ωt + α)
(25)
and the above boundary conditions take the form tan (ωT + α) = −ω/a and tan α = ω/a, respectively. After some manipulations these boundary conditions reduce to a unique equation tan (ωk T ) = −2aωk a2 + ωk (26) where the index k means that a countable set of values of ωk satisfying (26) exists. Eq. (23) establishes that a corresponding set of values λk exists, given by ∆λk = 2a2
2 a + ωk2
, k = 1, 2, . . .
(27)
thus all the eigenfunctions are expressed as Φk (t) = Ak sin (ωk t + αk )
(28)
where the constants Ak , αk are determined from orthonormalization conditions T Φk (t) Φl (t) dt = δkl . (29) 0
As for as the stochastic neural network is concerned, (16) apply in this case where Φ is given by (28). Infinite T In this case the set Λ is not countable, and the problem of determining a complete set of functions Φ (t, λ) falls in the theory of operators whose eigenvalues form a noncountable set. Since this subject is out of the scope of this work, we restrict the treatment of this case to a specific class of random processes. Let us assume Φ (t, λ) = a (t) exp (iλt)
(30)
where a (t) is an arbitrary function (such that ∀ λ ∈ Λ, Φ (t, λ) is L2 -integrable), then the set of functions {Φ (t, λ)}, yielded by varying the parameter t ∈ T , is a complete family in L2 (F ). Any linear combination of Φ (t, λ) can be written as bk Φ (tk , λ) = bk a (tk ) exp (iλtk ) = ck exp (iλtk ) . (31) k
k
k
Yet from the theory of Fourier integral it is known that the set {exp (iλt)} is complete in L2 (F ), meaning that linear combinations of such kind generate the
182
Paolo Crippa and Claudio Turchetti
entire space L2 (F ). As a consequence also the set { k bk Φ (tk , λ)} has the same property of generating the space L2 (F ), which is equivalent to say that, given any function f (λ) ∈ L2 (F ), it results f (λ) = limN →∞ N k=1 bk Φ (tk , λ). The covariance function of the process ϕ (t) can be written as B (s, t) = Φ (t, λ) Φ (s, λ)F (dλ) = a (t) a (s) exp [iλ (t − s)] F (dλ) (32) Λ
Λ
and the canonical representation (1) reduces to ϕ (t) = a (t) ζ (t)
(33)
where ζ (t) =
exp (iλt) η (dλ)
(34)
Λ
is a stationary SP since the canonical representation holds for it. Thus we conclude that the canonical representation in (1), with Φ (t, λ) given by (30), is valid for the class of nonstationary processes which can be expressed in the form ϕ (t) = a (t) ζ (t), being a (t) an arbitrary function and ζ (t) a stationary process. From (1) it can be shown that the stochastic measure η (∆λ), with ∆λ = [λ1 , λ2 ) can be written as η (∆λ) = ϕ (t) ξ∆λ (t) dt (35) T
where ξ∆λ (t) is defined through the integral equation 1 ξ∆λ (t) Φ (t, λ) dt = χ∆λ (λ) with χ∆λ (λ) = 0 T
if λ ∈ ∆λ . if λ ∈ / ∆λ
(36)
As far as the inversion formula (35) is concerned, by inserting Φ (t, λ), given by (30), in (36) and performing the Fourier inverse transform, it results ξ∆λ (t) a (t) =
exp(−iλ2 t)−exp(−iλ1 t) . −2πit
Finally from (35) and (37) we have exp(−iλ2 t)−exp(−iλ1 t) ζ (t) dt . η (∆λ) = ζ (t) a (t) ξ∆λ (t) dt = −2πit T T
(37)
(38)
Eq. (38) establishes that the stochastic measure η (∆λ) is only dependent on the stationary component ζ (t) of the process ϕ (t) = a (t) ζ (t). The measure F (∆λ) can be derived from general equation (35) F (∆λ) = E{|η (∆λ)|2 } = ξ∆λ (t) ξ∆λ (s) B (s, t) dt ds T T = ξ∆λ (t) ξ∆λ (s)a (t) a (s) exp [iλ (t − s)] F (dλ)dtds (39) T
T
Λ
Learning of SAINNs from Covariance Function: Historical Learning
183
which, by taking into account of (37), reduces to F (∆λ) = exp(−iλ2 t)−exp(−iλ1 t) exp(−iλ2 s)−exp(−iλ1 s) Bζ (t−s)dtds(40) −2πit −2πis T T where Bζ (t − s) is the covariance function of ζ (t) and is defined by B (s, t) = a (t) a (s)Bζ (t − s) .
(41)
Coming back to the problem of approximating the SP ϕ (t) = a (t) ζ (t), it is equivalent to approximating the resulting Φ (t, λ) = a (t) eiλt by the AINN, i.e. am um (t) um (λ) . (42) a (t) eiλt ≈ m
Indeed the mean square error between the SP ϕ (t) and the approximating process ψ (t) = Ψ (t, λ) η (dλ) with Ψ (t, λ) = am um (t) um (λ) , (43) Λ
m
is given by
2 E |ϕ (t) − ψ (t)|2 = E Λ Φ (t, λ) η (dλ) − Λ Ψ (t, λ) η (dλ) 2 = Λ |Φ (t, λ) − g (t, λ)| F (dλ) = Λ |a (t) eiλt − m am um (t) um (λ)|2 F (dλ) → 0
(44)
The stochastic neural network is defined by (44), with η (∆λ) given by (38). If the process ϕ (t) is Gaussian the random variables η (∆λ) are independent and 2 completely specified by the variance E{|η (∆λ)| } = F (∆λ) so that Bζ (t − s) completely characterizes the process as it follows from (40), (41).
4
Conclusions
In this work the historical learning, i.e. the SAINNs capability of approximating a wide class of nonstationary stochastic processes from their covariance function has been demonstrated. In particular the two cases where the ‘time’ t belongs to a finite or infinite set T have been analyzed separately.
References [1] Karhunen, K.: Uber lineare Methoden in der Wahrscheinlicherechnung, Ann. Acad. Sci. Fennicae, Ser. A. Math Phys., Vol. 37, (1947) 3–79 177, 179 [2] Turchetti, C., Conti, M., Crippa, P., Orcioni, S.: On the Approximation of Stochastic Processes by Approximate Identity Neural Networks. IEEE Trans. Neural Networks, Vol. 9, 6 (1998) 1069–1085 177, 178
184
Paolo Crippa and Claudio Turchetti
[3] Belli, M. R., Conti, M., Crippa, P., Turchetti, C.: Artificial Neural Networks as Approximators of Stochastic Processes. Neural Networks, vol. 12, 4-5 (1999) 647– 658 177 [4] Doob, J. L.: Stochastic Processes. J. Wiley & Sons, New York, USA, (1990) 178, 179
Use of the Kolmogorov's Superposition Theorem and Cubic Splines for Efficient Neural-Network Modeling Boris Igelnik Pegasus Technologies, Inc. 5970 Heisley Road, Suite 300, Mentor OH 44060, USA
[email protected] Abstract. In this article an innovative neural-network architecture, called the Kolmogorov's Spline Network (KSN) and based on the Kolmogorov's Superposition Theorem and cubic splines, is proposed and elucidated. The main result is the Theorem giving the bound on the approximation error and the number of adjustable parameters, which favorably compares KSN with other one-hidden layer feed-forward neural-network architectures. The sketch of the proof is presented. The implementation of the KSN is discussed.
1
Introduction
The Kolmogorov's Superposition Theorem (KST) gives the general and very parsimonious representation of a multivariate continuous function through superposition and addition of univariate functions [1]. The KST states that any function, f ∈ C ( I d ) , has the following exact representation on I d f ( x) =
∑ g ∑ λψ ( x )
2 d +1
d
n =1
i =1
i
ni
i
(1)
with some continuous univariate function g depending on f, while univariate functions, ψ n , and constants, λi , are independent of f. Hecht-Nielsen [2] was first who recognized that the KST could be utilized in neural network computing. Using early Sprecher's enhancement of the KST [3] he proved that the Kolmogorov's superpositions could be interpreted as a four-layer feedforward neural network. Girosi and Poggio [4] pointed out that the KSN is irrelevant to neural-network computing, because of high complexity of computation of the functions g and ψ n from the finite set of data. However Kurkova [5] noticed that in the Kolmogorov's proof of the KST the fixed number of basis function, 2d + 1, can be replaced by a variable N, and the task of function representation can be replaced by the task of function approximation. She also demonstrated [6] how to approximate Hecht-Nielsen's network by the traditional neural network.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 184-190, 2003. Springer-Verlag Berlin Heidelberg 2003
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
185
Numerical implementation of the Kolmogorov's superpositions was analyzed in [7]-[10]. In particular Sprecher [9], [10] derived the elegant numerical algorithms for implementation of both internal and external univariate functions in [1]. The common starting point of the works [2], [3], [5]-[10] is the structure of the Kolmogorov's superpositions. Then the main effort is made: to represent the Kolmogorov's superpositions as a neural network [2], [3]; to approximate them by a neural network [5], [6]; to directly approximate the external and internal univariate functions in the Kolmogorov's representation by a numerical algorithm [7]-[10]. In our view it is an attempt to preserve the efficiency of the Kolmogorov's theorem in representation of a multivariate continuous function in its practical implementation. Especially attractive is the idea of preserving the universal character of internal functions. If implemented with reasonable complexity of the suggested algorithms this feature can make a breakthrough in building efficient approximations. However, since the estimates of the complexity of these algorithms are not available so far, the arguments of Girosi and Poggio [4] are not yet refuted. The approach adopted in this paper is different. The starting point is the function approximation from the finite set of data, by a neural net of the following type N d f N ,W ( x ) = ∑ an g n ∑ψ i ( xi , wni ) , W = {an , wni } n =1 i =1
(2)
The function, f, to be approximated belongs to the class, Φ ( I d ) of continuously differentiable functions with bounded gradient, which is wide enough for applications. We are looking for the qualitative improvement of the approximation f ≈ f N ,W , f ∈ Φ ( I d )
(3)
using some ideas of the KST proof. In particular we see from the Kolmogorov's proof that it is important to vary, dependent on data, the shape of external univariate function, g, in contrast to traditional neural networks with fixed-shape basis functions. Use of variable-shape basis functions in neural networks is described in [11]-[15]. In this paper innovative neural-network architecture, KSN, is introduced. The distinctive features of this architecture are: it is obtained from (1) by replacing fixed number of basis functions, 2d + 1, by the variable N, and by replacing both the external function, and the internal functions, ψ n , by the cubic spline functions, s (., γ n ) and s (., γ ni ) respectively. Use of cubic splines allows to vary the shape of basis functions in the KSN by adjusting the spline parameters γ n and γ ni . Thus the KSN, f Ns ,W , is defined as follows N d f Ns ,W ( x ) = ∑ s ∑ λi s ( xi , γ ni ) , γ n n =1 i =1
(4)
where λ1 ,...λd > 0 are rationally independent numbers [16] satisfying the condition
∑
d i =1
λi ≤ 1 . These numbers are not adjustable on the data and are independent of an
application. In Section 2 we give a sketch of the proof for the main result of this
186
Boris Igelnik
paper. It states that the rate of convergence of approximation error to zero with N → ∞ is significantly higher for the KSN than the corresponding nets for traditional neural networks described by equation (2). Simultaneously we proved that the complexity of the approximation of KSN with N → ∞ tends to infinity significantly slower than the complexity of the traditional neural networks. The complexity is defined as the number of adjustable parameters needed to achieve the same approximation error for both types of neural networks. This main result is the justification for introducing the KSN. Thus, utilizing some of the ideas of the KST proof, a significant gain, both in accelerating rate of convergence of approximation error to zero and reducing the complexity of the approximation algorithm can be achieved by the KSN. Section 3 describes some future work in this direction.
2
Main Result
Main result is contained in the following theorem: Theorem. For any function f ∈ Φ ( I d ) and any natural N there exists a KSN defined by equation (4), with the cubic spline univariate functions g ns ,ψ nis 1, defined on [0,1] and rationally independent numbers λ1 > 0,...λd > 0, ∑ i =1 λi ≤ 1 , such that i=d
f − f Ns ,W = Ο (1/ N ) ,
(5)
f − f Ns ,W = max f ( x ) − f Ns ,W ( x ) . d
(6)
where x∈ I
The number of network parameters, P, satisfies the equation P = Ο( N 3/ 2 ) .
(7)
This statement favorably compares the KSN with the one-hidden layer feedforward networks currently in use. Most of the existing such networks provides the estimate of approximation [17]-[22] error by the following equation
(
)
f − f N ,W = Ο 1/ N .
(8)
Denote N∗ and P∗ the number of basis functions and the number of adjustable parameters for a network currently in use with the same approximation error as the KSN with N basis functions. Comparison of (5) and (8) shows that N∗ = Ο ( N 2 ) and P∗ = Ο ( N 2 ) . Therefore, P∗ >> P for large values of N, which confirms the significant
advantage of the KSN in complexity. 1
These notations are used in section 2 instead of notations s (., γ n ) , s (., γ ni ) in formula (4)
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
187
The detailed proof of the theorem is contained in [23]. The proof is rather technical and lengthy. In particular large efforts were spent on proving that the conditions imposed on the function f are sufficient for justification of equations (5) and (7) In this paper we concentrate on the description of the major steps of the proof. The first step of the proof follows the arguments of the first step of the KST proof, except two things: it considers variable number of basis functions, N, and differently defines the internal functions, ψ i in formula (2). In the latter case piecewise linear functions of the Kolmogorov's construction are replaced by almost everywhere piecewise linear functions with connections between adjacent linear parts made from nine degree spline interpolates on the arbitrarily small intervals. As a result of this first step a network, f N ,W , described by equation (2) is obtained, satisfying the equation f − f N ,W = Ο (1/ N ) .
(9)
The univariate functions, g n and ψ i , in the first step of the proof have uncontrolled complexity. That is why they are replaced in the second step by their cubic spline interpolates g ns and ψ nis . The number of spline knots, M, depends on N, and was chosen so that the spline network, f Ns ,W , defined as N d s f Ns ,W ( x ) = ∑ g n ∑ λψ i ni ( xi , γ ni ) , γ n n =1 i =1
(10)
approximates f with approximation error and complexity satisfying equations (5) and (7). The functions, g n and ψ i , defined in the first step of proof, should be four times continuously differentiable in order to be replaced by their cubic spline interpolates, g ns and ψ nis respectively, with estimated approximation error. That is why in the construction of g n and ψ i nine degree spline interpolates should be used.
3
Implementation of the KSN
In the implementation of the KSN we use the Ensemble Approach (EA) [24], [25]. EA contains two optimization procedures, the Recursive Linear Regression (RLR) [26] and the Adaptive Stochastic Optimization (ASO), and one module of the specific neural net architecture (NNA). Both RLR and ASO operate on the values of the T × N design matrices defined for N = 1,...N max
(
Γ1 x (1) , w1N ... Γ1 x (T ) , w1N
(
) )
(
)
1 ... Γ N x ( ) , wNN , ... ... T ... Γ N x ( ) , wNN
(
)
(11)
188
Boris Igelnik
and
the
matrix
of
the
target
output
values
y (1) ,... y (T ) ,
where
d t (t ) s Γ n x ( ) , wnN = g ns ∑ λψ i ni xi , γ ni , γ n , n = 1,...N , t = 1,...T , T is the number of i =1 points in the training set. Since these matrices are numerical, operation of ASO and RLR is independent of the specifics of NNA. The NNA for spline functions is defined in detail in [23]. The use of the EA can be supplemented by the use of the Ensemble Multi-Net (EMN) [27]-[29] in order to increase the generalization capability of the KSN (it can be used as well for training of other predictors with the same purpose). The main features of the EMN are as follows. First the ensemble of the nets
(
{f
)
n ,Wn , En
(
)
}
, n = 1,...N , having N 0 basis functions and trained on different subsets,
En , n = 1,...N , of the training set E, is created. Then these nets are combined in one
net, FN ,W , E , trained (if needed) on the set E. We prefer the method of linear combining FN ,W , E ( x ) = ∑ an f n,W∗ N , En ( x )
(12)
where f n,W* N , En , n = 1,...N are the trained nets. The parameters an , n = 1,...N are the only adjustable parameters on the right side of equation (12). The training sets are chosen so that to minimize the possibility of linear dependences among basis functions f n,W* N , En , n = 1,...N . One of the methods used for creation of training sets is called bagging [29]. For each ( x, y ) ∈ E the pair ( x, y + N ( 0, σ% ) ) is included in En with the probability 1/ E , where N ( 0, σ% ) is a Gaussian random variable with the zero mean and the standard deviation σ% equal to the estimate of the standard deviation of the net output noise. Inclusions in En are independent of inclusions in other sets, Em , m ≠ n . The number of basis functions, N 0 , in each net f n,W* N , En , n = 1,...N is small [27], [28] compared to N. It was proven [29], both theoretically and experimentally, that this method might lead to significant decrease of generalization error. Given relatively small size of the nets f n,W* N , En , n = 1,...N the EMN can be called the “parsimonious cross-validation”.
4
Conclusion and Future Work
In this article we have laid down the justification for a new, highly adaptive modeling architecture, the KSN, and developed the methods of its implementation and learning. Several theoretical issues remain for future work. In particular we are interested in deriving a bound on the estimation error [2] of the KSN and finding the methods for automatic choosing the intervals for the shape parameters of the KSN. Not less
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
189
important is the experimental work where not only the practical advantages and disadvantages of the KSN as a modeling tool will be checked but the KSN coupling with the preprocessing and post-processing tools.
References [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
[16]
Kolmogorov, A.N.: On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition. Dokl. Akad. Nauk SSSR 114 (1957) 953-956, Trans. Amer. Math. Soc. 2 (28) (1963) 55-59 Hecht-Nielsen, R.: Counter Propagation Networks. Proc. IEEE Int. Conf. Neural Networks. 2 (1987) 19-32 Sprecher, D.A.: On the Structure of Continuous Functions of Several Variables. Trans. Amer. Math. Soc. 115 (1965) 340-355 Girosi F., Poggio, T.: Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant. Neur.Comp. 1 (1989) 465-469 Kurkova, V.: Kolmogorov's Theorem Is Relevant. Neur.Comp. 3 (1991) 617622 Kurkova, V.: Kolmogorov's Theorem and Multilayer Neural Networks. Neural Networks. 5 (1992) 501-506 Nakamura, M., Mines, R., and Kreinovich, V.: Guaranteed Intervals for Kolmogorov's Theorem (and their Possible Relations to Neural Networks) Interval Comp. 3 (1993) 183-199 Nees, M.: Approximative Versions of Kolmogorov's Superposition Theorem, Proved Constructively. J. Comp. Appl. Math. 54 (1994) 239-250 Sprecher, D.A.: A Numerical Implementation of Kolmogorov's Superpositions. Neural Networks. 9 (1996) 765-772 Sprecher, D.A.: A Numerical Implementation of Kolmogorov's Superpositions. Neural Networks II. 10 (1997) 447-457 Guarnieri, S., Piazza, F., and Uncini, A.: Multilayer Neural Networks with Adaptive Spline-based Activation Function. Proc. Int. Neural Network Soc. Annu. Meet. (1995) 1695-1699 Vecci, L., Piazza, F., and Uncini, A.: Learning and Generalization Capabilities of Adaptive Spline Activation Function Neural Networks. Neural Networks. 11 (1998) 259-270 Uncini, A., Vecci, L., Campolucci, P., and Piazza, F.: Complex-valued Neural Networks with Adaptive Spline Activation Function for Digital Radio Links Nonlinear Equalization. IEEE Trans. Signal Proc. 47 (1999) 505-514 Guarnieri, S., Piazza, F., and Uncini, A.: Multilayer Feedforward Networks with Adaptive Spline Activation Function. IEEE Trans. Neural Networks. 10 (1999) 672-683 Igelnik, B.: Some New Adaptive Architectures for Learning, Generalization, and Visualization of Multivariate Data. In: Sincak, P., Vascak, J. (eds.): Quo Vadis Computational Intelligence? New Trends and Approaches in Computational Intelligence. Physica-Verlag, Heidelberg (2000) 63-78 Shidlovskii, A.V.: Transcendental Numbers. Walter de Gruyter, Berlin (1989)
190
Boris Igelnik
[17] Barron, A.R., Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Trans. Inform. Theory. 39 (1993) 930-945 [18] Breiman, L.: Hinging Hyperplanes for Regression, Classification, and Function Approximation. IEEE Trans. Inform. Theory. 39 (1993) 999-1013 [19] Jones, L.K.: Good Weights and Hyperbolic Kernels for Neural Networks, Projection Pursuit, and Pattern Classification: Fourier Strategies for Extracting Information from High-dimensional Data. IEEE Trans. Inform. Theory. 40 (1994) 439-454 [20] Makovoz, Y.: Random Approximants and Neural Networks. Jour. Approx. Theory. 85 (1996) 98-109 [21] Scarcelli, F and Tsoi, A.C.: Universal Approximation Using Feedforward Neural Networks: a Survey of Some Existing Methods and Some New Results. Neural Networks. 11 (1998) 15-37 [22] Townsend, N. W. and Tarassenko, L.: Estimating of Error Bounds for NeuralNetwork Function Approximators. IEEE Trans. Neural Networks. 10 (1999) 217-230 [23] Igelnik, B., Parikh, N.: Kolmogorov's Spline Network. IEEE Trans. Neural Networks. (2003) Accepted for publication [24] Igelnik, B., Pao, Y.-H., LeClair, S.R., and Chen, C. Y.: The Ensemble Approach to Neural Net Training and Generalization. IEEE Trans. Neural Networks. 10 (1999) 19-30 [25] Igelnik, B., Tabib-Azar, M., and LeClair, S.R.: A Net with Complex Coefficients. IEEE Trans. Neural Networks. 12 (2001) 236-249 [26] Albert, A.: Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York (1972) [27] Shapire, R. E.: The Strength of Weak Learnability. Machine Learning. 5 (1990) 197-227 [28] Ji, S and Ma, S.: Combinations of Weak Classifiers. IEEE Trans. Neural Networks. 8 (1997) 32-42 [29] Breiman, L.: Combining Predictors. In: Sharkey, A.J.C. (ed.): Combining Artificial Neural Nets. Ensemble and Modular Nets. Springer, London (1999) 31-48
The Influence of Prior Knowledge and Related Experience on Generalisation Performance in Connectionist Networks F.M. Richardson 1,2, N. Davey1, L. Peters1, D.J. Done2, and S.H. Anthony2 1
Department of Computer Science, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK {F.1.Richardson, N.Davey, L.Peters}@herts.ac.uk 2 Department of Psychology, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK {D.J.Done, S.H.1.Anthony}@herts.ac.uk
Abstract. The work outlined in this paper explores the influence of prior knowledge and related experience (held in the form of weights) on the generalisation performance of connectionist models. Networks were trained on simple classification and associated tasks. Results regarding the transfer of related experience between networks trained using backpropagation and recurrent networks performing sequence production, are reported. In terms of prior knowledge, results demonstrate that experienced networks produced their most pronounced generalisation performance advantage over navï e networks when a specific point of difficulty during learning was identified and an incremental training strategy applied at this point. Interestingly, the second set of results showed that knowledge learnt about in one task could be used to facilitate learning of a different but related task. However, in the third experiment, when the network architecture was changed, prior knowledge did not provide any advantage and indeed when learning was expanded, even found to deteriorated.
1
Introduction
Some complex tasks are difficult for neural networks to learn. In such circumstances an incremental learning approach, which places initial restrictions on the network in terms of memory or complexity of training data has been shown to improve learning [1], [2], [3]. However, the purpose of the majority of networks is not simply to learn the training data but to generalise to unseen data. Therefore, it can be expected, but not assumed, that using incremental learning may also improve generalisation performance. The work reported in this paper extends upon the original work of Elman [3], in which networks trained incrementally showed a dramatic improvement in learning. In this paper three different ways of breaking down the complexity of the task are V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 191-198, 2003. Springer-Verlag Berlin Heidelberg 2003
192
F.M. Richardson et al.
investigated, with specific reference to generalisation performance. In the first experiment an incremental training regime, in which the training set is gradually increased in size, is evaluated against the standard method of presenting the complete training set. In the second experiment the hypothesis that knowledge held in the weights of a network from one task might be useful to a network learning a related task, is explored. The third experiment completes the investigation by determining whether knowledge transfer between networks such as those in the second experiment, may prove beneficial to learning across different types of networks, performing related tasks. The use of well-known learning rules (back-propagation and recurrent learning) allows this work to complement previously mentioned earlier work, allowing a more complete picture of learning and generalisation performance.
2
Experiment One: Investigating Prior Knowledge
The aim of this experiment was to compare the generalisation performance of networks trained using incremental learning in which the input to the network is staged (experienced networks), with equivalent networks initialised with random weights (navï e networks). Networks were trained with the task of classifying static depictions of four simple line-types as seen in Figure 1. Classification of the line-type was to be made irrespective of its size and location on the input array. In order to accomplish this task the network, must learn the spatial relationship of units in the input and then give the appropriate output in the form of the activation of a single classification unit on the output layer.
Horizontal
Vertical
Diagonal acute
Diagonal obtuse
Fig. 1. Shows the four basic line-types which the network was required to classify
2.1
Network Architecture
A simple feed-forward network consisting of an input layer of 49 units arranged as a 7x7 grid was fully connected to a hidden layer of 8 units, which was also connected to an output layer of 4 units, each unit representing a single line type was used (see Figure 2). 2.2
Training and Testing of Naïve Networks
A full set of patterns for all simple line-types of lengths ranging from 3 to 7 units, for all locations upon the input grid, were randomly allocated to one of two equal-sized training and testing sets (160 patterns per set). Three batches of networks (each consisting of 10 runs) were initialised with random weights (navï e networks). The first batch of networks was trained to classify two different line-types (horizontal and vertical lines), the second three and the third, all four different line-types. All
The Influence of Prior Knowledge and Related Experience
193
networks were trained using back-propagation with the same learning rate (0.25) and momentum (0.9) to the point where minimal generalisation error was reached. At this stage the number of epochs that each network had taken to reach the stop-criterion of minimal generalisation error was noted. These networks formed the basis of comparison with experienced networks. 2.3
Training and Testing of Experienced Networks
These networks were trained using the same parameters as those used for navï e networks. The training and testing of the four line-types was divided into three increments, the first increment consisting of two line-types (horizontal and vertical lines) with an additional line-type being added at each subsequent increment. The network progressed from one increment to the next upon attaining minimal generalisation error for the current increment. At this point the weights of the network were saved and then used as the starting point for learning in the following increment, patterns for the additional line-type were added along with an additional output unit (the weights for the additional unit were randomly initialised). O u tp u t lay er
H id den la y er
Inp u t lay er
Fig. 2. Shows the network architecture of the 7x7-classification network. The network consisted of an input layer with units arranged as a grid, a hidden layer, and an output layer consisting of a number of classification units. Given a static visual depiction of a line of any length as input, the network was required to classify the input accordingly by activating the corresponding unit on the output layer. In the example shown the network is given the input of a vertical line with a length of 3 units, which is classified as a vertical line-type
2.4
Results
Generalisation performance was assessed in terms of the number of output errors and poor activations produced by each type of network. Outputs were considered errors if activation of the target unit was less than 0.50, or if a non-target unit had an activation of 0.50 or higher. Poor activations were marginally correct classifications (activation of between 0.50-0.60 on the target unit, and/or activation of 0.40-0.49 on non-target units) and were used to give a more detailed indicator of the level of classification success.
194
F.M. Richardson et al.
The generalisation performance of navï e networks trained on two line-types was good, with networks classifying on average, 89.84% of previously unseen patterns correctly. However, performance decreased as the number of different line-types in the training set increased, with classification performance for all four line-types dropping to an average of 79.45%. In comparison, experienced networks proved marginally better, with an average of 81.76%. Further comparisons between navï e and expe rienced networks revealed that a navï e network trained upon three line-types produced a better generalisation performance than an experienced network at the same stage. It seemed that the level of task difficulty increased between the learning of three and four line-types. So the weights from navï e networks trained with three lin e-types were used as a starting point for training further networks upon four line-types, resulting in a two-stage incremental strategy. This training regime resulted in a further improvement in generalisation performance, with an average of 84.48%. A comparison of the results for the three different training strategies implemented can be seen in Figure 3.
Fig. 3. Shows a comparison of generalisation performance between networks trained using the three different strategies. It can be seen that naïve networks produced th e lowest generalisation success. Of the experienced networks, those trained using the two-stage strategy produced the best generalisation performance
3
Experiment Two: Investigating Related Experience I
The aim of this experiment was to attempt to determine whether the knowledge acquired by networks trained to classify line-types in Experiment One would aid generalisation performance of networks trained upon different but related task. In the new task, networks were given the same type of input patterns as those used for the classification network, but were required to produce a displacement vector as output. This displacement vector contained information as to the length of the line and the direction in which the line would be drawn if produced (see Figure 4). It was hypothesised that the knowledge held in the weights from the previous network would aid learning in the related task because the task of determining how to
The Influence of Prior Knowledge and Related Experience
195
interpret the input layer had already been solved by the previous set of weights. The divisions between line-types created upon output in the classification task were also relevant to the related task, in that same line-types shared activation properties upon the output layer, for example all vertical lines are the result of activation upon the yaxis. 3.1
Network Architecture
All networks consisted of the same input layer as used in Experiment One, a hidden layer of 12 units and an output layer of 26 units. -
+ y x
Fig. 4. Shows the static input to the network of a vertical line of a length of five units and the desired output of the network. Output is encoded in terms of movement from the starting point along the x and y-axis. This form of thermometer encoding preserves issometry and is position invariant [3]
3.2
Training and Testing
All networks were trained and tested in the same manner and using the same parameters as those used in Experiment One. Navï e networks were initialised with a random set of weights. For experienced networks, weights from networks producing the best generalisation performance in Experiment One were loaded. Additional connections required for these networks (due to an increase in the number of hidden units) were initialised randomly. Generalisation performance was assessed. 3.3
Results
Navï e networks reached an average generalisation performance of 59.59%. Experienced networks were substantially better, with an average generalisation performance of 70.42%. This result as seen in Figure 5, clearly demonstrates the advantage of related knowledge about lines and the spatial relationship of units on the input grid in the production of displacement vectors.
4
Experiment Three: Investigating Related Experience II
The aim of this experiment was to examine whether weights from a classification task could be used to aid learning in a recurrent network required to carry out an extension of the static task. Initially, a standard feed-forward network (as shown in Figure 2) was trained to classify line-types of simple two-line shapes. Following this a recurrent
196
F.M. Richardson et al.
network [5] was trained to carry out this task in addition to generating the sequence in which the line-segments for each shape would be produced if drawn (as shown in Figure 6).
Fig. 5. Shows a comparison of generalisation performance between naïve and experienced networks. It is clear that the experienced networks produced the best generalisation performance
static
t1
sequential static
t2
sequential Input
Output
Fig. 6. Shows an example of the input and output for the sequential shape classification task. The input was a simple shape composed of two lines of different types. The output consisted of two components. The static, identifying the line-types, and the sequential generating the order of production. Networks used in Experiment One were trained to give the static output of the task. Weights from these networks were then used by recurrent networks to produce both the static and sequential outputs as shown above
4.1
Training and Testing
Both navï e and experienced networks were trained on an initial sequence production task, using simple shapes composed of diagonal line-types only. Following this, the initial task was extended for both networks to include shapes composed of horizontal and vertical line-types. 4.2
Results
The generalisation performance of navï e and experienced networks for the initial and extended task was assessed. Initially, comparison showed no notable difference
The Influence of Prior Knowledge and Related Experience
197
between navï e and experienced networks. However, performance deteriorated for experienced networks with the addition of the extended task. This drop in performance was attributed to a reduction in performance upon the initial task.
Fig. 7. Shows a comparison of generalisation performance between naïve and experienced networks, performing the initial sequence production task followed by the extended task. It can be seen that performance for the two types of networks for the initial task is relatively equal. However, for the extended task, performance of the experienced networks is poor in comparison to naïve networks trained with random weights
5
Discussion
The experiments conducted have provided useful insights into how prior knowledge and related experience may be used to improve the generalisation performance of connectionist networks. Firstly, it has been demonstrated that incremental learning was of a notable benefit. Secondly, by selecting the point at which learning becomes difficult as the time to increment the training set produces a further advantage. Thirdly, an interesting result is that knowledge learnt about in one task can be used to facilitate learning in a different but related task. Finally, the exploration into whether knowledge can transfer and aid learning between networks of different architectures has a less clear outcome; prior knowledge was successfully transferred, but was found to deteriorate as the network attempted to expand its learning. Further work involves exploring methods by which knowledge transfer between static and recurrent networks may prove beneficial in both learning and generalisation performance.
References [1] [2]
Altmann, G.T.M.: Learning and Development in Connectionist Learning. Cognition (2002) 85, B43-B50 Clarke, A.: Representational Trajectories in Connectionist Learning. Minds and Machines (1994) 4, 317-322
198
F.M. Richardson et al.
[3]
Elman, J.L.: Learning and Development in Neural Networks: the Importance of Starting Small. Cognition. (1993) 48, 71-99 Richardson, F.M., Davey, N., Peters, L., Done, D.J., Anthony, S.H.: Connectionist Models Investigating Representations Formed in the Sequential Generation of Characters. Proceedings of the 10th European Symposium on Artificial Neural Networks. D-side publications, Belgium (2002) 83-88 Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing. Vol. 1. Chapter 8. MIT Press, Cambridge (1986)
[4]
[5]
Urinary Bladder Tumor Grade Diagnosis Using On-line Trained Neural Networks D.K. Tasoulis1,2 , P. Spyridonos3 , N.G. Pavlidis1,2 , D. Cavouras4, P. Ravazoula5 , G. Nikiforidis3 , and M.N. Vrahatis1,2 1
Department of Mathematics, University of Patras GR–26110 Patras, Greece
[email protected] 2 University of Patras Artificial Intelligence Research Center (UPAIRC) 3 Computer Laboratory, School of Medicine University of Patras GR–26110 Patras, Greece 4 Department of Medical Instrumentation Technology, TEI of Athens Ag. Spyridonos Street Aigaleo, GR–12210 Athens, Greece 5 Department of Pathology, University Hospital GR–26110 Patras, Greece
Abstract. This paper extends the line of research that considers the application of Artificial Neural Networks (ANNs) as an automated system, for the assignment of tumors grade. One hundred twenty nine cases were classified according to the WHO grading system by experienced pathologists in three classes: Grade I, Grade II and Grade III. 36 morphological and textural, cell nuclei features represented each case. These features were used as an input to the ANN classifier, which was trained using a novel stochastic training algorithm, namely, the Adaptive Stochastic On-Line method. The resulting automated classification system achieved classification accuracy of 90%, 94.9% and 97.3% for tumors of Grade I, II and III respectively.
1
Introduction
Bladder cancer is the fifth most common type of cancer. Superficial Transitional cell carcinoma (TCC) is the most frequent histological type of bladder cancer [13]. Currently, these tumors are assessed using a grading system based on a variety of histopathological characteristics. Tumor grade, which is determined by the pathologist from tissue biopsy, is associated with tumor aggressiveness. The most widely accepted grading system is the WHO (World Health Organization) system, which stratifies TCCs into three categories: tumors of Grade I, Grade II and Grade III. Grade I tumors are not associated with invasion or metastasis but present a risk for the development of recurrent lesions. Grade II carcinomas are associated with low risk of further progression, yet they frequent recur. Grade III tumors are characterized by a much higher risk of progression
Corresponding author.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 199–206, 2003. c Springer-Verlag Berlin Heidelberg 2003
200
D.K. Tasoulis et al.
and also high risk of association with disease invasion [7]. Although histological grade has been recognized as one of the most powerful predictors of the biological behavior of tumors and significantly affects patients’ management, it suffers from low inter and intra observer reproducibility due to the subjectivity inherent to visual observation [12]. Digital image analysis techniques and classification systems constitute alternative means to perform tumor grading in a less subjective manner. Numerous research groups have proposed quantitative assessments to address this problem. H-K Choi et al [3] have developed an automatic grading system using texture features on a large region of interest, covering a typical area in the histological section. The textural based system produced an overall accuracy of 84.3% in assessing tumors grade. In a different study [6], researchers have employed tissue architectural features and classified tumors with an accuracy of 73%. More recent studies have focused on the analysis of cell nuclei characteristics to perform tumor grade classification with success rates that do not significantly exceed 80% [2, 17]. In this study, we present a methodology which improves considerably the level of diagnostic accuracy in assigning tumor grade. The method is based on the application of an ANN as a classifier system. The input data for the ANN, describe a number of nuclear morphological and textural features, that were obtained through an automatic image processing analysis technique. It is worth noting that the prognostic and diagnostic value of these features, has been confirmed [4].
2
Materials and Methods
129 tissue sections (slides) from 129 patients (cases) with superficial TCC were retrieved from the archives of the Department of Pathology of Patras University Hospital in Greece. Tissue sections were routinely stained with HaematoxylinEosin. All cases were reviewed independently by the experts to safeguard reproducibility. Cases were classified following the WHO grading system as follows: thirty-tree cases as Grade I, fifty-nine as Grade II and thirty-seven as Grade III. Images from tissue specimens were captured using a light microscopy imaging system. The method of digitalization and cell nuclei segmentation for analysis has been described in previous work [17]. Finally, from each case 36 features were estimated: 18 features were used to describe information concerning nuclear size and shape distribution. The rest were textural features that encoded chromatin distribution of the cell nucleus [17]. These features were used as an input to ANN classifier.
3
Artificial Neural Networks
Back Propagation Neural Networks (BPNNs) are the most popular artificial neural network models. The efficient supervised training of BPNNs is a subject of considerable ongoing research and numerous algorithms have been proposed to this end. Supervised training amounts to the global minimization of the network learning error.
Urinary Bladder Tumor Grade Diagnosis
201
Applications of supervised learning can be divided into two categories: stochastic (also called on line) and batch (also called off line) learning. Batch training can be viewed as the minimization of the error function E. This minimization corresponds to updating the weights by epoch and to be successful it requires a sequence of weight iterates {wk }∞ k=0 where k indicates epochs, which converges to a minimizer w∗ . In on line training, network weights are updated after the presentation of each training pattern. This corresponds to the minimization of the instantaneous error of the network E(p) for each pattern p individually. On line training may be chosen for a learning task either because of the very large (or even redundant) training set or because we want to model a gradually time varying system. Moreover, it helps escaping local minima. Given the inherent efficiency of stochastic gradient descent, various schemes have been proposed recently [1, 18, 19]. Unfortunately, on line training suffers from several drawbacks such as sensitivity to learning parameters [16]. Another disadvantage is that most advanced optimization methods, such as conjugate gradient, variable metric, simulated annealing etc., rely on a fixed error surface, and thus it is difficult to apply them for on line training [16]. Regarding the topology of the network it has been proven [5, 20] that standard feedforward networks with a single hidden layer can approximate any continuous function uniformly on any compact set and any measurable function to any desired degree of accuracy. This implies that any lack of success in applications must arise from inadequate learning, insufficient number of hidden units or the lack of a deterministic relationship between inputs and targets. Keeping these theoretical results we restrict the network topology of ANNs used in this study to one hidden layer.
4
Training Method
For the purpose of training neural networks an on-line stochastic method was employed. For recent proposed on–line training methods as well as application in medical applications see [8, 9, 10, 11, 14] This method uses a learning rate adaptation scheme that exploits gradient-related information from the previous patterns. This algorithm is described in [9], and is based on the stochastic gradient descent proposed in [1]. The basic algorithmic scheme is exhibited in Table 1. As pointed out in Step 4, the algorithm adapts the learning rate using the dot product of the gradient from the previous two patterns. This algorithm produced both fast and stable learning in all the experiments we performed, and very good generalization results.
5
Results and Discussion
To measure the ANN efficiency the dataset was randomly permutated five times. Each time it was split into a train set and a test set. The training set contained about 2/3 of the original dataset from each class. For each permutation the
202
D.K. Tasoulis et al.
Table 1. Stochastic On-Line Training with adaptive stepsize 0: 1: 2: 3: 4: 5: 6:
Initialize weights w0 ,η 0 , and meta-stepsize K. Repeat for each input pattern p Calculate E(wp ) and then ∂E(wp ) Update the weights: wp+1 = wp − η p ∂E(wp ) Calculate the stepsize to be used with the next pattern p + 1: η p+1 = η p + K∂E(wp−1 ), ∂E(wp ) Until the termination condition is met. Return the final weights w∗ .
Table 2. Accuracy of Grade I,II and III for various topologies Topology 36-1-3 36-2-3 36-5-3 36-16-3
Grade I 18.18% 81.81% 90.90% 81.81%
Grade II 100% 100% 100% 100%
Grade III 100% 100% 100% 100%
network was trained with the Stochastic On Line method with adaptive step size discussed previously. Two terminating conditions were used: the maximum number of cycles over the entire training set was set to 100, and the correct classification of all the training patterns. Alternatively, the Leave-One-Out (LOO) method [15] was employed to validate ANN classification accuracy. According to this method, the ANN is initialized with the training set including all patterns except one. The excluded pattern is used to assess the classification ability of the network. This process is repeated for all the patterns available and results are recorded in the form of a truth table. The software used for this task was developed under the Linux Operating System using the C++ programming language, and the gcc ver2.96 compiler. A great number of different ANN topologies (number of nodes in the hidden layer) were tested for the grade classification task. Some of these tests are exhibited in Table 2. Best results were obtained using the topology: 36-5-3. Table 3 illustrates analytically the ANN performance for each Crossover permutation. The ANN exhibited high classification accuracy for each grade category. It is worth noting that Grade I tumors were differentiated successfully from Grade III tumors. In four out of five Crossovers neither Grade I to III nor Grade III to I errors occurred. As can be seen from Table 3 in one permutation only 1 case of Grade I was misclassified as Grade III.rom a clinical point of view, it is important to distinguish low grade tumors, which can generally be treated
Urinary Bladder Tumor Grade Diagnosis
203
Table 3. Crossover Results For the ANNs
Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy
Crossover I ANN classification Grade I Grade II Grade III 10 1 0 0 19 1 0 1 12 Crossover II ANN classification Grade I Grade II Grade III 10 1 0 0 20 0 0 2 11 Crossover III ANN classification Grade I Grade II Grade III 10 1 0 0 19 1 0 1 12 Crossover IV ANN classification Grade I Grade II Grade III 10 0 1 0 19 1 0 0 13 Crossover V ANN classification Grade I Grade II Grade III 10 1 0 0 18 2 0 0 13
Accuracy(%) 90.9 95 92.3 93.2
Accuracy(%) 90.9 100 84.62 93.2
Accuracy(%) 90.9 95 92.3 93.2
Accuracy(%) 90.9 95 100 95.4
Accuracy(%) 90.9 90 100 93.2
conservatively in contrast to high-grade tumors. The latter often require a more aggressive therapy because of a high-risk cancer progression. Results could also be interpreted in terms of specificity and sensitivity. That is specificity is the percentage of Grade I tumors correctly classified and sensitivity is the percentage of Grade III tumors correctly classified. ANN grade classification safeguarded high sensitivity which is of vital importance for patients treatment course, retaining
204
D.K. Tasoulis et al.
Table 4. Leave One out Results For the ANNs Histological finding GRADE I GRADE II GRADE III Overall Accuracy
ANN classification Grade I Grade II Grade III 30 1 2 0 56 3 0 1 36
Accuracy(%) 90 94.9 97.3 94.06
at the same time high specificity. Another important outcome is that the intermediate Grade II tumors were recognized with high confidence from Grade I and III. This would be particular helpful for pathologist who encounter difficulties in assessing Grade II tumors since some of them fall into the gray zone bordering on either Grade I or Grade III, and the decision is subject to the judgment of the pathologist. The simplicity and efficiency of the training method enabled us to verify the ANN classification accuracy by employing the LOO method (the whole procedure required 46 seconds to complete in Athlon CPU running at 1500 MHz). It is well known that this method is optimal to test the performance of a classifier when small data sets are available, but this testing procedure is computationally expensive when used with conventional training methods. Classification results employing the LOO method are shown in Table 4. The consistency of the system in terms of high sensitivity (no Grade III to Grade I error occurred) was verified. In [3], a textural based system produced an overall accuracy of 84.3% in assessing tumors grade. In a different study [6], researchers have employed tissue architectural features and classified tumors with an accuracy of 73%. More recent studies have focused on the analysis of cell nuclei characteristics to perform tumor grade classification with success rates that do not significantly exceed 80% [2, 17]. The ANN methodology proposed in this paper, improved significantly the tumor grade assessment with success rates of 90%, 94.9%, and 97.3%, for Grade I, II and III respectively.
6
Conclusions
In this study an ANN was designed to improve the automatic characterization of TCCs employing nuclear features. The ANN exhibited high performance in correctly classifying tumors into three categories utilizing all the available diagnostic information carried by nuclear size, shape and texture descriptors. The proposed ANN could be considered as an efficient and robust classification engine able to generalize in making decisions about complex input data improving significantly the diagnostic accuracy. The present study extends previous work in terms of the features used and enforces the belief that objective measurements on nuclear morphometry and texture offer a potential solution for the accurate
Urinary Bladder Tumor Grade Diagnosis
205
characterization of tumor aggressiveness. The novelty of this paper resides in the results obtained since they are the highest reported in the literature. Since most Grade I tumors are considered to be good prognosis, while Grade III is associated with bad prognosis, the 0% misclassification of Grade III tumors as Grade I, gives an advantage to the proposed methodology to be part of a fully automated computer aided diagnosis system.
References [1] L. B. Almeida, T. Langlois, L. D. Amaral, and A. Plankhov. Parameter adaption in stohastic optimization. On-Line Learning in Neural Networks, pages 111–134, 1998. 201 [2] N. Belacel and M. R. Boulassel. Multicriteria fuzzy assignment method: a useful tool to assist medical diagnosis. Artificial Intelligence in Medicine, 21:201–207, 2001. 200, 204 [3] H.-K. Choi, J. Vasko, E. Bengtsson, T. Jarkrans, U. Malmstrom, K. Wester, and C. Busch. Grading of transitional cell bladder carcinoma by texture analysis of histological sections. Analytical Cellular Pathology, 6:327–343, 1994. 200, 204 [4] C. De Prez, Y. De Launoit, R. Kiss, M. Petein, J.-L. Pasteels, and A. Verhest. Computerized morphonuclear cell image analysis of malignant disease in bladder tissues. Journal of Urology, 143:694–699, 1990. 200 [5] K. Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989. 201 [6] T. Jarkrans, J. Vasko, E. Bengtsson, H.-K. Choi, U. Malmstrom, K. Wester, and C. Busch. Grading of transitional cell bladder carcinoma by image analysis of histological sections. Analytical Cellular Pathology, 18:135–158, 1995. 200, 204 [7] I. E. Jonathan, B. A. Mahul, R.R Victor, and F.K Mostofi. (transitional cell) neoplasms of the urinary bladder. The American Journal of Surgical Pathology, 22(12):1435–1448, 1998. 200 [8] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Global learning rate adaptation in on–line neural network training. In Proceedings of the Second International Symposium in Neural Computation May 23–26, Berlin, Germany, 2000. 201 [9] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Adaptive stepsize algorithms for on-line training of neural networks. Nonlinear Analysis, T. M. A., 47(5):3425–3430, 2001. 201 [10] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Hybrid methods using evolutionary algorithms for on–line training. In INNS–IEEE International Joint Conference on Neural Networks (IJCNN), July 14–19, Washington, D. C., U. S. A., volume 3, pages 2218–2223, 2001. 201 [11] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Improved neural networkbased interpretation of colonoscopy images through on-line learning and evolution. In D. Dounias and D. A. Linkens, editors, European Network of Excellence on Intelligent Technologies for Smart Adaptive Systems, pages 38–43. 2001. 201 [12] E. Ooms, W. Anderson, C. Alons, M. Boon, and R. Veldhuizen. Analysis of the performance of pathologists in grading of bladder tumors. Human Pathology, 14:140–143, 1983. 200 [13] S. L. Parker, T. Tony, S. Bolden, and P. A. Wingo. Cancer statistics. Cancer Statistics 1997. CA Cancer J Clin, 47(5):5–27, 1997. 199
206
D.K. Tasoulis et al.
[14] V. P. Plagianakos, G. D. Magoulas, and M. N. Vrahatis. Tumor detection in colonoscopic images using hybrid methods for on–line neural network training. In G. M. Papadourakis, editor, Neural Networks and Expert Systems in Medicine and Healthcare (NNESMED), pages 59–64. Technological Educational Institute of Crete, Heraklion, 2001. 201 [15] S. J. Raudys and A. K. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 252–264, 1991. 202 [16] N. N. Schraudolf. Gain adaptation in stochastic gradient descend. 1999. 201 [17] P. Spyridonos, D. Cavouras, P. Ravazoula, and G. Nikiforidis. Neural network based segmentation and classification system for the automatic grading of histological sections of urinary bladder carcinoma. Analytical and Quantitative Cytology and Histology, 24:317–324, 2002. 200, 204 [18] R. S. Suton. Adapting bias by gradient descent: an incremental version of deltabar-delta. In Proc. 10th National Conference on Artificial Intelligence, pages 171–176. MIT Press, 1992. 201 [19] R. S. Suton. Online learning with random representations. In Proc. 10th International Conference on Machine Learning, pages 314–321. Morgan Kaufmann, 1993. 201 [20] H. White. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3:535–549, 1990. 201
Invariants and Fuzzy Logic Germano Resconi and Chiara Ratti Dipartimento di Matematica, Universit` a cattolica via Trieste 17, 25128 Brescia, Italy
[email protected] Abstract. In this paper with the meta-theory based on modal logic, we study the possibility to build invariants in fuzzy logic. Every fuzzy expression will be compensated by other expressions so that to get a global expression which value is always true. The global expression takes the name of invariant because it always assumes true value also if the single components of the expression are fuzzy and they assume value between false and true. Key words: Fuzzy logic, meta-theory, modal logic, invariant, compensation, tautology.
1
Introduction
In the classical modal logic, given a proposition p and a world W, p can be true or false. In the meta-theory, we consider more worlds. P is valued in a set of worlds understood as only indivisible entity. If in the same set of worlds we have different evaluations, the uncertainty is born. In classical logic, if we consider a tautology, its evaluation is always true, but if the evaluation is done on a set of worlds according to the meta-theory, the same expression couldn’t be true. This fuzzy expression can be compensate by other expressions. We get a global expression which value is always true. It is an invariant.
2
Meta-theory Based on Modal Logic
In a series of papers Klir, et al.,(1994, 1995; Resconi, et al., 1992, 1993, 1994, 1996) a meta-theory was developed. The new approach is based on Kripke model of modal logic. A Kripke model is given by the structure: < W, R, V >. Resconi, et al. (1992-1996) suggested to adjoin a function Ψ : W → R where R is the set of real numbers assigned to worlds W in order to obtain the new model S1 =< W, R, V, Y > . (1) That is for every world, there is an associated real number that is assigned to it. With the model S1, we can build the hierarchical meta-theory where we can calculate the expression for the membership function of truthood in the fuzzy set theory to verify a given sentence via a computational method based V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 207–212, 2003. c Springer-Verlag Berlin Heidelberg 2003
208
Germano Resconi and Chiara Ratti
on {1, 0} values corresponding to the truth values {T, F } assigned to a given sentence as the response of a world, a person, or a sensor, etc. At this point, we should ask: “what are the linkages between the concepts of a population of observers, a population of possible worlds and the algebra of fuzzy subsets of well defined universe of discourse?” To restate this question, we need to point out that fuzzy set was introduced for the representation of imprecision in natural languages. However, imprecision means that a word representing an “entity” (temperature, velocity) cannot have a crisp logic evaluation. The meaning of a word in a proposition may usually be evaluated in different ways for different assessments of an entity by different agents, i.e. worlds. An important principle is: “we cannot separate the assessments of an entity without some loss of property in the representation of that entity itself ”. Different and in some cases conflicting evaluations for the same proposition may come up for the same entity. The presence of conflicting properties within the entity itself is the principal source of the imprecision in meaning representation of the entity. For example, suppose the entity is a particular temperature of a room, and we ask for the property cold, when we have no instrument to measure that property. The meaning of the entity, “temperature”, is composed of assessments that are the opinions of a population of observers that evaluate the predicate “cold”. Without the population of observers, and their assessments, we cannot have the entity “temperature” and the predicate “cold”. When we move from crisp set to the fuzzy set, we move from atomic elements, i.e., individual assessments to non-atomic entities, i.e., aggregate assessments. In an abstract way, the population of the assessments of the entity becomes the population of the worlds that we associate the set of propositions with crisp logic evaluation. Perception based data embedded in a context generate imprecise properties. Imprecise properties are generating by the individual generalization to the other worlds of the evaluation of the properties itself. For example when four persons give four different evaluations and any person think that its evaluation is the right evaluation and the others are wrong, we have in the four worlds a superposition of the individual evaluation and this generate uncertainty or fuzziness. With the perception-based data, as assessments of an entity, in context, we evaluate the properties of an entity. The evaluation can be conflicting. In such cases, the world population assessment composed of individual assessments give the context structure of the perception-based data. If we know only the name of a person (entity), we cannot know if s/he is “old”. Additional observationbased information that s/he is married, is a manager, and that s/he plays with toys are conflicting information of the proposition that associates her/him with the predicate “old”, etc. The aim of this paper is to show that with a model of a perception based imprecision generated by worlds, i.e., agents, we can on the one hand simplify the definitions of the operations in fuzzy logic and on the other hand expose and explain deeper issues embedded within the fuzzy theory. Consider a sentence such as: “John is tall” where x ∈ X is a person, “John”, in a population of people, X, and “tall” is a linguistic term of a linguistic
Invariants and Fuzzy Logic
209
variable, the height of people. Let the meta-linguistic expression that represents the proposition “John is tall” written in fuzzy set canonical form as follows: pA (x) ::= “x ∈ XisrA ,
(2)
where “isr” means “x ∈ X is in relation to a fuzzy information granule A”, and pA (x) is the proposition that relates x to A. Next consider that the imprecise sentence “John is tall” which can be modelled by different entities, i.e., different measures of the physical height h. At each height h, let us associate the opinions, based on their perceptions, of a set of observers or sensors that give an evaluation for the sentence “John is tall”. Any observer or sensor can be modelled in an abstract way by a world wk ∈ W , where W is the set of all possible worlds that compose the indivisible entity associated with a particular h and k is the index, 1, 2, 3 · · · n, that associates with any assessment of the entity given by the population of the observers. It should be noted that the world, i.e., the person or the sensor, does not say anything about the qualification, i.e., descriptive gradation of “John’s being tall”, but just verifies on the basis of a valuation schema. With these preliminaries, we can then write for short: pA (x) such that V (pA (x), wk ) evaluates to T, true. Next let us assign h(wk , x, A) = 1 if or h(wk , x, A) = 0 if With this background, we next define the membership expression of truthood of a given atomic sentence in a finite set of worlds wk ∈ W as follows: η(wk , x, A) set of worlds wherepA (x)is true = k . (3) µPA (x) = |W (X)| |W (x)| In the equation 3 for any value of the variable x we associate the set of the worlds W (x) that represent the entity with the opinions, based on their perceptions, of the population of the observers, where pA represents the proposition that the atomic sentence “John is tall”, for x = “John”, and A = “tall” such that ; and |W | is the cardinality of the set of worlds in our domain of concern. Recall once again that these worlds, wk ∈ W , may be agents, sensors, persons, etc. Let us define the subset of worlds WA = wk ∈ W |V (pA(x), wk) = T . We can write, expression 3 as follows: |WA | . (4) µPA (x) = |W | With the understanding that WA represent the subset of the worlds W where the valuation of pA (x) is “true” in the Kripke sense. For the special case, where the relation R in the Kripke model is wk Rwk at any world wk only itself is accessible (any world wk is isolated from the others) the membership expression is computed as the value of Y in S1 stated in 1 above. It is computed by the 1 expression Ψ = |W | for any (single) world w in W. Thus we can write the expression 3 as follows: µPA (x) = η(wk )Ψk . (5) k
210
3
Germano Resconi and Chiara Ratti
Transformation by External Agent
Given a set of worlds with a parallel evaluation of true value for every world, we can use an operator Γ to change the logic value of the propositions in the worlds. In the modal logic we have the possible ♦ and necessary , unary operators that change the logic value of the proposition p. One proposition p in one world w is necessary true or p is true, when p is true in all accessible worlds. One proposition p in one world w is possible true or ♦p is true, when is true in almost one world accessible from w. Near to these operators, we adjoin a new operator Γ . One proposition p in one world is true under the agent’s action, or Γ p is true, when there is an agent under whose action p is true. The action of the agent can be neutral so in the Γ p expression p does not change its logic value. The action of the agent can also change the logic value of p in false. With the agent operator G we can generate any type of t-norm or t-conorm and also non classical complement operator. (6) Exemple. For example given five worlds (w1 , w2 , w3 , w4 , w5 ) and two propositions p and q. When the logic evaluation of p in the four worlds is V (p) = (T rue, T rue, F alse, F alse, F alse)
(7)
V (q) = (F alse, T rue, T rue, T rue, F alse) .
(8)
and
For 7 and 8 we have µp = 25 and µq = 35 . The evaluation of V (pORq) = (T rue, T rue, T rue, T rue, F alse) where we use the classical logic operation OR for every world. So we have that µporq =
4 2 3 3 > max( , ) = 5 5 5 5
(9)
but when we introduce an external agent whose action changes the true value of p in this way V (Γ p) = (F alse, T rue, T rue, F alse, F alse) we have that µΓ porq =
2 3 3 3 = max( , ) = . 5 5 5 5
(10)
The agent has transformed the general fuzzy logic operation OR into the Zadeh max rule for the same operation OR. In conclusion the agent G can move from one fuzzy logic operation to another with the same membership function value for p and q. We know that, given the proposition p and the membership function µp , the Zadeh complement is µcp = 1 − µp . In this case we have that µp and cp = min(µp , 1 − µp ) that in general is different from zero or false value. We know that for Zadeh fuzzy logic the absurd condition is not always false. (11) Exemple. When we use the world image for the uncertainty we have evaluation 7 and the evaluation of its complement:
Invariants and Fuzzy Logic
211
V (cp) = (F alse, F alse, T rue, T rue, T rue); we have that V (p and cp) = (F alse, F alse, F alse, F alse, F alse); then V (p and cp) is always false. In this case the absurd condition is always false. But when an agent changes the true value position of the cp expression, in this way, if V (Γ cp) = (T rue, T rue, T rue, F alse, F alse), we have that V (p and Γ cp) = (T rue, T rue, F alse, F alse, F alse) and 2 2 3 µpandΓ cp = min( , ) = 5 5 5
(12)
that is different from false. We remark that V (Γ cp) = V (cΓ p). To have uniform language with classical logic we write cp = ¬p
4
Invariants in Fuzzy Logic
We know that in classical logic we have the simple tautology T = p ∨ ¬p
(13)
that is always true for every logic value of p When we evaluate the previous expression by the worlds and we introduce an external agent that use the operator G to change the logic value of p, we have the fuzzy expression of T or TF T F = p ∨ Γ ¬p = p ∨ ¬Γ p
(14)
Because in 14 we introduce a new variable or Γ p, the 14 is not formally equal to 13 so can assume also false value when the variable Γ p is different from the variable p. In fact the 14 can be written, for every world , in this way TF = Γp → p
(15)
that is false when Γ p is true and p is false. So a tautology in the classical logic is not a tautology in the fuzzy logic. In fact with the Zadeh operation we have µporcp = max(µp , 1 − µp ) ≤ 1
(16)
To find an invariant inside the fuzzy logic, we change the variables in the expression 13 in this way p → p ∨ ¬Γ p (17) because the expression 13 is a tautology , it is always true for every substitution of the variable. So we have that T = (p ∨ ¬Γ p) ∨ ¬(p ∨ ¬Γ p) is always true.
(18)
212
Germano Resconi and Chiara Ratti
So we have that T can separate in two parts one is the expression (4) and the other is the compensation part. In conclusion for the DeMorgan rule we have (p ∨ ¬Γ p) ∨ (¬p ∧ Γ p)
(19)
(20) Exemple. When we use the evaluation in the example 11, we have V (p) = (T rue, T rue, F alse, F alse, F alse), V (¬Γ p) = (T rue, T rue, T rue, F alse, F alse); V (Γ p) = (F alse, F alse, F alse, T rue, T rue), V (¬p) = (F alse, F alse, T rue, T rue, T rue). So we have V (p ∨ ¬Γ p) = (T rue, T rue, T rue, F alse, F alse) and V (¬p ∨ Γ p) = (F alse, F alse, F alse, T rue, T rue). When we come back to the original fuzzy logic with Zadeh operation we obtain max[µp , (1 − µp )] + min[(1 − µp ), µp ] = 1
(21)
where max[µp , (1 − µp )] is associate to the expression V (p ∨ ¬Γ p) and min[(1 − µp ), µp ] to V (¬p ∧ Γ p). The term min[(1 − µp ), µp ] is the compensation term. We can extend to other tautology the same process and in this way we can find new invariants in fuzzy logic.
5
Conclusion
In fuzzy logic, given a proposition p, we have fractionary evaluation of p. Then we can study t-norms, t-conorms and complement as general operations. In classical logic we have general properties for operations. They are always true for every evaluation(tautology). In fuzzy logic with invariants, we want to introduce greatness that don’t change if the single evaluation of p changes. In this way, invariants give a general formalization to fuzzy operations. The idea is to build a formal structure for fuzzy operations and to get a global expression always true even if every component is fuzzy.
References [1] G.Resconi,G.J Klir e U.St.Clair Hierarchical Uncertainity Metatheory Based Upon Modal Logic Int J. of General System, Vol. 21, pp. 23–50, 1992. [2] G.Resconi e I. B. Turksen. Canonical forms of fuzzy truthoods by meta theory based upon modal logic Information Sciences 131/2001 pp. 157–194. [3] G.Resconi, T.Murai. Field Theory and Modal Logic by Semantic Field to Make Uncertainity Emerge from Information IntJ. General System 2000. [4] G.Klir e Bo Yuan. Fuzzy sets and Fuzzy Logic Prentice Hall, 1995. [5] T.Muray, M.Nakata e M.Shimbo. Ambiguity, Inconsistency, and Possible-Worlds: A New Logical Approach Ninth Conference on Intelligence Technologies in HumanRelated Sciences, Leon, Spain, pp. 177, 184, 1996. [6] L. A.Zadeh. Fuzzy Sets Information and Control, Vol. 8, pp. 87–100, 1995.
Newsvendor Problems Based on Possibility Theory Peijun Guo Faculty of Economics, Kagawa University Takamatsu, Kagawa 760-8523, Japan
[email protected] Abstract. In this paper, the uncertainty of market of new products with short life cycle is characterized by possibility distribution. Two possibilistic models for newsvendor problem are proposed, one is based on optimistic criterion and the other is based on pessimistic criterion to reflect the different preference for risk in such one-shot decision problem, which are very different from the conventional newsvendor problem based on probability distribution in which maximizing expected utility is goal.
1
Introduction
The newsvendor problem, also known as newsboy or single-period problem is a wellknown inventory management problem. In general, the newsvendor problem has the following characteristics. Prior to the season, the buyer must decide how many units of goods to purchase. The procurement lead-time tends to be quite long relative to the selling season, so the buyer can not observe demand prior to placing the order. Due to the long lead-time, there is no opportunity to replenish inventory once the season has begun. Excess products can not be sold (or the price is trivial) over season. As well known that Newsvendor problem derives its name from a common problem faced by a person selling newspapers on the street, interest in such a problem has increased over the past 40 years partially because the increased dominance of service industrial for which newsboy problem is very applicable in both retailing and service organization. Also, the reduction in product life cycles makes newsboy problem more relevant. Many extensions have been made in last decade, such as different objects and utility function, different supplier pricing policies, different new-vendor pricing policies [11,17]. Almost all of extensions have been made in the probabilistic framework, that is, the uncertainty of demand and supply is characterized by the probability distribution, and the objective function is used to maximizing the expected profit or probability measure of achieving a target profit. However, some papers [3,8,9,10,13,14,15,18] have dealt inventory problems using fuzzy sets theory. There are few papers dealing with the uncertainty in newsvendor problems by fuzzy methods. Buckley [1] used possibility distribution to represents a decision-maker's linguistic express on demand, such as “good” and “not good” etc., and introduced some fuzzy goal to express decision-maker's satisfaction. The order quantity was obtained V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 213-219, 2003. Springer-Verlag Berlin Heidelberg 2003
214
Peijun Guo
based on possibility and necessity measures, which made the possibility of not achieving fuzzy goal was sufficiently low and achieving fuzzy goal was sufficiently high. Ishi etc. al [6] investigated the fuzzy newsboy problem in which the shortage cost was vague and expressed as a fuzzy number. An optimal order quantity was obtained by fuzzy maximal order. Petrovic et al. [16] gave a fuzzy newsboy model where uncertain demand was represented by a group of fuzzy sets and inventory cost was given by a fuzzy number. The defuzzification method was used to obtain an optimal order quantity. Kao et al. [7] obtained the optimal order quantity to minimize the fuzzy cost by comparing the area of fuzzy numbers. Li et al. [12] proposed two models, in one the demand was probabilistic while the cost components were fuzzy and in the other the costs were deterministic but the demand was fuzzy. The profit was maximized through ordering fuzzy numbers with respect to their total integral values. Guo et al [4,5] proposed some new newsboy models with possibilistic information. In this paper, new newsboy models are proposed with emphasis that the proposed models are for new life-cycle short products, such as fashion goods, season presents where no data can be used for statistical analysis to predict the coming demand. The uncertainty of demand is characterized by the possibility distribution where possibility degrees are used to catch the potential of market determined by the matching degree between customer needs and goods features, which can obtained by detailed market investigation before selling season. In other words, the uncertain demand is not described by the linguistic language, such as “good”, “better” etc., to only reflect the subjective judgment of decision-maker. The other should be noted is that newsboy problem is a typical one-shot decision problem so that maximizing expected profit or probability measure seems less meaningful. In this paper, the optimal order is determined by maximizing optimistic and pessimistic values of order quantity instead of maximizing mean value in probabilistic models and ranking fuzzy numbers in fuzzy models. The third should be noted is that a general model is given where the possibility distributions and utility functions are not given as specified function so that the general analysis and conclusion are made.
2
Newsvendor Model Based on Possibility Theory
Consider a retailer who sells a short life cycle, or single-period new product. The retailer orders q units before the season at the unit wholesale price W. When demand x is observed, the retailer sell units (limited by the supply q and the demand x ) at unit revenue R with R > W . Any excess units can be salvaged at unit salvage price S o with W > S o . If shortage, the lost chance price is S u . The profit function of the retailer is as follows: Rx + S o (q − x) − Wq ; if x < q r ( x, q ) = ( R − W )q − S u ( x − q) ; if x ≥ q
(1)
Newsvendor Problems Based on Possibility Theory
215
Fig. 1. The possibility distribution of demand
The plausible information of demand x is represented by a possibility distribution. The possibility distribution of x , denoted as π D (x) is defined by the following continuous function
π D : [d l , d u ] → [0,1]
(2)
∃d c ∈ [d l , d u ] , π D (d c ) = 1 , π D (d l ) = 0 and π D (d u ) = 0 . π D increases within x ∈ [d l , d c ] and decreases within x ∈ [d c , d u ] . d l and d u are the lower and upper
bounds of demand, respectively, d c is the most possible amount of demand, which is shown in Fig. 1. Because demand is inside the interval [d l , d u ] , a reasonable supply also should lie inside this region. The highest profit of retailer is ru = ( R − W )d u . That is, the retailer orders the most d u and the demand is the largest d u . The lowest profit is rl = (d l R + (d u − d l ) S o − d u W ) ∧ (d l R − (d u − d l ) S u − d l W ) , which is determined by the minimum of two cases, one is that the retailer order the most but the demand is the lowest, the other is that the retailer order the lowest but the demand is the highest. Without loss of generality, the assumption W ≥ S o + S u is made, which leads to rl = d l R + (d u − d l ) S o − d u W . Definition 1. The utility function of retailer is defined by the following continuous strictly increasing function of profit r , U : [rl , ru ] → [0,1]
(3)
where U (rl ) = 0 , U (ru ) = 1 . (3) gives a general form of utility function of decisionmaker where the utility of the lowest profit is 0 and the utility of the highest profit is 1. Definition 2. The optimistic value of supply q, denoted as Vo (q) is defined as follows,
216
Peijun Guo
Fig. 2. The optimistic value of supply q
Vo (q ) = max min(π D ( x), U (r ( x, q ))) x
(4)
It can be seen that Vo (q) is similar to the concept of possibility measure of fuzzy event, illustrated by Fig. 2, where U (r ( x, q)) corresponds to fuzzy membership function. Definition 3. The pessimistic value of supply q, denoted as V p (q ) is defined as follows: V p (q) = min max(1 − π D ( x),U (r ( x, p))) x
(5)
It can be seen V p (q ) is similar to the concept of necessity measure of fuzzy event, illustrated by Fig. 3, where U (r ( x, q)) corresponds to fuzzy membership function. The retailer should decide an optimal order quantity, which maximizes Vo (q) or V p (q) , that is, q o* = arg max V o (q)
(6)
q *p = arg max V p (q)
(7)
where q o* and q *p are called optimistic order quantity and pessimistic order quantity, respectively. It is obvious that an order quantity q will be given a higher evaluation by optimistic criterion (4) if this order can lead to a higher utility with a high possibility. On the other hand, an order quantity q will be given a lower evaluation by pessimistic criterion (5) if this order can lead to a lower utility with a high possibility. It is clear the possibility theory based-approach makes some decision to balance plausibility and satisfaction. The optimistic criterion and pessimistic criterion are initially proposed by Yager [20] and Whalen [19], respectively. These criteria have been axiomatized in the style of Savage by Dubois Prade and Sabbadin [2].
Newsvendor Problems Based on Possibility Theory
217
Fig. 3. The pessimistic value of supply q
Lemma 1. In the case of π D (q ) ≥ U (r (q, q)) , arg max ( max min(π D ( x),U (r ( x, q )))) = qo* where q o* is the solution of the equaq∈[ d l , d u ] x∈[ d l , d u ]
tion π D (q ) = U (r (q, q )) with q > d c . Proof. Considering (1) max U (r ( x, q)) = U (r (q, q )) holds. Then min(π D ( x), U (r ( x, q))) ≤ U (r ( x, q)) ≤ U (r (q, q)) holds for any x ∈ [d l , d u ] . π D (q ) ≥ U (r (q, q)) makes min(π D (q ),U (r (q, q ))) = U (r (q, q )) hold. So that Vo (q ) = max min(π D ( x),U (r ( x, q))) = U (r (q, q)) . Because U (r (q, q)) is a strictly x∈[ d l , d u ]
increasing function of q , max Vo (q) leads to q increasing and q > d c holding. q
Within x ∈ [d c , d u ] increasing q makes U (r (q, q)) increase and π D (q ) decrease. With considering the condition π D (q ) ≥ U (r (q, q)) , when π D (q ) = U (r (q, q)) holds, Vo (q) reaches maximum where the optimal q is denoted as q o* .
Lemma 2. In Case of π D (q ) < U (r (q, q)) , max min(π D ( x), U (r ( x, q))) < U (r (q o* , q o* )) holds for any q < d c .
x∈[ d l , d u ]
Lemma 3. π D (q ) ≥ U (r (q, q)) holds for d c < q < q o* . Lemma 4. In the case of π D (q ) < U (r (q, q)) , max (min(π D ( x),U (r ( x, q ))) < U (r (qo* , qo* )) holds for any q > qo* .
x∈[ d l , d u ]
Theorem 1. q o* is the solution of the following equation.
π D (q) = U (( R − W )q) where q ∈ [d c , d u ] .
(8)
218
Peijun Guo
Proof. The relation between π (x) and U (r ( x, q)) can be divided into three cases, that is, Case 1: π D (q ) ≥ U (r (q, q)) ; Case 2: π D (q ) < U (r (q, q)) and q < d c , and Case 3: π D (q ) < U (r (q, q)) and q > d c . It should be noted that there is no such case of π D (q ) < U (r (q, q)) and q = d c because of 1 = π D (d c ) > U (r (d c , d c )) . It is straightforward from Lemma 1, 2, 3 and 4 that the optimal order quantity is the solution of the equation U (r (q, q)) = π D (q ) with q ∈ [d c , d u ] . Considering (1), π (q ) = U (( R − W )q ) is obtained. (8) can be easily solved by Newton method with the condition q ∈ [d c , d u ] . Theorem 1 shows that the optimal order quantity obtained from optimistic criterion is only based on the revenue and wholesale price, which means that the retailer has enough confidence that he can sell what he orders. Theorem 2. The pessimistic order q *p is the solution of the following equation. U (r (d *pl , q )) = U (r (d *pu , q ))
(9)
where d *pl and d *pu are the horizontal coordinates of the intersections of U (r ( x, q)) and 1 − π D ( x) within [dl , min[q, d c ]] and [max[q, d c ], d u ] , respectively.
3
Conclusions
In this paper, the uncertainty of market of new products with short life cycle is characterized by possibility degrees to catch the potential of market determined by the matching degree between customer needs and goods features. Two possibilistic models for newsvendor problem are proposed, one is based on optimistic criterion and the other is based on pessimistic criterion to reflect the different preference for risk in such one-shot decision problem, which are very different from the conventional newsboy problem maximizing expected utility. The optimal order quantity based on optimistic criterion is only depended on the revenue and wholesale price, which means that the retailer has enough confidence that he can sell what he has ordered. On the contrary, the optimal order quantity based on pessimistic criterion needs all information of market, which reflects a more conservative attitude. Because newsvendor problem is a typical one-shot decision problem and the uncertainty of the coming market is easily and reasonably characterized by possibility distribution, it can be believed that the proposed models can make some suitable decision for newsvendor problem.
References [1] [2]
Buckley, J.J., Possibility and necessity in optimization, Fuzzy Sets and Systems 25 (1988) 1-13 Dubois D., Prade H., and Sabbadin R. Decision-theoretic foundations of possibilty theory. European Journal of Operational Research, 128 (2001) 459-478.
Newsvendor Problems Based on Possibility Theory
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
219
Gen, M., Tsujimura, Y. and Zheng, D., An application of fuzzy set theory to inventory control models, Computers Ind. Engng 33 (1997) 553-556. Guo, P., The newsboy problem based on possibility theory, Proceeding of the 2001 Fall National Conference of the Operations research Society of Japan (2001) 96-97. Guo, P. and Chen, Y., Possibility approach to newsboy problem, Proceedings of the First International Conference on Electronic Business (2001) 385-386. Ishii, H. and Konno, T., A stochastic inventory problem with fuzzy shortage cost, European Journal of Operational Research 106 (1998) 90-94. Kao, C. and Hsu., W. A single-period inventory model with fuzzy demand, Computers and Mathematics with Applications 43 (2002) 841-848. Katagiri, H. and Ishii, H., Some inventory problems with fuzzy shortage cost, Fuzzy Sets and Systems 111 (2000) 87-97. Katagiri, H. and Ishii, H., Fuzzy inventory problem for perishable commodities, European Journal of Operational Research 138 (2002) 545-553. Lee, H. and Yao, J. Economic order quantity in fuzzy sense for inventory without backorder model, Fuzzy Sets and Systems 105 (1999) 13-31. Lippman, S. A. and McCardle,K.F., The competitive newsboy, Operations Research, 45 (1997) 54-65. Li, L., Kabadi, S.N. and Nair, K.P.K., Fuzzy models for single-period inventory problem, Fuzzy sets and Systems (In Press) Lin, D. and Yao, J., Fuzzy economic production for inventory, Fuzzy Sets and Systems 111 (2000) 465-495. Kacpryzk, P. and Staniewski, P., Long-term inventory policy-making through fuzzy decision-making models, Fuzzy Sets and Systems 8 (1982) 117-132. Park, K.S., Fuzzy set theoretic interpretation of economic order quantity, IEEE Trans. Systems Man Cybernet. SMC 17(6) (1996) 1082-1084. Petrovic, D., Petrovic,R. and Vujosevic, M., Fuzzy model for newsboy problem, Internat. J. Prod. Econom. 45 (1996) 435-441. Porteus, E., L. Stochastic Inventory Theory, Handbooks in OR and MS, Vol. 2 Heyman D. P. and Sobel, M.J. eds. Elsevier Science Publisher, 605-652. Roy, T. K. and Maiti, M., A fuzzy EOQ model with demand-dependent unit cost under limited storage capacity, European Journal of Operational Research 99(1997) 425-432. Whalen, T. Decision making under uncertainty with various assumptions about available information, IEEE Transaction on Systems, Man and Cybernetics 14 (1984) 888-900. Yager, R. R., Possibilistic decision making, IEEE Transaction on Systems, Man and Cybernetics 9 (1979) 388-392.
Uncertainty Management in Rule Based Systems Application to Maneuvers Recognition T. Benouhiba, and J. M. Nigro Université de Technologie de Troyes 12 rue Marie Curie, BP 2060 10010 Troyes Cedex, France {toufik.benouhiba, nigro}@utt.fr
Abstract. In this paper we study the uncertainty management in expert systems. This task is very important especially when noised data are used such as in the CASSICE project, which will be presented later. Indeed, the use of a classical expert system in this case will generally produce mediocre results. We investigate the uncertainty management using the Dempster-Shafer theory of evidence and we discuss benefits and disadvantages of this approach by comparing the obtained results with those obtained by an expert system based on the fuzzy logic. Keywords: Expert system, evidence theory, uncertain reasoning.
1
Introduction
The expert systems technique is a very used approach in Artificial Intelligence; it can easily model expert knowledge in a well known field. Expert systems were used successfully in diagnostics, failure detection, recognition, etc. Nevertheless, they suffer from many problems. On one hand, they are closed systems, i.e., they could not react positively when the environment changes. On the other hand, they can neither make an uncertain reasoning nor handle noised or vague data. Since the first versions of MYCIN [16], many theoretical works have studied uncertainty in expert systems such as probabilities [7] and fuzzy logic [18]. However, there is another promising method that has not been well studied, which is the evidence theory. This one offers a mathematical framework for subjectivity and total ignorance but does not manage the uncertainty directly. In this paper we study the application of this theory in expert systems by combining it with some concepts of fuzzy logic. We see a real application of this approach to the recognition of driving maneuvers within the framework of the CASSICE project [12]. This paper is organized as follows. First, section 2 describes briefly the CASSICE project. Next, section 3 presents the Dempster-Shafer theory of evidence. Then, we see in section 4 the use of this theory in order to make an uncertain reasoning in expert systems. After that, section 5 shows an application example of the proposed approach in the CASSICE project; we discuss the obtained results by comparing them with those obtained by another expert system based on fuzzy logic. At the end, we give limitations and perspectives of this work. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 220-228, 2003. Springer-Verlag Berlin Heidelberg 2003
Uncertainty Management in Rule Based Systems Application
2
221
The CASSICE Project
The CASSICE project aims to help psychologists to understand a driver behavior in order to model it. The final goal is to improve the comfort and the security of the driver. The concrete objective of CASSICE is to build a computer system, which indexes all real driving situations. This system allows psychologists to efficiently search driving situations in a given context via a multi-criteria research interface. This project uses a car equipped with a set of proprioceptive sensors and cameras. Until now, data which are used in the recognition come from a simulator. Table1 shows the used data types. Table 1. Simulator data types
Data Time Acc Phi Rd Rg Teta V X,Y
Signification Clock (second) Acceleration of EV compared to TV1 EV front wheels angle EV distance from the right border EV distance from the left border Angle with TV Speed of EV compared TV Position of TV compared to EV (x and y axis )
It should be noted that two classical expert systems have been developed: the first one, called IDRES [13], makes an exhaustive recognition of the states (we consider that a maneuver is composed of a succession of several states). A maneuver is recognized when all the states that compose it are detected in the right order. The second system, called DRSC [11], considers a maneuver as an automaton. A maneuver is recognized when the sensors data enable the recognition process to pass from the initial state to the last state of the automaton. In this paper we have limited the recognition to the overtaking maneuver; we consider that this one is composed of ten successive states: wait to overtake, overtaking intention, beginning of the lane changing to the left, crossing left discontinuous line, end of the lane changing to the left, passing, end of passing, beginning of the lane changing to the right, crossing right discontinuous line and end of the lane changing to the right.
3
Uncertainty Management
3.1
Evidence Theory Elements
This theory can be regarded as a general extension of the Bayesian theory. It can deal with total ignorance and subjectivity of experts by combining many points of view. The last point is the most important because it enables us to manage many data uncertainties at the same time. The evidence theory originates in the work of 1
TV means target vehicle whereas EV means experimental vehicle
222
T. Benouhiba and J. M. Nigro
Dempster on theory of probabilities with upper and lower bounds [2]. It was concretely formulated by his student Shafer in 1976 through his book “A Mathematical Theory of Evidence” [15]. Let Ω be a finite set of hypotheses, an evidence mass m is a function defined on 2 Ω such that ∑ m( A) = 1 where A ⊆ Ω . m( A) represents the exact belief in A. Any subset of Ω for which the evidence mass is non-zero is called a focal element. If Ω = {a, b} then m({a,b}) measures the fact that one is indifferent to a or b, in other words, it's an ignorance measure. This theory is a generalization of the probabilities theory: if all focal elements are singleton sets, then this measure is a probability [16]. Many other measures are defined in this theory but the most used are the belief (bel) and the plausibility (pl) measures: bel ( A) = ∑ m( B) B⊆ A
pl ( A) =
∑ m( B )
A∩ B ≠ ∅
(1)
The belief represents the confidence that one lies in A or any subset of A whereas the plausibility represents the extent to which we fail to disbelieve A. These two measures are, in reality, the lower and upper bounds of the probability of A. The Dempster combination operator is one of the most important elements of the evidence theory. It makes it possible to combine many evidence masses (many points of view) about the same evidence. Let m1 and m2 be two evidence masses, so the combination of two points of view give: 0 m1 ( A)m2 ( B ) m1 ⊕ m2 (C ) = A∩∑ B =C 1 − ∑ m ( A)m ( B ) 2 A∩ B =∅1
3.2
if C = ∅ else
(2)
Adaptation of the Evidence Theory to Expert Systems
The evidence theory has been successfully used in many fields such as medical imagery [4], defaults classification by imagery [6], robot navigation systems [10] [17] and decision making [2] [3]. In General, this theory has shined in all fields relating to data fusion. However, too little works have been made to apply this theory to expert systems field; this is probably due to the combination operator. In fact, this operator can not directly combine many evidence masses associated to different variables. Some theoretical works have tried to resolve this problem, in this paper we use the method presented in [14]. In the following formulae, A and B are two distinct facts, m1 is the evidence mass associated to A, m2 is the evidence mass associated to B and m is the evidence mass associated to the fact (A and B). m( A ∩ B ) = m1 ( A).m 2 ( B )
(3)
Uncertainty Management in Rule Based Systems Application
223
m( A ∩ B ) = m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ∪ B )
(4)
+ m1 ( A ∪ A).m 2 ( B ) m( A ∩ B ∪ A ∩ B) = m1 ( A).m 2 ( B ∪ B ) + m1 ( A ∪ A).m 2 ( B ) + m1 ( A ∪ A).m 2 ( B ∪ B )
4
Uncertain Reasoning
4.1
Evidence Masses Generating
(5)
In order to generate evidence masses, we consider each input data as an intuitionistic number (see Intuitionistic fuzzy sets [1]). The input data represents the center of this number while the choice of parameters depends on the scale and the variations of data. A condition is a conjunction of many simple conditions which have one of the following forms: or where belongs to the set {=, >, inf) and (xinf) ⊕ M(xthreshold then State= («Wait to overtake», E)
However, another problem arises: useless rules could be fired. In reality, this problem can not be avoided because of the overhead introduced by the uncertainty management but we can solve it partially by using of a threshold. Even if we have a dedicated tool to write our expert system, this problem will remain but the code of the system would be more readable and easier to maintain.
5
Experience and Results
In order to make the driving maneuvers recognition, some considerations have to be mentioned. It should be noted that parameters are chosen according to the scale and the variations of data. The flexibility of the system depends on the choice of thresholds: if the threshold is important then the system will be close to a classical expert system, if it's, on the contrary, too small then the research tends to become random. 5.1
Results
The developed system enables us to trace the graphs of the variation of evidence masses for a given state through time (see paragraph 2). For example, figure 1 shows the graphs of the variations of the evidence masses associated to the “true” value (m(T)) for the two states: “overtaking intention” (continuous line) and “crossing the left discontinuous line” (dotted line):
Uncertainty Management in Rule Based Systems Application
225
Fig. 1. Variations of the evidence masses of two states
Figure 2 shows now the graph of the variations of the evidence mass associated to “true” for the whole maneuver:
Fig. 2. Evidence mass variations for the overtaking maneuver
Numbers on the graph refer to the rank of the recognized state within the maneuver. The evidence mass associated to the “true” value is a decreasing function which passes sometimes by picks to zero. This can be explained by the fact that we are in a total ignorance situation, i.e., we can not decide which state is occurring (unknown state). We remark that some states do not appear on the graph, this is due to the fact that these states are not indispensables to the maneuver recognition. 5.2
A Recognition System Based on Fuzzy Logic
We have developed a second expert system based on fuzzy logic in order to better evaluate the performance of the first system. We have used here a specific tool: FUZZYCLIPS [8] which is an extension to CLIPS tool in order to manage uncertainty by using fuzzy logic. In this system we transform input data into fuzzy numbers [9], this converting operation uses many parameters like the first system. The figure 3 shows the graph of the variations of the certitude factors (CF) of the two states “overtaking intention” (continuous line) and “crossing left discontinuous line” (dotted line) while figure 4 shows the graph of the variations of the CF of the whole maneuver:
226
T. Benouhiba and J. M. Nigro
Fig. 3. The CF variations of two states
Fig. 4. The CF variations of the maneuver
We can explain cuts in figure 4 by the fact that the system can not find the current state, i.e., these are total ignorance situations. The fuzzy logic theory can not manage these situations because it does not have any measure to quantify the ignorance. 5.3
Discussion
The obtained results by the both systems are nearly similar. They can handle imprecise data but each one uses its own method. The first system uses the intuitionistic fuzzy sets concept because they enable us to generate evidence masses; the second system uses the vague predicates concepts, the possibilities and necessities calculations. In addition, thanks to the evidence theory; the first system could make data fusion, e.g., if a right rule side inserts numeric data in the facts base and if it's about the same variable a first firing of the rule inserts the data in the facts base while the second one will merge the new data with the first one inserted. Both the systems enable us to see when the car gets ready to pass to a given state or to quit it. This makes easier the understanding of the maneuver progress through time. The rules of the second system are easier to write because of the use of vague predicates. In the first system, rules are neither easy to read nor to design because that CLIPS can not support directly the uncertainty management. In addition, transferring
Uncertainty Management in Rule Based Systems Application
227
premises from the left side to right side involves that lot of useless rules would be fired. We can solve the first problem by designing a specific tool by analogy with FUZZYCLIPS in order to hide all computing details. The second problem could not be solved; in reality this problem is due to the overhead introduced by the uncertainty management. However, because FUZZYCLIPS has the control of uncertainty, the second system is not so flexible. The first system does not suffer from this limitation; we can easily modify the system flexibility by changing the parameters α i and β i of the intuitionistic numbers. It offers several types of measures that can help us to better understand the maneuver progress.
6
Conclusion and Perspectives
In this paper we have studied two ways to manage uncertainty in expert system by applying them in a driving maneuver recognition application (the CASSICE project). The both systems consist in recognizing driving maneuvers by using data coming from several sensors, only the overtaking maneuver has been considered here. The first system is based on the Dempster-Shafer theory of evidence; however, we have used some fuzzy logic concepts in order to manage uncertainty in the rule-based system. The second system relies on the fuzzy sets theory but does not support total ignorance. Results obtained by the both systems are nearly similar. We can detect the moments when the driver gets ready to pass from one state to another. However, the first system allows us to better understand the maneuver progress by using many measures. For example, when the ignorance mass is equal to 1, we can deduct that the system could not recognize the current state; consequently, we could say that the rules set does not contain all necessary knowledge. There are two perspectives of this work. First, we should generalize the developed systems in order to recognize other maneuvers. We can imagine that the system would have a maneuvers database which contains all maneuvers description. Then, we can develop a specific tool for the evidence theory, by analogy with FUZZYCLIPS, in order to facilitate the expert system design.
Reference [1] [2] [3]
Atanassov K.: Intuitionistic fuzzy sets. FSS, 20: 87-96, 1986. Beynon M., Cosker D., Marshall D.: An expert system for multi-criteria decision making using Dempster-Shafer theory. Expert Systems with Applications 20, p.357 -367, 2001. Beynon M., Curry B., Morgan P.: The Dempster-Shafer theory of evidence: an alternative approach to multi-criteria decision modeling. The International Journal of Management Science, p. 37 -50, 2000.
228
T. Benouhiba and J. M. Nigro
[4]
Bloch I.: Some aspects of Dempster-Shafer evidence theory for classification of multimodality medical images taking partial volume effect into account. Pattern Recognition Letters 17, p.905-919, 1996. CLIPS : www.ghg.net/clips/download/documentation Dongping Z., Conners T., Schmoldt D., Araman P.: A prototype vision system for analyzing CT imagery of hardwood logs. IEEE Transactions on Systems, Man and Cybernetics: B 26(4), p.522-532, 1996. Fagin R., Halpern J. Y.: Reasoning about Knowledge and Probability. Proceedings of the Second Conference on Theoretical Aspects of Reasoning about Knowledge, Morgan Kaufmann, p.277 -293, 1988. FuzzyCLIPS : ai.iit.nrc.ca/cgi-bin/FuzzyCLIPS_ log Heilpern S.: Representation and application of fuzzy numbers. Fuzzy Sets and Systems 91, p.259-268, 1997. Kak A., Andress K., Lopez-Abadia C., Carol M., Lewis R.: Hierarchical evidence accumulation in the PSEIKI system. Uncertainty in Artificial Intelligence vol. 5, North-Holland, 1990. Loriette S., Nigro J.M., Jarkass I.: Rule-Based Approaches for the Recognition of Driving Maneuvers. Canberra (Australia), In ISTA'2000 (International Conference on Advances in Intelligent Systems: Theory and Applications), Volume59 of the series Frontiers in Artificial Intelligence, Ed. IOS Press 2-4 February 2000. Nigro J. M., Loriette-Rougegrez S., Rombaut M., Jarkass I.: Driving situation recognition in the CASSICE project – Towards an uncertainty management. ITSC 2000, 71-76, October 2000. Nigro J. M., Loriette S.: Characterization of driving situation. MS'99 (International Conference on Modeling and Simulation, May17-19, p.287-297. Nigro J. Marc, Loriette-Rougegrez S., Rombaut M.: Driving situation recognition with uncertainty management and rule-based systems. Engineering Applications of Artificial Intelligence Journal, Volume 15, Issue 3-4, pp. 217228, June - August 2002. Shafer G.: Mathematical theory of evidence. Princeton University Press, 1976. Shortliffe E. H., Buchanan B.: A model of inexact reasoning in medicine. Mathematical Biosciences, 23:351-379, 1975. Wang J., Wu Y.: Detection for mobile robot navigation based on multisensor fusion. Proceeding of the SPIE, The International Society for Optical Engineering vol.2591, p.182-192, 1995. Zadeh L. A.: Fuzzy sets. Information and control – n°8, p.338-353, 1965.
[5] [6] [7] [8] [9] [10] [11]
[12] [13] [14]
[15] [16] [17] [18]
Fuzzy Coefficients and Fuzzy Preference Relations in Models of Decision Making Petr Ekel1 , Efim Galperin2 , Reinaldo Palhares3, Claudio Campos1 , and Marina Silva1 1
3
Graduate Program in Electrical Engineering, Pontifical Catholic University of Minas Gerais Ave. Dom Jos´e Gaspar, 500, 30535-610, Belo Horizonte, MG, Brazil
[email protected] [email protected] [email protected] 2 Department of Mathematics, University of Quebec at Montreal C.P. 8888, Succ. Centre-Ville, Montreal, Quebec, Canada H3C 3P8
[email protected] Department of Electronics Engineering, Federal University of Minas Gerais Ave. Antˆ onio Carlos, 6627, 31270-010, Belo Horizonte, MG, Brazil
[email protected] Abstract. Analysis of < X, R > models is considered as part of a general approach to solving a wide class of optimization problems with fuzzy coefficients. This approach consists in formulating and solving one and the same problem within the framework of interrelated models to maximally cut off dominated alternatives. The subsequent contraction of the decision uncertainty region is based on reduction of the problem to multiobjective decision making in a fuzzy environment with applying techniques based on fuzzy preference relations. The results of the paper are of a universal character and are already being used to solve power engineering problems.
1
Introduction
Investigations of recent years show the benefits of applying fuzzy set theory to deal with various types of uncertainty, particularly for optimization problems where there are advantages of a fundamental nature (the possibility of validly obtaining less “cautious” solutions) and computational character [1]. The uncertainty of goals is the notable kind of uncertainty related to a multiobjective character of many optimization problems. It is possible to classify two types of problems, which need the use of a multiobjective approach [2]: (a) problems in which solution consequences cannot be estimated on the basis of a single criterion, that involves the necessity of analyzing a vector of criteria, and (b) problems that may be solved on the basis of a single criterion but their unique V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 229–236, 2003. c Springer-Verlag Berlin Heidelberg 2003
230
Petr Ekel et al.
solutions are not achieved because the uncertainty of information produces socalled decision uncertainty regions, and the application of additional criteria can serve as a means to reduce these regions [3]. According to this, two classes of models (so-called < X, M > and < X, R > models) may be constructed. When analyzing < X, M > models, a vector of objective functions is considered for their simultaneous optimization. The lack of clarity in the concept of “optimal solution” is the fundamental methodological complexity in analyzing < X, M > models. When applying the Bellman-Zadeh approach [4], this concept is defined with reasonable validity because the maximum degree of implementing goals serves as the optimality criterion. This conforms to the principle of guaranteed result and provides a constructive line in obtaining harmonious solutions [1, 5]. Some specific questions of using the approach are discussed in [1, 2]. Taking this into account, the paper considers primarily < X, R > models.
2
Optimization Problems with Fuzzy Coefficients
Many problems related to complex system design and control may be formulated as follows: (1) maximize F (x1 , . . . , xn ) , subject to the constraints j , gj (x1 , . . . , xn ) ⊆ B
j = 1, . . . , m ,
(2)
where the objective function (1) and constraints (2) include fuzzy coefficients, as indicated by the ∼ symbol. The following problem can also be defined: minimize F (x1 , . . . , xn ) ,
(3)
subject to the same constraints (2). An approach [3] to handle constraints such as (2) involves replacing each of them by a finite set of nonfuzzy constraints. Depending on the essence of the problem, it is possible to convert constraints (2) to constraints gj (x1 , . . . , xn ) ≤ bj ,
j = 1, . . . , d ≥ m ,
(4)
gj (x1 , . . . , xn ) ≥ bj ,
j = 1, . . . , d ≥ m .
(5)
or to constrains
Problems with fuzzy coefficients only in the objective functions can be solved by modifying traditional methods [2, 3]. For example, the algorithms [1, 3] of solving discrete problems (1), (4) and (3), (5) are based on modifying the methods of [6]. In their use, the need arises to compare alternatives on the basis of relative fuzzy values of the objective function. This may be done by applying the methods classified in [7]. One of their groups is related to building fuzzy
Fuzzy Coefficients and Fuzzy Preference Relations
231
preference relations providing the most justified way to compare alternatives [8]. Considering this, it is necessary to distinguish the choice function or fuzzy number ranking index based on the conception of a generalized preference relation [9]. If the membership functions corresponding to the values F1 and F2 of the objective function to be maximized are µ(f1 ) and µ(f2 ), the quantities η{µ(f1 ), µ(f2 )} =
sup f1 , f2 ∈ F f1 ≥ f2
min {µ(f1 ), µ(f2 )} ,
(6)
min {µ(f1 ), µ(f2 )} ,
(7)
and η{µ(f2 ), µ(f1 )} =
sup f1 , f2 ∈ F f2 ≥ f1
are the degrees of preferences µ(f1 ) µ(f2 ) and µ(f2 ) µ(f1 ), respectively. Applying (6) and (7), it is possible to judge the preference of any of the alternatives compared. However, if the membership functions µ(f1 ) and µ(f2 ) correspond to so-called flat fuzzy numbers, the use of (6) and (7) can lead to η{µ(f1 ), µ(f2 )} = η{µ(f2 ), µ(f1 )}. In such situations the algorithms do not allow one to obtain a unique solution. This also occurs with other modifications of optimization methods when the uncertainty and relative stability of optimal solutions produce decision uncertainty regions. In this connection other choice functions (for example, [10, 11]) may be used. However, these functions occasionally result in choices, which appear intuitively inconsistent, and their use does not permit one to close the question of building an order on a set of fuzzy numbers [3]. An approach, which is better validated and natural, is associated with transition to multiobjective choosing alternatives. The application of additional criteria can serve as a convincing means to contract the decision uncertainty region.
3
Multiobjective Choice and Fuzzy Preference Relations
First of all, it should be noted that considerable contraction of the decision uncertainty region may be obtained by formulating and solving one and the same problem within the framework of mutually interrelated models:(a) maximizing (1) while satisfying constraints (4) interpreted as convex down and (b) minimizing (3) while satisfying constraints (5) interpreted as convex up [1, 3]. Assume we are given a set X of alternatives (from the decision uncertainty region) that are to be examined by q criteria of quantitative and/or qualitative nature to make an alternative choice. The problem is presented as < X, R > where R = {R1 , . . . , Rq } is a vector of fuzzy preference relations. Therefore Rp = [X × X, µRp (Xk , Xl )],
p = 1, . . . , q ,
Xk , X l ∈ X ,
(8)
where µRp (Xk , Xl ) is a membership function of fuzzy preference relation. The availability of fuzzy or linguistic estimates Fp (Xk ), p = 1, . . . , q, Xk ∈ X with the membership functions µ[fp (Xk )], p = 1, . . . , q, Xk ∈ X permits one to build Rp , p = 1, . . . , q on the basis of correlations similar to (6) and (7).
232
Petr Ekel et al.
If R is a single preference relation, it can be put in correspondence with a strict fuzzy preference relation RS = R\R−1 [9] with the membership function µSR (Xk , Xl ) = max{µR (Xk , Xl ) − µR (Xl , Xk ), 0} .
(9)
Since µSR (Xl , Xk ) describes a fuzzy set of alternatives, which are strictly dominated by Xl , its complement by 1 − µSR (Xl , Xk ) gives the set of nondominated alternatives. To choice all these alternatives, it is enough to find the intersection of all 1 − µSR (Xl , Xk ), Xk ∈ X on all Xl ∈ X with µR (Xk ) = inf [1 − µSR (Xl , Xk )] = 1 − sup µSR (Xl , Xk ) . Xl ∈X
Xl ∈X
(10)
Because µR (Xk ) is the degree of nondominance, it is natural to choice X = {Xk | Xk ∈ X, µR (Xk ) = sup µR (Xk )} . Xk ∈X
(11)
If we have the vector fuzzy preference relation, expressions (9)-(11) can serve as the basis for a lexicographic procedure of step-by-step introducing criteria. It generates a sequence X 1 , X 2 , . . . , X q so that X ⊇ X 1 ⊇ X 2 ⊇ · · · ⊇ X q with the use of the following expressions: µp R (Xk ) = 1 −
sup Xl ∈X p−1
µSRp (Xl , Xk ),
p X p = {Xkp | Xkp ∈ X p−1 , µp R (Xk ) =
p = 1, . . . , q ,
sup Xk ∈X p−1
µp R (Xk )} ,
(12)
(13)
constructed on the basis of (10) and (11), respectively. If a uniquely determined order is difficult to build, it is possible to apply another procedure. In particular, the expressions (9)-(11) are applicable if we q take R = p=1 Rp , i.e., µR (Xk , Xl ) = min µRp (Xk , Xl ), l≤p≤q
Xk , Xl ∈ X .
(14)
The use of this procedure leads to the set X that fulfils the role of the Pareto set [12]. If necessary, its contraction is possible using the convolution µT (Xk , Xl ) =
q
λp µRp (Xk , Xl ),
Xk , Xl ∈ X ,
(15)
p=1
q where λp > 0, p = 1, . . . , q are importance factors normalized as p=1 λp = 1. The construction of µT (Xk , Xl ), Xk , Xl ∈ X allows one to obtain the membership function µT (Xk ) of the subset of nondominated alternatives using an expression similar to (9). Its intersection with µT (Xk ) defined as µ (Xk ) = min{µR (Xk ), µT (Xk )},
Xk ∈ X ,
(16)
Fuzzy Coefficients and Fuzzy Preference Relations
233
provides us with X = {Xk | Xk ∈ X, µ (Xk ) = sup µ (Xk )} . Xk ∈X
(17)
Finally, it is possible to apply the third procedure based on presenting (9) as µRp (Xk ) = 1 − sup µSRp (Xl , Xk ), Xl ∈X
p = 1, . . . , q
(18)
to build the membership functions of the subset of nondominated alternatives for all preference relations. Since the membership functions (18) play a role identical to membership functions replacing the objective functions in < X, M > models [1, 2], it is possible to build µ (Xk ) = min µRp (Xk ) l≤p≤q
(19)
to obtain X in accordance with (17). If necessary to differentiate the importance of different fuzzy preference relations, it is possible to transform (19) as follows: µ (Xk ) = min [µRp (Xk )]λp . l≤p≤q
(20)
The use of (20) does not require the normalization of λp , p = 1, . . . , q. It is natural that the use of the second procedure may lead to the solutions different from the results obtained on the basis of the first procedure. However, solutions based on the second procedure and the third procedure (that is more preferential from the substantial point of view) relating to a single generic basis may at time also be different. Considering this, it should be stressed that the possibility of obtaining different solutions is natural, and the choice of the approach is a prerogative of a decision maker. All procedures have been implemented within the framework of a decision making system DMFE (developed in C++ in the Builder-Borland environment). Its flexibility and user-friendly interaction with a decision maker makes it possible to use DMFE for solving complex problems of multiobjective choosing alternatives with the use of criteria of quantitative and qualitative nature.
4
Illustrative Example
The results described above have found applications in solving power engineering problems. As an example, we dwell on using the multiobjective choice of alternatives for substation planning. Without discussing substantial considerations (they are given in [13]), the problem consists in comparing three alternatives, which could not be distinguished from the point of view of their total costs, using criteria “Alternative Costs”, “Flexibility of Development”, and “Damage to Agriculture”. The fuzzy preference relations corresponding to these criteria,
234
Petr Ekel et al.
built with the use of (6) and (7) on the basis of alternative membership functions presented in [13], are the following: 1 1 1 1 1 , µR1 (Xk , Xl ) = 1 (21) 1 0.912 1
1 1 0.909 µR2 (Xk , Xl ) = 1 1 0.909 , 1 1 1 and
(22)
1 1 1 1 1 . µR3 (Xk , Xl ) = 0.938 0.625 0.938 1
(23)
Consider the application of the first approach if the criteria are arranged, for example, in the following order: p = 1, p = 2, and p = 3. Using (21), it is possible to form the strict fuzzy preference relation with 0 0 0 µSR1 (Xk , Xl ) = 0 0 0.088 . (24) 0 0 0 Following (12) and (13), we obtain on the basis of (24)
1 1 0.912 µ1 R (Xk ) =
(25) 1 1 and X 1 = {X1 , X2 }. From (22) we have: µ2R2 (Xk , Xl ) = , that leads 1 1 1 1 2 3 , to X = {X1 , X2 }. Finally, from (23) we can obtain: µR2 (Xk , Xl ) = 0.938 1 providing us with X 3 = {X1 }. Consider the application of the second approach. As a result of an intersection of (21), (22), and (23), we obtain 1 1 0.909 1 0.909 . µR (Xk , Xl ) = 0.938 (26) 0.625 0.912 1
It permits one to construct
0 0.062 0.284 0 0 . µSR (Xk , Xl ) = 0 (27) 0 0.003 0
Following (11), we obtain µR (Xk ) = 1 0.938 0.716 and X = {X1 }. Let us consider the use of the third approach. The membership function of the subset of nondominated alternatives for µR1 (Xk ) is (25). Using (22) and (23),
Fuzzy Coefficients and Fuzzy Preference Relations
235
we obtain µR2 (Xk ) = 0.909 0.909 1 and µR2 (Xk ) = 1 0.938 0.625 , respectively. As a result of their intersection, we have X = {X1 , X2 }. Thus, the use of the first and second approaches leads to choosing the first alternative. The last approach casts away the third alternative but cannot distinguish the first and second alternatives on the basis of information reflected by fuzzy preference relations (21)-(23).
5
Conclusion
Two classes of problems that need the application of a multiobjective approach have been classified. According to this, < X, M > and < X, R > models may be constructed. The use of < X, R > models is based on applying a general approach to solving optimization problems with fuzzy coefficients. It consists in analyzing one and the same problem within the framework of mutually interrelated models. The contraction of the obtained decision uncertainty region is based on reducing the problem to multiobjective choosing alternatives with applying procedures based on fuzzy preference relations. The results of the paper are of a universal character and can be applied to the design and control of systems and processes of different nature as well as the enhancement of intelligent decision support systems. These results are already being used to solve power engineering problems.
References [1] Ekel, P.Ya.: Fuzzy Sets and Models of Decision Making. Int. J. Comp. Math. Appl. 44 (2002) 863-875. 229, 230, 231, 233 [2] Ekel, P.Ya.: Methods of Decision Making in Fuzzy Environment and Their Applications. Nonlin. Anal. 47 (2001) 979-990. 229, 230, 233 [3] Ekel, P., Pedrycz, W., Schinzinger, R.: A General Approach to Solving a Wide Class of Fuzzy Optimization Problems. Fuzzy Sets Syst. 97 (1998) 49-66. 230, 231 [4] Bellman R., Zadeh L. A.: Decision Making in a Fuzzy Environment. Manage. Sci. 17 (1970) 141-164. 230 [5] Ekel, P.Ya., Galperin, E. A.: Box-Triangular Multiobjective Linear Programs for Resource Allocation with Application to Load Management and Energy Market Problems. Math. Comp. Mod., to appear. 230 [6] Zorin, V. V., Ekel, P.Ya.: Discrete-Optimization Methods for Electrical Supply Systems. Power Eng. 18 (1980), 19-30. 230 [7] Chen, S. J., Hwang, C. L.: Fuzzy Multiple Attribute Decision Making: Methods and Applications. Springer-Verlag, Berlin Heidelberg New York (1992). 230 [8] Horiuchi, K., Tamura, N.: VSOP Fuzzy Numbers and Their Fuzzy Ordering. Fuzzy Sets Syst. 93 (1998) 197-210. 231 [9] Orlovsky, S. A.: Decision Making with a Fuzzy Preference Relation, Fuzzy Sets Syst. 1 (1978) 155-167. 231, 232 [10] Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Kluwer, Boston Dordrecht London (1994). 231
236
Petr Ekel et al.
[11] Lee-Kwang, H.: A Method for Ranking Fuzzy Numbers and Its Application to Decision-Making. IEEE Trans. Fuzzy Syst. 7 (1999) 677-685. 231 [12] Orlovsky, S. A.: Problems of Decision Making with Fuzzy Information. Nauka. Moscow.: Nauka (1981), in Russian. 232 [13] Ekel, P.Ya., Terra, L. D. B., Junges, M. F. D.: Methods of Multicriteria Decision Making in Fuzzy Environment and Their Applications to Power System Problems. Proceedings of the 13th Power Systems Computation Conference, Trondheim (1999) 755-761. 233, 234
Mining Fuzzy Rules for a Traffic Information System Alexandre G. Evsukoff and Nelson F. F. Ebecken COPPE/Federal University of Rio de Janeiro P.O.Box 68506, 21945-970 Rio de Janeiro RJ, Brazil {Evsukoff,Nelson}@ntt.ufrj.br
Abstract. This work presents a fuzzy system for pattern recognition in a real application: the selection of traffic information messages to be displayed in Variable Message Signs located at the main routes of the city of Rio de Janeiro. In this application, flow and occupancy rate data is used to fit human operators' evaluation of traffic condition, which is currently done from images of strategically located cameras. The fuzzy rule-base mining is presented considering the symbolic relationships between linguistic terms describing variables and classes. The application presents three classifiers built from data.
1
Introduction
The work presents the development of a fuzzy system for the selection of messages to be displayed in Variable Message Signs (VMS). Such devices were initially conceived to provide drivers with information about traffic incidents, environmental problems, special events etc [1], [4]. However, due to the increasing levels of congestion in big cities the VMS have been used mainly to display current traffic conditions. In the city of Rio de Janeiro, the VMS are systematically used to inform the traffic conditions on the main routes located downstream each panel. Operators in the traffic control centre analyse images from strategically located cameras and classify the conditions on each route in three categories: fluid, dense and slow. The classification is not standardised and depends on the operator's evaluation, which can be different for similar situations. On the other hand, many streets in Rio are equipped with inductive loop detectors. These sensors provide flow and occupancy rates, of which data is used for traffic lights planning but not to support VMS operation. Fuzzy reasoning techniques are a key for human-friendly computerised devices, allowing symbolic generalisation of high amount of data by fuzzy sets and providing its linguistic interpretability [2], [3], [5]. The application described in this work uses flow and occupancy data to mine fuzzy rules to support operators' evaluation in the selection of the messages to be displayed in VMS. The following section introduces the current approach. The third section describes the mining of fuzzy rules for the fuzzy model. The fourth section presents and discusses the results achieved by three classifiers built with this methodology. Finally, some concluding remarks are presented. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 237-243, 2003. Springer-Verlag Berlin Heidelberg 2003
238
Alexandre G. Evsukoff and Nelson F. F. Ebecken
2
Fuzzy Systems for Pattern Recognition
Consider a pattern recognition problem where observations are described as a Ndimensional vector x in a feature space X N and classes are represented by the set C = {C1 K C m } . The solution consists in assigning a class label C j ∈ C to an observation x(t ) ∈ X N . The problem of designing a fuzzy system for pattern recognition is to build a classifier, which should executes correctly the mapping X N → C . Each input variable x(t ) ∈ X is described using ordered linguistic terms in a descriptor set A = {A1 , K , An } . The meaning of each term Ai ∈ A is given by a fuzzy set. The collection of fuzzy sets used to describe the input variable forms a fuzzy partition of the input variable domain. For a given input x(t ) ∈ X , the membership vector (or fuzzification vector) u(t ) is computed by the fuzzy sets in the fuzzy partition of the input variable domain as:
(
)
u(t ) = µ A1 ( x(t )) K µ An ( x(t )) .
(1)
In pattern recognition applications, the rules' conclusions are the classes in the set C = {C1 K C m } . The fuzzy rule base has thus to relate input linguistic terms Ai ∈ A to the classes C j ∈ C , in rules such as: if x(t ) is Ai then class is C j with cf = ϕ ij
(2)
where ϕ ij ∈ [0,1] is a confidence factor (cf) that represents the rule certainty. The confidence factor weights for all rules define the symbolic fuzzy relation Φ , defined on the Cartesian product A × C . A value µ Φ ( Ai , C j ) = ϕ ij represents how much the term Ai is related to the class C j in the model described by the rule base. A value ϕ ij > 0 means that the rule (i, j ) occurs in the rule base with the confidence
[ ]
factor ϕ ij . The rule base can be represented by the matrix Φ = ϕ ij
, of which the
lines i = 1...n are related to the terms in the input variable descriptor set A and the columns j = 1...m are related to classes in the set C . The output of the fuzzy system is the class membership vector (or the fuzzy model output vector) y (t ) = µ C1 ( x(t )), K , µ Cm ( x(t )) , where µ C j ( x(t )) is the output mem-
(
)
bership value of the input x(t ) to the class C j . The class membership vector y (t ) can also be seen as a fuzzy set defined on the class set C . It is computed from the input membership vector u(t ) and the rule base weights in Φ using the fuzzy relational composition: y (t ) = u(t ) o Φ
(3)
Mining Fuzzy Rules for a Traffic Information System
239
Adopting strong and normalised fuzzy partitions, and using the sum-product composition operator for the fuzzy inference, the class membership vector y (t ) is easily computed as a standard vector matrix product as: y (t ) = u(t ).Φ .
(4)
When two or more variables occur in the antecedent of the rules, input variables are the components of the feature vector x ∈ X N and the rules are written as: if x(t ) is Bi then class is C j with cf = ϕij
(5)
Each term Bi in the multi-variable descriptor set B = {B1 , K , B M } represents a combination of terms in the input variables' descriptor sets. It is a symbolic term used for computation purposes only and does not necessarily need to have a linguistic interpretation. All combination must be considered in such a way that the model is complete, i.e. it produces an output for whatever input values. The multi-variable fuzzy model described by a set of rules (5) is analogous to the single variable fuzzy model described by rules of type (2) and the model output y (t ) = µ C1 (x(t )), K , µ Cm (x(t )) is computed as:
(
)
y (t ) = w (t ).Φ
(6)
The combination of terms in multiple antecedent rules of type (5) implies in an exponential growth of the number of rules in the rule base. For problems with many variables, reasonable results [2] can be obtained considering the aggregation of partial conclusions computed by single antecedent rules of type (2). The final conclusion is the membership vector y (t ) = µ C1 (x(t )), K , µ Cm (x(t )) ,
(
)
which is computed by the aggregation of partial conclusions y i (t ) as in multi-criteria decision-making [6]. Each component µ C j (x(t )) is computed as:
(
µ C j (x(t )) = H µ C j ( x1 (t )), K , µ C j ( x N (t ))
)
(7)
where H : [0,1]N → [0,1] is an aggregation operator. The best aggregation operator must be chosen according to the semantics of the application. Generally, a conjunctive operator, such as the “minimum” or the “product”, gives good results to express that all partial conclusions must agree. A weighted operator like OWA may be used to express some compromise between partial conclusions. The final decision is computed by a decision rule. The most usual decision rule is the “maximum rule”, where the class is chosen as the one with greatest membership value. The “maximum rule” is often used to determine the most suitable solution. Nevertheless, other decision rules can be used including risk analysis and reject or ambiguity distances. When all variables are used in single variable rules as (2) and their outputs are aggregated as (7), the fuzzy classifier behaves as a standard Bayesian classifier. The
240
Alexandre G. Evsukoff and Nelson F. F. Ebecken
current approach is flexible enough so that some partial conclusions can be computed from the combination of two or three variables in multi-variable rules (5). An aggregation operator computes a final conclusion from partial conclusions obtained from all sub-models. A decision rule (the “maximum” rule) computes the final class as shown in Fig. 1. x1 (t )
A1
u 1 (t )
xk (t )
Bk
M x N (t )
AN
y 1 (t )
M
M xk +1 (t )
Φ1
w k (t )
u N (t )
Φk
M ΦN
y k (t )
H
y (t ) decision
C0
y N (t )
Fig. 1. The fuzzy system approach for pattern recognition
The rule base is the core of the model described by the fuzzy system and its determination from data or human knowledge is described in the next section.
3
Fuzzy Rule Base Mining
Consider a labelled data set T , where each sample is a pair (x(t ), v (t ) ) , of which x(t ) is the vector containing the feature values and v(t ) is the vector containing the assigned membership values of x(t ) to each class. For each fuzzy sub-model k , the rule base weights matrix Φ k is computed by the minimisation of the rule output error in each sub-model, defined as: J = N1
∑ (y k (t ) − v(t ) )2
t =1.. N
(8)
where y k (t ) = u k (t ).Φ k if the sub-model k is described by single antecedent rules as (2) or y k (t ) = w k (t ).Φ k if the sub-model is described by multiple antecedent rules as (5). The rule base weights matrix Φ k as the solution of the linear system: Uk Φk = V
(9)
where the matrix U k is the fuzzification matrix, of which lines are the input membership vector for each sample. Each line of the matrix U k can be the single variable input membership vector u k (t ) or the multi-variable input membership vector w k (t ) , if the sub-model k is described by single or multiple antecedent rules. The
Mining Fuzzy Rules for a Traffic Information System
241
matrix V stores, at each line, the class membership vector for the corresponding sample. Equation (9) can be computed by any standard numerical method. Fuzzy rule base weights are computed for each sub-model and used to compute partial conclusions on class descriptions. Final conclusions are computed by aggregation of partial conclusions of each sub-model. Heuristic and experimental knowledge can be combined as different sub-models and their outputs can be aggregated to compute the final conclusions.
4
Application
The application described in this work use flow and occupancy rate data to automatically select messages to support operators' evaluation. One of the data sets used in this application is shown in Fig. 2. The interpolation of the data represented by the solid line shows a typical relationship between flow and density. Density is not directly measured but it can be approximated by the occupancy rate. For low occupancy rate and flow values, traffic condition is classified as clearly “fluid”, since the road is being operated under its capacity. As flow increases the occupancy rate also increases until a saturation point, represented by the limit of the road capacity, where the traffic is classified as “dense”. After such limit, the effects of congestion reduce the traffic flow, and thus the mean velocity, while the occupancy rates still increases. In this situation, the traffic condition is classified as “slow”.
Fig. 2. A training data set
Operators classify traffic conditions visually from camera images. Thus, for some close data values, the classification can be completely different, due to the subjectivity of operators' evaluation. The methodology presented above allowed to build different classifiers; three of them are discussed in this section. The first classifier is obtained considering flow/occupancy ratio as the input variable. Using this analytical relation, the rule base design becomes trivial, since a low mean velocity implies a slow traffic, a medium mean velocity implies a intense traffic
242
Alexandre G. Evsukoff and Nelson F. F. Ebecken
and high velocity implies a fluid traffic. The membership functions were centered in the mean values of flow/occupancy ratio for each class in the training set. The output class membership vector was computed simply by fuzzification. The operators have agreed that flow can be described by three terms and occupancy is better described by four symbolic terms. Fuzzy meanings have been defined by equally spaced fuzzy sets with triangular membership functions. No optimization of fuzzy sets location was performed. The second classifier was obtained considering features separately as a sub-model in single antecedent rules like (2). The rule base weights were computed as (9) for each feature. The partial conclusions were aggregated by the t-norm “minimum” operator. The third classifier considers all the combination of terms in the variables' descriptor set in two-variable antecedent rules like (5). The fourth classifier is a standard Bayes classifier considering normal probability distribution. The collection of the data set was not a simple task. As the operators have many other occupations than monitor traffic conditions, each sample data (acquired at 1/15 minutes rate) was included into the data set only if the operator has just changed the message in the VMS. One data set was collected for each route. As routes are different in capacity, a different classifier was also developed for each route. The data sets were divided into a training data set and a test data set. The classification error rates in the test data set are shown in Table 2. The large error rates in the third route are due to inconsistent data in the training set. Table 2. Classification Results
Classifier #1 #2 #3 #4
Route 1 34.78 31.87 35.16 32.96
Route 2 27.83 31.19 23.85 31.19
Route 3 47.83 46.09 43.48 44.34
In the real VMS application, many detectors are available for each route. In this case, the solution provided by each data collection station should be considered as an independent partial conclusion and final decision could be obtained by weighted aggregation of all data collection stations in each route, in a data fusion scheme.
5
Conclusions
This work has presented a fuzzy system approach to represent knowledge elicited from experts and mined from a data set. The main advantage of this approach is its flexibility to combine single or multiple-variable fuzzy rules and to derive fuzzy rules from heuristic and/or experimental knowledge. In the application, three classifiers were discussed for a simple but real application of a traffic information system. It has been shown that the operator evaluation from visual inspection is very subjective and corresponds roughly to measured data. The classification results reflect the lack of standardization in operators' evaluation. The results obtained with fuzzy classifiers (# 1 to # 3) were more or less similar
Mining Fuzzy Rules for a Traffic Information System
243
to the results of the Bayes classifier (#4), although the rule bases in fuzzy classifiers are readable by domain experts. The rule bases obtained in the data-driven classifiers (classifiers #2 and #3) correspond roughly to the operators' expectation on how flow and occupancy are related to traffic condition, expressed in the heuristic classifier (classifier #1). An automatic procedure to support operators' evaluation could help to standardize the messages to be displayed in VMS, allowing a better understanding of traffic messages by the drivers. The parameters of the fuzzy partitions for each variable are important design issues that can improve classification rates and have not been considered in this work. It is considered for further development of the symbolic approach developed in this work.
Acknowledgements The authors are grateful to the kind co-operation of the traffic control operators and to CET-Rio that has provided the data for this work.
References [1] [2] [3] [4] [5] [6]
Dudek, C. L. (1991). Guidelines on the Use of Changeable Message Signs – Summary Report. Publication No. FHWA-TS-91-002. U.S. Department of Transportation, Federal Highway Administration. Evsukoff, A.; A. C. S. Branco and S. Gentil (1997). A knowledge aquisition method for fuzzy expert systems in diagnosis problems. Proc. 6th IEEE International Conference on Fuzzy Systems – FUZZIEEE'97, Barcelona. Iserman, R. (1998). On fuzzy logic applications for automatic control, supervision and fault diagnosis. IEEE Trans. on Systems Man Cybernetics – Part A: Systems and Humans, 28 (2), pp. 221-234. Sethi, V. and N. Bhandari (1994). Arterial incident detection using fixed detector and probe vehicle data. Transportation Research. Zadeh, L. (1996). Fuzzy logic = computing with words. IEEE Trans. on Fuzzy Systems, 4 (2), pp. 103-111. Zimmermann, H.-J. (1996). Fuzzy Set Theory and its Applications, Kluwer.
Possibilistic Hierarchical Fuzzy Model Paulo Salgado Universidade de Trás-os-Montes e Alto Douro 5000-911 Vila Real, Portugal
[email protected] Abstract. This paper presents a Fuzzy Clustering of Fuzzy Rules Algorithm (FCFRA) with dancing cones that allows the automatic organisation of the sets of fuzzy IF … THEN rules of one fuzzy system in a Hierarchical Prioritised Structure (HPS). The algorithm belongs to a new methodology for organizing linguistic information, SLIM (Separation of Linguistic Information Methodology), and is based on the concept of relevance of rules. The proposed FCFRA algorithm has been successfully applied to the clustering of an image.
1
Introduction
A fuzzy reasoning model is considered as a set of rules in the IF-THEN form to describe input-output relations of a complex system. We use rules to describe this relation instead of classical function approximation techniques mainly due to the transparency of the resulting fuzzy model. When building fuzzy systems from experts or automatically from data we need some procedures that divide the input space into fuzzy granules. These granules are the building blocks for the fuzzy rules. To keep interpretability we usually require that the fuzzy sets be specified in local regions. If this requirement is not fulfilled, many rules must be applied and aggregated simultaneously, such that the final result becomes more difficult to grasp. Moreover, aiming for a high approximation quality, we tend to use a large number of rules. In these last cases, it is not possible to interpret a fuzzy system any longer. The main objective of this article is to show that automated modelling techniques can be used to obtain not only accurate, but also transparent rule-based models from system measurements. This is obtained by organizing the flat fuzzy system f(x) into a set of n fuzzy sub-systems f1(x), f2(x), ..., fn(x), each corresponding to a readable and interpretable fuzzy system that may contain information related with particular aspects of the system f(x). This objective can be reached by using an algorithm that implements Fuzzy Clustering of Fuzzy Rules (FCFRA) with dancing cones [1]. The proposed algorithm allows the grouping of a set of rules into c subgroups (clusters) of similar rules, producing a HPS representation of the fuzzy systems. The paper is organized as follows. A brief introduction of the hierarchical fuzzy systems HPS is made and the concept of relevance is reviewed in section 2. In sec-
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 244-250, 2003. Springer-Verlag Berlin Heidelberg 2003
Possibilistic Hierarchical Fuzzy Model
245
tion 3 the FCFR strategy is proposed. An example is presented in section 4. Finally, the main conclusions are outlined in section 5.
2
The Hierarchical Prioritized Structure
The SLIM methodology organizes a fuzzy system f(x) as a set of n fuzzy systems f1(x), f2(x), …, fn(x). A clustering algorithm is used in this work to implement the separation of information among the various subsystems [2]. The HPS structure, depicted in figure 1, allows the prioritization of the rules by using a hierarchical representation, as defined by Yager [3]. If i < j the rules in level i will have a higher priority than those in level j. Consider a system with i levels, i=1,…, n-1, each level with Mi rules: I. II.
If U is Aij and Vˆi −1 is low, then Vi is Bij and rule II is used V1 is Vˆ i −1
Rule I is activated if two conditions are satisfied: U is Aij and Vˆi −1 is low. Vˆi −1 , which is the maximum value of the output membership function of Vi-1, may be interpreted as a measure of the satisfaction of the rules in the previous levels. If these rules are relevant, i.e. Vˆi −1 is not low, the information conveyed by these rules will not be used. On the other hand, if the rules in the previous levels are not relevant, i.e. Vˆ is i −1
low, this information is used. Rule II states that the output of level i is the union of the output of the previous level with the output of level i.
Fig. 1. One Hierarchical Prioritized Structure (HPS)
The fuzzy system of figure 1 is a universal function approximator. With the appropriated number of rules it can describe any nonlinear relationship. The output of a generic level i is given by the expression [4]: Mi Gi = (1 − α i −1 ) ∧ ∪ Fi l ∪ Gi −1 l =1
(1)
246
Paulo Salgado
where Fi l = Ail ( x* ) ∧ Bil , l = 1,
, M i is the output membership function of rule l in
level i. The coefficient αi translates the relevance of the set of rules in level i (for level 0: α0=0, G0 =∅). Equation (1) can be interpreted as an aggregation operation of the relevance, for the hierarchical structure. For the characterization of the relative importance of sets of rules, in the modelling process, it is essential to define a relevance function. Depending on the context where the relevance is to be measured, different metrics may be defined [5]. Definition 1. Consider ℑ a set of rules from U into V, covering the region S = U × V in the product space. Any function of relevance must be of the form ℜ S : P ( ℑ ) → [ 0 , 1]
(2)
where P ( ℑ) is the power set of ℑ. As a generalization of equation (1), we propose a new definition of relevance for the HPS structure [5]. Definitions 2. (Relevance of fuzzy system just i level) Let Si be the input-output region covered by the set of rules of level i. The relevance of the set of rules in level i is defined as:
(
αi = S αi −1 , T (1 − αi −1 ) , ℜS i
({
where ℜSi = S ℜSi ( Fi l ) , l = 1, , M i
)
(3)
}) and S and T are, respectively, S-norm e T-
norm operations. ℜSi ( Fi l ) represents the relevance of rule l in level i.
Using the product implication rule, and considering Bil a centroid of amplitude δil centred in y=ÿ, then ℜ Si ( Fi l ) = Ail ( x* ) ⋅ δ il
(4)
When the relevance of a level i is 1, the relevance of all the levels below is null. The relevance of the set of rules just i level, as definition 2, is here simplified by using the following expression:
αi −1 + ℜS ; if αi −1 + ℜS ≤ 1 αi = i
1
i
; otherwise
(5)
Mi
where ℜSi = ∑ ℜSi ( Fi l ) . l =1
In next section we present a new method to cluster the fuzzy rules of a fuzzy system. The result of this process corresponds to transforming the original fuzzy system (a flat structure) into a HPS fuzzy model.
Possibilistic Hierarchical Fuzzy Model
3
247
The Possibilistic Clustering Algorithm of Fuzzy Rules
The objective of the fuzzy clustering partition is the separation of a set of fuzzy rules ℑ={R1, R2,..., RM} into c clusters, according to a “similarity” criterion, finding the optimal clusters centre, V, and the partition matrix, U. Each value uik represents the membership degree of the kth rule, Rk, belonging to the ith cluster i, Ai, and obeys simultaneously to the equations (6) and (7): n
0 < ∑ uik < n, ∀i ∈ { 1, 2,L , c}
(6)
k =1
and c
M
∑∑ ℜ ( x ) ⋅ u l
k
il
⋅ α i = 1 , ∀xk ∈ S
(7)
i =1 l =1
Equation (7) can be interpreted as the sum of the products between the relevance of the rules l in the xk point with the degree of the rule l belonging to the cluster i, uil α i . This last item reflects the joint contribution of rule l to the ith hierarchical system, uil , with the relevance of the hierarchical level i, α i . Thus, for the Fuzzy Clustering of Fuzzy Rules Algorithm, FCFRA, the objective is to find a U=[uik] and V = [v1 , v2 ,L , vC ] with vi ∈ R p where: n m c M J (U , V ) = ∑ ∑∑ ( uil ℜl ( xk ) ) xk − vi k =1 i =1 l =1
2 A
m + ηi (1 − uil )
(8)
is minimized, with a weighing constant m>1, with constrain of equation (7). The ⋅ is an inner product norm x
2 A
= x T A x . The parameter ηi is fixed for each cluster with
membership distance of ½. It can be shown that the following algorithm may lead the pair (U*,V*) to a minimum. The models specified by the objective function (8) were minimized using alternating optimization. The results can be expressed in following algorithm:
Possibilistic Fuzzy Clustering algorithms of fuzzy rules – P-FCAFR Step 1: For a set of points X={x1,..., xn}, with xi∈S, and a set of rules ℑ={R1, R2,..., RM}, with relevance ℜl ( xk ) , k= 1, … , M, keep c, 2 ≤ c < n, and initialize U(0)∈ Mcf. Step 2: On the rth iteration, with r= 0, 1, 2, ... , compute the c mean vectors. ∑ ( u ( ) ) ⋅ ∑ ( ℜ ( x ) ) M
vi( ) = r
l =1
r il
m
(r ) ∑ uil l =1 M
( )
⋅ xk k =1 , where u ( r ) = U ( r ) , i=1, ... ,c. np il m m ⋅ ∑ ( ℜl ( x k ) ) k =1 np
l
m
k
(9)
248
Paulo Salgado
Step 3: Compute the new partition matrix U(r+1) using the expression: uil( r +1) =
1 ∑ ∑ (ℜ ( x )) np
c
j =1
k =1
l
m
k
⋅ xk − vi( r )
ηi
2 m −1
, with 1≤ i ≤ c , 1≤ l ≤ M.
(10)
A
In this algorithm, we interpret the values of {ui ( xk )} (discrete support) as observations from the extended membership function µi ( l ) : µi ( l ) = ui ( l ) , l=1,…, M (continuous support). Step 4: Compare U(r) with U(r+1): If || U(r+1)-U(r)|| < ε then the process ends. Otherwise let r = r+1 and go to step 2. ε is a small real positive constant. The applications of the P-FCAFR algorithm to fuzzy system rules with singleton fuzzifier, inference product and centroid defuzzifier result in a fuzzy system with HPS structure, i.e., in the form: c M f ( x ) = ∑ α i ∑ θ l ⋅ µ l ( x ) ⋅ uil i =1 l =1
∑ α ∑ µ ( x ) ⋅ u c
M
i
i =1
l
il
l =1
(11)
If the rules describe one region S, instead of a set of points, and the membership of relevance function of the rules was symmetrical the equation (10) will be reformulate ( r +1)
µil
=1
∑( c
j =1
(r )
xl − vi
A
η
2 * m −1 i
)
(12)
where xl is the centre of rule l. The shapes of this membership function are determined by the update of below iterative procedure in order to minimize the objective function(8). However, the user might be interested in choosing memberships functions shapes that are considered more useful for a given application, although he abandons the objective function. The P-FCAFR algorithm and the Alternating Cluster Estimation Fuzzy rules [1][6], with Cauchy membership function, were used in the identification of the Abington Cross.
4
Experimental Results
The Abington Cross image (Fig. 2a) was used to illustrate the proposed approach. It is a critical image for obtaining the cross clustering due to the high amount of noise. The image is a 256×256 pixels, grey level image. The intensity dynamic range has been normalized to the [0,1] interval. The image is a set of R3 p points, with coordinates (x,y,I), where I (x,y) is the intensity of image at point (x, y). I = 0 corresponds to the white colour, and I = 1 corresponds to the black colour. The aim is to partition the fuzzy system into five clusters. In the first step, the system is modelled in a set of rules, using the nearest neighbourhood method [7]. The
Possibilistic Hierarchical Fuzzy Model
249
resulting system after the learning process has 1400 fuzzy rules. The output of the system at this stage is shown in Fig. 2.b. (b)
(a)
Fig. 2. Grey scale image, with the shape of Abington Cross: a) original image. b) Fuzzy image
The second step consists in the segmentation of the image into 5 clusters (with m=2), each representing a fuzzy system in a HPS structure, using the FCAFR algorithm presented in section 3. Four of them represent the background (each representing the background rectangle in each corner of the image) and one represents the area containing the white cross. The degree of membership of each rule to each rule cluster has been measured using the weighed distance of the centre of the rule to the centre of the cluster, η. Figs. 3.a.-3.e. show the individual outputs response of each hierarchical fuzzy model. The original image can be described as the aggregation (equation (11)).
5
Conclusions
In this work, the mathematical fundaments for Possibilistic fuzzy clustering of fuzzy rules were presented, where the relevance concept has a significant importance. Based in this concept, it is possible to make a fuzzy clustering algorithm of fuzzy rules, which is naturally a generalization of fuzzy possibilistic clustering algorithms. Now, many others clusters algorithms (crisp) can be applied to the fuzzy rules. The P-FCAFR Algorithm organizes the rules of fuzzy systems in the HPS structure, based on the SLIM methodology, where the partition matrix can be interpreted as containing the values of the relevance of the sets of rules in each cluster. The proposed clustering algorithm has demonstrated the ability to deal with an image strongly corrupted by noise, the Abington Cross image. The test image was first identified using the Nearest Neighbourhood Method. The resulting fuzzy rules were successfully grouped into 5 clusters using the proposed P-FCAFR Algorithm. Each resulting cluster represents one meaningful region of the image. This helps in the corroboration of the proposed approach.
250
Paulo Salgado
(a)
(c)
(b)
(d)
(e)
Fig. 3. Images representing the output result of the five background clusters
References [1] [2] [3] [4] [5] [6] [7]
Hoppner, Frank, and al.: Fuzzy Cluster Analysis, Methods for classification data analysis and image recognition, Wiley, (1999). Salgado, Paulo, Clustering and hierarchization of fuzzy systems, submitted to Soft Computer Journal. Yager, R.: On a Hierarchical Structure for Fuzzy Modeling and Control, IEEE Trans. On Syst., Man, and Cybernetics, 23, (1993), 1189-1197. Yager, R.: On the Construction of Hierarchical Fuzzy Systems Models, IEEE Trans. On Syst., Man, and Cyber. –Part C, 28, (1998), 55-66. Salgado, P.: Relevance of the fuzzy sets and fuzzy systems. In: “Systematic Organization of Information in Fuzzy Logic”, NATO Advanced Studies Series, IOS Press, In publication. Runkler, T. A, Bezdek, C.: Alternating Cluster Estimation: A new Tool for Clustering and Function Approximation, IEEE Trans. on Fuzz. Syst., 7, (1999), 377-393. Wang, Li-Xin: Adaptive Fuzzy System and Control, Design and stability analysis, Prentice Hall, Englewood Cliffs, NY 07632 (1994).
Fuzzy Knowledge Based Guidance in the Homing Missiles Mustafa Resa Becan and Ahmet Kuzucu System Dynamics and Control Unit, Mechanical Engineering Faculty Istanbul Technical University, Turkey
Abstract. A fuzzy knowledge based tail pursuit guidance scheme is proposed as an alternative to the conventional methods. The purpose of this method is to perform the tracking and interception using a simplified guidance law based on the fuzzy controller. Noise effects in homing missile guidance must be reduced to get an accurate control. In this work, developed fuzzy tail pursuit control algorithm reduces the noise effects. Then, a knowledge based algorithm is derived from the fuzzy control behavior to obtain a simplified guidance law. Simulation results show that the new algorithm led to satisfactory performance for the purpose of this work.
1
Introduction
The tail pursuit is the one of the most common conventional guidance methods in the homing missile area. But, the homing missile guidance has some uncertain parameters for the target maneuver and target behavior is observed through noisy measurements. In these cases, conventional guidance methods may not be sufficient to obtain the tracking and interception. Fuzzy control has suitable properties to eliminate such difficulties. Fuzzy controller has been used in the systems where variables have fuzzy characteristics or deterministic mathematical models are difficult or impossible [1], [2]. Recently, developed neuro-fuzzy techniques serve as possible approaches for the nonlinear flight control problems [3-5]. However, a limited number of papers have been adressed to the issue of fuzzy missile guidance design [6-8]. Fuzzy tail pursuit guidance is designed considering the noise sourced from the thermal and radar detection sources at the system input. This noise affects the guidance system entirely. Fuzzy tail pursuit (FTP) guidance is applied to a homing missile with known aerodynamic coefficients and their variations. The purpose of this paper is not only to perform the FTP application but also is to search a simplified knowledge based guidance inspired by the fuzzy control results. The objective of this research is to obtain a rule based low cost guidance control free from the computational complexities that a full fuzzy controller requires.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 251-257, 2003. Springer-Verlag Berlin Heidelberg 2003
252
Mustafa Resa Becan and Ahmet Kuzucu
2
Definition of the Guidance Control System
Figure 1 shows the interception geometry for the missile model and it presents some symbols used in this study. Missile motions are constrained within the horizontal X-Y plane. State variables from Fig.1 are the flight path angle γ , missile velocity Vm, attitude angle θ, angle of attack α , pitch rate q, missile and target positions in the inertial space Xm, Ym, XT, YT and the target angle β. The missile is represented by an aerodynamic model with known coefficients.
Fig. 1. Interception Geometry
2.1
Conventional Tail Pursuit (CTP) Design
Line of sight (LOS) is the distance between the missile and target. The purpose of the tail pursuit guidance is to keep the flight path angle γ equal to the line of sight angle σ. Τhis goal leads to the design of a closed loop control around LOS angle.
Fig. 2. Block Diagram of CTP
Fig. 2 shows the simple block diagram of CTP. Controller inputs are the error e and first derivative of error e& . Noise effect is symbolized as w and σ& is the variation of LOS angle. The CTP output can be written in PD form as shown below: u = K p .e + K v .e&
(1)
where Kp and Kv are the control coefficients chosen for a desired system performance.
Fuzzy Knowledge Based Guidance in the Homing Missiles
3
253
Fuzzy Tail Pursuit (FTP) Design
Target position measurement is not precise and some noise effect is superposed to it. This particularity allows the fuzzy tail pursuit as an alternative to the conventional version. The noise is modeled as a gaussian density function in this study . The input and output variables of the fuzzy controller are linguistic variables taking linguistic values. The input linguistic variables are the error ( e ) and change of error ( e& ) and the linguistic output variable is the control signal u. The linguistic variables are expressed by linguistic sets. Each of these variables is assumed to take seven linguistic sets defined as negative big (NB), negative medium (NM), negative small (NS), zero (ZE), positive small (PS), positive medium (PM), positive big (PB). Triangular membership functions (MF) for error input are scaled in synchronization with the gaussian density function distribution shown on Fig. 3. This approach is preferred to match the membership function distribution with the noisy measurement distribution. Universe of discourse of MF (e/emax) is choosen in regular form at first, but the satisfactory results could not be obtained. Then it is changed using trial and error to provide the accordance with the density function and performed the best results. Consequently, the most reasonable way to design MF is determined as presented on Fig.3.
Fig. 3. Density Function and MF Design
A set of fuzzy PD rules has been performed in a real time application [9]. Table 1 shows the fuzzy PD rules set following this approach.
254
Mustafa Resa Becan and Ahmet Kuzucu Table 1. Rule Table of Fuzzy Tail Pursuit
Minimum (Mamdani type) inference is used to obtain the best possible conclusions [8]. This type of inference allowed easy and effective computation and it is appropriate for real time control applications.The outputs of the linguistic rules are fuzzy, but the guidance command must be crisp. Therefore, the outputs of the linguistic rules must be defuzzified before sending them to the plant. The crisp control action is calculated here using the center of gravity (COG) defuzzification method.
4
Control Applications
CTP and FTP strategy is applied in PD form to a homing missile model. Aerodynamic coefficients are evaluated from tabulated experimental values in the simulation environment. Fuzzy control coefficients are scaled according to the system limitations. Fig. 4 display the interception trajectory and LOS performance of the application. Results show that the FTP perform the interception without time delay.
Fig. 4. FTP Behaviors
Fuzzy Knowledge Based Guidance in the Homing Missiles
255
Fig. 5. Comparison of the Control Behaviors
Control behavior obtained by CTP and FTP are shown on Fig. 5. PD controller coefficients have been determined considering a second order system model with a natural freqency of wn= 6 Hz. and a damping ratio of ζ = 0.7 in CTP application. It is clear that better noise filtering performance may be obtained using prefilters or more sophisticated PD type controller design, so the comparison may seem unfair. But still the improvement in noise filtering of the fuzzy controller is impressing. Taking into account the distribution of measurements instead of instantaneous measurement during fuzzification , and the use of COG method in defuzzification both having some ‘averaging' property, reduce considerably the propagation of noise to the control variable behavior. On the other hand it is clear that a pure theoretical derivative effect in the CTP scheme will amplify the noise effect if a suitable filtering is not added to controller for the frequency band under consideration.
5
Derivation of Knowledge Based Control
Homing missiles are launched for once only. Therefore low cost controllers are needed in such applications. A knowledge based algorithm is developed using the fuzzy results to obtain a guidance controller less equipped than the full fuzzy control. Our goal is to obtain a guidance scheme related to the fuzzy tail pursuit, a new algorithm requiring less sophisticated computing power. Fig. 6 shows the block diagram of the proposed guidance control system.
Fig. 6. Knowledge Based Control Block Diagram
256
Mustafa Resa Becan and Ahmet Kuzucu
Fuzzy surface of fuzzy tail pursuit controller obtained through off-line simulation studies is shown on Fig. 7 Membership functions for knowledge based control are rearrenged using this surface. Table 2 presents the rules rewritten using the FTPcontrol behavior on Fig. 5 Control values are crisp and changed from –3 to +3 to match the control variable variation on Fig. 5. Fig. 7 shows that the derivated guidance law performed the interception within 7.55 s. using the proposed knowledge based discontinuous control. This value is very close to the interception time obtained with the fuzzy tail pursuit guidance.
Fig. 7. Fuzzy Surface Between the Input-Output Values
Table 2. Knowledge Based Rules
Fig. 8. Knowledge Based Guidance Behaviors
Fuzzy Knowledge Based Guidance in the Homing Missiles
6
257
Conclusion
A fuzzy tail pursuit guidance for a homing missile has been presented in this work. The simulation results showed the proposed method achieved a relatively noise free guidance without interception time delay using the filtering property of the fuzzy control. Then, a knowledge based guidance law is derived based on the fuzzy tail pursuit results, to obtain a simple discontinuous control. This work shows that the guidance can be performed using knowledge based algorithm because the line of sight behavior is quite satisfactory. The results encourage future studies through the fuzzy tail pursuit and knowledge based control on the homing missile guidance area.
References [1] [2] [3]
[4] [5] [6] [1] [8] [9]
Lee C.C., “Fuzzy Logic in Control Systems: Fuzzy Logic Controller: Part I & Part II”, IEEE Trans. Syst. Man and Cybernetics, 20, 404-435, 1990 Williams T., “Fuzzy Logic Simplifies Complex Control Problems”, Computer Design, 90- 102, 1991 Huang C., Tylock J., Engel S., and Whitson J., “Comparison of Neural Network Based , Fuzzy Logic Based, and Numerical Nonlinear Inverse Flight Control”, Proceedings of the AIAA Guidance, Navigation and Control Conference, AIAA Washington DC, 1994, pp. 922-929 Geng Z.J., and McCullough C.L., “Missile Control Using Fuzzy Cerebellar Model Arithmetic Computer Neural Networks”, Journal of Guidance, Control and Dynamics, Vol. 20, No.3, 1997, pp. 557-565 Lin C.L., and Chen Y.Y., “Design of an Advanced Guidance Law Against High Speed Attacking Target”, Proceedings of the National Science Council, Part A, Vol.23, No.1, 1999, pp.60-74 Mishra K., Sarma I.G., and Swamy K.N., “Performance Evaluation of the Fuzzy Logic Based Homing Guidance Schemes”, Journal of Guidance, Control and Dynamics, Vol.17, No.6, 1994, pp. 1389-1391 Gonsalves P.G., and Çağlayan A.K., “Fuzzy Logic PID Controller for Missile Terminal Guidance”, Proceedings of 1995 IEEE International Symposium and Intelligent Control, IEEE Publications, Piscataway, NJ.1995, pp.377-382 Lin C.L., and Chen Y.Y., “Design of Fuzzy Logic Guidance Law Against High Speed Target”, Journal of Guidance, Control and Dynamics, Vol.23, No.1, January-February 2000, pp. 17-25 Hoang L.H., and Maher H., “Control of a Direct-Drive DC Motor by Fuzzy Logic”, IEEE Industry Applications Society, Vol. VI, 1993, pp. 732-738
Evolutionary Design of Rule Changing Cellular Automata Hitoshi Kanoh1 and Yun Wu2 1
Institute of Information Sciences and Electronics, University of Tsukuba Tsukuba Ibaraki 305-8573, Japan
[email protected] 2 Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Ibaraki 305-8573, Japan
[email protected] Abstract. The difficulty of designing cellular automatons' transition rules to perform a particular problem has severely limited their applications. In this paper we propose a new programming method of cellular computers using genetic algorithms. We consider a pair of rules and the number of rule iterations as a step in the computer program. The present method is meant to reduce the complexity of a given problem by dividing the problem into smaller ones and assigning a distinct rule to each. Experimental results using density classification and synchronization problems prove that our method is more efficient than a conventional one.
1
Introduction
Recently, evolutionary computations by parallel computers have gained attention as a method of designing complex systems [1]. In particular, parallel computers based on cellular automata (CAs [2]), meaning cellular computers, have the advantages of vastly parallel, highly local connections and simple processors, and have attracted increased research interest [3]. However, the difficulty of designing CAs' transition rules to perform a particular task has severely limited their applications [4]. The evolutionary design of CA rules has been studied by the EVCA group [5] in detail. A genetic algorithm (GA) was used to evolve CAs. In their study, a CA performed computation means that the input to the computation is encoded as the initial states of cells, the output is decoded from the configuration reached at some later time step, and the intermediate steps that transform the input to the output are taken as the steps in the computation. Sipper [6] has studied a cellular programming algorithm for non-uniform CAs, in which each cell may contain a different rule. In that study, programming means the coevolution of neighboring, non-uniform CAs' rules with parallel genetic operations. In this paper we propose a new programming method of cellular computers using genetic algorithms. We consider a pair of rules and the number of rule iterations as a V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 258-264, 2003. Springer-Verlag Berlin Heidelberg 2003
Evolutionary Design of Rule Changing Cellular Automata
259
step in the computer program, whereas the EVCA group considers an intermediate step of transformation as a step in the computation. The present method is meant to reduce the complexity of a given problem by dividing the problem into smaller ones and assigning a distinct rule to each one. Experimental results using density classification and synchronization problems prove that our method is more efficient than a conventional method.
2
Present Status of Research
2.1
Cellular Automata
In this paper we address one-dimensional CAs that consist of a one-dimensional lattice of N cells. Each cell can take one of k possible states. The state of each cell at a given time depends only on its own state at the previous time step and the state of its nearby neighbors at the previous time step according to a transition rule R. A neighborhood consists of a cell and its r neighbors on either side. A major factor in CAs is how far one cell is from another. The CA rules can be expressed as a rule table that lists each possible neighborhood with its output bit, that is, the update value of the central cell of the neighborhood. Figure 1 shows an example of a rule table when k=2 and r=1. We regard the output bit “11101000” as a binary number. The number is converted to a decimal, which is 232, and we will denote the rule in Fig. 1 as rule 232. Here we describe CAs with a periodic boundary condition. The behavior of one-dimensional CAs is usually displayed by space-time diagrams in which the horizontal axis depicts the configuration at a certain time t and the vertical axis depicts successive time steps. The term “configuration” refers to the collection of local states over the entire lattice, and S(t) denotes a configuration at the time t.
Fig. 1. An example of a rule table (k=2, r=1)
2.2
Computational Tasks for CAs
We used the density classification and the synchronization tasks as benchmark problems [4, 7]. The goal for the density classification task is to find a transition rule that decides whether or not the initial configuration S(0) contains a majority of 1s. Here ρ(t) denotes the density of 1s in the configuration at the time t. If ρ(0) > 0.5, then within M time steps the CA should go to the fixed-point configuration of all 1s (ρ(t M) =1 ); otherwise, within M time steps it should produce the fixed-point configuration of all 0s (ρ(t M) = 0 ). The value of constant M depends on the task. The second task is one of synchronization: given any initial configuration S(0), the CA should reach a final configuration, within M time steps, that oscillates between all 0s and all 1s in successive time steps.
260
2.3
Hitoshi Kanoh and Yun Wu
Previous Work
The evolutionary design of CA rules has been studied by the EVCA group [3, 4] in detail. A genetic algorithm (GA) was used to evolve CAs for the two computational tasks. That GA was shown to have discovered rules that gave rise to sophisticated emergent computational strategies. Sipper [6] has studied a cellular programming algorithm for 2-state non-uniform CAs, in which each cell may contain a different rule. The evolution of rules is here performed by applying crossover and mutation. He showed that this method is better than uniform (ordinary) CAs with a standard GA for the two tasks. Meanwhile, Land and Belew [8] proved that the perfect two-state rule for performing the density classification task does not exist. However, Fuks [9] showed that a pair of human written rules performs the task perfectly. The first rule is iterated t1 times, and the resulting configuration of CAs is processed by another rule iterated t2 times. Fuks points out that this could be accomplished as an “assembly line” process. Other researchers [10] have developed a real-world application for modeling virtual cities that uses rule-changing, two-dimensional CAs to lay out the buildings and a GA to produce the time series of changes in the cities. The GA searches the sequence of transition rules to generate the virtual city required by users. This may be the only application using rule changing CAs.
3
Proposed Method
3.1
Computation Based on Rule Changing CAs
In this paper, a CA in which an applied rule changes with time is called a rule changing CA, and a pair of rules and the number of rule iterations can be considered as a step in the computer program. The computation based on the rule changing CA can thus operate as follows: Step 1: The input to the computation is encoded as the initial configuration S(0). Step 2: Apply rule R1 to S(0) M1 times; ……; apply rule Rn to S(M1+…+Mn-1) Mn times. Step 3: The output is decoded from the final configuration S(M1+…+Mn). In this case, n is a parameter that depends on the task, and rule Ri and the number of rule iterations Mi (i=1, …, n) can be obtained by the evolutionary algorithm described in the next section. The present method is meant to reduce the complexity of a given task by dividing the task into n smaller tasks and assigning (Ri, Mi) to the i-th task. 3.2
Evolutionary Design
Each chromosome in the population represents a candidate set of Ri and Mi as shown in Fig. 2. The algorithm of the proposed method is shown in Fig. 3.
Evolutionary Design of Rule Changing Cellular Automata
261
R1 | M1| ……… | Rn | Mn Fig. 2. Chromosome of the proposed method (Ri: rule, Mi: the number of rule iteration) procedure main initialize and evaluate population P(0); for g = 1 to the upper bound of generations { apply genetic operators to P(g) and create a temporary population P'(g); evaluate P' (g); get a new population P(g+1) by using P(g) and P'(g); } procedure evaluate for i = 1 to the population size { operate CA with the rules on the i-th individual; calculate fitness of the i-th individual; } Fig. 3. Algorithm of the proposed method (procedure main and procedure evaluate)
3.3
Application
In this section we describe an application that uses the CA with two rules for density classification and synchronization tasks. Chromosome Encoding In the case of the CA with two rules, a chromosome can generally be expressed by the left part in Fig. 4. The right part in Fig. 4 shows an example of the chromosome when k=2 and r=1. Each rule and the number of iterations are expressed by the output bit of the rule table and a nonnegative integer, respectively. R1 | M1| R2 | M2 = 1 1 1 0 1 0 0 0 | 45 | 0 1 0 0 1 0 0 1 | 25 Fig. 4. Example of the chromosome of the proposed method for the CA with two rules
Fitness Calculation The fitness of an individual in the population is the fraction of the Ntest initial configurations in which the individual produced the correct final configurations. Here the initial configurations are uniformly distributed over ρ(0)∈[0.0, 1.0]. Genetic Operations First, generate Npop individuals as an initial population. Then the following operations are repeated for Ngen generations. Step 1: Generate a new set of Ntest initial configurations, and calculate the fitness on this set for each individual in the population. Step 2: The individuals are ranked in order of fitness, and the top Nelit elite individuals are copied to the next generation without modification. Step 3: The remaining (Npop-Nelit) individuals for the next generation are formed by single-point crossovers between the individual randomly selected from Nelit
262
Hitoshi Kanoh and Yun Wu
elites and the individual selected from the whole population by roulette wheel selection. Step 4: The rules R1 and R2 on the offspring from crossover are each mutated at exactly two randomly chosen positions. The numbers of iterations M1 and M2 are mutated with probability 0.5 by substituting random numbers less than an upper bound.
4
Experiments
4.1
Experimental Methods
The experiments were conducted under the following conditions: the number of states k=2, the size of lattice N=149, and M1+M2=149 for the CA; the population size Npop=100, the upper bound of generations Ngen=100, the number of elites Nelit=20, and the number of the initial configuration for test Ntest=100 for the GA. 4.2
Density Classification Task (r=1)
Table 1 shows the best rules in the population at the last generation. The obtained rules agree with the human written rules shown by Fuks [9], which can perform this task perfectly. Table 1. Rules obtained by the genetic algorithm for the density classification task (r=1, k=2, N=149)
Experiment-1 Experiment-2 4.3
R1 184 226
M1 124 73
R2 232 232
M2 25 76
Density Classification Task (r=3)
Figure 5 shows the comparison of the proposed method with the conventional method that uses only one rule as the chromosome [5]. Each point in the figure is the average of ten trials. It is seen from Fig. 5 that the fitness of the former at the last generation (0.98) is higher than that of the latter (0.91). Table 2 shows the best rules obtained by the proposed method, and Fig. 6 shows examples of the space-time diagrams for these rules. The rules in Table 2 have been converted to hexadecimal. Table 2. The best rules obtained by the genetic algorithm for the density classification task (r=3, k=2, N=149)
R1 01000100111C0000 000004013F7FDFFB
M1 101
R2 97E6EFF6E8806448 4808406070040000
M2 48
Evolutionary Design of Rule Changing Cellular Automata
263
Fig. 5. The best fitness at each generation for density classification tasks (r=3, k=2, N=149)
Fig. 6. Examples of space-time diagrams for a density classification task (r=3, k=2, N=149)
4.4
Synchronization Task (r=3)
Figure 7 shows the results of the same comparison as in Fig 5 for the synchronization task. The fitness of the proposed method reaches 1.0 by the 3rd generation, whereas the conventional method requires more than 20 generations.
Fig. 7. The best fitness at each generation for the synchronization task (r=3, k=2, N=149)
264
Hitoshi Kanoh and Yun Wu
5
Conclusions
In this paper we proposed a new programming method of cellular computers. The experiments were conducted using a rule changing CA with two rules. In a forthcoming study, the performance of a CA with more than two rules will be examined.
References [1]
Alba, E. and Tomassini, M.: Parallelism and Evolutionary Algorithms. IEEE Transactions on Evolutionary Computation, Vol. 6, No. 5 (2002) 443-462. [2] Wolfram, S.: A New Kind of Science. Wolfram Media Inc. (2002). [3] Sipper, M.: The Emergence of Cellular Computing. IEEE Computer, Vol. 32, No. 7 (1999). [4] Mitchell, M., Crutchfield, J., and Das, R.: Evolving Cellular Automata with Genetic Algorithms: A Review of Recent Work. Proceedings of the First International Conference on Evolutionary Computation and Its Applications (1996). [5] Mitchell, M., Crutchfield, J. P., and Hraber, P.: Evolving Cellular Automata to Perform Computations: Mechanisms and Impediments. Physica D 75 (1994) 361-391. [6] Sipper, M.: Evolving of Parallel Cellular Machines: The Cellular Programming Approach. Lecture Notes in Computer Science Vol. 1194. Springer-Verlag (1997). [7] Das, R., Crutchfield, L., Mitchell, M., and Hanson, J.: Evolving Globally Synchronized Cellular Automata. Proceedings of the 6-th ICGA (1995) 336-343. [8] Land, M. and Belew, R.: NO Two-State CA for Density Classification Exists. Physical Review Letters 74 (1995) 5148. [9] Fuks, H.: Solution of the Density Classification Problem with Two Cellular Automata Rules. Physical Review E, 55(3) (1997) 2081-2084. [10] Kato, N., Okuno, T., Suzuki, R., and Kanoh, H.: Modeling Virtual Cities Based on Interaction between Cells. IEEE International Conference on Systems, Man, and Cybernetics (2000) 143-148.
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations L. Baumes1, P. Jouve1, D. Farrusseng2, M. Lengliz1, N. Nicoloyannis1, and C. Mirodatos1 1
Institut de Recherche sur la Catalyse-CNRS 2, Avenue Albert Einstein - F-69626 Villeurbanne Cedex France 2 Laboratoire Equipe de Recherche en Ingénierie des Connaissances Univ. Lumière Lyon2 -5, Avenue Pierre Mendès-France 69676 Bron Cedex France
[email protected] Abstract. A new iterative optimisation method based on an evolutionary strategy. The algorithm is proposed, which proceeds on a binary search space, combines a Genetic Algorithm and a knowledge extraction engine that is used to monitor the optimisation process. In addition of the boosted convergence to the optima in a fixed number of iterations, this method enables to generate knowledge as association rules. The new algorithm is applied for the first time for the design of heterogeneous solids in the frame of a combinatorial program of catalysts development.
1
Introduction
Today, the Genetic Algorithm (GA) approach is a standard and powerful iterative optimisation concept mainly due to its unique adaptation skills to highly diverse contexts. In a sequential manner, it consists at designing population of individuals that shall exhibit superior fitness, generation after generation. When it proceeds, the population shall focus on the optima of the space parameters. The optimisation process is over when criteria on the objective function are achieved. Very recently, this approach has been successfully applied for the optimisation of advanced materials (e.g. catalysts) for the petro-chemistry [1,2]. In this frame, the population consists at around 50 catalysts that are prepared and tested by automated working stations within a few days [3]. In addition of a quick optimisation demand, the total number of iteration was imposed. In this case, convergence of GAs may not be reached due to the "time" constraint. In turn, we face a particular problem: “Knowing that the number of iterations is low and a priori fixed, can efficient optimisations be carried out?” Thus, our strategy is to “act” on the stochastic character of genetic operators in order to guide the GA, without any modification of its algorithmic structure. It comes clear that the relevance of the population design shall be closely linked to the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 265-270, 2003. Springer-Verlag Berlin Heidelberg 2003
266
L. Baumes et al.
remaining time (e.g. numbers of iterations left). Indeed the Browsing-Exploitation ratio shall be monitored by the number of iterations left that allows the algorithm to get the best out of all the iterations at its disposal. Except Case Based Reasoning [4] and Supervised Learning methods [5] a very limited numbers of works deal for this purpose. This work presents a learning method based on extraction of equivalence classes, which accumulates strategic information on the whole previously explored search space, and that enables the dynamic control of the Browsing/Exploitation ratio.
2
Context and of the Catalysts Developments and Optimisation Requirements
Because industrial equipments are designed to prepare and test collection of catalysts (called libraries), iterative optimisation and more specifically the evolutionary approach is very well suited. The goal is to optimise properties of catalysts (produce quality). The optimisation cycle consists at designing a new population of individuals (catalysts) in light of previous individuals testing. In the context of heterogeneous catalysis, it does exist neither strong general theories nor empirical models that allow optimising a catalysts formulation for new specificity and requirements in a straightforward manner. Moreover, the probable behaviour of individuals cannot be estimated by simulation. In this study, we face the problem that the number of iterations is a priori fixed and fixed at a low value. It is not the purpose of this study to seek efficient algorithms that reduce the number of iterations. In addition, since the synthesis and testing of catalysts takes half-day to a few days, the calculation time is not a requirement. The context can be depicted in an industrial design process (Figure 1). Final Product
Specifications (compounds, processes, knowledge, objective function, time, …)
Cycle end Yes/No
Population Design
Population Production
Data analysis & decision
Population Test
Fig. 1. Iterative Design Process
On a general point of view, the data analysis and decision step, which follows each population test, emphases new guidelines for individuals improvement and also to update previous guidelines, which have been already considered. Thus, knowledge about the individuals (catalysts) and their properties (yields) is created, up-dated in the course of the screening and re-injected in the design process. In turn, each population design takes benefits of previous experiments. The knowledge mainly
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations
267
arises from the discovery of relationships between catalysts variables and the respective yields. In addition, the discovery of relationships between the properties of the variables and the yields can provide occasionally very useful guidelines. Indeed, it allows passing form categorical variables to such as the elements of the Mendeleeï v periodic table to continuous variables such as their respective electronegativity values. When introducing these additional information at the starting point, the optimisation process can be boosted because of the wider knowledge accumulated. Because of the very limited iterations number, usual optimisation methods such as GAs, would not be efficient to address this issue. Furthermore, GAs do not take into account the information linked to the catalysts (variables properties) so that the knowledge that would be extracted in the course of the screening could not be re-injected in the optimisation process. In addition, the stochastic feature of genetic operators can become a major problem when the number of iteration is very limited. In the industrial process, the browsing on the search space shall be promoted at the early stage of the screening whereas the exploitation shall take more and more priority when the end of the optimisation process is approaching. The genetic operators work simultaneously on both browsing and exploitation. However, this BrowsingExploitation ratio cannot be directly and dynamically controlled by usual AG in the course of the screening.
3
A New Hybrid Algorithm for Iterative Optimisation
In order to extract and manage information in the optimisation cycle, we have developed a new approach that consists at hybridising a GA with a Knowledge Based System (KBS), which extracts empirical association rules on the whole set of previously tested catalysts and re-injects these strategic information to guideline the optimisation process. The guideline is performed by monitoring the Browsing/Exploitation ratio with the KBS, which enables to consider dynamically the number of iterations. 3.1
AG and Knowledge-Based System Search Spaces
The algorithm operates with all kinds of GAs. The GA search space can consider ordered and non-ordered qualitative variables that can be are either pure binary variable (presence or absence of elements), or discretized continuous variable coded by a set of binary value (element concentration) or nominal variable coded by a set of binary value (Figure 2). The KBS search space includes the GA search space with additional information, which are inherent features of the GA variables such as the size of the pore or the elemental composition of discrete type of zeolites. Table 1. GA and KBS representation GA search space P re pa ra tion va ria ble s Ze olite Ca tion/m e thod KL ZSM-5 NaY NaX Na-beta Li/imp Li/ie 1 0 1 0 0 0 1 0 2 0 0 0 1 0 0 1 3 1 0 0 0 1 0 0
Extended search space … … … … …
Ca tion cha rge "+1" "+2" 1 0 1 0 0 1
Additionna l va ria ble s O-ring Ele ctrone ga tivity 10 12 0.79-0.82 0.89-0.95 0.98-1 1.31 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0
… … … … …
268
L. Baumes et al.
3.2
The Knowledge-Based System
The mechanism of the KBS cannot be extensively depicted in this extended abstract; details can be found elsewhere [6-8]. The KBS search space is binning in different zones that form a lattice. The space zone extraction consists in obtaining equivalence classes by the maximum square algorithm . Each of these zones is characterized by two coefficients B and E, which are characteristics of the Browsing and Exploration level, respectively. The relevance R of an individual x that is in a zone Zi is calculated a function of B and E. We define the followings:
R(h1 ( x), h2 ( x), t , µ) with h = ( f o g )
h1 = f1 (B(Zi )), h2 = f 2 (E (Zi )) and B(Zi ) = g1 ( x, t , µ),
E(Zi ) = g 2 ( x, t , µ)
For example, we can define B, E and R as below:
B( Z i ) =
(number of individuals x previously tested in Zi ) card(Zi )
E ( Z i ) = mean fitness of individuals x previously tested in Zi R(xk ) =
∑ (E (Z ) × (1 − B(Z )) i
i
∀Z i , xk ∈ Z i
i
In this example R is actually the Browsing-Exploitation ratio, others more complex function can be used. The total number of iteration (µ) can be here injected. The Browsing coefficient function enables to estimate the “real” diversity of an individual with respects to the occupancy rates of all the classified zones and quantifies its contribution to the extent of browsing of the total search space. The second coefficient E that evaluates the performances, takes into account the results of all individuals x in the same classification zones (for example the mean value). This second function quantifies the relevance of an individual for improving exploitation. In order to get a unique selection criteria, a third coefficient which combines the two former ones and the number of remaining individuals still to be tested t is defined R(h1(x),h2(x),t). It enables to evaluate the global relevance of an individual with respect to its interest for both the browsing and exploitation. In this manner, the relevance of individuals constituting each new generation can be increased and controlled as a function of time. 3.3
Hybridization Mechanism
The individual that exhibits the highest Relevance shall constitute the next generation. Within the traditional framework, GAs take as input a population of size K and proposes a new population of identical size (figure 4). In order to increase the number of individuals in the population of output (size K) with high T(x) values, one forces the GA to propose a number µK (µ>1, µ∈IN) of individuals, among which K valuers showing the highest relevance R are selected. The choice of the various functions f1, f2, g1, g2 and R is of major importance and completely defines the manner of controlling the optimisation process.
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations Initial population
269
K individuals selection of K individuals by the KBS
Evaluation
µ x K individuals
Final population of K individuals
end criteria GA
Fig. 2. Mechanism of the AG-KBS coupling
4
Tests and Discussions
A validation of the algorithm has been carried out on the following benchmark, which simulate a real case of industrial optimisation. • •
•
AG Space: ∀i, {x1,…,x6}∈ℜ+ ; xi ∈[0...1] - binary coding on 4 bits KBS Space: xi, min(xi≠0), max(xi) in {[0..0,25[ [0,25..0,5[ [0,5..0,75[ [0,75..1[} x1, x4 and (x1&x4) presence or absence ratio = max / min in {[0..1] ]1..2] ]2..4] ]4..8] ]8..16]} Number of element (NbE) in {1;2;3;4;5;6} Fitness Function: (4Σ(x1+x2+x3)+2Σ(x4+x5)-x6) × promotor × factor × feasible with: If (x1 or x4)>0.15 then promotor = 0 else 1 If NbE < 4 then factor = NbE else (7-NbE) If ratio > 10 then feasible = 0 else 1
A conventional GA is used for comparison purpose for which the features are: one-point crossover (number of parents: 2), roulette wheel selection, 24 (6*4) of chromosome size and a generation size (K) of 20. For the GA-KBS algorithm two studies have been carried out: with µ1 fixed at 10 (figure 5) and µ2=2 which increases of one unity each 10 generation. Fitness
Fitness
Fitness
Generation 25000
50000
75000
Generation
Generation 5000
10000
100
200
500
1000
Fig. 3. Test results, SGA (left), AG-KBS with µ=10 (middle), zoom on both (right)
270
L. Baumes et al.
In figure 3, the curves represent the mean value for maxima found on 100 runs. The areas around the curves show the dispersion. The hybrid algorithm gives much better results even if the parameters are not optimised. We noticed that with µ2 the algorithm shows better results, KBS cannot learn ‘good' rules on few data at the beginning (low µ) but needs many different chromosomes (high µ) in order to choose whereas GA generations often converge to local optima. Indeed, the KBS decreases the dispersion of GA runs and increases fitness results. The SGA converges to local optima and does not succeed to reach the global maximum (72) before a long time. At the same time, KBS tries to keep diversity in generations. Therefore higher mutation and crossover rates are required in order to get individuals which belong to different KBS zones.
References [1] [2] [3]
[4] [5] [6] [7] [8]
Wolf, D.; Buyevskaya, O. V.; Baerns, M. Appl. Catal., A 2000, 200, 63-77. Corma, A.; Serra, J. M.; Chica, A. In Principles and methods for accelerated catalyst design and testing; Derouane, E. G., Parmon, V., Lemos, F., Ribeiro, F. R., Eds.; Kluver Academic Publishers: Dordrecht - NL, 2002, pp 153-172. Farrusseng, D.; Baumes, L.; Vauthey, I.; Hayaud, C.; Denton, P.; Mirodatos, C. In Principles and methods for accelerated catalyst design and testing; Derouane, E. G., Parmon, V., Lemos, F., Ribeiro, F. R., Eds.; Kluver Academic Publishers: Dordrecht - NL, 2002, pp 101-124. Ravisé, C.; Sebag, M.; Schoenauer, M. In Artificial Evolution; Alliot, J. M., Lutton, E., Ronald, Schoenauer, M., Snyers, D., Eds.; Springer-Verlag, 1996, pp 100-119. Rasheed, K.; Hirsh, H. In The Seventh International Conference on Genetic Algorithms (ICGA'97), 1997. Norris, E. M. Revue Roumaine de Mathématiques Pures et Appliquées 1978, 23, 243-250. Agrawal, R.; Imielinski, T.; Swami, A. In Proceedings of the ACM SIGMOD Conference, 1993. Belkhiter, N.; Bourhfi, C.; Gammoudi, M. M.; Jaoua, A.; Le Thanh, N.; Reguig, M. INFOR 1994, 32, 33-53.
Intelligent Motion Generator for Mobile Robot by Automatic Constructed Action Knowledge-Base Using GA Hirokazu Watabe, Chikara Hirooka, and Tsukasa Kawaoka Chikara Hirooka and Tsukasa Kawaoka Dept. of Knowledge Engineering & Computer Sciences, Doshisha University Kyo-Tanabe, Kyoto, 610-0394, Japan
Abstract. Intelligent robots, which can become a partner to us humans, need ability to act naturally. In actions of the robot, it is very important to recognize environment and move naturally. In this paper, the learning algorithm to construct knowledge of action in order to achieve tasks that is given to the mobile robot is proposed. Conventionally, the knowledge of action for the robot was mostly constructed by humans, but the more complex the information of the environment, action, and task becomes, the more trial and error must be repeated. Therefore, the robot must learn and construct the knowledge of action by itself. In this study, the action to achieve a task in an environment is generated by genetic algorithm, and it is shown that by repeatedly extracting knowledge of action, the construction of the Action Knowledge-Base concerning a task of any situation is possible.
1
Introduction
In the field of mobile robot control, many approaches have been tried to generate robot control program or strategy by giving only the objective task [2][3][4][5][6][7][8]. To achieve such a problem is not so difficult if the size of problem is small or a number of states of the robot is small. It becomes, however, very difficult when the size of problem is big or the number of states of the robot is big. It is assumed that a mobile robot has some sensors and acts some actions. The robot can properly acts, in some environment, if all action rules are given. The action rule is a kind of IF-THEN rule. A part of IF is one of the states of sensors and a part of THEN is one of the actions that the robot can perform. The problem treated in this paper is to decide all action rules automatically. If the number of states of sensors is small and the number of possible actions is also small, the problem is not so difficult. But, if these numbers are relatively big, the problem becomes difficult to solve because the size of solution space is proportional to the combination of those numbers. The problem becomes too difficult to solve if the number of states of sensors is infinite. Originally, the number of states of sensors is infinite since the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 271-278, 2003. Springer-Verlag Berlin Heidelberg 2003
272
Hirokazu Watabe et al.
value of each sensor is a continuous variable. However, in conventional methods, each sensor variable must be quantitized and the number of states of sensors must be finite and as small as possible. This constraint is hard to solve the practical problems. In this paper, the learning method to construct the action knowledge base, which consists of all proper action rules to achieve the task, by which the robot can properly act in any situation different from learning situation, is proposed. The learning method is basically constructed by genetic algorithm (GA)[1]. GA automatically generates some action rules in the learning environments. To evaluate the proposed method, the mobile robot simulator is made, and expected results are demonstrated. subgoal
G
S Fig. 1. Road Like Environment
d
subgoal b
L
a
R Fig. 2. Mobile Robot
Intelligent Motion Generator for Mobile Robot
2
Target Environment and Mobile Robot
2.1
Assuming Environment
273
In this paper, the environment for the robot is road like one shown in figure 1. That is, there are two places. One place is the place where the robot can move and the other place is the place where the robot cannot move. These two places are separated by some kinds of wall. In this environment, the robot can generate subgoals which are intermediate points from start point to goal point using his sonar sensors. Each subgoal has position and direction.
Fig. 3. Subgoal generator
2.2
Definition of Mobile Robot
In this paper, Pioneer 2 type mobile robot is assumed. Figure 2 shows the structure of the robot. It is assumed that the robot recognizes the following five parameters. (Unit length is a distance between left and right wheel of the robot.) 1. 2.
A distance to the subgoal: d (0 is normalized conditionally positive definite for every c>0 and hence ϕ(g):= exp[ψ(g)] is positive definite in the usual sense. Mercer kernels K may be obtained from these functions for example by setting K(g1, g2):= ϕ(g −1 g1) or 2 1 − K(g1, g2):= ψ (g g1) − ψ (g −1) − ψ (g1) 2 2 Proof: See example 2, lemma 1.2, and theorem3.2 in [Fal2]. Unfortunately the 1-cohomology for topological groups is by no means easy to calculate explicitly in general, although some results are available for ℜn, cf. [Fal2]. However, strangely enough some known facts about non-commuting random products for semi-simple Lie groups prove helpful to obtain very concrete results.
3
Semi-simple Lie Groups with Finite Centre
Note first of all that in theorem 1 above an explicit knowledge of δ may not be required. Indeed, suppose that G is a semi-simple Lie group with finite centre. Let G = KAN, with K compact, A abelian, and N nilpotent, be its Iwasawa decomposition, cf. e.g. [Hel] for the technical details. Suppose further that a nontrivial cocycle of G associated with U satisfies δ(k) = 0 ∀k ∈ K . Then it follows that U cannot have a non-trivial vector vector invariant under K, cf. [Erv], p. 88. From this fact one obtains: Lemma 1: The function ψ given in theorem 1 satisfies the functional equation
312
Bernd-Jürgen Falkowski
∫ ψ(g1k g 2)dk = ψ(g1) + ψ(g 2) ∀ g1 , g 2 ∈ G ------------------------------(*)
K
where dk denotes the Haar measure on K. Proof: See [Erv], p. 90. The equation (*) above is quite remarkable since it appears in many different contexts, cf. e.g. [Fur] for non-commuting random products and laws of large numbers, or [Gan], [Fal1] for Levy-Schoenberg Kernels. If it has a unique nonnegative solution and a non-trivial cocycle is known to exist, then the negative of its solution is conditionally positive definite. In [Fur] non-negative solutions of (*) are described as follows: Let G = KAN as above, and let MAN, where M is the centralizer of A in K, be a minimal parabolic subgroup, cf. e.g. [Fur] for the technical details. Consider B = KAN/MAN. Then K acts transitively on this homogeneous space and in fact there exists a unique K-invariant probability measure m on B, cf. [Fur]. In order to describe the solutions of (*) a further definition of [Fur] is needed. Definition 2: If G acts transitively on a topological space X, then an A-cocycle on X is a real-valued continuous function ρ on G×X satisfying (i) (ii)
ρ(g1g2,x) = ρ(g1,g2x) + ρ(g2,x) ρ(k,x) = 0 ∀k∈K
∀ g1, g2 ∈ G, x ∈ X
Then the following theorem holds, cf.[Fur], [Fal1]: Theorem 2: The function ψρ defined by ψρ(g):= ∫ ρ(g, x )dm( x ) B
is a non-negative solution of (*). Note that the ψρ thus constructed is biinvariant with respect to K, i.e. satisfies the condition ψρ(k1gk2) = ψρ(g) ∀ k1, k2 ∈ K, ∀ g ∈ G, which is rather useful since any element in G may be written as g = k1ak2 with k1, k2 ∈ K, a ∈ A in view of the Cartan decomposition of G, cf. [Hel]. Hence it will suffice to compute ψρ on A.
4
Concrete Examples (SOe(n;1) and SU(n;1))
As examples the real and complex Lorentz groups will be treated in this section (actually in the real case only the connected component of the identity will be dealt with). Definition 3: SOe(n;1) (SU(n;1)) are real (complex) (n+1)×(n+1) matrix groups which leave the real (hermitian) quadratic forms n −1 ∑ x i2 − x 2 n i=0
invariant.
n −1 ( ∑ | x i |2− | x n |2) i=0
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups
313
The homogeneous G-space B = KAN/MAN in either case turns out to be the (real (complex)) Sphere Sn-1 on which an element g := [gij] for 0 ≤ i, j ≤ n of G acts by fractional linear transformations, i.e. n −1 n −1 (gx)i:= ( ∑ g ik x k + g in ) /( ∑ g nk x k + g nn ) k=0 k=0 n −1 n −1 ∑ x i2 = 1 ( ∑ (| x i | 2 = 1) and x := (x0, x2, …, xn-1). i=0 i=0 It is in fact known, cf. [Fur], that there is, up to scalar multiples, precisely one Acocycle in either case, which is described in:
where
Lemma 2: Let g ∈ SOe(n;1) (SU(n;1)) be given by g := [gij] for 0 ≤ i, j ≤ n. Then ρ(g,x) := log| where
n −1 ∑ g nk x k + g nn |, k=0
x := (x0, x2, …, xn-1), is an A-cocycle.
Proof: Computation (note here that the invariance conditions given in definition 3 of course impose certain restrictions on the gij which have not been given here but must be used in the computation). Q.E.D. Next note that the abelian group A appearing in the Cartan decomposition is given by the special elements g := [gij] for 0 ≤ i, j ≤ n satisfying gij = δij (Kronecker delta) for 0 ≤ i, j ≤ n-2, for some t ∈ ℜ, gn-1n-1 = gnn = cosht, gn-1n = gnn-1 = sinht all other gij being zero. Hence A is isomorphic in either case to the group of hyperbolic rotations in the plane or equivalently (ℜ, +). If one denotes a typical element of A by a(t) in this case, then ρ(a(t),x) = log|xn-1*sinht + cosht|. From this one then immediately obtains the following theorem: Theorem 3: The negative of the function ψρ(t) defined by π
ψρ(t):= ∫ log(sinh t cos u + cosh t ) sin n − 2 u du 0
is conditionally positive definite on the real line for every n ≥ 2. Proof: By theorem 2, lemma 2, and the remarks above the negative of the function fρ(t) originally defined by fρ(t):=
∫ ρ(a ( t ), x )dm( x ) n −1
S
on SOe(n;1) is conditionally positive definite on the real line for every n. But from this one immediately obtains the expression given in theorem 3 (modulo a constant) on transforming to polar coordinates, cf. e.g. [Mag]. Q.E.D.
314
Bernd-Jürgen Falkowski
Note that the integral for n = 3 may be evaluated and one thus obtains an interesting explicit result. Corollary to Theorem 3: The negative of the function ψρ(t) defined by ψρ(t):= t*coth t – 1 is conditionally positive definite on the real line. Hence the kernels K1 and K2 defined by K1(t1,t2):= exp[1-(t1-t2)*coth (t1-t2)] respectively K2(t1,t2):= -(t1-t2)*coth (t1-t2) + t2*coth t2+ t1*coth t1 –1 are Mercer kernels on the real line. π
Proof: Note that ∫ log(sinh t cos u + cosh t ) sin u du = 2*(t*coth t – 1), discard the 0
factor 2 (clearly any non-negative constant factor could appear in the kernels without altering the fact that they are Mercer kernels) and apply theorem 1. Q.E.D. Remark: If one is willing to invest a little more effort, then the case n = 2 can be evaluated explicitly as well, cf. e.g. [Gan] for a similar calculation. This computation has not been performed here since the result is similar to the case discussed in theorem 4 below. In the complex case the calculations are slightly more complicated and thus only an interesting special case will betreated here. For SU(1;1) one obtains the following result: Theorem 4: The negative of the function ψρ(t) defined by ψρ(t):= log(cosht) is conditionally positive definite on the real line. Hence the kernels K3 and K4 defined by K3(t1,t2):= 1/cosh(t1-t2) respectively K4(t1,t2):= log[(cosht1*cosht2)/cosh(t1-t2)] are Mercer kernels on the real line. Proof: For SU(1;1) an invariant measure on the complex unit circle is given by (1/2πi)dx0/x0 where |x0| = 1. Hence 1 / 2πi ∫ log | sinh t x 0 + cosh t | d x 0 / x 0 = log(cosh t) by the Cauchy residue 1
S
theorem, cf. e.g. [Ahl]. Again apply theorem 1 to obtain the result.
Q.E.D.
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups
5
315
Concluding Remarks
The functions described in theorem 2 describe the Gaussian part of a noncommutative analogue of the Levy-Khintchin formula, cf. e.g. [Gan]. Indeed, if one wishes to consider more general Mercer kernels, then this is perfectly possible, see e.g. [Fal1] for a more thorough discussion of the situation, which would have exceeded the framework of this paper. However, there is also an interesting connection to a law of large numbers which should not be overlooked. If {Xn} is a sequence of independent identically distributed random variables with values in a multiplicative group of positive real numbers, then the law of large numbers can be stated as follows: (1/n)log(X1X2 … Xn) → E(logXn), where E denotes the expectation value. The function ψ(t):= log t is characterized up to a constant multiple by the equation log (t1t2) = log t1 + log t2. However, a semi-simple Lie group does not admit any non-trivial homomorphisms and hence it would be pointless to look for an exact analogue of this function. The relevant analogue is the functional equation (*) given above and using solutions of this functional equation a law of large numbers can indeed be proved in the non-commutative situation prevailing in semisimple Lie groups as discussed above, for details see e.g. [Fur]. Since only Mercer kernels have been described above it may be worthwhile to remind the reader that from these the relevant feature spaces may be constructed as Hilbert spaces. In this case the Mercer kernels are used to define the inner product (that is to say they are reproducing kernels). Alternatively they may be used to define positive integral operators where the feature maps then arise as eigenfunctions belonging to positive eigenvalues. For a more detailed discussion of these aspects the reader is referred to [Sch] chapter 1. Finally, from a practical point of view, for example the very explicit result in the corollary to theorem 3 seems noteworthy, since even for the hyperbolic tangent kernel the positive definiteness criterion is satisfied only for some parameter values, cf. e.g. [Hay], p. 333.
References [Ahl] [Cri] [Erv] [Fal1] [Fal2]
Ahlfors, L.V.: Complex Analysis. Second Edition, (1966) Cristianini, N.; Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press, (2000) Erven, J.; Falkowski, B.-J.: Low Order Cohomology and Applications. Springer Lecture Notes in Mathematics, Vol. 877, (1981) Falkowski, B.-J.: Levy-Schoenberg Kernels on Riemannian Symmetric Spaces of Non-Compact Type. In: Probability Measures on Groups, Ed. H. Heyer, Springer Lecture Notes in Mathematics, Vol. 1210, (1986) Falkowski, B.-J.: Mercer Kernels and 1-Cohomology. In: Proc. of the 5th Intl. Conference on Knowledge Based Intelligent Engineering Systems & Allied Technologies (KES 2001), Eds. N. Baba, R.J. Howlett, L.C. Jain, IOS Press, (2001)
316
[Fur] [Gan] [Hay] [Hel] [Mag] [Sch]
Bernd-Jürgen Falkowski
Furstenberg, H.: Noncommuting Random Products. Trans. of the AMS,Vol. 108, (1962) Gangolli, R.: Positive Definite Kernels on Homogeneous Spaces. In: Ann. Inst. H. Poincare B, Vol. 3, (1967) Haykin, S.: Neural Networks, a Comprehensive Foundation. Prentice-Hall, (1999) Helgason, S.: Differential Geometry and Symmetric Spaces. Academic Press, New York, (1963) Magnus, W.; Oberhettinger, F.: Formeln und Sätze frü die speziellen Funktionen der Mathematischen Physik. Springer, (1948) Schölkopf, B.; Burges, J.C.; Smol a, A.J. (Eds.): Advances in Kernel Methods, Support Vector Learning. MIT Press, (1999)
On-line Profit Sharing Works Efficiently Tohgoroh Matsui1 , Nobuhiro Inuzuka2 , and Hirohisa Seki2 1
2
Department of Industrial Administration, Faculty of Science and Technology Tokyo University of Science, 2641 Yamazaki, Noda-shi, Chiba 278-8510, Japan
[email protected] Department of Computer Science and Engineering, Graduate School of Engineering Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan {inuzuka@elcom,seki@ics}.nitech.ac.jp
Abstract. Reinforcement learning constructs knowledge containing state-to-action decision rules from agent’s experiences. Most of reinforcement learning methods are action-value estimation methods which estimate the true values of state-action pairs and derive the optimal policy from the value estimates. However, these methods have a serious drawback that they stray when the values for the “opposite” actions, such as moving left and moving right, are equal. This paper describes the basic mechanism of on-line profit-sharing (OnPS) which is an actionpreference learning method. The main contribution of this paper is to show the equivalence of off-line and on-line in profit sharing. We also show a typical benchmark example for comparison between OnPS and Q-learning.
1
Introduction
Reinforcement learning constructs knowledge containing state-to-action decision rules from agent’s experiences. Most of reinforcement learning methods are based on action-value function which is defined as the expected return starting from state-action pairs, and called action-value estimation methods [5]. Action-value estimation methods estimate the true values of state-action pairs and derive the optimal policy from the value estimates. However, these methods have a serious drawback that they stray when the values for the “opposite” actions, such as moving left and moving right, are equal. This paper describes the basic mechanism of on-line profit-sharing (OnPS). Profit sharing [2, 3] is an action-preference learning method, which maintains preferences for each action instead of action-value estimates. Profit sharing1 has an advantage of the number of parameters to be set, because it only needs a discount rate parameter. Moreover, Arai et al. [1] have shown that profit sharing outperforms Q-learning in a multi-agent domain. 1
Note that profit sharing sometimes converges to a locally optimal solution like EM algorithm or neural networks.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 317–324, 2003. c Springer-Verlag Berlin Heidelberg 2003
318
Tohgoroh Matsui et al.
Profit sharing is usually an off-line updating method, which does not change its knowledge until the end of a task. One drawback is that ordinary profit sharing requires unbounded memory space to store all of the selected stateaction pairs during the task. Another disadvantage is that intermediate rewards until the end of the task are considered to be zero. Although OnPS is one of the profit-sharing methods, it can be implemented with bounded memory space and use intermediate rewards during the task for updating its knowledge. The main contribution of this paper is to show the equivalence of off-line and on-line in profit sharing. We also give the result of a typical benchmark example for comparison between OnPS and Q-learning [6].
2
Profit Sharing
Profit sharing [2, 3] considers the standard episodic reinforcement learning framework, in which an agent interacts with an environment [5]. In profit sharing, state-action pairs in a single episode of T time steps, s0 , a0 , r1 , s1 , a1 , r2 , · · · , rT , sT with states st ∈ S, actions at ∈ A, and rewards rt ∈ , are reinforced at the end of the episode. We consider the initial state, s0 , to be given arbitrarily. In episodic reinforcement learning tasks, each episode ends in a special state called the terminal state. Profit sharing learns the action-preferences, P , of state-action pairs. After an episode is completed, P is updated by P (st , at ) ← P (st , at ) + f (t, rT , T ),
(1)
for each st ∈ S, at ∈ A in the episode, where f is a function, called credit assignment. The rationality theorem [4] guarantees that unnecessary state-action pairs, that is, pairs evermore make up the loop, are always given smaller increments than the necessary pairs, if the credit-assignment function satisfies f (t, rT , T ) > L
t−1
f (i, rT , T ),
(2)
i=0
where L is the maximum number of pairs, except for unnecessary ones, in a state. In many cases, a function f (t, rT , T ) = γ T −t−1 rT ,
(3)
is used, where γ is a parameter, also called the discount rate. This function satisfies Equation (2) with γ ≤ 1/ max |A(s)| ≤ 1/L. We also use this function in this paper. The shape of the function is shown in Figure 1. The increments for P (st , at ) decrease as the state-action pair, st and at , was visited earlier. Profit
On-line Profit Sharing Works Efficiently
319
rT
f (t, r T , T) = γ T − t − 1 r T γ1 rT γ2 rT ...
T− 3
T−2
T−1
Step t
Fig. 1. A credit-assignment function in the form of Equation (3), in profit sharing. The increments of P (st , at ) decrease as the difference between t and T increases
sharing chooses action a with probability in proportion to action preferences at state s: P (s, a) Pr(at = a|st = s) = . a ∈A(s) P (s, a ) This is called weighted-roulette. Figure 2 shows the algorithm for the profitsharing method.
Initialize, for all s ∈ S, a ∈ A(s): P (s, a) ← a small constant Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using weighted-roulette derived from P Take action a; observe reward, r, and next state, s s ← s until s is terminal For each pair st , at appearing in the episode: P (st , at ) ← P (st , at ) + f (t, rT , T )
Fig. 2. Off-line profit-sharing algorithm. The function f is a credit-assignment function
3
On-line Profit-Sharing Method
The ordinary profit-sharing method has the problem that its memory and computational requirements increase without bound, because the method must store
320
Tohgoroh Matsui et al. c(s,a) Steps
Visiting to this state-action pair
Fig. 3. The credit trace for a state-action pair. It decays by γ on each step and is incremented by one when the state-action pair is visited in a step, in accordance with Equation (4)
all the experienced states, the selected actions, and the final reward. That is, it requires s0 , a0 , s1 , a1 , · · · , sT −1 , aT −1 , rT to calculate the updates of all of P for state-action pairs st and at in an episode. On the other hand, incremental methods compute the (k + 1)th values from the kth values, so that it is not necessary to keep earlier state-action pairs. Instead, our incremental implementation method requires additional but bounded memory for credit traces. Credit traces, c, maintain credit assignments for each state-action pair. They are based on eligibility traces [5] which are one of the basic mechanisms of reinforcement learning. We apply this technique for implementing the profit-sharing method incrementally. We call this version of the profit-sharing method the online profit-sharing method (OnPS) because updates are done at each step during an episode, as soon as the increment is computed. Because the number of elements in the credit trace, c, is equal to that in the preference, P , the size of required memory is bounded. In each step, the credit traces for all states decay by γ, while the trace for the state-action pair visited on the step is incremented by one: γct−1 (s, a) + 1 if s = st and a = at ; ct (s, a) = γct−1 (s, a) otherwise, for all non-terminal states s ∈ S, a ∈ A. The credit trace accumulates each time when the state-action pair is visited, then decays gradually while the state-action pair is not visited, as illustrated in Figure 3. The increments are given as follows: ∆Pt (s, a) = rt+1 ct (s, a) for all s, a. Figure 4 shows the algorithm for OnPS.
(5)
On-line Profit Sharing Works Efficiently
321
Initialize, for all s ∈ S, a ∈ A(s): P (s, a) ← a small constant Repeat (for each episode): Initialize s c(s, a) = 0, for all s, a Repeat (for each step of episode): Choose a from s using weighted-roulette derived from P c(s, a) ← c(s, a) + 1 Take action a; observe reward, r, and next state, s For all s, a: P (s, a) ← P (s, a) + rc(s, a) c(s, a) ← γc(s, a) s ← s until s is terminal
Fig. 4. On-line profit-sharing (OnPS) algorithm
4
Equivalence of Off-line and On-line Updating in Profit-Sharing Methods
In this section, we show that the sum of all the updates is exactly the same for the off-line and on-line profit-sharing methods over an episode. Theorem 1. If the rewards are zero until the system reaches one of the goal states and Equation (3) is used as the credit-assignment function, then the offline updating given in Equation (1) is equivalent to the on-line updating given in Equation (5). Proof. First, an accumulating eligibility trace can be written non-recursively as ct (s, a) =
t
γ t−k Issk Iaak ,
k=0
where sk and ak are states and actions that appeared in the episode respectively, and Ixy is an identity-indicator function, equal to one if x = y and equal to zero otherwise. The update of a step in the on-line profit-sharing method is Equation (5). Thus, the sum of all the updates can be written as ∆P (s, a) =
T −1
rt+1 ct (s, a) =
t=0
T −1 t=0
rt+1
t
γ t−k Issk Iaak .
(6)
k=0
On the other hand, the sum of updates in the off-line profit-sharing method can be written using the identity-indicator functions as ∆P (s, a) =
T −1 t=0
f (t, rT , T )Isst Iaat .
322
Tohgoroh Matsui et al.
G
F
S
G
F
Fig. 5. 11×11 grid world example. S is the initial state, and G’s are goal states. G’s and F’s are terminal states
Using Equation (3) as the credit-assignment function, f , this can be transformed to T −1 T −1 T −t−1 γ rT Isst Iaat = rT γ T −t−1 Isst Iaat . (7) ∆P (s, a) = t=0
t=0
Now, we rewrite the right-hand side of Equation (6). Because all the intermediate rewards until the end of the episode are zero, r1 = r2 = · · · = rT −1 = 0, only the term for t = T − 1 remains: ∆P (s, a) =
T −1
rt+1
t=0
=r1
0
t
γ t−k Issk Iaak
k=0
γ 0−k Issk Iaak + · · · + rT
k=0
=rT
T −1
T −1
γ (T −1)−k Issk Iaak
k=0
γ T −k−1 Issk Iaak .
k=0
Therefore, the sums of all the updates in the off-line and on-line profit-sharing methods are same, if we use Equation (3) as the credit-assignment function, and all of the intermediate rewards until the end of the episode are zero. ✷
5
An Example
We used 11×11 grid-world shown in Figure 5. The MDP is deterministic and has 4 actions, moving up, down, left, or right. If an agent bumps into a wall, it remains in the same state. The four corner states are terminal. The agent receives a reward of 1 for the actions entering the bottom-right and upper-left corners. All of the other rewards are 0, including for entering the other two corners. The initial state is in the center.
0.1
0.1
0.08
0.08
Rewards per Step
Rewards per Step
On-line Profit Sharing Works Efficiently
0.06 0.04 0.02
Q-learning 1
10
100
OnPS
0.06 0.04 0.02
OnPS
0
1000 1e+4 1e+5 1e+6 1e+7 1e+8 Number of Steps
323
Q-learning
0 1
10
100
1000 1e+4 1e+5 1e+6 1e+7 1e+8 Number of Steps
Fig. 6. The performances of greedy policy (left) and either weighted-roulette or softmax policy which are used during the learning (right) in the 11×11 environment. The optimal performance is 0.1
We have implemented OnPS with the initial preferences are 1/|A| = 0.25, γ = 1/|A| = 0.25, and Q-learning using Gibbs-distribution softmax action-selection with α = 0.01, γ = 0.9 and τ = 0.2. The results are shown in Figure 6. For each evaluation 101 trials are run with a maximum step cutoff at 101 steps. We ran a series of 30 runs and show the average. The left panel shows the performances of greedy policy derived from the same knowledge at each total time step. The right panel shows the performance comparison of probabilistic policies, that is, between weighted-roulette and softmax, which are used during the learning. The results indicate that OnPS outperforms Q-learning for accomplishing the task when they learn. Since the true values for moving all directions are the same at center, actionvalue estimation methods stray as the estimation approaches the true value. Therefore, the performance of Q-learning was poor using the stochastic policy.
6
Discussion and Conclusion
Monte Carlo control method (MC-control) [5] is the most closest method to OnPS. MC-control acquires the average returns per selection for each stateaction pair in order to estimate action-values. The essential difference is that MCcontrol is an action-value estimation method, while OnPS is an action-preference learning method. Hence, MC-control suffers from even-value actions when it explores. We proved the equivalence of off-line and on-line updating in profit sharing, if all of the intermediate rewards are zero. Profit sharing can accomplish the task even during the learning, although the acquired knowledge is not guaranteed to be optimal. Nevertheless, OnPS worked well in several benchmark problems [5]. OnPS is superior to off-line profit-sharing method from the view point of required memory space. Furthermore, OnPS can deal with intermediate rewards during
324
Tohgoroh Matsui et al.
the task, though details of the effects have not examined yet. Those theoretical analysis remains as future work.
References [1] Sachiyo Arai, Katia Sycara, and Terry R. Payne. Experience-based reinforcement learning to acquire effective behavior in a multi-agent domain. In R. Mizoguchi and J. Slaney, editors, Proceedings of The 6th Pacific Rim International Conference on Artificial Intelligence (PRICAI-2000), volume 1886 of Lecture Notes in Artificial Intelligence, pages 125–135. Springer-Verlag, 2000. 317 [2] John J. Grefenstette. Credit assignment in rule discovery systems based on genetic algorithms. Machine Learning, 3:225–245, 1988. 317, 318 [3] John H. Holland. Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume 2. Morgan Kaufmann Publishers, 1986. 317, 318 [4] Kazuteru Miyazaki, Masayuki Yamamura, and Shigenobu Kobayashi. On the rationality of profit sharing in reinforcement learning. In Proceedings of The 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, pages 285–288. Fuzzy Logic Systems Institute, 1994. 318 [5] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. 317, 318, 320, 323 [6] Christopher J. C. H. Watkins and Peter Dayan. Technical note: Q-learning. Machine Learning, 8(3/4):279–292, 1992. 318
Fast Feature Ranking Algorithm Roberto Ruiz, Jos´e C. Riquelme, and Jes´ us S. Aguilar-Ruiz Departamento de Lenguajes y Sistemas, Universidad de Sevilla. Avda. Reina Mercedes S/N. 41012 Sevilla, Espa˜ na {rruiz,riquelme,aguilar}@lsi.us.es Abstract. The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. The algorithm has some interesting characteristics: lower computational cost (O(m n log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; its applicability to any labelled data set, that is to say, it can contain continuous and discrete variables, with no need for transformation. In order to test the relevance of the new feature selection algorithm, we compare the results induced by several classifiers before and after applying the feature selection algorithms.
1
Introduction
It is advisable to apply to the database preprocessing techniques to reduce the number of attributes or the number of examples in such a way as to decrease the computational time cost. These preprocessing techniques are fundamentally oriented to either of the next goals: feature selection (eliminating non-relevant attributes) and editing (reduction of the number of examples by eliminating some of them or calculating prototypes [1]). Our algorithm belongs to the first group. Feature selection methods can be grouped into two categories from the point of view of a method’s output. One category is about ranking feature according to same evaluation criterion; the other is about choosing a minimum set of features that satisfies an evaluation criterion. In this paper we present a new feature ranking algorithm by means of Projections and the hypothesis on which the heuristic is based is: ”place the best attributes with the smallest number of label changes (NLC)”.
2
Feature Evaluation
2.1
Description
To describe the algorithm we will use the well-known data set IRIS, because of the easy interpretation of their two-dimensional projections.
This work has been supported by the Spanish Research Agency CICYT under grant TIC2001-1143-C03-02.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 325–331, 2003. c Springer-Verlag Berlin Heidelberg 2003
Roberto Ruiz et al.
5
5
4,5
4,5
4
4
3,5
3,5
sepalwidth
sepalwidth
326
3 2,5 2 1,5
3 2,5 2 1,5
1
1
0,5
0,5 0
0 0
2
(a)
4 sepallength
6
8
10
0
0,5
(b)
1
1,5
2
2,5
3
petalwidth
Fig. 1. Representation of Attributes (a) Sepalwidth-Sepallength and (b) Sepalwidth-Petalwidth
Three projections of IRIS have been made in two-dimensional graphs. In Figure 1(a) it is possible to observe that if the projection of the examples is made on the abscissas or ordinate axis we can not obtain intervals where any class is a majority, only can be seen the intervals [4.3,4.8] of Sepallength for the Setosa class or [7.1,8.0] for Virginica. In Figure 1(b) for the Sepalwidth parameter in the ordinate axis clear intervals are not appraised either. Nevertheless, for the Petalwidth attribute is possible to appreciate some intervals where the class is unique: [0,0.6] for Setosa, [1.0,1.3] for Versicolor and [1.8,2.5] for Virginica. SOAP is based on this principle: to count the label changes, produced when crossing the projections of each example in each dimension. If the attributes are in ascending order according to the NLC, we will have a list that defines the priority of selection, from greater to smaller importance. Finally, to choose the more advisable number of features, we define a reduction factor, RF, in order to take the subset from attributes formed by the first of the aforementioned list. Before formally exposing the algorithm, we will explain with more details the main idea. We considered the situation depicted in Figure 1(b): the projection of the examples on the abscissas axis produces an ordered sequence of intervals (some of then can be a single point) which have assigned a single label or a set of them: [0,0.6] Se, [1.0,1.3] Ve, [1.4,1.4] Ve-Vi, [1.5,1.5] Ve-Vi, [1.6,1.6] Ve-Vi, [1.7,1.7] Ve-Vi, [1.8,1.8] Ve-Vi, [1.9,2.5] Vi. If we apply the same idea with the projection on the ordinate axis, we calculate the partitions of the ordered sequences: Ve, R, R, Ve, R, R, R, R, R, R, R, R, R, R, Se, R, Se, R, Se, where R is a combination of two or three labels. We can observe that we obtain almost one subsequence of the same value with different classes for each value from the ordered projection. That is to say, projections on the ordinate axis provide much less information that on the abscissas axis. In the intervals with multiple labels we will consider the worst case, that being the maximum number of label changes possible for a same value. The number of label changes obtained by the algorithm in the projection of each dimension is: Petalwidth 16, Petallength 19, Sepallenth 87 and Sepalwidth 120. In this way, we can achieve a ranking with the best attributes from the point
Fast Feature Ranking Algorithm
327
Table 1. Main Algorithm Input: E training (N examples, M attributes) Output: E reduced (N examples, K attributes) for each attribute Ai ∈ 1..M QuickSort(E,i) NLCi ← NumberChanges(E,i) NLC Attribute Ranking Select the k first
Table 2. NumberChanges function Input: E training (N examples, M attributes), i Output: number of label changes for each example ej ∈ E with j in 1..N if att(u[j],i) ∈ subsequence of the same value changes = changes + ChangesSameValue() else if lab(u[j]) lastLabel) changes = changes + 1 return(changes)
of view of the classification. This result agrees with what is common knowledge in data mining, which states that the width and length of petals are more important than those related to sepals. 2.2
Algorithm
The algorithm is very simple and fast, see Table 1. It has the capacity to operate with continuous and discrete variables as well as with databases which have two classes or multiple classes. In the ascending-order-task for each attribute, the QuickSort [5] algorithm is used. This algorithm is O(n log n), on average. Once ordered by an attribute, we can count the label changes throughout the ordered projected sequence. NumberChanges in Table 2, considers whether we deal with different values from an attribute, or with a subsequence of the same value (this situation can be originated in continuous and discrete variables). In the first case, it compares the present label with that of the following value. Whereas in the second case, where the subsequence is of the same value, it counts as many label changes as are possible (function ChangesSameValue). After applying QuickSort, we might have repeated values with the same or different class. For this reason, the algorithm firstly sorts by value and, in
328
Roberto Ruiz et al.
case of equality, it will look for the worst of the all possible cases (function ChangesSameValue). We could find the situation as depicted in Figure 2(a). The examples sharing the same value for an attribute are ordered by class. The label changes obtained are two. The next execution of the algorithm may find another situation, with a different number of label changes. The solution to this problem consists of finding the worst case. The heuristic is applied to obtain the maximum number of label changes within the interval containing repeated values. In this way, the ChangesSameValue method would produce the output shown in Figure 2(b), seven changes. This can be obtained with low cost. It can be deduced counting the class’ elements. ChangesSameValue stores the relative frequency for each class within the interval. It is possible to be affirm that: if rf i > (nelem/2) them (nelem − rf i) ∗ 2
else nelem − 1
(1)
rfi: relative frequency for each class, with i in {1,. . . ,k} classes. nelem: number of elements within the interval. In Figure 2(a) we can observe a subsequence of the same value with eight elements: three elements are class A, four class B and one C. Applying formula 2 there is no relative frequency greater than half of the elements. Then, the maximum number of label changes is nelem-1, seven. In Figure 2(b) we verify it. Ranking algorithms produce a ranked list, according to the evaluation criterion applied. The methods need an external parameter to take the subset from attributes formed by the first features of the aforementioned list. This parameter produces different results with different data sets. Therefore, in order to establish the number of attributes in each case, we put the range of value of the ranked lists between [0,1], i.e. the punctuation of the first attribute of the list will be 1, and the last attribute 0. Then, we select attributes over the parameter named Reduction Factor (RF). We do not realize an especial analyzed on each data set.
3
Experiments
In order to compare the effectiveness of SOAP as a feature selector for common machine learning algorithms, experiments were performed using sixteen
3
3
3
3
3
3
3
3
values
A
A
A
B
B
B
B
C
classes changes
3
3
3
3
3
3
3
3
values
B
A B
A
B
A B
C
classes changes
Fig. 2. Subsequence of the same value (a) two changes (b) seven changes
Fast Feature Ranking Algorithm
329
Table 3. Data sets, number of selected features, the percentage of the original features retained and time in milliseconds DATA Data Set Inst. Atts N◦ Cl. autos 205 25 7 breast-c 286 9 2 breast-w 699 9 2 diabetes 768 8 2 glass2 163 9 2 heart-c 303 13 5 heart-st 270 13 2 hepatit. 155 19 2 horse-c. 368 27 2 hypothy. 3772 29 4 iris 150 4 3 labor 57 16 2 lymph 148 18 4 sick 3772 29 2 sonar 208 60 2 vote 435 16 2 Average
SOAP Atts ( %)t-ms 2.9 (11.8) 15 1.5 (16.7) 4 5.2 (57.6) 6 2.8 (34.9) 6 3.2 (35.7) 2 6.3 (48.2) 6 5.4 (41.8) 4 2.6 (13.6) 4 2.3 ( 8.6) 16 1.7 ( 5.7) 180 2.0 (50.0) 3 4.3 (27.0) 1 1.8 ( 9.9) 3 1.0 ( 3.4) 120 3.0 ( 5.0) 21 1.6 (10.0) 9 (23.7)
Atts 5.3 4.1 9.0 3.1 4.0 6.4 6.3 8.7 2.0 1.0 1.9 3.3 8.9 1.0 17.8 1.0
CFS ( %)t-ms (21.3) 50 (45.9) 6 (99.7) 35 (38.9) 39 (43.9) 9 (49.1) 10 (48.2) 12 (45.6) 9 ( 7.4) 43 ( 3.4) 281 (48.3) 3 (20.8) 3 (49.2) 7 ( 3.4) 252 (29.7) 90 ( 6.3) 4 (35.1)
Atts 10.9 3.7 8.1 0.0 0.3 6.9 6.3 13.3 2.3 5.2 4.0 8.8 11.8 7.1 3.9 15.5
RLF ( %) t-ms ( 43.7) 403 ( 41.6) 174 ( 89.4) 1670 ( 0.0) 1779 ( 3.6) 96 ( 53.4) 368 ( 48.2) 365 ( 70.0) 135 ( 8.6) 941 ( 18.0)94991 (100.0) 44 ( 55.3) 21 ( 65.8) 109 ( 24.5)93539 ( 6.5) 920 ( 96.9) 651 ( 45.3)
standard data sets from the UCI repository [4]. The data sets and their characteristics are summarized in Table 3. The percentage of correct classification with C4.5, averaged over ten ten-fold cross-validation runs, were calculated for each algorithm-data set combination before and after feature selection by SOAP (RF 0.75), CFS and ReliefF (threshold 0.05). For each train-test split, the dimensionality was reduced by each feature selector before being passed to the learning algorithms. The same fold were used for each feature selector-learning scheme combination. To perform the experiment with CFS and ReliefF we used the Weka1 (Waikato Environment for Knowledge Analysis) implementation. Table 3 shows the average number of features selected and the percentage of the original features retained. SOAP is a specially selective algorithm compared with CFS and RLF. If SOAP and CFS are compared, only in one data set (labor) is the number of characteristics significantly greater than those selected by CFS. In six data sets there are no significant differences, and in nine, the number of features is significantly smaller than CFS. Compare to RLF, only in glass2 and diabetes, SOAP obtains more parameters in the reduction process (threshold 0.05 is not sufficient). It can be seen that SOAP retained 23,7% of the attributes on average. Table 4 shows the results for attribute selection with C4.5 and compares the size (number of nodes) of the trees produced by each attribute selection scheme 1
http://www.cs.waikato.ac.nz/˜ml
330
Roberto Ruiz et al.
Table 4. Result of attribute selection with C4.5. Accuracy and size of trees. ◦, • Statistically significant improvement or degradation (p=0.05) DATA Set Ac. autos 82.54 breast-c 74.37 breast-w 95.01 diabetes 74.64 glass2 78.71 heart-c 76.83 heart-stat 78.11 hepatitis 78.97 horse-c.OR. 66.30 hypothyroid 99.54 iris 94.27 labor 80.70 lymph 77.36 sick 98.66 sonar 74.28 vote 96.53 Average 82.93
Size 63.32 12.34 24.96 42.06 24.00 43.87 34.58 17.06 1.00 27.84 8.18 6.93 28.05 49.02 27.98 10.64 26.36
SOAP Ac. Size 73.37 • 45.84 70.24 • 6.61 94.64 21.28 74.14 7.78 78.96 14.88 77.06 34.02 80.67 ◦ 19.50 80.19 5.62 66.30 1.00 95.02 • 4.30 94.40 8.12 78.25 3.76 72.84 • 7.34 93.88 • 1.00 70.05 • 7.00 95.63 • 3.00 80.98 11.94
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Ac. 74.54 72.90 95.02 74.36 79.82 77.16 80.63 81.68 66.30 96.64 94.13 80.35 75.95 96.32 74.38 95.63 82.24
CFS Size • 55.66 18.94 24.68 14.68 14.06 29.35 ◦ 23.84 ◦ 8.68 1.00 • 5.90 7.98 6.44 20.32 • 5.00 28.18 • 3.00 16.73
◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Ac. 74.15 70.42 95.02 65.10 53.50 79.60 82.33 80.45 66.28 93.52 94.40 80.00 74.66 93.88 70.19 96.53 79.38
RLF Size • 85.74 • 11.31 24.68 • 1.00 • 1.70 ◦ 28.72 ◦ 14.78 11.26 1.36 • 12.52 8.16 5.88 24.10 • 1.00 • 9.74 10.64 15.79
• ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦
against the size of the trees produced by C4.5 with no attribute selection. Smaller trees are preferred as they are easier to interpret, but accuracy is generally degraded. The table shows how often each method performs significantly better (denoted by ◦) or worse (denoted by •) than when performing no feature selection (column 2 and 3). Throughout we speak of results being significantly different if the difference is statistically at the 5% level according to a paired two-sided t test. Each pair of points consisting of the estimates obtained in one of the ten, ten-fold cross-validation runs, for before and after feature selection. For SOAP, feature selection degrades performance on seven data sets, improves on one and it is equal on eight. The reason for why the algorithm is not as accurate is the number of attribute selected, less than three feature. Five of these seven data sets obtain a percentage less than 10% of the original features. The results are similar to ReliefF and a little worse than those provided by CFS. Analyzing the data sets in which SOAP lost to CFS, we can observe breast-c, lymph and sonar, where the number of feature selected by SOAP is 25% of CFS (breast-c 4,1 to 1,5 with SOAP, lymph 8,9-1,8 and sonar 17,8-3). Nevertheless the accuracy reduction is small: breast-c 72,9 (CFS) to 70,24 with SOAP, lymph 75,95-72,84 and sonar 74,38-70,05. It is interesting to compare the speed of the attribute selection techniques. We measured the time taken in milliseconds to select the final subset of attributes. SOAP is an algorithm with a very short computation time. The results shown
Fast Feature Ranking Algorithm
331
in Table 3 confirm the expectations. SOAP takes 400 milliseconds2 in reducing 16 data sets whereas CFS takes 853 milliseconds and RLF more than 3 minutes. In general, SOAP is faster than the other methods and it is independent of the classes number. Also it is possible to be observed that ReliefF is affected very negatively by the number of instances in the data set, it can be seen in ”hypothyroid” and ”sick”. Even though these two data sets were eliminated, SOAP is more than 3 times faster than CFS, and more than 75 times than ReliefF.
4
Conclusions
In this paper we present a deterministic attribute selection algorithm. It is a very efficient and simple method used in the preprocessing phase. A considerable reduction of the number of attributes is produced in comparison to other techniques. It does not need distance nor statistical calculations, which could be very costly in time (correlation, gain of information, etc.). The computational cost is lower than other methods O(m n log n).
References [1] Aguilar-Ruiz, Jes´ us S., Riquelme, Jos´e C. and Toro, Miguel. Data Set Editing by Ordered Projection. Intelligent Data Analysis Journal. Vol. 5, n◦ 5, pp. 1-13, IOS Press (2001). 325 [2] Almuallim, H. and Dietterich, T. G. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1-2):279-305 (1994). [3] Blake, C. and Merz, E. K. UCI Repository of machine learning databases (1998). [4] Hall M. A. Correlation-based feature selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand (1998). [5] Hoare, C. A. R. QuickSort. Computer Journal, 5(1):10-15 (1962). 327 [6] Kira, K. and Rendell, L. A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning. pp. 249-256, Morgan Kaufmann (1992). [7] Kohavi, R. and John, G. H. Wrappers for feature subset selection. Artificial Intelligence, 97, 273-324 (1997). [8] Kononenko, I. Estimating attibutes: Analisys and extensions of relief. In Proceedings of the Seventh European Conference on Machine Learning. pp. 171-182, Springer-Verlag (1994). [9] Quinlan, J. C4.5: Programs for machine learning. Morgan Kaufmann (1993). ˇ [10] Robnik-Sikonja, M. And Kononenko, I. An adaption of relief for attribute estimation in regression. In Proceedings of the Fourteenth International Conference on Machine Learning. pp. 296-304, Morgan Kaufmann (1997). [11] Setiono, R., and Liu, H. A probabilistic approach to feature selection-a filter solution. In Proceedings of International Conference on Machine Learning, 319327 (1996). 2
This is a rough measure. Obtaining true cpu time from within a Java program is quite difficult.
Visual Clustering with Artificial Ants Colonies Nicolas Labroche, Nicolas Monmarch´e, and Gilles Venturini Laboratoire d’Informatique de l’Universit´e de Tours, ´ Ecole Polytechnique de l’Universit´e de Tours-D´epartement Informatique, 64, avenue Jean Portalis 37200 Tours, France {labroche,monmarche,venturini}@univ-tours.fr http://www.antsearch.univ-tours.fr/
Abstract. In this paper, we propose a new model of the chemical recognition system of ants to solve the unsupervised clustering problem. The colonial closure mechanism allows ants to discriminate between nestmates and intruders by the mean of a colonial odour that is shared by every nestmate. In our model we associate each object of the data set to the odour of an artificial ant. Each odour is defined as a real vector with two components, that can be represented in a 2D-space of odours. Our method simulates meetings between ants according to pre-established behavioural rules, to ensure the convergence of similar odours (i.e. similar objects) in the same portion of the 2D-space. This provides the expected partition of the objects. We test our method against other well-known clustering method and show that it can perform well. Furthermore, our approach can handle every type of data (from numerical to symbolic attibutes, since there exists an adapted similarity measure) and allows one to visualize the dynamic creation of the nests. We plan to use this algorithm as a basis for a more sophisticated interactive clustering tool.
1
Introduction
Computer scientists have successfully used bio mimetic approaches to imagine performing heuristics. For example, Dorigo et al. modelled the pheromone trails of real ants to solve optimisation problems in the well-known Ant Colony Optimization heuristic (ACO [1, 2]). Genetic algorithms have been applied to optimisation problems and clustering problems [3, 4]. The collective behaviours of ants have also inspired researchers to solve the graph partitioning problem with co-evolving species [5] and the unsupervised clustering problem [6, 7]. In these studies, researchers model real ants abilities to sort their brood. Artificial ants move on a 2D discrete grid on which objects are randomly placed. Ants may pick up or carry one or more objects and may drop them according to given probabilities. After a given time, these artificial ants build groups of similar objects that provide the final partition. In [8], the authors present AntClust, an ant-based clustering algorithm that relies on a chemical recognition model of ants. In this work, each artificial ant is associated to an object by the mean of its cuticular odour. This odour is representative of the belonging nest of each ant and coded with a single value. This value may change according to behavioural V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 332–339, 2003. c Springer-Verlag Berlin Heidelberg 2003
Visual Clustering with Artificial Ants Colonies
333
rules and meetings with other ants, until each ant finds the nest that best fits the object it represents. In this paper, we propose an alternative model of this chemical recognition system. Our goal is to see if the previous discrete odour model implies limitations to the method. Furthermore, we want to be able to visualize dynamically the building of the nests. In this case, an expert could interact with the system, while the results are displayed. Our paper is organised as follows: the section 2 briefly describes the main principles of the chemical recognition system of ants and introduces the new model of an artificial ant (the parameters and the behavioural rules for the meetings). The section 3 presents the Visual AntClust algorithm and discuss its parameters settings. Then, the section 4 describes the clustering method that we use to evaluate Visual AntClust and the data sets used as benchmarks. Finally, the section 5 concludes and introduces futures developments of our work.
2 2.1
The Model The Chemical Recognition System of Real Ants
As many other social insects (such as termites, wasps and bees), ants have developed a mechanism of colonial closure that allows them to discriminate between nestmates and intruders. Each ant generates its own cuticular odour, called the label. This label is partly defined by the genome of the ant and partly by chemical substances extracted from its environment (mainly nest materials and food). According to the ”Gestalt odour theory” [9, 10], the continuous chemical exchanges between nestmates lead to the establishment of a colonial odour that is recognised by all the members of the colony. 2.2
Parameters
The main idea of Visual AntClust is to associate one object of the data set to the genome of one artificial ant. The genome allows ants to generate their own odour and is used to learn or update their recognition template. This template is used in every meeting to decide if both ants should accept each others or not. For one ant a, we define the following parameters. The label labela is a vector in the 2D-space of odours. Each of its components reflects the chemical concentration of one compound in the label of the artificial ant and is defined in [0, 1]. As no assumption can be done concerning the final partition, the labels vectors are randomly initialised for every ant. The genome genomea corresponds to an object of the data set and can not evolve during the algorithm. The recognition template ta , is used by each ant during meetings to appreciate the similarities between genomes (i.e. between 2 objects). ta is learned during a fixed number of iterations (set experimentally to 50). During this period, the ant a encounters other randomly chosen ants and estimates the similarity
334
Nicolas Labroche et al.
between their genomes. At the end, the ant a uses the mean and maximal similarities values Sim(a, ·) and max (Sim(a, ·)) observed during this period to set its template threshold ta as follows: ta =
max (Sim(a, ·)) + Sim(a, ·) 2
(1)
The satisfaction estimator sa . At the beginning of the algorithm, it is set to 0 and is increased each time the artificial ant a meets and accepts another ant in a local portion of the 2D-space of odour. sa estimates for each ant if it is well placed in the 2D-odour space. When it increases, the ant limits the area of the 2D-space in which it accepts other ants to help the algorithm to converge. 2.3
Meeting Algorithm
We detail hereafter the meeting process between two ants i and j. This mechanism allows ants to exchange and mutually reinforce their labels if they are close enough and if their genomes are similar enough. Meeting(Ant i, Ant j) (1) (2) (3) (4) (5) (6) (7) (8)
D ← Compute the Euclidian distance between Labeli and Labelj if D ≤ (1 − max(si , sj )) then if there is acceptance between ants i and j then Increase si and sj because ants i and j are well-placed Compute Ri and Rj , the odour exchange rates between ant i and ant j Update Labeli(j) according to Ri(j) and si(j) endif endif
The algorithm determines if both ants labels are similar enough to allow the meeting. It is known that real ants prefer to reinforce their labels with nestmates that have similar odours. We consider also that an ant should have a need to exchange chemical substances to meet another ant. We use si and sj , the satisfaction estimators, to evaluate this need (line 2). An artificial ant i, that is well placed in the odour-space (si is close to 1), already massively accepts its neighbours. That means that the object corresponding to ant i is already assigned to a good cluster. In the opposite case, if si is close to 0, either ant i is alone in a portion of the odour-space, or it is in a region where the neighbours are not similar enough (i.e. the object is misclassified). In this case, the ant needs to encounter other ants to increase its chances to belong to a good nest. If an ant corresponding to the previous criteria is found, there must be acceptance between ants (line 3). In this case, the similarity between genomes of two ants i and j must be greater or equal than both ants’ template thresholds ti and tj as shown in the following equation 2. (2) Acceptance(Anti , Antj ) ← Sim(Anti , Antj ) ≥ max (tAnti , tAntj )
Visual Clustering with Artificial Ants Colonies
335
Finally, if both ants are equally satisfied, they mutually reinforce their labels according to an impregnation rate R computed as follows for one ant i that accepts an other ant j. (3) Ri ← Sim(i, j) − ti This ensures the convergence of the algorithm by preventing ants that are wellplaced to update their label. Thus, only the less satisfied ants can change their label consequently to a meeting, which improves the efficiency of the algorithm.
3
Visual AntClust Visual AntClust(data set with N objects, Niter ) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
Initialise N ants according to parameters described in the section 2.2 while (Niter iterations are not reached) do Draw N ants in the 2D odor space for c ← 1 to N do Meeting(i,j) where i and j are randomly chosen ants endfor endwhile Compute Dmax , the maximal distance in odor space that allows an ant to find a nestmate Group in the same nest all the ants within a local perimeter of value Dmax Delete nests n that are too small (size(n) < 0.5 × size∀ nests ) Re-assign the ants that have no more nest to the nest of the most similar ant
Our algorithm depends on two main parameters that are namely, the number of global iterations Niter and the maximal distance under which a nestmate may be found Dmax . The first parameter Niter has been set experimentally to 2.500 iterations. In fact, some data sets need more or less iterations, but we do not develop, for the time being, a heuristic to determine automatically the number of iterations. To the last extent, our goal is to integrate this algorithm in an interactive framework, in which an expert will have the possibility to start and stop the visual clustering process and manipulate the groups. The second parameter Dmax is computed for each ant a according to an estimation of the mean D(i, ·) and minimal min(D(i, ·)) distances (here an Euclidean distance measure) between its label and the labels of the other ants as follows: (4) Dmax ← (1 − β)D(i, ·) + β min(D(i, ·)) β has been set experimentally to 0.95 to get the best results.
4 4.1
Experiments and Results Experiments
We apply a traditional k-Means approach because of its time complexity Θ(N ), N being the number of objects to cluster. The algorithm inputs an initial
336
Nicolas Labroche et al.
Table 1. Main characteristics of the data sets Art1 Art2 Art3 Art4 Art5 Art6 Iris Glass P ima Soybean T hyroid N 400 1000 1100 200 900 400 150 214 798 47 215 M 2 2 2 2 2 8 4 9 8 35 5 K 4 2 4 2 9 4 3 7 2 4 3
partition of k clusters. This partition is then refined at each iteration by the assignment of an object to its nearer cluster center. The cluster centers are updated after each object assignment until the algorithm converges. The method stops when the intra class inertia becomes stable. In our experiment, we randomly generate the initial partition with the expected number of clusters to get the best results with this approach. We also compare our new approach to AntClust, an other ant-based clustering algorithm that also relies on the chemical recognition system of ants and that has been described in [8]. In this method the label of each ant is coded by a single value that reflects the nest the ant belongs to. The algorithm is able to find good partitions without any assumption concerning the number of expected clusters and can treat every type of data. We use several numerical data sets to evaluate the contribution of our method. There are real data sets such as Iris, P ima, Soybean, T hyroid and Glass data sets and also artificial data sets called Art1,2,3,4,5,6 . They are generated according to distinct laws (Gaussian and uniform) and with distinct difficulties. The number of objects N , the number of attributes M and the number of clusters K of each data sets are described in the table 1. We use a clustering error measure that evaluates the differences between the final partition output by each method and the expected one, by comparing each pair of objects and by verifying each time if they are clustered similarly or not in both partitions. 4.2
Results
The table 2 shows that, even with its simple heuristic to create the partition from the 2D-space of odours, Visual AntClust is able to perform well. For some data sets like Art1 , Art4 and Art6 , Iris and Glass, it slightly outperforms AntClust though it shows generally a greater variability in its performances. Furthermore, Visual AntClust is even better than the k-Means approach in two cases for Art6 and Soybean, where the clustering error is equal to 0. We notice also that Visual AntClust does not manage to estimate the number of expected clusters for P ima and T hyroid. This may mean that the parameters of the algorithm are not well suited for these data sets or that the heuristic that finds the nests is not accurate enough in these cases to find a good partition. It is very important to notice that Visual AntClust is slower than AntClust and of course k-Means because at each iteration, it has to draw the on-going partition of the data set. The figure 1
Visual Clustering with Artificial Ants Colonies
Beginning of the algorithm: 0 iteration
337
Convergence: about 100 iterations
Clusters apparition: about 500 iterations Clusters definition: about 1000 iterations
Fig. 1. Four steps of Visual AntClust in the 2D-space of odours to build the final partition for Art1
represents four successive steps of Visual AntClust as it dynamically builds the final partition.
5
Conclusion and Perspectives
In this paper, we introduce a new ant based clustering algorithm named Visual AntClust. It associates one object of the data set to the genome of one artificial ant and simulates meetings between them to dynamically build the expected partition. The method is compared to the k-Means approach and AntClust, an other ant-based clustering method over artificial and real data sets. We show that Visual AntClust performs well generally and sometimes very well. Furthermore, it can treat every type of data unlike k-Means that is limited to numerical data and allow one to visualize the partition as it is constructed. In the near future, we will try to develop a method to automate the parameters’ setting. We will also use Visual AntClust as a base for an interactive clustering and data visualization tool.
338
Nicolas Labroche et al.
Table 2. Mean number of clusters (# clusters) and mean clustering error (ClusteringError) and their standard deviation ([std]) for each data set and each method computed over 50 runs K-Means Data sets mean [std] Art1 3.98 [0.14] 2.00 [0.00] Art2 3.84 [0.37] Art3 2.00 [0.00] Art4 8.10 [0.75] Art5 4.00 [0.00] Art6 Iris 2.96 [0.20] Glass 6.88 [0.32] P ima 2.00 [0.00] Soybean 3.96 [0.20] T hyroid 3.00 [0.00]
# clusters Clustering Error AntClust VAntClust K-Means AntClust VAntClust mean [std] mean [std] mean [std] mean [std] mean [std] 4.70 [0.95] 4.66 [1.04] 0.11 [0.00] 0.22 [0.03] 0.17 [0.07] 2.30 [0.51] 3.76 [2.36] 0.04 [0.00] 0.07 [0.02] 0.14 [0.06] 2.72 [0.88] 3.54 [2.11] 0.22 [0.02] 0.15 [0.02] 0.23 [0.07] 4.18 [0.83] 2.16 [0.37] 0.00 [0.00] 0.23 [0.05] 0.03 [0.04] 6.74 [1.66] 4.5 [1.69] 0.09 [0.02] 0.26 [0.02] 0.29 [0.12] 4.06 [0.24] 3.98 [0.14] 0.01 [0.04] 0.05 [0.01] 0.00 [0.02] 2.82 [0.75] 2.28 [0.49] 0.14 [0.03] 0.22 [0.01] 0.19 [0.06] 5.90 [1.23] 6.2 [1.16] 0.32 [0.01] 0.36 [0.02] 0.35 [0.03] 10.66 [2.33] 19.38 [7.68] 0.44 [0.00] 0.46 [0.01] 0.49 [0.01] 4.16 [0.55] 4.00 [0.00] 0.09 [0.08] 0.07 [0.04] 0.00 [0.00] 4.62 [0.90] 11.66 [3.44] 0.18 [0.00] 0.16 [0.03] 0.36 [0.07]
References [1] A. Colorni, M. Dorigo, and V. Maniezzo, “Distributed optimization by ant colonies,” in Proceedings of the First European Conference on Artificial Life (F. Varela and P. Bourgine, eds.), pp. 134–142, MIT Press, Cambridge, 1991. 332 [2] E. Bonabeau, M. Dorigo, and G. Theraulaz, From natural to artificial swarm intelligence. New York: Oxford University Press, 1999. 332 [3] Y. Chiou and L. W. Lan, “Genetic clustering algorithms,” European journal of Operational Research, no. 135, pp. 413–427, 2001. 332 [4] L. Y. Tseng and S. B. Yang, “A genetic approach to the automatic clustering problem,” Pattern Recognition, no. 34, pp. 415–424, 2001. 332 [5] P. Kuntz and D. Snyers, “Emergent colonization and graph partitioning,” in Cliff et al. [11], pp. 494–500. 332 [6] E. Lumer and B. Faieta, “Diversity and adaptation in populations of clustering ants,” in Cliff et al. [11], pp. 501–508. 332 [7] N. Monmarch´e, M. Slimane, and G. Venturini, “On improving clustering in numerical databases with artificial ants,” in Lecture Notes in Artificial Intelligence (D. Floreano, J. Nicoud, and F. Mondala, eds.), (Swiss Federal Institute of Technology, Lausanne, Switzerland), pp. 626–635, 1999. 332 [8] N. Labroche, N. Monmarch´e, and G. Venturini, “A new clustering algorithm based on the chemical recognition system of ants,” in Proc. of 15th European Conference on Artificial Intelligence (ECAI 2002), Lyon FRANCE, pp. 345–349, 2002. 332, 336 [9] N. Carlin and B. H¨ olldobler, “The kin recognition system of carpenter ants(camponotus spp.). i. hierarchical cues in small colonies,” Behav Ecol Sociobiol, vol. 19, pp. 123–134, 1986. 333 [10] B. H¨ olldobler and E. Wilson, The Ants. Springer Verlag, Berlin, Germany, 1990. 333
Visual Clustering with Artificial Ants Colonies
339
[11] D. Cliff, P. Husbands, J. Meyer, and S. W., eds., Third International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, MIT Press, Cambridge, Massachusetts, 1994. 338
Maximizing Benefit of Classifications Using Feature Intervals ˙ Nazlı Ikizler and H. Altay G¨ uvenir Bilkent University Department of Computer Engineering 06533, Ankara Turkey {inazli,guvenir}@cs.bilkent.edu.tr
Abstract. There is a great need for classification methods that can properly handle asymmetric cost and benefit constraints of classifications. In this study, we aim to emphasize the importance of classification benefits by means of a new classification algorithm, Benefit-Maximizing classifier with Feature Intervals (BMFI) that uses feature projection based knowledge representation. Empirical results show that BMFI has promising performance compared to recent cost-sensitive algorithms in terms of the benefit gained.
1
Introduction
Classical machine learning applications try to reduce the quantity of the errors and usually ignore the quality of errors. However, in real-world applications, the nature of the error is very crucial. Further, the benefit of correct classifications may not be the same for all cases. Cost-sensitive classification research addresses this imperfection and evaluates the effects of predictions rather than simply measuring the predictive accuracy. By incorporating cost(and benefit) knowledge to the process of classification, the effectiveness of the algorithms in real-world situations can be evaluated more rationally. In this study, we concentrate on costs of misclassifications and try to minimize those costs, by maximizing the total benefit gained during the process of classification. Within this framework, we propose a new cost-sensitive classification technique, called Benefit-Maximizing classifier with Feature Intervals (BMFI for short),that uses the predictive power of feature projection method previously proposed in [6]. In BMFI, voting procedure has been changed to impose the cost-sensitivity property. Generalization techniques are implemented to avoid overfitting and to eliminate redundancy. BMFI has been tested over several benchmark datasets and a number of real-world datasets that we have compiled. The rest of the paper is organized as follows: In Section 2, benefit maximization problem is addressed. Section 3 gives the algorithmic descriptions of BMFI algorithm along with the details of feature intervals concept, voting method and generalizations. Experimental evaluation of BMFI is presented in Section 4. Finally, Section 5 reviews the results and presents future research directions on the subject. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 339–345, 2003. c Springer-Verlag Berlin Heidelberg 2003
340
2
˙ Nazlı Ikizler and H. Altay G¨ uvenir
Benefit Maximization Problem
Recent research in machine learning has used the terminology of costs when dealing with misclassifications. However, those studies mostly lack the information that correct classifications may have different interpretations. Besides implying no cost, accurate labeling of instances may entail indisputable gains. Elkan points out the importance of these gains [3]. He states that doing accounting in terms of benefits is commonly preferable because there is a natural baseline from which all benefits can be measured, and thus, it is much easier to avoid mistakes. Benefit concept is more appropriate to real world situations, since net flow of gain is more accurately denoted by benefits attained. If a prediction is profitable from the decision agent’s point of view, its benefit is said to be positive. Otherwise, it is negative, which is the same as cost of wrong decision. To incorporate this natural knowledge of benefits to cost-sensitive learning, we have used benefit matrices. B=[bij ] is a n × m benefit matrix of domain D if n equals to the number of prediction labels, m equals to the number of possible class labels in D and bij ’s are such that ≥ 0 if i = j bij = . (1) < bii otherwise Here, bij represents the benefit of classifying an instance of true class j as class i. The structure of the benefit matrix is similar to that of the cost matrix, with the extension that entries can either have positive or negative values. In addition, diagonal elements should be non-negative values, ensuring that correct classifications can never have negative benefits. Given a benefit matrix B, the optimal prediction for an instance x is the class i that maximizes expected benefit (EB ), that is P (j|x) × bij . (2) EB(x, i) = j
where P (j|x) is the probability that x has true class j. The total expected benefit of the classifier model M over the whole test data is EBM = arg max EB(x, i) = arg max P (j|x) × bij . (3) x
i∈C
x
i∈C
j
where C is the set of possible class labels in the domain.
3
Benefit Maximization with Feature Intervals
As shown in [6], feature projections based classification is a fast and accurate method, and the rules it learns are easy for humans to verify. For this reason, we have chosen to extend its predictive power to involve benefit knowledge. In a particular classification problem, given the training dataset which consists of p features, an instance x can be thought as a point in a p-dimensional space with an associated class label xc . It is represented as a vector of nominal or
Maximizing Benefit of Classifications Using Feature Intervals
341
train(T rainingSet, Benef itM atrix) begin for each feature f sort(f , T rainingSet) i list ← make point intervals(f ,T rainingSet) for each interval i in i list votei (c) ← voting method (i,f ,Benef itM atrix) if f is linear i list ← generalize(i list,Benef itM atrix) end.
Fig. 1. Training phase of BMFI linear feature values together with its associated class, i.e., <x1 , x2 , .., xp , xc >. Here, xf represents the value of the f th feature of the instance x. If we consider each feature separately, and take x’s projection onto each feature dimension, then we can represent x by the combination of its feature projections. Training process of BMFI algorithm is given in Fig. 1. In the beginning, for each feature f , all training instances are sorted with respect to their value for f . This sort operation is identical to forming projections of the training instances for each feature f . A point interval is constructed for each projection. Initially, lower and upper bounds of the interval are equal to the f value of the corresponding training instance. If the f value of a training instance is unknown, it is simply ignored. If there are several point intervals with the same f value, they are combined into a single point interval by adding the class counts. At the end of point interval construction, vote for each class label is determined by using one of the two voting methods. The first one is the voting method of CFI algorithm [5], called VM1 in our context. VM1 can be formulated as follows: V M 1(c, I) =
Nc . classCount(c)
(4)
where Nc is the number of instances that belong to class c in interval I and classCount(c) is the total number of instances of class c in the entire training set. This voting method favors the prediction of minority class in proportion to its occurrence in the interval. The second voting method, called VM2, is basically founded on optimal prediction approximation given by (2) and makes direct use of the benefit matrix. VM2 casts votes to class c in interval I as bck × P (k|I) . (5) V M 2(c, I) = k∈C
P (k|I) is the estimated probability that an instance falling to interval I will have the true class k, and is calculated as P (k|I) =
Nk . classCount(k)
(6)
342
˙ Nazlı Ikizler and H. Altay G¨ uvenir generalize(interval list) begin I ← first interval in interval list while I is not empty do I ← interval after I I” ← interval after I if merge condition(I, I , I”) is true merge I (and/or) I” into I else I ← I end.
Fig. 2. Generalization of intervals step in BMFI
After the initial assignment of votes, for linear features, intervals are generalized to form range intervals in order to eliminate redundancy and avoid overfitting. The generalization process is illustrated in Fig. 2. Here, merge condition() is a comparison function that evaluates relative properties of each interval and returns true if sufficient level of similarity between neighboring intervals is reached. Besides adding more prediction power to the algorithm, proper generalization reduces the number of intervals, and by this way, decreases the classification time. In this work, we have experimented with three interval generalization methods. The first one, called SF (same frequent) joins two consecutive intervals if the most frequently occurring class of both are the same. The second method, SB (same beneficial) joins two consecutive intervals if they have the same beneficial class. A class c is the beneficial class of an interval i iff for ∀j ∈ C and j = c, x∈i B(x, c) ≥ x∈i B(x, j) . If the beneficial classes of two consecutive intervals are the same, then it can be more profitable to unite them into a single interval. The third method, HC (high confidence) combines three consecutive intervals into a single one, when the middle interval has less confidence on its votes than the other two. The confidence of an interval is measured as the difference between votes of the most beneficial class and second beneficial class. Table 1. List of evaluated cost-sensitive algorithms Name
Description
MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI
MetaCost on Naive Bayes MetaCost on J4.8 CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with
reweighting on Naive Bayes direct minimization on Naive Bayes reweighting on J4.8 direct minimization on J4.8 reweighting on VFI direct minimization on VFI
Maximizing Benefit of Classifications Using Feature Intervals
343
classify(q) begin for each class c vc ← 0 for each feature f if qf is known I ← search interval(f, qf ) for each class c vc ← vc + interval vote(c, I) prediction ← argmaxc(vc ) end.
Fig. 3. Classification step in BMFI
The classification process of the BMFI algorithm is given in Fig. 3. The choice of voting method to be used depends on the characteristics of the domain. Based on our empirical results, we propose to use VM1 voting together with SF, SB and HC techniques when the correct classification of the minority class is more beneficial than the other classes. On the contrary, when the benefit matrix is not correlated with the distribution, VM2 can be employed together with SB and HC to boost up the benefit performance. Experimental results presented in Sect. 4 are achieved by using this general rule-of-thumb.
4
Experimental Results
For evaluation purposes, we have used benchmark datasets from UCI ML Repository [1]. These data sets do not have predefined benefit matrices, so we formed their benefit matrices in the following manner. In binary datasets, one class is assumed to be more important to predict correctly than the other by a constant benefit ratio, b. We have tested our algorithm by using five different b values that are 2, 5, 10, 20, 50. Note that when b is equal to 1, the problem reduces to the classical classification problem. Further, we have compiled four new datasets. Their benefit matrices have been defined by experts of each domain. For more information about the datasets and benefit matrices the reader is referred to [7]. We have compared BMFI with MetaCost [2] and CostSensitiveClassifier of Weka [4] on well-known base classifiers which are Naive Bayesian Classifier, C4.5 decision tree learner and VFI [6]. Table 1 lists these algorithms with their base classifiers (J4.8 is Weka’s implementation of C4.5 in Java). MetaCost is a wrapper algorithm that takes a base classifier and makes it sensitive to costs of classification [2]. It operates with a bagging logic beneath and learns multiple classifiers on multiple bootstrap replicates of the training set. MetaCost has become a benchmark for comparing cost-sensitive algorithms. In addition to MetaCost, we have compared our algorithm with two cost sensitive classifiers provided in Weka. The first method uses reweighting of training
344
˙ Nazlı Ikizler and H. Altay G¨ uvenir
Table 2. Comparative evaluation of BMFI with wrapper cost-sensitive algorithms. The entries are benefit per instance values. Best results are shown in bold domain
MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI BMFI
breast-cancer pima-diabetes ionosphere liver disorders sonar bank-loans bankruptcy dermatology lesion
4.0 2.8 5.7 5.3 3.3 -0.8 7.8 7.5 8.7
4.0 3.0 6.1 5.2 4.5 -0.9 7.7 7.5 8.9
3.8 2.8 6.5 5.3 4.6 -0.4 7.5 7.2 7.8
4.0 2.7 6.0 5.4 4.0 -0.6 7.4 7.5 9.0
3.9 2.9 6.5 5.4 4.6 0.1 7.5 7.2 7.8
3.7 2.5 5.7 4.4 3.3 -0.5 7.3 7.3 7.7
3.7 -1.5 6.4 4.3 0.0 -1.2 7.7 6.9 6.4
2.8 2.8 6.1 5.3 4.0 -2.8 7.8 5.6 4.0
3.9 (VM1) 2.7 (VM1) 6.5 (VM2) 5.4 (VM2) 4.9 (VM2) -0.1 (VM1) 7.9 (VM1) 7.4 (VM2) 9.0 (VM1)
instances and the second method makes direct cost-minimization based on probability distributions [8]. We call these two classifiers C1 and C2, respectively. Experimental results are presented in Table 2. In this table, results of binary datasets are benefit per instance values for b=10. All results are recorded by using 10-fold cross validation. As the results demonstrate, BMFI algorithm is very successful in most of the domains and remarkably comparable to other algorithms in all of the domains. In ionosphere, liver, sonar, bankruptcy and lesion domains, BMFI attains the maximum benefit per instance value. In the remaining datasets its performance is very high and comparable to other algorithms. We have observed that benefit achieved is highly dependent on the nature of the domain, i.e., benefit matrix information, distribution of classes, etc, as expected. In addition, it is worthwhile to note that BMFI outperforms cost-sensitive versions of its base classifier VFI (C1VFI and C2VFI). This observation suggests that using benefit knowledge inside the algorithm itself is more effective than wrapping a meta-stage around to transform it into a cost-sensitive classifier. In binary datasets, we observed that the success of BMFI increases as the benefit ratio increases. This is an important highlight of BMFI and is mostly due to its high sensitivity to benefit of classifications. This aspect of BMFI has been illustrated with the results of pima-diabetes dataset given in Table 3.
Table 3. Benefit per instance values of pima-diabetes dataset with different benefit ratios. Best results are shown in bold b MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI BMFI 2 5 10 20 50
0.5 1.2 2.8 5.8 16.6
0.6 1.2 2.8 5.8 16.2
0.7 1.5 3.0 6.2 16.2
0.6 1.3 2.7 6.1 16.6
0.6 1.2 2.9 6.1 16.3
0.6 1.2 2.5 5.6 14.7
0.0 -0.5 -1.5 -3.3 -9.0
0.0 1.1 2.8 6.3 16.7
0.5 1.2 2.7 6.3 16.8
Maximizing Benefit of Classifications Using Feature Intervals
5
345
Conclusions and Future Work
In this study, we have focused on the problem of making predictions when the outcomes have different benefits associated with them. We have implemented a new algorithm, namely BMFI that uses the predictive power of feature intervals concept in maximizing the total benefit of classifications. We make direct use of benefit matrix information provided to the algorithm in tuning the prediction so that the resultant benefit gain is maximized. BMFI has been compared to MetaCost and two other cost-sensitive classification algorithms provided in Weka. These generic algorithms are wrapped over NBC, C4.5 and VFI. The results show that BMFI is very effective in maximizing the benefit per instance values. It is more successful in domains where the prediction of a certain class is particularly important. Empirical results we obtained also show that using benefit information directly in the algorithm itself is more effective than using a meta-stage around the base classifier. In benefit maximization problem, we have observed that individual characteristics of the datasets influence results significantly, due to the extreme correlation between cost-sensitivity and class distributions. As future work, feature-dependent domains can be explored in depth and feature-dependency aspect of BMFI can be improved. Benefit maximization can be extended to include the feature costs. Feature selection mechanisms that are sensitive to individual costs of features can be utilized.
References [1] Blake C. L. and Merz C. J.: UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences (1998) [http://www.ics.uci.edu/˜mlearn/MLRepository.html] 343 [2] Domingos P.: Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Diego, CA (1999) 155-64 343 [3] Elkan C.: The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Articial Intelligence (2001) 340 [4] Frank E. et al.: Weka 3 - Data Mining with Open Source Machine Learning Software in Java. The University of Waikato (2000) [http://www.cs.waikato.ac.nz/˜ml/weka] 343 [5] G¨ uvenir H. A.: Detection of abnormal ECG recordings using feature intervals. In: Proceedings of the Tenth Turkish Symposium on Artificial Intelligence and Neural Networks (2001) 265-274 341 ˙ [6] G¨ uvenir H. A, Demir¨ oz G., and Ilter N.: Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, Vol. 13(3) (1998) 147-165 339, 340, 343 ˙ [7] Ikizler N.: Benefit Maximizing Classification Using Feature Intervals. Technical Report BU-CE-0208, Bilkent University (2002) 343 [8] Ting K. M.: An instance weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, Vol. 14(3) (2002) 659-665 344
Parameter Estimation for Bayesian Classification of Multispectral Data Refaat M Mohamed and Aly A Farag Computer Vision and Image Processing Laboratory University of Louisville, Louisville, KY 40292, USA {refaat,farag}@ cvip.uofl.edu www.cvip.uofl.edu
Abstract. In this paper, we present two algorithms for estimating the parameters of a Bayes classifier for remote sensing multispectral data. The first algorithm uses the Support Vector Machines (SVM) as a multi-dimensional density estimator. This algorithm is a supervised one in the sense that it needs in advance, the specification of the number of classes and a training sample for each class. The second algorithm employs the Expectation Maximization (EM) algorithm, in an unsupervised way, for estimating the number of classes and the parameters of each class in the data set. Performance comparison of the presented algorithms shows that the SVM- based classifier outperforms those based on Gaussian-based and Parzen window algorithms. We also show that the EM based classifier provides comparable results to Gaussian-based and Parzen window-based while it is an unsupervised algorithm. Keywords: Bayes classification, density estimation, support vector machines (SVM), expectation maximization (EM), multispectral data.
1
Introduction
Bayes classifier constitutes the basic setup for a large category of classifiers. To implement the bayes classifier, there are two main parameters to be addressed [1]. The first parameter is the estimation of the class conditional probability for each class, and the second one is the prior probability of each class. In this paper, we address both of these parameters. Support Vector Machines (SVM) approach was developed to solve the classification problem, but recently have been extended to regression problems [2]. SVM have been shown to perform well for density estimation where the probability distribution function of the feature vector x can be inferred from a random sample Ð. SVM represent Ð by a few number of support vectors and the associated kernels [3]. This paper employs the SVM as a density estimator for multi-dimensional feature spaces, and uses this estimate in a Bayes classifier. SVM as a density estimator work V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 346-355, 2003. Springer-Verlag Berlin Heidelberg 2003
Parameter Estimation for Bayesian Classification of Multispectral Data
347
in a supervised setup; the number of classes and a design data sample from each class need to be available. There are a number of unsupervised algorithms for parameter estimation for Bayes classifier, e.g. the well known k-means algorithms [4]. The k-means algorithm requires the knowledge of the number of classes. In this paper, we propose a fully unsupervised approach for estimating the Bayes classifier parameters. The Expectation Maximization (EM) algorithm is a general method of finding the maximum-likelihood estimate (MLE) of the parameters of an underlying distribution from a given data set, when the data is incomplete or has missing values. The proposed approach in this paper incorporates the EM algorithm with some generalizations to estimate the number of classes, parameter of each class as well as the prior probability of each class. The two algorithms introduced in this paper are the following: Algorithm 1: SVM as a density estimator for Bayes Classifier; and Algorithm 2: EM algorithm for Estimating number of classes, class conditional probability parameters, as well as the prior probabilities for a Bayesian classification of multispectral data classification. We describe the details of the algorithms and implementation for classification of seven-dimensional multispectral Landsat data in a Bayes classifier setup. We evaluate the performance of the two algorithms with respect to the following classical approaches: Gaussian-based and Parzen window-based classifiers. The proposed algorithms show high performance against these algorithms.
2
Support Vector Machines for Density Estimation
Support Vector Machines (SVM) have been developed by Vapnik [5] and are gaining popularity due to many attractive features and promising empirical performance. The formulation embodies the Structural Risk Minimization (SRM) principle, which has been shown to be superior to traditional Empirical Risk Minimization (ERM) principle employed in conventional learning algorithms (e.g. neural networks). SRM minimizes an upper bound on the generalization error, as opposed to ERM, which minimizes the error on the training data. It is this difference, which makes SVM more attractive in statistical learning applications. In this section, we present a brief outlines of the SVM approach for density estimation. 2.1
The Density Estimation Problem
The probability density function p(x) of the random vector x is a nonnegative quantity which is defined as: x F (x) = ∫ p (x ′) dx ′ −∞
(1)
where F (x) is the cumulative distribution function. Hence, in order to estimate the probability density, we need to solve the integral equation:
348
Refaat M Mohamed and Aly A Farag
x ∫ p (x ′, α ) dx ′ = F (x) −∞
(2)
on a given set of densities p( x, α ) , where the integration in (2) is a vector integration, and α is the set of parameters to be determined. The estimation problem in (2) can be regarded as solving the linear operator equation: A p ( x) = F ( x)
(3)
where the operator A is a one-to-one mapping for the elements p(x) of the Hilbert space E1 into elements F (x) of the Hilbert space E 2 . 2.2
Support Vector Machines
In support vector machines, we look for a solution of the density estimation problem in the form: N
p ( x) =
∑β
k
K ( x, x k )
(4)
k =1
where N is the size of the training sample, K (x, x k ) is the kernel that defines a
Reproducing Kernel Hilbert Space (RKHS), and β k ' s are the parameters to be estimated using the SVM method. The density estimation problem in (3) is known to be an ill-posed problem, see [6]. One solution for the ill-posed problems is to introduce a semi-continuous, and positive functional (Λ ( p(x)) ≤ c ; c > 0 ) in a RKHS. Also, we define p(x) as a trade-off between Λ ( p(x)) and
Ap (x) − FN (x) . An
example for such method is that by Phillips [6]: min p ( x)
Λ ( p (x))
(5)
Ap(x) − FN (x) < ε N , ε N > 0, ε N → 0
such that:
= εk max FN (x) − ∫−x ∞ p(t ) dt x = xk k
or simply,
(6)
where FN (x) is an estimate for F (x) which is characterized by an estimate error ε k for each instant x k , see [6]. Using the properties of RKHS, (5) rewritten as: N β k K (x, x k ), Λ( p( x)) = ( p(x), p(x)) H = k =1
∑
N
=
N
∑ β ∑ β (K (x, x k
k =1
j
j =1
N
∑β j =1
j
K ( x, x j ) H N
k
), K (x, x j )
N
)H = ∑∑ β k β j K (x k , x j ) k =1 j =1
(7)
Parameter Estimation for Bayesian Classification of Multispectral Data
349
Therefore, to solve the estimation problem in (3) we minimize the functional: N
W ( β ) = Λ( p(x), p(x)) =
N
∑∑ β
k β j K (x k , x j )
(8)
k =1 j =1
subject to the constraints (6), and : N
βk ≥0,
∑β
j
=1
(9)
j =1
where 1 ≤ k ≤ N . The constraints in (9) are imposed to obtain the solution in the form of a mixture of densities. The only remaining issue for the SVM description is the choice of the functional form for the kernel. To obtain a solution in the form of a mixture of densities, we choose a nonnegative kernel which satisfies the following conditions: x − xk x − xk ; a(γ ) K K γ (x, x k ) = a (γ ) K γ γ and K (0) = 1
∫
dx = 1 ;
(10)
Following is a listing of the implementation steps for SVM as a density estimator. Algorithm 1. SVM density estimator Step 1: Step 2: Step 4: Step 5: Step 6: Step 7:
Get the random sample Ð from the density function to be estimated. Use Ð to construct FN (x) and ε k ' s . Select a kernel function that satisfies (10). Set up the objective function (8) for the optimization algorithm. Set up the constraints (6) and (9). Apply the optimization algorithm. The nonzero parameters ( β k ' s ) and their corresponding data vectors are considered as the support vectors. Step 8: Calculate the density function value from (4) corresponding to a feature vector x .
3
Unsupervised Density Estimation Using the EM Algorithm
The EM algorithm had been proposed as an iterative optimization algorithm to find the maximum likelihood estimates (e.g., [7][8]). In this paper, we use the EM algorithm to estimate the parameters for a mixture of densities. For an observed data set X of size N , the probabilistic model is defined as: M
p (x | Θ) =
∑α
j
p j (x | θ j )
j =1
where,
M : number of different components “classes” in the mixture,
(11)
350
Refaat M Mohamed and Aly A Farag M
α j : weight of component i in the mixture;
∑α
j
=1 ,
j =1
θ j : parameters indexing the density of component i in the mixture, Θ = (α j ' s , θ j ' s ) : parameters that indexing the mixture density.
The likelihood for the mixture density can be written as: N
l (Θ | X) =
∏ p(x
| Θ)
i
(12)
i =1
The optimization problem can be significantly simplified if we assume that the data set X is incomplete and there is an unobserved data set w = {wi }iN=1 , each component of which is associated with a component of X . The values of w determine which density in the mixture (i.e. class) generated the particular data instant. Thus, we have wi = k , for wi ∈{1, 2, ..., M } , if the sample i of X is generated by mixture component k . If w is known, (12) can be written as: N
∑ log(P(x
log(l (Θ | X, w ) ) = log(P ( X, w | Θ) ) =
i
| wi ) P (w ) )
i =1
N
=
∑ log(α
wi
p wi (x i | θ wi )
i =1
(13)
)
The parameter set Θ = (α j ' s , θ j ' s ) can be obtained if the components of w are known. In the case where w is unknown, we still can proceed by assuming w to be a random vector. In the reminder of this section, we will develop an expression for the distribution of the unobserved data w . We start by guessing initial values Θ c of the mixture density parameters, i.e. we assume Θ c = (α cj , θ cj ; j = 1...M ) . Using Bayes rule, the conditional probability density for the value wi ”class of site i ” is written as: p( wi | x i , Θ c ) =
α wc i p wi (x i | θ cwi ) p (x i | Θ)
=
α wc i p wi (x i | θ cwi )
∑
M
α kc p k (x i | θ ck )
(14)
k =1
Then, the joint density for w can be written as: p(w | X, Θ c ) =
N
∏ p(w
i
i =1
| xi , Θc )
(15)
Parameter Estimation for Bayesian Classification of Multispectral Data
3.1
351
Maximizing the Conditional Expectation
The EM algorithm first finds the expected value (Q(Θ, Θ c )) of the log-likelihood of the complete-data (the observed data X and the unobserved data w ) with respect to the unknown data w . The pieces of information which known here are the observed data set X and the current parameter estimates Θ c . Thus, we can write: Q (Θ, Θ c ) = E [( log p ( X, w | Θ) ) | X, Θ c ]
(16)
where Θ are the parameters that we optimize them to maximize the conditional expectation Q . In (16), X and Θ c are constants, Θ is a regular variable that we try to adjust, and w is a random variable that governed by the density function defined in (15). Thus, using (13) and (15), conditional expectation in (16) can be rewritten as: Q(Θ, Θ c ) =
∫ log p(X, w | Θ) p(w | X, Θ
c
) dw =
∑ log (l ( Θ | X, w)) p(w | X, Θ
c
)
w∈w
w∈w
(17)
With some mathematical simplification for (17) the conditional expectation can be rewritten as: Q (Θ, Θ c ) =
M
N
∑∑ log (α
j
p j (x i | θ j )) p ( j | x i , Θ c )
j =1 i =1
M
=
N
∑∑ log(α
j ) p(
c
j | xi , Θ ) +
j =1 i =1
M
(18)
N
∑∑ log( p
j (x i
c
| θ j )) p ( j | x i , Θ )
j =1 i =1
In (18), the term containing α j and the term containing θ j are independent. Hence, those terms can be maximized independently, which significantly simplifies the optimization problem. 3.2
Parameters Estimation
In this section we will present the implementation steps to estimate the different parameters of the density function underlying the distribution of the classes in a data set. The density is assumed to be a mixture of densities. Thus, the parameters to be estimated are, number of components in the mixture, the mixing coefficients and the parameters of each density component As stated before, the first term of the conditional expectation in (18) can be maximized independently to obtain the mixing coefficients, α j ' s . Maximizing the first term in (18) results in the value of a mixing coefficient as:
αj=
1 N
N
∑ p( j | x , Θ i
c
)
(19)
i =1
We stated before, that we use the EM algorithm under the assumption that we set a parametric form for the density of each class. Theoretically, the algorithm can be
352
Refaat M Mohamed and Aly A Farag
applied to any form for the class densities. For computational purpose, we assume the class densities to be in the form of multinomial Gaussian. Under this assumption, the parameters will be the mean vector and the covariance matrix, i.e. θ j = (µ j , Σ j ) . Under the normality assumption, the second term of (18) becomes: M
N
∑∑log(p (x | θ ))p( j | x , Θ ) = j
i
j
i
c
j =1 i=1
M
N
1 1 T −1 c − log(| Σ j |) − (xi − µ j ) Σ j (xi − µ j )p( j | xi , Θ ) 2 2 j =1 i =1
∑∑
(20)
Maximizing (20) with respect to µ, and Σ will give: N
µj =
∑
N
x i p( j | x i , Θ c )
i =1 N
∑
∑ p( j | x , Θ i
and Σ j = p( j | x i , Θ c )
i =1
c
)(x i − µ j )(x i − µ j ) T
i =1
N
∑
(21) p( j | x i , Θ c )
i =1
The essence of the proposed algorithm is that we can automatically estimate the number of classes M in the scene. The optimum value of M is that value at which Q is maximum. This can be done simply by assuming a number of classes, M and then optimize (16) to obtain the other parameters and at the same time we calculate the final optimized value of Q . Then, we change M and redo the optimization process. This iterative procedure is continued until a maximum value of Q is obtained. The value of M at which Q is maximum is the optimum number of classes defined in the data set. Algorithm 2. Step 1: Step 2: Step 3: Step 4: Step 5: Step 6:
Set a value for M . Define an initial estimate for Θ . Compute the value of Q from (18) using these parameters. Compute the new values of Θ from (19), and (21). Compute the new value of Q . Repeat 4 and 5 to get acceptable small difference between subsequent values of Q . Step 7: Increase M by one and iterate from step 2 until Q is maximum.
4
Experimental Results
In this section we present the classification results of using the above two algorithms as density estimators in the design of a Bayesian classifier. For the SVM approach, a 7-D Gaussian-like kernel function has been used. The experiments were carried out
Parameter Estimation for Bayesian Classification of Multispectral Data
353
using real data acquired from the Landsat Thematic Mapper (TM) for two different data sets as will be discussed below. The proposed two algorithms are evaluated against the Gaussian-based and Parzen window-based algorithms. Details of these algorithms are elsewhere (e.g. [1]). 4.1
Golden Gate Bay Area
The first part of the experiments is carried out using real multispectral data (7-bands) for the Golden Gate Bay area of the city of San Francisco, California, USA. Five classes are defined on this image: Trees, Streets (light urban), Water, Buildings (dense urban) and Earth. The available ground truth data for this data set includes about 1000 points per class, which is approximately 1% of the image scene. The classification results, Table 1 and Fig.1 and, show that the SVM-based classifier outperforms Gaussian- based and Parzen window-based algorithms, which reflects that the SMV works well as a density estimator in multi-dimensional. For the second algorithm, Fig. 2 shows the evolution of the conditional expectations with the number of components assumed in the area. The figure shows that the maximum value of Q occurs at M = 5 , which is the nominal value. The estimated parameters using the proposed algorithm are used in a Bayes classifier. Table 1 shows that the unsupervised classifier provides comparable results which reflects its applicability for the problem.
Fig. 1. Classified Images for the Golden Gate Area. From upper left to lower right, original, Gaussian-based, Parzen Window-based, SVM-based, and EM-based Bayes classifier
4.2
Agricultural Area
The first part of this experiment was carried out using real multispectral data for an agricultural area, Fig 3. It is known that this area contains: Background, Corn, Soybean, Wheat, Oats, Alfalfa, Clover, Hay/Grassland, and Unknown. The ground
354
Refaat M Mohamed and Aly A Farag
truth data for that area is available. Again, the results in Fig 1 and Table 2 emphasize the same concluded notes in the above Golden Gate Bay area data set.
Fig. 2. Evolution of Conditional Expectation with the number of classes, left Golden Gate, right Agricultural area (Actual value is Q*104)
Fig. 3. Classified Images for the Agricultural Area. From upper left to lower right, original, Gaussian-based, Parzen Window-based, SVM-based, and EM-based Bayes classifier
Parameter Estimation for Bayesian Classification of Multispectral Data
5
355
Conclusion and Future Work
This paper presented two algorithms for estimating the parameters for a Bayes classifier of multispectral data classification. The first one is a supervised algorithm which uses the SVM as an estimator for the class conditional probabilities. The second one is an unspervised and it assumes the density distribution of a class as a mixture of different densities. The algorithm is used to estimate the number of components of this mixture (number of classes), the parameters of each component and the proportion of each component in the mixture (which is regarded as the prior probability of the class). The results show that the SVM has the capabilities to simulate well density functions in high dimensions. Also, the results illustrated that the unsupervised algorithm performs well with the data sets used in the experiments. For future work, we study methods for enhancing the SVM performance and to automatically choose its parameters. For the unsupervised algorithm, we will relax the normality assumption of the densities. Also, we will adjust the algorithm to estimate the distribution of the observed data set assuming nonparametric forms for the densities of the classes in the data.
References [1] [2] [3] [4] [5] [6] [7] [8]
A. Farag, R. M. Mohamed and H. Mahdi, “Experiments in Image Classification and Data Fusion,” Proceedings of 5th International Conference on Information Fusion, Annapolis, MD, Vol. 1, pp. 299-308, July 2002. V. Vapnik, S. Golowich and A. Smola, “Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing,” Neural Information Processing Systems, Vol. 9, MIT Press, Cambridge, MA, 1997. Vapnik and S. Mukherjee, “Support Vector Method for Multivariate Density Estimation,” Neural Information Processing Systems 1999, Vol. 12, MIT Press 2000. R. O. Duda and et al., Pattern Classification, John Wiley & Sons, 2nd edition, 2001. V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, 2nd Edition, 2000. Aly Farag, and Refaat Mohamed, “Classification of Multispectral Data Using Support Vector Machines Approach for Density Estimation,” Proceedings of the 7th INES 2003, Assiut, Egypt, March4-6, 2003. D. Phillips, “A Technique for the Numerical Solution of Certain Integral Equations of the First Kind,” Journal of Association Computing Machinery, 9:84-97, 1962. Greg A. Rempala and Richard A. Derrig, “ Modeling Hidden Exposures in Claim Severity via the EM Algorithm,” TR-MATH, Mathematical department, University of Louisville, December 2002.
Goal Programming Approaches to Support Vector Machines Hirotaka Nakayama1, Yeboon Yun2 , Takeshi Asada3 , and Min Yoon4 1
Konan University, Kobe 658-8501, Japan Kagawa University, Kagawa 761-0396, Japan 3 Osaka University, Osaka 565-0871, Japan Yonsei University, Seoul 120-749, Republic of Korea 2
4
Abstract. Support vector machines (SVMs) are gaining much popularity as effective methods in machine learning. In pattern classification problems with two class sets, their basic idea is to find a maximal margin separating hyperplane which gives the greatest separation between the classes in a high dimensional feature space. However, the idea of maximal margin separation is not quite new: in 1960’s the multi-surface method (MSM) was suggested by Mangasarian. In 1980’s, linear classifiers using goal programming were developed extensively. This paper considers SVMs from a viewpoint of goal programming, and proposes a new method based on the total margin instead of the shortest distance between learning data and separating hyperplane.
1
Introduction
Recently, Support Vector Machines (SVMs, in short) are gaining much popularity for machine learning. One of main features of SVMs is that they are kernel based linear classifiers with maximal margin on the feature space . The idea of maximal margin in linear classifiers is intuitive, and its reasoning in connection with perceptrons was given in early 1960’s (e.g., Novikoff [6]). The maximal margin is effectively applied for discrimination analysis using mathematical programming, e.g., MSM (Multi-Surface Method) by Mangasarian [4]. Later, linear classifiers with maximal margin were formulated as linear goal programming, and extensively studied through 1980’s to the beginning of 1990’s. The pioneering work was given by Freed-Glover [2], and a good survey can be seen in Erenguc-Koehler et al. [1]. This paper discusses the goal programming approach to SVMs, and shows a close relationship between the so-called ν-SVM and our suggested total margin method.
2
Goal Programming Approach to Linear Discrimination Analysis
Let X be a space of conditional attributes. For binary classification problems, the value of +1 or −1 is assigned to each data xi according to its class A or V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 356–363, 2003. c Springer-Verlag Berlin Heidelberg 2003
Goal Programming Approaches to Support Vector Machines
357
B. The aim of machine learning is to predict which class newly observed data belong to by virtue of a certain discrimination function f (x) constructed on the basis of the given data set (xi , yi ) (i = 1, . . . , l), where yi = +1 or −1. Namely, f (x) 0 ⇒ x ∈ A f (x) < 0 ⇒ x ∈ B When the discrimination function is linear, namely if f (x) = wT x + b, this is referred to as a linear classifier. In this case, the discrimination boundary is given by a hyperplane defined by wT x + b = 0. In 1981, Freed-Glover suggested to get a hyperplane separating two classes with as few misclassified data as possible by using goal programming [2]. Let ξi denote the exterior deviation which is a deviation from the hyperplane for a point xi improperly classified. Similarly, let ηi denote the interior deviation which is a deviation from the hyperplane for a point xi properly classified. Some of main objectives in this approach are as follows: i) Minimize the maximum exterior deviation (decrease errors as much as possible) ii) Maximize the minimum interior deviation iii) Maximize the weighted sum of interior deviation iv) Minimize the weighted sum of exterior deviation Of course, we can combine some of them. Note that the objective ii) corresponds to maximizing the shortest distance between the learning data and the hyperplane. Although many models have been suggested, the one considering iii) and iv) above may be given by the following linear goal programming: Letting yi = +1 for i ∈ IA and yi = −1 for i ∈ IB , (GP)
minimize
l
(hi ξi − ki ηi )
i=1
subject to
yi (xTi w + b) = ηi − ξi , ξi , ηi 0, i = 1, . . . , l.
Here, hi and ki are positive constants. In order for ξi and ηi to have the meaning of the exterior deviation and the interior deviation respectively, the condition ξi ηi = 0 for every i = 1, . . . , l must hold. Lemma 1. If hi > ki for i = 1, . . . , l, then we have ξi ηi = 0 for every i = 1, . . . , l at the solution to (GP). Proof. Easy due to lemma 7.3.1 of [7]. It should be noted that the above formulation may yield some unacceptable solutions such as w = 0 and unbounded solution.
358
Hirotaka Nakayama et al.
Example Let z1 = (−1, 1), z2 = (0, 2) ∈ A and z3 = (1, −1), z4 = (0, −2) ∈ B. Constraint functions of SVM are given by z1 z2 z3 z4
: : : :
w1 (−1) + w2 (1) + b = η1 − ξ1 w1 (0) + w2 (2) + b = η2 − ξ2 w1 (1) + w2 (−1) + b = η3 − ξ3 w1 (0) + w2 (−2) + b = η4 − ξ4
Here, it is clear that ξ = 0 at the optimal solution. The constraints include ηi added in the right hand side. Note that the feasible region in this formulation moves to the north-west by increasing ηi . Maximizing ηi yields unbounded optimal solution unless any further constraint in w are added. In the goal programming approach to linear classifiers, therefore, some appropriate normality condition must be imposed on w in order to provide a bounded optimal solution. One of such normality conditions is ||w|| = 1. If the classification problem is linearly separable, then using the normalization ||w|| = 1, the separating hyperplane H : wT x+b = 0 with maximal margin can be given by (GP1 )
maximize
η
subject to
yi (xTi w + b) η, i = 1, . . . , l, ||w|| = 1.
However, this normality condition makes the problem to be of nonlinear optimization. Note that the following SVM formulation with the objective function minimizing ||w|| can avoid this unboundedness handily. Instead of maximizing the minimum interior deviation in (GP1 ) stated above, we use the following equivalent formulation with the normalization wT z+b = ±1 at points with the minimum interior deviation: (SVM)
minimize subject to
||w|| yi w T z i + b 1, i = 1, . . . , l,
where yi is +1 or −1 depending on the class of z i . Several kinds of norm are possible. When ||w||2 is used, the problem is reduced to quadratic programming, while the problem with ||w||1 or ||w||∞ is reduced to linear programming (see, e.g., Mangasarian [5]). For the above example, we have the following condition in the SVM formulation: z1 : w1 (−1) + w2 (1) + b 1 z2 : w1 (0) + w2 (2) + b 1 z3 : w1 (1) + w2 (−1) + b −1 z4 : w1 (0) + w2 (−2) + b −1 Since it is clear that the optimal hyperplane has b = 0, the constraint functions for z3 and z4 are identical to those for z1 and z2 . The feasible region in
Goal Programming Approaches to Support Vector Machines
359
(w1 , w2 )-plane is given by w2 w1 + 1 and w2 1/2. Minimizing the objective function of SVM yields the optimal solution (w1 , w2 ) = (−1/2, 1/2) for the QP formulation. Similarly, we have a solution among the line segment {w2 w1 + 1} ∩ {−1/2 w1 0} depending on the initial solution for the LP formulation. On the other hand, Glover [3] shows the following necessary and sufficient condition for avoiding unacceptable solutions: xi + lB xi )T w = 1, (1) (−lA i∈IB
i∈IA
where lA and lB denote the number of data for the category A and B, respectively. Geometrically, the normalization (1) means that the distance between two hyperplanes passing through centers of data for lA and lB is scaled by lA lB . Taking into account that ηi /||w|| represents the margin of correctly classified data xi from the hyperplane wT x + b = 0, larger value of ηi and smaller value of ||w|| are more desirable in order to maximize the margin. On the other hand, since ξi /||w|| stands for the margin of misclassified data, the value of ξi should be minimized. The methods considering all of ξi /||w|| and ηi /||w|| are referred to as the total margin methods. Now, we have the following formulation for getting a linear classifier with maximal total margin: (GP2 )
minimize
||w|| +
l
(hi ξi − ki ηi )
i=1
subject to
yi xTi w + b = ηi − ξi , i = 1, . . . , l, ξi , ηi 0, i = 1, . . . , l.
If we maximize the smallest margin for correctly classified data instead of the sum of all margins, the formulation is reduced to the well known ν-SVM by setting ηi ρ, hi = 1/l, and ki = ν: l
(ν−SVM)
minimize subjectto
1 ξi l i=1 yi xTi w + b ρ − ξi , i = 1, . . . , l,
||w|| − νρ +
ξi , ηi 0, i = 1, . . . , l. In the following, we use a simplified formulation for (GP2 ) as follows: (GP3 )
minimize subject to
l l 1 ||w||22 + C1 ξi − C2 ηi 2 i=1 i=1 yi xTi w + b ηi − ξi , i = 1, . . . , l,
ξi , ηi 0, i = 1, . . . , l.
360
Hirotaka Nakayama et al.
Note that equality constraints in (GP2 ) are changed into inequality constraints (GP3 ) which bring the equivalent solution. When the data set is not linearly separable, we can consider linear classifiers on the feature space mapped from the original data space by some nonlinear map Φ in a similar fashion to SVMs. Considering the dual problem to (GP3 ), we have a simple formulation by using a kernel K(xi , xj ) with the property K(xi , xj ) = Φ(xi ), Φ(xj ) as follows: l 1 (GP3 −SVM) minimize yi yj αi αj K (xi , xj ) 2 i,j=1 subject to
l
αi yi = 0,
i=1
C2 αi C1 , i = 1, . . . , l. Let α be the optimal solution to the problem (GP3 -SVM). Then, ∗
w=
l
yi α∗i xi .
i=1
n+ +n− l 1 ∗ b = yi αi K xi , xj , (n+ − n− ) − n+ + n− j=1 i=1 ∗
where n+ is the number of xj with C2 < α∗j < C1 and yj = +1 and n− is the number of xj with C2 < α∗j < C1 and yj = −1. Likewise, similar formulations following i) and/or iv) of the main objectives of goal programming are possible: l 1 ||w||22 + C1 σ − C2 minimize ηi (GP4 ) 2 i=1 subject to yi xTi w + b ηi − σ, i = 1, . . . , l, (GP5 )
minimize subject to
σ, ηi 0, i = 1, . . . , l. 1 ||w||2 + C1 σ − C2 ρ 2 2 yi xTi w + b ρ − σ, i = 1, . . . , l, σ, ρ 0.
GP4 and GP5 can be reformulated using kernels in a similar fashion to GP3 SVM. l 1 minimize yi yj αi αj K (xi , xj ) (GP4 −SVM) 2 i,j=1 subject to
l i=1 l
αi yi = 0, αi C1 ,
i
C2 αi , i = 1, . . . , l.
Goal Programming Approaches to Support Vector Machines
361
Table 1. Classification rate of various SVMs (Liver–90% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
–
67.57
68.38
71.11
63.53
59.08
69.12
STDV
–
8.74
7.10
8.59
8.10
6.79
6.55
TIME
–
221
109.96
192.14
1115
93.83
2.29
Table 2. Classification rate of various SVMs (Liver–70% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
66.51
65.32
67.35
68.33
65.05
55.48
68.16
STDV
2.43
4.83
4.2
4.68
6.39
5.28
2.94
TIME
20.2
161.81
28.57
61.33
109.83
19.98
1.23
Table 3. Classification rate of various SVMs (Cleveland–90% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
74.67
80.0
80.59
79.89
71.67
72.81
80.67
STDV
6.86
4.88
4.88
5.85
5.63
5.63
5.33
TIME
17.57
35.06
22.4
36.04
6.86
6.86
2.25
(GP5 −SVM)
minimize
subject to
l 1 yi yj αi αj K (xi , xj ) 2 i,j=1 l
αi yi = 0,
i=1
C2
l
αi C1 ,
i
0 αi , i = 1, . . . , l.
3
Numerical Example
We compare several SVMs through well known bench mark problems: 1) BUPA liver disorders(345 instances with 7 attributes), 2) Cleveland heart-disease (303 instances with 14 attributes)(http://www.ics.uci.edu/ mlearn/MLSummary.html).
362
Hirotaka Nakayama et al.
Table 4. Classification rate of various SVMs (Cleveland–70% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
75.6
81.07
81.97
81.03
79.01
75.06
83.19
STDV
6.27
3.29
3.95
2.88
2.58
6.08
2.78
TIME
6.6
12.64
9.18
14.41
12.51
5.63
1.16
We made 10 trials with randomly selected training data of 90% and 70% from the original data set and the test data of rest 10% and 20%, respectively. Tables 1-4 show the results. We used the computer with CPU of Athron +1800. As can be seen in Table 1, SVM (hard margin) cannot produce any solution for the 90% training data of BUPA liver disorders problem. This seems because the problem is not linearly separable in a strict sense even in the feature space. On the other hand, however, the linear classifier of GP provides a good result. Therefore, we may conclude that the problem is nearly linearly separable, but not in a strict sense. The data around the boundary may be affected by noise, and this causes that nonlinear decision boundaries can not necessarily yield good classification ability.
4
Concluding Remarks
SVM was discussed from a viewpoint of Goal Programming. Since the model (GP) is a linear programming, the computation time is extremely short. However, GP sometimes produces undesirable solutions. It is seen through our experiments that the stated normalization can not get rid of this difficulty completely. SVMs with LP formulation are of course possible. We shall report numerical experiments with LP-based SMVs in another opportunity.
References [1] Erenguc, S. S. and Koehler, G. J. (1990) Survey of Mathematical Programming Models and Experimental Results for Linear Discriminant Analysis, Managerial and Decision Economics, 11, 215-225 356 [2] Freed, N. and Glover, F. (1981) Simple but Powerful Goal Programming Models for Discriminant Problems, European J. of Operational Research, 7, 44-60 356, 357 [3] Glover, F. (1990) Improved Lineaar Programming Models for Discriminant Analysis, Decision Sciences, 21, 771-785 359 [4] Mangasarian, O. L. (1968): Multisurface Method of Pattern Separation, IEEE Transact. on Information Theory, IT-14, 801-807 356 [5] Mangasarian, O. L. (2000) Generalized Support Vector Machines, in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Sh¨ olkopf, and D. Schuurmans (eds.) MIT Press, Cambridge, pp.135-146 358
Goal Programming Approaches to Support Vector Machines
363
[6] Novikoff, A. B. (1962): On Convergence Proofs on Perceptrons, Symposium on the Mathematical Theory of Automata, 12, 615-622 356 [7] Sawaragi, Y., Nakayama, H. and Tanino, T. (1994) Theory of Multiobjective Optimization, Academic Press 357 [8] Sch¨ olkopf, B. and Smola, A. S. (1998) New Support Vector Algorithms, NeuroCOLT2 Technical report Series NC2-TR-1998-031
Asymmetric Triangular Fuzzy Sets for Classification Models J. F. Baldwin and Sachin B. Karale* A.I. Group, Department of Engineering Mathematics, University of Bristol Bristol BS8 1TR, UK {Jim.Baldwin,S.B.Karale}@bristol.ac.uk
Abstract. Decision trees have already proved to be important in solving classification problems in various fields of application in the real world. The ID3 algorithm by Quinlan is one of the well-known methods to form a classification tree. Baldwin introduced probabilistic fuzzy decision trees in which fuzzy partitions were used to discretize continuous feature universes. Here, we have introduced a way of fuzzy partitioning in which we can have asymmetric triangular fuzzy sets for mass assignments approach to fuzzy logic. In this paper we have shown with example that the use of asymmetric and unevenly spaced triangular fuzzy sets will reduce the number of fuzzy sets and will also increase the efficiency of probabilistic fuzzy decision tree. Keywords. Decision trees, fuzzy decision trees, fuzzy sets, mass assignments, ID3 algorithm
1
Introduction
ID3 is a classic decision tree generation algorithm for classification problems. This was later extended to C4.5 to allow continuous variables with the use of crisp partitioning [1]. If attribute A is partitioned in two sets, one having values of A> α and the other A ≤ α , for some parameter α , then small changes in attribute value can result in inappropriate changes to the assigned class. So most of the time the ID3 algorithm becomes handicapped in the presence of continuous attributes. There are many more partitioning techniques available, to deal with continuous attributes [2,3]. Fuzzy ID3 extends ID3 by allowing the values of attributes for fuzzy sets. This allows discretization of continuous variables using fuzzy partitions. The use of fuzzy sets to partition the universe of continuous attributes gives much more significant results [2]. Baldwin's mass assignment theory is used to translate attribute values to probability distribution over the fuzzy partitions. Fuzzy decision rules are less sensitive to small changes in attribute values near partition boundaries and thus helps to obtain considerably good results from ID3 algorithm. *
Author to whom correspondence should be addressed.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 364-370, 2003. Springer-Verlag Berlin Heidelberg 2003
Asymmetric Triangular Fuzzy Sets for Classification Models
365
A fuzzy partition in the context of this paper is a set of triangular fuzzy sets such that for any attribute value the sum of membership of the fuzzy sets is 1. The use of symmetric, evenly spaced triangular fuzzy sets generates a decision tree, which normally gives good classification on training sets but for test data sets it may tend to give erroneous results, depending on the data set used as training set. This can be avoided by using asymmetric, unevenly spaced triangular fuzzy sets. This is discussed in the following sections and supported with examples.
2
Symmetric and Asymmetric Triangular Fuzzy Partitioning
The need of partition universe has introduced a new problem into decision tree induction. The use of asymmetric, unevenly spaced triangular fuzzy sets can help to improve the results compared to those obtained by symmetric, evenly spaced triangular fuzzy sets. Asymmetric fuzzy sets will give improved results or the similar results depending on circumstances. 2.1
Symmetric, Evenly Spaced Triangular Fuzzy Sets
Symmetric, evenly spaced triangular fuzzy partitioning is very much popular and has been used for various applications in number of research areas. If fuzzy sets {f1,…,fn}forms fuzzy partition then we have n
∑ prob( f ) =1 i
(1)
i =1
where prob( f ) is the membership function of ‘ f ', with respect to a given distribution over the universe [4]. While forming the fuzzy partitions the total universe is evenly divided into number of symmetric triangular partitions, decided by the user, which generally looks like the fuzzy partitions as shown in Fig.1.
Fig. 1. Symmetric, evenly spaced triangular fuzzy sets
Depending on the training data set available this may result into forming one or more fuzzy partitions, which have no data point associated with them. In this case if there is a data point in a test data set which lies in this fuzzy partition then obviously
366
J. F. Baldwin and Sachin B. Karale
the model gives equal probability towards each class for that data point and thus there is the possibility of erroneous results. 2.2
Asymmetric, Unevenly Spaced Triangular Fuzzy Sets
Same as in symmetric fuzzy sets, in this case also, we have fuzzy sets {f1,…,fn} forms fuzzy partition and probability distribution given by formula (1). But the fuzzy partitions formed in this method will depend on the training data set available. The training data set is divided so that each fuzzy partition will have almost same number of data points associated with it. This will end up in obtaining asymmetric fuzzy sets, which will look like fuzzy sets shown in Fig. 2.
Fig. 2. Asymmetric, unevenly spaced triangular fuzzy sets
This will minimise the possibility of developing rules, which were giving equal probability, prob( f i ) towards each class for data points in test data set and will help to let us know the class of the data point to which it belongs. In turn it will increase the average accuracy of the decision model. If the training data is having data points clustered at few points of the universe of the attribute instead of scattered all over the universe or if there are gaps in the universe of the attribute without any data point, in those cases this modification will also help to minimize the number of fuzzy sets required and/or will let us know the maximum fuzzy sets possible (i.e. in such cases the algorithm will provide the exact number of fuzzy sets required).
3
Defuzzification
After having formed a decision tree, we can obtain probabilities for each data point towards each class of the target attribute. Now, it is another important task to defuzzify these probabilities and obtain a class of the data point to which it belongs to. Here, we have used the value at the maximum membership of the fuzzy set (i.e. the point of the fuzzy set at which the data points show 100% dependency towards that fuzzy set) for defuzzification. This can be easily justified with an example as below;
Asymmetric Triangular Fuzzy Sets for Classification Models
367
Let us assume we have an attribute with universe from 0 to 80 and five fuzzy sets on this attribute as shown in Fig. 2. If we have a data point having value 30 on this attribute then after fuzzification we get membership of 70% on the second fuzzy set and 30% membership on the third fuzzy set as shown in Fig. 3.
Fig. 3. Example diagram for defuzzification
If we defuzzify this point using the values at the maximum membership of the fuzzy sets, then we will get: • •
Second fuzzy set 15 (value at maximum membership) × 0.7 = 10.50, and Third fuzzy set 65 (value at maximum membership) × 0.3 = 19.50
i.e. we get back the total 10.50 + 19.50 = 30.00, which is exactly as the value at the initial point before fuzzification. Thus the final output value is given by n
V=
∑α
v
i i
(2)
i =1
where ‘ n ' is the number of fuzzy sets each associated with probability ‘ α i ' and ‘ vi ' is the value at maximum membership of each fuzzy set. Whereas the use of any other value for defuzzification, like, if we use the mean value of the fuzzy set then the value obtained after defuzzification will not be same to the value before fuzzification.
4
Results and Discussion
4.1
Example 1
Consider an example having six attributes, one of which is a categorical, noncontinuous attribute with four classes and five are non-categorical, continuous attributes. Universes for all continuous attributes are shown in Table 1.
368
J. F. Baldwin and Sachin B. Karale Table 1. Universes for continuous attributes
Non-categorical attribute 1 2 3 4 5
Universe 10 - 100 0 - 550 3 - 15 0 - 800 0 – 1000
A training data set of 500 data points is generated such that the first and the fourth non-categorical attributes have a gap on their universe, where no data point is available. For this example the gap is adjusted so that there is one fuzzy set in symmetrical fuzzy partitioning on each of the first and the fourth attributes without any data point belong to them. For example, in Fig. 4, symmetric triangular partitioning for attribute one is shown, where no data point associated with third fuzzy set.
Fig. 4. Symmetric fuzzy partitions on the first attribute
Fig. 5. Asymmetric fuzzy partitions on the first attribute
On use of asymmetric fuzzy sets, the partitioning is such that each fuzzy partition has almost same number of data points, as shown in Fig. 5. Results obtained for both symmetric and asymmetric approaches are shown in Table 2a. Test data set of 500 data points is generated which contains the data points scattered all over the universe of the attributes instead of clustered. We can see from Table 2a that, on use of asymmetric fuzzy sets the percentage accuracy is significantly higher than that on the use of symmetric fuzzy sets for both training and test data sets. In this problem, the use of four symmetric fuzzy sets avoid forming empty fuzzy partitions and gives more accuracy than the accuracy obtained by the use of five fuzzy sets. But still the accuracy cannot be improved up to that obtained by asymmetric fuzzy sets.
Asymmetric Triangular Fuzzy Sets for Classification Models
369
Table 2a. Comparison for use of symmetric and asymmetric fuzzy sets
Fuzzy Sets
4 5 6 7 4.2
Symmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 54.56 67.87 55.45 47.51 55.91 51.83 65.17 36.87
Asymmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 83.60 74.20 85.37 71.75 88.84 75.90 91.35 77.12
Example 2
In this example also the training data set of 500 data points with five non categorical attributes with universes as shown in Table 1 and one categorical attribute with four classes is used. This time the training data set is generated so that the gap in the first example is reduced to 50% on third fuzzy partition for the first attribute, as shown in Fig. 6. Even on a reduction in the gap, we can see from Table 2b that the results obtained by the use of asymmetric fuzzy sets are better than those obtained by the use of symmetric fuzzy sets. The test data set used in this case is same as that used in Example 1.Fig. 7 shows the fuzzy partitioning on use of asymmetric fuzzy sets with reduced gap.
Fig. 6. Symmetric fuzzy partitions on the first attribute with reduced gap
Fig. 7. Asymmetric fuzzy partitions on the first attribute with reduced gap
370
J. F. Baldwin and Sachin B. Karale
Table 2b. Comparison for use of symmetric and asymmetric fuzzy sets with reduced gap
Fuzzy Sets
4 5 6 7
5
Symmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 63.93 69.68 57.25 66.58 77.09 81.93 62.66 69.08
Asymmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 83.10 82.77 84.64 86.11 87.03 83.86 86.92 84.13
Conclusions
The new attempt made to use Fuzzy ID3 algorithm generate a decision tree which can give same or improved results compared to the use of symmetric fuzzy sets in Fuzzy ID3 algorithm for the real-world classification problems. The use of asymmetric, unevenly spaced triangular fuzzy sets makes Fuzzy ID3 more efficient and increases the reliability of the algorithm for variety of data. This approach of fuzzy partitioning for mass assignment ID3 can improve the results either by reducing the number of fuzzy sets or by increasing the accuracy, depending on the problem.
References [1] [2] [3] [4]
Quinlan, J.R.: C4.5: Programs for Machine Learning. San Mateo, Morgan Kaufmann Publishers (1993) Baldwin, J.F., Lawry, J., Martin, T. P.: A Mass Assignment Based ID3 Algorithm for Decision Tree Induction. International Journal of Intelligent Systems, 12 (1997) 523-552 Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn., 8 (1992) 87-102 Baldwin, J.F., Martin, T.P., Pilsworth, B.W.: FRIL- Fuzzy and Evidential Reasoning in A.I. Research Studies Press, Wiley, New York (1995)
Machine Learning to Detect Intrusion Strategies Steve Moyle and John Heasman Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford, OX1 3QD, England
Abstract. Intrusion detection is the identification of potential breaches in computer security policy. The objective of an attacker is often to gain access to a system that they are not authorized to use. The attacker achieves this by exploiting a (known) software vulnerability by sending the system a particular input. Current intrusion detection systems examine input for syntactic signatures of known intrusions. This work demonstrates that logic programming is a suitable formalism for specifying the semantics of attacks. Logic programs can then be used as a means of detecting attacks in previously unseen inputs. Furthermore the machine learning approach provided by Inductive Logic Programming can be used to induce detection clauses from examples of attacks. Experiments of learning ten different attack strategies to exploit one particular vulnerability demonstrate that accurate detection rules can be generated from very few attack examples.
1
Introduction
The importance of detecting intrusions in computer based systems continues to grow. In early 2003 the Slammer worm crippled the Internet taking only 10 minutes to spread across the world. It contained a simple, fast scanner to find vulnerable machines in a small worm with a total size of only 376 bytes. Fortunately the bug did not contain a malicious payload. Intrusion detection is the automatic identification of potential breaches in computer security policy. The objective of an attacker is often to gain access to a system that they are not authorized to use. The attacker achieves this by exploiting a (known) software vulnerability by sending the system a particular input. In this work the buffer over-flow vulnerability is studied, and its technical basis described (in section 1). This is followed by the description of a semantic intrusion detection framework in section 2. Section 3 describes an experiment to use Inductive Logic Programming techniques to automatically produce intrusion detection rules from examples of intrusions. These results are discussed in the final section of the paper. 1.1
The Buffer Overflow Attack
Buffer overflow attacks are a type of security vulnerability relating to incomplete input validation. The circumstances in which a buffer overflow can occur are well V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 371–378, 2003. c Springer-Verlag Berlin Heidelberg 2003
372
Steve Moyle and John Heasman
understood and wholly avoidable if the software is developed according to one of many secure programming checklists [3, ch. 23].1 From a technical perspective the concepts behind exploiting a buffer overflow vulnerability are simple but the implementation details are subtle. They are discussed thoroughly from the view of the attacker in the seminal paper “Smashing the Stack for Fun and Profit” [1]. Sufficient background and implementation details are reproduced here so that the reader may understand how such attacks can be detected. Buffer overflow attacks exploit a lack of bounds checking on a particular user-supplied field within a computer software service that performs some transaction. The attacker supplies an overly large input transaction to the service. In the course of processing the transaction, the service copies the field into a buffer of a fixed size.2 As a result, memory located immediately after the buffer is overwritten with the remainder of the attacker’s input. The memory that is overwritten may have held other data structures, the contents of registers, or, assuming the system is executing a procedure, the return address from which program execution should continue once the procedure completes. It is the return address that is of most interest to the attacker. The function that copies data from one location to another without regard for the data size is referred to in the sequel as simply the dangerous function. Before the service executes the procedure containing the function call that leads to the overflow, the return address is pushed onto the stack so that program execution can continue from the correct place once the procedure is complete. Information about the processor’s registers is then pushed onto the stack so that the original state can be restored. Space for variables local to the procedure is then allocated. The attacker aims to overwrite the return address thus controlling the flow of the program. A malicious transaction input of a buffer overflow attack consists of several components as shown in Fig. 1. The dangerous function call takes the attacker’s input and overwrites the intended buffer in the following order: 1) the space allocated for the procedure’s local variables and the register save area, and 2) the return address. The purpose of the sequence of idle instructions at the beginning of the exploit code stream is to idle the processor. Idle instructions are used to reduce the uncertainty of the attacker having to guess the correct return address which must be hard coded in the attack stream. By including a sequence of idle instructions at the start of the exploit code, the return address need only be set to an address within this sequence which greatly improves the chances of an exploit succeeding. This sequence of idle instructions is often referred to as a NOP sledge. NOP is the opcode that represents a no operation instruction within the CPU. The attacker aims to set the return address to a location within the idle sequence so that his exploit code is executed correctly. He supplies multiple copies of his desired return address to compensate for the uncertainty of where 1 2
This does not demean the target of this work – although preventable – buffer overflow vulnerabilities are still prolific in computer software! The size of the buffer is smaller than that necessary to store the entire input provided by the attacker.
Machine Learning to Detect Intrusion Strategies
373
Arbitrary Sequence of idle Exploit code Multiple copies of data instructions return address
Fig. 1. attack
Components of a malicious transaction containing a buffer overflow
the actual return address is located. Once the procedure has completed, values from the (overwritten) register save area are popped off the stack back into system registers and program execution continues from the new return address – now executing the attacker’s code. For the attacker to correctly build the malicious field within the transaction, he must know some basic properties of the target system, including the following: – The processor architecture of the target host as machine code instructions have different values in different processors. – The operating system the target is running as each operating system has a different kernel and different methods of changing between user and kernel mode. – The stack start address for each process which is fixed for each operating system/kernel. – The approximate location of the stack pointer. This is a guess at how many bytes have been pushed onto the stack prior to the attack. This information together with the stack start address allow the attacker to correctly calculate a return address that points back into the idle sequence at the beginning of the exploit code. The work described in the sequel is limited to the discussion of buffer overflow exploits for the Intel x86 family of processors running the Linux operating system. These are only pragmatic constraints. The work is easily extended to different processor architectures, operating systems, and some vulnerabilities. 1.2
An IDS for Recognizing Buffer Overflow Attacks
An attacker must provide a stream of bytes, that, when executed after being stored in an overflowed buffer environment, will gain him control of the CPU, which will in-turn, perform some task for him (say execute a shell on a Unix system). When crafting a string of bytes the attacker is faced with many alternatives, given the constraints that he must work within (e.g. the byte stream must be approximately the size of the vulnerable buffer). These alternatives are both at the strategic and implementation levels. At the strategic level, the attacker must decide in what order the tasks will be performed by the attack code. For example, will the address of the command to the shell be calculated and stored before the string that represents the shell type to be called? Having decided on the components that make up the strategy, at the implementation level the attacker has numerous alternative encodings to choose from. Which register, for example, should the null byte be stored in? It
374
Steve Moyle and John Heasman
is clear that for any given attack strategy, there are vast numbers of alternative attack byte code streams that will, when executed, enact the strategy. This also means that simple syntax checking intrusion detection systems can easily be thwarted even by attackers using the same strategy that the syntactic IDS has been programmed to detect. A semantic Intrusion Detection System (IDS) was developed as a Prolog program in [4], which was further used for the basis of this work. The IDS program includes a dis-assembler that encodes the semantics of Intel x86 class processor byte codes. The IDS includes a byte stream generator that utilizes the same dis-assembler components. This byte stream generator produces a stream of bytes by producing a randomized implementation of the chosen intrusion strategy. This stream of bytes is then compiled into a dummy executable which simulates the process of the byte stream being called. This simply emulates the situation of the buffer having already been over-flowed, and not the process of over flowing it per se. Each attack byte stream is verified in its ability to open a shell (and hence its success for an attacker).
2
Inductive Logic Programming
Inductive Logic Programming (ILP) [6] is a field of machine learning involved in learning theories in the language of Prolog programs. ILP has demonstrated that such a powerful description language (First Order Predicate Calculus) can be applied to a large range of domains including natural language induction [2], problems in bio-informatics [7], and robot programming [5]. This work is closet to that of natural language programming in that it is concerned with inducing context dependent parsers from examples of “sentences”. The general setting for ILP is as follows. Consider an existing partial theory (e.g. a Prolog program) B (known as Background knowledge), from which one is not able to derive (or predict) the truth of an externally observed fact. Typically these facts are known as examples3 , E. This situation is often presented as B |= E + . The objective of the ILP system is to produce an extra Prolog program H known as a theory, that, when combined with the background knowledge enables all the examples to be predicted. It is this new theory that is the knowledge discovered by the ILP system. B ∪ H |= E In the context of the semantic intrusion detection system, the basic assembler of the CPU architecture and the general semantic parser program, BIDS make up the background knowledge. The examples are the byte streams known to produce a successful attack. Consider the following single example of e+ 1 , where the attack is represented as a list of bytes in Prolog. 3
In general, the examples in ILP can be considered as either positive E + examples, or negative E − examples such that E = E + ∪ E − and E + ∩ E − = ∅.
Machine Learning to Detect Intrusion Strategies
e+ 1 =
375
attack([0x99,0x90,0x90,0xf8,0x9b,0x9b,0xeb,0x1c,0x5b,0x89,0x5b,0x08,0xba,0xff,0xff,
0xff,0xff,0x42,0x88,0x53,0x07,0x89,0x53,0x0c,0x31,0xc0,0xb0,0x0b,0x8d,0x4b,0x08,0x8d, 0x53,0x0c,0xcd,0x80,0xe8,0xdf,0xff,0xff,0xff,0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68]).
Such an example is a single instance of the SSFFP buffer overflow exploit. It can be described (and parsed) by the following high level strategy (written in Prolog). shown below (which covers semantic variations in the exploit code taken from “Smashing the Stack for Fun and Profit” [1]). attack(A):- consume_idles(A,B), consume_load(B,C,Addr,Len), consume_place_addr(C,D,Addr,Len), consume_null_long(D,E,Null,Addr), consume_null_operation(E,F,Len,Null,Addr), consume_set_sys_call(F,G,Null,0x0b,Addr), consume_params(G,H,Addr,Null,Len), consume_system_call(H,[]).
Consider for example, the predicate consume idles(A,B) which transforms a sequence of bytes A into a suffix sequence of bytes B by removing the contiguous idle instructions from the front of the sequence. The above attack predicate is clearly much more powerful than a syntactic detection system in that it is capable detecting the whole class of byte streams for that particular strategy. However, it does not detect other strategies of attacks that have different structural properties. A simple solution is to add more rules that cover variations in the exploit code structure but this requires substantial human effort. If the expert is faced with new real world examples he believes can be expressed using the background information that is already encoded, he might ask “is there a way of automating this process?”. This is the motivation for the following section, which demonstrates that inducing updates to the semantic rule set is possible.
3
An Experiment to Induce Semantic Protection Rules
This section describes controlled experiments to (re-)learn semantic intrusion detection rules using the ILP system ALEPH [9]. The motivation for the experiment can be informally stated as testing the hypothesis that ILP techniques can recover reasonably accurate rules describing attack strategies. The basic method was to use correct and complete Prolog programs to randomly generate sequences of bytes that successfully exploit a buffer overflow4. Such byte sequences are considered as positive examples, E + , of attacks. The byte sequence generating predicates were then removed from the Prolog program leaving a common background program to be used as background knowledge, BIDS , to ALEPH. This background program was then used, along with positive examples of attack byte sequences, to induce rules similar to the byte sequence generating predicates. 3.1
Results
The learning curves for the recovery of each of the ten intrusion strategies are presented in Figure 2. Here, accuracy is measured as the proportion of test set 4
All such generated sequences, when supplied as part of a transaction with a vulnerable service, enabled the attacker to open a shell.
376
Steve Moyle and John Heasman × ♦
100
× + ♦
× + ♦
× + ♦
+ × ♦
80 Average Accuracy (%) on hold-out × sets 60
+
+ ♦
40 3
9 27 Cardinality of training set (logarithmic scale)
81
Fig. 2. Learning curves for the recovery of ten different intrusion strategies from examples. Each line on the graph represents a different intrusion attack strategy (i.e. of the form attack(A) :- consume. . . ). Each point on the graph represents the average accuracy of 10 randomly chosen training and test sets
examples which are detected by the recovered rule. The proportion of the test set examples which are not detected by the recovered rule is the intrusion detection false negative rate. It can be seen from Figure 2 that, on average, for the attack strategies studied, relatively few examples of attacks are required to recover rules that detect a high proportion of attacks. In fact, for all of the ten studied attack strategies, only ten positive attack examples were required to recover a rules that were 100% accurate on the test set. This finding is discussed in the following section.
4
Discussion and Conclusions
The graphs in Figure 2 demonstrate that it is possible to induce semantic detection rules for buffer overflows attacks that are 100% accurate. Furthermore, these results can be obtained using, on average, a low number of examples for the induction. The need for few examples can be attributed to the relatively low size of the hypothesis space, which was a result of the high input and output connectedness of the background predicates. The average estimated size of the hypothesis space for the recovery of these attack strategies is 797 clauses and is comparable to the classic ILP test-bed problem, the Trains Srinivasan provides estimates of the hypothesis space for the trains problem and other common ILP data sets in [8].
Machine Learning to Detect Intrusion Strategies
377
Even the rules with lower accuracies can be understood by a domain expert and have some intrusion detection capabilities. This result demonstrates that ILP was well suited to the intrusion problem studied. It may well be suited to real world intrusion detection problems, provided the background theory is sufficient. These experiments indicate that the background information was highly accurate and relevant. The importance of relevancy has been studied by Srinivasan and King [10]. This application domain is very “crisp” and well defined. In a particular context, a byte op-code always performs a particular function. For example when the byte 0x9B is executed on an Intel x86 processor it always triggers the FWAIT functions to be performed. Such well defined, and cleanly specified domain is well suited to being represented in a logic program and susceptible to ILP. Contrast this with a biological domain where the activity of a particular system varies continuously with respect to many inputs – for example the concentration of certain chemicals. The work described has shown that it is possible to encode the operational semantics of attack transactions in a logical framework, which can then be used as background knowledge to an ILP system. Sophisticated attack strategies can then be represented in such a framework. In studying the particular application of a buffer overflow attack, it has been shown that ILP techniques can be used to learn rules that detect – with low false negative rates – attack strategies from relative few examples of attacks. Furthermore, only positive examples of attacks are necessary for the induction of attack strategy rules.
References [1] ”Aleph One”. Smashing The Stack For Fun And Profit. Phrack 49, 1996. 372, 375 [2] J. Cussens. Part-of-speech tagging using Progol. In Inductive Logic Programming: Proceedings of the 7th International Workshop (ILP-97), p. 93-108. Prague, 1997. Springer. 374 [3] S. Garfinkel, G. Spafford. Practical UNIX and Internet Security. Sebastopol, O’Reilly and Associates, 1996. 372 [4] S. A. Moyle and J. Heasman. Applying ILP to the Learning of intrusion strategies. Technical Report PRG-RR-02-03, Oxford University Computing Laboratory, University of Oxford, 2002. 374 [5] S. Moyle. Using Theory Completion to learn a Robot Navigation Control Program. In S. Matwin, editor, Proceedings of the 12th International Workshop on Inductive Logic Programming, 2002. 374 [6] S. Muggleton. Inverse Entailment and Progol. New Generation Computing, 13(3 and 4):245–286, 1995. 374 [7] S. H. Muggleton, C. H. Bryant, A. Srinivasan, A. Whittaker, S. Topp, and C. Rawlings. Are grammatical representations useful for learning from biological sequence data? – a case study. Computational Biology, 8(5):493-522, 2001. 374 [8] A. Srinivasan. A study of two probabilistic methods for searching large spaces with ILP. Technical Report PRG-TR-16-00, Oxford University Computing Laboratory, University of Oxford, 2000. 376
378
Steve Moyle and John Heasman
[9] A. Srinivasan. The Aleph Manual. http://web.comlab.ox.ac.uk/oucl/research/ ~areas/machlearn/Aleph/, 2001. 375 [10] A. Srinivasan, R. D. King. An empirical study of the use of relevance information in Inductive Logic Programming. Technical Report PRG-RR-01-19, Oxford University Computing Laboratory, University of Oxford, 2001. 377
On the Benchmarking of Multiobjective Optimization Algorithm Mario K¨ oppen Fraunhofer IPK Dept. Security and Inspection Technologies, Pascalstr. 8-9, 10587 Berlin, Germany
[email protected] Abstract. The ”No Free Lunch” (NFL) theorems state that in average each algorithm has the same performance, when no a priori knowledge of single-objective cost function f is assumed. This paper extends the NFL theorems to the case of multi-objective optimization. Further it is shown that even in cases of a priori knowledge, when the performance measure is related to the set of extrema points sampled so far, the NFL theorems still hold. However, a procedure for obtaining function-dependent algorithm performance can be constructed, the so-called tournament performance, which is able to gain different performance measures for different multiobjective algorithms.
1
Introduction
The ”No Free Lunch” (NFL) theorems state the equal average performance of any optimization algorithm, when measured against the set of all possible cost functions and if no domain knowledge of the cost function is assumed [3]. Usually, the NFL theorem is considered in a context of design of algorithms, especially it became well-known in the scope of evolutionary computation. However, the NFL theorem has also some other facettes, one of which is the major concern of this paper. So, the NFL theorem can also be seen as stating the impossibility to obtain a concise mathematical definition of algorithm performance. In this context, this paper considers multi-objective optimization and how the NFL theorems apply in this field. After recalling some basic definitions of multi-objective optimization in section 2, esp. the concept of Pareto front, the standard NFL theorem is proven for the multi-objective case in section 3. Then, the proof is extended to the case where sampling of extrema is also involved in the performance measure in section 4, proving that there is no gain in using such a measure. Only the case that two algorithms are compared directly give rise to so-called tournament performance and a heuristic procedure to measure algorithm performance. This will be presented in section 5.
2
Basic Definitions
In multi-objective optimization, optimization goal is given by more than one objective to be extreme [1]. Formally, given a domain as subset of Rn , there V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 379–385, 2003. c Springer-Verlag Berlin Heidelberg 2003
380
Mario K¨ oppen
are assigned m functions f1 (x1 , . . . , xn ), . . . , fm (x1 , . . . , xn ). Usually, there is not a single optimum but rather the so-called Pareto set of non-dominated solutions: For two vectors a and b it is said that a (Pareto-)dominates b, when each component of a is less or equal to the corresponding component of b, and at least one component is smaller: a >D b ←→ ∀i(ai ≤ bi ) ∧ ∃k(ak < bk ).
(1)
Note that in a similar manner Pareto dominance can be related to ”>”-relation. The subset of all vectors of a set M of vectors, which are not dominated by any other vector of M is the Pareto set (also Pareto front) P F . The Pareto set for univariate data (single objective) contains just the maximum of the data. The task of multi-objective optimization algorithm is to sample points of the Pareto front. A second instantiation (often called decision maker) is needed to further select from the Pareto front.
3
NFL-Theorem for Multi-objective Optimization Algorithms
A slight modification extends the proof of the single-objective NFL theorems given in [2] to the multi-objective case. Be X a finite set and Y a set of k finite domains Yi with i = 1, . . . , k. Then we consider the set of all sets of k cost functions fi : X → Yi with i = 1, . . . , k, or f : X → Y for simplicity. Let m be a non-negative integer < |X|. Define dm as a set {(dxm (i), dym (i) = = dxm (j). (f (dxm (i)))}, i = 1, . . . , m where dxm (i) ∈ X ∀ i and ∀ i, j, dxm (i) Now consider a deterministic search algorithm a which assign to every possible dm an element of X \ dxm (see fig. 1): dxm+1 (m + 1) = a[dm ] ∈ {dxm }.
(2)
Fig. 1. A deterministic algorithm derives next sampling point dxm+1 (m + 1) from the outcome of foregoing sampling dm
On the Benchmarking of Multiobjective Optimization Algorithm
381
Define Y (f, m, a) to be the sequence of m Y values produced by m successive applications of the algorithm a to f . Let δ(., .) be the Kronecker delta function that equals 1 if its arguments are identical, 0 otherwise. Then the following holds: Lemma 1. For any algorithm a and any dym ,
δ(dym , Y (f, m, a)) =
k
|Yi ||X|−m .
i=1
f
Proof. Consider all cost functions f+ for which δ(dym , Y (f+ , m, a)) takes the value 1, 2 asf. of the sequence dym : i) f+ (a(∅)) = dym (1) ii) f+ (a[dm (1)]) = dym (2) iii) f+ (a[dm (1), dm (2)]) = dym (3) ... where dm (j) ≡ (dxm (j), dym (j)). So the value of f+ is fixed for exactly m distinct elements from X. For the remaining |X| − m elements from X, the corresponding value of f+ can be assigned freely. Hence, out of the i |Yi ||X| separate f , exactly i |Yi ||X|−m will result in a summand of 1 and all others will be 0. Then, we can continue with the proof of NFL theorem in multi-objective case. Take any performance measure c(.), mapping sets dym to real numbers. Theorem 1. For any two deterministic algorithms a and b, any performance value K ∈ R, and any c(.), δ(K, c(Y (f, m, a))) = δ(K, c(Y (f, m, b))). f
f
dym
Proof. Since more than one may give the same value of the performance measure K, for each K the l.h.s. is expanded over all those possibilities: δ(K, c(Y (f, m, a))) = f = δ(K, c(dym ))δ(dym , Y (f, m, a)) (3) m f,dy m ∈Y
=
m dy m ∈Y
=
m dy m ∈Y
=
k i=1
δ(K, c(dym ))
δ(dym , Y (f, m, a))
f
δ(K, c(dym ))
|Yi ||X|−m
k
|Yi ||X|−m (by Lemma 1)
i=1
m dy m ∈Y
δ(K, c(dym ))
The last expression does not depend on a but only on the definition of c(.).
(4)
382
4
Mario K¨ oppen
Benchmarking Measures
The formal proof of the NFL theorems assumes no a priori knowledge of the function f . This can be easily seen in the proof of Theorem 1, when the expansion over f is made (line 3 of the proof): it is implicitely assumed that the performance measure c(.) does not depend on f . There are performance measures depending on f , for which Theorem 1 does not hold and that can be easily constructed (as e.g. derived from the requirement to scan (x, y) pairs in a given order). This is a reasonable assumption for evaluating an algorithm a. Domain knowledge of f could result in algorithm a somehow designed in a manner to show increased performance on some benchmark problems. However, common procedure to evaluate algorithms is to apply them onto a set of so-called ”benchmark problems.” This also holds in the multi-objective case. From a benchmark function f , usually analytic properties (esp. the extrema points) are given in advance. In [1], an extensive suite of such benchmark problems is proposed, in order to gain understanding of abilities of multi-objective optimization algorithms. So, for each benchmark problem, a description of the Pareto front of the problem is provided. The task given to a multi-objective optimization algorithm is to sample as many points from the Pareto front as possible. To name it here again: clearly, such a performance measure is related to a priori of f itself. NFL theorems given with Theorem 1 do not cover this case. However, in the following, it will be shown that NFL theorems even apply in such a case. It is based on the following lemma: Lemma 2. For any algorithm a it holds {a ◦ f } |Y = Y |X| f
A given algorithm a applied to any f gives a sequence of values from Y . The union of all those sequences will be the set of all possible sequences of |X| elements chosen from Y . Or, in other words: each algorithm, applied to all possible f will give a permutation of the set of all possible sequences, with each sequence appearing exactly once. Proof. Assume that for two functions f1 and f2 algorithm a will give the same sequence of y-values (y1 , y2 , . . . , y|X| ). This also means that the two corresponding x-margins are permutations of X. Via induction we show that then follows f1 = f2 . Verification. Since we are considering deterministic algorithms, the choice of the first element x1 is fixed for an algorithm a (all further choices for x values are functions of the foregoing samplings). So, both f1 and f2 map x1 to y1 . Step. Assume f1 (xi ) = f2 (xi ) for i = 1, . . . , k (and k < |X|). Then according to eq. 2, algorithm a will compute the same dxk+1 (k +1) since this computation only depends on the sequence dm that is equal for f1 and f2 by proposition. Since the y-margins are also equal in the position (k + 1), for both f1 and f2 xk+1 = dxk+1 (k + 1) is mapped onto yk+1 .
On the Benchmarking of Multiobjective Optimization Algorithm
383
Table 1. Performance measure Pareto sampling after two steps in the example case Y = {0, 1}2 and |X| = 3 y1 y2 00 00 00 00 00 00 00 00 00 01 00 01 00 01 00 01 00 10 00 10 00 10 00 10 00 11 00 11 00 11 00 11 Sum
y3 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
P F c(2) y1 y2 y3 P F c(2) y1 y2 y3 P F c(2) y1 y2 y3 P F c(2) 00 1 01 00 00 00 1 10 00 00 00 1 11 00 00 00 1 00 1 01 00 01 00 1 10 00 01 00 1 11 00 01 00 1 00 1 01 00 10 00 1 10 00 10 00 1 11 00 10 00 1 00 1 01 00 11 00 1 10 00 11 00 1 11 00 11 00 1 00 1 01 01 00 00 0 10 01 00 00 0 11 01 00 00 0 00 1 01 01 01 01 1 10 01 01 01, 10 2 11 01 01 01 1 00 1 01 01 10 01, 10 1 10 01 10 01, 10 2 11 01 10 01, 10 1 00 1 01 01 11 01 1 10 01 11 01, 10 2 11 01 11 01 1 00 1 01 10 00 00 0 10 10 00 00 0 11 10 00 00 0 00 1 01 10 01 01, 10 2 10 10 01 01, 10 1 11 10 01 01, 10 1 00 1 01 10 10 01, 10 2 10 10 10 10 1 11 10 10 10 1 00 1 01 10 11 01, 10 2 10 10 11 10 1 11 10 11 10 1 00 1 01 11 00 00 0 10 11 00 00 0 11 11 00 00 0 00 1 01 11 01 01 1 10 11 01 01, 10 1 11 11 01 01 0 00 1 01 11 10 01, 10 1 10 11 10 10 1 11 11 10 10 0 00 1 01 11 11 01 1 10 11 11 10 1 11 11 11 11 1 16 16 16 11 Average Performance 59/64 ∼ 0.92
This completes the proof. It has to be noted that not each permutation of Y |X| can be accessed by an algorithm (what can be easily seen from the fact that there are much more permutations than possible algorithm specifications). Following this lemma, all performance calculations that are independent of the sorting of the elements of Y |X| will give the same average performance, independent on a. Sampling of Pareto front elements after m algorithm steps is an example for such a measure. For illustration, table 1 gives these compuations for the simple case Y = {0, 1} × {0, 1} and X = {a, b, c}. Table 1 displays all possible functions f : X → Y and the corresponding Pareto set P F . The column c(2) shows the number of Pareto set elements that have already been sampled after two steps. The computation of the average performance does not depend on the order in which the functions are listed, thus each algorithm a will have the same average performance cav (2) = 59/64 . A remark on the single-objective case: in the foregoing discussion, multiobjectivity of f was not referenced explicitely. Hence, the discussion holds also for the ”single-objective” version, in which an algorithm is judged by its ability to find extrema points within a fixed number of steps. The NFL theorems also appy to this case.
5
Tournament Performance
Among the selection of function-dependent performance measures, one should be pointed out in the rest of this paper. For obtaining ”position-dependence” of the measure on a single function f , the value obtained by applying a base algorithm A is taken. Algorithm a now runs competively against A. In such
384
Mario K¨ oppen
Table 2. Tournament performance of algorithm a after two steps f (a) f (b) 01 00 01 00 01 00 01 00 01 01 01 01 01 01 01 01 01 10 01 10 01 10 01 10 01 11 01 11 01 11 01 11 Total
f (c) 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
P F (dy 2 (A, f )) 00 00 00 00 01 01 01 01 01, 10 01, 10 01, 10 01, 10 01 01 01 01
P F (dy 2 (a, f )) 00 01 01, 10 01 00 01 01, 10 01 00 01 01, 10 01 00 01 01, 10 01
y P F (dy 2 (A, f )) \ P F (d2 (a, f )) 00 00 00 00 01 01 01 01, 10 01, 10 01, 10 01 01 01
|.| 0 0 0 0 -1 0 0 0 -2 0 0 0 -1 0 0 0 -4
a case, the NFL theorem does not hold. For seeing this, it is sufficient to provide a counterexample. Before, we define the difference of two Pareto sets P Fa and P Fb as the set P Fa \ P Fb , in which in P Fa all elements are removed, which are dominated by any element of P Fb . Be Y = {0, 1}2 and X = {a, b, c}, as in the foregoing example. Now algorithm A is as follows: Algorithm A: Take the dxm in the following order: a, b, c. And be Algorithm a: Take a as first choice. If f (a) = {0, 1} then select c as next point, otherwise b. Table 2 shows the essential part of all possible functions f , in which algorithms A and a behave different. For ”measuring” the performance of a at step m, we compute the size of the Pareto set difference |P F (dym (A, f )) \ P F (dym (a, f ))| .
(5)
and take the average over all possible f as average performance. For functions that do not start with f (a) = {0, 1}, both algorithms are identical, so in these cases = 0. For functions mapping x-value a onto {0, 1}, we see = −4. Now taking other algorithms: Algorithm b: Take a as first choice. If f (a) = {1, 0} then select c as next point, otherwise b. Algorithm c: Take a as first choice. If f (a) = {1, 1} then select c as next point, otherwise b.
On the Benchmarking of Multiobjective Optimization Algorithm
385
For b and c, a similar computation gives = −4 and = −5 respectively. In this sense ”strongest” algorithm (i.e. in comparison to A) is to sample in the order a, c, b with a performance = −13. It should be noted that this performance measure is also applicable to the single-objective case. However, more studies on this measure have to be performed. Based on this, a heuristic procedure to measure performance of multi-objective optimization algorithm a might look like: 1. Let algorithm a run for k evaluations of cost function f and take the set M1 of non-dominated points from Y obtained by the algorithm. 2. Select k random domain values of X and compute the Pareto set M2 of the corresponding Y values. 3. Compute the set M3 of elements of M2 that are not dominated by any element of M1 . The relation of |M1 | to |M3 | gives a measure how algorithm a performs against random search.
References [1] Carlos A. Coello Coello, David A. Van Veldhuizen, Gary B. Lamont. Evolutionary Algorithms for Solving Multi-Objective Problems. Kluwer Academic Publishers, 2002. 379, 382 [2] Mario K¨ oppen, David H. Wolpert and William G. Macready, “Remarks on a recent paper on the ”no free lunch” theorems,” IEEE Transactions on Evolutionary Computation, vol. 5, no. 3, pp. 295–296, 2001. 380 [3] David H. Wolpert and William G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997. 379
Multicategory Incremental Proximal Support Vector Classifiers Amund Tveit and Magnus Lie Hetland Department of Computer and Information Science Norwegian University of Science and Technology, N-7491 Trondheim, Norway {amundt,mlh}@idi.ntnu.no
Abstract. Support Vector Machines (SVMs) are an efficient data mining approach for classification, clustering and time series analysis. In recent years, a tremendous growth in the amount of data gathered has changed the focus of SVM classifier algorithms from providing accurate results to enabling incremental (and decremental) learning with new data (or unlearning old data) without the need for computationally costly retraining with the old data. In this paper we propose an efficient algorithm for multicategory classification with the incremental proximal SVM introduced by Fung and Mangasarian.
1
Introduction
Support Vector Machines (SVMs) are an efficient data mining approach for classification, clustering and time series analysis [1, 2, 3]. In recent years, a tremendous growth in the amount of data gathered (for example, in e-commerce and intrusion detection systems) has changed the focus of SVM classifier algorithms from providing accurate results to enabling incremental (and decremental) learning with new data (or unlearning old data) without the need for computationally costly retraining with the old data. Fung and Mangasarian [4] introduced the Incremental and Decremental Linear Proximal Support Vector Machine (PSVM) for binary classification and showed that it could be trained extremely efficiently, with one billion examples (500 increments of two million examples) in two hours and twenty-six minutes on relatively low-end hardware (400 MHz Pentium II). In this paper we propose an efficient algorithm based on memoization, in order to support Multicategory Classification for the Incremental PSVM.
2
Background Theory
The standard binary SVM classification problem with soft margin (allowing some errors) is shown visually in Fig. 1(a) and as a constrained quadratic programming problem in (1). Intuitively, the problem is to maximize the margin between the solid planes and at the same time permit as few errors as possible, errors being positive class points on the negative side (of the solid line) or vice versa. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 386–392, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multicategory Incremental Proximal Support Vector Classifiers
(a) Standard SVM classifier
387
(b) Proximal SVM classifier
Fig. 1. SVM and PSVM
min
(w,γ,y)∈Rn+1+m
{ve y + 12 w w} (1)
s.t. D(Aw − eγ) + y ≥ e y≥0 A ∈ Rm×n , D ∈ {−1, +1}m×1 , e = 1m×1
Fung and Mangasarian [5] replaced the inequality constraint in (1) with an equality constraint. This changed the binary classification problem, because the points in Fig. 1(b) are no longer bounded by the planes, but are clustered around them. By solving the equation for y and inserting the result into the expression to be minimized, one gets the following unconstrained optimization problem: min
(w,γ)∈Rn+1+m
Setting ∇f =
f (w, γ) = ν2 D(Aw − eγ) − e2 + 12 (w w + γ 2 )
∂f ∂f ∂w , ∂γ
(2)
= 0 one gets:
−1 −1 I w A De A A + νI −A e + E = E E De = γ −e De −e A ν1 + m ν B
X
(3)
A−1
E = [A − e], E ∈ Rm×(n+1) Fung and Mangasarian [6] later showed that (3) can be rewritten to handle increments (E i , di ) and decrements (E d , dd ), as shown in (4). This decremental approach is based on time windows.
388
Amund Tveit and Magnus Lie Hetland
X = =
w γ
I + E E + (E i ) E i − (E d ) E d ν
−1
E d + (E i ) di − (E d ) dd
,
(4)
where d = De .
3
Incremental Proximal SVM for Multiple Classes
In the multicategorical classification case, the (incremental) class label vector di consists of mi numeric labels in the range {0, . . . , c − 1}, where c is the number of classes, as shown in (5). w0 . . . wc−1 (5) X = = A−1 B γ0 . . . γc−1 3.1
The Naive Approach
In order to apply the proximal SVM classifier in a “one-against-the-rest” manner, the class labels must be transformed into vectors with +1 for the positive class and −1 for the rest of the classes, that is, Θ(cmi ) operations in total, and later Θ(cmi n) for calculating (E i ) d for each class. The latter (column) vectors are collected in a matrix B ∈ R(n+1)c . Because the training features represented 2 by E i are the same for all the classes, it is enough to calculate A ∈ R(n+1) once, giving Θ(mi (n + 1)2 + (n + 1)2 ) operations for calculating (E i ) E i and adding it to νI + E E. The specifics are shown in shown in Algorithm 1. Theorem 1. The running time complexity of Algorithm 1 is Θ(cminc n). Proof. The conditional statement in lines 3–7 takes Θ(1) time and is performed minc times (inner loop, lines 2–8) per iteration of classId (outer loop, line 1–10). Calculation of the matrix-vector product B[classId, ] in line 9 takes Θ((n + 1)minc ) per iteration of classId. This gives a total running time of Θ(c · (minc + minc (n + 1))) = Θ(cminc n) .
Multicategory Incremental Proximal Support Vector Classifiers
389
Algorithm 1 calcB Naive(Einc , dinc ) Require: Einc ∈ Rminc x(n+1) , dinc ∈ {0, . . . , c − 1}minc and n, minc ∈ N Ensure: B ∈ R(n+1)xc 1: for all classId in {0, . . . , c − 1} do 2: for all idx in {0, . . . , minc − 1} do 3: if dinc [idx ] = classId then 4: dclassId [idx, ] = +1 5: else 6: dclassId [idx, ] = −1 7: end if 8: end for dclassId 9: B[classId, ] = Einc 10: end for 11: return B
3.2
The Memoization Approach
The dclassId vectors, c in all, (in line 3 of Algorithm 1) are likely to be unbalanced, that is, have many more −1 values than +1 values. However, if there are more than two classes present in the increment di , the vectors will at least share one index position where the value is −1. With several classes present in the increment di , the matrix-vector products (E i ) dclassId actually perform duplicate calculations each time there exists two or more dclassId vectors that have −1 values in the same position. The basic idea for the memoization approach (Algorithm 3) is to only calculate the +1 positions for each vector dclassId by first creating a vector F = −[E.i j ]
(a vector with the negated sum of E’s columns, equivalent to multiplying E i with a vector filled with −1) and then to calculate the dclassId vectors using F and only switching the −1 to a +1 by adding the row vector of E twice if the row in dclassId is equal to +1. In order to do this efficiently, an index of di for each class ID has to be created (Algorithm 2). Algorithm 2 buildClassMap(dinc ) Require: dinc ∈ {0, . . . , c − 1}minc and minc ∈ N 1: classMap = array of length c containing empty lists 2: for all idx = 0 to minc − 1 do 3: append idx to classMap[dinc [idx, ]] 4: end for 5: return classMap
Theorem 2. The running time complexity of Algorithm 2 is Θ(minc ).
390
Amund Tveit and Magnus Lie Hetland
Proof. Appending idx to a the tail of a linked list takes Θ(1) time, lookup of classMap[dinc [idx, ]] in the directly addressable arrays classMap and dinc also takes Θ(1) time, giving a total for line 3 of Θ(1) time per iteration of idx. idx is iterated minc times, giving a total of Θ(minc ) time.
Algorithm 3 calcB Memo(Einc , dinc , Einc Einc ) Require: Einc ∈ Rminc ×(n+1) , dinc ∈ {0, . . . , c − 1}minc and n, minc ∈ N Ensure: B ∈ R(n+1)xc , F ∈ R(n+1) 1: classMap = buildClassMap(dinc ) 2: for all classId in {0, . . . , c − 1} do 3: B[classId, ] = Einc Einc [n] 4: for all idx in classMap[classId, ] do 5: B[idx, classId, ] = B[idx, classId, ] + 2 · n j=0 Einc [idx , j] 6: end for 7: end for 8: return B
Theorem 3. The running time complexity of Algorithm 3 is Θ(n(c + minc ). Proof. Calculation of classMap (line 1) takes Θ(minc ) time (from Theorem 2). Line 3 takes Θ(n + 1) time per iteration of classId, giving a total of Θ(c(n + 1)). Because classMap provides a complete indexing (| c−1 u=0 classMap[u]| = minc ) of the class labels in d inc , and because there are no repeated occurrences of idx for c−1 different classId s ( u=0 classMap[u] = ∅), line 5 will run a total of minc times. This gives a total running time of Θ(minc + (n + 1)minc + c(n + 1) + minc ) = Θ(n(c + minc )) .
Corollary 1. Algorithms 1 and 3 calculate the same B if provided with the same input.
4
Empirical Results
In order to test and compare the computational performance of the incremental multicategory proximal SVMs with the naive and lazy algorithms, we have used three main types of data: 1. Forest cover type, 580012 training examples, 7 classes and 54 features (from UCI KDD Archive [7])
50
391
● ●
5
●
30
Seconds
●
Seconds
●
Naive Lazy 20
10
●
Naive Lazy
●
40
15
Multicategory Incremental Proximal Support Vector Classifiers
●
●
10
● ●
● ● ●
●●
0e+00
0
0
● ●
1e+05
2e+05
3e+05
4e+05
5e+05
● ●
● ●
200
Number of Examples
(a) Runtime vs examples (cover type)
●
400
●
●
●
●
600
800
1000
Number of Classes
(b) Runtime vs classes (synthetic)
Fig. 2. Computational performance: training time
2. Synthetic datasets with a large number of classes (up to 1000 classes) and 30 features 3. Synthetic dataset with a large number of examples (10 million), 10 features and 10 classes The results for the first two data sets are shown in Fig. 4; the average time from tenfold cross-validation is used. For the third data set, the average classifier training times were 18.62 s and 30.64 s with the lazy and naive algorithm, respectively (training time for 9 million examples, testing on 1 million). The incremental multicategory proximal SVM was been implemented in C++ using the CLapack and ATLAS libraries. The tests were run on an Athlon 1.53 GHz PC with 1 GB RAM running Red Hat Linux 2.4.18.
Acknowledgements We would like to thank Professor Mihhail Matskin and Professor Arne Halaas. This work is supported by the Norwegian Research Council in the framework of the Distributed Information Technology Systems (DITS) program, and the ElComAg project.
5
Conclusion and Future Work
We have introduced the multiclass incremental proximal SVM and shown a computational improvement for training the multiclass incremental proximal SVM, which works particularly well for classification problems with a large number of
392
Amund Tveit and Magnus Lie Hetland
classes. Another contribution is the implementation of the system (available on request). Future work includes applying the algorithm to demanding incremental classification problems, for example, web page prediction based on analysis of click streams or automatic text categorization. Algorithmic improvements that need to be done include (1) develop balancing mechanisms (in order to give hints for pivot elements to the applied linear system solver for reduction of numeric errors), (2) add support for decay coefficients for efficient decremental unlearning, (3) investigate the appropriateness of parallelized incremental proximal SVMs, (4) strengthen implementation with support for tuning set, kernels as well as one-against-one classifiers.
References [1] Burbidge, R., Buxton, B. F.: An introduction to support vector machines for data mining. In Sheppee, M., ed.: Keynote Papers, Young OR12, University of Nottingham, Operational Research Society, Operational Research Society (2001) 3–15 386 [2] Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (svm). In: Proceedings of 14th Int’l Conf. on Pattern Recognition (ICPR’98), IEEE (1998) 154–156 386 [3] Muller, K. R., Smola, A. J., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time series with support vector machines. In: ICANN. (1997) 999–1004 386 [4] Fung, G., Mangasarian, O. L.: Incremental support vector machine classification. In Grossman, R., Mannila, H., Motwani, R., eds.: Proceedings of the Second SIAM International Conference on Data Mining, SIAM (2002) 247–260 386 [5] Fung, G., Mangasarian, O. L.: Multicategory Proximal Support Vector Classifiers. Submitted to Machine Learning Journal (2001) 387 [6] Schwefel, H. P., Wegener, I., Weinert, K., eds.: 8. Natural Computing. In: Advances in Computational Intelligence: Theory and Practice. Springer-Verlag (2002) 387 [7] Hettich, S., Bay, S. D.: The UCI KDD archive. http://kdd.ics.uci.edu (1999) 390
Arc Consistency for Dynamic CSPs Malek Mouhoub Department of Computer Science University of Regina, 3737 Waskana Parkway, Regina SK, Canada, S4S 0A2
[email protected] Abstract. Constraint Satisfaction problems (CSPs) are a fundamental concept used in many real world applications such as interpreting a visual image, laying out a silicon chip, frequency assignment, scheduling, planning and molecular biology. A main challenge when designing a CSPbased system is the ability to deal with constraints in a dynamic and evolutive environment. We talk then about on line CSP-based systems capable of reacting, in an efficient way, to any new external information during the constraint resolution process. We propose in this paper a new algorithm capable of dealing with dynamic constraints at the arc consistency level of the resolution process. More precisely, we present a new dynamic arc consistency algorithm that has a better compromise between time and space than those algorithms proposed in the literature, in addition to the simplicity of its implementation. Experimental tests on randomly generated CSPs demonstrate the efficiency of our algorithm to deal with large size problems in a dynamic environment. Keywords: Constraint Satisfaction, Search, Dynamic Arc Consistency
1
Introduction
Constraint Satisfaction problems (CSPs) [1, 2] are a fundamental concept used in many real world applications such as interpreting a visual image, laying out a silicon chip, frequency assignment, scheduling, planning and molecular biology. This motivates the scientific community from artificial intelligence, operations research and discrete mathematics to develop different techniques to tackle problems of this kind. These techniques become more popular after they were incorporated into constraint programming languages [3]. A main challenge when designing a CSP-based system is the ability to deal with constraints in a dynamic and evolutive environment. We talk then about on line CSP-based systems capable of reacting, in an efficient way, to any new external information during the constraint resolution process. A CSP involves a list of variables defined on finite domains of values and a list of relations restricting the values that the variables can take. If the relations are binary we talk about binary CSP. Solving a CSP consists of finding an assignment of values to each variable such that all relations (or constraints) are satisfied. A CSP is known to be an NP-Hard problem. Indeed, looking for V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 393–400, 2003. c Springer-Verlag Berlin Heidelberg 2003
394
Malek Mouhoub
a possible solution to a CSP requires a backtrack search algorithm of exponential complexity in time1 . To overcome this difficulty in practice, local consistency techniques are used in a pre-processing phase to reduce the size of the search space before the backtrack search procedure. A k-consistency algorithm removes all inconsistencies involving all subsets of k variables belonging to N . The kconsistency problem is polynomial in time, O(N k ), where N is the number of variables. A k-consistency algorithm does not solve the constraint satisfaction problem, but simplifies it. Due to the incompleteness of constraint propagation, in the general case, search is necessary to solve a CSP problem, even to check if a single solution exists. When k = 2 we talk about arc consistency. An arc consistency algorithm transforms the network of constraints into an equivalent and simpler one by removing, from the domain of each variable, some values that cannot belong to any global solution. We propose in this paper a new technique capable of dealing with dynamic constraints at the arc consistency level. More precisely, we present a new dynamic arc consistency algorithm that has a better compromise between time and space than those algorithms proposed in the literature [4, 5, 6], in addition to the simplicity of its implementation. In order to evaluate the performance in time and memory space costs of the algorithm we propose, experimental tests on randomly generated CSPs have been performed. The results demonstrate the efficiency of our methods to deal with large size dynamic CSPs. The rest of the paper is organized as follows. In the next section we will present the dynamic arc consistency algorithms proposed in the literature. Our dynamic arc consistency algorithm is then presented in section 3. Theoretical comparison of our algorithm and those proposed in the literature is covered in this section. The experimental part of our work is presented in section 5. Finally, concluding remarks and possible perspectives are listed in section 6.
2
Dynamic Arc-Consistency Algorithms
2.1
Arc Consistency Algorithms
The key AC algorithm was developed by Mackworth[1] called AC-3 over twenty years ago and remains one of the easiest to implement and understand today. There have been many attempts to best its worst case time complexity of O(ed3 ) and though in theory these other algorithms (namely AC-4[7], AC-6[8] and AC7[9]) have better worst case time complexities, they are harder to implement. In fact, the AC-4 algorithm fares worse on average time complexity than the AC-3 algorithm[10]. It was not only until recently when Zhang and Yap[11]2 proposed 1
2
Note that some CSP problems can be solved in polynomial time. For example, if the constraint graph corresponding to the CSP has no loops, then the CSP can be solved in O(nd2 ) where n is the number of variables of the problem and d the domain size of the different variables. Another arc consistency algorithm (called AC-2001) based on the same idea as AC3.1 was proposed by Bessi`ere and R´egin [12]. We have chosen AC-3.1 for the simplicity of its implementation.
Arc Consistency for Dynamic CSPs
395
an improvement directly derived from the AC-3 algorithm into their algorithm AC-3.1. The worst case time complexity of AC-3 is bounded by O(ed3 ) [13] where e is the number of constraints and d is the domain size of the variables. In fact this complexity depends mainly on the way the arc consistency is enforced for each arc of the constraint graph. Indeed, if anytime a given arc (i, j) is revised, a support for each value from the domain of i is searched from scratch in the domain of j, then the worst case time complexity of AC-3 is O(ed3 ). Instead of a search from scratch, Zhang and Yap [11] proposed a new view that allows the search to resume from the point where it stopped in the previous revision of (i, j). By doing so the worst case time complexity of AC-3 is achieved in O(ed2 ). 2.2
Dynamic Arc Consistency Algorithms
The arc-consistency algorithms we have seen in the previous section can easily be adapted to update the variable domains incrementally when adding a new constraint. This simply consists of performing the arc consistency between the variables sharing the new constraint and propagate the change to the rest of the constraint network. However, the way the arc consistency algorithm has to proceed with constraint relaxation is more complex. Indeed, when a constraint is retracted the algorithm should be able to put back those values removed because of the relaxed constraint and propagate this change to the entire graph. Thus, traditional arc consistency algorithms have to be modified so that it will be able to find those values which need to be restored anytime a constraint is relaxed. Bessi`ere has proposed DnAC-4[4] which is an adaptation of AC-4[7] to deal with constraint relaxations. This algorithm stores a justification for each deleted value. These justifications are then used to determine the set of values that have been removed because of the relaxed constraint and so can process relaxations incrementally. DnAC-4 inherits the bad time and space complexity of AC-4. Indeed, comparing to AC-3 for example, AC-4 has a bad average time complexity[10]. The worst-case space complexity of DnAC-4 is O(ed2 + nd) (e, d and n are respectively the number of constraints, the domain size of the variables and the number of variables). To work out the drawback of AC-4 while keeping an optimal worst case complexity, Bessi`ere has proposed AC-6[8]. Debruyne has then proposed DnAC-6 adapting the idea of AC-6 for dynamic CSPs by using justifications similar to those of DnAC-4[5]. While keeping an optimal worst case time complexity (O(ed2 )), DnAC-6 has a lower space requirements (O(ed + nd)) than DnAC-4. To solve the problem of space complexity, Neveu and Berlandier proposed AC|DC[6]. AC|DC is based on AC-3 and does not require data structures for storing justifications. Thus, it has a very good space complexity (O(e + nd)) but is less efficient in time than DnAC-4. Indeed, with its O(ed3 ) worst case time complexity, it is not the algorithm of choice for large dynamic CSPs. Our goal here is to develop an algorithm that has a better compromise between running time and memory space than the above three algorithms. More precisely, our ambition is to have an algorithm with the O(ed2 ) worst case time
396
Malek Mouhoub
complexity of DnAC-6 but without the need of using complex and space expensive data structures to store the justifications. We have then decided to adapt the new algorithm proposed by Zhang and Yap[11] in order to deal with constraint relaxations. The details of the new dynamic arc consistency algorithm we propose that we call AC-3.1|DC, are presented in the next section. The basic idea that we took was to integrate the AC-3.1 into the AC|DC algorithm since that algorithm was based on AC-3. The problem with the AC|DC algorithm was that it relied solely on the AC-3 algorithm and did not keep support lists like DnAC4 or DnAC6 causing the restriction and relaxation of a constraint to be fairly time consuming. This is also the reason for its worst case time complexity of O(ed3 ). If AC-3.1 was integrated into the AC|DC algorithm, then by theory the worst case time complexity for a constraint restriction should be O(ed2 ). In addition to this, the worst case space complexity remains the same as the original AC|DC algorithm of O(e + nd). The more interesting question is whether this algorithm’s time complexity can remain the same during retractions. Following the same idea of AC|DC, the way our AC3.1|DC algorithm deals with relaxations is as follows (see pseudo-code of the algorithm in figure 3). For any retracted constraint (k, m) between the variables k and m, we perform the following three phases : 1. An estimation (over-estimation) of the set of values that have been removed because of the constraint (k, m) is first determined by looking for the values removed from the domains of k and m that have no support on (k, m). Indeed, those values already suppressed from the domain of k (resp m) and which do have a support on (k, m), do not need to be put back since they have been suppressed because of another constraint. This phase is handled by the procedure Propose. The over-estimated values are put in the array propagate list[k] (resp propagate list[m]). 2. The above set is then propagated to the other variables. In this phase, for each value (k, a) (resp (m, b)) added to the domain of k (resp m) we will look for those values removed from the domain of the variables adjacent to k (resp m) supported by (k, a) (resp (m, b)). These values will then be propagated to the adjacent variables. The array propagate list is used to contain the list of values to be propagated for each variable. After we propagate the values in propagate list[i] of a given variable i, these values are removed from the array propagate list and added to the array restore list in order to be added later to the domain of the variable i. This way we avoid propagating the values more than once. 3. Finally a filtering procedure (the function Filter) based on AC-3.1 is then performed to remove from the estimated set the values which are not arcconsistent with respect to the relaxed problem.
3
AC-3.1|DC
The worst case time complexity of the first phase is O(d2 ). AC-3.1 is applied in the third phase and thus the complexity is O(ed2 ). Since the values
Arc Consistency for Dynamic CSPs
397
in propagate list are propagated only once, then the complexity of the second phase is also O(ed2 ). Thus the overall complexity of the relaxation is O(ed2 ). In terms of space complexity, the arrays propagate list and restore list require O(nd). AC-3.1 requires an array storing the resume point for each variable value (in order to have O(ed2 ) time complexity). The space required by this array is O(nd) as well. If we add to this the O(e + nd) space requirement of the
Function Relax(k,m) 1. propagate list ← nil 2. Remove (k, m) from the set of constraints 3. Propose(k,m,propagate list) 4. Propose(m,k,propagate list) 5. restore list ← nil 6. Propagate(k,m,propagate list,restore list) 7. Filter(restore list) 8. for all i ∈ V do 9. domaini ← domaini ∪restore list[i] Function Propose(i,j,propagate list) 1. for all value a ∈ dom[i] − D[i] do 2. support ← false 3. for all b ∈ D[j] do 4. if ((i a),(j b)) is satisfied by (i,j) then 5. support ← true 6. exit 7. if support ← false then 8. propagate list[i] ← propagate list[i] ∪ {a} Function Propagate(k,m,propagate list,restore list) 1. L ← {k,m} 2. while L = nil do 3. i ← pop(L) 4. for all j such that (i,j) ∈ the set of constraints 5. S ← nil 6. for all b ∈ dom[j] − (d[j]∪restore list[j]∪propagate list[j]) do 7. for all a ∈ propagate list[i] do 8. if ((i a),(j b)) is satisfied by (i,j) then 9. S ← S ∪ {b} 10. exit 11. if S = nil do 12. L ← L ∪ {j} 13. propagate list[j] ← propagate list[j] ∪ S 14. restore list[i] ← restore list[i] ∪ propagate list[i] 15. propagate list[i] ← nil
Fig. 1. Pseudo code of the dynamic arc consistency algorithm
398
Malek Mouhoub
Table 1. Comparison in terms of time and memory costs of the four algorithms DnAC-4 DnAC-6 AC|DC AC-3.1|DC Space complexity O(ed2 + nd) O(ed + nd) O(e + nd) O(e + nd) Time complexity O(ed2 ) O(ed2 ) O(ed3 ) O(ed2 )
traditional AC-3 algorithm, the overall space requirement is O(e + nd) as well. Comparing to the three dynamic arc consistency algorithms we mentioned in the previous section, ours has a better compromise, in theory, between time and space costs as illustrated by table 1.
4
Experimentation
Theoretical comparison of the four dynamic arc consistency algorithms shows that AC3.1|DC has a better compromise between time and space costs than the other three algorithms. In order to see if the same conclusion can be said in practice we have performed comparative tests on dynamic CSPs randomly generated as we will show in the following subsection. The criteria used to compare the algorithms are the running time needed and the memory space required by each algorithm to achieve the arc consistency . The experiments are performed on a Sun Sparc 10 and all procedures are coded in C/C++. Given n the number of variables and d the domain size, each CSP instance is randomly obtained by genconstraints are then picked randomly erating n sets of d natural numbers. n(n−1) 2 from a set of arithmetic relations {=, =, , ≥, . . .}. The generated CSPs are characterized by their tightness, which can be measured, as shown in [14], as the fraction of all possible pairs of values from the domain of two variables that are not allowed by the constraint. Figure 2 shows the performance in time performed by each arc consistency algorithm to achieve the arc consistency in a dynamic environment, as follows. Starting from a CSP having n = 100 variables, d = 50
7
6
DnAC-4
Time (sec)
5
4
3
2 AC|DC
AC3.1|DC 1 DnAC-6 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Tightness
Fig. 2. Comparative tests of the dynamic arc-consistency algorithms
Arc Consistency for Dynamic CSPs
399
Table 2. Comparative results in terms of memory cost n 500 500 500 500 500 500 500 500 500 500 500 d 100 90 80 70 60 50 40 30 20 10 5 DnAC6 343MB 336MB 278MB 243MB 203MB 160MB 130MB 98MB 67MB 37MB 23MB AC|DC3.1 55MB 51MB 45MB 42MB 36MB 32MB 26MB 16MB 16MB 12MB 10MB
and 0 constraints, restrictions are done by adding the relations from the random ) is obtained. AfterCSP until a complete graph (number of constraints= n(n−1) 2 wards, relaxations are performed until the graph is 50% constrained (number of ). These tests are performed on various degree of tightness constraints= n(n−1) 4 to determine if one type of problem, (over-constrained, middle-constrained or under-constrained) favored any of the algorithms. As we can easily see, the results provided by AC-3.1|DC fares better than that of AC|DC and DnAC-4 in all cases. Also AC-3.1|DC algorithm is comparable if not better than DnAC6 (that has the best running time of the three dynamic arc consistency algorithms) as can be seen in figure 2. Table 2 shows the comparative results of DnAC-6 and AC3.1|DC in terms of memory space. The tests are performed on randomly generated CSPs in the same way as for the previous ones. As we can easily see, AC3.1|DC requires much less memory space than DnAC-6 especially for large problems with large domain size.
5
Conclusion and Future Work
In this paper we have presented a new algorithm for maintaining the arc consistency of a CSP in a dynamic environment. Theoretical and experimental comparison of our algorithm with those proposed in the literature demonstrate that our algorithm has a better compromise between time and memory costs. In the near future we are looking to integrating our dynamic arc consistency algorithm during the backtrack search phase in order to handle the addition and relaxation of constraints. For instance, if a value from a variable domain is deleted during the backtrack search, would it be worthwhile to use a DAC algorithm to determine its effect or would it be more costly than just continuing on with the backtrack search.
References [1] A. K. Mackworth. Consistency in networks of relations. Artificial Intelligence, 8:99–118, 1977. 393, 394 [2] R. M. Haralick and G. L. Elliott. Increasing tree search efficiency for Constraint Satisfaction Problems. Artificial Intelligence, 14:263–313, 1980. 393 [3] P Van Hentenryck. Constraint Satisfaction in Logic Programming. The MIT Press, 1989. 393
400
Malek Mouhoub
[4] C. Bessi`ere. Arc-consistency in dynamic constraint satisfaction problems. In AAAI’91, pages 221–226, Anaheim, CA, 1991. 394, 395 [5] R. Debruyne. Les algorithmes d’arc-consistance dans les csp dynamiques. Revue d’Intelligence Artificielle, 9:239–267, 1995. 394, 395 [6] B. Neuveu and P. Berlandier. Maintaining arc consistency through constraint retraction. In ICTAI’94, pages 426–431, 1994. 394, 395 [7] R. Mohr and T. Henderson. Arc and path consistency revisited. Artificial Intelligence, 28:225–233, 1986. 394, 395 [8] C. Bessi`ere. Arc-consistency and arc-consistency again. Artificial Intelligence, 65:179–190, 1994. 394, 395 [9] C. Bessi`ere, E. Freuder, and J. C. Regin. Using inference to reduce arc consistency computation. In IJCAI’95, pages 592–598, Montr´eal, Canada, 1995. 394 [10] R. J. Wallace. Why AC-3 is almost always better than AC-4 for establishing arc consistency in CSPs. In IJCAI’93, pages 239–245, Chambery, France, 1993. 394, 395 [11] Yuanlin Zhang and Roland H. C. Yap. Making ac-3 an optimal algorithm. In Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pages 316–321, Seattle, WA, 2001. 394, 395, 396 [12] C. Bessi`ere and J. C. R´egin. Refining the basic constraint propagation algorithm. In Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pages 309–315, Seattle, WA, 2001. 394 [13] A. K. Mackworth and E. Freuder. The complexity of some polynomial networkconsistency algorithms for constraint satisfaction problems. Artificial Intelligence, 25:65–74, 1985. 395 [14] D. Sabin and E. C. Freuder. Contradicting conventional wisdom in constraint satisfaction. In Proc. 11th ECAI, pages 125–129, Amsterdam, Holland, 1994. 398
Determination of Decision Boundaries for Online Signature Verification Masahiro Tanaka1 , Yumi Ishino2 , Hironori Shimada2 , Takashi Inoue2 , and Andrzej Bargiela3 1
Department of Information Science and Systems Engineering Konan University, Kobe 658-8501, Japan m
[email protected] http://cj.is.konan-u.ac.jp/~tanaka-e/ 2 R & D Center Glory Ltd., Himeji 670-8567, Japan {y ishino,shimada,inoue}@dev.glory.co.jp 3 Department of Computing and Mathematics The Nottingham Trent University, Burton St, Nottingham NG1 4BU, UK
[email protected] Abstract. We are developing methods for online (or dynamical) signature verification. Our method is first to move the test signature to the sample signature so that the DP matching can be done well, and then compare the pen information along the matched points of the signatures. However, it is not easy to determine how to use the several elements of discrepancy between them. In this paper, we propose a verification method based on the discrimination by RBF network.
1
Introduction
We are developing methods for online signature verification. The data is multidimensional time series obtained by the tablet and an electronic pen ([1, 4, 3]). The information available in our system includes x and y coordinates, the pressure of the pen, the azimuth and the altitude of the pen. Hence the data can be seen as a sequence of vectors. Our method is first to modify the test signature to the template signature so that the DP matching can be done well, and then compare the pen information along the matched points of the signatures. One of the most intuitive methods for verification or recognition of handwritten characters or signatures is to extract the corresponding parts of two drawings and compare them next. For this objective, we use only the coordinates of the drawings and neglect other elements. We developed this method as a preprocessing tool before the main procedure of verification method [5, 6, 7]. However, it is not easy to determine how to use the several elements of differences between them. In this paper, we propose a verification method based on the discrimination by neural network. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 401–407, 2003. c Springer-Verlag Berlin Heidelberg 2003
402
2
Masahiro Tanaka et al.
Data Categories
Suppose q-dimensional time-series vector z ∈ Rq is available from the tablet with a fixed interval, whose elements include x and y coordinates, pressure, azimuth and the altitude of the pen. We will use k as the discrete time index. For a signature verification problem, the following three kinds of data category should be prepared. – Template. Template signature is the one registered by the true person. We could use more than two template signatures, but we use one template here for the simplicity of the treatment. Let S ∈ Rq×ns represent it. – Training signatures. We need both true and forgeries. It is possible to make a model with only genuine signatures and establish probability density function model, but the ability of discrimination is low. It is better to use both positive and negative cases. The more training signatures, the higher the accuracy. Let T1 , T2 , · · ·, Tm be the genuine signatures. Let Fij , j = 0, · · · , m(i); i = 1, · · · , n be the forgeries, where i is the ID of the person and j is the file number. – Test signatures. We also need both genuine and forgeries for checking the model of verification. If we need to authenticate n signers, we need to prepare n-sets of the above.
3
Matching Method
By considering the diversity of genuine signatures as the partial drift of the speed during signing procedure, it is appropriate to apply DP matching. Here we will use only x and y which are elements of z for matching of two vectors. One ¯ = [p(1), · · · , p(ns )] ∈ sequence is the x, y elements of S. We will denote it as S 2×ns . Another sequence to compare with it is a 2-dimensional sequence which R is denoted here as Q = [q(1), · · · , q(nq )]. For this problem, DP matching did not work sufficiently well. However, we considered this to be particularly important for verification. So, we proposed the matching method in [5], which is to apply DP matching and the coordinate modification iteratively. Next we will explain the summary of DP matching and coordinate modification procedures. 3.1
DP Matching
DP matching used here is the method to derive the corresponding points that give the minimal cost for two vector sequences, and the following recursive form is well known: D(i − 1, j) + d(i, j) D(i, j) = min D(i − 1, j − 1) + 2d(i, j) (1) D(i, j − 1) + d(i, j)
Determination of Decision Boundaries for Online Signature Verification
403
where d(i, j) is the distance between p(i) and q(j), which is usually Euclidean distance. 3.2
Coordinate Modification
This is a procedure for moving points of the test signature to the sample signature. The following is the model: x (k) = a(k)x(k) + b(k)
y (k) = a(k)y(k) + c(k)
(2) (3)
where k denotes the time index, x(k) and y(k) are the elements of p(k). x (k) and y (k) are the transformed values which are taken as the elements of q(k), and a(k), b(k), c(k) are time-variant transformation parameters. Now we define the parameter vector θ as θ(k) = [a(k) b(k) c(k)] If they are allowed to change slowly and independently, the model can be written as the following: θ(k + 1) = θ(k) + w(k) (4) where w(k) is a zero-mean random vector with covariance matrix Q = diag(q1 , q2 , q3 ). If the diagonal elements q1 , q2 , q3 are small, it means the elements of θ(k) are allowed to change slowly. Now we fix the template data p(k). Then it is expected that the test data transformed by (2)-(3) is close to the template data p(k). We can write this assertion by the following equation obtained from equations (2), (3) z(k) = H(k)θ(k) + v(k) where
x (k) z(k) = y (k)
is the test data, and
H(k) =
(5)
x(k) 1 0 y(k) 0 1
(6)
includes the elements of the model data where v(k) is a 2-dimensional random vector independent from w, its mean is zero and covariance R. This absorbs the unmodelled portion of the observation data. Based on equations (4) and (5), we can estimate the time-variant parameter θ(k) by using the linear estimation theory. If we must estimate them on-line, we would use Kalman filter or fixed-lag smoother. However besides this, the data must be matched by using DP matching, where we must use the data in an offline way. So, we must work in an off-line manner, and “fixed-interval smoother” yields the best result for off-line processing. Hence we will use fixed-interval smoother.
404
4
Masahiro Tanaka et al.
Verification Method
We have constructed the signature verification method based on the matching of signatures. Here we will describe the method in detail. 4.1
Normalisation of Training Data
We have experimentally experienced that data normalisation is very useful before matching. Let p be the original 2-D signal, and let R be the covariance matrix of p. By the orthogonalisation, we have RV = V Λ, where V is the orthonormal matrix and Λ is the diagonal matrix. Since V is orthonormal, we have V −1 = V and V RV = λ. ¯ be the mean of p. Also define ¯r = V p ¯ and r = V p − ¯r. Then we Now, let p have ¯ )(p − p ¯ ) ]V = Λ E[rr ] = V E[(p − p If we want to have the data whose standard deviation of the horizontal axis is 1000, we can normalise it by 1000 r = √ (V p − ¯r) λ1 thus we have E[rr ] =
4.2
106 0
λ2 λ1
0 × 106
Matching Data
Here we apply the iterative procedure of DP matching and coordinate modification of the data as described in section 3. 4.3
Extracting Feature Vectors
After the matching of two signatures s and ti , we can use the square of the distance between each elements as d(i) = [d1 (i) d2 (i) · · · d7 (i)], i = 1, · · · , m or m(j)
(7)
The above criteria are common to the genuine and forgeries. Each of the elements in (7) are the mean square difference between the template signature and the training signature. Each of the criteria have the following meaning. Suppose (i, j) is the index after matching for one of the vectors. Then we trace back the original position (i , j ) and calculate the velocity by vel = (x(i − 1) − x(i ))2 + (y(j − 1) − y(j ))2 (8) 2 2 + (x(i ) − x(i + 1)) + (y(j ) − y(j + 1))
Determination of Decision Boundaries for Online Signature Verification
405
Table 1. Elements of Criteria element number meaning 1 length of the data 2 modified coordinate x after matching 3 modified coordinate y after matching 4 pressure of matched points 5 angle of matched points 6 direction of matched points 7 velocity of matched points
This trace back is necessary because the matched index does not necessarily proceed smoothly, and will not show the actual velocity of the signature. The meaning of other elements will be straightforward, so we will omit the detail explanation. 4.4
Model Building by RBF Network
Thus we have the verification model. A training vector d(i) is 7-dimensional, and the output o(i) is 1 (for genuine) or 0 (for forgery). We have m cases for o(i) = 1 and m(1) + · · · + m(n) cases for o(i) = 0. However, the scale of the criteria vary a lot, thus we need further normalisation. Hence we normalised the criteria by 1. Find the maximum maxj among the j-th elements in the training vectors, and 2. Divide all the elements of j-th elements by maxj . This maxj is again used in the verification. Let us denote the training data by d(k), k = 1, · · · , N . By using all the samples as the kernel of the function, we have the model f (x) =
N k=1
x − dk 2 θk exp − σ2
(9)
for a general x ∈ R7 . This can be rewritten as Y = Mθ
(10)
where Y ∈ RN , M ∈ RN ×N , and θ ∈ RN . By using the training data dk , k = 1, · · · , N , we have the n × N matrix M whose rank is N . So, we have a unique parameter θ. It is clear that equation (9) produces the exact values for the training data dk , k = 1, · · · , N . Various kinds of neural networks can adapt to nonlinear complicated boundary problems. However, we applied RBF network for a special reason. It has
406
Masahiro Tanaka et al.
a universal approximation property like multi-layer perceptron. However, RBF network has another good property for pattern recognition like this. Due to the functional form, it intrinsically produces nearly zero if the input vector is not similar to any of the training data. Other recogniser like multi-layer perceptron does not have this property. 4.5
Verification
We can use the model for verifying whether the data is genuine or forgery based on the model. When we have the data, we have to process it as follows. 1. Normalise the original data 2. Normalize the feature vector 3. Input the feature vector to the recogniser.
5
Experimental Results
Fig. 1 shows the ranges of each criterion. The sub-figures are numbered in the order from the top-left, top-right, and to the second row. Sub-figure 1 shows the values of d1 (i), i = 1, · · · , 9. The same for the sub-figure 2 with values d2 (i), i = 1, · · · , 9, and so on. The first var corresponds to the genuine signature. By using each criterion alone, it is almost impossible to divide the space clearly. However, by using the RBF network, we have obtained a good separation result as shown in Fig. 2. The first 15 cases are the genuine signatures and all the others are forgeries. If we put the threshold at 0.4, the errors of the first kind is 3 out of 15 and the errors of the second kind is 3 out of 135. This is fairly a good result.
1500
6
1000
4
500 0 3
x 10
4
2
0 x 10
4
2
4
6
8
10
0 4
0 x 10
5
2
4
6
8
10
0 x 10
4
2
4
6
8
10
2
4
6
8
10
2 2 1 0 3
0 x 10
5
2
4
6
8
10
3
2
2
1 0 4
0
1
0 x 10
4
2
4
6
8
10
2
4
6
8
10
0
0
2
0
0
Fig. 1. Ranges of the criteria
Determination of Decision Boundaries for Online Signature Verification
407
1.2
1
0.8
output
0.6
0.4
0.2
0
-0.2
0
50
100
150
sample #
Fig. 2. Output of RBF network for test data
6
Conclusions
The performance of the algorithm will be upgraded if we apply the matching procedure more times. We will have to apply the algorithm for other persons’ data as the template. Further comparison will be our future work.
References [1] L. L. Lee, T. Berger and E. Aviczer, “Reliable On-line Human Signature Verification Systems,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 18, pp. 643-647, 1996. 401 [2] M. L. Minsky and S. A. Papert, Perceptrons – Expanded Edition, MIT, 1969. [3] V. S. Nalwa, “Automatic On-line Signature Verification,” Proceedings of the IEEE, Vol. 85, pp. 215-239, 1997. 401 [4] R. Plamondon and S. N. Srihari, “On-line and Off-line Handwriting Recognition: A Comprehensive Survey,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, pp. 63-84, 2000. 401 [5] M. Tanaka, Y. Ishino, H. Shimada and T. Inoue, “DP Matching Using Kalman Filter as Pre-processing in On-line Signature Verification,” 8th International Workshop on Frontiers in Handwriting Recognition,, pp. 502–507, Niagara-onthe-Lake, Ontario, Canada, August 6-8, 2002. 401, 402 [6] M. Tanaka, Y. Ishino, H. Shimada and T. Inoue, Dynamical Scaling in Online Hand-written Characters’ Matching, 9th International Conference on Neural Information Processing, Singapore, November 19-22, 5 pages (CD ROM), 2002. 401 [7] M. Tanaka, Y. Ishino, H. Shimada, T. Inoue and A. Bargiela, “Analysis of Iterative Procedure of Matching Two drawings by DP Matching and Estimation of Time-Variant Transformation Parameters,” The 34th International Symposium on Stochastic Systems Theory and Its Applications, accepted. 401
On the Accuracy of Rotation Invariant Wavelet-Based Moments Applied to Recognize Traditional Thai Musical Instruments Sittisak Rodtook and Stanislav Makhanov Information Technology Program, Sirindron International Institute of Technology, Thamasat University, Pathumthani 12121, Thailand {sittisak,makhanov}@siit.tu.ac.th
Abstract. Rotation invariant moments constitute an important technique applicable to a versatile number of applications associated with pattern recognition. However, although the moment descriptors are invariant with regard to spatial transformations, in practice the spatial transformation themselves, affect the invariance. This phenomenon jeopardizes the quality of pattern recognition. Therefore, this paper presents an experimental analysis of the accuracy and the efficiency of discrimination under the impact of the rotation. We evaluate experimentally the behavior of the noise induced by the rotation versus the most popular basis functions based on wavelets. As an example, we consider a particular but interesting case of the Thai traditional musical instruments. Finally, We present a semi heuristic pre computing technique to construct a set of descriptors suitable for discrimination under the impact of the spatial transformation.
1
Introduction
It has been very well documented that performance of pattern recognition may critically depend on whether the employed descriptors are invariant with respect to spatial transformations. A popular class of the invariant shape descriptors is based on the moment techniques, first, introduced by Hu[1] . However, a dramatic increase in complexity when increasing the order makes the Hu’s moments impractical. Shortly after the Hu’s paper a variety of invariant moments based techniques designed to recognize moving objects was proposed and analyzed [2][6]. The major developments of the rotational invariant moment-based methods are the orthogonal Zernike, orthogonal Fourier-Mellin and complex moments. Finally, Shen [3] introduced the wavelet moments and showed that multi-resolution accomplished by the wavelets made it possible to construct adaptable moment descriptors best suited for a particular set of objects. Sensitivity of the moment descriptors to the image noise has been repeatedly mentioned in the Literature. An interesting consequence of this is that the moment descriptors are invariant only when they are computed from the ideal analog images. Even in the absence of a noise induced by physical devices, there V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 408–414, 2003. c Springer-Verlag Berlin Heidelberg 2003
On the Accuracy of Rotation Invariant Wavelet-Based Moments
409
always exists a noise due to the spatial transformations. Therefore, although the moment descriptors are invariant with regard to the spatial transformations, in practice the spatial transformation themselves affect the invariance. This phenomenon could seriously jeopardize the quality of pattern recognition. Therefore, this paper considers accuracy and discriminative properties of the wavelet-based moment descriptors under the impact of rotation. We perform the experiments as applied to an interesting case of Thai musical instruments. The impact of the rotation transforms is profound due to the elongated shape of the instruments. We analyze the range of the errors and the accuracy. Finally, we present a combination of a variance-based procedure with a semiheuristic pre-computing technique to construct the appropriate wavelet-based descriptors.
2
Rotationally Invariant Wavelet-Based Moment
T ype A general rotation invariant moment MOrder with respect to moment function T ype Forder (r, θ) defined in the polar coordinates (with the origin at the centroid of the object) is given by
2π1 T ype MOrder
T ype f (r, θ)FOrder (r, θ)rdrdθ
= 0
0
T ype T ype T ype We assume that Forder (r, θ) = RΩ (r)Gα (θ) , where RΩ (r) denotes the radial moment function and Gα (θ) the angular moment function. Note that the angular function defined by Gα (θ) = ei α θ provides rotational since invariance T ype T ype it leads to the circular Fourier transform. Therefore, MOrder = M ROrder [3] T ype where M ROrder be a moment of the rotated object image and φ be the angle of rotation. In words, rotation of the object affects the phase but not the magnitude. In order to compute the rotation invariant moments of a given image, we first calculate the centroid of the object image. Next, we define the polar coordinates and map the object onto the parametric unit circle. Finally, we represent the integral by
2π1 T ype MOrder
= 0
where Sα (r) =
2π
1 T ype f (r, θ)FOrder (r, θ)rdrdθ
0
T ype RΩ (r)Sα (r)rdr.
= 0
f (r, θ)Gα (θ)dθ. The continuous image function is obtained by
0
the standard bilinear interpolation applied to the discrete image. 1 T ype T ype | is rotation invariant but RΩ (r) |Sα (r)| rdr and Finally, not only |MOrder 1 T ype RΩ (r)Sα (r)r dr are rotation invariant as well. 0
0
410
Sittisak Rodtook and Stanislav Makhanov
It is plain that from the viewpoint of functional analysis, each object is represented by an infinite and unique set of the rotational invariants. In other words, T ype if the set RΩ (r) constitutes a basis in L2 [0, 1] then Sα (r) can be represented with a prescribed accuracy. However, in practice, we always have a finite set of moment descriptors affected by noise. Wavelets being well localized in time and frequency are an efficient tool to construct appropriate moment descriptors. Consider a wavelet-based radius function given by Rm,n (r) = √1m ψ( r−n m ) where ψ(r) is the mother wavelet, m the dilation parameter (the scale index), n the shifting parameter. Unser et al show the usefulness of biorthogonal spline wavelets transform for texture analysis [7]. In this paper we apply the cubic B-spline to construct the appropriate moment descriptors. In this case of a B-spline the mother wavelet is given by (2r − 1)2 4 σw cos(2πf0 (2r − 1)) exp(− 2 ). ψ(r) = 2σw (n + 1) 2π(k + 1) 2 where k = 3, a = 0.7, f0 = 0.41 and σw = 0.56. The basis functions are given by m
ψm,n (r) = 2 2 ψ(2m r − 0.5n).
3
Errors of the Moment Descriptors Applied to Thai Musical Instruments
We analyze the accuracy of the wavelet-based moments as applied to the Thai musical instrument images in presence of the geometric errors induced by the rotational transforms and the subsequent binarization. We also consider an effect of the rotation. Photographs of the instruments are rotated by 360◦ with the increment 5◦ (see Fig. 1) by the Adobe Photoshop. In order to eliminate accumulation of errors due to multiple re-sampling, each rotation has been performed by rotating the original photograph corresponding to 0◦ . Fig. 2 shows a typical impact of rotation in the case of the spline wavelet moment. We evaluate the accuracy by measuring the standard deviation of the normalized spline moment descriptor. The error varies from 0.0 to 27.81 % with the maximum error produced by the ”SUENG” rotated by 50◦ for |M0,1,5 |. The wavelets make it possible to control not only the spatial position of the basis function but the frequency range as well. However, without an appropriate adaptation, the rotation may drastically affect a wavelet descriptor. We analyze the accuracy of spline wavelet descriptors |Mm,n,α | where α is the angular order. The comparison of the accuracy versus the position and the main frequency of wavelet basis function shows that the maximum error is only 7.21% for a 130◦ rotated ”SAW OU” with |M1,2,5 |. Note that the extrema of ψ1,2 (r) and extrema of |S5 (r)| almost coincide which results in a large numerical value of the moment. Furthermore, Fig. 3 indicates the best wavelet for particular frequencies. Fig. 4 shows why wavelets ψ1,4 (r),ψ2,1 (r) ,ψ2,4 (r) and ψ3,1 (r) produce poor results. The noise either has been substantially magnified at various positions
On the Accuracy of Rotation Invariant Wavelet-Based Moments
(a)
(b)
(d)
(e)
411
(c)
(f)
Fig. 1. Gray-scale photos and Silhouettes of the Thai musical instruments (a). ”SUENG”, Lute. (b). ”SAW SAM SAI”, Fiddle. (c). ”PEE CHAWA”, Oboe. (d). ”PEE NOKE”, Pipe. (e). ”SAW DUANG”, Fiddle. (f). ”SAW OU”, Fiddle
Fig. 2. Impact of rotation of |S5 (r)| by ψ2,1 (r), ψ2,4 or has been ”washed out” along with the peak itself by ψ1,4 (r)or ψ3,1 (r) . In other words, although ψ1,4 (r) and ψ3,1 (r) eliminate the noise they ”wash out” information about the object as well. Such functions are easily detected by the energy threshold (see for instance[9]). Unfortunately, the rotation noise in the frequency domain often appears at low frequencies which also characterize the signal. The rotated object produces a shifted of Sα (r), in other words, the noise often ”replicates” Sα (r). That is why it is difficult to construct a conventional filter in this case.
Table 1. The standard deviation calculated for the normalized spline wavelet moment magnitude corresponding to ”SAW OU” m 0 1 2 3
The standard deviation |Mm,n,5 | n=0 n=1 n=2 0.0800 0.0255 0.0221 0.0438 0.0213 0.0201 0.0450 0.0714 0.0270 0.1093 0.0526 0.1053
of normalized n=3 0.0231 0.0455 0.0230 0.1009
n=4 0.0225 0.0679 0.0681 0.1534
412
Sittisak Rodtook and Stanislav Makhanov
Fig. 3.
ψm,n (r) versus |S5 (r)|
Fig. 4. Rotation noise. (a). |S5 (r)| (Solid line: the original ”SAW OU”, Dasheddoted line: 335◦ rotated ”SAW OU”). (b)-(f). Wavelet versus |S5 (r)|. (g)-(k). Normalized |ψ(r)S5 (r)r|. (b),(g). ψ1,2 (r) (c),(h). ψ1,4 (r) (d),(i). ψ2,1 (r) (e),(j) ψ2,4 (r) (f),(k) ψ3,1 (r)
4
Pre Computing Techniques to Classify the Wavelet-Based Descriptors Applied to the Thai Musical Instruments
Our experiments reveal that it is difficult to find an appropriate moment descriptor which provides the both accurate and discriminative representation. The most accurate moment descriptors might be different for different instruments. It is not always possible to find one small set of wavelet basis functions suitable to represent all the angular moments. Note that in case of dissimilar, unsymmetrical objects, a moment descriptor suitable for discrimination can be derived with α = 0. However, if the objects have similar shapes, the zero angular order is not sufficient. Moreover, it is difficult to decide which angular order is the most representative since different angular orders magnify different frequencies of the noise. Meanwhile, we find the best discriminative moment descriptor for each good angular order and construct an appropriate vector of wavelet-based moment descriptors (such as (|Mm1 ,n1 ,q1 | , |Mm2 ,n2 ,q2 | , ... |Mmk ,nk ,qk |)). Furthermore, given the magnitudes of the wavelet-based moments we apply the variance-based classification technique (Otsu’s algorithm) introduced in [8] which invokes the least inner-class/ the largest inter-class variance ratio representing the discriminative measures of the descriptors.
On the Accuracy of Rotation Invariant Wavelet-Based Moments
413
Fig. 5. (a)-(b). |F T (S5 (r))| , (†: the original ”SAW OU”, ∞ : 335◦ rotated ”SAW OU”). (c). |F T (N )|, N : The spatial noise in frequency domain Next, we propose a concatenation of the above variance-based procedure with the pre computing techniques as follows: 1. Pre-compute an appropriate set of the angular order α = {q1 , q2 , . . . , qn } by considering the least of circular Fourier transform square error for each qi .
M N N ( dif fi,j (rk )2 ∆r)
1 , ∆r = , rk = k∆r, k = 0, 1, 2, ...N MN N where dif fi,j (rk ) = Sql (rk )i,Orginal − Sql (rk )i,j . N denotes the number of points employed for numerical integration, N the number of rotations and M denotes the number instruments. 2. Select a set of wavelet basis functions ψm,n (r). The function must belong to a basis in L2 [0, 1] 3. Check the following condition: – If |Sα (r)| is large at r than at least one of the basis functions ψ must be large at r. The condition could be replaced by {ψi > ε} = [0, 1] for εql =
i=1 j=1 k=1
i
4. 5. 6. 7.
some ε. For each angular order and for each musical instrument find the set of moment descriptors having the best normalized standard deviation. Threshold the wavelet-based moment descriptors by the energy [9]. Collect the best descriptors. Apply the variance-based techniques to the set of the descriptors for each angular order.
The techniques applied to case of the Thai musical instruments makes it possible to decrease the computational time by 10-15%. The discriminative measures in the Table. 2 demonstrate that the B-spline wavelet moment descriptors are well separated. The discriminative measure averages of the first three best discriminative descriptors |M1,2,5 |,|M1,0,5 | and |M1,3,5 | at α = 5 are 0.0726, 0.0915 and 0.1163 respectively. The energy threshold is 2.0.
5
Conclusion
Rotations produce a noise which could substantially affect the quality of descriptors and their discriminative properties. Wavelets constitute a suitable class
414
Sittisak Rodtook and Stanislav Makhanov
Table 2. The appropriate spline wavelet descriptors for discrimination and discriminative measures at α = 5 SUENG SAM SAI CHA WA NOKE DAUNG
SAMSAI |M1,0,5 | 0.0039
CHA WA |M1,0,5 | 0.0035 |M0,1,5 | 0.0149
NOKE |M1,2,5 | 0.0013 |M1,2,5 | 0.0039 |M1,2,5 | 0.0043
DUANG |M0,1,5 | 0.0031 |M1,3,5 | 0.0054 |M1,3,5 | 0.0039 |M1,0,5 | 0.0054
OU |M1,0,5 | 0.0043 |M1,3,5 | 0.0121 |M0,3,5 | 0.0064 |M1,2,5 | 0.0023 |M1,2,5 | 0.0030
of basis functions to perform recognition under the impact of rotations. The proposed algorithm based on standard deviation and energy combined with the variance-based procedure makes it possible to efficiently construct a set of descriptors suitable for discrimination under the impact of the rotation. The best discrimination properties for the set of the Thai musical instruments are displayed by {|M0,1,1 |, |M1,2,3 |, |M1,2,5 |, |M0,1,6 |, |M1,1,7 |}, appropriate α = {1, 3, 5, 6, 7}.
References [1] Hu, M. K.: Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory, 8 (1962) 179–187 408 [2] Liao, S. X., Pawlak, M.: On the Accuracy of Zernike Moments for Image Analysis. IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 20. 12 (1998) 1358–1364 408 [3] Shen, D., Ip, H. H.: Discriminative Wavelet Shape Descriptors for Recognition of 2-D patterns. Pattern Recognition, 32 (1999) 151–165 408, 409 [4] Kan, C., Srinath, M. D.: Orthogonal Fourier-Mellin moments with centroid bounding circle scaling for invariant character recognition. 1st IEEE conf. on Electro Information Technology, Chicago illinois (2000) [5] Kan, C., Srinath, M. D.: Invariant Character Recognition with Zernike and Orthogonal Fourier-Mellin Moments. Pattern Recognition, 35 (2002) 143–154 [6] Flusser, J.: On the inverse problem of rotation moment invariants. Pattern Recognition, 35 (2002) 3015–3017 408 [7] Unser, M. Aldroubi, A., Eden, M.: Family of polynomial spline wavelet trans1 form. Signal Process., 30 (1993) 141–162 410 [8] Otsu, N.: A Threshold Selection Method from Gray Level Histograms. IEEE Tran. On Systems Man. and Cybernetics, SMC-9 (1985) 377–393 412 [9] Thuillard, M.: Wavelets in Soft Computing, Series in Robotices and Intelligent Systems, 25 World Scientific London 84-85 411, 413
A Multi-agent System for Knowledge Management in Software Maintenance Aurora Vizcaino1, Jesús Favela 2, and Mario Piattini1 1
Escuela Superior de Informática, Universidad de Castilla-La Mancha {avizcain,mpiattini}@inf-cr.uclm.es 2 CICESE, Ensenada, México
[email protected] Abstract. Knowledge management has become an important topic as organisations wish to take advantage of the information that they produce and that can be brought to bear on present decisions. This work describes a system to manage the information and knowledge generated during the software maintenance process, which consumes a large part of the software lifecycle costs. The architecture of the system is formed from a set of agent communities, each community is in charge of managing a specific type of knowledge. The agents can learn from previous experience and share their knowledge with other agents, or communities. Keywords: Muti-agent systems, Knowledge Management, Software Maintenance
1
Introduction
Knowledge is a crucial resource for organizations, it allows organizations to fulfil their mission and become more competitive. For this reason, companies are currently researching techniques and methods to manage their knowledge systematically. In fact, nearly 80% of companies worldwide have some knowledge management efforts under way [5]. Organizations have different types of knowledge that are often related to each other and which must be managed in a consistent way. For instance, software engineering involves the integration of various knowledge sources that are constantly changing. The management of this knowledge and how it can be applied to software development and maintenance efforts has received little attention from the software engineering research community so far [3]. Tools and techniques are necessary to capture and process knowledge in order to facilitate subsequent development and maintenance efforts. This is particularly true for software maintenance, a knowledge intensive activity that depends on information generated during long periods of time and by large numbers of people, many of whom may no longer be in the organisation. This paper presents a multi-agent system (KM-MANTIS) in charge of managing V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 415-421, 2003. Springer-Verlag Berlin Heidelberg 2003
416
Aurora Vizcaino et al.
the knowledge that is produced during software maintenance. The contents of this paper are organized as follows: Section 2 presents the motivation for using a Knowledge Management (KM) system to support software maintenance. Section 3 describes the implementations of KM-MANTIS. Finally, conclusions are presented in Section 4.
2
Advantages of Using a KM System in Software Maintenance
Software maintenance consumes a large part of overall lifecycle costs [2, 8]. The incapacity to change software quickly and reliably causes organizations to lose business opportunities. Thus, in recent years we have seen an important increase in research directed towards addressing these issues. On the other hand, software maintenance is a knowledge intensive activity. This knowledge comes not only from the expertise of the professionals involved in the process, but it is also intrinsic to the product being maintained, and to the reasons that motivate maintenance (new requirements, user complaints, etc.) processes, methodologies and tools used in the organization. Moreover, the diverse types of knowledge are produced at different stages of the MP. Software maintenance activities are generally undertaken by groups of people. Each person has partial information that is necessary to other members of the group. If the knowledge only exists in the software engineers and there is no system in charge of transferring the tacit knowledge (contained in the employees) to explicit knowledge (stored on paper, files, etc) when an employee abandons the organization a significant part of the intellectual capital goes with him/her. Another well-known issue that complicates the MP is the scarce documentation that exists related to a specific software system, or even if detailed documentation was produced when the original system was developed, it is seldom updated as the system evolves. For example, legacy software written by other units often has little or no documentation describing the features of the software. By using a KM system the diverse kinds of knowledge generated may be stored and shared. Moreover, new knowledge can be produced, obtaining maximum benefit from the current information. By reusing information and producing relevant knowledge the high costs associated with software maintenance could also be decreased [4]. Another advantage of KM systems is that they help employees build a shared vision, since the same codification is used and misunderstanding in staff communications may be avoided. Several studies have shown that a shared vision may hold together a loosely coupled system and promote the integration of an entire organisation [7].
3
A Multi-agent System to Manage Knowledge in Software Maintenance
The issues explained above motivated us to design a KM system to capture, manage, and disseminate knowledge in a software maintenance organisation, thus increasing
A Multi-agent System for Knowledge Management in Software Maintenance
417
the workers' expertise, the organisation's knowledge and its competitiveness while decreasing the costs associated with the software MP. KM-MANTIS is a multi-agent system where different types of agent manage the diverse types of information generated during SMP. Agents interchange data and take advantage of the information and experience acquired by other agents. In order to foster the interchange of information the system uses an open format to store data and metadata XMI (XML metadata interchange) [6]. This is an important advantage of this system, since data and metadata defined along with other tools that support XMI can also be managed by KM-MANTIS. And it also facilitates the interchange of information between agents since they all use the same information representation. 3.1
The KM-MANTIS Architecture
The system is formed of a set of agent communities which manage different types of knowledge. There are several reasons why agents are recommendable for managing knowledge. First of all, agents are proactive, this means they decide to act when they find it necessary to do so. Moreover, agents can manage both distributed and local information. This is an important feature since software maintenance information is generated by different sources and often from different places. One aspect related to the previous one is that agents may cooperate and interchange information. Thus, each agent can share its knowledge with others or ask them for advice, benefiting from the other agents' experience. Therefore, there is reuse and knowledge management in the architecture of the system itself. Another important issue is that agents can learn from their own experience. Consequently the system is expected to become more efficient with time since the agents have learnt from their previous mistakes and successes. On the other hand, each agent may utilize different reasoning techniques depending on the situation. For instance, an agent can use an ID3 algorithms to learn from previous experiences and use case-based reasoning to advise a client on how to solve a problem. The rationale for designing KM-MANTIS with several communities is that during the software MP different types of information are produced, each with its own specific features. The types of information identified were: information related to the products to be maintained; information related to the activities to be performed in order to maintain the products; and, peopleware involved during software maintenance [10]. Therefore, KM-MANTIS has three communities: a community termed the "products community", another called the “activities community”, and the last community denoted as the "peopleware community". In what follows we describe each community in more detail. Products Community: This community manages the information related to the products to be maintained. Since each product has its own features and follows a specific evolution this community has one agent per product. The agents have information about the initial requirements, changes made to the product, and about metrics that evaluate features related to the maintainability of the product, (this
418
Aurora Vizcaino et al.
information is obtained from different documents such as modification requests, see Figure 1, perfective, corrective or preventive actions performed or product measurements). Therefore, the agents monitor the product's evolution in order to have up to date information about it at each moment. Each time an agent detects that information about its product is being introduced or modified in KM-MANTIS (the agent detects this when the application identification number that it represents is introduced or displayed in the interface of KM-MANTIS) the agent analyses the new information, or comparing it to that previously held in order to detect inconsistencies, or checking the differences and storing the relevant information in order to have up-to-date information. Information relevant to each product (data) is stored in an XMI repository. The XMI repository also stores rules (knowledge) produced by the agents through induction and decision trees-based algorithms. The decision to use XMI documents based on the MOF (Meta Object Facility) standard makes it possible for agents to have access to the different levels of information and knowledge that they need to process and classify their information and the queries that they receive. Activities community: Each new change demanded implies performing one or more activities. This community, which has one agent per activity, is in charge of managing the knowledge related to the different activities including methods, techniques and resources used to perform an activity.
Fig. 1. KM-MANTIS Interface
A Multi-agent System for Knowledge Management in Software Maintenance
419
Activities agents can also obtain new knowledge from their experience or taught learning. For instance, an activity agent learns which person usually carries out a specific activity or what techniques are most often used to perform an activity. Furthermore, activities agents use case-based reasoning techniques in order to detect whether a similar change under analogous circumstances was previously requested. When this is the case the agent informs the users on how that problem was previously solved, taking advantage of the organization's experience. Peopleware Community: Three profiles of people can be clearly differentiated in MP [9]: the maintenance engineer, the customer and the user. The Peopleware community has three types of agent, one per profile detected. One agent is in charge of the information related to staff (maintainers). This is the staff agent. Another manages information related to the clients (customers) and is called the client agent. The last one is in charge of the users and is termed the user agent. The staff agent monitors the personal data of the employees, in which activities they have worked, and what product they have maintained. Of course, the agent also has current information about each member of the staff. Therefore it knows where each person is working at each moment. The agent can utilise the information that it has to generate new knowledge. For instance, the staff agent calculates statistics to estimate the performance of an employee. The client agent stores the information of each client, their requirements (including the initial requirements, if they are available) and requests. The client agent also tries to gather new knowledge. For instance, it tries to guess future requirements depending on previous requirements or it estimates the costs of changes that the client wants to make, warning him, for instance, of the high costs associated with a specific change request. The user agent is in charge of knowing the requirements of the users of each product, their background and also their complaints and comments about the products. New knowledge could also be generated from this information, for example by testing to what degree the users' characteristics influence the maintenance of the product. Each type of agent has a database containing the information that they need. In this case there is no community repository because there is no data common to the three types of agents. 3.2
Implementation Considerations
In order to manage the XML documents different middleware alternatives were studied, some being object-relational databases such as ORACLE 9, Microsoft SQL Server 2002 and Tamino which have been designed specifically for XML documents. Finally Tamino, a Software AG's product, was chosen, because KM-MANTIS needs to store a huge amount of XML documents and manage them efficiently. The fact that an object-relational database needs to translate XML to tables and vice versa considerably reduces its efficiency. On the other hand, the platform chosen for creating the multiagent system is JADE [1] which is FIPA compliant. The agent communication language is FIPA ACL. Agents interchange information in order to take advantage of the knowledge that
420
Aurora Vizcaino et al.
others have, thereby the architecture itself performs reuse of information and knowledge.
4
Conclusions
Software maintenance is one of the most important stages of the software life cycle. This process takes a lot of time, effort, and costs. It also generates a huge amount of different kinds of knowledge that must be suitably managed. This paper describes a multiagent system in charge of managing this knowledge in order to improve the MP. The system has different types of agents in order to deal with the different types of information produced during SMP. Agents generate new knowledge and take advantage of the organization's experience. In order to facilitate the management of data and metadata or knowledge a XMI repositories have been used. XMI uses the MOF standard which enables the description of information at different levels of abstraction.
Acknowledgements This work is partially supported by the TAMANSI project (grant number PBC-02001) financed by the Consejería de Ciencia y Tecnología of the Junta de Comunidades de Castilla-La Mancha.
References [1] [2] [3]
[4] [5]
[6]
Bellifemine, A., Poggi, G., and Rimassa, G. (2001). Developing multi agent systems with a FIPA-compliant agent framework. Software Practise & Experience, (2001) 31: 103-128. Bennet K.H., and Rajlich V.T.(2000). Software Maintenance and Evolution: a Roadmap, in Finkelstein, A. (Ed.), The Future of Software Engineering, ICSE 2000, June 4-11, Limerick, Ireland, pp 75-87. Henninger, S., and Schlabach, J. (2001). A Tool for Managing Software Development Knowledge, 3ª International Conf. on Product Focused Software Process Improvement. PROFES 2001, Lecture Notes in Computer Science, Kaiserslautern, Germany, pp 182-195. de Looff, L.A., Information Systems Outsourcing Decision Making: a Managerial Approach. Hershey, PA: Idea Group Publishing, 1990. Mesenzani, M., Schael, T., and Alblino, S. (2002). Multimedia platform to support knowledge proceses anywhere and anytimeIn Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies (KES 2002). Damiani, E., Howlett, R.J., Jain L.C., Ichalkaranje, N. (Eds.), pg 14341435. OMG XML Metadata Interchange (XMI), v. 1.1, Nov-2000.
A Multi-agent System for Knowledge Management in Software Maintenance
[7] [8] [9] [10]
421
Orton, J.D., and Weick, K.E. (1990) Loosely coupled systems: A reconceputalization. Academy of Management Review, 15(2), pp 203-223. Pigoski, T.M. (1997). Practical Software Maintenance. Best Practices for Managing Your Investment. Ed. John Wiley & Sons, USA, 1997. Polo, M., Piattini, M., Ruiz, F. and Calero, C.: Roles in the Maintenance Process. Software Engineering Notes; vol 24, nº 4, 84-86. ACM., (1999). Vizcaíno, A., Ruíz, F., Favela, J., and Piattini, M. A Multi-Agent Architecture for Knowledge Management in Software Maintenance. In Proceedings of International Workshop on Practical Applications of Agents and Multiagent Systems (IWPAAMS'02), Salamanca, Spain 23-25 October, (2002) 39-52.
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs Sadok Bouamama, Boutheina Jlifi, and Khaled Ghédira SOI²E (ex URIASIS) SOI²E/ISG/Université Tunis, B. 204, Département d'Informatique 41 rue de la liberté, 2000 cité Bouchoucha. Tunisie
[email protected] [email protected] [email protected] Abstract. Inspired by the distributed guided genetic algorithm (DGGA), D²G²A is a new multi-agent approach, which addresses Maximal Constraint Satisfaction Problems (Max-CSP). GA efficiency provides good solution quality for Max_CSPs in one hand and benefits from multi-agent principles reducing GA temporal complexity. In addition to that the approach will be enhanced by a new parameter called guidance operator. The latter allows not only diversification but also an escaping from local optima. D²G²A and DGGA are been applied to a number of randomly generated Max_CSPs. In order to show D²G²A advantages, experimental comparison is provided. As well, guidance operator is experimentally outlined in order to determine its best given value. Keywords: Maximal constraint satisfaction problems, multi-agent systems, genetic algorithms, Min-conflict-heuristic, guidance operator
1
Introduction
The CSP formalism consists of variables associated with domains and constraints involving subsets of these variables. A CSP solution is an instantiation of all variables with values from their respective domains. The instantiation must satisfy all constraints. A CSP solution, as defined above, is costly to get and does not necessarily exist within every problem. In such cases, one had better search an instantiation of all variables that satisfies the maximal number of constraints. Such problems called Maximal CSPs and referred to as Max_CSPs, make up the framework to this paper. Max_CSPs have been dealt with by complete or incomplete methods. The first ones are able to provide an optimal solution. Unfortunately, the combinatorial explosion thwarts this advantage. The second ones, such as Genetic Algorithms [4] have the same property to avoid the trap of local optima. They also sacrifice completeness for efficiency. There is another incomplete but distributed method V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 422-429, 2003. Springer-Verlag Berlin Heidelberg 2003
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
423
known as Distributed Simulated Annealing (DSA) that has been successfully applied to Max-CSP [2] As DSA outperforms the centralized Simulated Annealing in terms of optimality and quality, the same idea is adopted for Centralized Genetic Algorithms (CGAs), which are especially known to be expensive. The result was the distributed guided genetic algorithm for Max_CSPs [3]. This paper aims to enhance the DGGA in order to escape from local optima and for better search diversification. And it is organized as follows: The next subsection presents the Distributed Double Guided Genetic Algorithm: context and motivations, the basic concepts and global dynamic. The third section details both experimental design and results. Finally, concluding remarks and possible extensions to this work are proposed.
2
Distributed Double Guided Genetic Algorithm
2.1
Context and Motivations
The DGGA cannot be considered as a classic GA. In fact, in classic GAs the mutation aims to diversify a considered population and then to avoid the population degeneration [4]. In the DGGA, mutation operator is used improperly since it is considered as a betterment operator of the considered chromosome. However, if a gene value was inexistent in the population there is no mean to obtain it by cross-over process. Thus, it is sometimes necessary to have, a random mutation in order to generate the possibly missing gene values. The DGGA approach is a local search method. The first known improvement mechanism of local search is the diversification of the search process in order to escape from local optima [6]. No doubt, the simplest mechanism to diversify the search is to consider a noise part during the process. Otherwise the search process executes a random movement with probability p and follows the normal process with a probability 1-p [5]. In figure 1, an example of local optima attraction basin is introduced; in a maximization case, S2, which is a local maximum, is better than S1. The passing through S1 from S2, is considered as a solution destruction but give more chance to the search process to reach S3, the global optimum. For all these reasons, the new proposed approach is a DGGA enhanced by a random providing operator, which will be called guidance probability Pguid. Thus the D2G2A will possess (in addition to the cross-over and mutation operators, to the generation number and to the initialpopulation size) a guidance operator. This approach will conserve the same basic principles and the same agents structure used with in the DGGA approach [3].
Fig. 1. An example of attraction basin of local optima
424
1. 2. 3. 4. 5. 6. 7. 8. 9.
Sadok Bouamama et al.
m ← getMsg (mailBox) case m of optimization-process (sub-population) : apply-behavior (sub-population) take-into-account (chromosome) : population-pool ← population-pool ∪ {chromosome} inform-new-agent (Speciesnvc) : list-acquaintances ← list-acquaintances ∪ {Speciesnvc} stop-process : stop-behavior Fig. 2. Message processing relative to Speciesnvc
2.2
Global Dynamic
The Interface agent randomly generates the initial population and then partitions it into sub-populations accordingly to their specificities. After that the former creates Species agents to which it assigns the corresponding sub-populations. Then the Interface agent asks these Species agents to perform their optimization processes (figure 2 line 3). So, before starting its optimization process, i.e. its behavior (figure 3), each Specie agent, Speciesn, initializes all templates corresponding to its chromosomes (figure 3 line 3). After that it carries out its genetic process on its initial sub-population, i.e. the sub-population that the Interface agent has associated to it at the beginning. This process, which will be detailed in the following subsection, returns a sub-population “pop” ( figure 3 line 4) that has been submitted to the crossing and mutating steps only once, i.e. corresponding to one generation. For each chromosome of pop, Speciesn computes the number of violated constraints “nvc” (figure 3 line 6). Consequently, two cases may occur. The first one corresponds to a chromosome violating the same number of constraints of its parents. In this case, the chromosome replaces one of the latter randomly chosen (figure 3 line 8). In the second case is that this number (nvc) is different from (n), i.e, the specificity of the corresponding Speciesn. Then the chromosome is sent to another Speciesnvc (figure 3 line 10) if such an agent already exists, otherwise it is sent to the Interface agent (figure 4 line 11). The latter creates a new agent having nvc as specificity and transmits the quoted chromosome to it. Whenever a new Species agent is created, the Interface agent informs all the other agents about this creation (figure 2 line 7) and then asks the new Species to perform its optimization process (figure 2 line 3). Note that message processing is given a priority. So, whenever an agent receives a message, it stops its behavior, saves the context, updates its local knowledge, and restores the context before resuming its behavior. If all the Species agents did not meet a chromosome violating zero constraints at the end of their behavior, they successively transmit one of their randomly chosen chromosomes, linked to its specificity to the Interface agent. The latter determines and displays the best chromosome namely the one which violates the minimal number of constraints.Here we describe the syntax used in the figures: sendMsg (sender, receiver,‘message'): ‘message' is sent by “sender” to “receiver” getMsg (mailBox): retrieves the first message in mailBox.
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
Apply-behavior (initial-population) 1. init-local-knowledge 2. for i := 1 to number-of-generations do 3. template-updating (initial-population) 4. pop ← genetic-process (initial-population) 5. for each chromosome in pop do 6. nvc ← compute-violated-constraints (chromosome) 7. if (nvc = n) 8. then replace-by (chromosome) 9. else if exist-ag (Speciesnvc) 10. then sendMsg (Speciesn, Speciesnvc,'take-into-account ( chromosome)') 11. else sendMsg (Speciesn, Interface, ‘create-agent (chromosome)') 12. sendMsg (Speciesn, Interface, ‘result (one-chromosome, specificity)')
Fig. 3. Behavior relative to Speciesn
genetic process 1. mating-pool ← matching (population-pool) 2. template-updating (mating-pool) 3. offspring-pool-crossed ← crossing (mating-pool) 4. offspring-pool-mutated ← mutating (offspring-pool- crossed) 5. return offspring-pool-mutated
Fig. 4. Genetic process
Crossing (mating-pool) if (mating-pool size < 2) then return mating-pool for each pair in mating-pool do if (random [0,1] < Pcross) then offspring ← cross-over (first-pair, second-pair) nvc ← compute-violated-constraints (offspring) if (nvc = 0) then sendMsg (Speciesn, Interface, ‘Stop-process (offspring)') else offspring-pool ← offspring-pool ∪ {offspring} return offspring-pool
Fig. 5. Crossing process relative to Speciesn cross-over (chromosomei1, chromosomei2) 1. for j :=1 to size (chromosomei1) do 2. sum ← templatei1,j + templatei2,j 3. if (random-integer [0, sum – 1]< templatei1,j) 4. then genei3,j ← genei2,j 5. else genei3,j ← genei1,j 6. return chromosomei3
Fig. 6. Cross-over operator
425
426
Sadok Bouamama et al.
Guided_Mutation (chromosome i) 1. min-conflict-heuristic (chromosome i) 2. return chromosomei
Fig. 7. Guided Mutation relative to chromosomei Random_Mutation (chromosome i) 1. choose randomly a genei,j 2. choose randomly a value vi in domain of genei,j 3. value(genei,j) ← vi 4. return chromosomei
Fig. 8. Random Mutation relative to chromosomei mutating (offspring-pool) 1. for each chromosome in offspring-pool do 2. if (random [0,1]< Pmut) 3. if (random [0,1]< Pguid) 4. then Guided_Mutation (chromosome i) 5. else Random_Mutation(chromosome i ) 6. nvc* ← violated_constraints_number (chromosome i) 7. if (nvc* = 0) 8. then sendMsg (Speciesn, Interface, ‘stop-process (chromosomei)' ) 9. else offspring-pool-mutated offspring-pool-mutated ∪{chromosomei} 10. return offspring-pool-mutated
Fig. 9. New mutating sub-process relative to Speciesn min-conflict-heuristic (chromosomei) 1. δi,j ← max (templatei) /*δi,j is associated to genei,j which is in turn associated to the variable vj*/ 2. nvc* ← nc /* nc is the total number of constraints*/ 3. for each value in domain of vj do 4. nvc ← compute-violated-constraint (value) 5. if (nvc < nvc*) 6. then nvc* ← nvc 7. value* ← value 8. value (genei,j) ← value* 9. update (templatei) 10. return nvc*
Fig. 10. Min-conflict-heuristic relative to chromosomei
3
Experimentation
The goal of our experimentation is to compare two distributed implementation. The first referred to as Distributed Double Guided Genetic Algorithm (D²G²A). Whereas the second implementation is known as Distributed Guided Genetic Algorithm
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
427
(DGGA). The implementations have been done with ACTALK [1], a concurrent object language implemented above the Object Oriented language SMALLTALK-80. 3.1
Experimental Design
Our experiments are performed on binary CSP-samples randomly generated. The generation is guided by classical CSP parameters: number of variables (n), domain size (d), constraint density p (a number between 0 and 100% indicating the ratio between the number of the problem effective constraints to the number of all possible constraints, i.e., a complete constraint graph) and constraint tightness q (a number between 0 and 100% indicating the ratio between the number of forbidden pairs of values (not allowed) by the constraint to the size of the domain cross product). As numerical values, we use n = 20, d = 20. Having chosen the following values 0.1, 0.3, 0.5, 0.7, 0.9 for the parameters p and q, we obtain 25 density-tightness combinations. For each combination, we randomly generate 30 examples. Therefore, we have 750 examples. Moreover and considering the random aspect of genetic algorithms, we have performed 10 experimentations per example and taken the average without considering outliers. For each combination density-tightness, we also take the average of the 30 generated examples. Regarding GA parameters, all implementations use a number of generations (NG) equal to 10, an initial-population size equal to 1000, a cross-over probability equal to 0.5, a mutation probability equal to 0.2 and a random replacement. The performance is assessed by the two following measures: -
Run time: the CPU time requested for solving a problem instance, Satisfaction: the number of satisfied constraints.
The first one shows the complexity whereas the second receals the quality. In order to have a quick and clear comparison of the relative performance of the two approaches, we compute ratios of DGGA and D²G²A performance using the Run time and the satisfaction, as follows: -
CPU-ratio = DGGA-Run-time/D²G²A-Run-time Satisfaction-ratio = D²G²A Satisfaction/DGGA Satisfaction.
Thus, DGGA performance is the numerator when measuring the CPU time ratios, and the denominator when measuring satisfaction ratio. Then, any number greater than 1 indicates superior performance by D²G²A. 3.2
Experimental Results
Figure11 shows the performance ratios from which we draw out the following results: -
-
From the CPU time point of view, D²G²A requires lesser time for the overconstrained and most strongly tight set of examples. For the remaining set of examples the CPU-time ratio is always over 1. In average, this ratio is equal to 2.014. From the satisfaction point of view, the D²G²A always finds more or same satisfaction than DGGA. It finds about 1.23 times more for the most strongly
428
Sadok Bouamama et al.
constrained and most weakly tight set of problems. The satisfaction ratios average is about 1.05 Note that the experiment do not show a clear dependency between NG and the evolution of both CPU-time and satisfaction ratios [3]. 3.3
Guidance Operator Study
Attention is next focused on the guidance probability study in order to determine its best values. To accomplish this task, we have assembled the CPU-time averages and satisfaction averages for different values of Pguid. Both figure 12 shows that the best value of Pguid is about 0.5. This value can be explained by the fact that not only random mutations are important but also guided mutations. In fact the guided mutating sub-process allows guidance and so helps the algorithm to reach optima, and the random one helps it to escape from local optima attraction basin.
4
Conclusion and Perspective
We have developed a new approach called D²G²A which is a distributed guided genetic algorithm enhanced by a new parameter called guidance probability. compared to the DGGA, our approach has been experimentally shown to be always better in terms of satisfied constraints number and CPU time.
20
400
15
300
CPU time
satisfied constraint number
Fig. 11. CPU-time ratio and Satisfaction ratio
10 5 0
200 100 0
0
0,2
0,4
0,6
guidance probability
0,8
1
0
0,2
0,4
0,6
guidance probability
Fig. 12. CPU-time and Satisfaction relative to different values of Pguid
0,8
1
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
429
The improvement is due to diversification, which increases the algorithm convergence by escaping from local optima attraction basin. And to guidance which helps the algorithm to attain optima. So that our approach gives more chance to the optimization process to visit all the search space. We have come to this conclusion thanks to the proposed mutation sub process. The latter is sometimes random, aiming to diversify the search process, and sometimes guided in order to increase the number of satisfied constraints. The guidance operator has been tested, too. Its best given value has been about 0.5. No doubt further refinement of this approach would allow its performance to be improved. Further works could be focused on applying this approach to solve other hard problems like valued CSPs and CSOPs.
References [1]
[2]
[3] [4] [5] [6]
BRIOT J.P., “actalk: a testbed for classifying and designing Actor languages in the Smalltalk-80 Environment”, Proceedings of European Conference on Object-Oriented Programming (ECOOP'89), British Computer Society Workshop Series, July 1989, Cambridge University Press. GHÉDIRA K., “A Distributed approach to Partial Constraint Satisfaction Problem”, Lecture Notes in Artificial Intelligence, Distributed software Agents and Applications, number 1069, 1996, John W. Perram and J.P Muller Edts, Springer Verlag, Berlin Heidelberg. GHÉDIRA K & JLIFI B. “ A Distributed Guided Genetic Algorithm for Max_CSPs” journal of sciences and technologies of information (RSTI), journal of artificial intelligence series (RIA), volume 16 N°3/2002. GOLDBERG D.E, Genetic algorithms in search, Optimisation, and Machine Learning, Reading, Mass, Addison-Wesley, 1989. T. SCHIEX, H. FARGIER & G. VERFAILLIE, “Valued constrained satisfaction problems: hard and easy problems, proceeding of the 14th IJCAI, Montreal, Canada august 1995. TSANG E.P.K, WANG C.J., DAVENPORT A., VOUDOURIS C., LAU T.L, “A family of stochastic methods for Constraint Satisfaction and Optimization”, University of Essex, Colchester, UK, November 1999.
Modelling Intelligent Agents for Organisational Memories Alvaro E. Arenas and Gareth Barrera-Sanabria Laboratorio de C´ omputo Especializado Universidad Aut´ onoma de Bucaramanga Calle 48 No 39 - 234, Bucaramanga, Colombia {aearenas,gbarrera}@unab.edu.co
Abstract. In this paper we study the modelling of intelligent agents to manage organisational memories. We apply the MAS-CommonKADS methodology to the development of an agent-based knowledge management system applied to project administration in a research centre. Particular emphasis is made on the development of the expertise model, where knowledge is expressed as concepts of the CommonKADS Conceptual Model Language. These concepts are used to generate annotated XML-documents, facilitating their search and retrieval. Keywords: Agent-Oriented Software Engineering; Knowledge Management; Organisational Memory; MAS-CommonKADS
1
Introduction
An organisational memory is an explicit, disembodied and persistent representation of crucial knowledge and information in an organisation. This type of memory is used to facilitate access, sharing and reuse by members of the organisation for individual or collective tasks [3]. In this paper we describe the development of a multi-agent system for managing a document-based organisational memory distributed through the Web using the MAS-CommonKADS methodology [5]. MAS-CommonKADS is a general purpose multi-agent analysis and design methodology that extends the CommonKADS [10] design method by combining techniques from object-oriented methodologies and protocol engineering. It has been successfully applied to the optimisation of industrial systems [6], the automation of travel assistants [1] and the development of e-commerce applications [2], among others. In this paper particular emphasis is made on the development of the expertise model, where knowledge is expressed as concepts of the CommonKADS Conceptual Model Language. These concepts are used to generate annotated XML documents, which are central elements of the organisational memory, thus facilitating their search and retrieval.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 430–437, 2003. c Springer-Verlag Berlin Heidelberg 2003
Modelling Intelligent Agents for Organisational Memories
2
431
Applying MAS-CommonKADS to the Development of Organisational Memories
The concepts we develop here are applicable to any organisation made of distributed teams of people who work together online. In particular, we have selected a research centre made of teams working on particular areas as our case study. A team consists of researchers who belong to particular research projects. The organisational memory is made of electronic documents distributed throughout the Intranet of that centre. The MAS-CommonKADS methodology starts with a conceptualisation phase where an elicitation task is carried out, aiming to obtain a general description of the problem by following a user-centred approach based on use cases [9]. We have identified three human actors for our system: Document-Author, End-User and Administrator. A Document-Author is a person working in a research project and who is designated for updating information about such a project. An EndUser is any person who makes requests to the system; requests could be related to a team, a project or a particular researcher. For instance, a typical request is ‘Find the deliverables of the project named ISOA’. Finally, an Administrator is the person in charge of maintaining the organisational memory. The outcome of this phase is a description of the different actors and their use cases. 2.1
The Agent Model
The agent model specifies the characteristics of an agent, and plays the role of a reference point for the other models. An agent is defined as any entity —human or software— capable of carrying out an activity. The identification of agents is based on the actors and use cases diagrams that are generated in the conceptualisation. We have identified four classes of agents: User Agent, software agent corresponding to an interface agent that helps to carry out the actions associated with human actors; it has three subclasses, one for each actor: Author Agent, End-User Agent and Administrator Agent. Search Agent, software agent responsible for searching information in the organisational memory; it has three subclasses, one for each kind of search: Search Team Agent, Search Project Agent and Search Researcher Agent. Update Agent, software agent responsible for updating information in the organisational memory; it has three subclasses, one for kind of update: Update Team Agent, Update Project Agent, Update Researcher Agent. Create Agent, software agent responsible for creating new information in the organisational memory; it has three subclasses, one for each possible creation: Create Team Agent, Create Project Agent, Create Researcher Agent. 2.2
The Task Model
The task model describes the tasks that an agent can carry out. Tasks are decomposed according to a top-down approach. We use UML activity diagrams
432
Alvaro E. Arenas and Gareth Barrera-Sanabria
Table 1. Textual Template for Organise Search Criteria Task Task Organise Search Criteria Purpose: Classify the search criteria Description: Once the criteria are received, they are classified and the information to which they will be applied is identified Input: General criteria Output: Criteria classified Precondition: Existence of criteria Ingredient: Pattern of criteria classification
to represent the activity flow of the tasks and textual templates to describe each activity. For instance, in the case of Search Agent, it carries out tasks such as request information from other agents, organise search criteria, select information and notify its status to other agents. Table 1 shows the textual template of task Organise Search Criteria. 2.3
The Organisation Model
This model aims to specify the structural relationships between human and/or software agents, and the relationships with the environment. It includes two submodels: the human organisation model and the multi-agent organisation model. The human organisation model describes the current situation of the organisation. It includes a description of different constituents of the organisation such as structure, functions, context and processes. As a way of illustration, the process constituent for our case study includes the reception of information about new teams, projects or researchers; once the information is validated, it is registered in data bases and new pages are generated in the organisation’s Intranet; such information can be updated by modifying the corresponding records; finally, the information is offered to the different users. The multi-agent organisation model describes the structural organisation of the multi-agent system, the inheritance relationship between agents and their structural constituent. The structural organisation and the inheritance relationships are derived directly from the agent model. The structural constituent of the multi-agent system specifies the geographic distribution of the agents and the organisational memory. Figure 1 presents an abstract diagram of the structural constituent of our case study. 2.4
The Coordination and Communication Models
The coordination model shows the dynamic relationships between the agents. This model begins with the identification of the conversations between agents, where use cases play again an important role. At this level, every conversation consists of just one single interaction and the possible answer, which are
Modelling Intelligent Agents for Organisational Memories
433
Fig. 1. Abstract Diagram of the Structural Constituent of the MAS
described by means of textual templates. Next, the data exchanged in each interaction are modelled by specifying speech acts and synchronisation, and collect all this information in the form of sequence diagrams. Finally, interactions are analysed in order to determine their complexity. We do not emphasize this phase due to lack of space. The communication model includes interactions between human agents and other agents. We use templates similar to those of the coordination model, but taking into consideration human factors such as facilities for understanding recommendations given by the system. 2.5
The Expertise Model
The expertise model describes the reasoning capabilities of the agents needed to carry out specific tasks and achieve goals. It consists of the development of the domain knowledge, inference knowledge and task knowledge. The domain knowledge represents the declarative knowledge of the problem, modelled as concepts, properties, expressions and relationships using the Conceptual Modelling Language (CML) [10]. In our problem, we have identified concepts such as User, Project, Team, Profile, Product and properties such as the description of the products generated by a project, the description of the profile of an user. Table 2 presents a fragment of the concept definition of our system. Such a description constitutes the general ontology of the system, it is an essential component for generating the XML-annotated documents that make the organisational memory. The inference knowledge describes the inference steps performed for solving a task. To do so, we use inference diagrams [10]. An inference diagram consists of three elements: entities, representing external sources of information that are necessary to perform the inference, which are denoted by boxes; inferences, denoted by ovals and flow of information between entities and inferences, denoted
434
Alvaro E. Arenas and Gareth Barrera-Sanabria
Table 2. Fragment of the Concept Definition of the System Concept User Description: Metaconcept that represents the type of users in the system. It does not have specific properties Concept: External User Description: Subconcept of concept User. This concept represents the external users of the system, i.e. people form other organisations who want to consult our system Concept: Employee Description: Subconcept of concept User that details the general information of the employees of the organisation Properties: Identity: String ; Min = 8 ; Max = 15 Name : String; Max = 100 Login: String; Min = 6 ; Max = 8 Type: Integer ; Max = 2 ...
by arrows. Typical processes that require inference steps include validating the privileges of a user or the search of information in the organisational memory. Figure 2 shows the inference diagram for validating the privileges of a user. The task knowledge identifies the exact knowledge required by each task and determines its most suitable solving method. To do so, each task is decomposed in simpler subtasks, from which we extract relevant knowledge to structure the organisational memory. 2.6
The Design Model
This model aims to structure the components of the system. It consists of three submodels: the agent network, the agent design and the platform submodels. The agent network design model describes the network facilities (naming services, security, encryption), knowledge facilities (ontology servers, knowledge representation translators) and the coordination facilities (protocol servers, group management facilities, resource allocation) within the target system. In our system, we do not use any facilitator agent, so that this model is not defined. The agent design model determines the most suitable architecture for each agent. To do so, we use a template for each agent type that describes the functionalities of each subsystem of such an agent type. Finally, in the platform design model, the software and hardware platform for each agent are selected and described. We have selected Java as the programming language for the implementation of the agent subsystems. The documents that make the organisational memory are represented as XML documents. We use JDOM [4], an API for accessing, manipulating, and outputting XML data from
Modelling Intelligent Agents for Organisational Memories
435
Fig. 2. Inference Diagram for Validating the Privileges of a User
Java code, to program the search and retrieval of documents. Lastly, Java Server Pages (JSP) is used for implementing the system dynamic pages.
3
Implementation
The implementation of our system consisted of three phases: creation of a document based organisational memory, development of user interfaces and construction of the multi-agent system. The organisational memory consisted of XML-annotated documents that can be updated and consulted by the users according to their privileges. In the current implementation, such documents are generated straightforwardly from the concepts and relationships of the domain knowledge. The structure of the user interfaces is derived directly from the communication model and developed with JSP. In the current version of the system, each agent type is implemented as a Java class. We are implementing a new version with the aid of the AgentBuilder tool.
4
Conclusions and Future Work
The agent-based approach to system development offers a natural means of conceptualising, designing and building distributed systems. The successful practice of this approach requires robust methodologies for agent-oriented software
436
Alvaro E. Arenas and Gareth Barrera-Sanabria
engineering. This paper applies MAS-CommonKADS, a methodology for agentbased system development, to the development of a knowledge management system for administrating a document-based organisational memory distributed throughout the Web. We have found the application of MAS-CommonKADS useful in this type of applications. First, the organisational model was crucial in the specification of the relationship between system and society, which is a central aspect in this type of applications. In our application, it was useful for defining the structure of the information within the Intranet of the organisation. Second, the knowledge submodel of the expertise model was the basis for creating the organisational memory. From such concepts we derive XML-annotated documents that make the organisational memory. Several projects have studied ontologies for both knowledge management and Web search. The OSIRIX project also proposes the use of XML-annotated documents to build an organisational memory [8]. They develop a tool that allows users to translate a corporate ontology into a document type definition, provided that the ontology is represented in the CommonKADS Conceptual Modelling Language. Although we share similar objectives, our work was focused on the methodological aspects to develop this kind of systems. Klein et al study the relation between ontologies and XML for data exchange [7]. They have devised a procedure for translating an OIL ontology into a specific XML schema. Intended future work will focus on the automation of the generation of the XML-documents, and the inclusion of domain-specific ontologies in the system.
Acknowledgements The authors are grateful to Jos´e P´erez-Alc´azar and Juan Carlos Garc´ıa-Ojeda for valuable comments and suggestions. This work was partially funded by the Colombian Research Council (Colciencias-BID).
References [1] A. E. Arenas and G. Barrera-Sanabria. Applying the MAS-CommonKADS Methodology to the Flights Reservation Problem: Integrating Coordination and Expertise. Frontiers of Artificial Intelligence and Applications Series, 80:3–12, 2002. 430 [2] A. E. Arenas, N. Casas, and D. Quintanilla. Integrating a Consumer Buying Behaviour Model into the Development of Agent-Based E-Commerce Systems. In IIWAS 2002, The Fourth International Conference on Information Integration and Web-Based Applications and Services. SCS, European-Publishing House, 2002. 430 [3] R. Dieng. Knowledge Management and the Internet. IEEE Intelligent Systems, 15(4):14–17, 2000. 430 [4] E. R. Harold. Processing XML with Java. Addison-Wesley, 2002. 434 [5] C. A. Iglesias, M. Garijo, J. Centeno-Gonzalez, and J. R. Velasco. Analysis and Design of Multiagent Systems using MAS-CommonKADS. In Agent Theories, Architectures, and Languages, Lecture Notes in Artificial Intelligence, pages 313– 327, 1997. 430
Modelling Intelligent Agents for Organisational Memories
437
[6] C. A. Iglesias, M. Garijo, M. Gonz´ alez, and J. R. Velasco. A Fuzzy-Neural Multiagent System for Optimisation of a Roll-Mill Application. In IEA/AIE, Lecture Notes in Artificial Intelligence, pages 596–605, 1998. 430 [7] M. Klein, D. Fensel, F. van Harmelen, and I. Horrocks. The Relation between Ontologies and XML Schemas. Electronic Transactions on Artificial Intelligence, 6(4), 2001. 436 [8] A. Rabarijoana, R. Dieng, O. Corby, and R. Ouaddari. Building and Searching an XML-Based Corporate Memory. IEEE Intelligent Systems, 15(3):56–63, 2000. 436 [9] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modelling Language Reference Manual. Addison Wesley, 1999. 431 [10] G. Schreiber, H. Akkermans, A. Anjewierden, R. de Hoog, N. Shadbolt, W. Vand de Velde, and B. Wielinga. Knowledge Engineering and Management: The CommonKADS Methodology. The MIT Press, 2000. 430, 433
A Study on the Multi-agent Approach to Large Complex Systems Huaglory Tianfield School of Computing and Mathematical Sciences Glasgow Caledonian University, 70 Cowcaddens Road, Glasgow G4 0BA, UK
[email protected] Abstract. This paper explores the multi-agent approach to large complex systems. Firstly, system complexity research is briefly reviewed. Then, self-organization of multi-agent systems is analyzed. Thirdly, complexity approximation via multi-agent systems is presented. Finally, distinctive features of multi-agent approach are summarized.
1
Introduction
Large complex systems (LCS) are found in abundance. Practically all disciplines ranging from natural sciences to engineering, from management/decision/behavior sciences to social sciences have encountered various LCS. Various natural and manmade instances of LCS are, e.g., biological body systems, biological society systems, environmental and weather systems, space/universe systems, machine systems, traffic/transport systems, human organizational/societal (e.g., politic, economic, military) systems, to just mention a few. Complexity research and its applications have received very broad attention from different disciplines. Particularly, the different models employed have largely determined the approaches to LCS, as depicted in Table 1. An LCS may have hundreds or even thousands of heterogeneous components, between which are complicated interactions ranging from primitive data exchanges to natural language communications. The information flows and patterns of dynamics in an LCS are intricate. Generally LCS have three intrinsic attributes, i.e. perceptiondecision link, multiple gradation, and nesting [1-3]. As emergent behaviors dominate the system, study of behaviors of LCS is very hard. Given that System Geometry ∆∇= components + interaction between components
(1)
where ∆∇= denotes “packaging” and “de-packaging”, system complexity is much more attributable to the interactions between components than to the components themselves. The architecture study is very hard. Apparently there is certainly a hierarchy in an LCS, but it is unclear how many levels there should be and how a level can be stratified. And integration of LCS is very hard. Fundamentally there is even no effective approach. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 438-444, 2003. Springer-Verlag Berlin Heidelberg 2003
A Study on the Multi-agent Approach to Large Complex Systems
439
Table 1. Models in complexity research Model Statistical models
Originating area Thermodynamics
Biology
Characteristics of complexity Entropy, complexity measure, dissipative structure, synergetics, selforganization Chaos, fractal geometry, bifurcation, catastrophe, hypercylce Evolution, learning, emergence
Deterministic models Biological species models Game theoretical models Artificial life, adaptive models
Non-linear science
Operations research
Game, dynamic game, co-ordination
Cybernetics, adaptation
Co-evolution, cellular automata, replicator, self-organized criticality, edge of chaos Time series dynamics, black-box quantitative dynamics, system controllability, systems observeability Hierarchy, self-organizing, adaptation, architecture of complexity, systems dynamics To be explored
Time-series dynamic models
System science and engineering
Macroscopic dynamic models
Economics and sociology, systems research
Multi-agent models
Distributed artificial intelligence
All of these make it rather difficult to investigate the basic problems of LCS, including complexity mechanism, modeling, architecture, self-organization, evolution, performance evaluation, system development, etc. Intelligent agents and multi-agents systems have been widely accepted as a type of effective coarse-granularity metaphors for perception, modeling and decision making, particularly in systems where humans are integrated [4, 5]. They will be very effective in dealing with the heterogeneous natures of the components of LCS as intelligent agents and their emergent interaction are ubiquitous modeling metaphors whether the real-world systems are hardware or software, machines or humans, low or high level.
2
Self-Organization of Multi-agent Systems
2.1
Geometry of Multi-agent Systems
Prominent advantage of agentization is that it enables a separation of social behavior (problem solving, decision making and reasoning) at social communication level from individual behavior (routine data and knowledge processing, and problem solving) at autonomy level. Due to such a separation of the two levels, every time a social communication is conducted, the corresponding agents have to be associated beforehand. Therefore, social communication level and autonomy level, along with agent-task association, predominate the geometry of a multi-agent system on the basis of its infrastructure, as depicted in Fig. 1. Infrastructure of a multi-agent system includes agents, agent platform(s), agent management, and agent communication language.
440
Huaglory Tianfield
System Geometry Infrastructure ACL Social Communication Level
Agent Platform
Agent-Task Association
Registration Autonomy level Agent
Agent
Agent
Fig. 1. Geometry of multi-agent system (ACL: agent communication language)
Agent-task association is dynamic. The dynamics can be as long as a complete round of problem solving, and can also be as short as a round of social communication, or even as a piece of social communication (exchange of one phrase). Social communication is emergent. It is purely a case-by-case process. It is nonpredictable beforehand and is only appearing in the situation. Moreover, social communication is at knowledge level. Agents themselves have sufficient abilities. The purpose of social communication is not solving problems, but initiating, activating, triggering, invoking, inducing, and prompting the abilities of individual agents. However, the essential decision making is relying upon the corresponding agents, not the social communication between agents. 2.2
Uncertainty and Self-Exploration of Multi-agent Problem Solving Process
Multi-agent problem solving process involves three aspects: •
Agent-task association: which agents are associated, which are not, and the alternation and mixture of association and non-association in problem solving; • Social communication: organizationally constrained and non-constrained social communication, and their alternation and mixture in problem solving; • Progression of problem solving: what tasks or intermediate solutions are achieved, how they are aggregated, what are the further tasks to achieve the complete solution. The autonomy of agents, the dynamic property of agent-task association, and the emergent and knowledge level properties of social communication between agents make that multi-agent problem solving process is uncertain and self-exploring. Multi-agent problem solving process is self-exploring. At the beginning, the solving process is unknown. With the progression of problem solving, the process is made clearer and clearer.
A Study on the Multi-agent Approach to Large Complex Systems
441
In redundant multi-agent systems, agents to be associated vary uncertainly from one round or even one piece of social communication to another. Overlapping between the abilities of individual agents in a society forms the basis of the uncertainty of multi-agent problem solving process. In non-redundant multi-agent systems, agents are functionally fixed under a given organizational structure, social communication is organizationally constrained, and there is no provision of competition or alternation between agents. The total problem solving is just a mechanical additive aggregation of the abilities of individual agents over an uncertain process. This appears as if the problem solving of individual agents forms a Markov chain. t
t
1 2 Total task = PS(Agent-1) + PS(Agent-2) + ··· +
tN
PS(Agent-N)
(2)
t
where PS is abbreviation of problem solving, and “ + ” denotes a mechanical aggregation of results from the problem solving of individual agents which takes place at asynchronous instants. However, the social communication in non-redundant multi-agent systems still varies in time, content and/or quality from one round or even one piece to another. So, even in this case the multi-agent problem solving process is uncertain. For instance, within an enterprise, even provided that there is an adequate organizational structure between functional departments, from one time to another, functional departments may have failures in delivery dates and qualities, depending on how cooperation and/or negotiations are made between functional departments. Therefore, in both redundant and non-redundant multi-agent systems, multi-agent problem solving process is uncertain.
3
Approximating Complexity through Flexible Links of Simplicities
3.1
Flexibility of Multi-agent Modeling of Real-World Systems
Consider ∆
A system = ((Agent - i)N)T ( ⇔ ijt )NxN (Agent - j)N)
(3)
where parentheses denote a vector or matrix, superscript T refers to transposition of a vector, ⇔ ijt denotes social communication from agent j to agent i at asynchronous instant t. It can refer to pattern, discourse, or phrase of social communication, or all of these. Due to the dynamic property of agent-task association, the number of agents to be associated, N, varies from one round, or even one piece of social communication to another. The basic idea of multi-agent based modeling is packaging various real-world LCS into multi-agent systems, as depicted in Fig. 2. Thus Equation (1) becomes Large Complex System Geometry ∆∇ = Agents + Interactions between agents
(4)
442
Huaglory Tianfield
Unified Systems Engineering Process: Architecture, Analysis, Design, Construction, Performance Evaluation, (Co-)Evolution, etc. Multi-agent Version of Systems Engineering Process
Multi-agent Systems
De-packaging
Packaging
Non-life Living Living Non-life
Real-world Version of Systems Engineering Process
Natural Various Real-World Large Complex Systems Man-made
Fig. 2. Multi-agent systems as metaphor of various real-world LCS
Then, multi-agent modeling involves specifications upon packaging, agents, agent communication language, and agent platform. This apparently requires a specification language. Multi-agent packaging has many prominences, e.g., • • •
Unified systems engineering process of LCS: architecture, analysis, design, construction, performance evaluation, evolution, etc.; Unified concepts, measure, theory, methods, etc. of LCS; Unified studies on different types of real-world LCS.
Given a fixed infrastructure, the self-organization of multi-agent systems makes it possible that a multi-agent system can adapt itself for a wide variety of real-world systems. 3.2
Multi-agent Based System Development
Principal impact of multi-agent paradigm upon system development is the transition from individual focus to interaction focus. High level social behavior may result from a group of low level individuals. Developing systems by multi-agent paradigm immediately turns out developing the corresponding multi-agent systems. To develop a multi-agent system, there are two parts to complete, i.e., first to develop the infrastructure of the multi-agent system, including agents, agent platform(s), agent management, and agent communication language, and then to develop the agent-task association and the social communication. For the former, the development of infrastructure of multi-agent system is just similar to that of any traditional systems. Traditional paradigms of system development, including analysis/design methods and life-cycle models are applicable.
A Study on the Multi-agent Approach to Large Complex Systems
443
Table 2. Multi-agent approach versus traditional complexity research Scope
Destination
Highlight
Conventional complexity research Non-linearity, chaos, selforganization, quantum mechanics, thermodynamics, etc. Basically, physical mechanism oriented, i.e., trying to discover the primary physics of non-linear and thermodynamic complex systems More on microscopic/quantum level, ignoring effectively incorporating humans / computers in the systems, and incapable of dealing with the genuine largeness and structural complexity of a system
Multi-agent approach Very natural to incorporate humans/computers as basic components of the systems, and able to deal with the largeness and architecture of systems Not physical mechanism oriented, but black-box systematic behavior oriented Macroscopic, meta-level, i.e., using coarse-granularity (pseudo-) computational entities (i.e., agents) as the primary elements of characteristics and modeling of LCS. Power is not on the primary elements themselves, but on the dynamic, uncertain, emergent, knowledge-level interactions between these primary elements
For the latter, exactly speaking, the agent-task association and the social communication are not something that can be developed, except influenced. In such a circumstance, traditional paradigms are generally inapplicable. Actually, there is no apparent life-cycle concept. Developers are interactively working with the multi-agent system. In order to exert influence, extra requirements are posed upon the development of infrastructure of multi-agent system. That is, when infrastructure of multi-agent system is developed, consideration should already be given about how they can be influenced later on. Essentially, the geometry of a multi-agent system should have been designed before the development of it infrastructure. By means of multi-agent paradigm, infrastructure of multi-agent systems can be very easily inherited from one system to another. This prominent provision greatly facilitates and alleviates reusability and evolvability of systems.
4
Summary
Advantages of self-organization of multi-agent systems include: • • •
System modeling becomes greatly alleviated. Given a fixed infrastructure, the multi-agent system can adapt itself for many varieties of real-world systems; Completeness and organic links between system perspectives are readily guaranteed; Reusability and evolvability of systems is thoroughly enhanced. Provided with the infrastructure of a multi-agent system, new systems are rapid to be available, considerably short time to market;
444
Huaglory Tianfield
•
Late user requirement can be readily resolved and adapted. Actually users are interactively working with the multi-agent system. Users can change (exactly, influence) the behavior of the multi-agent system. Disadvantages of multi-agent systems may be: • Multi-agent problem solving process is uncertain and self-exploring. These leave the problem solving process to be nontransparent and untraceable to users; • While efficiency of multi-agent problem solving can be influenced by incorporating heuristics into the multi-agent system, the speed of problem solving is uncontrollable.
The traditional models/methods for LCS are relatively dependent upon the originating area, and lack a unified research process for systems from different variety of domains. Multi-agent systems provide a unified approach to the analysis, design, construction, performance, evolution, etc. of LCS. A comparison is made as in Table 2.
References [1] [2] [3] [4] [5]
Tianfield, H.: Formalized analysis of structural characteristics of large complex systems. IEEE Trans. on Systems, Man, and Cybernetics, Part A, 31 (2001) 559-572 Tianfield, H.: Structuring of large-scale complex hybrid systems: From illustrative analysis toward modeling. J. of Intelligent and Robotic Systems: Theory and Applications, 30 (2001) 179-208 Tianfield, H.: An innovative tutorial on large complex systems. Artificial Intelligence Review, 17 (2002) 141-165 Tianfield, H.: Agentization and coordination. Int. J. of Software Engineering and Knowledge Engineering, 11 (2001) 1-5 Tianfield, H.: Towards advanced applications of agent technology. Int. J. of Knowledge-based Intelligent Engineering Systems, 5 (2001) 258-259
Multi-layered Distributed Agent Ontology for Soft Computing Systems Rajiv Khosla School of Business La Trobe University, Melbourne, Victoria – 3086, Australia
[email protected] Abstract. In this paper we outline a multi-layered distributed agent ontology for providing system modeling support to practitioners and problem solvers at four levels, namely, distributed, tool (technology), optimization and task level respectively. We describe the emergent characteristics of the architecture from a architectural viewpoint. We also outline how the ontology facilitates development of humancentered soft computing systems. The ontology has been applied in a range areas including alarm processing, web mining, image processing, sales recruitment and medical diagnosis. We also outline definition of agents in different layers of the ontology.
1
Introduction
Soft computing agents today are being applied in a range of areas including image processing, engineering, process control, data mining , internet and others. In the process of applying soft computing agents to complex real world problems three phenomena have emerged. Firstly, the application of soft computing agents in distributed environments has resulted in merger of techniques from soft computing area with those in distributed artificial intelligence. Secondly, in an effort to improve the quality of soft computing solution researchers have been combining and fusing technologies. This has resulted in hybrid configurations of soft computing technologies. Finally, from a practitioners perspective, as the soft computing technologies have moved out of laboratories into the real world the need for knowledge or task level practitioner-centered technology independent constructs has emerged. In this paper we describe four levels of intelligent soft computing agent design which correspond to the three phenomena described above. These four levels or layers are distributed, tool, optimization and problem solving or task layer. The distributed level support is provided among other aspects for fetching and depositing data across different machines in a distributed environment. Tool support is provided in terms of applying various soft and hard computing technologies like fuzzy logic, neural networks, genetic algorithms, and knowledge based systems. Optimization level support is provided in terms of hybrid configurations of soft computing technologies for designing and developing optimum models. The optimization level V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 445-452, 2003. Springer-Verlag Berlin Heidelberg 2003
446
Rajiv Khosla
assists in improving the quality of solution of the technologies in the tool layer and range of tasks covered by them. . Finally, task level support is provided in terms of modelling user's tasks and problem solving models in a technology independent manner. The layered architecture is also motivated by the human-centred approach and criteria outlined in the 1997 NSF workshop on human-centered systems and consistent problem solving structures/strategies employed by practitioners while designing solutions to complex problems or situations. The paper is organised as follows. Section 2 describes work done in the process of developing the ontology. Section 3 outlines the multi-layered agent ontology for soft computing systems and some aspects of the emergent behavior or characteristics of multi-layered ontology. Section 4 concludes the paper.
2
Background
In this section we firstly look at evolution of intelligent technologies along two dimensions, namely, quality of solution and range of tasks. Then we outline some of the existing work on problem solving ontologies. 2.1
Evolution of Intelligent Technologies
The four most commonly used intelligent technologies are symbolic knowledge based systems (e.g. expert systems artificial neural networks, fuzzy systems and genetic algorithms) [3, 5, 8].The computational and practical issues associated with intelligent technologies have led researchers to start hybridizing various technologies in order to overcome their limitations. However, the evolution of hybrid systems is not only an outcome of the practical problems encountered by these intelligent methodologies but is also an outcome of deliberative, fuzzy, reactive, self-organizing and evolutionary aspects of the human information processing system [2]. Hybrid systems can be grouped into three classes, namely, fusion systems, transformation systems, combination systems [3, 6, 7]. These classes along with individual technologies are shown in Fig. 1 along two dimensions, namely, quality of solution and range of tasks. In fusion systems, the representation and/or information processing features of in technology A are fused into the representation structure of another technology B. From a practical viewpoint, this augmentation can be seen as a way by which a technology addresses its weaknesses and exploits its existing strengths to solve a particular real-world problem. Transformation systems are used to transform one form of representation into another. They are used to alleviate the knowledge acquisition problem. For example, neural nets are used for transforming numerical/continuous data into symbolic rules which can then be used by a symbolic knowledge based system for further processing. Combination system involve explicit hybridization. Instead of fusion, they model the different levels of information processing and intelligence by using technologies that best model a particular level. These systems involve a modular arrangement of two or more technologies to solve real-world problems.
Multi-layered Distributed Agent Ontology for Soft Computing Systems
447
Associate Systems
Fusion Systems Transformation Systems
Quality of Solution
Symbolic AI
Fuzzy System
Genetic Algorithm
Neural Networks
Combination Systems
Range of Tasks
Fig. 1. Technologies, Hybrid Configurations, Quality of Solution and Range of Tasks
However, these hybrid architectures also suffer from some drawbacks. These drawbacks can be explained in terms of the quality of solution and range of tasks covered as shown in Fig. 1. Fusion and transformation architectures on their own do not capture all aspects of human cognition related to problem solving. For example, fusion architectures result in conversion of explicit knowledge into implicit knowledge, and as a result lose on the declarative aspects of problem solving Thus, they are restricted in terms of the range of tasks covered by them. The transformation architectures with bottom-up strategy get into problems with increasing task complexity. Therefore the quality of solution suffers when there is heavy overlap between variables, where the rules are very complicated, the quality of data is poor, or data is noisy. The combination architectures cover a range of tasks because of their inherent flexibility in terms of selection of two or more technologies. However, because of lack of (or minimal) knowledge transfer among different modules the quality of solution suffers for the very reasons the fusion and transformation architectures are used. It is useful to associate these architectures in a manner so as to maximize the quality as well as range of tasks that can be covered. These class of systems are called associative systems as shown in Fig. 1. As may be apparent from Fig. 1, associative systems consider various technologies and their hybrid configurations as technological primitives that are used to accomplish tasks. The selection of these technological primitives is contingent upon satisfaction of task constraints. In summary, it can seen from the discussion in this section that associative systems represent evolution from a technology-centered approach to a task-centered approach 2.2
Strengths and Weaknesses of Existing Problem Solving Ontologies
In order to pursue the task-centered approach one tends to look into work done in the area of problem solving ontologies. The research on problem solving ontologies or knowledge-use level architectures has largely been done in artificial intelligence. The research at the other end of the spectrum (e.g., radical connectionism, soft computing)
448
Rajiv Khosla
is based more on understanding the nature of human or animal behavior rather than developing ontologies for dealing with complex real world problems on the web as well as in conventional applications. A discussion on strengths and some limitations of existing problem solving ontologies can be found in [7, 8]. Some of limitations include lack of modelling constructs for soft computing applications, humancenteredness, context, response time, external or perceptual representations and some others. Firstly, especially with the advent of the internet and the web human (or user)centeredness has become an important issue (NSF workshop on human-centered systems 1997) and Takagi [10]. Secondly, from a cognitive science viewpoint, distributed cognition (which among aspects involves consideration of external and internal representations for task modelling) has emerged as a system modelling paradigm as compared to the traditional cognitive science approach based on internal representations only. Thirdly, Human-centered problem solving among other aspects involves, context, focus on practitioner goals and tasks, human evaluation of tasks modelled by technological artifacts, flexibility, adaptability and learning. Fourthly, unlike knowledge based systems, soft computing systems because of imprecise and approximate nature require modelling of constructs like optimization . Finally, as outlined in the last section there are a range of hard and soft computing technologies and their hybrid configurations which lend flexibility to the problem solver as against trying to force fit a particular technology or method on to a system design (as traditionally has been done). These aspects can be considered as pragmatic constraints. Most existing approaches do not meet one or more of the pragmatic constraints. Besides, from a soft computing perspective most existing do not facilitate component based modelling at optimization, task and technology levels respectively.
3
Distributed Multi-layered Agent Ontology for Soft Computing Systems
The multi-layered agent ontology is shown in Fig. 2. It is derived from integration of characteristics of intelligent artifacts like fuzzy logic, neural network and genetic algorithm, agents, objects and distributed process model with problem solving ontology model [6, 7]. It consists of five layers, namely, the object layer, which defines the data architecture or structural content of an application. The tool or technology agent layer defines the constructs for various intelligent and soft computing tools. The optimization agent layer defines constructs for fusion, combination and transformation technologies which are used for optimizing the quality of solution (e.g., accuracy). Finally, the problem solving ontology (task) agent layer defines the constructs related to the problem solving agents namely, preprocessing, decomposition, control, decision and postprocessing. This layer models practitioners tasks in the domain under study. Some generic goals and tasks associated with each problem solving agent are shown in Table 1. This layer employs the services of the other 4 layers for accomplishing various tasks. The five layers facilitate a component based approach for agent based software design .
Multi-layered Distributed Agent Ontology for Soft Computing Systems
Problem Solving (Task) Agent Layer Optimization Agent Layer Tool Agent Layer Expert System Agent
Global Preprocessing Phase Agent
Fusion Agent
Fuzzy Logic Agent
Distributed Comm. & Processing Agent1
Distributed Agent Layer Distri-
Object Layer
buted Comm. & Processing Agent N
Supervised Neural Network Agent
SelfOrganising Agent
Assoc iative Agent
Postprocessing Phase Agent
Transformation Agent
Genetic Algorithm Agent
Decomposition Phase Agent
Decision Phase Agent
Combination Agent Control Phase Agent
Fig. 2. Multi-Layered Distributed Agent Ontology for Soft Computing
Table 1. Some Goals and Tasks of Problem Solving Agents Phase
Goal
Preprocessing
Improve data quality
Decomposition
Restrict the context of the input from the environment at the global level. By defining a set of orthogonal concepts Reduce the complexity and enhance overall reliability of the computer-based artifact
Control
Determine decision selection knowledge constructs within an orthogonal concept for the problem under study .
Decision
Post-processing
Provide decision instance results in a user defined decision concept. Establish outcomes as desired outcomes
Some Tasks Noise Filtering Input Conditioning
Define orthogonal concepts
Define decision level concepts with in each orthogonal concept as identified by users Determine Conflict Resolution rules Define decision instance of interest to the user Concept validation Decision instance result validation
449
450
N U M B E R o f P A T H S
Rajiv Khosla D ia g n o sis to T re a tm e n t D ia g n o sis to S ym p tom s a n d T re a tm e nt S ym p tom s to D ia g n o sis to T re a tm e nt to
S y ste m D e co m p o sitio n
T re a tm e n t D e c isio n S yste m P o stp ro c e s sin g
D e c o m p o sitio n S y ste m
D ia g n o stic D e cisio n D e co m p o sitio n
T rea tm e n t D e cision C o n tro l
P o stp ro ce ssin g D ia g n o stic D e cisio n T re a tm e n t D e c isio n P o stp ro c e s sin g
D e c is io n P a th s
Fig. 3. Decision Paths Taken by a Medical Practitioner and Agent Sequence In t e l l i g e n t C o n t r o l A g e n t ( I C A ) N eu ra l N e tw o rk A g en t
W a te r Im m e rs io n S e g m e n t a t i o n IP A
M om en t In v a ria n t T ra n sfo rm a tio n A g en t
M a t h e m a t ic a l M o r p h o lo g y S e g m e n ta tio n a g e n t 1 N
Fig. 4. Optimization Level Modelling of a Image Processing Application
3.1
Some Emerging Characteristics of Soft Computing Agent Ontology
In this section we establish the human-centeredness of the multi-layered agent ontology from two perspectives. Firstly, the multi-layered agent ontology facilitates human involvement, evaluation and feedback at three levels, namely, task, optimization and technology respectively. The constructs used at the task level are based on consistent problem solving structures employed by users in developing solutions to complex problems. Thus the five problem solving agents facilitate mapping of user's tasks in a systematic and scalable fashion [7]. Further, the component based nature of the five problem solving agents allows them to be used in different problem solving contexts as shown in Fig. 4. These sequences assist in modeling different problem solving or situational contexts. The optimization level allows an existing technology based solution to be optimized based on human feedback. For example, in an unstained mammalian cell image processing application we employ neural network (as shown in Fig. 4) for prediction and optimization of segmentation quality of segmented images by morphological and water immersion agents. At the technology level human evaluation and feedback is modelled by asking user to define the search parameters or fitness
Multi-layered Distributed Agent Ontology for Soft Computing Systems
451
function to be used by a soft computing agent like GA to create the correct combination of shirt design [7]. Further, the generic tasks of the five problem solving agents have been grounded in human-centered criteria outlined in the 1997 NSF workshop on human-centered systems. These criteria state that a) human-centered research and design is problem driven rather than logic theory or any particular technology; b) human-centered research and design focuses on practitioner's goals and tasks rather than system developer's goals and tasks, and c) human-centered research and design is context bound. The context criterion relates to social/organizational context and representational context (where people use perceptual as well as internal representations to solve problems) and problem solving context. The generic tasks employed by the five problem solving agents are based on consistent problem solving structures used by practitioners solving complex problems in engineering, process control, image processing, management and other areas. The component based nature problem solving agents shown in Fig. 2 enables them to be used in different sequences and different problem solving contexts. Further, the representation (external and internal representations) and social and organization context (not described in this paper) have also been modelled [6,7]. Additionally, the five layers of the agent ontology lead to component based distributed (collaboration and competition) soft computing system design. The ontology also provides flexibility of technologies, learning and adaptation, different forms of knowledge (fuzzy, rule based and distributed pattern based) to be used for modelling component based software design.
4
Conclusion
In this paper we have outlined a multi-layered distributed agent ontology for developing soft computing applications. The ontology provides modelling support at four levels, namely, task level, optimization level, technology level, and distributed processing level. Further, it takes into consideration pragmatic constraints like human-centeredness, context, distributed cognition, and flexibility of technologies. The ontology has been applied in a range of areas including web mining, image processing, e-commerce, alarm processing, medical diagnosis and others
References [1] [2] [3]
Bezdek, J.C. “Pattern Recognition with Fuzzy Objective Function Algorithms,' Advanced Applications in Pattern Recognition,” Plenum Press 1981, USA. Bezdek, J.C., “What is Computational Intelligence?' Computational Intelligence: Imitating Life, Eds. Robert Marks-II et al., IEEE Press, New York, 1994. Chiaberage, M., Bene. G.D., Pascoli, S.D., Lazzerini, B., and Maggiore, A. (1995), “Mixing fuzzy, neural & genetic algorithms in integrated design environment for intelligent controllers,” 1995 IEEE Int Conf on SMC,. Vol. 4, pp. 2988-93.
452
Rajiv Khosla
[4]
Khosla, R., and Dillon, T., “Learning Knowledge and Strategy of a Generic Neuro-Expert System Architecture in Alarm Processing”, in IEEE Transactions on Power Systems, Vol. 12, No. 12, pp. 1610-18, Dec.1997. Khosla, R., and Dillon T., Engineering Intelligent Hybrid Multi-Agent Systems, Kluwer Academic Publishers, MA, USA August 1997. Khosla, R., Sethi, I. and Damiani, E., Intelligent Multimedia Multi-Agent Systems – A Human-Centered Approach, Kluwer Academic Publishers, MA, USA October 2000. Khosla, R. Damiani, E., and Grosky, W. Human-Centered E-Business, Kluwer Academic Publishers, MA, USA, April 2003 . Goldberg, D.E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, pp. Takagi, H.K. (2001) “Interactive Evolutionary Computation: Fusion of the Capabilities of EC Optimization and Human Evaluation,” Proceedings of the IEEE, vol. 89, No. 9, September. Zhang, J., Norman, D. A. (1994), "Distributed Cognitive Tasks", Cognitive Science, pp. 84-120.
[5] [6] [7] [8] [9] [10]
A Model for Personality and Emotion Simulation Arjan Egges, Sumedha Kshirsagar, and Nadia Magnenat-Thalmann MIRALab - University of Geneva 24, Rue General-Dufour,1211 Geneva, Switzerland {egges,sumedha,thalmann}@miralab.unige.ch http://www.miralab.ch
Abstract. This paper describes a generic model for personality, mood and emotion simulation for conversational virtual humans. We present a generic model for describing and updating the parameters related to emotional behaviour. Also, this paper explores how existing theories for appraisal can be integrated into the framework. Finally we describe a prototype system that uses the described models in combination with a dialogue system and a talking head with synchronised speech and facial expressions.
1
Introduction
With the emergence of 3D graphics, we are now able to create very believable 3D characters that can move and talk. However, an important part often missing in this picture is the definition of the force that drives these characters: the individuality. In this paper we explore the structure of this entity as well as its link with perception, dialogue and expression. In emotion simulation research so far, appraisal is popularly done by a system based on the OCC model [1]. This model specifies how events, agents and objects from the universe are appraised according to an individual’s goals, standards and attitudes. These three (partly domain-dependent) parameters determine the ‘personality’ of the individual. More recent research indicates that personality can be modelled in a more abstract, domain-independent way [2, 3]. In this paper we will investigate the relationship between such personality models and the OCC model. The effect of personality and emotion on agent behaviour has been researched quite a lot, whether it concerns a general influence on behaviour [4], or a more traditional planning-based method [5]. Various rule based models [6] and probabilistic models [7] have been reported in the past. How behaviour should be influenced by personality and emotion depends on the system that is used and it is out of the scope of this paper. The effect of personality and emotion will also have an effect on expression (speech intonation, facial expressions, etc.) is partly deatl with in Kshirsagar et al. in [8], which describes a system that simulates personalized facial animation with speech and expressions, modulated through mood. There have been very few researchers who have tried to simulate mood [9, 10].
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 453–461, 2003. c Springer-Verlag Berlin Heidelberg 2003
454
Arjan Egges et al.
Fig. 1. Overview of the emotional state and personality in an intelligent agent framework Figure 1 depicts how we view the role of personality and emotion as a glue between perception, dialogue and expression. Perceptive data is interpreted on an emotional level by an appraisal model. This results in an emotion influence that determines, together with the personality what the new emotional state and mood will be. An intelligent agent uses the emotional state, mood and the personality to create behaviour. This paper is organized as follows. Section 2 and Section 3 present the generic personality, mood and emotion model. In Section 4, we investigate the relationship between the OCC model and personality. Finally, we present an example of how to use our model with an expressive virtual character1.
2
A Generic Personality and Emotion Model
The intention of this section is to introduce the concepts that we will use in our model. In the next section we will explain how these concepts interact. We will first start by defining a small scenario. Julie is standing outside. She has to carry a heavy box upstairs. A passing man offers to help her carry the box upstairs. Julies personality has a big influence on her perception and on her behaviour. If she has an extravert personality, she will be happy that someone offers her some help. If she has a highly introvert and/or neurotic personality, she will feel fear and distress and she will respond differently. As someone is never 100% extravert or 100% neurotic, each personality factor will have a weight in determining how something is perceived and what decisions are being taken. Surely, Julies behaviour will not only be based 1
Please read Egges et al. [11] and Kshirsagar et al. [12, 13] for more details.
A Model for Personality and Emotion Simulation
455
on emotional concepts, but also on intellectual concepts. A dialogue system or intelligent agent will require concrete representations of concepts such as personality, mood and emotional state so that it can decide what behaviour it will portray [4, 14]. As such, we need to define exactly what we mean by personality, mood, and emotion before we can simulate emotional perception, behaviour and expression. 2.1
Basic Definitions
An individual is an entity that is constantly changing. So, when we speak of an individual, we always refer to it relative to a time t. The moment that the individual starts existing is defined by t = 0. The abstract entity that represents the individual at a time t we will call It from now on. An individual has a personality and an emotional state (we do not yet take mood into consideration). The model based on this assumption is called PE. The personality is constant and initialized with a set of values on t = 0. The emotional state is dynamic and it is initialized to 0 at t = 0. Thus we define It as a tuple (p, et ), where p represents the personality and et represents the emotional state at time t. In our example, Julie will portray emotions (that change over time) based on what happens, but how she obtains these emotions and the behaviour that results from it, depends on a static part of her being, the personality. There exist many personality models, each of them consisting of a set of dimensions, where every dimension is a specific property of the personality. Take for example the OCEAN model [3], which has five dimensions (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) or the PEN model [2] that has three dimensions. Generalizing from these theories, we assume that a personality has n dimensions, where each dimension is represented by a value in the interval [0, 1]. A value of 0 corresponds to an absence of the dimension in the personality; a value of 1 corresponds to a maximum presence of the dimension in the personality. The personality p of an individual can then be represented by the following vector: pT = α1 . . . αn , ∀i ∈ [1, n] : αi ∈ [0, 1]
(1)
Emotional state has a similar structure as personality, but it changes over time. The emotional state is a set of emotions that have a certain intensity. For example, the OCC model defines 22 emotions, while Ekman [15] defines 6 emotions for facial expression classification. We define the emotional state et as an m-dimensional vector, where all m emotion intensities are represented by a value in the interval [0, 1]. A value of 0 corresponds to an absence of the emotion; a value of 1 corresponds to a maximum intensity of the emotion. This vector is given as follows: β . . . β 1 m , ∀i ∈ [1, m] : βi ∈ [0, 1] if t > 0 eTt = (2) 0 if t = 0
456
Arjan Egges et al.
Furthermore, we define an emotional state history ωt that contains all emotional states until et , thus: ωt = e0 , e1 , . . . , et
(3)
An extended version of the PE model, the PME model, is given by including mood. We now define the individual It as a triple (p, mt , et ), where mt represents the mood at a time t. We define mood as a rather static state of being, that is less static than personality and less fluent than emotions [8]. Mood can be one-dimensional (being in a good or a bad mood) or perhaps multi-dimensional (feeling in love, being paranoid). Whether or not it is justified from a psychological perspective to have a multi-dimensional mood is not in the scope of this paper. However, to increase generality, we will provide for a possibility of having multiple mood dimensions. We define a mood dimension as a value in the interval [−1, 1]. Supposing that there are k mood dimensions, the mood can be described as follows: , ∀i ∈ [1, k] : γi ∈ [−1, 1] if t > 0 . . . γ γ 1 k (4) mTt = 0 if t = 0 Just like for the emotional state, there is also a history of mood, σt , that contains the moods m0 until mt : σt = m0 , m1 , . . . , mt
3
(5)
Emotional State and Mood Updating
From perceptive input, an appraisal model such as OCC will obtain emotional information. This information is then used to update the mood and emotional state. We define the emotional information as a desired change in emotion intensity for each emotion, defined by a value in the interval [0, 1]. The emotion information vector a (or emotion influence) contains the desired change of intensity for each of the m emotions: aT = δ1 . . . δm , ∀i ∈ [1, m] : δi ∈ [0, 1]
(6)
The emotional state can then be updated using a function Ψe (p, ωt , a). This function calculates, based on the personality p, the current emotional state history ωt and the emotion influence a, the change of the emotional state. A second part of the emotion update is done by another function, Ωe (p, ωt ) that represents internal change (such as a decay of the emotional state). Given these two components, the new emotional state et+1 can be calculated as follows: et+1 = et + Ψe (p, ωt , a) + Ωe (p, ωt )
(7)
A Model for Personality and Emotion Simulation
457
In the PME model (which includes the mood), the update process slightly changes. When an emotion influence has to be processed, the update now happens in two steps. The first step consists in updating the mood; the second step consists of updating the emotional state. The mood is updated by a function Ψm (p, ωt , σt , a) that calculates the mood change, based on the personality, the emotional state history, the mood history and the emotion influence. The mood is internally updated using a function Ωm (p, ωt , σt ). Thus the new mood mt+1 can be calculated as follows: mt+1 = mt + Ψm (p, ωt , σt , a) + Ωm (p, ωt , σt )
(8)
The emotional state can then be updated by an extended function Ψe that also takes into account the mood history and the new mood. The internal emotion update, which now also takes mood into account, is defined as Ωe (p, ωt , σt+1 ). The new emotion update function is given as follows: et+1 = et + Ψe (p, ωt , σt+1 , a) + Ωe (p, ωt , σt+1 )
4
(9)
Personality, Emotion and the OCC Model of Appraisal
As OCC is the most widely accepted appraisal model, we will present some thoughts about how to integrate personality models with OCC. OCC uses goals, standards and attitudes. These three notions are for a large part domain dependent. As multi-dimensional personality models are domain-independent, we need to define the relationship between this kind of personality model and the personality model as it is used in OCC. We will choose an approach where we assume that the goals, standards and attitudes in the OCC model are fully defined depending on the domain. Our personality model will then serve as a selection criterion that indicates what and how many goals, structures and attitudes fit with the personality. Because the OCEAN model is widely accepted, we will use this model to illustrate our approach. For an overview of the relationship between OCEAN and the goals, standards, and attitudes, see Table 1. The intensity of each personality faction will determine the final effect on the goals, standards and attitudes.
5 5.1
Application Overview
In order to demonstrate the model in a conversational context, we have built a conversational agent represented by a talking head. The update mechanisms of emotions and mood are implemented using a linear approach (using simple matrix representations and computations). The center of the application is a dialogue system, using Finite State Machines, that generates different responses
458
Arjan Egges et al.
Table 1. Relationship between OCEAN and OCC parameters Goals Openness
Standards Attitudes making a shift of attitude towards standards in new sit- new elements uations
Conscientiousness abandoning and adopting goals, determination Extraversion Agreeableness
willingness to communicate abandoning and compromising stan- adaptiveness to adopting goals in dards in favor of oth- other people favor of others ers
Neuroticism
based on the personality, mood and emotional state. The personality and emotion model is implemented using the 5 factors of the OCEAN model of personality, one mood dimension (a good-bad mood axis) and the 22 emotions from OCC plus 2 additional emotions (disgust and surprise) to have maximum flexibility of facial expression. A visual front-end uses the dialogue output to generate speech and facial expressions. The dialogue system annotates its outputs with emotional information. An example of such a response is (the tags consist of a joy emotion percentage of 58): |JO58|Thanks! I like you too! 5.2
Visual Front-End
Our visual front-end comprises of a 3D talking head capable of rendering speech and facial expressions in synchrony with synthetic speech. The facial animation system interprets the emotional tags in the responses, generates lip movements for the speech and blends the appropriate expressions for rendering in real-time with lip synchronization2 . Facial dynamics are considered during the expression change, and appropriate temporal transition functions are selected for facial animation. We use MPEG-4 Facial Animation Parameters as low level facial deformation parameters [12]. However, for defining the visemes and expressions, we use the Principal Components (PCs) [13]. The PCs are derived from the statistical analysis of the facial motion data and reflect independent facial movements observed during fluent speech. The use of PCs facilitates realistic speech animation, especially blended with various facial expressions. We use available text-to-speech (TTS) software that provides phonemes with temporal information. The co-articulation rules are applied based on the algorithm of Cohen et al. [16]. The expressions are embedded in the text in terms of tags 2
For more details on expressive speech animation, see Kshirsagar et al. [13].
A Model for Personality and Emotion Simulation
459
Fig. 2. Julie’s behaviour as an extravert (a) and an introvert (b) personality
and associated intensities. An attack-sustain-decay-release type of envelope is applied for these expressions and they are blended with the previously calculated co-articulated phoneme trajectories. Periodic eye-blinks and minor head movements are applied to the face for increased believability. Periodic display of facial expression is also incorporated, that depends on the recent expression displayed with the speech, as well as the mood of the character. 5.3
Example Interaction
As an example, we have developed a small interaction system that simulates Julie’s behaviour. We have performed these simulations with different personalities, which gives different results in the interaction and the expressions that the face is showing. The interaction that takes place for an extravert personality (90% extravert) is shown in Figure 2(b). An introvert personality (5% extravert) significantly changes the interaction (see Figure 2(b)).
6
Conclusions and Future Work
In this paper we have presented a basic framework for personality and emotion simulation. Subsequently we have shown how this framework can be integrated with an application such as an expressive MPEG-4 talking head with speech synthesis. Our future work will focus on user studies to validate the personality
460
Arjan Egges et al.
and emotion model that is used. We will also explore the possibility of having multiple mood dimensions. Furthermore we will explore how personality and emotion are linked to body behaviour and what computational methods are required to simulate this link.
Acknowledgment This research has been funded through the European Project MUHCI (HPRNCT-2000-00111) by the Swiss Federal Office for Education and Science (OFES).
References [1] Ortony, A., Clore, G.L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press (1988) 453 [2] Eysenck, H.J.: Biological dimensions of personality. In Pervin, L.A., ed.: Handbook of personality: Theory and research. New York: Guilford (1990) 244–276 453, 455 [3] Costa, P.T., McCrae, R.R.: Normal personality assessment in clinical practice: The NEO personality inventory. Psychological Assessment (1992) 5–13 453, 455 [4] Marsella, S., Gratch, J.: A step towards irrationality: Using emotion to change belief. In: Proceedings of the 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, Bologna, Italy (2002) 453, 455 [5] Johns, M., Silverman, B.G.: How emotions and personality effect the utility of alternative decisions: a terrorist target selection case study. In: Tenth Conference On Computer Generated Forces and Behavioral Representation. (2001) 453 [6] Andr´e, E., Klesen, M., Gebhard, P., Allen, S., Rist, T.: Integrating models of personality and emotions into lifelike characters. In: Proceedings International Workshop on Affect in Interactions. Towards a New Generation of Interfaces. (1999) 453 [7] Ball, G., Breese, J.: Emotion and personality in a conversational character. In: Proceedings of the Workshop on Embodied Conversational Characters. (1998) 83–84 and 119–121 453 [8] Kshirsagar, S., Magnenat-Thalmann, N.: A multilayer personality model. In: Proceedings of 2nd International Symposium on Smart Graphics, ACM Press (2002) 107–115 453, 456 [9] Vel´ asquez, J.: Modeling emotions and other motivations in synthetic agents. In: Proceedings of AAAI-97, MIT Press (1997) 10–15 453 [10] Moffat, D.: Personality parameters and programs. In: Lecture Notes in Artificial Intelligence : Creating Personalities for Synthetic Actors: Towards Autonomous Personality Agents. (1995) 453 [11] Egges, A., Kshirsagar, S., Zhang, X., Magnenat-Thalmann, N.: Emotional communication with virtual humans. In: The 9th International Conference on Multimedia Modelling. (2003) 454 [12] Kshirsagar, S., Garchery, S., Magnenat-Thalmann, N.: Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation. In: Deformable Avatars. Kluwer Academic Publishers (2001) 33–43 454, 458
A Model for Personality and Emotion Simulation
461
[13] Kshirsagar, S., Molet, T., Magnenat-Thalmann, N.: Principal components of expressive speech animation. In: Proceedings Computer Graphics International. (2001) 59–69 454, 458 [14] Elliott, C.D.: The affective reasoner: a process model of emotions in a multiagent system. PhD thesis, Northwestern University, Evanston, Illinois (1992) 455 [15] Ekman, P.: Emotion in the Human Face. Cambridge University Press, New York (1982) 455 [16] Cohen, M.M., Massaro, D. In: Modelling co-articulation in synthetic visual speech. Springer-Verlag (1993) 139–156 458
Using Loose and Tight Bounds to Mine Frequent Itemsets Lei Jia, Jun Yao, and Renqing Pei School of Mechatronics and Automation Shanghai University, Shanghai, 200072, China {jialei7829,junyao0529,prq44}@hotmail.com
Abstract. Mining frequent itemsets forms a core operation in many data mining problems. The operation, however, is data intensive and produces a large output. Furthermore, we also have to scan the database many times. In this paper, we propose to use loose and tight bounds to mine frequent itemsets. We use loose bounds to remove the candidate itemsets whose support cannot satisfy the preset threshold. Then, we find whether we can determine the frequency of the remainder candidate itemsets with the tight bounds. According to the itemsets that cannot be treated, we scan the database for them. Using this new method, we can decrease not only the candidate frequent itemsets have to be tested, but also the database scan times.
1
Introduction
Data mining is to efficiently discover interesting rules from large collections of data. Mining frequent itemsets forms a core operation in many data mining problems. The frequent itemset problem is stated as follows. Assume we have a finite set of items I. A transaction is a subset of I, together with a unique identifier. A database D is a finite set of transactions. A subset of I is called an itemset. The support of an itemset equals to the fraction of the transactions in D that contains it. If the support is above the preset threshold, we call it frequent itemset. Many researchers devote to designing efficient structures or algorithms [1,2,3,4,5,6,7,8] to mine frequent itemsets. We find, in these algorithms, we often spend a lot of time in scanning the database. If we can optimize the scanning processes, we will improve the efficiency of the mining dramatically. In this paper, unlike the existing algorithms, we propose to use loose and tight bounds to mine the frequent itemsets. We use loose bounds to remove the candidate itemsets whose support cannot satisfy the preset threshold. Then, we find whether we can determine the frequency of the remainder candidate itemsets with the tight bounds. According to the itemsets that cannot be treated, we scan the database for them. In section 2, we introduce loose bounds and tight bounds. In section 3, we show how to use loose and tight bounds to mine frequent itemsets by LTB algorithm (loose V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 462-468, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Loose and Tight Bounds to Mine Frequent Itemsets
463
and tight bounds based algorithm). We present the experiment result in section 4 and section 5 is a brief conclusion.
2
Loose Bounds and Tight Bounds
2.1
Loose Bounds
Suppose we have two itemsets A,B and their support sup(A), sup(B). Minsup is the preset support threshold. According to the Apriori-trick, if at least one of them is not frequent, we should not consider AB. If A and B are both frequent, we have to generate and test the candidate frequent itemsets AB. Not only do we generate it, but also scan the database to determine its frequency. We know sup(AB) ∈ [max(0,sup(A)+sup(B)-1), min(sup(A),sup(B))]. Because we have already known sup(A) and sup(B) at the last level, we can easily induce it is a frequent itemset if max(0,sup(A)+sup(B)-1) ≥ minsup. In the same way, if min(sup(A),sup(B))<minsup, we confirm it is not a frequent itemset. We call this bound is a loose bound. We define a symbol ∇ to denote the cross union between the itemsets A and B. Given a database D, s1 and s2 as the minimum and maximum support level, we define Fs as the itemsets with the frequency above s, Fs1,s2 to be frequent itemsets occurring at least s1, but less than s2 percent of transactions in D. We can get Fs1Fs2 ⊆ Fs1,s2, also can we get Fs1,s2 ∪ Fs2 ⊆ Fs1. Now we can use the definition to the loose bound. We let ∆ s=s1+s2-1, if ∆ s 〉 0,
then Fs1,s2 ∇ Fs3,s4 ⊆ F ∆s ,min(s2,s4). For example, if sup(A)=60%, sup(B)=70%, sup(C)=10%, sup(D)=30%, minsup=20%. We can easily get 30% ≤ sup(AB) ≤ 60%, 0 ≤ sup(AC) ≤ 10%, 0 ≤ sup(AD) ≤ 30%. So we conclude that itemset AB must be frequent itemsets, AC cannot be frequent itemset and we don't know AD is frequent or not. In the existing algorithm, we have to scan the database for the itemsets like AB and AD. But now, we can use the following tight bounds to determine frequency of the itemset like AB. 2.2
Tight Bounds
We find, using the tight bounds [9,10], the numbers of database scans and the itemsets that have to be counted can be reduced significantly. Consider the following example. Suppose we know: sup(A)=sup(B)=sup(C)=2/3, sup(AB)=sup(AC)=sup(BC)= 1/3 and minsup=1/3, we cannot judge the itemsets ABC whether it is frequent or not with the idea in Apriori-like algorithm. However, we can calculate the frequency of it with the following inequalities.
464
Lei Jia et al.
We get the following results:
So we conclude that sup(ABC)=0. We find we can determine the frequency without scanning the database. This method is called tight bounds. We define tx as the frequency of x in the transactions if transaction.item=x. For example: tAB means the frequency of tranactions contains AB in the database and it is independent to tA and tB. Then sup(A)=tA+tAB+tAC+…+tABC+tABD+…+tI. Then we have the following equalities.
We know any one of the tx is above zero. So we can solve the above equalities recursively from (1) to (N) to get the following solutions.
We find if we know all the subsets of the itemsets I, we can accurately calculate the lower and upper limits of its support with above solutions. If the lower and upper limits are the same, we can determine its frequency without scanning the databases. We prune the superset of I with the downward-closed property (Apriori-trick). When we cannot decide the support just because the low and upper limits of the tight bounds are not the same, we have to scan the database for them. Also do we scan database for the itemsets we cannot decide like AD in the last subsection.
Using Loose and Tight Bounds to Mine Frequent Itemsets
3
465
LTB Algorithm (Loose and Tight Bounds Based Algorithm)
Through the above analysis, we find if we make full use of the loose and tight bounds, we can decrease the number of the itemsets that need to be tested and the times we have to scan the database. So we propose the LTB algorithm in this section. LTB algorithm: Input: database D, minsup and t (the highest level we use the loose bounds); Output: Frequent itemsets L; L= φ ; L1={frequent 1-itemsets}; L1=order L1 according to their frequency in ascend order; 4) T[1]=get_filtered(D,L1); 5) for(k=2,Lk-1 ≠ φ ,k++) 6) Ck=apriori_gen(Lk-1); 7) if(k ≤ t) 8) use the loose bounds to get L' whose support is above minsup; 9) L''=Ck-L'; 10) Calculate the support of itemsets M1(M1 ⊆ L') with the tight bounds; 11) if the low and upper limits are the same; 12) M2=L'-M1; 13) Lk=Lk ∪ M1; 14) end; 15) Scan the database T[1] for the itemsets x (x ⊆ (L'' ∪ M2)); 16) if sup(x) ≥ minsup 17) Lk=Lk ∪ x; 18) else 19) Calculate the support of itemsets in Ck with the tight bounds; 20) if the low and upper limits of M3(M3 ⊆ Ck) are the same; 21) M4=Ck-M3; 22) Lk=Lk ∪ M3; 23) end; 24) Scan the database T[1] for the itemsets y in M4; 25) If sup(y) ≥ minsup 26) Lk=Lk ∪ y; 27) end 28) end Answer L= ∪ kLk; 1) 2) 3)
The algorithm can be divided into three parts. From line 1 to line 4, we find the frequent itemsets at level 1 and use them to filter the database D to T [1]. If some database has infrequent 1-item, this filtering can shorten the database significiently.
466
Lei Jia et al.
From line 5 to 17, we use the loose and tight bounds to mine the frequent itemsets below level t where t is the preset highest level we use loose bound. We utilize the apriori_gen function as the traditional Apriori Algorithm to make full use of the downward-closed property to prune the search space. We use the loose bounds to select the itemsets that is needed to calculate their frequency with the tight bounds. Some of these have the same low and upper limits, so we can determine its support. We scan the database for the others and the itemsets we cannot determine whether it is frequent or not by loose bounds. The last part is from line 15 to the end. In this part, because the loose bounds is too loose, we only calculate their supports with tight bounds. We scan the database again if we cannot determine the support of them. Using this algorithm, we can use the loose and tight bounds to mine frequent itemsets efficiently.
4
Experiment
To study the effectiveness and efficiency of the algorithm we proposed in the above section, we implemented it in VB and tested on a 1GHz Pentium PC computer with 128 megabytes of the main memory. The test consists of a set of synthetic transaction database generated using a randomized itemset generation algorithm similar to the algorithm described in [2]. Table 1 shows the database used, in which N is the average size of these itemsets and T is the size of item sets we randomly choose from. Table 1.
Database 1 2
N 2 3
T
Size of transactions 100K 100K
5
N=2,MinSup=20%
N=3,MinSup=20%
5000
15000 Apriori LTB
Apriori LTB
4500 4000 3500
10000 Time(ms)
Time(ms)
3000 2500 2000
5000
1500 1000 500 0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 1. Database1, N=2, MinSup=20%
0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 2. Database2, N=3, MinSup=20%
Using Loose and Tight Bounds to Mine Frequent Itemsets N=2,MinSup=40%
N=3,MinSup=40%
3500
7000 Apriori LTB
3000
6000
2500
5000
2000
4000
Time(ms)
Time(ms)
Apriori LTB
1500
3000
1000
2000
500
1000
0
467
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 3. Database1, N=2, MinSup=40%
0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 4. Database2, N=3,MinSup=40%
We compare our LTB algorithm with the classic Apriori algorithm [2]. Figure 1 and Figure 2 show the running time of the two approaches in relevance to the number of transactions in database1 and database2 where the support threshold MinSup is 20%. In the same way, we get figure 3 and figure 4 respectively when we change the MinSup to 40%. When n=2, LTB algorithm is about 6% (MinSup=20%) or 3% (MinSup=40%) less efficient than Aprori algorithm. The reason is that we cannot find the itemsets that could satisfy the tight bounds. We have to scan the database for them all. We waste the time we spend in calculating the tight bounds. When n=3, LTB approach is about 20% (MinSup=20%) or 12% (MinSup=40%) more efficient just because we can determine the support of some itemsets without scanning the database. In practice, there are many candidate frequent itemsets and the database is very large. So, we can save more time if we use LTB algorithm.
5
Conclusion
We propose to use a new method, loose and tight bounds, to mine frequent itemsets in this paper. The algorithm we propose is called LTB algorithm (loose bounds and tight bounds based algorithm). The loose bounds can help us narrow the search space and the tight bounds can assist us to get the frequency of the itemsets without scanning the database. We also use the order and apriori-trick in the algorithm to make the mining process efficiently. The detail experiments prove the effectiveness of our method.
References [1] [2]
R.Agrawal, T.Imielinski and A.Swami. Mining association rules between sets of items in large databases. SIGMOD (1993) 207–216 R.Agrawal and R.Srikant. Fast slgorithms for mining association rules. VLDB (1994) 487–499
468
Lei Jia et al.
[3]
J.Han and Y.Fu. Discovery of multiple-level association rules from large databases. VLDB (1995) 420–431 R.Srikant,Q.Vu,and R.Agrawal. Mining associstion rules with item constraints. In Proc. 1997 Int. Conf. Knowledge Discovery and Data mining (KDD'97) 67–73 J.Pei,J.Han,and L.V.S.Lakshmanan. Mining frequent itemsets with convertible constraints. In ICDE'01, 323–332 J.Han,J,Pei and Y.Yin. Mining frequent patterns without candidate generation. In SIGMOD'00, 1–12 N.Pasquier,Y.Bastide,R.Taouil, and L.Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems (1999) 25–46 J.-F.Boulicaut and A.Bykowski. Frequent closures as a concise representation for binary data mining. In Proceedings of the Fourth Pacif-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00) 62–73 T.Calders. Deducing bounds on the frequency of itemsets. In EDBT Workshop DTDM Database Techniques in Data Mining, 2002 T.Calders and B.Goethals. Mining all non-deriable frequent itemsets. Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery. 2002
[4] [5] [6] [7] [8] [9] [10]
Mining Association Rules with Frequent Closed Itemsets Lattice Lei Jia, Jun Yao, and Renqing Pei School of Mechatronics and Automation Shanghai University, Shanghai, 200072, China {jialei7829,junyao0529,prq44}@hotmail.com
Abstract. One of the most important tasks in the field of data mining is the problem of finding association rules. In the past few years, frequent closed itemsets mining has been introduced. It is a condensed representation of the data and generates a small set of rules without information loss. In this paper, based on the theory of Galois Connection, we introduce a new framework called frequent closed itemsets lattice. Compared with the traditional itemsets lattice, it is simple and only contains the itemsets that can be used to generate association rules. Using this framework, we get the support of frequent itemsets and mine association rules directly. We also extend it to fuzzy frequent closed itemsets lattice, which is more efficient at the expense of precision.
1
Introduction
Data mining is to extract the previous unknown and potentially useful information from the large database. The problem of mining association rules [1,2] has been the subjects of numerous studies. However, the traditional methods generate too many frequent itemsets. We know if the size of largest itemsets is N, the candidate frequent itemsets space is 2N. It is really a complex and untrivial work. In the past few years, based on the theory of Formal Concept Analysis, a new technology, frequent closed itermsets mining, has been introduced [3,4,5,6,7]. It extracts a particular subset of the frequent itemsets that can regenerate the whole ones without information loss. In this paper, we introduce a new framework. It is called Frequent Closed Itemsets Lattice. We use Apriori-style algorithm to build the framework. It makes full use of the Galois Connection theory and only contains the itemsets that can be used to form association rules. We also extend the frequent closed itemsets lattice to fuzzy frequent closed itemsets lattice. It is more efficient at the expense of information loss. In Section 2, we present theoretical basis of the frequent closed itemsets lattice. In Section 3, we introduce Apriori-style FCIL algorithm to build the new framework. In section 4, we discuss how to generate the informative association rules under the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 469-475, 2003. Springer-Verlag Berlin Heidelberg 2003
470
Lei Jia et al.
framework. In section 5, we extend the framework to fuzzy frequent closed itemsets lattice. We present our experiments in section 6 and section 7 is a brief conclusion.
2
Theoretical Basis
In this section, we define the data mining context, Galois connection, Galois closure operator, closed itemsets and frequent closed itemsets lattice.[8] Definition 1 (Data mining context). A data mining context is a triple D=( Γ , τ ,R). The rows are transactions Γ and the columns are the itemsets τ . R ⊆ Γ × τ is a relation between the transactions and itemsets. Definition 2 (Galois Connection). Let D=( Γ , τ ,R) be data mining context. For T ⊆ Γ and I ⊆ τ , we define: Γ
τ
τ
Γ
p(T): 2 → 2 p(T)={i ∈ τ | ∀ t ∈ T,(t,i) ∈ R} q(I): 2 → 2 Γ
q(I)={t ∈ Τ | ∀ i ∈ I,(t,i) ∈ R}
τ
Because 2 and 2 are power set of Γ and
τ
, (p,q)
forms a Galois connec-
tion. The connection has the following property.
I1 ⊆ I2 → q(I1) ⊇ q(I2)
T1 ⊆ T2 → p(I1) ⊇ p(I2) T ⊆ q(I) → I ⊆ p(T) Definition 3 (Galois closure operator). Given the Galois connection, if we choose h=p(q(I)), g=q(p(T)), I,I1,I2 ⊆ τ ,T,T1,T2 ⊆ Γ , we have the following properties: Extension:
I ⊆ h(I)
T ⊆ g(T)
Idempotency:
h(h(I))=h(I)
Monotonicity:
I1 ⊆ I2 → h(I1) ⊆ h(I2)
g(g(T))=g(T)
T1 ⊆ T2 → g(T1) ⊆ g(T2) Definition 4 (Closed itemsets and Frequent closed itemsets). The closure of an itemset I is the maximal superset of I which has the same support as I. The closed itemset C is the itemset whose support is equal to its closure. If the support is higher than the minimum support, the closed itemsets called frequent closed itemsets. Definition 5 (Frequent closed itemsets lattice). We use all the itemsets C' in the closure to build a complete lattice (C', ≤ ) called frequent closed itemsets lattice.
Mining Association Rules with Frequent Closed Itemsets Lattice
471
The lattice has the following properties: 1) The closure lattice is simple than the frequent itemsets lattice. It is composed by the frequent closed itemsets. 2) All sub-itemsets in the closure lattice of a frequent closed itemsets are frequent (anti-monotonicity property). 3) We can get the associations directly from the lattice as shown in section 4.
3
Building Frequent Closed Itemsets Lattice with Apriori-Style FCIL Algorithm
Considering that frequent closed itemsets lattice has the anti-monotonicity property, we build the framework with the Apriori-style algorithm. Suppose we have 6 transactions in table1. The second column is the items contain in the transaction and the class label is in the third column. Now, we start to build the lattice. Generally, we sort the itemsets in lexicographic order, but nowadays we find if we put items in specific order, it will improve the efficiency of the process. We use the support-based order introduced in paper [7]. Also do we sort the frequent 1-item in support-based ascend order and use the same Itemset-Transaction structure. Table 1.
1 2 3 4 5 6
Item ACTW CDW ACTW ACDW ACDTW CDT
Fig.1. Frequent closed itemsets lattice
Class 1 2 1 1 2 2
472
Lei Jia et al.
The level-wise algorithms like Apriroi [2] have been proved to be one of the most efficient algorithms in the field of association rules mining. So we propose the Apriori-style FCIL algorithm (Frequent Closed Itemsets Lattice algorithm). The FCIL algorithm: Input:
T[1](in ascend order), MinSup is the Minimum Support threshold and function t(x) represents the transactions that contain x. Output: frequent closed itemsets lattice 1)L1={large 1-itemsets X1 × t(X1) in ascend order}; 2)T[2]=order T[1] according to L1; 3)Construct the lattice of level-1;
4)for(k=2;Lk-1 ≠ φ ;k++) 5) Ck=apriori_gen(Lk-1) where Lk=Lk-1 ∪ Lk-1,t(Lk)= t(Lk-1) ∩ t(Lk-1); 6) forall transactions t ∈ T[2] 7) Ct=subset(Ck,t); 8) forall candidates c ∈ Ct do 9) c.count++; 10) end 11) end 12) Lk={c ∈ Ck| c.count ≥ MinSup}; 13) if t(x)=t(y) where x ∈ Lk, y ∈ Lk-1 14) add the item to the lattice and connect them; 15) else add the item to the lattice; 16) end 17)end return the frequent closed itemsets lattice Using the FCIL algorithm, we can get the frequent closed itemsets collections from the lattice easily. In each {} are the frequent itemsets have the same closure. We show the transactions that contain the itemsets in ( ). They are {D,DC(2456)}, {DW,DWC(245)},{T,TC(1356)},{TA,TW,TAW,TAC,TWC(135)},{A,AW,AC,AW C(1345)},{W,WC(12345)} and {C(123456)}.
4
Forming Association Rules
After building the frequent closed itemsets lattice, we try to form association rules from the structure directly. Association rules can be divided into two classes. According to the rule I1 → I2-I1, if both I1 and I2 are in the same closure, the confidence of the rule is 1. It is so-called exact rules. If I1 and I2 are in the different closure, the rule maybe belongs to approximate rules. We should compare the confidence with the preset confidence threshold to determine whether they are our targeted association rules. We use ⇒ and → to show exact and approximate rules respec-
Mining Association Rules with Frequent Closed Itemsets Lattice
473
tively, we also present the confidence of the approximate rules in the brackets behind the rule. Using the following inference technology, we elicit the informative rules directly. 1) The informative rules mean we can infer the other rules from them. We should pay attention to the following cases: i. If the antecedents of the rules are the same, which one has the largest consequent is the informative one. ii. If the conjunctions of the antecedent and consequent of the rule is the same, which rule has the smallest antecedent is the informative rule. 2) According to the Guigues-Duquenne basis and Luxenburger basis [8], we can prune some rules. (1) Guigues-Duquenne basis: i. X ⇒ Y, W ⇒ Z a XW ⇒ YZ ii. X ⇒ Y, Y ⇒ Z a X ⇒ Z We can also get: X → Y, Y ⇒ Z a X → Z (2) Luxenburger basis: i. The association rule X → Y has the same support and confidence as the rule close(X) → close(Y). The close(x) means the closed itemsets of x. ii. For any three closed itemsets I1,I2 and I3, such that I1 ⊆ I2 ⊆ I3, the confidence of the rule I1 → I3 is equal to the product of the confidences of the rules I1 → I2 and I2 → I3, and its support is equal to the support of the rule I2 → I3. (3) When two rules are equipotent (The support of the antecedent and the conjunctions of the antecedent and consequent of the two rules are the same.), we can delete one according to the ascend order. We get the informative ones of approximate rules and exact rules respectively, then compare them with the above principle again to get the final results. According to the transactions in table 1, we get the final informative rules: C → W(5/6), W → AC(0.8), D ⇒ C, T ⇒ C and TW ⇒ AC. We note that, unlike the traditional algorithm, we do not generate the rule like W → A, then prune it when we find W → AC. Based on the frequent closed itemsets lattice and above inference technology, we can get W → AC directly (The rule is determined by the smallest itemset in one closure with the largest itemset in another closure). We are sure it will cover the information of rules like W → A. Through analyzing the example, we find that if we combine frequent closed itemsets lattice and inference technology, we can mine association rules efficiently.
5
Fuzzy Frequent Closed Itemsets Lattice
In reality, the more correlated the data is, the more association rules we will find. In such case, the collections of frequent closed itemsets are more compact than frequent itemsets. However, we often find the following situation: itemsets A and itemsets B
474
Lei Jia et al.
are both occur in almost the same transactions except small exceptions. Case 1: |A|=|B|, just means they are in the same level. Case2: |A|=|B|-n, just means A is the subset of B. In the above definition, if the frequent itemsets have the same support (occur in the same transactions), they are in the same frequent itemsets closure. The frequent closed itemsets is a concise representation. However, in many applications, it will cost a lot of time and memories. So we propose to loose the concise representation to ε -adequate representation, we can gain efficiency at the expense of precision. We introduce the notation fuzzy frequent closed itemsets. Formally, if itemsets X occurs in t transactions with the database, we say that the itemsets Y is in the same fuzzy frequent closed itemsets of X if the difference of sup(X) and sup(X ∪ {Y}) is less than the threshold δ . If δ =0, the fuzzy frequent closed itemsets degenerate to frequent closed itemsets. The fuzzy frequent closed itemsets form the fuzzy frequent closed itemsets lattice. In the FCIL algorithm, if we change step 13 to the step 13': if |t(x) -t(y)| ≤ δ where x ∈ Lk-1,y ∈ Lk, we can construct the fuzzy frequent closed itemsets as well.
6
Experiment
To study the effectiveness and efficiency of the algorithm we proposed in the above section, we implemented it in Basic and tested on a 1GHz Pentium PC computer with 128 megabytes of the main memory. The test consists of a set of synthetic transaction database generated using a randomized itemset generation algorithm similar to the algorithm described in [2]. The average size of these itemsets N is 3, the size of item sets T we randomly choose from is 5 and the size of database is 10K. The minimum support threshold is 70%. Using the traditional algorithm like Apriori [2], we have 35 connections if we want to build the frequent itemsets lattice. Now, we use FICL algorithm to build frequent closed itemsets lattice, the number reduces to 17. After building the lattice, we begin to mine the association rules. Within Apriori algorithm, we firstly generate candidate rules, then using the inference technology to prune the useless rules. If we have already built the frequent closed itemsets lattice, we can directly induce the exact association rules and approximation association rules respectively. We only use the inference technology to find whether some exact rules can be covered by some approximation rules. The time used to create all candidate rules and prune uninformative ones can be reduced. The experiment result is summarized in Table 2. Because the time spent in finding frequent itemset is same, we only show the time used to induce the association rules in the time list. Clearly, in practice, we will save more time when N increases. Table 2. Algorithm
Connections
Apriori FCIL
35 17
Candidate Rules 43
(xact+Approximation) Rules 7+5
Result Rules 3
Time (ms) 13 2
Mining Association Rules with Frequent Closed Itemsets Lattice
7
475
Conclusion
Association rule mining has been extensively studied since its introduction. It has been used in many applications. However, we are often in trouble just because we have to face a large number of candidate frequent itemsets. In the past few years, frequent closed itemsets mining has been introduced. It will generate a small set of rules compared with the traditional frequent itemsets mining without information loss. In this paper, we introduce a new framework called frequent closed itemsets lattice to mine association rules. Compared with the traditional itemsets lattice, the framework is simple and only contains the itemsets we need to form association rules. Under this framework, we get the support of the frequent itemsets and mine association rules directly. We also extend the structure to fuzzy frequent closed itemsets lattice, which is more efficient at the expense of precision. Finally, through the experiment, we prove the effectiveness of our method.
References [1] [2] [3] [4] [5] [6] [7] [8]
R.Agrawal,T.Imielinski, and A.Swami. Mining association rules between sets of items in large databases. SIGMOD 93, 207–216 R.Agrawal and R.Srikant. Fast slgorithms for mining association rules. VLDB 94, 487-499 N.Pasquier,Y.Bastide,R.Taouil, and L.Lakhal. Discovering frequent closed itemsets for association rules. In 7th Intl. Conf. on Database Theory 1999 D.Cristofor, L.Cristofor, and D.Simovici. Galois connection and data mining. Journal of universal Computer Science (2000) 60–73 J.Pei,J.Han,and R.Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, 2000 N.Pasquier,Y.Bastide,R.Taouil, and Lotfi Lakhal. Efficient mining of associaion rules using closed itemset lattice. Information Systems (1999) 25–46 M.Zaki. Generating non-redundant association rules. In Proceedings of the 6th ACM-SIGKDD Intermational Conference on Knowledge Discovery and Data Mining (2000)34–43 B.A.Davey and H.A.Priestley. Introduction to lattices and Order. Cambridge University Press, Four edition (1994)
Mining Generalized Closed Frequent Itemsets of Generalized Association Rules Kritsada Sriphaew and Thanaruk Theeramunkong Sirindhorn International Institute of Technology, Thammasat University 131 Moo 5, Tiwanont Rd., Bangkadi, Muang, Pathumthani, 12000, Thailand {kong,ping}@siit.tu.ac.th
Abstract. In the area of knowledge discovery in databases, the generalized association rule mining is an extension from the traditional association rule mining by given a database and taxonomy over the items in database. More initiative and informative knowledge can be discovered. In this work, we propose a novel approach of generalized closed itemsets. A smaller set of generalized closed itemsets can be the representative of a larger set of generalized itemsets. We also present an algorithm, called cSET, to mine only a small set of generalized closed frequent itemsets following some constraints and conditional properties. By a number of experiments, the cSET algorithm outperforms the traditional approaches of mining generalized frequent itemsets by an order of magnitude when the database is dense, especially in real datasets, and the minimum support is low.
1
Introduction
The task of association rule mining (ARM) is one important topic in the area of knowledge discovery in databases (KDD). ARM focuses on finding a set of all subsets of items (called itemsets) that frequently occur in database records or transactions, and then extracting the rules representing how a subset of items influences the presence of another subset [1]. However, the rules may not provide informative knowledge in the database. It may be limited with the granularity over the items. For this purpose, generalized association rule mining (GARM) was developed using the information of pre-defined taxonomy over the items. The taxonomy may classify products (or items) by brands, groups, categories, and so forth. Given a taxonomy where only leaf items present in the database, more initiative and informative rules (called generalized association rules) can be mined from the database. Each rule contains a set of items from any levels of the taxonomy. In the past, most previous works focus on efficient finding all generalized frequent itemsets. As an early intensive work, Srikant et al. [2] proposed five algorithms that apply the horizontal database format and breath-first search strategy like Apriori algorithm. These algorithms waste a lot of time in multiple scanning a database. As a more recent algorithm, Prutax, was proposed in [3] by V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 476–484, 2003. c Springer-Verlag Berlin Heidelberg 2003
Mining Generalized Closed Frequent Itemsets
477
applying a vertical database format to reduce the time needed in database scanning. Nevertheless, a limitation of this work is the cost of checking whether their ancestor itemsets are frequent or not using a hash tree. There exists a slightly different task for dealing with multiple minimum supports as shown in [4, 5, 6]. A parallel algorithm was proposed in [7]. The recent applications of GARM were shown in [8, 9]. Our efficient approach to mine all generalized frequent itemsets is presented in [10]. Furthermore to improve time complexity of the mining process, the concepts of closed itemsets have been proposed in [11, 12, 13]. The main idea of these approaches focus to find only a small set of closed frequent itemsets, which is the representative of a large set of frequent itemsets. This technique helps us reduce the computational time. Thus, we intend to apply this traditional concept to deal with the generalized itemsets in GARM. In this work, we propose a novel concept of generalized closed itemsets, and present an efficient algorithm, named cSET, to mine only generalized closed frequent itemsets.
2
Problem Definitions
A generalized association rule can be formally stated as follows. Let I = {A, B, C, D, E, U, V, W} be a set of distinct items, T = {1, 2, 3, 4, 5, 6} be a set of transaction identifiers (tids). The database can be viewed into two formats, i.e. horizontal format as shown in Fig. 1A, and vertical format as shown in Fig. 1B. Fig. 1C shows the taxonomy, a directed acyclic graph on items. An edge in a taxonomy represents is-a relationship. V is called an ancestor item of U, C, A and B. A is called a descendant item of U and V. Note that only leaf items of a taxonomy are presented in the original database. Intuitively, the database can be extended to contain the ancestor items by adding the record of ancestor items of which tidsets are given by the union of their children as shown in Fig. 1D. A set IG ⊆ I is called a generalized itemset (GI) when IG is a set of items where no any items in the set is an ancestor item of the others. The support of IG , denoted by σ(IG ), is defined as a percentage of the number of transactions in which IG occurs as a subset to the total number of transactions. Only the GI that has its support greater than or equal to a user-specified minimum support (minsup) is called a generalized frequent itemset (GFI). A rule is an implication → I2 , where I1 , I2 ⊆ I, I1 ∩ I2 = ∅, I1 ∪ I2 is GFI, and no of the form R: I1 item in I2 is an ancestor of any items in I1 . The confidence of a rule, defined as σ(I1 ∪ I2 )/σ(I1 ), is the conditional probability that a transaction contains I2 , given that it contains I1 . The rule is called a generalized association rule (GAR) if its confidence is greater than or equal to a user-specified minimum confidence (minconf). The task of GARM can be divided into two steps, i.e. 1) finding all GFIs and 2) generating the GARs. The second step is straightforward while the first step takes intensive computational time. We try to improve the first step by exploiting the concept of closed itemsets to GARM, and find only a small set of generalized closed itemsets to reduce the computational time.
478
Kritsada Sriphaew and Thanaruk Theeramunkong
Fig. 1. Databases and Taxonomy
3
Generalized Closed Itemset (GCI)
In this section, the concept of GCI is defined by adapting the traditional concept of closed itemsets in ARM [11, 12, 13]. We show that a small set of generalized closed frequent itemsets is sufficient to be the representative of a large set of GFIs. 3.1
Generalized Closed Itemset Concept
Definition 1. (Galois connection): Let the binary relation δ ⊆ I × T be the extension database. For an arbitrary x∈I and y∈T, xδy can be denoted when x is related to it y in database. Let it X ⊆ I, and Y ⊆ T. Then the mapping functions, t:I → T, t(X) = {y ∈ T | ∀x ∈ X, xδy} i:T → I, i(Y) = {x ∈ I | ∀y ∈ Y, xδy} define a Galois connection between the power set of I (P(I)) and the power set of T (P(T)). The following properties hold for all X,X1 ,X2 ⊆I and Y,Y1 ,Y2 ⊆T: 1. X1 ⊆ X2 =⇒ t(X1 ) ⊇ t(X2 ) 2. Y1 ⊆ Y2 =⇒ i(Y1 ) ⊇ i(Y2 ) 3. X ⊆ i(t(X) and Y ⊆ t(i(Y)) Definition 2. (Generalized closure): Let X ⊆ I, and Y ⊆ T, the composition of two mappings gcit :P(I) → P(I) and gcti : P(T) → P(T) are generalized closure operator on itemset and tidset respectively. The mapping of gcit (X) = i ◦ t(X) = i(t(X)) while gcti (Y) = t ◦ i(Y) = t(i(Y)). Definition 3. (Generalized closed itemset and tidset): X is called a generalized closed itemset (GCI) when X = gcit (X), and Y is called a generalized closed tidset (GCT) when Y = gcti (Y).
Mining Generalized Closed Frequent Itemsets
479
Fig. 2. Galois Lattice of Concepts For X ⊆ I and Y ⊆ T, the generalized closure operators gcit and gcti satisfy the following properties (Galois property): 1. Y ⊆ gcti (Y). 2. X ⊆ gcit (X). 3. gcit (gcit (X)) = gcit (X), and gcti (gcti (Y)) = gcti (Y). For any GCI X, there exists a corresponding GCT Y, with the property that Y = t(X) and X = i(Y). Such a GCI and GCT pair X × Y is called a concept. All possible concepts can be formed a Galois lattice of concepts as shown in Fig. 2. 3.2
Generalized Closed Frequent Itemsets (GCFIs)
The support of a concept X × Y is the size of GCT (i.e. |Y|). A GCI is frequent when its support is greater than or equal to minsup. Lemma 1. For any generalized itemset X, its support is equal to the support of its generalized closure (σ(X) = σ(gcit (X))). Proof. Given X, its support σ(X) = |t(X)|/|T|, and the support of its generalized closure σ(gcit (X)) = |t(gcit (X))|/|T|. To prove the lemma, we have to show that t(X) = t(gcit (X)). Since gcti is the generalized closure operator, it satisfies the first property that t(X) ⊆ gcti (t(X)) = t(i(t(X))) = t(gcit (X)). Thus t(X) ⊆ t(gcit (X)). The gcit (X) provides the GI that is the maximal superset of X and has the same support as X. Then X ⊆ gcit (X), and t(X) ⊇ t(gcit (X)) due to the Galois property [11]. We can conclude that t(X) = t(gcit (X)). Implicitly, the lemma states that all GFIs can be uniquely determined by the GCFIs since the support of any GIs will be equal to its generalized closure. In the worst case, the number of GCFIs is equal to the number of GFIs, but typically it is much smaller. From the previous example, there are 10 GCIs, which are the representatives of a large amount of all GIs as shown in Fig. 2. With minsup=50%, only 7 concepts (in bold font) are GCFIs.
480
4 4.1
Kritsada Sriphaew and Thanaruk Theeramunkong
Algorithm cSET Algorithm
In our previous work [10], all GFIs can be enumerated by applying two constraints, i.e. subset-superset and parent-child, on GIs for pruning. We propose an algorithm called cSET algorithm, which specifies the order of set enumeration by using these two constraints and the generalized closures to generate only GCIs. Two constraints stated that only descendant and superset itemsets of GFIs should be considered in the enumeration process. For generating only GCFIs, the following conditional properties must be checked when generating the child itemsets by joining X1 × t(X1 ) with X2 × t(X2 ). 1. If t(X1 ) = t(X2 ), then (1) replace X1 and children under X1 with X1 ∪ X2 , (2) generate taxonomy-based child itemsets of X1 ∪ X2 , and (3) remove X2 (if any). 2. If t(X1 ) ⊂ t(X2 ), then (1) replace X1 with X1 ∪ X2 and (2) generate taxonomy-based child itemsets of X1 ∪ X2 . 3. If t(X1 ) ⊃ t(X2 ), then (1) generate join-based child itemset of X1 with X1 ∪ X2 , (2) add hash table with X1 ∪ X2 , and (3) remove X2 (if any). 4. If t(X1 ) = t(X2 ) and t(X1 ∪ X2 ) is not contain in hash, then generate join-based child itemset of X1 with X1 ∪ X2 . Using the given example in Fig. 1 with minsup=50%, the cSET algorithm starts with an empty set. Then, we add all frequent items in the second level of the taxonomy, that are item V and W, and form the second level of the tree shown in Fig. 3. Each itemset has to generate two kinds of child itemsets, i.e. taxonomybased and join-based itemsets, respectively. We first generate taxonomy-based itemset by joining last items in itemsets by its child according to taxonomy. One taxonomy-based itemset of V is VU. The first property holds for VU, which results in replacing V with VU and then generating VUA and VUB. The second taxonomy-based itemset is joined with the current itemset (VU), which produces VUC. Again, the first property holds for VUC, which results in replacing VU and the children in tree under VU with VUC. Next, the join-based child itemset of V, VW, is generated. The third property holds for VW, which results in removing W and then generating VW under V. In the same approach, the process recursively occurs until no new GCFIs are generated. Finally, a complete itemset tree is constructed without excessive checking cost as shown in Fig. 3. All remaining itemsets in Fig. 3, except ones of crossed itemsets, are GCFIs. 4.2
Pseudo-code Description
The formal pseudo-code of cSET, extended from SET in [10], is shown below. The main procedure is cSET-MAIN and a function, called cSET-EXTEND, creates a subtree followed by a proposed set enumeration. cSET-EXTEND is executed recursively to create all itemsets under the root itemsets. The NewChild function
Mining Generalized Closed Frequent Itemsets
481
Fig. 3. Finding GCFIs using cSET with minsup=50%
creates a child itemset. For instance, NewChild(V,U) creates a child itemset VU of a parent itemset V, and adds the new child in a hash table. The GenTaxChild function returns the taxonomy-based child itemsets of that GI. Line 8-11 generates the join-based child itemsets. The function, called cSET-PROPERTY, checks for four conditional properties of GCIs and makes the operations with the generated itemset. Following the cSET algorithm, we will get a tree of all GCFIs. cSET-MAIN (Database,Taxonomy,minsup): 1. Root = Null Tree //Root node of set enumeration 2. NewChild(Root, GFIs from second level of taxonomy) 3. cSET-EXTEND(Root) cSET-EXTEND(Father) 4. For each Fi in Father.Child 5. C = GenTaxChild(Fi ) //Gen taxonomy-based child itemset 6. If supp(C) ≥ minsup then 7. cSET-PROPERTY(Nodes,C) 8. For j = i+1 to |Father.Child| //Gen join-based child itemset 9. C = Fi ∪ Fj 10. If supp(C) ≥ minsup then 11. cSET-PROPERTY(Nodes,C) = NULL then cSET-EXTEND(Fi ) 12. If Fi .Child cSET-PROPERTY(Node,C) 13. if t(Fi ) = t(Fj ) and Child(Fi )= ∅ then //Prop.1 14. Remove(Fj ); Replace(Fi ) with C 15. else if t(Fi ) ⊂ t(Fj ) and Child(Fi )= ∅ then //Prop.2 16. Replace(Fi ) with C 17. else if t(Fi ) ⊃ t(Fj ) then //Prop.3 18. Remove(Fj ); if !Hash(t(C)) then NewChild(Fi ,C) 19. else if !Hash(t(C)) then NewChild(Fi ,C) //Prop.4
482
5
Kritsada Sriphaew and Thanaruk Theeramunkong
Experimental Results
Since the novel concept of GCIs has never appeared in any researches, there are no existing algorithms for finding GCFIs. In our experiment, the cSET algorithm is evaluated by comparing with the current efficient algorithms for mining GFIs, i.e. SET algorithm [10]. All algorithms are coded in C language and the experiment was done on a 1.7GHz PentiumIV with 640Mb of main memory running Windows2000. The syntactic and real datasets are used in our experiment. The syntactic datasets are automatically generated by a generator tool from IBM Almaden with some slightly modified default values. Two real datasets from the UC Irvine Machine Learning Database Repository, i.e. mushroom and chess, are used with our own generated taxonomies. The original items contain in the leaf-level of taxonomy. Table. 1 shows the comparison of using SET and cSET to enumerate all GFIs and GCFIs, respectively. In real datasets, the number of GCFIs is much smaller than that of GFIs. With the same datasets, the ratio of the number of GFIs to that of GCFIs typically increases when we lower minsup. The higher the ratio is, the more time reduction is gained. The ratio can grow up to around 7,915 times, which results in reduction of running time around 3,878 times. Note that in syntactic datasets, the number of GFIs is slightly different from the number of GCFIs. This indicates that the real datasets are dense but the syntactic datasets are sparse. This result makes us possible to reduce more computational time by using cSET in real situations.
Table 1. Number of itemsets and Execution Time (GFIs vs. GCFIs)
Mining Generalized Closed Frequent Itemsets
6
483
Conclusion and Further Research
A large number of generalized frequent itemsets may cause of high computational time. Instead of mining all generalized frequent itemsets, we can mine only a small set of generalized closed frequent itemsets and then result in reducing computational time. We proposed an algorithm, named cSET, by applying some constraints and conditional properties to efficiently enumerate only generalized closed frequent itemsets. The advantage of cSET becomes more dominant when minimum support is low and/or the dataset is dense. This approach makes us possible to mine the data in real situations. In further research, we intend to propose a method to extract only a set of important rules from these generalized closed frequent itemsets.
Acknowledgement This paper has been supported by Thailand Research Fund (TRF), and NECTEC under project number NT-B-06-4C-13-508.
References [1] Agrawal, R., Imielinski, T., Swami, A. N.: Mining association rules between sets of items in large databases. In Buneman, P., Jajodia, S., eds.: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D. C. (1993) 207–216 476 [2] Srikant, R., Agrawal, R.: Mining generalized association rules. Future Generation Computer Systems 13 (1997) 161–180 476 [3] Hipp, J., Myka, A., Wirth, R., G¨ untzer, U.: A new algorithm for faster mining of generalized association rules. In: Proceedings of the 2nd European Conference on Principles of Data Mining and Knowledge Discovery (PKDD ’98), Nantes, France (1998) 74–82 476 [4] Chung, F., Lui, C.: A post-analysis framework for mining generalized association rules with multiple minimum supports (2000) Workshop Notes of KDD’2000 Workshop on Post-Processing in Machine Learing and Data Mining 477 [5] Han, J., Fu, Y.: Mining multiple-level association rules in large databases. Knowledge and Data Engineering 11 (1999) 798–804 477 [6] Lui, C. L., Chung, F. L.: Discovery of generalized association rules with multiple minimum supports. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD ’2000), Lyon, France (2000) 510–515 477 [7] Shintani, T., Kitsuregawa, M.: Parallel mining algorithms for generalized association rules with classification hierarchy. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. (1998) 25–36 477 [8] Michail, A.: Data mining library reuse patterns using generalized association rules. In: International Conference on Software Engineering. (2000) 167–176 477 [9] Hwang, S. Y., Lim, E. P.: A data mining approach to new library book recommendations. In: Lecture Notes in Computer Science ICADL 2002, Singapore (2002) 229–240 477
484
Kritsada Sriphaew and Thanaruk Theeramunkong
[10] Sriphaew, K., Theeramunkong, T.: A new method for fiding generalized frequent itemsets in generalized association rule mining. In Corradi, A., Daneshmand, M., eds.: Proc. of the Seventh International Symposium on Computers and Communications, Taormina-Giardini Naxos, Italy (2002) 1040–1045 477, 480, 482 [11] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540 (1999) 398–416 477, 478, 479 [12] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association rules using closed itemset lattices. Information Systems 24 (1999) 25–46 477, 478 [13] Zaki, M. J., Hsiao, C. J.: CHARM: An efficient algorithm for closed itemset mining. In Grossman, R., Han, J., Kumar, V., Mannila, H., Motwani, R., eds.: Proceedings of the Second SIAM International Conference on Data Mining, Arlington VA (2002) 477, 478
Qualitative Point Sequential Patterns Aomar Osmani LIPN- UMR CNRS 7030 Universit´e de Paris 13, 99 avenue J.-B. Cl´ement 93430 Villetaneuse, France
[email protected] Abstract. We have introduced [Osm03] a general model for representing and reasoning about STCSP (Sequential Temporal Constraint Satisfaction Problems) to deal with patterns in data mining and applications which generate great quantity of data used to understand or to explain some given situations (diagnostic of dynamic systems, alarms monitoring, event prediction, etc.). One important issue of sequence reasoning concerns the recognition problem. This paper presents the STCSP model with qualitative point primitives using frequency evaluation function. It gives the problem formalization; usual problems concerned with this kind of approach and it proposes some algorithms to deal with sequences.
1
Introduction
Sequences generation, and more generally reasoning about sequences,problem in many disciplines including artificial intelligence, databases, cognitive sciences, and engineering techniques [Se02]. A sequence reasoning called also sequence mining is the discovery of sets of characteristics shared through time by a great number of objects ordered in time [Zak01]. Various problems are concerned with reasoning about sequences i.e. sequence prediction, sequence generation, and sequence recognition. The most considered one is the task of discovering of all frequent sequences in large databases. It is a quite challenging problem, for instance, the search space is extremely large: with m variables there are O(mk ) potentially frequent sequences of length at most k. Many techniques have been proposed to mine temporal databases for the frequently occurring sequences [Zak01, JBS99, Lin92, Se02, aSD94, BT96].
2
Definitions
Let us consider the set E = {e1 , . . . , en } of qualitative points and their possible relationships {}. Let (e1 , . . . , em ) a vector of observations or events, such that ∀i ∈ {1..m}, ei ∈ E and R a matrix of binaries relations on E. Definition 1. A sequential pattern P is defined as a couple ((e1 , . . . , en ), R) such that (∀i), i ∈ {1..n − 1} ei { I 0 ( m,n )
0
Other
I n (i. j ) = {
( m , n )∈N 8 ( i , j ), m ≠ i , n ≠ j
(4)
Choosing Rule: Only the point (i, j) whose final value I n (i . j ) equals one can be used as a candidate point for matching. In fact the pixels of sea ice image always changes slightly in gray. So the point which has a local maximum, however, small initial value may be used as candidate point. It will make an error final result. To avoid this situation we set a threshold based on such points to ensure the initial value to be a big value. In our estimating the velocity of sea ice, the threshold is selected by computing the maximum initial value of a sea ice background image. Fig. 1(a) and (b) are two adjacent images. Fig.1(c) and (d) are the final results in which the candidate points are displayed
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
(a) Image 1
(b) Image 2
(c) Candidate points for image1
(d) Candidate points for image2
641
Fig. 1. Candidate points for two adjacent images
3
A Locally Parallel Model for Matching Candidate Points
After finding the two sets of candidate points, we would like to construct a set of matches. Because some points may be occluded, shadowed, or not visible in the next image for some reason, an initial set of possible matches are constructed by pairing each candidate point from the current image with each candidate from the next image. We regard the set of matches as a collection of “nodes” { ai } . Associated with each node ai is coordinate (x i , y i ) in image 1 and a set of labels L i which represents possible disparity that may be assigned to the point. Each label in L i is either a disparity vector (l x , l y ) ( l x ∈ [ − r , r ], l y ∈ [ − r , r ] , r is the maximum detectable disparity) or a distinguished label l * denoting “undefined disparity”. Each possible is assigned an initial likelihood using (5), (6), (7) and (8). wi (l ) =
1 l ≠ l* 1 + c * s i (l )
(5)
642
Kong Xiangwei et al.
p i0 ( l * ) = 1 − max* ( w i ( l ))
(6)
p i0 ( l ) = p i ( l | i ) * (1 − p i0 ( l * )), l ≠ l *
(7)
l≠l
pi (l | i) =
wi (l ) ∑ wi (l ' )
(8)
'
l ≠l *
si (l ) is the sum of the squares of the differences between a small window from image 1 centered on (x i , y i ) and a window from image 2 centered of ( xi + l x , yi + l y ) . c is a constant. After constructing the initial set of possible matches, we iteratively refine these likelihood using (9), (10), (11).
qik (l ) =
^ k +l
pi
^ k +l
pi
∑[ ∑p
j∃ai l '∃ nearai ||l −l ' ||≤Θ j ≠i
k j
(l ' )], l ≠ l *
(l ) = pik (l ) * ( A + B * qik (l )), l ≠ l *
(9)
(10)
(l * ) = pik (l * ) ^ k +l
p
k +l i
(l ) =
pi
(l )
^ k +l
∑ pi
(l ' )
(11)
l 'inLi
A and B are constants. Labels are considered consistent if they satisfy the following relation: || l − l ' ||≤ Θ (12) A node a j may be considered near ai if max(| xi − x j |, | yi − y j |) ≤ R
(13)
the probability of the label l * is affected only by the normalization (11). If ^ k +l
∑ pi l ≠l *
^ k
(l ) < ∑ pi (l ) l ≠l *
(14)
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
643
the probability that ai has label l * increases. Otherwise ^ k +1
∑p l ≠l *
i
^ k
(l ) ≥ ∑ pi (l )
(15)
l ≠l *
it decrease or remain the same. Such iteration procedure is repeated for a certain times. Matching Rule: If Pi (l ) ≥ T , the pair points are considered to be matched. Pi (l ) < T , the pair points will be cancelled.
T is the threshold for estimating the likelihood. Some nodes may remain ambiguous, with several potential matches retaining nonzero probabilities.
4
Estimation of the Velocity of the Sea Ice Movement
Because the gray of sea ice images always change slightly, sometimes there would be an invalid match. So we present a orientation choose and probability decision method to estimate the velocity of sea ice like that: After matching each neighbor images, we can get a set of matching pairs: Ai : {( x1i , y1i ), ( x2i , y 2i )}
(16)
Ai is the i th matching pair of the two neighbor image; ( x1i , y1i ) is the coordinates of the i th matching point of the first image. ( x2i , y 2i ) : the coordinates of the i th
matching point of the next image. Compute the velocity vector of each matching pair:
v Vi = [( x2i − x1i ), ( y2i − y1i )T
(17)
Fig. 2. Matching results
r According to the angle of Vi , each matching pair can be allocated to one of the eight orientation subsets(each orientation subset represents one orientation):
644
Kong Xiangwei et al.
j = 1,2L8 , Am ⊂ A
Subset ( j ) = { Am }
(18)
Here, Subset ( j ) is one of the eight orientation subsets. The eight orientations include: up, down, left, right, left-up, left-down, right-up, right-down. The probability that the velocity orientation of the sea ice belongs to Subset ( j ) is defined as: P( j ) =
Number ( j ) 8
∑ Number ( j )
(19)
k =1
Here Number( j ) is the number of the elements in Subset ( j ) . Find the maximum probability,
Pmax = max P( j ) j
j = 1,2L8
(20)
Assume that the maximum probability corresponds to Subset (n) , all the elements in Subset (n) will be used to estimate the velocity of the sea ice. Then, the final result of r the estimating velocity is V : r Vm ∑ r A ⊂ Subset ( n ) (21) V= m Number ( j ) Am is all the matching pairs belonging to the Subset (n) . r Vm is the velocity vector of the matching pair corresponding to Am .
5
Results and Conclusion
According to the character of the sea ice images, we adopt a locally parallel matching model to estimate the velocity of the sea ice. First, use an interest operator to choose the candidate points. Then, a locally parallel model is adopted to match the neighbor images of sea ice. Finally, we present a method to select the maximum probability subset for estimating the velocity. An important property of this matching algorithm is that it works for any mode of disparity and does not require precise information about camera orientation, position and distortion. And the method of selecting maximum probability subset for estimating the velocity can eliminate the case of some invalid matching. By bounding a camera on the ice breaker and using a general image collection board, we have applied the above system to estimate the velocity of the sea ice in Bohai of north China. In practice, the parameters in the above expressions are selected as below: In section3, The window size for matching is 5×5. Constant c in the
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
645
equation (5) equals 0.001. The constants A and B in the equation (10) are selected as 0.6 and 3. In equation (12), (13), Θ = 1 , R = 15 . The iteration procedure is repeated for eight times. In the matching rules, the threshold T=0.5. These methods have been proved to have nice effects and stabilization in practice of estimating velocity of sea ice. The precise level of estimating velocity of the sea ice can meet the demand of application.
References [1] [2] [3] [4] [5] [6] [7]
S. T. Barnard, W. B. Thompson, “Disparity analysis of images”, IEEE Transactions on PAMI, Vol. PAMI-2, pp.333-340, July 1980 H. P. Moravec, “Towards automatic visual obstacle avoidance,” in Proc. 5th Int. Joint Conf. Artificial Intell., Cambridge, MA, Aug. 1977, p. 584 M. D. Levine, D. A. O' Handley, and G. M. Yagi, “Computer determination of depth maps,” Comput. Graphics Image Processing, Vol. 2, pp. 131-150, 1973 K. Mori, M. Kidode, and H. Asada, “An iterative prediction and correction method for automatic stereo-comparison,” Comput. Graphics Image Processing, Vol. 2, pp.393-401, 1973 R. Nevatia, “Depth measurement by motion stereo,” Comput. Graphics Image Processing, Vol.5, pp.203-214, 1976 N. Evans, “Glacier surface motion computation from digital image sequences”, To appear in IEEE Transaction on Geoscience and Remote Sensing, 2000 Zhang Yujin, Image engineering: Image understanding and computer vision, Tsinghua press, 2000
Evaluation of a Combined Wavelet and a Combined Principal Component Analysis Classification System for BCG Diagnostic Problem Xinsheng Yu1, Dejun Gong2, Siren Li2, and Yongping Xu2 Marine Geology College, Ocean University of China Qingdao 266003, P. R. China 2 Institute of Oceanology, Chinese Acadamy of Sciences Qingdao 266071, P. R. China 1
Abstract. Heart disease is one of the main factor causing death in the developed countries. Over several decades, variety of electronic and computer technology have been developed to assist clinical practices for cardiac performance monitoring and heart disease diagnosis. Among these methods, Ballistocardiography (BCG) has an interesting feature that no electrodes are needed to be attached to the body during the measurement. Thus, it is provides a potential application to asses the patients heart condition in the home. In this paper, a comparison is made for two neural networks based BCG signal classification models. One system uses a principal component analysis (PCA) method, and the other a discrete wavelet transform, to reduce the input dimensionality. It is indicated that the combined wavelet transform and neural network has a more reliable performance than the combined PCA and neural network system. Moreover, the wavelet transform requires no prior knowledge of the statistical distribution of data samples and the computation complexity and training time are reduced.
1
Introduction
Traditionally, the physicians in hospitals need to interpret characteristics of the measured records and to calculate relevant parameters to determine whether or not the heart shows signs of cardiac disease. Recently, the advances in computer and electronic technology have provided a basis for automatic cardiac performance monitoring and heart disease diagnosis to assist clinical practice by saving diagnostic time. These technologies along with artificial intelligence research have also established entirely new applications in detecting the risk of heart disease [4]. However, the demands of practical health care require that these technologies are not limited to hospital environments but are able to detect the vital signs of heart disease during daily life under unrestricted conditions. For example, it would be helpful for both doctors and patients, if the preliminary heart condition could be monitored regularly at the home before making the decision whether or not it is necessary to visit V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 646-652, 2003. Springer-Verlag Berlin Heidelberg 2003
Evaluation of a Combined Wavelet
647
hospital, for further multiple assessment and treatment. Thus, the time and transport expenses could be saved. These requirements lead an increasing innovation in portable systems in reduction in size, weight and power consumption to an extent that the instruments are available in routine life monitoring and diagnosis at home. BCG provides a potential application to monitor heart condition in the home because no electrodes are required to be attached to the body during the measurements. In the past several years, the BCG has been used in a variety of clinical studies such as prognosis, monitoring, screening, physical conditioning, stress tests, evaluation of therapy, cardiovascular surgery and clinical conditions [1]-[4]. It would therefore be of value if the BCG could be used as a non-invasive alternative for the prediction of subjects at risk of heart attack that could help clinicians to establish an early treatment. Although studies have been made in the last few decades on the subject of improving BCG measurement technology, there has tended to be relatively little work done to attempt to improve the BCG process and analysis methods by using modern signal processing technology There is little work on developing computer assisted analysis systems that are capable of operating in real time, especially those incorporating artificial intelligence methods for BCG pattern classification. Moreover, no reports have been made in the literature to implement decision making algorithms in a portable system for real world application [5]. In this paper, two BCG classification models are evaluated. One is a combined PCA and a three layer neural network system, and the other is to use wavelet transform to decompose the original signal into a series of views for neural network classifier. The performance and computation efficiency of the two models are compared. It illustrated that the combined wavelet transform with backpropagation neural network has a more reliable performance than the combined PCA and neural network system.
2
BCG Signal Feature Selection
A classification of the whole BCG waveform can be computationally intensive by the neural network classifier. The high dimension number inputs forces us to use a network with a higher number of free parameters which means a large training samples is required to estimate these parameters. The time involved in the network training is considerable. Moreover, the presence of irrelevant feature combinations can actually obscure the impact of the more discriminating inputs. A small input means a smaller neural network structure is required and thus could save in time and system implementation for real world applications. 2.1
BCG Signal and Preprocessing
The BCG data set used in this study was provided by the medical Informatics Unit of the Cambridge University. The signal was sampled at 100 samples/second from the seat of the chair in a clinical trial. During the BCG recording, the reference ECG signals were recorded simultaneously from the arms of the chair at the same sampling rate. The R-wave of the ECG signal is used to identify each BCG cycle. The BCG
648
Xinsheng Yu et al.
signal are filtered and then averaged to reduce the background noise. Each averaged BCG signal was then normalized into a standard 80 points. Altogether, there are 58 normal subjects, 275 mild hypertension subjects and 6 subjects who died suddenly due to myocardial infarction within 24 months of the BCG recording. Figure 1 shows the standard length BCG signal of normal and hypertension subjects.
1
11
21
31
41
51
61
71
1
11
21
31
(A)
41
51
61
71
(B)
Fig. 1. BCG of normal (A) and hypertension (B)
2.2
Using PCA for Feature Extraction
Principal components are usually derived by numerically computing the eigenvector of the variance matrix. Those eigenvectors with largest eigenvalue (variance) are used as features onto which the data are projected [7],[8]. This method may be considered optimal, in the sense that mean square linear reconstruction error is minimised. For a given data set X = ( X1 , X 2 ,... X N ) with a zero mean, a simple approach to PCA is to compute the simple variance matrix
C=
1 N X i X iT ∑ N i =1
(1)
The next step is to use any of the existing standard eigenvector analysing algorithms to find the m largest eigenvalues and their corresponding eigenvectors of the data covariance matrix:
Cφ = λφ
(2)
where λ is the eigenvalue, and φ is its corresponding eigenvector. The m principal components of n dimensional data are the m orthogonal directions in n spaces which capture the greatest variation in the data. These are given by the eigenvectors φ . In this study, it is found that 10 principal components are have the ability to maintain the required information for the neural classifier 2.3
BCG Signal Analysis Using Discrete Wavelet Transform
In the BCG signal classification, most of the information needed for classification is in the BCG shape and related parameters, such as the wave amplitudes of H, I, and J,
Evaluation of a Combined Wavelet
649
slopes of base line to H, H-J, and I-J, and times of H-J, I-J as well as H-I and H-J in percent of heart period [6]. These kinds of features are often dependent on the degree of resolution. At a coarse resolution, information about the large features of the complex waves is represented; while at a fine resolution, information about more localised features of individual waves is formed. Upon discarding selected detail, one can obtain the required information for the classifier without much signal distortion. It is true that not every resolution is equally important to provide the classification information in the classification phase. Therefore, it is possible to choose a proper resolution and discard the selected component without much signal distortion. Using the wavelet transform, one can project the raw BCG data into a small dimension space and decompose the original signal into a series of sub-signals at different scales. The coarse components which still have the significant shape information can be used to present to the neural network classifier. One of the most useful features of wavelets is that one can choose the defining coefficients for a given wavelet system to be adapted for a given problem. In Mallat's original paper [9], he developed a specific family of wavelets that is good for representing polynomial behaviour. As there is no prior information to suggest that one wavelet function would be better suited to BCG analysis than others, the filter coefficients derived by Mallat [9] are adopted for this study. In this study, the signal reconstruction procedure is not required. Thus, only low pass filter processing is used to approximate the original BCG signal. The fast discrete wavelet transform algorithm is implemented in C code for BCG signal decomposition. As the original data frame has 80 sample points, at each successive resolution level, the pyramidal algorithm compresses the data by a factor of 2. At the first process level, the N dimension signal f ( x ) is taken as C0 and is decomposed into two bands: coarse component C j ,n and detail component D j ,n which both have N/2 samples. At the next level, only the low-pass output signal is further split into two bands. Signals at various resolutions give information about different features. In time scale, they are 40, 20, and 10 samples respectively. According to our experiment results, the level of 20 coarse components have relatively good classification performance.
3
BCG Classification
The neural network model used in this study is a variation of the backpropagation neural network. with a single hidden layer. The network parameters are empirically defined. The whole BCG data set ispatitioned randomly into two data set which shows in Table 1. One data set is used for the neural classifier training and another data set is used for test. We then swap the training data set and testing data set to repeat the training and testing procedure again. The same partitioning of the data set was used in all runs of this investigation and the generalization capability of the neural classifier was evaluated using the same strategy.
650
Xinsheng Yu et al. Table 1. Training and Testing Data Sets
Subjects Normal (N) Hypertension (H) Risk of Heart Attack (RHA)
4
Results
4.1
Recognition Performance
Training Data Set 30 111 3
Testing Data Set 28 164 3
After training the neural network, the weights were fixed and the network was evaluated. The results of using data set 1 as training data set for combined PCA and neural classifier is shown in Table 2. Table 3 illustrates the classification results of using data set 2 as training data set. The results for the combined wavelet transform and neural network classification system are shown in Table 4 and Table 5 respectively. Table 2. Classification Results Using 10 Principal Components
Class N H RHA Overall Performance
Training Data N H 30 0 0 111 0 1
RHA 0 0 2
99.31%
Testing Data N H 22 6 3 160 1 2
RHA 0 1 0
93.33%
Table 3. Classification Results of Swapped Data Sets Using 10 Principal Components
Class N H RHA Overall Performance
Training Data N H 28 0 1 163 0 1
RHA 0 0 2
98.97%
Testing Data N H 25 5 2 105 0 2
RHA 0 4 1
90.97%
Table 4. Classification Results Using Combined Wavelet Transform and Neural Classifier
Class N H RHA Overall Performance
Training Data N H 28 0 0 111 0 1 99.3%
RHA 0 0 2
Testing Data N H 25 2 2 159 0 1 95.38%
RHA 1 3 2
Evaluation of a Combined Wavelet
651
Table 5. Classification Results of Using Swapped Data Sets
Class N H RHA Overall Performance 4.2
Training Data N H 28 0 0 164 0 1 99.49%
RHA 0 0 2
Testing Data N H 27 3 0 107 0 1
RHA 0 4 2
94.44%
Computational Complexity
For real time applications, the reduction of computation complexity is of greatest concern for the engineering design. Table 6 shows a comparison of the forward pass computation complexity of the two combined classification systems. It shows that the computation operation required for the combined wavelet transformation is similar to that of the combined PCA network with 10 outputs. However, the memory storage required for the combined wavelet transform and neural classifier system is much smaller. Moreover, unlike the PCA method, no training procedure is required to generate the features for classifier inputs. This provides a great advantage for real time implementation. The comparison results of performance and computation efficiency suggest that the combined wavelet transformation and neural network system is suitable for real time implementation.
5
Conclusion
A study was made to determine the usefulness of a combination of different types of features for BCG classification. The simulation results using the partition training scheme show that the wavelet transform with neural network system has better performance than a combined PCA and neural networks approach. The simulation results also suggested that the improvement in system performance is constrained by the limited number of data samples. The results indicate that the number of data samples and balance of the class patterns in the samples have a significant influence on the generalization capability of the combined PCA and neural network system. Hoffbeck and Landgrebe [11] have suggested that the statistical method needs a large training data set to provide reliable statistical estimation, for example, to accurately estimate a sample covariance matrix. Although constructing a compact supported mother wavelet involves complex mathematics, the calculation of the wavelet transform is relatively simple. The wavelet transform can be implemented as finite impulse response (FIR) filters [10]. It has been demonstrated that the wavelet transform has a compact computational complexity. From the classification performance and real time implementation point of view, the multiresolution wavelet is promising for an on-line dimensionality reduction in BCG analysis in term of performance, storage and operations per decision.
652
Xinsheng Yu et al. Table 6. Comparison of the Computation Complexity for BCG Classification
Feature Classifier Total
Combined PCA Method Multiplication Addition Storage 800 808 808 115 115 115 915 923 923
Combined Wavelet Method Multiplication Addition Storage 720 660 12 99 99 99 819 759 111
References [1]
I. Starr and F.C. Wood, “Twenty years studies with ballistocardiograph, the relation between the amplitude of the first record of ‘health' adults and eventual mortability and morbidity from the heart disease,” Circulation, Vol. 23, pp. 714-732, 1961 [2] 2 C.E. Kiessling, “Preliminary appraisal of the prognostic value of Ballistocardiography,” Bibl. Cardiol., Vol. 26, pp. 292-295, 1970 [3] 3 T.N. Lynn and S. Wolf, “The prognostic significance of the ballistocardiogram in ischemic heart disease,” AM. Heart, J., Vol. 88, pp. 277280, 1974 [4] 4 R.A. Marineli, D.G. Penney, W.A. Marineli, and F. A. Baciewicz, “Rotary mation in the heart and blood vessels: A review,” Journal of Applied Cardiology, Vol. 6, pp. 421-431, 1991 [5] X. Yu, “Ballistocardiogram classifier prototyping with CPLDs,” Electronic Engineering, Vol. 68, No. 834, pp. 39-40, 1996 [6] W.K. Harrison, S.A. Talbot and Baltimore, “Discrimination of the Quantitative Unltralow-frequency Ballistocardiogram in Coronary Heart Disease,” American Heart Journal, Vol. 74, pp. 80-87, 1967 [7] I.T. Jolliffe, “Principal Component Analysis,” Springer-Verlog Press, 1986 [8] E. Oja, H. Ogawa, and J. Wangviwattana, “Principal component analysis by homogeneous neural networks, Part I & Part II: The Weighted Subspace Criterion,” IEICE Trans. INF. & Syst., Vol. E75-D, No. 3, pp. 366-381, 1992 [9] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 11, No. 7, pp. 674-693, 1989 [10] O. Rioul and M. Vetterli, “Wavelet and signal processing”, IEEE Transaction on Signal Processing Magazine, Vol. 8, pp. 14-38, 1991 [11] J.P. Hoffbeck and D.A. Landgrebe, “Covariance matrix estimation and classification with limited training data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 7, pp. 763-767, 1996
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery Stewart Crawford-Hines Colorado State University Fort Collins, Colorado USA
[email protected] Abstract. Defining the boundaries of regions of interest in biomedical imagery has remained a difficult real-world problem in image processing. Experience with fully automated techniques has shown that it is usually quicker to manually delineate a boundary rather than correct the errors of the automation. Semi-automated, user-guided techniques such as Intelligent Scissors and Active Contour Models have proven more promising, since an expert guides the process. This paper will report and compare some recent results of another user-guided system, the Expert's Tracing Assistant, a system which learns a boundary definition from an expert, and then assists in the boundary tracing task. The learned boundary definition better reproduces expert behavior, since it does not rely on the a priori edge-definition assumptions of the other models.
1
Background
The system discussed in this paper provides a computer-aided assist to human experts in boundary tracing tasks, through the application of machine learning techniques to the definition of structural image boundaries. Large imagery sets will usually have a repetition and redundancy on which machine learning techniques can capitalize. A small subset of the imagery can be processed by a human expert, and this base can then be used by a system to learn the expert's behavior. The system can then semiautomate the task with this knowledge. The biomedical domain is a rich source of large, repetitive image sets. For example, in a computed tomographic (CT) scan, cross-sectional images are generated in parallel planes typically separated by millimeters. At a 2mm separation between image planes, approximately 75 images would be generated in imaging the complete brain. Images such as this, generated along parallel planes, are called sectional imagery. Such sectional imagery abounds in medical practice: X-ray, MRI, PET, confocal imagery, electron microscopy, ultrasound, and cryosection (freezing and slicing) technologies all produce series of parallel-plane 2D images. Generating a three dimensional polygonal model of a structure from sectional imagery requires bounding the structure across the whole image set. Currently, the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 653-659, 2003. Springer-Verlag Berlin Heidelberg 2003
654
Stewart Crawford-Hines
reference standard for high-quality outlining tasks is an expert's delineation of the region. The state-of-the-practice is that this is done manually, which is a repetitive, tedious, and error-prone process for a human. There has been much research directed toward the automatic edge detection and segmentation of images, from which the extraction of a boundary outline could then proceed. Systems based on this work have run into two significant problems: (1) the cost of user adjustments when the system runs into troublesome areas often exceeds the cost of manually tracing the structure from the start, and (2) the a priori assumptions implicit in these methods impact the ability of the system to match the expert's performance on boundary definition tasks where an expert's judgement is called into play. The system discussed herein, the Expert's Tracing Assistant (ETA), provides viable assistance for such tracing tasks which has proven beneficial in several regards:
• • •
the specialist's time can be significantly reduced; errors brought on by the tedium of tracing similar boundaries over scores of similar images can be reduced; and the automated tracing is not subject to human variability and is thus reproducible and more consistent across images.
The thrust of this work is to learn the boundary definitions of structures in imagery, as defined by an expert, and to then assist the expert when they need to define such boundaries in large image sets. The basic methodology of this Expert's Tracing Assistant is: 1) 2) 3) 4)
in one image, an expert traces a boundary for the region of interest; that trace is used in the supervised training of a neural network; the trained network replicates the expert's trace on similar images; the expert overrides the learned component when it goes astray.
The details of the neural network architecture and training regimen for this system have been previously documented by Crawford-Hines & Anderson [1]. This paper compares these learned boundaries to the semi-automated boundary definition methods of Intelligent Scissors by Barrett & Mortensen [2] , and Active Contour Models begun by Kass, Witkin, & Terzopoulos [3].
2
Methods for Boundary Tracing
To understand the relative merits of learning boundary contours, the Expert's Tracing Assistant (ETA) was studied in comparison to other user-guided methods representing the current best state-of-the-practice for boundary delineation. The techniques of Active Contour Models (ACM) and Intelligent Scissors (IS) were chosen for comparison to ETA because of they have been brought into practice, they have been studied and refined in the literature, and they represent benchmarks against which other novel methods are being compared. The ground truth for the comparison of these methods is an expert's manual tracing of a structure's boundary in an image.
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
655
The structures chosen for comparison were taken from the imagery of the Visible Human dataset, from the National Library of Medicine [4], which is also becoming a benchmark set upon which many image processing and visualization methods are being exercised. Several structures were selected as representative cross-sections. For each, the IS, ACM, and ETA methods were used to define the structure's boundary and an expert manually delineated the boundary in two independent trials. Figure 1 shows three structures to be compared. The leg bone and skin are shown clearly, without much confusion. The leg muscle is fairly typical, surrounded by highly contrasting fatty tissue. However sometimes only a thin channel of tissue separates one muscle from the next. IS, also known as Live-Wire, is a user-guided method. With an initial mouse click, the user places a starting point on a boundary of interest; the system then follows the edges in an image to define a path from that initial control point to the cursor's current screen location. As the cursor moves, this path is updated in real time and appears to be a wire snapping around on the edges in an image, hence the terminology "live wire" for this tool. ACM, another user-guided methodology, uses an energy minimizing spline, that is initialized close to a structure of interest and then settles into an energy minima over multiple iterations. The energy function is defined so these minima correspond to boundaries of interest in the imagery. Since the initial contour is closed, the final result will always be a closed, continuous curve.
Fig. 1. A transverse image of the leg, highlighting the femur (bone), the biceps femoris (muscle), and the skin
656
Stewart Crawford-Hines
3
Comparing Boundaries
For a ground truth in this comparison, an expert was asked to manually trace the structures. The expert traced the structures twice, generating two independent contours for each structure. This permits a basic measure of the variation within the expert's manual tracings to be quantified. It might be argued that this ground truth is not really a truth, it is only one user's subjective judgement of a structural boundary. But the expert user brings outside knowledge to bear on the problem and is dealing with more than simple pixel values when delineating a boundary. And for a system to be useful and acceptable as an assistant to an expert, it should replicate what the expert is attempting to do, rather than do what is dictated by some set of a priori assumptions over which the expert has no input or control. The boundary definitions are to be quantitatively compared to each other and to the ground truth of the expert. The boundaries produced by each of these methods are basically sets of ordered points which can be connected by lines or curves or splines to create a visually continuous bound around a region. To compare two boundaries, A and B, we first connect the points of B in a piecewise linear curve, and measure the minimum distance from each point in A to the curve of B. We then flip the process around, and measure from each point in B to the piecewise linear curve of A. The collection of these measures is called a difference set. Figure 2 illustrates several visualizations of this difference set. The first graph, in the upper half of the figure, plots the distance between the curves (on the vertical axis) as a function of position on the curve (on the horizontal axis). In this example, there are perhaps three places where the curves are significantly more than one pixel apart from each other, shown in the plot by the six excursions of the graph over the distance of 1.0 (remembering the plot measures A to B and B to A, thus typically excursions show up twice). If the goal is to have the curves within one pixel of each other, this indicates that there are three places where operator intervention would be required to adjust the curves so as to meet that objective. The lower left of Figure 2 is a simple histogram of the distance set, with the number of occurrences on the vertical axis and distance bins on the horizontal. The lower right is an empirical cumulative distribution function (CDF) over the distance set. The vertical axis measures the fraction of occurrences that are within a tolerance specified on the horizontal axis. The CDF allows quantification of the inter-curve distances by selecting a tolerance such as one pixel and stating, "The curves are within one pixel of each other 86% of the time" or by selecting a percentile such as 90% and stating, "The curves are within 1.1 pixels of each other 90% of the time".
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
657
Fig. 2. Three visualizations of the distance set
4
Key Results
The three structures of Figure 1 typify the range of results found so far across many images. The expert manually outlined each structure on two independent trials. The expert's first boundary is used as the Ground Truth (GT), while the second manually traced boundary (M2T) is used to provide a measure of intra-expert variation, i.e., the inherent variation a user shows in tracing a boundary at different times. Given there exists variation within what an expert might manually trace, a good boundary delineation method needn't exactly match any specific expert's boundary definition, but it should be within the range of that expert's variance. Looking at the left-hand side of Figure 3, this is exactly what is observed. The black curve illustrates the CDF of M2T compared to the Ground Truth. Note the three methods are roughly comparable, all close to the bound of the black CDF. The righthand side of the figure shows the performance for the muscle. The performance is consistently worse overall. Figure 4 shows a detail of the five boundaries superimposed on the original image; the expert's traces are in black, the semiautomated traces in white. In the lower-left of the muscle, however, there is no consistency of definition - even the expert made different traces at different times. All did equally poorly. Figure 5 shows the results for the leg skin. Here the performance difference is dramatic between ETA (far left) and the IS and ACM methods (to the right). Figure6 illustrates what is happening in this situation. The expert judgement of the skin boundary places the boundary slightly inside what a more classically-defined boundary would be; note that both IS and ACM are agreeing on where the boundary lies, and apriori this appears to be a sensible boundary to draw. In this case, however, the body was encased in a gel before freezing, and the expert is accounting for both gel effects and the image pre-processing in locating the actual skin boundary. The expert is consistent in this judgement, and the ETA system has learned this behavior and replicated it.
658
Stewart Crawford-Hines
Fig. 3. CDF s of M2T (black) and IS, ACM, and ETA (grey) for the bone (left) and the muscle (right) from Figure 1
Fig. 4. Detail of the five boundaries
Fig. 5. Results for the leg's skin: CDFs for, from left to right: ETA, M2T, ACM, and IS
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
659
Fig. 6. In this detail of the skin, the expert (in black) has traced consistently to the inside of what would be clasically considered the image's edge; ETA (white) follows the expert's lead, while IS and ACM follow more traditional edge definitions.
The range of results in Figures 3 and 5 typify what has been seen so far: the learned boundary was either consistent with the classically defined IS and ACM methods, or it did better when expert judgement was called into play.
References [1] [2] [3] [4]
S. Crawford-Hines & C. Anderson, "Learning Expert Delineations in Biomedical Image Segmentation", ANNIE 2000 - Artificial Neural Networks In Engineering, November 2000. E.N. Mortensen & W.A. Barrett, "Interactive Segmentation with Intelligent Scissors", Graphical Models and Image Processing, v.60, 1998, pp.349-384. M. Kass, A. Witkin, D. Terzopoulos, "Snakes: Active Contour Models", First International Conference on Computer Vision, 1987, pp.259-268. National Library of Medicine, as of Jan 2001: http://www.nlm.nih.gov/research/visible
Low Complexity Functions for Stationary Independent Component Mixtures K. Chinnasarn1 , C. Lursinsap1 , and V. Palade2 1
AVIC Center, Department of Mathematics, Faculty of Science Chulalongkorn University, Bangkok, 10330, Thailand
[email protected] [email protected] 2 Oxford University Computing Laboratory Parks Road, Oxford, OX1 3QD, UK
[email protected] Abstract. Obtaining a low complexity activation function and an online sub-block learning for non-gaussian mixtures are presented in this paper. The paper deals with independent component analysis with mutual information as a cost function. First, we propose a low complexity activation function for non-gaussian mixtures, and then an online sub-block learning for stationary mixture is introduced. The size of the sub-blocks is larger than the maximal frequency Fmax of the principal component of the original signals. Experimental results proved that the proposed activation function and the online sub-block learning method are more efficient in terms of computational complexity as well as in terms of learning ability. Keywords: Blind signal separation, independent component analysis, mutual information, unsupervised neural networks
1
Introduction
In ”independent component mixture” or ”non-gaussian mixture” problems, the source signals are mutually independent and no information about the mixture environment is available. The recovering system receives unknown mixed signals from the receivers, such as microphones or sensors. In order to recover the source signals, some unsupervised intelligent learning systems are needed. We used, in our approach, a combined learning system, including unsupervised neural networks, principal component analysis, and independent component analysis using mutual information with natural gradient. The learning rule is based on a Mutual Information (MI) using the Kullback-Leibler distance measure: KLp(y)||p(˜y) =
p(y) log
p(y) dy p(˜ y)
(1)
This work is fully supported by a scholarship from the Ministry of University Affairs of Thailand.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 660–669, 2003. c Springer-Verlag Berlin Heidelberg 2003
Low Complexity Functions for Stationary Independent Component Mixtures
661
where p(y) is a true probability density function of y, and p(˜ y) is an estimated parameter of the probability density function of y. For low computational time, we propose some approximation functions of 2nd order for non-gaussian mixtures. In our previous paper [3], we presented only the batch mode learning, which needs a computational complexity of at least O(K 3 ). In this paper, we propose an online sub-block learning method which requires a complexity of at most O(K.k 2 ), where k < K. The paper is organized as follows: Section 2 describes some background of an ICA problem; Section 3 proposes some low complexity approximation of the activation functions; Experimental design and results are presented in section 4 and 5, respectively; Section 6 shows an analytical considerations on complexity; some conclusions are presented in section 7.
2
Independent Component Analysis Problem
An Independent Component Analysis (ICA) problem is an adequate approach because, usually, the sources si are statistically independent from each others. A very well-known practical example of an ICA application is the cocktail party problem. This problem assumes there are some people talking simultaneously in the room, which is provided with some microphones for receiving what they are talking about. Herein, we assume that there are n people, and m microphones as illustrated in figure 1 (for n = m = 3). Each microphone M icj gives you a recorded time signal, denoted as xj (t), where 1 ≤ j ≤ m and t is an index of time. Each of these recorded signals is a linear combination of the original signals si , (1 ≤ i ≤ n) using the mixing matrix A, as given below: xj (t) =
n
aji si (t)
(2)
i=1
where aji , 1 ≤ j, i ≤ n are the weighted sum parameters, that depend on the distance in between the microphones and the speakers [7]. If the sources si are near to the receivers M icj , the elements of the mixing matrix A are similar to a diagonal matrix, a special case of ICA problem. In case of mixture in diag environment, it is said that there is no mixing occurrance between the sources si , because an original signal is only scaled and/or permuted by the diag mixing matrix. Commonly, the elements of mixing matrix A and the probability density function of si (t) are unknown in advance. The only basic assumption of the cocktail party problem is that all of the sources si (t) are independent and identically distributed (iid). The basic background of ICA were presented in [1] [4] [7]. The objective of an ICA problem is to recover the source signals ˜ s = y = Wx from an observed signal x = As, where each component of the recovered signals yi (t) is iid. The equation for transforming the mixed signals is the following: ˜ s = y = Wx = WAs = A−1 As = Is = s
(3)
662
K. Chinnasarn et al.
A
W x1
y1
s
2
x
y2
s3
x
s1
2
y3
3
Fig. 1. The cocktail party problem with 3 speakers and 3 receivers
Equation (3) shows that the full rank de-mixing matrix W is needed for recovering the mixed signal xi . There are many successful algorithms for finding the de-mixing matrix W, such as minimization of mutual information [1], infomax and maximum likelihood estimation [2], projection persuit [7] etc. Herein, we prefer the mutual information with natural gradient (MING algorithms), which was proposed by Amari et al. [1] in 1996. Lee et al. [9] and Chinnasarn and Lursinsap in [3] added a momentum term β∆Wt , for speeding up the convergence. The learning equation is described below: Wt+1 = Wt + η[I − φ(y)yT ]Wt + β∆Wt
(4)
where η is the learning rate, β is the momentum rate, I is an identity matrix, t is i) the time index, φi (y) = ∂p(y ∂yi is the nonlinear activation function which depends on the probability density function of the source signals si , and y = Wx. Let’s consider that the activation functions for de-mixing of the sub-gaussian and the super-gaussian distributions are φ(y) = y3 and φ(y) = tanh(αy), respectively. In this paper, we are looking for some activation functions of lower complexity than the functions given above.
3
Low Complexity Activation Functions
In the ICA problem, at most one gaussian channel is allowed because the transformation of two gaussianities are also gaussian in another variable [7]. The nongaussian signals can be classified into the super-gaussian and the sub-gaussian distributions. A super-gaussian signal has a sharp peak and a large tail probability density function (pdf). On the other hand, sub-gaussian signals have a flat pdf. As we described in the previous section, the nonlinear activation function φi (y) in equation (4) is determined by the sources’ distribution. In this paper, we used the Kurtosis [4, 5] for selecting an appropriate activation function.
Kurtosis(s) =
E[s4 ] −3 (E[s2 ])2
(5)
where Kurtosis(s) values are negative, zero, and positive for the sub-gaussianity, the gaussianity, and the super-gaussianity, respectively.
Low Complexity Functions for Stationary Independent Component Mixtures
3.1
663
Approximation for Super-Gaussianity
In [8], Kwan presented the KTLF (Kwan Tanh-Like activation Function), which is a 2nd order function. This function is an approximation of tanh(2y) function. He divided the approximation curve into 3 regions, which are the upper bound y(t) ≥ L, the nonlinear logistic tanh-like area −L < y(t) < L, and the lower bound y(t) ≤ −L. All regions are described below: 1, y
(y ≥ L) y (γ − θ ), (0 ≤ y < L) L L φ(y) = y y (γ + θ ), (−L < y < 0) L L −1, (y ≤ −L)
(6)
The shape of KTLF curve is controlled by γ = 2/L and θ = 1/L2. The approximation function given in equation (6) corresponds to the tanh(2y) function. Consequently, the term α2 is needed for controlling y, and we also suggest L = 1. Then the modified equation can be rewritten as follow: 1,
(y ≥ 1) ´ (2 − y ´ ), (0 ≤ y < 1) y φ(y) = ´ (2 + y ´ ), (−1 < y < 0) y −1, (y ≤ −1)
(7)
´ = αy where α is an upper-peak of the derivative of the activation function and y 2 . Figure 2 shows tanh(αy), its approximation (the dash line) and their derivatives. From the figure we can conclude that the fraction α2 is fitted for all tanh(αy).
Logistic tanh(2y) and tanh−like(2y) activation function
Derivative of Logistic tanh(2y) and tanh−like(2y) activation function 2 φ(y)= tanh(2y) φ(y)= tanh−like(2y)
1
φ(y)= tanh(2y) φ(y)= tanh−like(2y)
1.8
0.8 1.6 0.6 1.4 0.4 1.2
0.2 0
1
−0.2
0.8
−0.4 0.6 −0.6 0.4 −0.8 0.2
−1
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
0 −2.5
−2
(a)
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Fig. 2. (a) An activation function φ(y) = tanh(2y), its approximation from equation (7) and (b) their derivatives
3.2
Approximation for Sub-gaussianity
In this subsection, we propose a 2nd order approximation of φ(y) = y11 and φ(y) = y3 , presented by Amari et al. [1] and Cichocki et al. [4], respectively.
664
K. Chinnasarn et al. Sub−gaussian activation functions
Derivative of sub−gaussian activation functions
2.5
6
φ(y)=Amari Fn. φ(y)=y3 2 φ(y)=+/−y
2
φ(y)=Amari Fn. φ(y)=y3 2 φ(y)=+/−y
4
1.5 2 1 0 0.5 −2 0 −4 −0.5 −6
−1
−1.5
−8
−2
−2.5 −2
−10
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2.5
−2
−1.5
(a)
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Fig. 3. (a) Graphical representation of 11th , 3rd , and 2nd order activation function and (b) their derivatives
Given the graphical representation of the sub-gaussian activation functions illustrated in figure 3, it can be seen that the sub-gaussian activation functions can be separated into 2 regions: the positive and the negative regions. For demixing the sub-gaussian distribution, we proposed the bisection paraboloid function given in equation (8) which is a good approximation for the previously reported functions in the literature. φ(y) =
4
+y2 , (y ≥ 0) −y2 , (y < 0)
(8)
Experimental Design
For the de-mixing efficiency, the signal separating procedure is divided into 2 subprocedures which are the preprocessing and the blind source separating procedure. 4.1
Preprocessing
For speeding up the procedure of signal transforming, some preprocessing steps are required. We used 2 steps of preprocessing as described in [7], which are centering and prewhitening. The first step is the centering. For an ICA learning procedure, an input signal si , and also xi , must be zero mean E[si ] = 0. This is an important assumption because the zero mean signal causes the covariance E[(si − mi )(sj − mj )T ] equal to the correlation si sTj , where sTj is the transpose matrix of sj , and mi is a channel average or a statistical mean. Consequently we can calculate an eigenvalue and an eigenvector from both their covariance and correlation. If the input signal is nonzero mean, it must be subtracted by some arbitary constant, such as its mean.
Low Complexity Functions for Stationary Independent Component Mixtures
665
The second step is prewhitening. This step will decorrelate the existing correlation among the observed channels xi . Sometimes, the convergent speed will be increased by the prewhitening step. The Principal Component Analysis (PCA) is a practical and a well-known algorithm in the multivariate statistical data analysis. It has many useful properties, including data dimensionality reduction, principal and minor feature selection and data classification, etc. In this paper, without loss of generality, we assume that all sources are principal components. Then the PCA is designed as a decorrelated filter. The PCA procedure is the ˜ = VT x where VT is the transpose matrix of an eigenvector of projection of x covariance of the observed signals E[xxT ]. After the projection of x, it becomes ˜ T ] = I, where I is an identity matrix. the covariance E[˜ xx 4.2
Blind Source Separating Processing
In this subsection, we describe an online sub-block ICA learning algorithm. We used an unsupervised multi-layer feed forward neural network for de-mixing the non-gaussian channels. Our network learning method is a combination of the online and the batch learning techniques. Unknown signals xi (k0 : k0 + k) are fed into the input layer, where k0 is the start index of the sub-block, k ≥ Fmax is the length of the sub-block, and Fmax is the value of the maximal frequency in time domain of the principal source si . Output signals yi (k0 : k0 +k) are produced by yi (k0 : k0 + k) = Wxi (k0 : k0 + k), where W is called the de-mixing matrix. If the output channels yi (k0 : k0 + k) depend on each other, the natural gradient descent in equation (4) still updates the de-mixing matrix W. Then repeat to produce the output signals yi (k0 : k0 + k) until they become independent. The increase of the convergence speed of the online sub-block method is proved by the following theorem. Theorem 1. ICA online sub-block learning is converging faster than batch learning. Proof Considering K is the total time index of the signal and k is the time index for each sub-block, where k < K. The learning equation (4) can be rewritten as follow: Wt+1 = Wt + η[I − φ(y)yT ]Wt + β∆Wt
(9)
The computational complexity of equation (9) depends on the correlation φ(y)yT , where yT is a transpose matrix of y. For the batch learning with time index K, the complexity of (9) is of O(K 3 ) On the other hand, for the online sub-block learning, we have K k sub-blocks. 3 2 The computational complexity of equation (9) is of K O(k ) = O(K.k ). It is k 2 3 proved that O(K.k ) < O(K ), where k < K. Hence, the ICA online sub-block learning method is faster than the batch learning method.
666
5
K. Chinnasarn et al.
Experimental Results
Some simulations have been made on both super-gaussian and sub-gaussian signals, which contained 191,258 data points for each channel. The super-guassian data sets were downloaded from http://sound.media.mit.edu/ica-bench/, and consist of 3 sound sources. For sub-gaussianity, we simulated our algorithm using the following three-synthesized signals: 1. s1 (t) = 0.1 sin(400t) cos(30t) 2. s2 (t) = 0.01sign[sin(500t + 9 cos(40t))] 3. s3 (t) = uniform noise in range [-0.05,0.05]
A mixing matrix A is randomly generated. An initial de-mixing matrix W is set to the transpose of an eigenvector of the covariance of an observed signal x. As presented in [3], the convergent criterion should be set so that the distance of Kullback-Leibler divergence is less than 0.00001 (∆KL ≤ 0.00001). We have run 10 simulations using a variable learning rate with initial value 0.05 and a momentum rate value of 0.01η. Anyway, both the learning rate value and momentum rate value can be arbitrary set over range (0..1]. At each learning iteration, the η ). Some of the experimental results learning was decreased by 1.005 (η = 1.005 were presented in our previous paper [3]. For improving the learning performance, we determined the relationship between the online sub-blocks. The final de-mixing matrix W of sub-block j is set to the initial de-mixing matrix of j + 1 sub-block, where 1 ≤ j ≤ T b, T b is the total number of blocks K k , k ≥ Fmax = 20, 000 Hz, and K is the total time index. The weight inheritance will maintain the output channel of unknown mixture environments. Figure 4(a) and 4(b) display the original sources and recovered sources of the sub-gaussian and the super-gaussian distributions, respectively. As an algorithmic performance measurement, we used the performance correlation index, derived from the performance index proposed by Amari et al. in [1]. In practice, the performance index can be replaced with the performance correlation index.
E=
N N ( i=1 j=1
N N |cij | |cij | − 1) + − 1) ( maxk |cik | maxk |ckj | j=1 i=1
(10)
The matrix C = φ(y)yT is close to the identity matrix when the signal yi and yj are mutually uncorrelated or linearly independent. The results are averaged over the 10 simulations. Figure 5 illustrates the performance correlation index on a logarithmic scale, during the learning process, using our proposed activation functions in section 3, and the typical activation functions for the non-gaussian mixtures from section 2. Figure 5(a) corresponds to the mixture of super-gaussian signals. The curve of tanh(.) was matched with its approximation function, and it can be seen that they converge over the saturated region with the same speed. Figure 5(b) corresponds to the mixture of sub-gaussian distribution. The proposed activation function φ(y) = ±y2 converges faster than
Low Complexity Functions for Stationary Independent Component Mixtures Input:S
X : Block No. 8 0.1
After PCA
667
Output:Y
0.05
0.05
0.1 0.05
0.05 0
0
0
−0.05
0
−0.05
−0.1 8317
−0.1 8317
8367
0.1
0.05
−0.05
−0.05 8317
8367
8367
8317
0.05
8367
0.05
0.05 0
0
0
0
−0.05 −0.05 8317
−0.1 8317
8367
0.05
8367
8317
8367
0.1
0
0
0
−0.05
−0.1
−0.1
8367
8367
0.05
0.1
8317
−0.05
−0.05 8317
8317
8367
0
−0.05
8317
8367
8317
8367
(a) Original Signals:
s
Observed Signals:
x
Recovered Signals:
6 5
2 0
0
−10
−2 −5
0
y1
s1
y
10
4
−4 5
10
15
5
10
4
15
5
10
4
x 10
15 4
x 10
x 10
5
4
5 0
0
y2
s2
2
−2
−5
−4
−5 5
10
15
5
10
4
15
5
10
4
x 10
15 4
x 10
x 10
6
5
5
4 2 0
y3
s3
0
0
0
−2 −5
−5
−4 5
10
15
5
10
4
15
5 4
x 10
x 10
10
15 4
x 10
(b)
Fig. 4. The mixing and de-mixing of (a) sub-gaussianity and (b) supergaussianity φ(y) = y3 in the beginning, and slows down when the outputs yi are close to saturated region. The derivative of the function φ(y) = ±y2 is slower than the derivative of the function φ(y) = y3 when y ≥ ±0.667, see the slope of each function in figure 3.
6
Analytical Considerations on Complexity
For the mixture of the super-gaussian signals, the unknown source signals can be recovered by tanh(αy) and its approximation as given in equation (8). Considering the same input vector, both activation functions produce a similar output vector, because the curve of the approximation was matched to the curve of tanh(αy), as illustrated in figure 2. Hence, they required the same number of epochs for recovering the source signals, as shown in figure 5(a). But, an approximation function requires lower computational micro-operations per instruction than tanh(αy), and is more suitable for hardware implementation. Regarding the recovery of the mixture of more sub-gaussian signals, the curve of φ(y) = ±y2 did not exactly match either y3 or y11 , but they produced the same results with different convergent speed as shown in figure 5(b). The lower activation function needs smaller memory representation during the running process. And, also, the φ(y) = ±y2 requires only XOR and complementary micro-operations per instruction.
668
K. Chinnasarn et al. Performance correlation index for de−mixing of Super−gaussianity
2
Performance correlation index for de−mixing of Sub−gaussianity
10
φ(y)= mKTLF(2y) φ(y)= tanh(2y)
2
φ(y)=+/−y φ(y)=y3
4
10 1
log−scale of performance correlation index
log−scale of performance correlation index
10
0
10
−1
10
−2
10
−3
10
2
10
1
10
0
10
−1
10
−2
−4
10
3
10
10 0
500
1000
1500 2000 number of iterations (epochs)
(a)
2500
3000
3500
400
800 1200 number of iterations (epochs)
1600
2000
(b)
Fig. 5. Performance correlation index for the separation of (a) the supergaussian mixtures and (b) the sub-gaussian mixtures (averaged on 10 simulations)
7
Conclusions
In this paper, we presented a low complexity framework for an independent component analysis problem. We proposed a 2nd order approximation used for the demixing of the super-gaussian and sub-gaussian signals. Moreover, the number of the multiplication operations required by the separating equation was reduced by using an online sub-block learning method. The proposed activation functions and the online sub-block learning algorithm are efficient methods for demixing the non-gaussian mixtures, with respect to the convergence speed and learning abilities.
References [1] S.-I.Amari, A.Cichocki, and H. H.Yang. A New Learning Algorithm for Blind Signal Separation, MIT Press, pp.757-763, 1996. 661, 662, 663, 666 [2] J.-F.Cardoso. Infomax and Maximum Likelihood for Blind Source Separation, IEEE Signal Processing Letters, Vol. 4, No. 4, pp. 112-114, 1997. 662 [3] K.Chinnasarn and C.Lursinsap. Effects of Learning Parameters on Independent Component Analysis Learning Procedure, Proceedings of the 2nd International Conference on Intelligent Technologies, Bangkok/Thailand, pp. 312-316, 2001. 661, 662, 666 [4] A.Cichocki and S.-I.Amari. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, John Wiley & Sons, Ltd., 2002. 661, 662, 663 [5] M.Girolami. An Alternative Perspective on Adaptive Indepndent Component Analysis Algorithms, Neural Computation, Vol. 10, No. 8, pp. 201-215, 1998. 662 [6] Haykin.S. Neural Network a Comprehensive foundation. 2nd, Prentice Hall,1999. [7] A.Hyvarinen and E.Oja. Independent Component analysis: algorithms and applications, Neural Networks. vol. 13 pp. 411-430, 2000. 661, 662, 664 [8] H. K.Kwan. Simple Sigmoid-like activation function suitable for digital hardware implementation, Electronics Letter Vol.28, no.15, pp.1379-1380, 1992. 663
Low Complexity Functions for Stationary Independent Component Mixtures
669
[9] T.-W.Lee, M.Girolami and T. J.Sejnowski. Independent Component Analysis Using an Extended Informax Algorithm for Mixed Sub-Gaussian and SuperGaussian Sources Neural Computation, Vol. 11, No. 2, pp. 409-433, 1999. 662
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout Tristan Pannérec Laboratoire d'Informatique de Paris VI Case 169, 4 place Jussieu, 75005 Paris, France
[email protected] Abstract. Printed Circuit Board layout involves the skillful placement of the required components in order to facilitate the routing process and minimize or satisfy size constraints. In this paper, we present the application to this problem of a general system, which is able to combine knowledge-based and search-based solving and which has a meta-level layer to control it's reasoning. Results have shown that, with the given knowledge, the system can produce good solutions in a short time.
1
Introduction
The design of a PCB (Printed Circuit Board) for a product goes through three main stages. The first one is the designing of the logical scheme, which defines the components used (part list) and their interconnections (net list). The second one is the layout of the PCB with the definition of positions for all components and tracks, which link the components together. Generally, these two steps (components placement and routing) are carried out sequentially. The last stage is the industrial production, where the PCB are obtained by chemical processes and components are soldered. In this paper, we are interested in the components placement stage. The problem is to find a position on the board and a direction for each component in such a way that the components do not overlap and that the placement allows a good routing result. A good routing result means that all wires are taken on and that tracks are shortest. Depending on the nets, the length of the tracks can be of varying importance. For example, unprotected alimentation wires or wires between quartz and microcontrollers must absolutely be short for physical reasons (the quality of the product will depend on these lengths). A secondary objective during the component placement step can be to minimize the surface of the PCB if no predefined form is given. While the routing step is automatically done for a long time in most CAD tools, fully automated components placement is a more recent feature. Most applications thus allow “interactive” placement, where the user can instantly see the consequence of placing a component in a given position, but few can achieve good automatic placement, especially for single-layer PCB, where the problem is much harder. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 669-675, 2003. Springer-Verlag Berlin Heidelberg 2003
670
Tristan Pannérec
In this paper we report on the use of our MARECHAL framework for solving this complex problem of automatic component placement for a single-layer PCB (the method works of course for multi-layer PCB as well). The rest of this paper is organized as follows: the second section presents some related work. Then, the MARECHAL framework is described briefly in section 3. Section 4 is dedicated to the knowledge we have given to this framework and section 5 presents the results we have achieved. Some final conclusions are drawn in section 6.
2
Related Work
The placement of components is a well-known problem, which has been extensively studied. Research has mainly focused on the VLSI design. The methods used for VLSI layout can be divided into two major classes [1]: constructive methods and iterative improvement methods. The first group (also referred to as global or relative placement) builds a global placement derived from the net-cell connections. In this class, we can find the force-directed (analytical) methods [3] and the min-cut (partitioning-based) methods [2]. The iterative methods start with an initial random placement and try to improve it by applying local changes, like reflecting a cell or swapping two cells. Deterministic methods or randomized methods (either with Simulated Annealing [9] or Genetic Algorithm [11]) belong to this class. Hybrid methods have also been investigated [4] but are limited to the use of a constructive method to produce the initial placement for an iterative method. In our approach, the two processes are more integrated. Little academic work addresses specifically the PCB framework, although the problem is broadly different from the VLSI one. There are not hundreds of thousands of cells, which have all the same shape and have to be organized in predefined rows. Instead, we have from ten to a few hundred components, with very variable sizes and numbers of pins (from 2 to 30 or more). Components cannot be modeled as points as in the common VLSI methods and the objective function has to be more precise. In the VLSI layout, objective functions use the semi-perimeter method [10] to estimate wiring length and density estimation to avoid crossing between wires. In a single layer PCB, these approximations are not precise enough and more complex models have to be considered, as we will see in the next section. But these more complex models prevent the use of classical constructive methods and lead to more CPUconsuming objective functions, which also prevent the use of classical iterative approaches (we have for example tested a genetic algorithm with our objective function but, even after 2 hours of computation, the function was not correctly optimized and the solution was of no interest). We use an objective function based on the minimum spanning tree model (MST) for each net (cf. Fig. 1 and Fig. 4). On the basis of these trees, the evaluation function is the sum of two terms. The first term is the weighted sum of all segment lengths (the weights depend on the wire type). The second term represents the crossing value: when two segments cross, a value is added to the evaluation function depending on the a priori possibilities to resolve the crossing and proportionally to the avoiding path if it can be easily computed (cf. Fig. 2). The resulting objective function takes a long time to compute but is more precise.
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
671
Fig. 1. Example of a MST for a 5-pins net
Fig. 2. Example of importance of crossing
3
The Marechal System
The MARECHAL system is a general problem solver, which uses domain-dependent knowledge. For the moment, it is limited to optimisation problems where the goal is to coordinate numerous concurrent decisions to optimise a given criterion (e.g. the evaluation function defined in the previous section). Except to automatic component placement, it is also currently applied to a complex semi-concurrent game [7] and timetabling. The approach materialized in the MARECHAL framework is first of all based on three principles. The first one is the integration of a knowledge-based solving with a search-based solving [6]. The system uses domain-dependent knowledge to select the best a priori choices, but it can also question these choices to explore the search space. The second principle is to use abstract reasoning levels to limit the search to coherent solutions. The problem is thus decomposed into sub-problems (thanks to domainspecific knowledge) and for each decomposition level, the system can make abstract choices (choices about intentions, stage orders…). The third principle is to use autoobservation to control the solving process. By means of domain-dependent and domain-independent meta-level rules, the system can analyse it's own reasoning and determine how to improve the current solution [5]. Instead of randomly trying changes to the current solution, it can thus limit the search to interesting improvement possibilities. These three important features of our approach allow dealing with complex problems where pure search methods cannot be applied (because of huge
672
Tristan Pannérec
search space and irregular time-consuming objective functions) and where good specific heuristics cannot be found for all possible instances of the problem. The MARECHAL system materializes this approach, by providing a kind of an interpreter to a specific language used to define the domain-specific knowledge. It is build on a meta-level architecture: the first part is the basic system, which is able to construct solutions, and the second part is the meta level, which is responsible of managing and monitoring the solving process by observing the first part [8]. To achieve that, a bi-directional communication mechanism is used: on the one hand, the basic system sends notifications and traces about what it is doing and what has happened (a sub-problem has been successfully solved…) and on the other hand, the supervision layer sends orders or recommendations, with eventually new constraints or objectives to precisely drive the resolving process.
4
Knowledge
In this section, we describe in more detail the application of the MARECHAL system to component placement, with a particular emphasis on the knowledge given to solve this problem. The knowledge allows decomposing the problem (decomposition expertise), to construct a solution for each sub-problem (construction expertise) and to intelligently explore the solution space by improving the solution already constructed (improvement expertise). 4.1
Decomposition and Construction Knowledge
As regards the other applications, the decomposition knowledge for the PCB layout problem is simple. The main problem is decomposed into two stages. First, the system tries to generate macro-components by looking for repeated patterns in the circuit. Components belonging to a macro-component are then placed locally and replaced by a single new component. This stage is especially useful for large circuits. The second stage is the main placement stage. If we have n components, n calls to the “Single Component Placement” (SCP) sub-problem are made, according to the best a priori order. Heuristics used to define this order are based on the number of connections and the size of the components (big components with many connections are set up first). The SCP sub-problem first chooses an already placed "brother" component and tries to place the new component close to this brother by scanning the local area (the last point is the most complex one, because it poses difficult perception problems). To find a ‘brother” component, the heuristics emphasis on the components which share many connections. Heuristics used to choose the concrete positions are based on simulation of the local “rat nest” that will be computed for the component to be placed. 4.2
Improvement Knowledge
With the above knowledge, the system is thus able to construct a solution but, as the choices are done a priori (next component to consider, position of the components…),
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
673
they can be proved wrong after some attempts, when a future component cannot be correctly placed. That is why a priori choices have to be questioned by means of improvement tests, which will allow a partial exploring of the solution space. Two types of improvement are tested by the system for the automatic component placing application. The first type are called “immediate tests” because they are executed as soon as the possibilities are discovered, without waiting for a full solution. For example, if a component Cx cannot be placed next to Cy (but where it would be well placed) because of the presence of Cz, the position of Cz is immediately questioned before placing any other component: the system tries to place Cz in another place that will allow to place Cx next to cy. For example in Fig. 3, RL3 is placed first, but then U6 cannot be placed close to pre-placed component B4, although a short connection is required between U6 and B4. The position of RL3 is thus put into question to obtain the layout on the right, which is better. Then, the placement process go on with a new component. A total of seven rules produce improvement possibilities of this type.
Fig. 3. Example of immediate improvement
The second type of improvement acts after having constructed a full solution. When all components have been placed, the system tries to modify the worst placed component or to minimize the size of the circuit by generating new form constraints. The process stops after a fixed amount of time and the best-constructed solution is sent to the CAD program.
5
Results
Our system has been tested on several circuits involving from 20 to 50 components and with one to three minutes to produce each solution. Two examples of a solution produced by the system for simple circuits are given in Fig. 4. The figure displays the associated MST: the bold lines represent connections that have to be especially minimized. In all cases, the solutions produced by the systems had a better value than those produced manually by a domain expert, in terms of the evaluation function presented in section 3 (cf. Fig. 5). Of course, this function is not perfect, as only a full routing process can determine if the placement is really good.
674
Tristan Pannérec
Fig. 4. Examples of solution produced by the system
Objective function
MARECHAL
human expert
160000 140000 120000 100000 80000 60000 40000 20000 0 #1
#2
#3
#4
#5
#6
Circuits
Fig. 5. Comparison of objective function values obtained by our system and expert humans (the value have to be minimized)
That's why the solutions have also been evaluated by a domain expert, who was asked to build a final solution (with the tracks) from the solution produced by the system. It appeared that only minor changes were necessary to obtain a valid solution. In addition, these manual routing processes did not use more "straps" (bridges to allow crossings) than manual routing operated on the human placements.
6
Conclusion
In this paper, we have presented the application of the MARECHAL system to automatic component placement for single-layer PCB layout. Thanks to the meta-
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
675
level solving control and the combination of knowledge-based and search-based approaches, the system is able to produce good solutions in a short time, compared to human solutions. As the single-layer problem is more difficult than the multi-layer problem, the knowledge could be easily adapted to deal with multi-layer PCB design. More constraints, such as component connection order, could also easily be added. Future work will first consists in improving the current knowledge to further minimize the necessary human intervention. A more important direction is to implement an automatic router with the MARECHAL framework and to connect the two modules. Thus, the router will serve as an evaluation function for the placement module, which will be able to modify component positions according to the difficulties encountered in the routing process. The system will then be able to produce a full routed solution from scratch and without any human intervention.
References [1]
H.A.Y. Etawil: Convex optimization and utility theory: new trends in VLSI circuit layout. PhD. Thesis of the University of Waterloo, Canada, (1999). [2] D.J.H. Huang & A.B. Kahng: Partitioning-based standart-cell global placement with an exact objective: Inter. Symp. On Physical Design (ISPD) (1997), 18-25. [3] F. Johanes, J.M. Kleinhaus, G. Sigl & K. Antereich: Gordian: VLSI placement by quadratic programming and slicing optimization, IEEE. Trans. On CAD, 10(3) (1991), 356-365. [4] A. Kennings: Cell placement using constructive and iterative methods (PhD. Thesis of the University of Waterloo, Canada (1997). [5] T. Pannérec: Using Meta-level Knowledge to Improve Solutions in Coordination Problems, Proc. 21st SGES Conf. on Knowledge Based Systems and Applied Artificial Intelligence, Springer, Cambridge (2001) 215-228. [6] T. Pannérec: An Example of Integrating Knowledge-based and Search-based Approaches to Solve Optimisation Problems. Proc. of the 1st European STarting A.I. Researchers Symp. (STAIRS 02), p. 21-22, Lyon, France (2002). [7] T. Pannérec: Coordinating Agent Movements in a Semi-Concurrent TurnBased Game of Strategy. Proceedings of the 3rd Int. Conf. on Intelligent Games and Simulation (GameOn 02), p. 139-143, Harrow, England (2002). [8] J. Pitrat: An intelligent system must and can observe his own behavior. Cognitiva 90, Elsevier Science Publishers (1991), 119-128. [9] C. Sechen: VLSI placement and global routing using simulated annealing. Kluwer Academic Publishers (1988). [10] 10.C. Sechen & A. Sangiovanni: The TimberWolf 3.2: A new standard cell placement and global routing package: Proc. 23rd DAC, IEEE/ACM, Las Vegas (1986), 408-416. [11] 11.K. Shahookar & P. Mazumder: A genetic approach to standard cell placement using meta-genetic parameter optimization, IEEE. Trans. on Computers, 9(5) (1990), 500-511.
Knowledge-Based Hydraulic Model Calibration Jean-Philippe Vidal1 , Sabine Moisan2 , and Jean-Baptiste Faure1 1
Cemagref, Hydrology-Hydraulics Research Unit, Lyon, France {vidal,faure}@lyon.cemagref.fr 2 INRIA Sophia-Antipolis, Orion Project, France
[email protected] Abstract. Model calibration is an essential step in physical process modelling. This paper describes an approach for model calibration support that combines heuristics and optimisation methods. In our approach, knowledge-based techniques have been used to complement standard numerical modelling ones in order to help end-users of simulation codes. We have both identified the knowledge involved in the calibration task and developed a prototype for calibration support dedicated to river hydraulics. We intend to rely on a generic platform to implement artificial intelligence tools dedicated to this task.
1
Introduction
Simulation codes are scientific and technical programs that build numerical models of physical systems, and especially environmental ones. We are interested in river hydraulics where simulation codes are based on the discretisation of the simplified fluid mechanics equations (de Saint-Venant equations) that model streamflows. Theses codes have been evolving during the last forty years from basic numerical solvers to efficient and user-friendly hydroinformatics tools [1]. When a numerical model is built up for a river reach and its corresponding hydraulic phenomena (e.g., flood propagation), the model must be as representative as possible of physical reality. To this end, some numerical and empirical parameters must be adjusted to make numerical results match observed data. This activity – called model calibration – can be considered as a task in the artificial intelligence sense. This task has a predominant role in good modelling practice in hydraulics [2] and in water-related domains [3]. Users of simulation codes currently carry out model calibration either by relying on their modelling experience or by resorting to an optimisation code. The most widely used method is trial-and-error, which consists in running the code, analysing its outcomes, modifying parameters and restarting the whole process until a satisfactory match between field observations and model results has been reached. Few indications to carry out this highly subjective task in an efficient way are given in reference books [4] and experienced modelers follow heuristic rules to modify parameters. This task thus requires not only a high degree of expertise to analyse model results but also fast computers to perform numerous time-consuming simulation runs. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 676–683, 2003. c Springer-Verlag Berlin Heidelberg 2003
Knowledge-Based Hydraulic Model Calibration
677
In order to overcome these difficulties, many automatic calibration methods have been developed over the last thirty years [5]. These methods rely on three main elements: an objective function that measures the discrepancy between observations and numerical results, an optimisation algorithm that adjusts parameters to reduce the value of the function, and a convergence criterion that tests its current value. The major drawback of this kind of calibration stands in the equifinality problem – as defined by Beven [6] – which predicts that the same result might be achieved by different parameter sets. Consequently, local minima of the objective function might not be identified by the algorithm and thus lead to unrealistic models. Artificial intelligence techniques have recently been used to improve these automatic methods, with the use of genetic algorithms, alone [7] or combined with case-based reasoning systems [8]. Our objective is to propose a more symbolic approach for hydraulic model calibration support. The aim is to make hydroinformatics expert knowledge available for end-users of simulation codes; a knowledge-based system encapsulating the expertise of developers and experienced engineers can guide the calibration task. Contrary to Picard [9], we focused more on the operational use of simulation codes and not on their internal contents. The paper first presents the cognitive modelling analysis of calibration and hydroinformatics domain, then outlines the techniques used for the development of a knowledge-based system dedicated to model calibration.
2
Descriptive Knowledge Modelling
From a cognitive modelling perspective, we have first identified and organised the knowledge in computational hydraulics that may influence the model calibration task. Our approach was to divide this knowledge into generic knowledge, corresponding to concepts associated with numerical simulation of physical processes in general, and domain knowledge, corresponding to concepts specific to open-channel flow simulation. Concerning generic computational knowledge, we defined a model as a composition of data, parameters, and a simulation code (see Fig. 1). Following Amdisen [10], we distinguished static data from dynamic data. Static data are components of the model; they depend only on the studied site (e.g., in our application domain, topography and hydraulic structures). Dynamic data represent inputs and outputs of the simulation code; they include observed data (e.g., water levels measured during a flood) and corresponding computed data. Concerning domain knowledge, we extracted from river hydraulics concepts a hierarchy of bidimensional graphical objects which instantiate the previous meta-classes of data. Figure 2 shows a partial view of this hierarchy of classes and displays objects commonly used in 1-D unsteady flow modelling. For instance, CrossSection in Fig. 2 derives from StaticData in Fig. 1 and DischargeHydrograph may derive from any subclass of DynamicData.
678
Jean-Philippe Vidal et al.
Data
PhysicalEvent
1..*
1..*
StaticData
DynamicData
1 InputData
CalibrationInputData *
OutputData
PredictionInputData
CalibrationOutputData
*
uses for calibration
SimulationOutputData
* are compared to for calibration 1..*
1..*
uses for prediction SimulationCode
1
1..*
*
1..*
Model Parameters
1
1
produces *
1..*
Fig. 1. UML class diagram of generic computational knowledge
GraphicalObject
Curve
Point
DynamicCurve
StaticCurve
GivenReachCurve
StaticPoint
GivenSectionCurve
DynamicPoint
TimeLimitedPoint
{incomplete}
TimeExtendedPoint
{incomplete} {incomplete}
GivenTimeCurve
{incomplete} {incomplete}
WaterSurfaceProfile 0..1
RatingCurve
StageHydrograph 0..1
0..1 2..*
2..* WaterLevel
DischargeHydrograph
0..1
GaugingPoint
0..1
0..1
2..*
3..*
{incomplete}
1..* 1
CrossSection
0..1
1
Discharge
GroundPoint
Fig. 2. Simplified UML class diagram of graphical objects
Floodmark
Knowledge-Based Hydraulic Model Calibration
:Model [uncalibrated]
:DynamicData
solve inverse problem
679
:PredictionInputData
solve direct problem
preprocess
simulate
determine :SimulationOutputData
dispatch
define [else] initialise
[parameter definition = ok]
simulate
compare :Model [calibrated]
evaluate [model calibration = unsatisfactory] [else]
Fig. 3. UML activity diagram of problem solving in modelling
3
Operative Knowledge Modelling
The second category of knowledge concerns the problem solving strategy of the calibration task. Modellers of physical processes may face two situations [11]: solving a direct problem, which means computing the output of a model, given an input, or solving an inverse problem, which means finding out the model which transforms a given input into a given output. The first situation occurs during productive use of simulation codes. In hydraulics, the second situation amounts to determine the model parameters – since hydraulic modelling follows deterministic laws – and corresponds in fact to a calibration task. Figure 3 shows the relation between these two situations – calibration must of course be performed before using simulation codes for prediction – and emphasises the first-level decomposition of both modelling problems. We determined the problem solving strategy of this task essentially from interviews of experts, since literature on calibration task is very scarce. We identified six main steps, listed in Fig. 3. We detail them below for a typical case of unsteady flow calibration of a river reach model: 1. dispatch: Available dynamic data are dispatched among calibration input and output data. Boundary conditions compose the core of calibration input data
680
2. 3. 4. 5.
6.
4
Jean-Philippe Vidal et al.
(e.g., a flood discharge hydrograph determined at the upstream cross-section and the corresponding stage hydrograph at the downstream cross-section). Field observations like floodmarks (maximum flood water levels) measured along the reach constitute calibration output data. define: Generally, only one model parameter is defined at first: here, a single riverbed roughness coefficient for the whole reach; initialise: Parameter default value(s) is(are) set up, often thanks to literature tables; simulate: The code is run to produce simulation output data, usually in the form of water-surface profiles for each time step; compare: Simulation output data are compared to calibration output data, e.g., by putting maximum envelops of water-surface profiles and floodmarks side by side; evaluate: Model error is judged; an unsatisfactory calibration leads to adjust coefficient value (back to point 3) or to give another definition of model parameters, e.g., to take into account a spatially distributed roughness coefficient (back to point 2).
Prototype for Calibration Support
We intend to rely on a platform for knowledge-based system design that has been developed at Inria and applied to program supervision [12]. Program supervision is a task that emulates expert decisions in the skilled use of programs, such as their choice and organisation, their parameter tuning, or the evaluation of their results. Like calibration, this task involves activities such as initialise parameters, run codes, evaluate results, etc. Both tasks also use very similar concepts, such as the common notion of parameters. We plan to take advantage of these similarities to reuse – with slight modifications – both an existing engine and the attached knowledge description language, named Yakl. The previous sections provide us with a conceptual view of the calibration task, which serves as a basis to determine the required modifications. From this conceptual view it is possible to derive modifications in both the language to describe the task concepts and the inference engine to manipulate them. For instance, while initialise and evaluate are existing reasoning steps in the program supervision task, dispatch is a new one that should be added for the calibration task. In parallel to this conceptual view, we also conducted an experiment to implement a calibration knowledge-based system, by using the program supervision language and engine. This experiment allowed us to refine the specifications for new tools dedicated to calibration task. The platform will support the implementation of these tools by extension of program supervision ones. Two concepts provided by the program supervision task have proven useful for our purpose: operators and criteria. Operators correspond to actions, either simple ones (e.g., run a code) or composite ones, which are combinations of operators (sequence, parallel or alternative decompositions) that solve abstract processing steps. Operators have arguments corresponding to inputs/outputs
Knowledge-Based Hydraulic Model Calibration
Initialisation criteria Rule { name determine base value for gravel comment ”Determination of base value for gravel roughness coefficient” Let obs an observation If obs.bed material == gravel Then nb.lower := 0.028, nb.upper := 0.035 }
681
Choice criteria Rule { name select flood dynamics operator comment ”Selection of operator for flood dynamics modelling” Let s a study If s.phenomena == flood dynamics Then use operator of characteristic unsteady flow }
Fig. 4. Examples of criteria and attached rules in Yakl syntax of actions. Both kinds of operators also include various criteria composed of production rules, capturing the know-how of experts on the way to perform actions. For instance, in program supervision, criteria express how to initialise input argument of programs (initialisation criteria), to evaluate the quality of their results (assessment criteria), to modify parameter values of the current operator (adjustment criteria), or to transmit failure information in case of poor results (repair criteria). Additional criteria are related to composite operators (for choices among sub-operators or optional applications of sub-operators in a sequence). This range of criteria mostly suits our needs. Figure 4 shows examples of rules that belong to our prototype knowledgebased calibration system. The first rule is connected to the initialise step and expresses a conversion from an observation of bed material nature to an appropriate range of numerical values for roughness coefficient. The second rule is related to the simulate step. It helps to select an appropriate sub-operator for flood dynamics modelling, in the line of current research on model selection [13].
5
Conclusion and Perspectives
We have currently achieved an important phase which was the specification of artificial intelligence tools dedicated to the calibration task, with a focus on its application to hydraulics. Our approach is based on a conceptual view of the task and on a prototype knowledge-based system. A generic platform will support the implementation of these tools by extending program supervision ones. The specifications of the new calibration engine, following the problem solving method in Fig. 3, has been completed and the implementation is under way. We reuse the notions of operators and rules from program supervision for the simulate, initialise and compare subtasks of calibration. Checking the resulting system with real data will allow us not only to test calibration results but also to improve the encapsulated knowledge which should mimic an experienced modeller reasoning in all situations. We intend to use existing river models (e.g., on Ard`eche river) that have been built up with different simulation codes (Mage, Hec-Ras). We will compare the problem solving processes with and without support: since simulation code use is difficult to grasp, we expect that a knowledge layer should reduce the time spent by end-users without impairing the quality of the results.
682
Jean-Philippe Vidal et al.
The presented approach allows experts to share and to transmit their knowledge and favours an optimal use of simulation codes. The complementarity between numerical and symbolic components makes the resulting system highly flexible and adaptable. Data flow processing is thus automated while taking into account the specificities of each case study. Our approach appears to be a relevant alternative to current automatic calibration methods, since it includes knowledge regarding both physical phenomena and numerical settings, and may thus avoid equifinality pitfalls. As we have already mentioned, this approach, while focusing on hydraulic domain, is general and can be applied with minimal modifications to any physical process modelling that requires a calibration.
References [1] Faure, J. B., Farissier, P., Bonnet, G.: A Toolbox for Free-Surface Hydraulics. In: Proceedings of the fourth International Conference on Hydroinformatics, Cedar Rapids, Iowa, Iowa Institute of Hydraulic Research (2000) 676 [2] Abbott, M. B., Babovic, V. M., Cunge, J. A.: Towards the Hydraulics of the Hydroinformatics Era. Journal of Hydraulic Research 39 (2001) 339–349 676 [3] Scholten, H., van Waveren, R. H., Groot, S., van Geer, F. C., W¨ osten, J. H. M., Koeze, R. D., Noort, J. J.: Good Modelling Practice in Water Management. In: Proceedings of the fourth International Conference on Hydroinformatics, Cedar Rapids, Iowa, Iowa Institute of Hydraulic Research (2000) 676 [4] Cunge, J. A., Holly, F. M., Verwey, A.: Practical Aspects of Computational River Hydraulics. Volume 3 of Monographs and Surveys in Water Resources Engineering. Pitman, London, U. K. (1980) 676 [5] Khatibi, R. H., Williams, J. J. R., Wormleaton, P. R.: Identification Problem of Open-Channel Friction Parameters. Journal of Hydraulic Engineering 123 (1997) 1078–1088 677 [6] Beven, K. J.: Prophecy, Reality and Uncertainty in Distributed Hydrological Modeling. Advances in Water Resources 16 (1993) 41–51 677 [7] Chau, K. W.: Calibration of Flow and Water Quality Modeling Using Genetic Algorithm. In McKay, B., Slaney, J. K., eds.: Proceedings of the 15th Australian Joint Conference on Artificial Intelligence. Volume 2557 of Lecture Notes in Artificial Intelligence., Canberra, Australia, Springer (2002) 720 677 [8] Passone, S., Chung, P. W. H., Nassehi, V.: The Use of a Genetic Algorithm in the Calibration of Estuary Models. In van Harmelen, F., ed.: ECAI 2002 - Proceedings of the Fifteenth European Conference on Artificial Intelligence. Volume 77 of Frontiers in Artificial Intelligence and Applications., Lyon, France, IOS Press (2002) 183–187 677 [9] Picard, S., Ermine, J. L., Scheurer, B.: Knowledge Management for Large Scientific Software. In: Proceedings of the Second International Conference on the Practical Application of Knowledge Management PAKeM’99, London, UK, The Practical Application Company (1999) 93–114 677 [10] Amdisen, L. K.: An Architecture for Hydroinformatic Systems Based on Rational Reasoning. Journal of Hydraulic Research 32 (1994) 183–194 677 [11] Hornung, U.: Mathematical Aspects of Inverse Problems, Model Calibration, and Parameter Identification. The Science of the Total Environment 183 (1996) 17–23 679
Knowledge-Based Hydraulic Model Calibration
683
[12] Thonnat, M., Moisan, S., Crub´ezy, M.: Experience in Integrating Image Processing Programs. In Christensen, H. I., ed.: Proceedings of the International Conference on Computer Vision Systems, ICVS’99. Volume 1542 of Lecture Notes in Computer Science., Las Palmas, Spain, Springer (1999) 200–215 680 [13] Chau, K. W.: Manipulation of Numerical Coastal Flow and Water Quality Models. Environmental Modelling and Software 18 (2003) 99–108 681
Using Artificial Neural Networks for Combustion Interferometry Victor Abrukov, Vitaly Schetinin, and Pavel Deltsov Moskovsky prosp, 15, Chuvash State University 428015, Cheboksary, Russia
[email protected] Abstract. We describe an application of artificial neural networks (ANNs) for combustion interferometry. Using the ANN we calculate the integral and local characteristics of flame presented by using incomplete set of features that characterize interferometric images. Our method performs these calculations faster than the standard analytical approaches. In prospects, this method can be used in automated systems to control and diagnose combustion processes.
1
Introduction
Interferometry has wide possibilities in combustion research. It allows to determine simultaneously the local characteristics, such as temperature or density field in flame as well as the integral characteristics, such as mass of flame, Archimedean force acting upon flame, quantity of heat in flame, etc [1 - 3]. Among other integral characteristics, which can be extracted from data of interferometric images, are non-stationary mass burning rate and heat release power during ignition, heat release rate, the force of powder, changing of mechanical impulse of non-stationary gas flow, mechanical impulse of arise flow, profile of heat release rate in the stationary wave of burning, etc. [1, 3]. However, for the determination of the mentioned characteristics, first of all, it is necessary to measure a lot of values (big set discrete values) of the function of the phases difference distribution on the interferogram plane, S(x,y). The results often have a subjective nature and depend greatly on the experience of a user in the interferometric measurement. Besides this, the measurement of a lot of values of S(x,y) is labor-intensive. These circumstances confine the level of usage of interferometry in the quantitative analysis of combustion. We can mark here also that, from the point of view of interferometric requirements, the possibilities of determination of mentioned characteristics of flame exist only in the case of ideal conditions of interferometric experiment. There are a lot of real conditions of experiment when it is impossible to measure accurately the S(x,y) and to implement the wide possibilities of interferometry. In any cases, after the accurate measurement a lot of values of S(x,y), it is necessary to make the integration of S(x,y) in a plane of the interferogram (a direct task dealing with the determination of integral characteristics) or to make the differentiation of S(x,y) (an inV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 684-690, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Artificial Neural Networks for Combustion Interferometry
685
verse task dealing with the determination of local characteristics). As one can see, the full realization of the all possibilities of interferometry demands the solving of some not simple problems the decision of which is not ever possible. In this paper we describe examples of methods of application of artificial neural networks (ANNs) for solving of problems of combustion interferometry in real condition of experiments. The main problem established in our work was how to determine the integral and local characteristics of flame without necessity of measurement of complete set of data about features of interferometric image, first of all, without necessity of measurement of a lot of values of S(x,y), e.g. by means of an incomplete set of values of S(x,y) (the case of determination of local characteristics of flame) or even without them (the case of determination of integral characteristics of flame). This problem is very important for solving of problems of combustion research in real condition of experiments as well as for creation of automated systems of diagnostics and control of combustion processes.
2
The Artificial Neural Network Method
Using ANNs we can solve problems, which have no algorithmic solutions. The ANNs are applicable when problems are multivariate and relationships between variables are unknown. The ANN methods can induce realistic models of complex systems from experimental data. In the special cases, the ANNs are able to induce relations between variables, which can be represented in a knowledge-based form [4]. 2.1
Integral Characteristics of Flame. Direct Task
The integral characteristics that we consider here are the mass of flame, m, the Archimedean lifting force acting on flame, Fa, and quantity of heat in flame, H. We use the following geometrical parameters of the interferometric images: the maximal height, h, and width of the image, w, its square, s, and perimeter, e. We consider these parameters as an incomplete set of features characterizing interferometric images, because, in usual case, we should measure the complete set of value of S(x,y). In our work, we used a feed-forward ANN Neural Networks Wizard 1.7 (NNW) developed by BaseGroupLabs [4]. For training the NNW, we used some the combinations of the above listed features and integral characteristics. The integral characteristics that we have to calculate are assigned as the target variables. Under these assumptions we aim to train the NNW on the input data to calculate the desirable integral characteristics of flames. The number of hidden neurons was 8 and the learning rate was 0.1. The learning stops when the classification error is less than 0.01. Table 1 presents six sets of the integral characteristics (m, Fa, H) determined by usual numerical method for six interferometric images of flame as well as six sets of corresponding geometrical parameters of the interferometric images (h, w, s, e).
686
Victor Abrukov et al.
Table 1. Sets of values of the features characterizing interferometric images (images geometrical parameters) and integral characteristics of flames
№ 1 2 3 4 5 6
w, cm, 10-2 65 88 108 130 142 151
h, cm, 10-2 65 85 147 222 274 341
S, cm2
e, cm
0,32 0,62 1,26 2,25 2,98 3,91
1,99 2,77 4,08 5,61 6,68 8,11
m, g, 10-3 0,08 0,22 0,52 1,25 1,90 2,67
Fa, dynes
H, J
0,08 0,22 0,56 1,06 1,42 1,90
0,024 0,070 0,175 0,324 0,430 0,575
The geometrical parameters were as input data and the integral characteristics (m, Fa, H) were as output data. Five sets were used for training the ANN, and set N 4 was used for testing the ANN. We used 12 various combinations of the geometrical parameters for each set of integral characteristics. The one of results of testing are shown on Fig.1. The horizontal line in Fig. 1 is the target value. The vertical columns correspond to the calculated outputs of the ANN. Each column corresponds to a unique set of geometrical parameters. For example, the term hse means that the height h, the square s as well as the perimeter e were used as the input data. This example shows that the ANN can calculate the integral characteristics of flame successfully. Analyzing this result, we can see that the accuracy of calculation is dependent on the combination of features. For example, if the ANN uses a combination of height and perimeter of image, then the error is minimal. And contrary, the error is large if it is combined width and height of images. More detailed analysis shows, that the combinations of features, which include width and square of images give a large error. On the other hand, less error is achieved if we use the combinations height and perimeter of the image. So we conclude that the values of height and perimeter of the image are more essential for calculation of integral characteristics of flame than width and square of images. Fig.2 depicts the significance of the geometrical parameters for calculation of integral characteristics. The first column corresponds to an average error of calculations for combinations, which include value of width of images. The second column shows an average error of calculation when we use the combinations that include value of height, etc.
Quantity of heat, J
0,4 0, 35 0,3 0, 25 0,2 0, 15 0,1 0, 05 0
e
h
he
w
wh
w he
hs
h se
s
se
w h s w h se
V a rio us co mbina tio ns of the e ntra nce da ta
Fig. 1. Results of calculation of heat quantity of heat for different combinations of features
687
1 0 ,0 % 8 ,3 % 8 ,0 %
6 ,8 %
6 ,0 % 4 ,4 %
4 ,3 %
%
Average error of the calculation,
Using Artificial Neural Networks for Combustion Interferometry
4 ,0 % 2 ,0 % 0 ,0 %
w
h
s
e
Fig. 2. Errors of calculation for geometrical parameters and integral characteristics of flames
Fig. 2 shows that using height and perimeter of images we can improve the accuracy by 1.5 than using square of images and by 2 than width. Our results show that in order to calculate the integral characteristics of flame we have to use the following combinations of geometrical parameters of images: e, h, he, whe, hs. In some cases we are interesting in an analytical representation of the relations induced from experimental data representing the inputs and outputs of real systems. Below we describe an analytical model of flame mass for ignition of powder by laser radiation, which was considered. m = 0.0534 + 1.5017x1 + 2.3405x2 - 2.5898x3 - 0.3031x4 ,
(1)
where x1, …, x4 are normalized width, height, square and perimeter of flame image, respectively, xi = zi/max(zi), and zi is current value, max(zi) is maximum rating of a variable. The ANN model has no hidden neurons. A standard least squared error method was used during creation of ANN model. Obviously that we can interpret the coefficients in Equation 1 as the contribution of variables x1, …, x4 to the value of m. This Equation is an example of the interferometric diagnostics model.
2.2
The Local Characteristics of Flame. Optical Inverse Problem
In the section we apply Neural Networks Wizard 1.7 (NNW) for solving an inverse problem of optics and demonstrate an example of calculation of a refractive index distribution in flame. We consider a determination of integrand meanings on basis of incomplete integral meanings and exam Abel integral equation for the case of cylindrical symmetry. The main problem is how to solve the inverse problem using an incomplete set of features presenting the phase difference distribution function in an interferogram plane. We can use only meaning of the integral, which allows us to calculate all the meanings of integrand, the refractive index distribution and temperature distribution in flame. To solve this problem we can write a dimensionless Abel's equation: 2 1− p
S( p) = 2
∫ (n 0
0
− n (r )) dz
688
Victor Abrukov et al.
where z2 + p2 = r2, z is a ray's path in the object, p is an aim distance (0 < p < 1), and r is a variable radius, see Fig.3. p 1 p
r 1
z
Fig. 3. Geometrical interpretation
Using this equation, we can then calculate the integrals S(p) from different integrands of form nо – n(r) = 1 + ar – br2, where a and b are the given constants. In total we used seven different integrands that reflect a real distribution of refractive index distribution in flames. The training data were as follows. For various r, values nо – n(r) were calculated. Then the integrals of each of these seven integrands were calculated. For example, for the function nо – n(r) = 1 + 4r – 5r2, we can write the following expression 1 + 1− p 2 8 10 S(p ) = 1 − p 2 ⋅ − p 2 + 2p 2 ⋅ ln 3 3 1− 1− p2 .
Here values of S(p) are calculated for different p. The input data for training the NNW were values of S(p), p and r. Values of integrand corresponding to each radius are the target values. In total, we used about 700 sets of these parameters. The number of hidden layers was 5, the number of neurons in them was 8, the learning rate was 0.1, the condition when the maximum errors during both learning and training is less than 0.01 were choused as the stopping conditions. The results of testing the NNW for an integrand nо – n (r) =1 + 0.5r – 1.5r2 are represented in Fig. 4.
Fig. 4. Relative errors of ANN integrand calculation
Relative errors of integrand calculated in the interval of p and r between 0.1 up to 0.9 were lower than 6%. We have executed also the complete LV1 (leave-one-out)
Using Artificial Neural Networks for Combustion Interferometry
689
procedure in the computational experiment. The average errors over all seven tests were in the narrow range from 5% to 6%. More fullness train-test experiment with using different initial weights of neural networks will be done in the nearest future. We also explored the robustness of our calculation to errors in the input data. The calculation error increases by 3 % per 2% of noise added to the input data. This means a good stability of our calculation. In total, the results show that the ANN can solve the inverse problem with an acceptable accuracy in a case of cylindrical symmetry. The definition of one integral value (and also it changes during time in case of non-stationary objects) can be automated. Therefore it is possible to use an ANN cheap (microprocessor) embedded into automated control systems. Further perspectives of the work are concerned with the implementation of neural networks for solving the inverse optical problems for other optical methods and also solving the practical problems of control of combustions processes.
3
Conclusions
The ANN applied for combustion interferometry performs well. Despite an incomplete set of features characterizing flame images, we can solve both the direct and inverse optics problems required for diagnostics of combustion processes. The main advantages of the ANN application are: 1.
2.
3. 4.
It will be possible to calculate the distribution of local thermodynamic characteristics, including density of separate components, by means of measuring in one point of plane of signal registration, for example in the case of laser-diode technique [5]. The ANN method does not require additional real experiments for solving of inverse problems. ANN model for solving of inverse problems can be obtained by means a knowledge base created with using a enough simple numerical calculation as well as there are possibilities of realizing of this approach for solving of the direct task (determination of integral characteristics of flame and other object including industry objects. The method can be extended for other optical methods, for which data are integrated on a line of registration. For diagnostics goals, the ANN can be applied to analysis optical images and signals of flows various types of engine, in particular in a very promising pulse detonation engine [5].
So, we conclude that the ANN method allows us to extend significantly the possibilities of optical diagnostics techniques and improve control systems of combustion processes as well as other industry processes.
690
Victor Abrukov et al.
References [1] [2] [3] [4] [5]
Abrukov, V.S., Ilyin, S.V., Maltsev, V.M., and Andreev I.V.: Proc. VSJ-SPIE Int. Conference, AB076 (1998). http://www.vsj.or.jp/vsjspie/ Abrukov, V.S., Andreev, I. V., Kocheev, I. G.: J. Chemical Physics 5 (2001) 611 (in Russian) Abrukov V.S., Andreev I.V., Deltsov P.V.: Optical Methods for Data Processing in Heat and Fluid Flow. Professional Engineering Publishing, London (2002) 247 The BaseGroup Laboratory http://www.basegroup.ru Sanders S.T., Jenkins T.P., Baldwin J.A. at al.: Control of Detonation processes. Moscow, Elex-KM Publisher (2000) 156-159
An Operating and Diagnostic Knowledge-Based System for Wire EDM Samy Ebeid1, Raouf Fahmy2, and Sameh Habib2 1
Faculty of Engineering, Ain Shams University, 11517 Cairo, Egypt
[email protected] 2 Shoubra Faculty of Engineering Zagazig University, Egypt
Abstract. The selection of machining parameters and machine settings for wire electrical discharge machining (WEDM) depends mainly on both technologies and experience provided by machine tool manufacturers. The present work designs a knowledge-based system (KBS) to select the optimal process parameters settings and diagnose the machining conditions for WEDM. Moreover, the present results supply users of WEDM with beneficial data avoiding further experimentation. Various materials are tested and sample results are presented for improving the WEDM performance.
1
Introduction
Wire electrical discharge machining (WEDM) Fig.1 is a special form of electrical discharge machining wherein the electrode is a continuously moving conductive wire. The mechanism of material removal involves the complex erosion effect from electric sparks generated by a pulsating direct current power supply. With WEDM technology, complicated difficult-to-machine shapes can be cut. The high degree of accuracy obtainable and the fine surface finishes make WEDM valuable [1]. One of the serious problems of WEDM is wire breakage during the machining process. Wire rupture increases the machining time, decreases accuracy and reduces the quality of the machined surface [2,3,4]. WEDM is a thermal process in which the electrodes experience an intense local heat in the vicinity of the ionized channel. Whilst high erosion rate of the workpiece is a requirement, removal of the wire material leads to its rupture. Wire vibration is an important problem and leads to machining inaccuracy [5,6,7]. There is a great demand to improve machining accuracy in WEDM as it is applied to the manufacture of intricate components. The term knowledge engineering has been used quite recently and is now commonly used to describe the process of knowledge-based-systems [8,9]. The knowledge base is obviously the heart of any KBS. The process of collecting and structuring knowledge in a problem domain is called data acquisition. As the user consults the expert systems, answers and results build up in the working memory. In order to make use of the expertise, which is contained in the knowledge base, the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 691-698, 2003. Springer-Verlag Berlin Heidelberg 2003
692
Samy Ebeid et al.
system must posses an inference engine, which can scan facts and rules, providing answers to the queries given to it by the user. Shells are the most widely available tools for programming knowledge. Using a shell can speed up the development of a KBS. Snoeys et al [10] constructed a KBS for the WEDM process. Three main modules were constructed namely: work preparation, operator assistance and fault diagnosis, and process control. The software has been built in PROLOG language. They concluded that the knowledge based system for WEDM improved the performance of the process. Sadegh Amalink et al [11] introduced an intelligent knowledge-based system for evaluating wire-electro erosion dissolution machining in a concurrent engineering environment using the NEXPERT shell. The KBS was used to identify the conditions needed to machine a range of design features, including various hole shapes of required surface roughnesses.
Fig. 1. Principle of WEDM
2
Experimental Work
The present experiments have been performed on a high precision 5 axis CNC WED machine commercially known as Robofil 290, manufactured by Charmilles Technologies Corporation. The Robofil 290 allows the operator to choose input parameters according to the material, height of the workpiece and wire material from a manual supplied by the manufacturer. The machine uses wire diameters ranging from 0.1 to 0.3 mm. However, the wire used in the present tests is hard brass 0.25 mm diameter, under a wire traveling speed ranging between 8-12 m/min. To construct the knowledge based system, the experimental work has been planned to be done on four categories of materials. The first one is alloy steel 2417 (55 Ni Cr Mo V7), while the second category includes two types of cast iron. The third material category is the aluminum alloys (Al-Mg-Cu and Al-Si-Mg 6061). Conductive composites are the fourth materials category where seven types are selected (see Table I).
3
Knowledge-Based System Design
The proposed WEDM knowledge-based system is based on the production rule system of representing knowledge. The system has been built in “CLIPS” shell
An Operating and Diagnostic Knowledge-Based System for Wire EDM
693
software. CLIPS is a productive development and delivery expert system tool which provides a complete environment for the construction of rule and/or object based expert systems [12]. In the CLIPS program the method of representing knowledge is the (defrule) statement instead of (IF…THEN) statements. A number of such (defrule) rules in the forward chaining method has been built into the proposed KBS. The general layout of the architecture of the proposed system for WEDM process is depicted in Fig.2. Four main modules are designed namely: machining parameters, WEDM machine settings, problem diagnosis and machine technology database. Table 1. Types of composite materials
Al 1 2 R 3 R 4 R 5 R 6 R 7 5 Values are in %. 3.1
Cu Gr R 10 4 4 4 4 (R = remainder)
SiC 10 15 30 -
SiO2 10 -
Al2O3 24 -
Zn R
System Modules
3.1.1 Machining Parameters Module This module includes the required workpiece information such as the material to be machined, height and the desired final machining quality. The model is based upon the following: CS = Vf * H
(1)
MRR = (dw+2Sb) * Vf * H
(2)
2
where: CS = cutting speed, mm /min. dw = wire diameter, mm. Sb = spark gap, mm. Vf = machining feed rate, mm/min. H = workpiece height, mm. MRR = material removal rate, mm3/min. 3.1.2 WEDM Machine Settings Module The machine settings module includes the required parameters for operating the machine such as pulse frequency, average machining voltage, injection flushing flow rate, wire speed, wire tension and spark gap. The equations of the above two modules are presented in a general polynomial form given by: P = a H4 + b H3 + c H2 + d H + e
(3)
694
Samy Ebeid et al.
Where, P denotes any of the KBS outputs as cutting speed, MRR, .. etc. The constants “a” to “e” have been calculated according to the type of material, height of workpiece and roughness value. 3.1.3 Problem Diagnosis Module The problems and faults happening during the WEDM process are very critical when dealing with intricate and accurate workpieces. Consequently, faults and problems must be solved soon in order to minimize wire rupture and dead times. The program suggests the possible faults that cause these problems and recommend the operators with the possible actions that must be taken. The correction of these faults might be a simple repair or a change of some machine settings such as feed rate, wire tension, voltage, etc.. 3.1.4 Machine Technology Database Module This module contains a database for recording and organizing the heuristic knowledge of the WEDM process. The machining technology data has been stored in the form of facts. The reasoning mechanism of the proposed KBS proceeds in the same manner as that of the operator in order to determine the machine settings. The required information of workpiece material type, height and required surface roughness is acquired through keyboard input by the operator. On the basis of these data, the KBS then searches through the technology database and determines the recommended WED machine settings. The flow chart of the KBS program is shown in Fig.3. The knowledge acquisition in the present work has been gathered from: 1.
Published data of WEDM manuals concerning machining conditions for some specific materials as steel, copper, aluminum, tungsten carbide and fine graphite. 1. Experimental tests: The system has been built according to: Number of workpiece materials = 12 (refer to section of experimental work), number of surface roughness grades = 2 (ranging between 2.2 to 4 µm depending on each specific material) and number of workpiece heights for each material = 5 (namely: 10, 30, 50, 70 and 90 mm).
Fig. 2. Architecture of KBS.
An Operating and Diagnostic Knowledge-Based System for Wire EDM
695
Fig. 3. Flow chart of KBS
4
Results and Discussion
Figure 4 shows a sample result-taken from the above twelve materials-for the variation of cutting speed with workpiece height for two grades of surface roughness namely Ra=3.2 and Ra=2.2 µm for alloy steel 2417. The results show that the cutting speed increases until it reaches its maximum value at 50 mm thickness, then begins to decrease slightly again for the two grades of roughness. The results also indicate that the values of cutting speed for Ra=3.2 µm are higher than those of Ra=2.2 µm. Nevertheless, the work of Levy and Maggi [13] did not show any dependence of roughness on wire feed rate. Figs.5 and 6 show the variation of material removal rate with cutting speed for alloy steel 2417 and Al 6061 respectively for two grades of roughness. Both charts indicate a linear increase in MRR with cutting speed. Fig.5 shows that the most optimal zone for machining steel 2417 lies between nearly 40-70 mm2/min (Ra=3.2 µm) and between 20-40 mm2/min (Ra=2.2µm) for workpiece thickness ranging from 10-90 mm. Whereas, for Al 6061 this zone shows higher values between 120-180 mm2/min (Ra=4 µm) and between 50-90 mm2/min (Ra=2.2µm) for thickness values from 10-60 mm.
696
Samy Ebeid et al.
The variation between spark gap and cutting speed for both alloy steel 2417 and Al 6061 is shown in Fig.7 for Ra=3.2 and 4 µm respectively. The spark gap increases directly with the increase of cutting speed thus leading to a wider gap. This result is in accordance with the work of Luo [14] for cutting speeds up to 200 mm2/min. It was realized during the tests that the variation in the spark gap with cutting speed at Ra=2.2 µm for both steel 2417 and Al 6061 was nearly negligible.
Fig. 4. Variation of cutting speed with workpiece thickness
Fig. 5. Variation of MRR with cutting speed for steel 2417
An Operating and Diagnostic Knowledge-Based System for Wire EDM
697
Fig. 6. Variation of MRR with cutting speed for Al 6061
Fig. 7. Variation of spark gap with cutting speed
5
Conclusions
The present results of the KBS supply users of WEDM with useful data avoiding further experimentation. The designed system enables operation and diagnosis for the WEDM process. The system proves to be reliable and powerful as it allows a fast retrieval for information and ease of modification or of appending data. The sample results for alloy steel 2417 and Al 6061 of the various twelve tested materials are presented in the form of charts to aid WEDM users for improving the performance of the process.
698
Samy Ebeid et al.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Lindberg, R.A., “Processes and Materials of Manufacture”, Published by Allyn and Bacon, Boston, (1990), 796-802 Rajurkar, K.P., Wang, W.M. and McGeough, J.A., “WEDM identification and adaptive control for variable height components”, CIRP Ann. of Manuf. Tech., 43, 1, (1994), 199-202 Liao, Y.S., Chu, Y.Y. and Yan, M.T., “Study of wire breaking process and monitoring of WEDM”, Int. J. Mach. Tools and Manuf., 37, 4, (1997), 555-567 Yan, M.T. and Liao, Y.S., ”Monitoring and self-learning fuzzy control for wire rupture prevention in wire electrical discharge machining”, Int. J. of Mach. Tools and Manuf., 36, 3, (1996), 339-353 Dauw, D.F. and Beltrami, I., “High-precision wire-EDM by online wire positioning control” CIRP Ann. of Manuf. Tech., 43, 1, (1994), 193-196 Beltrami, I., Bertholds, A. and Dauw, D., “A simplified post process for wire cut EDM”, J. of Mat. Proc. Tech., 58, 4, (1996), 385-389 Mohri, N., Yamada, H., Furutani, K., Narikiyo, T. and Magara, T., “System identification of wire electrical discharge machining”, CIRP Ann. of Manuf. Tech., 47, 1, (1998), 173-176 Smith, P., “An Introduction to Knowledge Engineering”, Int. Thomson Computer Press, London, UK, (1996) Dym, C.L. and Levitt, R.E., “Knowledge-Based Systems in Engineering”, McGraw-Hill, Inc., New York, USA, (1991) Snoeys, R., Dekeyser, W. and Tricarico, C., “Knowledge based system for wire EDM”, CIRP Ann. of Manuf. Tech., 37, 1, (1988), 197-202 Sadegh Amalink, M., El-Hofy, H.A. and McGeough, J.A., “An intelligent knowledge-based system for wire-electro-erosion dissolution in a concurrent engineering environment”, J. of Mat. Proc. Tech., 79, (1998), 155-162 Clips user guide, www.ghgcorp.com/clips, (1999) Levy, G.N. and Maggi, F., “WED machinability comparison of different steel grades”, Ann. CIRP 39, 1, (1990), 183-185 Luo, Y.F., “An energy-distribution strategy in fast-cutting wire EDM”, J. of Mat. Proc. Tech., 55, 3-4, (1995), 380-390
The Application of Fuzzy Reasoning System in Monitoring EDM Xiao Zhiyun and Ling Shih-Fu Department of Manufacturing and Production Engineering Nanyang Technological University, Singapore 639798
Abstract. EDM (electrical discharge machining) is a very complicated and stochastic process. It is very difficult to monitor its working conditions effectively as lacking adequate knowledge on the discharge mechanism. This paper proposed a new method to monitor this process. In this method, electrical impedance between the electrode and the workpiece was taken as the monitoring signal. Through analyzing this signal and using a fuzzy reasoning system as classifier, sparks and arcs were differentiated effectively, which is difficult when using other conventional monitoring methods. The proposed method first partitions the collected voltage and current trains into separated pulses using Continuous Wavelet Transform. Then apply Hilbert Transform and calculate electrical impedance of each pulse. After that, extract features from this signal and form a feature vector. Finally a fuzzy logic reasoning system was developed to classify the pulses as sparks and arcs.
1
Introduction
Discharge pulses of EDM can commonly be classified into five kinds, which are open, spark, arc, short-circuit and off [1]. Open, short circuit and off stages can be distinguished easily for their distinctive characteristics, but it is very difficulty to distinguish spark and arc effectively because they are quite similar in many characters. Traditionally, the signals used to monitor EDM include the magnitude of voltage and current, the high frequency components of discharge, the delay time of discharge pulses and the radio frequency during discharge. From literature, though these signals had been widely used, the results are not good enough, especially in differentiating sparks and arcs [2-3]. Electrical impedance is an inherent characteristic of electrical system that will not change with the input signals and its change can reflect the changes of the electrical system. Though EDM is a complicated process, it can still be regarded as an electrical system approximately because the discharge is the result of the applied voltage. As the discharge is so complicated and we lack enough knowledge on the discharge mechanism, it is quite difficult to find an accurate mathematical model for sparks and arcs. Fuzzy reasoning system has advantages in processing fuzzy, uncertainty information and has a lot of successful application in engineering field. This paper proV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 699-706, 2003. Springer-Verlag Berlin Heidelberg 2003
700
Xiao Zhiyun and Ling Shih-Fu
posed a new monitoring method for EDM, through deriving a new monitoring signal and extracting features from it, a fuzzy reasoning system was used to classify the pulses into sparks and arcs correctly. In section 2, Wavelet transform and the partition of the discharge process were studied. In section 3, the definition and calculation of the electrical impedance signal was given. In section 4, the feature extraction from the impedance signal and voltage signal was introduced. In section 5, a Takagi-Sugeno fuzzy reasoning system was developed to differentiate sparks and arcs. In the last section, conclusions were remarked.
2
Signal Segmentation
The signal segmentation method we developed partitions the voltage and current of the whole discharge process into time slices representing separately discharge pulses. Theoretically, between every two pulses, there is an off stage that means no voltage and no current. During this stage, dielectric fluid was de-ionized and prepared for next discharge. We can regard this off stage as the edge of a discharge pulse, a kind of singularities, which commonly appears in both the one-dimensional signals and two-dimensional images. Until present, the Fourier transform was the main mathematical tool for analyzing singularities. The Fourier transform is global and provides a description of the overall regularity of signals but it is not well adapted for finding the location and the spatial distribution of singularities. The subject of Wavelet transform is a remarkable mathematical tool to analyze the singularities including the edges, by decomposing signals into elementary building blocks that are well localized both in space and frequency, the Wavelet transform can characterize the local regularity of signals, and further, to detect them effectively. A significant study related to this research topic had been done by Mallat, Hwang and Zhong [4][5]. The continuous Wavelet transform of a function is defined as W
f
(a, b) = a
−1 / 2
∫
f ( t )h (
t − a ) dt b
(1)
Here, f (t ) is the function to apply wavelet transform, a is dilation, b is translation and h ( t − a ) is wavelet basic function. b
Wavelet analysis presents a windowing technique with variable-sized regions in contrast to the other methods of signal analysis. Wavelet analysis employs longer time intervals where more precise low frequency information is required and shorter regions for high frequency information. This allows wavelet analysis to capture aspects like trends, breakdown points, and discontinuities in higher derivatives and selfsimilarity that are missed by the other signal analysis techniques. Singularities contain the high frequency information, using the coefficients of low scale of CWT, we can find the location of the singularities correctly. Calculation of the Wavelet transform is a key process in applications. The fast algorithm of computing W s f ( x ) for detecting edges and reconstructing the signal can
The Application of Fuzzy Reasoning System in Monitoring EDM
701
be found in [6] when ψ ( x ) is a dyadic wavelet defined in that article. However, in edge detection, the reconstruction of signals is not required. Therefore, the choice of the wavelet function will not be restricted in the conditions that were presented in [6]. Many wavelets other than dyadic ones can be utilized. In fact, almost all the general integral wavelets satisfy this particular application. Figure 1 is the voltage and current signal from the experiment. The experiment setting is: open voltage 80V, peak current 1.0A, on-time 2.4µs, off-time 0.8µs and sample frequency is 10M. Figure 2 is the segmentation result. The second part of this figure is the CWT coefficients of scale 5. It shows that we can extract pulse 1 to pulse 6 from the original signal accurately.
Fig. 1. Original voltage and current
Fig. 2. Segmentation result
3
Electrical Impedance of Discharge
Electrical impedance is a very important property of an electrical system. It is widely used to study stationary process and always appears as a frequency dependent variable. For an electrical system, impedance is always calculated by voltage over current in frequency domain. The ratio of the voltage and current amplitudes gives the absolute value of the impedance at the given frequency, and the phase shift between the current and the voltage gives the phase of it. The electrical impedance of an electrical process is defined as:
Z =V /I
(2)
702
Xiao Zhiyun and Ling Shih-Fu
The difficulty in calculating impedance in time domain is the sensed voltage and current are AC signals and they cannot be divided directly. In order to avoid this problem, we first convert the sensed voltage and current in real representations into their corresponding complex representations. The converted signals are called analytic signals. Then electrical impedance can be calculated by dividing the analytic voltage by the analytic current. The electrical impedance can express in the complex form, the real part and the imaginary part. The real part represents the resistance and the imaginary part represents the capacitance and inductance.
Z ( t ) = R ( t ) + jX (t )
(3)
It also can be expressed in magnitude and phase representation.
I (t ) = Z (t ) =
V ( t ) jφ ( t ) e Z (t )
(4)
R (t ) 2 + X (t ) 2
ϕ ( t ) = arctg
(5)
X (t ) R (t )
The commonly used transform is FFT, and the result is the electrical impedance at a given frequency. In order to get electrical impedance in time domain and study its varying trend with time, here, we use the Hilbert transform to transfer the measured voltage and current to their corresponding analytical signals and then calculate the electrical impedance. Hilbert transform of signal x(t ) are defined as: x (t ) = H
−1
[ x ( s )] =
π
x ( t ) = H − 1 [ x ( s )] =
π
1
1
∫
∞ x(s) −∞ t − s
ds
(6)
∫
∞ x(s) −∞ t − s
ds
(7)
This shows that the Hilbert transform are defined using the kernel
Φ (t , s ) = 1 /[π ( s − t )] and the conjugate kernel ψ (t , s ) = 1 /[π (t − s )] . That is, the kernels differ only by sign. Here the variable s is a time variable. As a result, the Hilbert transform of a function of time is another function of time of different shape. Accordingly, the corresponding analytical signal of v(t) and i(t) is respectively V ( t ) = v ( t ) + j vˆ ( t ) = a ( t ) exp( j ω t + φ 1 )
(8)
I ( t ) = i ( t ) + j iˆ ( t ) = b ( t ) exp( j ω t + φ 2 )
(9)
In the above two equations, v ( t ) and i (t ) is the real signal, vˆ ( t ) and
iˆ (t ) represent the Hilbert transform of v (t ) and i (t ) . After being transformed into complex form, the following calculation becomes possible for evaluating the electrical impedance.
The Application of Fuzzy Reasoning System in Monitoring EDM
Z =
V (t ) e I (t )
j (φ 1 − φ 2 )
703
(10)
Figure 5 is the calculated electrical impedance of each pulse using Hilbert transform. In the next step, we will extract some features from these signals, which will be used as the inputs of the fuzzy classifier.
Fig. 5. Magnitude of electrical impedance
4
Features Extraction
In this step, we will extract some features from the electrical impedance and voltage signals and determine the determinant features for classification. We define a feature to be any property of the impedance or voltage signal within a pulse that is useful for describing normal or abnormal behavior of this pulse. Based on the electrical discharge machining and signal processing knowledge, we have found the following features to be useful in determining a pulse is spark or arc: Voltage Average: As the breakdown voltage is lower and the delay time is shorter in arc than in spark. The voltage average of arc should be smaller than that of spark. Current Average: Because the dielectric was not fully de-ionized or the debris between the electrode and the workpiece was not flush away effectively. The average current of arc should be bigger than that of spark. Pulse Duration. As arcs are easy to start, the shorter the duration is, the more possible this pulse belongs to arc. Average Impedance. This is the most important feature in differentiating spark and arc because arc is believed to occur at the same spot of previous pulse, the dielectric between the two spot of electrode and workpiece was not fully de-ionized. Delay Time: Extracted from the voltage signal which can be obtained easily when partition the voltage signal using the coefficients of low scale. Commonly, spark is believed to have delay time before discharge starts while arc has no observable delay time.
704
Xiao Zhiyun and Ling Shih-Fu
Table 1 is the values of these features of pulse 1 to pulse 6. In the next section, we will study using a fuzzy reasoning system, taking all the features as the inputs, to differentiate spark and arc. Table 1. Features extracted from pulse 1 to pulse 6
Average V (V) Average I (A) Average Im (Ω) Duration (µs) Delay time (µs)
5
P1 61.448 0.181 409.06 11.8 7.2
P2 74.098 0.157 467.09 4.2 3.2
P3 24.733 0.350 61.31 3.4 0.6
P4 28.256 0.287 91.65 4.2 0.9
P5 22.027 0.290 58.92 4.2 0.8
P6 25.638 0.291 47.11 4.2 1.6
Fuzzy Reasoning System
Presently, Fuzzy system and neural network have been widely and successfully used in monitoring and controlling EDM [8-9]. The theory of fuzzy logic is aimed at the development of a set of concepts and techniques for dealing with sources of uncertainty, imprecision and incompleteness. The nature of fuzzy rules and the relationship between fuzzy sets of different shapes provides a powerful capability for incrementally modeling a system whose complexity makes traditional expert systems, mathematical models, and statistical approaches very difficult. The most challenging problem in differentiating sparks and arcs is that many characters of these two kinds pulses are similar and the knowledge about how to differentiate them is incomplete and vague due to the complexity of discharge phenomena. This uncertainty leads us to seek a solution using fuzzy logic reasoning method. Fuzzy reasoning is performed within the context of a fuzzy system model, which consist of control, solution variables, fuzzy sets, proposition statements, and the underlying control mechanisms that tie all these together into a cohesive reasoning environment. The fuzzy rules can be completely characterized by a set of control variables, X = { x1 , x 2 , L , x n } and solution variables, y1 , y 2 , L y k . In our application, we have five control variables corresponding to the extracted features and one solution variable representing the discharge state, i.e. a value of 1 indicate the normal discharge while a value of 0 indicate a abnormal discharge. Each control variable xi is associated with a set of fuzzy terms ∑ = {α 1 , α 2 , L a p i } , and the solution i
i
i
i
variable has its own fuzzy terms. Each fuzzy variable is associated with a set of fuzzy membership functions corresponding to the fuzzy terms of the variable. A fuzzy membership function of a control variable can be interpreted as a control surface that responds to a set of expected data points. The fuzzy membership functions associated with a fuzzy variable can be collectively defined by a set of critical parameters that uniquely describe the characteristics of the membership function, and the characteristic of an inference engine is largely affected by these critical parameters.
The Application of Fuzzy Reasoning System in Monitoring EDM
705
Takagi-Sugeno fuzzy reasoning system was first introduced in 1985 [10]. It is similar to the Mamdani method in many respects. In fact the first two parts of the fuzzy inference process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same. The main difference between Mamdani-type of fuzzy inference and Sugeno-type is that the output membership functions are only linear or constant for Sugeno-type fuzzy inference. A typical fuzzy rule in a zero-order Sugeno fuzzy model has the form If x is A and y is B then z = k Where A and B are fuzzy sets in the antecedent, while k is a crisply defined constant in the consequent. When the output of each rule is a constant like this, the similarity with Mamdani's method is striking. The only distinctions are the fact that all output membership functions are singleton spikes, and the implication and aggregation methods are fixed and cannot be edited. The implication method is simply multiplication, and the aggregation operator just includes all of the singletons. In our application, as the fuzzy reasoning system has only one output, that is a pulse belongs to spark or arc, it is more convenient to use Sugeno-type reasoning system than conventional used Mamdani system. The fuzzy reasoning system we construct shown in figure 6. It has five inputs that corresponding to the five features we obtained, every input has three fuzzy terms: low, medium and high. There is only one output, the discharge state, which has two values, zero corresponding to arc and one corresponding to spark. Table 2 is the fuzzy reasoning results. We can see, pulse 1 and pulse 2 were classified as sparks, and pulse 3 to pulse 6 were classified as arcs. Average voltage Average current Average impedance Delay time
Fuzzy reasoning system
Discharge state (0: Arc, 1: Spark)
Pulse duration
Fig. 6. Fuzzy reasoning system Table 2. Fuzzy reasoning results for pulse 1 to pulse 6
Pulses 1 Reasoning result 1(spark)
6
2 1(spark)
3 0(arc)
4 0(arc)
5 0(arc)
6 0(arc)
Conclusions
This paper proposed a new monitoring method for EDM process. Through analyzing the electrical impedance signal and extracting features from this signal and using fuzzy reasoning system as the classifier, this method can differentiate sparks and arcs effectively. The proposed method has the following advantages: (1) this method can detect off stage easily when segmenting the voltage and current signal. (2) this
706
Xiao Zhiyun and Ling Shih-Fu
method can process pulse individually, so the monitoring result will be precise and can be quantitated, not just the varying trend as taking frequency as monitoring signal. (3) The proposed method can effectively differentiate sparks and arcs. (4) The monitoring signal is an inherent character of the EDM system, so the monitoring result is more credible. (5) This method system is easier to be implemented. The cost of the monitoring system will be lower because we just need to collect the voltage and current between the discharge gap.
References [1]
“Summary specifications of pulse analyzers for spark-erosion machines”, CIRP scientific technical committee E, 1979. [2] M. Weck, J. M. Dehmer. Analysis and adaptive control of EDM sinking process using the ignition delay time and fall time as parameter. Annals of the CIRP Vol. 41/1/1992 [3] M. S. Ahmed. Radio frequency based adaptive control for electrical discharge texturing. EDM digest, Sept./Oct. 1987, pp. 8-10 [4] Mallat, S. and Hwang, W. L. (1992) Singularity detection and processing with wavelet. IEEE Trans. On information theory, 38: 617~643 [5] Mallat, S. and Zhong, S. (1992) Characterization of signals from multiscale edges. IEEE Trans. On pattern analysis and machine intelligence, 14(7): 710 ~ 732 [6] Mallat, S (1989) Multiresolution approximations and wavelet orthonormal base of L2(R). Trans. Amer Math. Soc, 315: 69 ~ 87 [7] Stefan L. Hahn, Hilbert Transform in signal processing, Artech House, 1996 [8] Y. S. Tarng, C. M. Tseng and L. K. Chung. A fuzzy pulse discriminating system for electrical discharge machining. International Journal of Tools and Manufacturing. Vol. 37, No. 4, pp. 511-522, 1997 [9] J. Y. Kao, Y. S. Tarng. A neural network approach for the on-line monitoring of electrical discharge machining process. Journal of Materials Processing Technology 69 (1997) 112-119 [10] Sugeno, M., Industrial applications of fuzzy control, Elsevier Science Pub. Co., 1985.
Knowledge Representation for Structure and Function of Electronic Circuits Takushi Tanaka Department of Computer Science and Engineering, Fukuoka Institute of Technology 3-30-1 Wajiro-Higashi Higashi-ku, Fukuoka 811-0295, Japan
[email protected] Abstract. Electronic circuits are designed as a hierarchical structure of functional blocks. Each functional block is decomposed into sub-functional blocks until the individual parts are reached. In order to formalize knowledge of these circuit structures and functions, we have developed a type of logic grammar called Extended-DCSG. The logic grammar not only defines the syntactic structures of circuits but also defines relationships between structures and their functions. The logic grammar, when converted into a logic program, parses circuits and derives their functions.
1
Introduction
Electronic circuits are designed as a hierarchical structure of functional blocks. Each functional block consists of sub-functional blocks. Each sub-functional block is also decomposed into sub-sub-functional blocks until the individual parts are reached. In other words, each functional block, even a resistor, has a special goal for its containing functional block. Each circuit as a final product itself can also be viewed as a functional block designed for a special goal for users. As these hierarchical structures were analogous to syntactic structures of language, we developed a grammatical method for knowledge representation of electronic circuits[1]. In the study, each circuit was viewed as a sentence and its elements as words. Structures of functional blocks were defined by logic grammar called DCSG (Definite Clause Set Grammar)[2]. Here, the DCSG is a DCG[3]like logic grammar developed for analyzing word-order free languages. A set of the grammar rules, when converted into Prolog clauses, forms a logic program which executes top-down parsing. Thus knowledge of circuit structures were successfully represented by DCSG, but knowledge of circuit functions could not be represented. In this study, we extend DCSG by introducing semantic terms into grammar rules. The semantic terms define relationships between syntactic structures and their meanings. By assuming functions as meaning of structures, we can represent relationships between functions and structures of electronic circuits.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 707–714, 2003. c Springer-Verlag Berlin Heidelberg 2003
708
2 2.1
Takushi Tanaka
DCSG Word-Order Free Language
A word-order free language L(G’) is defined by modifying the definition of a formal grammar[2]. We define a context-free word-order free grammar G’ to be a quadruple < VN , VT , P, S > where: VN is a finite set of non-terminal symbols, VT is a finite set of terminal symbols, P is a finite set of grammar rules of the form: A −→ B1 , B2 , ..., Bn . (n ≥ 1) A ∈ VN , Bi ∈ VN ∪ VT (i = 1, ..., n) and S(∈ VN ) is the starting symbol. The above grammar rule means rewriting a symbol A not with the string of symbols “B1 , B2 , ..., Bn ”, but with the set of symbols {B1 , B2 , ..., Bn }. A sentence in the language L(G’) is a set of terminal symbols which is derived from S by successive application of grammar rules. Here the sentence is a multi-set which admits multiple occurrences of elements taken from VT . Each non-terminal symbol used to derive a sentence can be viewed as a name given to a subset of the multi-set. 2.2
DCSG Conversion
The general form of the conversion procedure from a grammar rule A −→ B1 , B2 , ..., Bn .
(1)
to a Prolog clause is: subset(A, S0 , Sn ) : − subset(B1 , S0 , S1 ), subset(B2 , S1 , S2 ), ... subset(Bn , Sn−1 , Sn ).
(1)’
Here, all symbols in the grammar rule are assumed to be a non-terminal symbol. If “[Bi ]”(1 ≤ i ≤ n) is found in the right hand side of grammar rules, where “Bi ” is assumed to be a terminal symbol, then “member(Bi , Si−1 , Si )” is used instead of “subset(Bi , Si−1 , Si )” in the conversion. The arguments S0 , S1 , ..., Sn in (1)’ are multi-sets of VT , represented as lists of elements. The predicate “subset” is used to refer to a subset of an object set which is given as the second argument, while the first argument is a name of its subset. The third argument is a complementary set which is the remainder of the second argument less the first; e.g. “subset(A, S0, Sn )” states that “A” is a subset of S0 and that Sn is the remainder. The predicate “member” is defined by the following Prolog clauses: member(M, [M |X], X). member(M, [A|X], [A|Y ]) : − member(M, X, Y ).
(2)
The predicate “member” has three arguments. The first is an element of a set. The second is the whole set. The third is the complementary set of the first.
Knowledge Representation for Structure and Function of Electronic Circuits
3
709
Extended-DCSG
In order to define relationships between syntactic structures and their meaning, we introduce semantic terms into grammar rules. The semantic terms can be placed both sides of grammar rules. Different from DCSG, we do not distinguish terminal symbols from non-terminal symbols in grammar rules so that we can treat meaning of words same as meaning of structures. 3.1
Semantic Term in Left-Hand Side
The semantic terms are placed in curly brackets as: A, {F1 , F2 , ..., Fm } −→ B1 , B2 , ..., Bn .
(3)
This grammar rule can be read that the symbol A with meaning {F1 , F2 , ..., Fm } consists of the syntactic structure B1 , B2 , ..., Bn . This rule is converted into a Prolog clause in Extended-DCSG as: ss(A, S0 , Sn , E0 , [F1 , F2 , ..., Fm |En ]) : − ss(B1 , S0 , S1 , E0 , E1 ), ss(B2 , S1 , S2 , E1 , E2 ), ... , ss(Bn , Sn−1 , Sn , En−1 , En ). (3)’ As the conversion is different from DCSG, we use predicate “ss” instead of “subset”. When the rule is used in parsing, the goal ss(A, S0 , Sn , E0 , E) is executed, where the variable S0 is substituted by an object set and the variable E0 is substituted by an empty set. The subsets “B1 , B2 , ..., Bn ” are successively identified in the object set S0 . After all of these subsets are identified, the remainder of these subsets (complementary set) is substituted into Sn . While, semantic information of B1 is added with E0 and substituted into E1 , semantic information of B2 is added with E1 and substituted into E2 ,..., and semantic information of Bn is added with En−1 and substituted into En . Finally, semantic information {F1 , F2 , ..., Fm } which is the meaning associated with symbol A is added, and whole semantic informations are substituted into E. 3.2
Semantic Term in Right-Hand Side
Semantic terms in right-hand side define semantic conditions for the grammar rule. For example, the following rule (4) is converted into the Prolog clause (4)’. A −→ B1 , {C1 , C2 }, B2 .
(4)
ss(A, S0 , S2 , E0 , E2 ) : − ss(B1 , S0 , S1 , E0 , E1 ), member(C1 , E1 , ), member(C2 , E1 , ), ss(B2 , S1 , S2 , E1 , E2 ).
(4)’
When the clause (4)’ is used in parsing, conditions C1 , C2 are tested whether the semantic information E1 fills these conditions after identifying the symbol B1 . If it succeeds, the parsing process goes on to identify the symbol B2 .
710
3.3
Takushi Tanaka
Terminal Symbol
Both terminal and non-terminal symbols are converted with the predicate “ss”. Only difference is that terminal symbols are defined by rules which do not have right-hand side to rewrite. The terminal symbol A with meaning {F1 , F2 , ..., Fm } is written as (5). A, {F1 , F2 , ..., Fm }.
(5)
This rule is converted into the following clause (5)’ in Extended DCSG. ss(A, S0 , S1 , E0 , [F1 , F2 , ..., Fm |E0 ]) : −member(A, S0 , S1 ).
(5)’
That is, when the rule (5) is used in parsing, the terminal symbol A is searched in the object set S0 . If it is found, the complementary set is returned from S1 , and the semantic term {F1 , F2 , ..., Fm } associated with A is added with current semantic information E0 to make the fifth argument of “ss”. Thus, we can treat meaning of wards with the same manner of syntactic structures.
4 4.1
Knowledge Representation Representation of Circuits
In the previous study[1], the circuit in Figure 1 is represented as the free wordorder sentence (6). The compound term “resistor(r1, 2, 10)” is a word of the sentence. It denotes a resistor named r1 connecting node 2 and node 10. The word “npnT r(q1, 3, 5, 6)” denotes an NPN-transistor named q1 with the base connected to node 3, the emitter to node 5, and the collector to node 6 respectively. [ resistor(r1, 2, 10), resistor(r2, 9, 1), npnT r(q1, 3, 5, 6), npnT r(q2, 4, 5, 7), npnT r(q3, 10, 1, 5), npnT r(q4, 10, 1, 10), npnT r(q5, 10, 1, 8), npnT r(q6, 8, 9, 2), ... , terminal(t4, 9), terminal(t5, 1) ] 4.2
(6)
Grammar Rules
The rule (7) defines an NPN-transistor in active state as a terminal symbol. Its semantic term defines relationships of voltages and currents in the state. Each compound term such as “gt(voltage(C, E), vst)” is a logical sentence for circuit functions. Here, “gt(voltage(C, E), vst)” states “voltage between C and D is greater than collector saturation voltage”. Similar rules are defined for other states of transistor. npnT R(Q, B, E, C), { state(Q, active), gt(voltage(C, E), vst), equ(voltage(B, E), vbe), gt(current(B, Q), 0), gt(current(C, Q), 0), cause(voltage(B, E), current(B, Q)), cause(current(B, Q), current(C, Q)) }.
(7)
Knowledge Representation for Structure and Function of Electronic Circuits
711
t3 2
q7
q8
6
7
7
t1 q1 3
6 5
7
2
q2
5
q9
2
6
8
2
q6
2
10
8
9 t4
r1
4
t2
9
4 5
q3
1
10
q4
10 q5
8
1
1
r2 1 t5 1
Fig. 1. Operational Amplifier cd42 The rule (8) which was originally introduced to refer a resistor as non-polar element[1] defines causalities of voltage and current on the resistor. res(R, A, B), { cause(voltage(A, B), current(A, R)), cause(current(A, R), voltage(A, B)) } −→ ( resistor(R, A, B); resistor(R, B, A) ).
(8)
The rule (9) defines a simple voltage regulator in Figure 2. The semantic term of left-hand side is the function of this structure that the circuit named vreg(D, R) controls the voltage between Out and Com. The right-hand side consists of the syntactic structures of diode D and resistor R, and the semantic term which specifies an electrical condition of diode D. vbeReg(vreg(D, R), V p, Com, Out), { control(vreg(D, R), voltage(Out, Com)) } −→ dtr(D, Out, Com), { state(D, conductive) }, res(R, V p, Out). vreg(D,R) R Vp
Out D
Com
Fig. 2. Voltage regulator
(9)
712
Takushi Tanaka sink(VR,Q) VR
In
Q B
Com
Fig. 3. Current Sink
The rule (10) is defined for the current sink in Figure 3. The semantic term in left-hand side is the function of this structure. The right-hand side is the condition of this circuit. The disjunction “;” specifies either a structural condition or a semantic condition. It becomes useful to analyze context dependent circuits[1]. The next line specifies the connections of transistor Q, and requires the transistor Q must be in active state. cSink(sink(V R, Q), In, Com), { control(sink(V R, Q), current(In, Q)) } −→ ( vbeReg(V R,, Com, B); { control(V R, voltage(B, Com))} ), npnT r(Q, B, Com, In), { state(Q, active) }. (10) The rule (11) defines the active load (current mirror) in Figure 4. Functional blocks such as common-emitter, emitter-follower, and differential amplifier are also defined as grammar rules. activeLoad(al(D, Q), Ref, V p, Ld), { control(al(D, Q), current(Q, Ld)), cause(current(al(D, Q), Ref ), current(Q, Ld)), equ(current(Q, Ld), current(al(D, Q)Ref )) } −→ dtr(D, V p, Ref ), { state(D, conductive) }, pnpT r(Q, Ref, V p, Ld), { state(Q, active) }.
al(D,Q)
Vp
Vp D Q
Ref
Ld
Fig. 4. Active Load
(11)
Knowledge Representation for Structure and Function of Electronic Circuits
5
713
Deriving Circuit Functions
The grammar rules in the previous section are converted into Prolog clauses. Using these clauses, the goal (12) parses the circuit in Figure 1. The identified circuit is substituted into the variable X. The first argument of the composite term is a name given to the circuit, and represents the syntactic structure as shown in Figure 5. The circuit functions which consist of more than 100 logical sentences are substituted into the variable Y as meanings of the circuit structure. ? − cd42(Circuit), ss(X, Circuit, [ ], [ ], Y ).
(12)
X = opAmp(opAmp(sdAmp(ecup(q1, q2), al(pdtr(q8), q7), sink(vreg(dtr(q4), r1), q3)), pnpCE(q9, sink(vreg(dtr(q4), r1), q5)), npnEF (q6, r2)), 3,4,9,2,1)
(13)
Y = [ input(opAmp(...), voltage(3, 4)), output(opAmp(...), voltage(9, 1)), cause(voltage(3, 4), voltage(9, 1)), enable(sdAmp(...), amplif y(opAmp(...), dif f erential inputs)), enable(pnpCE(...), high(voltage gain(opAmp(...)))), enable(npnEF (...), low(output impedance(opAmp(...)))), input(npnEF (...), voltage(8, 1)), output(npnEF (...), voltage(9, 1)), cause(voltage(8, 1), voltage(9, 1)), high(input impedance(npnEF (...))), low(output impedance(npnEF (...))), equ(voltage gain(npnEF (...)), 1), state(q6, active), ... ] (14)
6
Conclusion
We have introduced semantic terms into grammar rules to extend DCSG. The semantic terms define relationships between syntactic structures and their meanings. We also changed the notation of terminal symbol. Different from DCSG, terminal symbols are not distinguished from non-terminal symbols, but defined by grammar rules lacking right-hand side. This enables us to treat the meaning of words same as meaning of structures. That is, Extended-DCSG defines not only the surface structure of language, but also their meanings hidden from the surface.
714
Takushi Tanaka opAmp
sdAmp
ecup
al
pnpCE
sink
sink q9
dtr q1
q2
npnEF
q6
r2
vreg q7
q3
q5
dtr q8
r1 q4
Fig. 5. Parse tree of cd42
We assumed circuit functions as meaning of circuit structures, and extended circuit grammars to define relationships between structure and function. The grammar rules were translated into Prolog clauses, and derived structures and functions from the given circuits. Now we have had a method of encoding knowledge for electronic circuits. If we encode more knowledge, more circuits can be parsed. Currently, our system outputs circuit structures and functions in logical sentence. In order to improve readability for user, we are developing natural language interface. Circuit simulators such as SPICE [5] derive voltages and currents from given circuits. Such simulator replaces measurements of experimental circuits with computer. While our system derives structures and functions from given circuits as a simulator of understanding. It helps user to understand how circuits work in design, analysis, and trouble shooting. These two different systems will work complementary.
References [1] Tanaka, T.: Parsing Circuit Topology in A Logic Grammar, IEEE-Trans. Knowledge and Data Eng., Vol.5, No.2, pp. 225–239, 1993. 707, 710, 711, 712 [2] Tanaka, T.: Definite Clause Set Grammars: A Formalism for Problem Solving, J. Logic Programming, Vol.10, pp. 1–17, 1991. 707, 708 [3] Pereira, F. C. N. and Warren, D. H. D.: Definite Clause Grammars for Language Analysis, Artificial Intell., Vol.13, pp. 231–278, 1980. 707 [4] Tanaka, T. and Bartenstein, O.: DCSG-Converters in Yacc/Lex and Prolog, Proc. 12th International Conference on Applications of Prolog, pp. 44–49, 1999. [5] Tuinenga, P. W.: SPICE - A Guide to circuit Simulation & Analysis using PSpice, Prentice Hall (1988) 714
A Web-Based System for Transformer Design J.G. Breslin and W.G. Hurley Power Electronics Research Centre, National University of Ireland Galway, Ireland {john.breslin,ger.hurley}@nuigalway.ie http://perc.nuigalway.ie/
Abstract. Despite the recent use of computer software to aid in the design of power supply components such as transformers and inductors, there has been little work done on investigating the usefulness of a webbased environment for the design of these magnetic components. Such an environment would offer many advantages, including the potential to share and view previous designs easily along with platform/OS independence. This paper presents a web-based transformer design system whereby users can create new optimised transformer designs and collaborate or comment on previous designs through a shared information space.
1
Introduction
Despite the recent use of computer software to aid in the design of magnetic components [3], to date there has been little work done on investigating the usefulness of a web-based environment for magnetic component design. Such an environment would offer many advantages, including the potential to share and view previous designs easily along with platform and operating system independence. This paper proposes a web-based prototype for magnetic component design. The system is based on an existing transformer design methodology [1], and is implemented using a web programming language, PHP. Magnetic component material information and designs created by students and instructors are stored in a MySQL database on the system server. It will therefore be possible for users of this system to collaborate and comment on previous designs through a shared information space. We will begin with a summary of existing methods for transformer design, followed by details of the system design, and finally an overview of the web-based implementation.
2
Related Research and Other Systems
2.1
Transformer Design Methodologies
The basic area product methodology [4] often results in designs that are not optimal in terms of either losses or size since the method is orientated towards low frequency V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 715-721, 2003. Springer-Verlag Berlin Heidelberg 2003
716
J.G. Breslin and W.G. Hurley
transformers. A revised arbitrary waveform methodology was proposed [1] that allows designs at both low and high frequencies, and is suitable for integration with high frequency winding resistance minimisation techniques [2]. The selection of the core is optimised in this methodology to minimise both the core and winding losses. The design process can take different paths depending on whether a flux density constraint is violated or not. If the flux density exceeds a set limit (the saturation value for a particular core material), it is reset to be equal to this limit and one path is used to calculate the core size; otherwise the initial flux density is used and a different path is taken to find the core size. 2.2
Previous Windows-Based Packages
Many magnetic design companies have used computer spreadsheets to satisfy their own design needs and requirements. They thus tend to be solely linked to a company and its products, and remain unpublished as their content is only of interest to their direct users and competitors. Some of the limitations of these spreadsheets include: difficulties in incorporating previous designer expertise due to limited decision or case statements; non-conformity with professional database formats used by manufacturers; problems with implementing most optimisation routines due to slow iterative capabilities and a reduced set of mathematical functions; basic spreadsheet input mechanisms that lack the features possible with a customised GUI. A computer aided design package has previously been developed [3] for the Windows environment based on the arbitrary waveform methodology [1]. This package allows the design of transformers by both novice and expert users for multiple application types through “try-it-and-see” design iterations. As well as incorporated design knowledge and high frequency winding optimisation, the system allows access to customisable or pre-stored transformer geometries which were usually only available by consulting catalogues. Other systems such as [6] provide more detailed information on winding layouts and SPICE models, but lack the high frequency proximity and skin effects details of [3]. Another Windows-based package has recently been released [7] which allows the comparison of designs based on different parameters. Some companies have also advertised web-based selection of magnetic components [8]; however these tend to be online spreadsheets and therefore suffer from the problems previously mentioned.
3
System Design
All of the information and associated programs of the system reside on a web server, and any user can access the system using a web browser through the appropriate URL. The inputs to the system are in the form of specifications such as desired voltage, current, frequency, etc. The output from the system is an optimised design for the specifications given. Fig. 1 shows the steps taken in creating and managing a design. The system is comprised of the following elements: a relational database management system (RDBMS); a web-based graphical user interface (GUI); optimisation
A Web-Based System for Transformer Design
717
techniques; a repository of knowledge; and a shared information space. We will now describe these in some more detail. Start
Enter Specifications, Choose Materials
Opt. Flux Density > Sat. Value?
Y
N Use Opt. Flux Density, Calc. Area Product
Use Sat. Value, Calc. Alternative Area Product
N Satisified?
Y N Satisified?
Select Desired Core Geometry
Y
Calculate Turns Information N Satisified?
Y
Optimise Winding Thickness?
Y
N Select Desired Winding Geometries
Calculate and Use Opt. Winding Geometries
N Satisified?
Y
Predict Losses, Efficiency, Calc. Leakage Inductance N Satisified?
Y Stop
Fig. 1. Flow chart of design steps
3.1
Relational Database Management System
In transformer design, huge amounts of core and winding data have to be managed for effective exchange of information between various design stages. The magnetic designer should not get lost in the data management process as their concentration on the real design problem may be affected. A RDBMS can be used to save designers from having to search through books with manufacturer's data or from manipulating
718
J.G. Breslin and W.G. Hurley
data themselves with lower level external programs. All transformer core and winding data is accessible using the sophisticated database storage and retrieval capabilities of a relational database engine incorporated into the design application. The database contains the following tables of data: designs (where each saved design is identified by a unique design_id); designers (users of the system); areas (consisting of parent areas which contain child areas for related sets of designs); comments (each relating to a particular design_id); cores and windings (either read-only items added by an administrator or modifiable items created by a particular designer); and tables for core and winding materials, shapes, manufacturers and types. 3.2
Web-Based Graphical User Interface
In our system, we require over 250 HTML objects and form controls for interaction with the user; these include text boxes for both inputs and calculated outputs, labels describing each text box, radio buttons and checkboxes for selection of discrete or Boolean variables, option group menus, graphs of waveforms, etc. Proper categorisation and presentation of data in stages is our solution to the problem of organising this data in a meaningful way, whereby images identify links to the distinct steps in the design process, and only information related to a particular step is shown at any given time. Some of the main GUI features incorporated in the system are: numbered and boxed areas for entry of related data in “sub-steps”; input and output boxes colourcoded according to whether data entry is complete, required, or just not permitted; scrollbars for viewing large tables of data in small areas; and pop-up message windows for recommendations and errors. 3.3
Optimisation Techniques
The performance level of an engineering design is a very important criterion in evaluating the design. Optimisation techniques based on mathematical routines provide the magnetic designer with robust analytical tools, which help them in their quest for a better design. The merits of a design are judged on the basis of a criterion called the measure of merit function (or the figure of merit if only a single measure exists). Methods for optimising AC resistance for foil windings [2] and total transformer loss [1] are implemented in the web-based system; these variables are our measures of merit. 3.4
Repository of Knowledge
A “repository of knowledge” is incorporated into our system, to allow a program design problem to be supplemented by rules of thumb and other designer experience. In the early design stages, the designer generates the functional requirements of the transformer to be designed, and the expertise of previous users can play a very important role at this stage. For example, on entry of an incompatible combination of transformer specifications, the designer will be notified by a message informing them that a design error is imminent. The system will also suggest recommended “expert” values for certain
A Web-Based System for Transformer Design
719
variables. Although such a system is useful for novices, it can also be used by experts who may already know of certain recommended values and who want to save time setting them up in the first place. 3.5
Shared Information Space
The system allows collaboration between users working on a design through a shared information space, with features similar to those of a discussion forum. Designs are filed in folders, where each folder may be accessed by a restricted set of users. To accommodate this, user and group permissions are managed through an administration panel. Access to design folders is controlled by specifying either the users or the groups that have permission to view and add designs to that folder. Users can comment on each design, and can also send private messages to other users.
4
Implementation and Testing
A popular combination for creating data-driven web sites is the PHP language with a MySQL database, and this was chosen for the implementation of the system. PHP is a server-side scripting language that allows code to be embedded within an HTML page. The web server processes the PHP script and returns regular HTML back to the browser. MySQL is a RDBMS that can be run as a small, compact database server and is ideal for small to medium-sized applications. MySQL also supports standard ANSI SQL statements. The PHP / MySQL combination is cross-platform; this allowed the development of the system in Windows while the server runs on a stable BSD Unix platform. A typical PHP / MySQL interaction in the system is as follows. After initial calculations based on the design specifications to find the optimum core size (as mentioned in section 2.1 and detailed in [1]), a suitable core geometry is obtained from the cores database table using the statement: $suitable_core_array = mysql_query("SELECT * FROM cores WHERE (core_ap >= $optimum_ap AND corematerial_id = $chosen_corematerial_id AND coretype_id = $chosen_coretype_id) ORDER BY core_ap LIMIT 1"); where the names of calculated and user-entered values are prefixed by the dollar symbol ($), and fields in the database table have no prefix symbol. Fig. 2 shows the user interface, with each of the design steps clearly marked at the top and the current step highlighted (i.e. “Specifications”). An area at the bottom of the screen is available for designer comments. The underlying methodologies have previously been tested by both the authors [1] and external institutions [5]. Design examples carried out using the system produce identical results to those calculated manually. The system is being tested as a computer-aided instruction tool for an undergraduate engineering class.
J.G. Breslin and W.G. Hurley
Designs > Push-Pull Group > Dudley Device ES2d 1.1 Specifications
Comments 12
New Comment
Output Voltage (V): Output Current (A): Input Voltage Lower (V): Input Voltage Upper (V): Frequency (Hz): Temperature Rise (°C): Ambient Temp (°C): Efficiency (%): Calculated Area Product (cm^4): Calculated Flux Density (T):
Core Material: Winding Material:
36 72
John Smith
Circuit Diagram 11
Custom Addition 10
Leak Inductance 9
Welcome, John Smith! You have 2 PMs.
Modifications: 10 | Last: 22:01, 11-FEB-03
1.2 Core and Winding Info
24 12.5
Optim Thickness 8
Total Losses 7
Magnetic Component Designer 2.0.0b Core Losses 6
Wire Data 4
Turns Info 3
Core Data 2
Specifications 1
Login and Files 0
MaCoDe
Copper Losses 5
720
N27 CU
Transformer Type:
1.3 Expert Options Primary Power Factor: 2ndary Power Factor: Duty Cycle: Optimum Constant: Output Power (W): Volt Ampere Rating (VA): Waveform Factor: Current Density Factor: Turns Ratio: Window Utilisation Factor:
1:1 0.4
Perhaps it would be possible to integrate this particular design with the existing spec for the Allen application.
Emb PSU Team
If we limit the temperature rise to 25 degrees, we can meet that spec. E-Mail | Priv Msg
Matt Jones
13:23, 10-FEB-03
Edit | Quote
Yes.
Fig. 2. Screenshot of web-based system
5
Conclusions and Future Work
With the current trend towards miniaturisation in power converters, the magnetics designer should now expect accurate computer aided techniques that will allow the design of any magnetic component while incorporating existing techniques in the area of web-based collaboration. This paper presented a web-based transformer design system based on a previous methodology. This system is an improvement on previous automated systems because: previous designer expertise and optimisation routines are incorporated into the design method; database integration avoids the need for consultation of catalogues; and a user-friendly interface, with advanced input mechanisms, allows for collaboration among users where designs can be shared and analysed. This system can be further developed for more transformer applications, and with a revised methodology the system could also be updated to incorporate inductors and integrated planar magnetics.
A Web-Based System for Transformer Design
721
References [1] [2] [3] [4] [5] [6] [7] [8]
Hurley, W.G., Wlö fle, W., Breslin, J.G. : Optimized Transformer Design: Inclusive of High Frequency Effects. In: IEEE Trans. on Power Electronics, Vol. 13, No. 4 (1998) 651-659 Hurley, W.G., Gath, E., Breslin, J.G.: Optimizing the AC Resistance of Multilayer Transformer Windings with Arbitrary Current Waveforms. In: IEEE Trans. on Power Electronics, Vol. 15, No. 2 (2000) 369-376 Hurley, W.G., Breslin, J.G.: Computer Aided High Frequency Transformer Design Using an Optimized Methodology. In: IEEE Proc. of COMPEL 2000 (2000) 277-280 McLyman, W.T.: Transformer and Inductor Design Handbook. Marcel Dekker, New York (1978) Lavers, J.D., Bolborici, V.: Loss Comparison in the Design of High Frequency Inductors and Transformers. IEEE Transactions on Magnetics, Vol. 35, No. 5 (1999) 3541-3543 Intusoft Magnetics Designer: http://www.intusoft.com/mag.htm Ansoft PExprt: http://www.ansoft.com/products/em/pexprt/ Cooper Electronic Technologies Press Release for Coiltronics Versa Pac: WebBased Selection for Transformers and Inductors. http://www.electronicstalk.com/news/hnt/hnt105.html and http://www.cooperet.com/products_magnetics.asp
A Fuzzy Control System for a Small Gasoline Engine S.H. Lee, R.J. Howlett, and S.D. Walters Intelligent Systems & Signal Processing Laboratories Engineering Research Centre, University of Brighton Moulsecoomb, Brighton, BN2 4GJ, UK {S.H.Lee,R.J.Howlett,S.D.Walters}@Brighton.ac.uk
Abstract.Small spark-ignition gasoline-fuelled internal-combustion engines can be found all over the world performing in a variety of roles including power generation, agricultural applications and motive power for small boats. To attain low cost, these engines are typically aircooled, use simple carburettors to regulate the fuel supply and employ magneto ignition systems. Electronic control, of the sort found in automotive engines, has seldom proved cost-effective for use with small engines. The future trend towards engines that have low levels of polluting exhaust emissions will make electronic control necessary, even for small engines where previously this has not been economic. This paper describes a fuzzy control system applied to a small engine to achieve regulation of the fuel injection system. The system determines the amount of fuel required from a fuzzy algorithm that uses the engine speed and manifold air pressure as input values. A major advantage of the fuzzy control technique is that it is relatively simple to set up and optimise compared to the labour-intensive process necessary when the conventional "mapped" engine control method is used. Experimental results are presented to show that a considerable improvement in fuel regulation was achieved compared to the original carburettor-based engine configuration, with associated improvements in emissions. It is also demonstrated that the system produces improved output power and torque curves compared to those achieved when the original mechanical fuel regulation system was used. 1
Introduction
Electronic control of the air-fuel ratio (AFR) and ignition timing of a spark ignition (SI) engine is an effective way to achieve improved combustion efficiency and performance, as well as the reduction of exhaust emissions. The AFR essentially sets the operating point of the engine, and in conjunction with the ignition timing angle, determines the output power and the resulting levels of emissions. In an engine with electronic control, the amount of fuel that is supplied to the engine is controlled by an engine control unit (ECU). This is a micro-processor based V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 722-732, 2003. Springer-Verlag Berlin Heidelberg 2003
A Fuzzy Control System for a Small Gasoline Engine
723
system that controls the frequency and width of the control pulse supplied to the fuel injector. The AFR is important in the combustion and calibration processes. If there is too much fuel, not all of it is burnt, causing high fuel consumption and increased emissions of HC and CO. Too little fuel can result in overheating and engine damage such as burnt exhaust valves. Conventional ECUs use three-dimensional mappings (3-D maps), in the form of look-up tables, to represent the non-linear behaviour of the engine in real-time [1]. A modern vehicle ECU can contain up to 50 or more of these maps to realise complex functions. In addition the engine will be equipped with a wide range of sensors to gather input data for the control system. A major disadvantage of the look-up table representation is the time taken to determine the values it should contain for optimal engine operation; a process known as calibration of the ECU. These 3-D maps are typically manually calibrated, or tuned, by a skilled technician using an engine dynamometer to obtain desired levels of power, emissions and efficiency. The calibration process is an iterative one that requires many cycles of engine measurements and is very time consuming. Techniques that reduce the time and effort required for the calibration process are of considerable interest to engine manufacturers. This is especially the case where the engine is a small capacity non-automotive engine. These engines are particularly price sensitive and any additional cost, including the cost of extended calibration procedures, is likely to make the engine un-economic to manufacture. For similar economic reasons, any control strategy intended for application to a small engine has to be achievable using only a small number of low-cost sensors. This paper makes a useful contribution to research in this area by proposing a technique for the rapid calibration of a small-engine ECU, requiring only sensors for speed and manifold air pressure.
2
Fuzzy Control
Fuzzy logic is a ‘soft computing' technique, which mimics the ability of the human mind to learn and make rational decisions in an uncertain and imprecise environment [2]. Fuzzy control has the potential to decrease the time and effort required in the calibration of engine control systems by easily and conveniently replacing the 3-D maps used in conventional ECUs. Fuzzy logic provides a practicable way to understand and manually influence the mapping behaviour. In general, a fuzzy system contains three main components, the fuzzification, the rule base and the defuzzification. The fuzzification is used to transform the so-called crisp values of the input variables into fuzzy membership values. Afterwards, these membership values are processed within the rule base using conditional ‘if-then' statements. The outputs of the rules are summed and defuzzified into a crisp analogue output value. The effects of variations in the parameters of a Fuzzy Control System (FCS) can be readily understood and this facilitates optimisation of the system. The system inputs, which in this case are the engine speed and the throttle angle, are called linguistic variables, whereas ‘large' and ‘very large' are linguistic values which are characterised by the membership function. Following the evaluation of the rules, the defuzzification transforms the fuzzy membership values into a crisp output
724
S.H. Lee et al.
value, for example, the fuel pulse width. The complexity of a fuzzy logic system with a fixed input-output structure is determined by the number of membership functions used for the fuzzification and defuzzification and by the number of inference levels. The advantage of fuzzy methods in the application of engine control over conventional 3-D mappings is the relatively small number of parameters needed to describe the equivalent 3-D map using a fuzzy logic representation. The time needed in tuning a FCS compared to the same equivalent level of 3-D map look-up control can be significantly reduced.
3
The Fuzzy Control System
3.1
Feedforward Control
This aim of the control strategy was to govern the value of AFR in the engine, keeping it at a desired optimal value, and minimising the influence of changes in speed and load. Figure 1 shows the block diagram of the test system. Engine load was estimated indirectly by measurement of the inlet manifold air pressure (MAP). The parameters of the fuzzy control system and rule-base contents in the fuzzy control system were determined during test-rig trials and implanted as a system reference into the control unit. The details of the creation of such a system for this experiment are explained in the next section of the paper. The minor drawback of this feedforward control is lack of feedback information; factors such as wear and spark plug deterioration will detract from optimum fuel injection quantity in what is still effectively an open-loop system. Feedback control of AFR is often provided in automotive engines, but this is seldom economic on small engines. A suitable model was created to predict throttle position by using the MAP and the engine rotating speed. The feedforward fuzzy control scheme was used in order to reduce deviations in lambda-value or λ, where λ is an alternative method of expressing AFR (λ = 1.0 for an AFR of approximately 14.7:1, the value for complete combustion of gasoline). The scheme also has the benefit of reducing the sensitivity of the system to disturbances which enter the system outside the control loop. This fuzzy model offers the possibility of identifying a single multi-input single-output non-linear model covering a range of operating points [3].
Fig. 1. Block diagram for feedforward and fuzzy logic control scheme
A Fuzzy Control System for a Small Gasoline Engine
3.2
725
Experimental Arrangement
The experimental fuzzy control algorithm was implemented using a test facility based on a Bosch Suffolk single-cylinder engine having a capacity of 98 cc. The engine parameters are summarised in Table 1. The engine had a single camshaft and sidevalve arrangement, and was capable of generating manufacturer- listed peak power and torque outputs of 1.11kW at 3000 revolutions per minute (RPM) and 3.74Nm at 2100 RPM respectively. Load was applied to the engine via a DC dynamometer with a four-quadrant speed controller. A strain gauge load cell system was incorporated and a frequency-to-voltage converter was used to provide speed information. The dynamometer was capable of either motoring or absorbing power from the engine. A PC-based data acquisition system utilising an Advantech PCL818HD analogue-to-digital converter (ADC) card was used. Various sensors were provided to measure the engine operating parameters: speed, torque, manifold vacuum pressure, temperatures, AFR, etc. The ignition system used was the standard fitment magneto. A modification was made to the air-induction system in order to accommodate a fuel injector as well as the original carburettor. Thus, the engine could be conveniently switched so as to use the carburettor or the fuel injection system. The fuel injector electronic system consisted of a programmable counter/interval timer (Intel 82C54) which generates a pulse of the required length, feeding an automotive specification Darlington-configuration power transistor, driving the fuel injector solenoid. The fuel pulse width (FPW) governed the quantity of fuel injected into the engine. Table 1. Basic Engine specification
Bore (mm) Stroke (mm) Compression ratio Capacity Valve arrangement Carburettor Ignition system 3.3
57.15 38.1 5.2 : 1 98cc Sidevalve Fixed jet Flywheel magneto
Engine Load Estimation
In a spark-ignition engine the induction manifold pressure varies with engine speed and throttle opening according to a non-linear mapping. Figure 2 shows the 3-D relationship between these operating parameters for the Bosch Suffolk engine. By measuring these two variables, the engine load/throttle position can be determined. A conventional look-up table can be used, although in the case of this work fuzzy logic was used to represent the non-linear relationship between functions. An optical sensor was used for speed measurement, and a low-cost pressure sensor was applied to measure the MAP. These formed the major control inputs to the fuzzy control loop.
726
S.H. Lee et al.
Fig. 2. Variation of MAP with speed and throttle opening
3.4
Fuzzy Control Algorithm
The fuzzy control system was devised using a Fuzzy Development Environment (FDE) which was the outcome of a linked piece of work. The FDE is an MS Windows-based application that consists of a Fuzzy Set Editor and Fuzzy Rule Editor. Fuzzy sets, membership functions and rule sets for this project were all created, and modified where required, using the FDE. Parameters derived from the FDE, specific to the particular set-up devised, were transferred to an execution module, known as the Fuzzy Inference Kernel (FIK). The FIK was a module programmed in C++ code. To make it possible to embed the FIK directly into an ECU, the code was compiled to .obj format, and incorporated into the rest of the control code by the linker.
Fig. 3. Air-fuel ratio fuzzy control loop
The fuzzy control loop illustrated in Figure 3 was implemented in order to optimise the AFR. To determine the effectiveness of the control loop, the AFR was monitored using a commercial instrument, known as an Horiba Lambda Checker. The engine speed was determined by an optical sensor while the MAP was measured by a pressure sensor located in the intake manifold. These instruments sampled individual parameters and through the medium of signal conditioning circuitry provided analogue output voltage levels proportional to their magnitude. These were converted to digital form and the crisp digital signals were then applied to a fuzzy algorithm implemented in the C programming language on a PC. The crisp output from the algorithm was the width of the pulse applied to the fuel injector (the FPW).
A Fuzzy Control System for a Small Gasoline Engine
727
Fig. 4. Fuzzy input set – engine speed
The fuzzy sets showed in Figures 4 and 5 were used in the fuzzy controller. The engine speed fuzzy set used three trapezoidal membership functions for classes low, medium and high. The MAP fuzzy set consisted of four trapezoidal membership functions for classes Very Low, Low, High, Very High. Experimental adjustment of the limits of the membership classes enabled the response of the control kernel to be tailored to the physical characteristic of the engine.
Fig. 5. Fuzzy input set – vacuum pressure
The contents of the rule-base underwent experimental refinement as part of the calibration process. The final set of rules contained in the rule-base is shown in Figure 6.
Fig. 6. The fuzzy rule base
728
S.H. Lee et al.
The fuzzified values for the outputs of the rules were classified into membership sets similar to the input values. An output membership function of output singletons, illustrated in Figure 7, was used. This was defuzzified to a crisp value of FPW.
Fig. 7. Fuzzy output set – FPW (ms)
3.5
The Mapping
Engine control typically requires a two-dimensional plane of steady state operating points with engine speed along the horizontal axis and throttle position along the vertical axis. The control surface in Figure 8 shows the crisp value of FPW at different combinations of speed and vacuum pressure using FCS. Each of these intersection points indicates the differing requirement for fuel, which is determined by the design of fuzzy sets and membership functions. The control surface acts as a means of determining the FPW needed for each combination of speed and MAP value.
Fig. 8. Three-dimensional FCS map
4
Experimental Results
The performance of the engine running with the FCS was experimentally compared with that of the engine running using the conventional mechanical fuel regulation and delivery system. A monitoring sub-routine was created to capture performance data, under conventional operation and using the FCS, under the experimental conditions described in Table 2. The experimental evaluation was carried out using a combination of six speed settings and five values of Throttle Position Setting (TPS) as illustrated in Table 2. Values of engine torque and power were recorded for each combination of speed and TPS.
A Fuzzy Control System for a Small Gasoline Engine
729
The results are presented graphically in Figures 9 to 16 and discussed in Sections 4.1 and 4.2. Table 2. Experimental conditions
Engine speed (RPM) Throttle Position (%) 4.1
1800, 2000, 2200, 2400, 2600, 2700 0, 25, 50, 75, 100
Power and Torque
1
1
0.9
0.9
0.8
0.8
Power(kW)
Power(kW)
Figures 9 to 12 illustrate that the power produced by the engine with the FCS exhibited an increase of between 2% and 21% with an average of approximately 12% compared with the original mechanical fuel delivery system. A corresponding improvement in output torque also resulted from the use of the fuel injection system with the FCS compared to when the original fuel delivery system was used. Figures 13 to 16 show that the mean torque exhibited an increase of between 2% and 20% with an overall average of 12%. These increases in engine performance are partly due to the improvement in charge preparation achieved by the fuel injection process; the improvement in fuel metering also results in improved combustion efficiency hence increased engine power.
0.7 0.6
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900
0.7 0.6
0.4 1700 1900 2100 2300 2500 2700 2900
Speed (RPM)
Speed (RPM)
Fig. 10. Engine power when TPS=50%
1
1
0.9
0.9
0.8
0.8
Power(kW)
Power(kW)
Fig. 9. Engine power when TPS=25%
0.7 0.6 FCS Basic engine
0.5
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900 Speed (RPM)
Fig. 11. Engine power when TPS=75%
0.7 0.6
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900 Speed (RPM)
Fig. 12. Engine power when TPS=100%
730
S.H. Lee et al.
4
2
FCS Basic engine
Torque (Nm)
3
2 FCS Basic engine
1 1700 1900 2100 2300 2500 2700 2900
1 1700 1900 2100 2300 2500 2700 2900
Speed(RPM)
Speed (RPM)
Fig. 13. Engine torque when TPS=25%
Fig. 14. Engine torque when TPS=50%
4
4
3
3
2
FCS Basic engine
2
FCS Basic engine
1 1700 1900 2100 2300 2500 2700 2900
1 1700 1900 2100 2300 2500 2700 2900
Speed (RPM)
Speed (RPM)
Fig. 15. Engine torque when TPS=75%
4.2
Torque (Nm)
3
Torque (Nm)
Torque (Nm)
4
Fig. 16. Engine torque when TPS=100%
Air-Fuel Ratio
The AFR was monitored, over a range of speeds and load conditions, using both the original fuel delivery system and the fuzzy-controlled fuel-injection system to comparatively evaluate the variation in AFR that occurred. The control objective was to stabilise the AFR such that λ=1.0 was achieved under all engine operating conditions. Figures 17 and 18 illustrate how the value of λ varied with different combinations of speed and throttle position using the original fuel regulation system and the fuzzy-controlled fuel-injection system, respectively. Figure 17 shows that wide variations in λ occurred when the original fuel regulation system was used, this being due non-linearities in the characteristic of the carburettor. This resulted in an excessively rich mixture at small throttle openings and an excessively weak mixture when the throttle opening was large. The large
A Fuzzy Control System for a Small Gasoline Engine
731
variations in λ suggested poor combustion efficiency and higher, harmful, exhaust emissions. An improved and refined contour was found to occur when the FCS was employed. Reasonable regulation of λ was achieved, the value being maintained between 0.8 and 1.0 in approximately 90% of the experimental operating region. Exceptions occurred in two extreme conditions, which were • •
high engine speed with very small throttle opening; and low engine speed with throttle wide open.
Neither of these conditions are likely to occur frequently in normal engine operation. There were a number of limitations in the mechanical and electronic components of the fuel injection system which adversely affected the stabilisation of the AFR. Firstly, the fuel injector was one that was conveniently available for the experiment, but it was too big for the size of the engine, making it difficult to make small changes in the amount of fuel delivered. Secondly, the resolution of the counter that determined the fuel pulse width was too coarse, again causing difficulty in making fine adjustments to the quantity of fuel delivered. Finally, the chamber where the fuel injector was installed and the inlet manifold were not optimised for fuel injection. Even with such a non-optimal system, it was possible to conveniently and quickly adjust the parameters of the fuzzy control system to produce a close to optimal solution.
Fig. 17. Variation in lambda with original fuel regulation system
5
Fig. 18. Variation in lambda with fuzzy-controlled fuel-injection system
Conclusion
This paper has introduced an improved technique for the computer control of the fuel supply of a small internal combustion engine. The technique provides significant time savings in ECU calibration, and improved performance. The fuzzy logic control scheme eliminates the requirement for skilful and timeconsuming calibration of the conventional three-dimensional map while at the same time achieving good fuel regulation, leading to improved control of polluting exhaust emissions.
732
S.H. Lee et al.
It was demonstrated that the entire tuning process, including the set-up of membership function and derivation of the rule-base, took as little as one hour to deliver results. This was significantly faster than comparable manual-controlled calibration for the equivalent mapped control. Faster times could be achieved with experience and practise. Laboratory tests showed the fuzzy-controlled fuel-injection system achieved increased engine power and torque over that obtained with mechanical fuel delivery. In addition, it was shown that the system was capable of maintaining the variation of λ within a narrow range, leading to reduced emissions. Parameters where further work will have significant impact include the development of further fuzzy set(s) that refine the control strategy by including ignition timing, cold start enrichment, etc.
References [1]
“Representation of 3-D Mappings for Automotive Control Applications using Neural Networks and Fuzzy Logic”, H. Holzmann, Ch. Halfmann, R Isermann, IEEE Conference on Control Applications – Proceedings, pp.229-234, 1997. [2] “Fuzzy Logic And Neural Networks”, L. Chin, D. P. Mital, IEEE Region 10th Annual International Conference, Proceedings/TENCON, Vol.1, pp.195-199, 1999. [3] “Model Comparison for Feedforward Air/fuel Ratio Control”, D G Copp, K J Burnham , FP Locket, UKACC International Conference on Control 98, pp.670-675, 1998. [4] “Air-fuel Ratio Control System using Pulse Width and Amplitude Modulation at Transient State”, Takeshi Takiyama, Eisuke Shiomi, Shigeyuki Morita, Society of Automotive Engineers of Japan, pp.537-544, 2001. [5] “Neural Network Techniques for Monitoring and Control of Internal Combustion Engines”, R.J.Howlett, M.M.de Zoysa, S.D.Walters & P.A.Howson, Int. Symposium on Intelligent Industrial Automation 1999. [6] “Fuelling Control of Spark Ignition Engines”, David J. Stroh, Mathew A Franchek and James M. Kerns, Vehicle System Dynamics 2001, Vol. 36, No. 45, pp.329-358. [7] “Exhaust Gas Sensors for Automotive Emission Control”, J. Riegel, H. Neumann, H. M. Wiedenmann, Solid State Ionics, 8552, 2002. [8] “Emission from In-use Lawn-mowers in Australia”, M.W. Priest, D. J. Williams, H.A. Bridgman, Atmospheric Environment, 34, 2000. [9] "Effects of Varying Engine Block Temperature on Spark Voltage Characterisation for the Measurement of AFR in IC Engines", M. M. de Zoysa, R. J. Howlett, S. D. Walters, Paper No. K312, International Conference on Knowledge-based Information Engineering Systems and Allied Technologies, University of Brighton, Sussex, 30th August – 1st September 2000. [10] “Engine Control using Neural Networks: A New Method in Engine Management Systems”, R. Muller, H. H. Hemberger, and K. Baier, Meccanica 32 pp.423-430, 1997.
Faults Diagnosis through Genetic Matching Pursuit* Dan Stefanoiu and Florin Ionescu FH-University of Applied Sciences in Konstanz, Dept. of Mechatronics Brauneggerstraße 55, 78462 Konstanz, Germany Tel.: + 49 07 531 206-415,-289; Fax: + 49 07 531 206-294 {Danstef,Ionescu}@fh-konstanz.de
Abstract. Signals that carry information regarding the existing defects or possible failures of a system are sometimes difficult to analyze because of various corrupting noises. Such signals are usually acquired in difficult conditions, far from the place where defects are located and/or within a noisy environment. Detecting and diagnosing the defects require then quite sophisticated methods that are able to make the distinction between noises encoding the defects and another parasite signals, all mixed together in an unknown way. Such a method is introduced in this paper. The method combines time-frequency-scale analysis of signals with a genetic algorithm.
1
Introduction
The problem of faults detection and diagnosis (fdd) using signals provided by a monitored system is approached in this paper through the hybrid combination between a modern Signal Processing (SP) method and an optimization strategy issued from the field of Evolutionary Computing (EC). The signals to analyze are mechanical vibrations provided by bearings in service and have been acquired in difficult conditions: far from the location of tested bearing and without synchronization signal. The resulted vibrations are therefore corrupted by interference and environmental noises (the SNR is quite small) and the main rotation frequency can only be estimated. Moreover, the rotation speed varies during the measurements, because of load and power supply fluctuations. In general, from such signals is difficult to extract the defects information by simple methods. The method that will be succinctly described in this paper (a presentation in deep is made in [4]) relies on Matching Pursuit Algorithm (MPA) introduced in [2]. MPA proves a number of properties that can be exploited in order to perform fdd of noisy vibrations. The most important property is the capacity of denoising. By denoising, the util component is separated from the noisy component of analyzed signal with a controlled accuracy. Another useful characteristic is the distribution of denoised signal energy over the time-frequency (tf) plane, which in general reveals the signal *
Research developed with the support of Alexander von Humboldt Foundation (www.avh.de).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 733-740, 2003. Springer-Verlag Berlin Heidelberg 2003
734
Dan Stefanoiu and Florin Ionescu
features better than the classical spectrum. Vibrations affected by defects and noises are non stationary signals that require tf analysis rather than classical spectral methods.
2
Matching Pursuit in a tfs Dictionary
The approach presented in [2] has been generalized in the sense that the tf dictionary of waveforms is replaced here by a time-frequency-scale (tfs) dictionary. SP applications proved that the use of time, frequency and scale together can lead to better results than when only time and frequency or only time and scale are employed. Hence, the dictionary is generated by applying scaling with scale factor s0 , time shifting with step u 0 and harmonic modulation with pulsation ω 0 on a basic signal g referred to as mother waveform (mw). The mw can be selected from a large class of known signals, but its nature should be related to the analyzed signals. In case of vibrations, the unit energy Gauss function has been preferred: def
g (t ) = Ae
−
( t − t0 ) 2 2σ 2
, ∀t ∈ 5 ,
(1)
where: A = 1 / 4 πσ 2 is the magnitude, σ > 0 is the sharpness and t 0 ∈ 5 is the central instant. It is well known that Gauss mw (1) has a time support length of 6σ and a frequency support length of 6 / σ , because of Uncertainty Principle (UP). The supports are measuring the tf localization. Starting from the selected mw, the dictionary consists of the following discrete atoms: def
g [m,n,k ][l ] = s0−m / 2 e
j
− ω 0 s0− m klTs
(
)
g s0−m (lTs − nu0 ) ,
(2)
where: ∀l ∈ A is the normalized instant, Ts is the sampling period of vibration,
m ∈ 0, M s − 1 is the scale index, n ∈ 1 − N , N − 1 is the time shifting index, and k ∈ 0, K m − 1 is the harmonic modulation index. The ranges of variation for indexes
can be derived by tuning the dictionary on the vibration data v . Thus, the number of scales M s , as well as the number of frequency sub-bands per scale K m (variable
because of UP), are determined by the vibration bandwidth set by pre-filtering. The number of time shifting steps depends on vibration data length N . Also, the central instant t 0 is naturally set to ( N − 1)Ts / 2 , which involves σ = t 0 / 3 (the support of mw extends over the data support). The operators applied on mw are defined by: s 0 = 1 / 2 , u 0 = Ts and ω 0 = 2 ln 2 / σ (see [4]). Note that atoms (2) are not necessarily orthogonal each other. The tfs dictionary, denoted by D[g ] , is redundant and the spectra of every 2 adjacent atoms overlap. It generates a subspace D[g ] of finite energy signals, as shown in Fig. 1(a). Vibration
Faults Diagnosis through Genetic Matching Pursuit
735
v may not belong to D[g ] , but, if projected on D[g ] , the util signal vD is obtained. The residual signal ∆v is orthogonal on D[g ] and corresponds to unwanted noises. Fig. 1(a) illustrates in fact the signal denoising principle. Since the atoms (2) are redundant, v D cannot easily be expressed. Moreover, v D is an infinite linear combination of atoms from D[g ] . Thus, it can only be estimated with a controlled accuracy.
Fig. 1. (a) Principle of denoising within a tfs dictionary. (b) Principle of GA elitist strategy
The MPA is concerned with estimation of util signal v D , by using the concept of best matching atom (bma). When projecting the signal on dictionary, the bma is the atom with maximum magnitude of resulted projection coefficient. Within MPA, the util signal is estimated by approximating the residual through the recursive process:
∆q +1 x ≡ ∆q x − ∆q x, g [mq ,nq ,kq ] g [mq ,nq ,kq ] , ∀q ≥ 0 .
(3)
The approximation process (3) starts with vibration signal ∆0 x ≡ v as first estimation (the coarser one) of residual. The corresponding bma g [m0 ,n0 ,k0 ] is found and the projected signal is subtracted from the current residual. The new resulted residual ∆1 x (that refines the estimation of noisy part) is looking now for its bma in D[g ] and, after finding it, a finer residual estimation ∆2 x is produced, etc. The iterations stop when the residual energy falls below a threshold a priori set, i.e. after Q bmas have been found. The util signal is then estimated by: Q −1
x D ≅ ∑ ∆q x, g [mq ,nq ,kq ] g [mq ,nq ,kq ] .
(4)
q =0
Note that, in (4), the projections coefficients become from successive residuals and not only from the initial vibration data. Moreover, a remarkable energy conservation property has been proved in [2]:
736
Dan Stefanoiu and Florin Ionescu 2
∆q x ≡ ∆q x, g [mq ,nq ,kq ]
2
2
+ ∆q +1 x , ∀q ≥ 0 ,
(5)
despite the atoms are not necessarily orthogonal. Actually, the iterations in (3) can be stopped thanks to (5), which shows that the energy of util signal increases, while the residual energy decreases with every new extracted bma.
3
A Genetic Matching Pursuit Algorithm
Finding the bma corresponding to current residual ∆q x means solving the following non linear maximization problem:
max m ,n ,k
∑ ∆ x[l ]g q
∗ [ m ,n ,k ]
[l ] ,
(6)
l∈A
where a ∗ is the complex conjugate of a . The sum in (6) is actually finite, because the supports of residual and atoms are finite. Although the searching space D[g ] is finite, it usually includes a huge number of atoms to test. The exhaustive search is inefficient. The gradient based optimization methods are also impractical, because the cost function is extremely irregular and changes in every step of iteration. A promising approach is to use an optimization technique coming from EC. The combination between MPA and evolutionary techniques is a very recent idea that has been used in analysis of satellite images [1]. In this framework MPA is joined to a Genetic Algorithm (GA) [3]. The symbiosis between the 2 algorithms resulted in a Genetic Matching Pursuit Algorithm (GMPA). The GA was designed according to MPA characteristics. Dictionary atoms are uniquely located into the tfs plane by 3 indexes: scaling, time shifting and harmonic modulation. In general M s (with cavities on inner and outer races). Fig. 2(a) also shows the value of sampling rate (ν s = 20 kHz ), the nominal estimated rotation speed of shaft (ν s = 44.3 Hz ) and the natural frequencies of bearing in decreasing order. The resulted vibration data have small SNR, especially in case of defect encoding vibrations (under 6 dB, i.e. more than 30% of energy
Faults Diagnosis through Genetic Matching Pursuit
739
consist of undesirable noise). A band pass filter 0.5-10 kHz was applied in order to remove some noises and the main rotation harmonics up to order 10. Fig. 2 displays data and spectra for bearings (b) and (c). Spectra of the other 2 bearings are quasi identical to (c) and thus the defects cannot be discriminated. After projecting the vibrations on tfs dictionary, the distributions of bmas over tfs plane have been drawn, as depicted in Fig. 3, for bearings and . The tf planes have been horizontally stacked, in order of scales, with time put on vertical axis. On the horizontal axis, the frequency band is split into a variable number of sub-bands, depending on scale index (from 0 to 8). Atoms are located into rectangles with variable size, depending on scale, because the frequency resolution varies along scales (larger scale index involves bigger rectangles). Also, the rectangles overlap in frequency, because atoms spectra overlap. The fitness value (in dB) of every bma is represented with a color (or gray level) according to the scaling on the right side of images. It starts with light colors (blue or gray) for small values and ends with dark colors (red or black) for large values. Thanks to absence of defect encoding noise, the bmas on left window are almost uniformly distributed over the tfs plane, without groups of high fitness atoms concentrated in a specific zone. One can remark that the noise (scales #7 & #8) is located at low frequency, albeit some more high frequency atoms appeared in the end. The distribution corresponding to bearing with inner race defects (right side image), reveals a higher energy concentration on scale #0 (with more than 12% of initial energy) within a small number of bmas. The 2 abnormal groups of bmas are located around sub-band #571, which gives the frequency of about 10.569 kHz, i.e. approximately 47 × BPFI (“BPFI” stands for Ball Pass Frequency on Inner race). Thus, the defect is quite clearly decoded. Moreover, the noise is now located at high frequencies (scale #8). For the other 2 bearings, defects are correctly diagnosed as well [4], albeit all 3 “defective” spectra are practically indistinguishable each other. The defects severity has also been estimated, even in case of multiple defects. The fdd results have been confirmed by bearings after dismounting.
Fig. 3. Distributions of atoms over tfs plane: (left) and (right)
740
Dan Stefanoiu and Florin Ionescu
5
Conclusion
Processing of noisy signals usually requires methods with high complexity degree. Some of such methods lead to greedy procedures, similar to Matching Pursuit Algorithm presented in this paper. The complexity of method described above is involved by a non linear optimization problem that cannot efficiently be solved by means of classical techniques, gradient based. Therefore, an evolutionary approach using Genetic Algorithms has been proposed. The resulted Genetic Matching Pursuit Algorithm proved interesting features, such as denoising and extraction of util signal with a desired SNR or decoding of some information related to fdd. Moreover, its performance can be improved by operating with different mother waveforms, adapted to the nature of analyzed signal. The algorithm can be used for a larger class of one dimension signals than vibrations, in a framework where a method issued from Signal Processing and a strategy relying on Evolutionary Computing proved that these two fields can work together in a promising alliance.
References [1]
[2] [3] [4]
Figueras i Ventura R.M., Vandergheynst P., Frossard P. – Evolutionary Multiresolution Matching Pursuit and its relations with the Human Visual System, Proceedings of EUSIPCO 2002, Toulouse, France, Vol. II, 395-398 (September 2002). Mallat S., Zhang S. – Matching Pursuit with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12, 3397-3415 (December 1993). Mitchell M. – An Introduction to Genetic Algorithms, The MIT Press, Cambridge, Massachusetts, USA (1995). Stefanoiu D., Ionescu F. – Vibration Faults Diagnosis by Using TimeFrequency Dictionaries, Research Report AvH-FHKN-0301, Konstanz, Germany (January 2003).
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators C. D. Bocănială1, J. Sa da Costa2, and R. Louro2 1
University “Dunărea de Jos” of Galati, Computer Science and Engineering Dept. Domneasca 47, Galati 6200, Romania
[email protected] 2 Technical University of Lisbon, Instituto Superior Tecnico, Dept. of Mechanical Engineering, GCAR/IDMEC Avenida Rovisco Pais, Lisboa 1096, Portugal
[email protected] [email protected] Abstract. This paper proposes a fuzzy classification solution for fault diagnosis of valve actuators. The belongingness of the current state of the system to the normal and/or a faulty state is described with the help of fuzzy sets. The theoretical aspects of the classifier are presented. Then, the case study – the DAMADICS benchmark flow control valve, is shortly introduced and also the method used to generate the data for designing and testing the classifier. Finally, the simulation results are discussed.
1
Introduction
Fault diagnosis is a suitable application field for classification methods as its main purpose is to achieve an optimal mapping of the current state of the monitored system into a prespecified set of behaviors, normal and faulty [6]. Fault diagnosis is performed in two main stages: fault detection, which indicates if a fault occurred in the system, and fault isolation, which identifies the specific fault that affected the system [8]. Classification methods are usually used to perform fault isolation. The recent research in the field largely propose classification methods based on soft computing methodologies: neural networks [8], fuzzy reasoning [4],[5] and combinations of them, neuro-fuzzy systems [2][[7]. This paper proposes a fuzzy classification solution for fault diagnosis of valve actuators. Diagnosis is performed in one single stage. Based on the raw sensor measurements, the classifier identifies the behavior of the system. The classifier uses fuzzy sets to describe the belongingness of the current state of the system to a certain class of system behavior (normal or faulty). The class of behavior with the largest degree of belongingness represents the current behavior of the system. The structure of the paper is following. Section 2 presents the theoretical aspects of the classifier. Section 3 introduces the case study valve and details the method used to V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 741-747, 2003. Springer-Verlag Berlin Heidelberg 2003
742
C. D. Bocănială et al.
generate the data for designing and testing the classifier. The last section, Section 4, describes the simulation results obtained using MATLAB. The paper ends with some conclusions and further work considerations.
2
The Fuzzy Classifier
Cluster analysis studies methods for splitting a group of objects into a number of subgroups on the basis of a chosen measure of similarity. The similarity of objects within a subgroup is larger than the similarity of objects belonging to different subgroups. Usually, a subgroup is considered a classical set, which means that an object either belongs to the set or not. The previous description lacks nuances. In contrast, an object can be assigned to a fuzzy set with a varying degree of membership from 0 to 1. Baker [1] proposes a cluster model that builds an optimal decomposition of the given group of objects into a collection of fuzzy sets. If u and v belong to the analyzed group of objects, the similarity between them, s(u,v), is measured via a dissimilarity measure, d(u,v)=1-s(u,v). The dissimilarity measure is expressed using a distance function hβ that maps the distance between u and v, δ(u,v), into [0,…,1] interval (Eq. 1). If δ(u,v) decreases towards 0, hβ(δ(u,v)) also decreases towards 0 and [1 - hβ(δ(u,v))] increases towards 1. If δ(u,v) increases towards β or is larger than β, hβ(δ(u,v)) increases towards 1 or, respectively, is 1. In the later case, [1- hβ(δ(u,v))] increases towards 0 or, respectively, is 0. These facts prove the correctness of choosing hβ as the dissimilarity measure.
δ (u , v ) / β , for δ (u , v ) ≤ β h β (δ (u , v )) = 1, otherwise
(1)
The fuzzy classifier proposed in this paper uses ideas from the previous mentioned cluster model and it is described in the following. In order to design and test the classifier, the set of all available data for the problem to be solved is split in three distinct subsets: the reference patterns set, the parameters tuning set, and the test set. The elements in the reference patterns set are grouped according to their membership to the prespecified classes. That is, all elements in the reference patterns set belonging to a specific class are in the same subgroup. Let the obtained partition be C={Ci}i=1,…,n, where n is the number of prespecified classes. If u is the input of the classifier, the subset affinity measure defined in [1] can be used to express its average similarity to a specific subset Ci (Eq. 2). The notation ni stands for the cardinal of Ci.
r (u ,Ci ) = 1 −
1 ni
∑ h (δ (u ,v )) . β
(2)
v∈Ci
Given the subset affinity measure, a fuzzy membership function can describe the degree of belongingness of input u to subset Ci (Eq. 3) [1]. The notations n and ni stand for the cardinal of Ci, and, respectively, the cardinal of C. The notations n and ni stand for the cardinal of Ci, and, respectively, the cardinal of C.
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
f i (u ) =
ni − ∑ h β (δ (u , v )) v∈Ei
n − ∑ h β (δ (u , v ))
.
743
(3)
v∈E
The classification task is performed taking into account the fuzzy membership functions values of the input u to all subsets Ci. Remembering that Ci subset represents all elements in the reference pattern set belonging to the i-th prespecified class, the input u is classified as a member of the class whose corresponding fuzzy membership value is the largest (Eq. 4). In case of ties, the vector to be classified is rejected.
u ∈ C max ⇔ f max (u ) = max f i (u ) . i =1 ,...,m
(4)
The classification task is performed taking into account the fuzzy membership functions values of the input u to all subsets Ci. Remembering that Ci subset represents all elements in the reference pattern set belonging to the i-th prespecified class, the input u is classified as a member of the class whose corresponding fuzzy membership value is the largest (Eq. 4). In case of ties, the vector to be classified is rejected. Baker [1] uses only one β parameter value. In this paper, each class has associated a distance function hβ that receives a dedicated value for parameter β. That is, the classifier has as many parameters as the number of prespecified classes. The modification proposed causes the performances to improve. The parameters of the classifier are tuned by increasing the performance of the classifier on the parameters tuning subset towards an optimal value. The algorithm for tuning the parameters must search in a n-dimensional space the parameter vector (β1, …, βn), that offers optimal performance when applying the classifier on the parameters tuning set. Genetic Algorithms can be used to perform this search. Each individual of the population contains n strings corresponding to the n parameters of the classifier. The genetic operators are applied on pairs of strings corresponding to the same parameter. Consecutive populations are produced applying elitism (the most fitted six individuals survive in the next generation), followed by reproduction, crossover and mutation.
3
Case Study
The DAMADICS benchmark flow control valve was chosen as the case study for this method. More information on DAMADICS benchmark is available via web site [3]. The valve has the purpose of supplying water to a steam-generating boiler. It has three main parts: a valve body, a spring-and-diaphragm actuator and a positioner [9]. The valve body is the equipment that sets the flow through the pipe system. The flow is proportional to the minimum flow area inside the valve, which, in its turn, is proportional to the position of a rod. The spring-and-diaphragm actuator determines the position of this rod. The spring-and-diaphragm actuator is composed by a rod, which at one end is connected to the valve body and the other end has a plate, which is placed, inside an
744
C. D. Bocănială et al.
airtight chamber. The plate is connected to the walls of the chamber by a flexible diaphragm. This assembly is supported by a spring. The position of the rod is proportional to the pressure inside the chamber, which is determined by the positioner. The positioner is basically a control element. It receives three signals: a measurement of the position of the rod (x), a reference signal for the position of the rod (CV) and a pneumatic signal from a compressed air circuit in the plant. The positioner returns an airflow signal, which is determined by a classic feedback control loop of the rod position. The airflow signal changes the pressure inside the chamber. Two of the several sensors that are applied to the valve, the sensor for measuring the position of the rod (x) and the sensor for measuring the water flow through the valve (F), provide variables that contain information relative to the faults. The valve is subject to a total of 19 faults that affect all of its components. For the work detailed in this paper only 5 of these faults were considered. A detailed description of the selected faults follows. Sometimes the movement of the rod is limited by an external mechanical event. If this happens the rod will not be able to move above a certain position. This situation is known as fault F1.
Fig. 1. Effects of the faults on the position of the rod
There are large pressure differences inside a valve. Under certain conditions, if the water pressure drops below the vapor pressure, the water may undergo a phase change from liquid to gas. If this happens the flow will be governed by the laws of compressible flow, one of which states that if the water vapor reaches the speed of sound a further increase in the pressure difference across the valve will not lead to an increase in flow. Therefore there is a limit to the flow. The effects of this fault will also be visible on x because it is dependent of the flow. During the normal operation of the valve this phenomenon is unlikely to occur, however sometimes there are tem-
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
745
perature increases which allow the occurrence of this fault. This situation is known as fault F2.
Fig. 2. Effects of the faults on the flow
During the normal operation of the system the upstream and downstream water pressures remain in a given range. However due to a pump failure or to a process disturbance the values of these pressure may change and leave the above-mentioned range. If this happens the value of the flow will deviate from the normal behavior as well as having some effects on the rod position. This situation is known as fault F3. The valve is placed in a pipe circuit that has a circuit parallel to the valve. This circuit is intended to replace the valve without cutting the water feed to the boiler. It has a gate valve that is closed during the normal operation. However, due to a fault in the valve or to employee mishandling, this valve may be opened. This increases the flow of the water to the boiler. This situation is known as fault F4. The flow sensor is subject to a fault occurrence due to wiring failure or to an electronic fault. If this happens it will cause the flow measurements to be biased. This situation is known as fault F5. The changes that each fault causes to the normal behavior can be seen in Fig. 1 and Fig. 2. The figures show that the effects of all the faults, with exception to F3, remain constant from the moment in which they start. The effects of fault F3 are visible only temporarily on x and F. There is a small effect that endures in F, very similar to the effects of F4. This behavior will make fault F3 difficult to correctly classify.
746
C. D. Bocănială et al.
4
Simulation Results
There is no available data related to the operation of the system while a fault is occurring due to the economical costs that this would imply. Therefore there is the need to develop a program that models the system and to artificially introduce faults in this model in order to obtain data relative to the faulty behavior. The valve was extensively modeled using first-principles. A MATLAB / SIMULINK program was then developed on the basis of this model [9]. For the simulation of the effects of the faults the inputs to the program were taken from the available data from the real plant. This makes the data containing the fault effects obtained through simulation more realistic. The development of FDI systems based on these data is more difficult because they contain both dynamic and static behavior. As noted in Section 3, two measurements provide best distinction among the considered classes of behavior: rod position (x), and flow through the valve (F). The first order difference on x (dx) can be added to improve the discrimination between different faults. The input of the classifier at time instance t is the 3-uple (xt, Ft, dxt) that represents the values of the three features at the time instance t. The five selected faults were simulated for 20 values of fault strength, uniformly distributed between 5% and 100%, and different conditions for the reference signal (CV). The previous settings approximate very well all possible faulty situations involving the five selected faults. The simulation data was used as follows: 5% for the reference pattern set, 2.5% for the parameters tuning set, and the rest of 92.5% for the test set. The GA mentioned in Section 2 reached the maximum value for the objective function in 23 generations. The necessary time for producing a successive generation was 50 minutes on a Intel Pentium 4, CPU 1.70 Ghz, RAM 256 MB. The performance of the classifier on the test set is shown in Table 1. The classifier correctly detects and identifies the faults F1, F2, F4 and F5 after 2 seconds they are present in the system. The exception is fault F3 that needs about 3 seconds for detection and 3 seconds for identification. These results are explained by the fact that F3 is mainly distinguishable from normal state only on F (see Fig. 1 and Fig. 2).
5
Summary
The paper introduced a fuzzy classification solution for fault diagnosis of valve actuators. The complexity of the problem is given by two facts. First, the data used are driven by real data and they cover very well all possible faulty situations. Second, the fault F3 is mainly visible in the flow measurement (F) and poorly visible in the other two measurement, x. These facts put a special emphasis on the good performance of the classifier. Further research needs to be carried out on the performance of the classifier when adding other faults to the five faults selected in this paper.
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
747
Table 1. The confusion matrix for the test data. The main diagonal contain the percent of wellclassified data per class
N F1 F2 F3 F4 F5
N 98.71 0.18 0.18 4.66 0.18 0.18
F1 0.14 99.82 0.00 0.79 0.00 0.00
F2 0.00 0.00 99.82 0.00 0.00 0.00
F3 0.29 0.00 0.00 90.74 0.24 0.00
F4 0.00 0.00 0.00 0.00 2.78 99.82
F5 0.00 0.00 0.00 0.00 0.18 0.00
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Baker, E.: Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. PhD thesis, Delft University of Technology, Holland (1978) Calado, J. M. F., Sá da Costa, J. M. G.: An expert system coupled with a hierarchical structure of fuzzy neural networks for fault diagnosis. International Journal of Applied Mathematics and Computer Science 9(3) (1999) 667-688 EC FP5 Research Training Network DAMADICS: Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems, (http://www.eng.hull. ac.uk/research/control/damadics1.htm) Frank, P. M.: Analytical and qualitative model-based fault diagnosis – a survey and some new results. European Journal of Control 2 (1996) 6-28 Koscielny, J. M., Syfert, M., Bartys, M.: Fuzzy-logic fault isolation in largescale systems. International Journal of Applied Mathematics and Computer Science 9 (3) (1999) 637-652 Leonhardt, S., Ayoubi M.: Methods of Fault Diagnosis. Control Eng. Practice 5(5) (1997) 683-692 Palade, V., Patton, R. J., Uppal, F. J., Quevedo, J., Daley, S.: Fault diagnosis of an industrial gas turbine using neuro-fuzzy methods. Preprints of the 15th IFAC World Congress, Barcelona, Spain (2002) Patton, R. J., Lopez-Toribio, C. J., Uppal, F. J.: Artificial intelligence approaches to fault diagnosis for dynamic systems. International Journal of Applied Mathematics and Computer Science 9(3) (1999) 471-518 Sá da Costa, J., Louro R.: Modelling and Simulation of an Industrial Actuator Valve for Fault Diagnosis Benchmark. Proceedings of the Fourth International Symposium on Mathematical Modelling, Vienna (2003) 1057-1066
Deep and Shallow Knowledge in Fault Diagnosis Viorel Ariton “Danubius” University from Galati Lunca Siretului no.3, 6200 – Galati, Romania
[email protected] Abstract. Diagnostic reasoning is fundamentally different from reasoning used in modelling or control: last is deductive (from causes to effects) while first is abductive (from effects to causes). Fault diagnosis in real complex systems is difficult due to multiple effects-to-causes relations and to various running contexts. In deterministic approaches deep knowledge is used to find “explanations” for effects in the target system (impractical when modelling burden appear), in softcomputing approaches shallow knowledge from experiments is used to links effects to causes (unrealistic for running real installations). The paper proposes a way to combine shallow knowledge and deep knowledge on conductive flow systems at faults, and offers a general approach for diagnostic problem solving.
1
Introduction
Fault diagnosis is basically abductive reasoning: from manifestations - as effects, one must infer faults - as causes. The target system's utilities consist of some wanted outputs and appear also as effects. Faults are causes that bring utility far from desired values; they are unexpected and, usually, not included in the cause-effect model of the target systems operation. Such causes are related to physical installation (state and age of components, accidents), human operators (technological discipline, maintenance) or environment (season, working conditions, other installations). From human point of view, the target system goes in normal running from known causes to expected effects, while in faulty running it goes from unknown causes to unexpected effects. Diagnostic reasoning – abductive reasoning in general, involves an open space of causes and, possibly, an open space of effects. Effects-to-causes mapping is based on incomplete knowledge about relations between causes and effects. Deterministic approaches seem not appropriate for diagnostic problem solving: they assume quantitative and precise values also crisp causal relations (deep knowledge) – hence closed sets of parameters and finite value ranges. Better suited seem softcomputing approaches – fuzzy logic and artificial neural networks (ANN), able to deal with incomplete knowledge and imprecise data. Usually, they require large amount of data from practice or from experiments (shallow knowledge), acquired in long time; if obtained from provoked faults to running installations, they look unrealistic. Mixing deep knowledge (compact and general) with shallow V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 748-755, 2003. Springer-Verlag Berlin Heidelberg 2003
Deep and Shallow Knowledge in Fault Diagnosis
749
knowledge (diverse and specific) will bring an advantage. The paper develops considerations on deep and shallow knowledge in fault diagnosis for the class of Conductive Flow Systems (CFSs), and proposes a methodology to mix them and apply techniques to detect and isolate faults. It unifies deep and shallow knowledge representation by assuming qualitative aspects of real system's structure and behaviour, similar to the way human diagnosticians proceed. Faults in Conductive Flow Systems spread effects from the place they occur (as primary effects) throughout the entire system (as secondary effects). Using deep knowledge on flow conduction laws and CFS's structure of components, it is possible to reject secondary effects and use only primary effects directly, in a recognizing scheme, for fault isolation. Deep and shallow knowledge mixture for fault diagnosis briefly refers to: qualitative means-end and bond-graph models of components and the whole target CFS regarding normal behaviour and structure; a) generic anomalies defined for the faulty behaviour of modules and components; b) patterns of qualitative deviations of flow variables in bond graph junctions for each generic anomaly; c) generic types of primary effects and their links to faults at components; d) recognizing schemes and architecture of the diagnosis application. Items a to c represent theoretical premises for the way the fault diagnosis proceeds in d, e; deep and shallow knowledge appear in a, b and c, d respectively.
2
Qualitative Abstraction on Structure and Behaviour
Means-end models refer to goals (ends) of a target system and functions it performs, as means to reach the ends. For example, MFM means-end modelling approach [4] considers a set of flow-functions for low-level operations of components (e.g. valve open - transport flow function, valve shut - barrier flow function, etc.). A network of such flow functions perform an end (utility) and correspond to a module. 2.1
Qualitative Means-End Modelling of Conductive Flow Systems
However, for the qualitative approach proposed and for fault diagnosis it is not necessary to work with detailed flow functions (transport, barrier, etc.) but with some qualitative flow functions. [6] refers to three orthogonal operational facets of a technical system, see Table 1, suited to three qualitative flow functions proposed further. Table 1. Functional orthogonal facets of the real-world components
Concept Activity Aspect
Process Transformation Matter
Flow Transportation Location
Store Preservation Time
750
Viorel Ariton
Let us consider for each concept in Table 1 a generic flow function: • • •
flow processing function (fpf) – as chemical or physical transformation of the piece of flux (to an end - or utility); flow transport function (ftf) – as space location change of the piece of flux, (by pipes, conveyors, etc.); flow store function (fsf) – as time delay of the piece of flux, by accumulation of mass or energy in some storing or inertial components.
A real component may perform one or more generic flow functions; faults affect each of them in specific ways. Component that performs a unique flow function is a primary component (PC). It is final location for fault isolation, so modules and components will be “split” to as many primary components as it is necessary. The set of all primary components defines the “granularity of the diagnosis”. A module (M) is a structure of components (as means), the set of all flow functions completing a certain end. The abstraction above is useful for modelling CFSs' qualitative behaviour. When faults occur to physical components function(s) get disturbed in a specific ways (see below). Fault isolation is made to components performing only one generic flow function. 2.2
Bond Graph Modelling and Qualitative Means-End Approach
CFSs involve transport of the matter and energy as flows along given paths. Bond graphs modelling approach [3] represent Kirchkoff's laws with explicit components and structures. They allow modularization of the CFS model – originally, applied only to the whole system. Notions used in the paper are: • • •
effort and flow: extensive / intensive power variables (pressure and flow-rate like). Bond graph passive components: R (resistance like), storage C (capacity), inertial component I (inductance like), Bond graph active components (power conversion) - T transformer, G gyrator.
It worth to note that bond graph components correspond to primary flow functions (and primary component PC): R to flow transport function (ftf), C and I to flow store function (fsf), T and G to flow processing (fpf). For each Kirchkoff's law type correspond a bond graph junction • •
0-junction (node), flow rate is common power variable, effort the summed one, 1-junction (loop), effort is the common power variable and flow the summed one.
In the presented approach, bond graph model of the target CFS is not meant for the normative running (as usually encountered) but only to obtain the modular structure of modules and components necessary for isolating faults, as presented below.
Deep and Shallow Knowledge in Fault Diagnosis
2.3
751
Deep Knowledge on the Target CFS Structure
Knowledge on components, structure of components and their functions (flow functions), along with hybrid modelling by bond graphs and modules' activities [5] represent deep knowledge on target CFS. Module M corresponds to a bond graph junction of components, the entire CFS corresponds to a junction of modules. Target CFS requires hierarchical decomposition: from CFS to block units, then modules, finally PCs [1]. Because bond graph modelling has no means to represent relative positions of modules (or components) along flow paths or inside junctions, let us introduce the up-stream weak order between units in each junction of the target CFS [1]. Up-stream / down-stream relations between modules are crucial for transport anomaly identification procedure (see 4.1). 2.4
Imprecise and Incomplete Knowledge on Manifestations
Human diagnosticians evaluate deviations of observed variables in an imprecise manner, they deal with incomplete knowledge and (normal) drift of values in installation's real running. So, a value is “normal” (no) when it falls inside a given range, it is “too-high” (hi) or “too low” (lo) when falling in neighbour ranges; hi, lo deviations represent qualitative values for an observed variable. Common representations for such qualitative values are fuzzy attributes (ranges with imprecise limits); hi or lo are fuzzy attributes [2]. Fuzzy representation has another advantage: from continuous values one may obtain discrete ones (attributes). Human diagnostician deals with discrete pieces of knowledge when linking effects to faults then mapping and recognizing them (as proposed in 4.2). Attributes are, in fact, primary effects, and along with the mapping represent shallow knowledge. Fuzzy attributes of continuous variables (hi, lo), and intrinsic specific attributes of other discrete variables (muddy, noisy, etc.) are manifestations directly linked to faults. Let us denote manifestations related to flow conduction FMAN (deviations of flow and effort variables - propagate), and manifestations from other variables MAN (do not propagate).
3
Faulty Behaviour Modelling
Faulty generic components exhibit anomalies. To each generic flow function a generic anomaly is attached, as follows: a)
Process anomaly (AnoP) – means deviation from the normal value (too high or too low) of an end-variable; it refers to transformations the flow undergoes. b) (Flow) Transport anomaly (AnoT) - means changes on flow variables or on inner structure of a component relative to flow transport along flow paths. c) Store anomaly (AnoS) – refers to deviation from the normal value for the delay specific to storing (capacitor-like) or inertial (inductance-like) component.
Works on fault diagnosis deal with concepts as “leakage” or “obstruction”, but no complete set of transport-anomalies is defined. Such a set is presented below [1]:
752
Viorel Ariton
a)
Obstruction – consists in change of the transport resistance-like parameter of a component (increase), without flow path modification (clogged pipe). b) Tunnelling – consists in changes of the transport resistance-like parameter (decrease), without flow path modification (broken-through pipe). c) Leakage - consists in structure changing (balance low) of flow transport, involving flow path modification. d) Infiltration - consists in structure changing (balance high) of a flow transport, with flow path modification. Transport anomalies are orthogonal in pairs (obstruction to tunnelling and leakage to infiltration), each pair orthogonal to the other. A fault causes a unique transport anomaly that appears at respective component and module. Actually, transport anomalies are primary effects that may enter the recognizing scheme for fault isolation. Transport anomalies refer to FMAN – deep knowledge, while for MAN faulty behaviour modelling is faults-to-effects mapping – shallow knowledge.
4
Fault Detection and Isolation
In order to efficiently isolate fault, a recognizing schemes have to use only with primary effects - linked to faults. Secondary effects appear at non-faulty components and each effect is specific to a transport anomaly, but they are irrelevant information for fault isolation. Separating primary effects from the set of all effects means rejecting secondary effects. 4.1
Fault Detection and Isolation Procedure
Transport anomalies are also primary effects, as they appear only at faulty module or component. Fault detection consists in anomaly detection. Fault isolation at module level is similar: transport anomaly detection. Fault isolation inside module is based on causal relations faults - primary effects. It is possible to generate patterns of qualitative deviations, in order to identify a transport anomaly at a (faulty) module, depending on the junction type and depending on the up-stream/down-stream relations relative to other modules in the junction [2]. Table 2. Patterns of qualitative deviations for power variables of components' inputs, for each transport anomaly occurred when faulty
Transport anomaly
Obstruction (Ob) Tunnelling (Tu) Infiltration (In) Leakage (Le)
1-junction type faulty component located: updownstream stream lo-hi lo-lo hi-lo hi-hi lo-hi hi-hi hi-lo lo-lo
0-junction type – 0-junction type - fault fault located down- located up-stream stream updownupdownstream stream stream stream lo-hi hi-hi hi-lo lo-lo hi-lo lo-lo lo-hi hi-hi lo-hi hi-hi lo-hi hi-hi hi-lo lo-lo hi-lo lo-lo
Deep and Shallow Knowledge in Fault Diagnosis
753
Faulty module isolation proceeds by finding patterns in Table 2 at inputs of neighbour modules in each bond graph junction they appear. 4.2
Architecture of ANN Blocks for Fault Diagnosis
Human expert have good knowledge on faulty behaviour of components and modules but poor knowledge on entire system's faulty behaviour. The approach proposed, try to cope with both: faulty module isolation is based on deep knowledge (flow transport anomalies), while fault isolation inside module is based on expert's shallow knowledge (primary effects and faults mapping). To each module correspond two artificial neural network (ANN) blocks - see Fig. 1: one for module isolation (detects patterns as from Table 2) one for component isolation (Fault Isolation Block), by recognizing faults from patterns of primary effects as fuzzy attributes lo, hi to observed variables. While bond graphs are modular, junctions are independent; so, each module has its own neural block for module isolation inside junction (Module Isolation Block). Neural network Fault Isolation Block is a mapping of non flow-related manifestations (MAN) and flow-related manifestations (FMAN – see 2.4) to faults of each PCi.
Fig. 1. Architecture of ANN blocks for isolation of faulty module and component
5
Case Study on a Simple Hydraulic Installation
Fault diagnosis was meant for a simple hydraulic installation in a rolling mill plant, comprising three modules: Supply Unit (pump, tank and pressure valve), Hydraulic Brake (control valve, brake cylinder) and Conveyor (control valve, self, the conveyor cylinder). For the 20 faults to all 8 components considered, manifestations come from sensors for FMAN – 2 flow-rate, 4 pressure, and MAN – 5 temperature, 8 binary (cylinders' left/right ends, open/shut valves) and 10 operator observed variables (noise and oil-mud). Software architecture exhibit 6 ANN Backpropagation blocks – 2 per module.
754
Viorel Ariton
Conveyor
Hydraulic Brake F=20
F=200
J'1
Drossel 66%
Ctrl. Valve 1 Pressure Valve
J1"
Ctrl. Valve 2
J"0 J'0 J1'''
Pump
Oil Tank
Fig. 2. Simple hydraulic installation under diagnosis
Fault diagnosis was performed in two approaches: (1) using shallow knowledge from experiments, (2) using deep and shallow knowledge – in the proposed approach. Recognition rate of faults in each experiment was 58% and 92% respectively. Results are not surprising, while second approach involves much more information on the target system and includes first approach. Qualitative abstraction of structure and behaviour offer premises for rapid prototyping of diagnosis applications, each meant for a particular CFS (each installations in real life). It is of greatest importance to build specific diagnosis applications, while each installation is actually unique; structure may contain same modules as in other installations but parameters will vary, behaviour may depend on place, environment, operators, age or provider of components. It was built a CAKE (Computer Aided Knowledge Elicitation) tool to replace knowledge engineer when assisting human diagnostician for target system analysis; then, a CASE tool automatically builds the fault diagnosis application for the given real installation. Fault diagnosis system represents and mixes both types of knowledge in a single application (which is build automatically if cascading CAKE with a CASE tool).
6
Conclusion
Fault diagnosis of real complex flow conduction systems often relies on shallow knowledge from experts. It comes from their practice on faults of the functional units (modules) – each meant for an end of the process. Faulty behaviour of an entire complex system is quite complicated and hardly known. The paper proposes a way to combine deep knowledge – on structure of modules and on secondary effects (propagated by conduction), with shallow knowledge – as faults-to-primary effects mapping (from practice or experiments). So, many irrelevant relations from effects to faults disappear and the fault diagnosis is manageable and suited to connectionist
Deep and Shallow Knowledge in Fault Diagnosis
755
implementation. Fuzzy representation changes continuous values in discrete ones, hence all data representation is discrete – the same way human diagnostician uses information and also suited for Neural Network recognition for fault isolation.
References [1] [2] [3] [4] [5] [6]
Ariton V., Fault diagnosis connectionist approach for multifunctional conductive flow systems, PhD dissertation., Galati, 1999. Ariton V. Bumbaru S., Fault Diagnosis in Conductive Flow Systems using Productive Neural Networks, The 12th International Conference on Control Systems and Computer Science CSCS12, Bucureşti, 1999, pp. 125-130. Cellier F.E., Modeling from Physical Principles, The Control Handbook (W.S. Levine, ed.) CRC Press, Boca Raton, pp.98-108, 1995. Larsson, J. E., Knowledge-based methods for control systems, PhD Thesis Dissertation, Lund, Sweden, 1992. Mosterman P. J., Kapadia R, Biswas G., Using bond graphs for diagnosis of dynamical physical systems, Sixth Intl. Conference on Principles of Diagnosis, pp. 81-85, Goslar, Germany, 1995. Opdahl A. L., Sindre G., A taxonomy for real-world modelling concepts, Information Syst., 19(3), Pergamon, 1994, pp. 229-241.
Learning Translation Templates for Closely Related Languages Kemal Altintas*and Halil Altay Güvenir Department of Computer Engineering, Bilkent University Bilkent, 06800 Ankara Turkey
[email protected] [email protected] Abstract. Many researchers have worked on example-based machine translation and different techniques have been investigated in the area. In literature, a method of using translation templates learned from bilingual example pairs was proposed. The paper investigates the possibility of applying the same idea for close languages where word order is preserved. In addition to applying the original algorithm for example pairs, we believe that the similarities between the translated sentences may always be learned as atomic translations. Since the word order is almost always preserved, there is no need to have any previous knowledge to identify the corresponding differences. The paper concludes that applying this method for close languages may improve the performance of the system.
1
Introduction
Machine translation has been an interesting area of research since the invention of computers. Many researchers have worked on this subject and developed different methods. Currently, there are many commercial and operational systems and the performances of the machine translation systems are best when the languages are close to each other [2]. There are two main approaches in corpus-based machine translation: statistical methods and example based methods. All corpus-based methods require the presence of a bilingual corpus in hand. The necessary translation rules and lexicons are automatically derived from this corpus. Example based methods in machine translation use previously translated examples to form a “translation memory” for the translation process [3]. There are three main components of example-based machine translation (EBMT): matching fragments against a database of real examples, identifying the corresponding translation fragments and recombining these to give the target text [7]. *
Currently affiliated with Information and Computer Science Department, the University of California, Irvine.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 756-762, 2003. Springer-Verlag Berlin Heidelberg 2003
Learning Translation Templates for Closely Related Languages
757
A detailed review of example based machine translation systems can be found in [9]. The idea of learning generalized translation templates for machine translation was investigated by Cicekli and Gvü enir [5]. They proposed a method for learning translation templates from bilingual translation examples. Their system is based on analyzing similarities and differences between two translation example pairs. There is no linguistic analysis involved in the method and the system totally depends on string matching. The authors claim that the method is language independent and they show that it works for Turkish and English, which are two virtually unrelated languages. The principal idea of translation template learning framework as presented in [5] is based on a heuristic to infer the correspondences between the patterns in the source and target languages from given two translation pairs. The similarities between the source language sentences are identified and assumed to correspond to the similar parts in the target language. Also, the differences in the source language sentences should correspond to the differences in the target language sentence pair. The system they present identifies the similarities and differences between source and target language pairs and learns generalized translation rules from these examples. In this paper, we investigate the possibility of applying the same idea to closely related languages by using the corresponding translated sentences themselves instead of using two examples. We take Turkish and Crimean Tatar as the example closely related language pair and we believe that the idea can be developed and applied for other close language pairs. The rest of the paper is organized as follows: Next section introduces the concept of translation template and Section 3 gives the details of the learning process comparing it against the proposed method in [5]. Section 4 discusses some weak points of the approach that we present here and the last section summarizes the ideas and concludes the paper.
2
Translation Templates
A translation template is a generalized translation exemplar pair where some components are generalized by replacing them with variables in both sentences. Consider the following example: X1 +Verb+Pos+Past+A1sg Y1 +Verb+Pos+Past+A1sg gel kel The left-hand side (first) part in this example and in the following examples throughout the paper refers to Turkish and the right-hand side (second) part refers to Crimean Tatar. The first template means that whenever the sequence “+Verb+Pos+Past+A1sg” follows any sequence that can be put in place of the variable X1, it can be translated into “+Verb+Pos+Past+A1sg” provided that it follows another sequence Y1, which is the translation of X1. In other words, after learning this rule, we can translate a sentence ending in “+Verb+Pos+Past+A1sg” provided that the beginning of the sentence can also be translated using the previously
758
Kemal Altintas and Halil Altay Güvenir
learned rules. The second template is an atomic template, which can be read as “gel” (come) in Turkish always corresponds to “kel” in Crimean Tatar. Since Turkish and all other Turkic languages are agglutinative languages, using the surface form (actual spelling) of the words may not be helpful. For example, Turkish word “geliyoruz” (we are coming) corresponds to “kelemiz” in Crimean Tatar and they do not show much similarity at first sight. However, if we morphologically analyze the two words we get: geliyoruz gel+Verb+Pos+Prog1+A1pl kelemiz
kel+Verb+Pos+Prog1+A1pl
The two analyses are similar except for the roots. Thus, using the morphological analyses of the two words may help us to learn much more rules. For the morphological analysis of Turkish, we used the analyzer developed by Oflazer [8]. For the Crimean Tatar part, we used the analyzer described in [1].
3
Learning Translation Templates
Close languages such as Turkish and Crimean Tatar share most parts of their grammars and vocabularies. The word order in close languages can most of the time be the same and even the ambiguities are preserved [6: p.807]. The first phase of translation template learning algorithm is identifying the similarities and differences between the two sentences. A similarity is a non-empty sequence of common items in both sentences. Actually, the similarity is an exact matching between sub-strings of the sentences. A difference is the opposite of a similarity and it is a non-common sequence of characters between the two sentences. In other words, a difference is what is not a similarity. The following translation pair gives the similarities as underlined: geliyoruz
gel+Verb+Pos+Prog1+A1pl
kelemiz
kel+Verb+Pos+Prog1+A1pl
A matching sequence between the sentences is a sequence of similarities and differences with the following properties: • • • • •
A similarity is followed by a difference and a difference is followed by a similarity. Two consequent similarities and two consequent differences cannot occur in a match sequence. If a terminal occurs in a similarity, it cannot occur in a difference. If a terminal occurs in a difference in one language, it cannot occur in a difference in the other language. A terminal occurring in both sentences must appear exactly n times where n >= 1. If a terminal occurs more than once in both sentences, its ith occurrence in both sentences must end up in the same similarity of their minimal match sequence.
Learning Translation Templates for Closely Related Languages
759
If these rules are satisfied, then there is a unique match for the sentences or there is no match. The details of the algorithm that finds the similarities and differences between the two sentences are explained in [4]. Once the similarities and the differences are identified, the system changes the differences with variables to construct a translation template. If there is no difference between the sentences and it is composed of only a single similarity, then it is learned as an atomic template. Many times, Turkish words and their Crimean Tatar correspondings are the same. For example, both the surface and lexical forms of the words “ev = ev+Noun+A3sg+Pnon+Nom” (house), “bildim = bil+Verb+Pos+Past+A1sg” (I knew) are the same in Turkish and Crimean Tatar. For “ev”, the following translation template is learned: ev+Noun+A3sg+Pnon+Nom ev+Noun+A3sg+Pnon+Nom Although [5] does not discuss matching pairs with a single similarity, it exists between close languages and can be learned. It is always possible that a variable in the template may have to be replaced with a noun like the one above. Consider the sentence “ev aldım = ev+Noun+A3sg+Pnon+Nom al+Verb+Pos+Past+A1sg” (I bought a house). If we have a template like: X1 al+Verb+Pos+Past+A1sg Y1 al+Verb+Pos+Past+A1sg we can easily replace X1 with “ev+Noun+A3sg+Pnon+Nom” for the translation. If the matching sequence is composed of a single similarity and a single difference, then the difference is replaced with a variable and similarity is preserved. Also, the differences and the similarities are learned as separate atomic templates. For the word pair geldim gel+Verb+Pos+Past+A1sg (I came) keldim
kel+Verb+Pos+Past+A1sg
the following templates are learned: X1 +Verb+Pos+Past+A1sg Y1 +Verb+Pos+Past+A1sg +Verb+Pos+Past+A1sg +Verb+Pos+Past+A1sg gel kel When the similarities are in the beginning then the same rule applies. The differences in the end are replaced with variables and the similarities and differences are learned as separate atomic templates. When there are two similarities surrounding a single difference in the sentences, the difference is replaced with a variable and the differences and the similarities are learned as separate templates. For the sentence pair “eve geldim = ev+Noun+A3sg+Pnon+Dat gel+Verb+Pos+Past+A1sg” (I came home) and “evge keldim = ev+Noun+A3sg+Pnon+Dat kel+Verb+Pos+Past+A1sg” the following rules are learned: ev+Noun+A3sg+Pnon+Dat X1 +Verb+Pos+Past+A1sg ev+Noun+A3sg+Pnon+Dat Y1 +Verb+Pos+Past+A1sg
760
Kemal Altintas and Halil Altay Güvenir
gel kel ev+Noun+A3sg+Pnon+Dat ev+Noun+A3sg+Pnon+Dat +Verb+Pos+Past+A1sg +Verb+Pos+Past+A1sg For the cases where there is more than one difference, the system should learn templates only if at least all but one of the differences have previously learned correspondences. Consider the following sentence pair: okula geldim (I came to school) okul+Noun+A3sg+Pnon+Dat
gel+Verb+Pos+Past+A1sg
mektepke keldim mektep+Noun+A3sg+Pnon+Dat
kel+Verb+Pos+Past+A1sg
According to [5], the system should not learn anything if it does not know whether “okul” (school) is really the translation of “mektep” (school) or “kel” (come). Actually it is possible to learn rules without requiring that we know the corresponding differences. The algorithm proposed in [5] requires that at least all but one of the difference correspondences are known. This algorithm is a general method for learning and the system is language independent. The experiments were done for Turkish and English where the word order is clearly different. Thus, for the general system, it might be necessary to verify that all but one of the differences have corresponding translations in hand. However, for close language pairs, such as Turkish and Crimean Tatar, the word order is almost always preserved in the translation. Thus, if we know that our example translations are fully correct, we can learn the following templates without requiring any preconditions: X1+Noun+A3sg+Pnon+Dat X2 +Verb+Pos+Past+A1sg Y1 +Noun+A3sg+Pnon+Dat Y2 +Verb+Pos+Past+A1sg okul mektep +Noun+A3sg+Pnon+Dat +Noun+A3sg+Pnon+Dat gel kel +Verb+Pos+Past+A1sg +Verb+Pos+Past+A1sg
4
Discussions
There are cases where the idea is not applicable. Consider the following phrases: bildiğim yer (the place where I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg
yer+Noun+A3sg+Pnon+Nom
bilgen yerim bil+Verb+Pos^DB+Adj+PastPart+Pnon
yer+Noun+A3Sg+P1sg+Nom
Learning Translation Templates for Closely Related Languages
761
The difference between the two sentences is that the possessive marker in Turkish follows the past participle morpheme affixed to the verb, whereas the possessive marker in Crimean Tatar follows the noun in this clause. Any translation program in such a case should identify that this is an adjectival clause made with past participle and should move the possessive marker that comes after the verb to its place after the noun. The current algorithm cannot deal with such a case, regardless of whether we have any prior information or not. Since the differences between the two sentences are only the possessive markers, we cannot have a prior information like: P1sg Pnon which is totally wrong. However, the approach which uses example pairs is much safer in this case and can identify a template for this case: Turkish: bildiğim yer (the place that I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg
yer+Noun+A3sg+Pnon+Nom
bildiğim ev (the house that I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg ev+Noun+A3Sg+Pnon+Nom Crimean Tatar: bilgen yerim (the place that I know) bil+Verb+Pos^DB+Adj+PastPart+Pnon
yer+Noun+A3sg+P1sg+Nom
bilgen evim (the house that I know) bil+Verb+Pos^DB+Adj+PastPart+Pnon
ev+Noun+A3Sg+P1sg+Nom
From these two examples, we can derive the template: bil+Verb+Pos^DB+Adj+PastPart+P1sg X1 +Noun+A3sg+Pnon+Nom bil+Verb+Pos^DB+Adj+PastPart+Pnon Y1 +Noun+A3Sg+P1sg+Nom However, this is an exceptional case and overwhelming majority of the cases can be covered with the approach that we presented in the paper.
5
Conclusion
Corpus based approaches in language processing have attracted more interest. Example based machine translation is also considered as an alternative to traditional rule based methods with its capabilities to learn the necessary linguistic and semantic knowledge from the translation examples. Cicekli and Güvenir in [5] proposed a method to learn translation templates from bilingual translation examples. They also showed that the method is applicable to Turkish and English, which are two unrelated languages having completely different
762
Kemal Altintas and Halil Altay Güvenir
characteristics. Their method requires two similar translation example pairs to derive a template. Further they require that the similarities and differences are identified and the corresponding translations for almost all differences are known to derive a template from the given example pair. In this paper, we extended their approach to closely related languages and taking Turkish and Crimean Tatar as an example, we investigated the possibility of using the translated sentences themselves instead of a pair of sentences to derive some rules. The first case we saw for close languages is that, it is possible to have cases where the two sentences are exactly the same for both languages. So, this can be learned as an atomic template. Secondly, similarities can always be learned as atomic templates regardless of the number of differences between sentences. Since the word and morpheme order is usually preserved in close languages, it is possible to say that a similarity is always a correspondence between the languages. Finally, we saw that, in most cases there is no need to know any explicit correspondences between the differences in order to derive templates. Cicekli and Gvü enir require that if there are n > 1 differences between sentences, we must know at least n-1 of the correspondences. However, for close languages, since the word order is preserved, there is usually no need to enforce any preconditions provided that the translations are correct.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Altintas, K., Cicekli, I., “A Morphological Analyser for Crimean Tatar”, In Proceedings of 10th Turkish Artificial Intelligence and Neural Network Conference (TAINN 2001), North Cyprus, 2001. Appleby, S., Prol, M. P., “Multilingual World Wide Web”, BT Technology Journal Millennium Edition, Vol. 18, No:1, 1999. Carl, M. “Recent research in the field of example-based machine translation”, Computational Linguistics and Intelligent Text Processing, LNCS 2004, pp. 195-196, 2001. Cicekli, I., “Similarities and Differences”, In Proceedings of SCI 2000, pp. 331337, Orlando, FL, July 2000. Cicekli, I., and Gvü enir, H. A., “Learning Translation Templates from Bilingual Translation Examples”, Applied Intelligence, Vol. 15, No. 1, pp: 5776, 2001. Jurafsky, D., Martin, J. H., Speech and Language Processing, Prentice Hall, 2000. Nagao, M., A Framework of a Mechanical Translation Between Japanese and English by Analogy Principle, In Artificial and Human Intelligence, Amsterdam, 1984. Oflazer, K., “Two-level Description of Turkish Morphology”, Literary and Linguistic Computing, Vol. 9, No:2, 1994. Somers, H., “Review Article: Example-based Machine Translation”, Machine Translation, Vol. 14, pp. 113-157, 1999.
Implementation of an Arabic Morphological Analyzer within Constraint Logic Programming Framework Hamza Zidoum Department of Computer Science, SQU University PoBox36, Al Khod PC 132 Oman
[email protected] Abstract. This paper presents an Arabic Morphological Analyzer and its implementation in clp(FD), a constraint logic programming language. The Morphological Analyzer (MA) represents a component of an architecture which can process unrestricted text from a source such as Internet. The morphological analyzer uses a constraint-based model to represent the morphological rules for verbs and nouns, a matching algorithm to isolate the affixes and the root of a given wordform, and a linguistic knowledge base consisting in lists of markers. The morphological rules fall into two categories: the regular morphological rules of the Arabic grammar and the exception rules that represent the language exceptions. clp(FD) is particularly suitable for the implementation of our system thanks to its double reasoning: symbolic reasoning expresses the logic properties of the problem and facilitates the implementation of a the linguistic knowledge base, and heuristics, while constraint satisfaction reasoning on finite domains uses constraint propagation to keep the search space manageable.
1
Introduction
Morphological analysis module is inherent to the architecture of any system that is intended to allow a user to query a collection of documents, process them, and extract salient information, as for instance, systems that handle Arabic texts and retrieve information expressed in Arabic language over Internet [11]. In this paper we present the implementation of an Arabic morphological analyzer that is a component of an architecture that can process unrestricted text, within a CLP framework [12]. Constraint Programming is particularly suitable for the implementation of our system thanks to its double reasoning: symbolic reasoning expresses the logic properties of the problem and facilitates the implementation of a the linguistic knowledge base, and heuristics, while constraint satisfaction reasoning on finite domains uses constraint propagation to keep the search space manageable. Constraints present an overwhelming advantage: declarativity. Constraints describe what the solution is and leave the answer to question how to solve them to the underlying solvers. A typical constraint-based system has a two-level architecture V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 763-769, 2003. Springer-Verlag Berlin Heidelberg 2003
764
Hamza Zidoum
consisting of (1) a programming module i.e. symbolic reasoning module that expresses the logic properties of the problem, and (2) the constraint module provides a computational domain such as reals, booleans, sets, finite domains, etc…, and a reasoning about the properties of the constraints such that satisfiability, and reduction, algorithms known as solvers. Constraints thus, reduce the gap between the high level description of a problem and the code implemented. The objective is to process a text in order to facilitate its usage by a wide range of further applications e.g.; text summary, translation, …etc. The system is intended to process unrestricted text. Hence, the criteria of robustness, and efficiency are critical, and highly desirable. To fulfill these criteria, we made the following choices: 1.
avoid the use of a dictionary as is it the case of classical morphological analyzers. Indeed, the coverage of such tool is limited to a given domain and cannot cope with unrestricted texts from dynamic information sources such as Internet. 2. deal with unvoweled texts since most Arabic texts available on Internet are written in modern Arabic that usually doesn't use diacritical marks. In order to implement the morphological analyzer, we used the contextual exploration method [1]. It consists of scanning a given linguistic marker and its context (surrounding tokens in a given text) looking for linguistic clues that guide the system to make the suitable decision. In our case, the method scans an input token and tries to find the required affixes in order to associate the root-form and the corresponding morpho-syntactic information. Arabic is known to be a highly inflexional language, its famous pattern model using the CV (Consonant, Vowel) analysis has widely been used to build computational morphological models [2, 3, 4]. During the last decade, an increasing interest has been noticed to implement Arabic morphological analyzers [5, 6, 7, 8, 9, 10, 12]. Almost all systems developed, in the industry as well as in the research, make use of a dictionary to perform morphological analysis. For instance, In France, the Xerox Centre developed an Arabic morphological analyzer [7] using the finite-state technique. The system uses an Arabic dictionary containing 4930 roots that are combined with patterns (an average of 18 pattern for every root). This system analyses words that may include full diacritics, partial diacritics, or no diacritics; and if no diacritics are present it returns all the possible analyses of the word. The remaining sections of this paper are organized as follows. Section two describes the system design and its architecture. Section three is dedicated to the description of regular rules. The matching algorithm is described in section 4. Finally, section five concludes the paper with future directions to extend this work.
2
System Design and Implementation
The morphological analyzer finds all word root forms and associates the morphosyntactic information to the tokens.
Implementation of an Arabic Morphological Analyzer
765
Morphological Analyzer
Exception Lists
Exception Rules
Regular Rules
Candidate rule
Text
Text representation
Token Matching algorithm
Token updated
Fig. 1. The morphological analyzer components
Within the text representation, a token includes the following fields. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Name, that contains the name of the token. Every token is assigned a name that allow the system to identify it. Value, which is the word-form as it appears in the text before any processing. Root, stores the root of the word that is computed during morphological analysis. Category, Tense, Number, Gender and Peron are the same fields as the regularrules. Rule applied, stores the identifier of the rule that has been fired to analyse the token. Sentence, a reference to the object "Sentence" containing the token in the text. This is a relationship that holds between the token and its corresponding sentence. Order, stores the rank of the token in sequence from the beginning of the text. It is used to compare the sequential order of tokens in the text. Positions, correspond to the offset positions of the token in the text. It is used to highlight a relevant token when displaying the results to the user. Format, is the associated format (boldface, italics, …) applied to a token in the text.
The morphological analyzer (fig 1.) includes two kind of rules: regular morphological rules and exception rules that represent the morphological exceptions. Lists of exceptions contain all the markers that do not fall under the regular rules category. When analysing input tokens, the matching algorithm attempts to match between the affixes of the token with a regular rule. If it does not succeed, it attempts to apply an exception rule by looking into the exception lists. In view of a rapid-prototyping software implementation strategy, only regular rules specifications have been considered for implementation. After a test phase, the implementation of the exceptions is underway. We used Constraint Logic Programming language clp(FD) [13] for building the Arabic morphological analyzer to benefit from several advantages. CLP is particularly suited for rapid-prototyping software implementation because it provides a high-level of abstraction. The reason lies in the neat separation between the declarative model (how to express the problem to be solved) and the operational model (how the problem is actually solved). A typical constraint-based system has a two-level architecture consisting of a
766
Hamza Zidoum
programming module i.e. symbolic reasoning module expresses the logic properties of the problem and facilitates the implementation of the linguistic knowledge base, and heuristics; while constraint satisfaction reasoning on (finite) domains uses constraint propagation to keep the search space manageable [12].
3
Regular Rules
Regular Arabic verb, and noun forms have a fixed pattern of the form “prefix+root+suffix”, thus they can be implemented as automatic procedures since the identification of affixes is enough to extract the root form and associate the morphosyntactic information. A regular rule models a spelling rule for adding affixes. The structure of regularrule consists of nine fields that can be grouped into three classes: i) Name and Class identify the object in the system, ii) Prefix and Suffix store the prefix and suffix that are attached to a given token iii) Category, Tense, Number, Gender, and Person store the morpho-syntactic information inferred from a token. For instance, consider the Arabic word "( "نوبتكيin the active mode: ‘they write') that is composed of the three following morphemes: the root of the verb that is "( "بتك/ktb/, notion of writing), the prefix " "ـيthat denotes both the present tense and the third person, and the suffix " "نوthat denotes the masculine and the plural. The rule that analyses this word is represented in (fig. 2 (a)). In (fig. 2 (b)) the token is shown before matching, and in (fig. 2 (c)) the token attributes are updated:
Name: V28 Class: regular-rule Prefix: ـي Suffix: نو Category: verb Tense: present Number: plural Gender: masculine Person: third
(a)
match
Name: T1123 Class: Token Value: نوبتكي Root: Category: Tense: Number: Gender: Person: Rule-applied: Sentence: S052 Order: 1123 Positions: (3382, 3388) Format:
Update the Token
(b)
Name: T1123 Class: Token Value: نوبتكي Root: بتك Category: verb Tense: present Number: plural Gender: masculine Person: third Rule-applied: V28 Sentence: S052 Order: 1123 Positions: (3382, 3388) Format:
(c)
Fig. 2. Matching regular rules
The structure of the regular-rule class is detailed below: 1. 2. 3. 4. 5.
Name: identifies uniquely a rule Class: is the class of the object Prefix: a sequence of characters at the beginning of a token Suffix: a sequence of characters at the end of a token Category: the part of speech to which the token belongs to. It can hold two possible values: verb or noun 6. Tense: the tense associated to the token in case of a verb 7. Number: the cardinality of the token consisting of singular, dual or plural
Implementation of an Arabic Morphological Analyzer
767
8.
Gender: the gender associated to the token consisting of either masculine or feminine 9. Person: valid only for verbs, it represents either the first, second or third person. Thus, the rules are represented as predicates which implement the attributes as discussed above: regularRule(p_refix, s_uffix, c_at, g_ender, p_erson, n_ame, c_lass):p_refix = ' ', s_uffix = " ", c_at = verb, t_ens = present, n_umber = plural, g_ender = masculin, p_erson = third, n_ame = v28, c_lass = regular_rule.
4
t_ens,
n_umber,
Matching Algorithm
The extracted tokens from the source text are represented through the following facts base which distinctive arguments are the value of the token itself and its corresponding root classification inferred by the algorithm token(name, value, root, cat, tense, number, gender, person,format, ruleApplied, sentence, order, position). The aim of token-to-rule matching algorithm implemented in predicate tokenToRule/1 (shown below) is to fetch the rule that extracts the root of a given token t, and consequently, associates the morpho-syntax information to the token. The rule is identified if it matches a given pair of suffix and prefix. Thus, first the affixes are extracted from the token's value v through getSuffix/3 and getPrefix/3, then the matching operation is performed by regularRule/8. tokenToRule(t):t = token(_, v,_,_,_,_,_,_,_, r_ule,_,_,_), tokenToRule (v, r_ule, p_refix, s_uffix, 3, 1). tokenToRule(v, r_ule, p_refix, s_uffix, i, j) :i < 0, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, i, j) :length(v)-i-j = 0, length(v)-i-j > 1,
768
Hamza Zidoum
tokenToRule(v, r_ule, p_refix, s_uffix, i, j). tokenToRule(v, r_ule, p_refix, s_uffix, j < 0, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, length(v)-i-j 1) due to the fact that for regular words no root is less than two characters long: getPrefix(v, i, p_refix):getPrefix1(v, i, [], p_refix). getPrefix1(v, 0, p_refix, p_refix). getPrefix1(v, i, p_refix, _):getPrefix1([c|v1], i-1, [c|p_refix1], _). getSuffix(v, i, s_uffix):reverse(v, v1), getSuffix(v1, i, [], s_uffix). getSuffix1(v, 0, s_uffix, s_uffix1):reverse(s_uffix, s_uffix1). getSuffix1(v, i, s_uffix, _):getSuffix1([c|v1], i-1, [c | s_uffix1], _).
5
Conclusion
A morphological analyzer is one of the essential components in any natural language processing architecture. The morphological analyzer presented in this paper is implemented within CLP framework and is composed of three main components: a linguistic knowledge base comprising the regular and irregular morphological rules of the Arabic grammar, a set of linguistic lists of markers containing the exceptions handled by the irregular rules, and a matching algorithm that matches the tokens to the rules. The complete implementation of the system is underway. In a first phase, we have considered only the regular rules for implementation. Defining a strategy to match the regular and irregular rules and the extension of the linguistic lists of markers are the future directions of this project.
Implementation of an Arabic Morphological Analyzer
769
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
J-P. Desclès, Langages applicatifs, Langues naturelles et Cognition, Hermès, Paris, 1990. A. Arrajihi, The Application of morphology, Dar Al Maarefa Al Jameeya, Alexandria, 1973. (in Arabic) F. Qabawah, Morphology of nouns and verbs, Al Maaref Edition, 2nd edition, Beyruth, 1994. (in Arabic) G. A. Kiraz, "Arabic Computational Morphology in the West.", In Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, Cambridge, 1998. B. Saliba and A. Al Dannan, “Automatic Morphological Analysis of Arabic: A study of Content Word Analysis”, In Proceedings of the Kuwait Computer Conference, Kuwait, March 3-5, 1989. M. Smets, "Paradigmatic Treatment of Arabic Morphology", In Workshop on Computational Approaches to Semitic Languages COLING-ACL98, August 16, Montreal, 1998. K. Beesley, "Arabic Morphological Analysis on the Internet", In Proceedings of the International Conference on Multi-Lingual Computing (Arabic & English), Cambridge G.B., 17-18 April, 1998. R. Alshalabi and M. Evens, "A Computational Morphology System for Arabic", In Workshop on Computational Approaches to Semitic Languages COLING-ACL98, August 16, Montreal, 1998. R. Zajac, and M. Casper, “The temple Web Translator”, 1997 Available at: http://www.crl.nmsu.edu/Research/Projects/tide/papers/twt.aaai97.html T. A. El-Sadany, and M. A. Hashish, “An Arabic Morphological System”, In IBM Systems J., Vol. 28, No. 4, 600-612, 1989. J. Berri, H. Zidoum., Y. Attif, “Web-based Arabic Morphological Analyser”, A. Gelbukh (Ed): CICLing 2001, pp.389-400, 2001, Springer-Verlag, 2001. K. Marriott and P. Stuckey, “Programming with constraints: An Introduction”, MIT Press, 1998. P. Codognet, and D Diaz, “Compiling Constraints in clp(FD): ”, Journal of Logic Programming, 1996:27:1-199.
Efficient Automatic Correction of Misspelled Arabic Words Based on Contextual Information Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed RIADI Research Laboratory, University of La Manouba, National School of Computer Sciences, 2010 La Manouba, Tunisia
[email protected] [email protected] Abstract. We address in this paper a new method aiming to reduce the number of proposals given by automatic Arabic spelling correction tools. We suggest the use of error's context in order to eliminate some correction candidates. Context will be nearby words and can be extended to all words in the text. We present here experimentations we performed to validate some hypotheses we made. Then we detail the method itself. Finally we describe the experimental protocol we used to evaluate the method's efficiency. Our method was tested on a corpus containing genuine errors and has yielded good results. The average number of proposals has been reduced by about 75% (from 16.8 to 3.98 proposals on average).
1
Introduction
Existing spelling correctors are semi-automatic i.e. they assist the user by proposing him a set of candidates close to the erroneous word. For instance, word processors proceed by asking the user to choose the correct word among the proposals automatically computed. We aim to reduce the number of proposals provided by a spelling corrector. Two major preoccupations have motivated us to be interested in such a problem. First, some applications need to have a totally automatic spelling corrector [1]. Then, we have noticed that the number of candidates given by correctors for Arabic language is too big compared to what correctors give for other languages namely English or French. In such condition, Arabic correctors may even be useless. The idea is rather to eliminate the less probable candidates than to make corrector give fewer candidates. The method we propose makes use of the error's context. It considers the nearby words as well as all words in the text. In this paper we start by studying to what extent the problem is relevant in English, French and Arabic. Then, after a general presentation of the spelling corrector we used, we propose an initial assessment of it based on genuine errors. Finally, we present our method followed by an evaluation protocol measuring its efficiency.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 770-777, 2003. Springer-Verlag Berlin Heidelberg 2003
Efficient Automatic Correction of Misspelled Arabic Words
2
771
Characteristics of the Arabic Language
According to the language, if one wants to automatically correct misspelled written forms, the difficulties encountered will not be the same. The correction methods applied to languages such as English cannot be applied to agglutinative languages such Arabic [2]. As concerns Arabic, apart from vowel marks and agglutination of the affixes (articles, prepositions, conjunctions, pronouns) to the forms, we noticed, thanks to an experience described in [3], that "the Arabic words are lexically very close". According to this experience, the average number of forms that are lexically close1 is 3 for English and 3.5 for French. For Arabic without vowel marks, this number is 26.5. Arabic words would thus be much more close amongst one other than French and English words. This proximity of Arabic words has a double consequence: Firstly on error detection, where the words that are recognized as correct can in fact hide an error. Secondly on error correction, where the number of proposals for an erroneous form is liable to be excessively high. On could estimate a priori that an average of 27 forms will be proposed for the correction of each error.
3
General Presentation of Our Spelling Checker
Our method of checking and correcting Arabic words is mainly based on the use of a dictionary. This dictionary contains inflected forms with vowel marks (1 600 000 (pencils)… together with linguistic information. Beentries)2 such: (pencil), cause of the agglutination of prefixes (articles, prepositions, conjunctions) and suffixes (linked pronouns) to the stems (inflected forms) the dictionary is not sufficient : for recognizing the words as they are represented in the Arabic texts (e.g.: with their pencils). The ambiguities of decomposing words make it difficult to recognize the inflected forms and the affixes. So we made the dictionary go together with algorithms allowing the morphological analysis of textual forms. Apart from the dictionary of inflected forms, this morphological analyzer makes use of a small dictionary that contains all affixes (90 entries) and applies a set of rules for the research of all possible partitions in prefix, stem and suffix. These same dictionaries and grammar shall be used for detecting and correcting errors. Detection of errors occurs during the morphological analysis. Correction is carried out by means of an improved “tolerant” version of the morphological analyzer.
1
2
Words that are lexically close: words that differ with one single editing error (substitution, addition, deletion and inversion). The dictionary contains 577 546 forms without vowel marks.
772
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed
4
Initial Evaluation of the Spelling Checker
To evaluate our correction system, we took into account the following measures: • • • • 4.1
Coverage: the percentage of errors for which the corrector is not silent, i.e. makes proposals. Accuracy: the percentage of errors having the correct word among its proposals. Ambiguity: the percentage of errors for which the corrector gives more than one proposal. Proposal: the average number of correction proposals per erroneous word. Experiment
Our first experiment bore on genuine errors. To this aim we took three texts (amounting to approximately 5000 forms) from the same field and containing 151 erroneous forms that come under one of the four editing operations. The invocation of our correction system on these texts gave the following results: Table 1. Initial evaluation of the spelling checker
Coverage Accuracy Ambiguity Proposal 100% 100% 78.8% 12.5 forms 4.2 •
• •
5
Comments The coverage and accuracy rates are of 100%. This can be explained by the fact that we token into account only the errors the solution of which was recognized by our morphological analyzer. The other errors (mostly neologisms) were excluded because we were concerned with the evaluation of the corrector not the analyzer. The ambiguity rate is very high: more than 78% of the errors show more than one proposal in their correction. Although the average number of proposals is lower than the previously planned average (27 forms), it remains high if one compares it to other languages. In English for example, the average number of proposals is 5.55 for artificial errors and 3.4 for real errors [4].These results are expected, considering that the words in Arabic are lexically close.
Proposed Method
Our correction system assists the user by giving him a set of proposals that are close to the erroneous word.
Efficient Automatic Correction of Misspelled Arabic Words
773
We use the following notation : We : an erroneous word, Wc : correction of We P = {p1, . . , pn} : the set of proposals for the correction of We Wctxt = {w-k, … w-1, w1, … wk}: the set of words surrounding the erroneous word We in the text (while considering a window of the size k). In order to develop an entirely automatic correction, we would like to reduce the set P to a singleton corresponding to the correct word Wc, thus giving: Card (P) = 1 with Wc ∈ P. We shall try to minimize as much as possible the number of proposals, by stating to eliminate the least probable proposals. The use of the context, on which our method is based, shall occur in 2 cases : firstly, consider only the words near to the error; secondly, consider the whole set of words in the text containing the error. 5.1
Words in Context
The assumption is that each proposal pi has a certain lexical affinity with the words of the context of the erroneous word we wish to correct. Consequently, to classify the proposals and eliminate those that are the least probable, we examine the context and we choose the proposals that are closest to the words of the context. To do so, we opted for a statistical method consisting in calculating for each proposal the probability of being the good solution, with respect to the words surrounding the error in the text. Only the proposals with a probability deemed acceptable are kept, the others are eliminated. For each proposal we calculate: p(pi\Wctxt) : probability that pi is the good solution, knowing that the erroneous word We is surrounded by the context Wctxt Calculating this probability is not easy; it would take far too much data to generate reasonable estimates for such sequences directly from a corpus. We use instead the probability p(Wctxt\pi) by using the inversion rule of Bayes: p(pi \ Wctxt) =
p(Wctxt \ pi) × p(pi) . p(Wctxt)
(1)
Since we are searching proposals with the highest value p(pi\Wctxt), we can only calculate the value p(Wctxt\pi)×p(pi). The probability p(Wctxt) being the same for all the proposals (the context is the same), it thus has no effect on the result. Supposing that the presence of a word in a context does not depend on the presence of the other words in this same context, we can carry out the following approximation, as was already shown by [5]: p(Wctxt \ pi) =
-k,...,k
∏p(wj\ pi) . j
Everything well considered, we calculate for each proposal pi :
774
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed -k,..., k
∏ p(w j \ p i) × p(p i) j
where : p(w j \ pi) =
Number of occurrences of w j with pi . Number of occurrences of pi
p(pi) =
Number of occurrences of pi . Total number of words
Experiment Our experiment is carried out in two stages : one stage of training during which probabilities are collected for the proposals and one testing stage which consists in using these probabilities for choosing among the proposals. Training Stage This stage consists in creating a dictionary of co-occurrences from a training corpus. The entries in this dictionary are the proposals given by our correction system for the errors that it detected in the text. We put with each entry its probability of appearance in the training corpus p(pi) and the totality of its co-occurring words with their probability of co-occurrence p(wj\pi). Test Stage This stage consists in referring to the correction system for a test text and accessing the dictionary of co-occurrences for each proposal in order to calculate the probability p(pi\Wctxt). Only the proposals with a probability deemed sufficient (> 0.3 in our case, the training corpus not being very big) are chosen. Results Training text: the corpus previously used containing the real errors (5 000 words) Testing text: a portion of the corpus (1 763 words, 61 of which are erroneous) Table 2. Evaluation of the spelling checker : Context words
Initially Context words
Coverage 100% 100%
Accuracy Ambiguity 100% 88.52% 93.44% 72.13%
Proposal 16.8 forms 10.33 forms
The use of the words in context allowed reducing the number of proposals with approximately 40%. However, accuracy decreased: in 6.6% of cases the good solution was not among the proposals.
Efficient Automatic Correction of Misspelled Arabic Words
5.2
775
Words in the Text
The idea for this experiment was conceived after some counts carried out on the previously used corpus which contained real errors. These counts showed us the following: • •
A stem (textual form without affixes) appears 5.6 times on average. A lemma (canonical form) appears 6.3 times on average.
Which leads us to the following deductions: • •
In a text, words tend to repeat themselves. For the canonical forms the frequency is more important than for the stems. This is easy to understand, because if one repeats a word one can vary the gender, the number, the tense, … according to the context in which it occurs.
Starting from the idea that the words in a text tend to repeat themselves, one could reasonably think that the corrected words of the erroneous words in a text can be found in the text itself. Consequently, the research of proposals for the correction of an erroneous word shall from now on be done with the help of dictionaries made up from words of the text containing the errors instead of the two general Arabic dictionaries that we previously used, i.e.: the dictionary of inflected forms and the dictionary of affixes. Two experiments were realized to this aim: the first experiment bore on the use of the dictionary of the text's stems, the second experiment on the use of the dictionary of all the inflected forms of the text's stems. Experiment 1 : Dictionary of the Text's Stems The construction of the dictionaries required for this experiment went through the following stages: 1.
Morphological analysis of the testing text. We obtain the morphological units of the text decomposed in : Prefix / Stem / Suffix 2. Access, with all the stems of the text, to the general dictionary (577 546 forms without vowel marks) so as to obtain a dictionary of 1 025 forms. 3. Access, with all the affixes of the text, to the general dictionary of affixes (71 forms without vowel marks) so as to obtain a dictionary of 33 forms The correction of the text by using these two dictionaries gives the following results: Table 3. Evaluation of the spelling checker : Dictionary of the text's stems
Coverage 73.77%
Accuracy 97.61%
Ambiguity 35.55%
Proposal 2.36 forms
This table shows that the rate of ambiguity decreased by more than half. The average number of proposals also clearly decreased: it went down from 16.8 forms to 2.4 forms. Which means that we succeeded in decreasing the number of proposals but we also lost at the level of coverage and accuracy.
776
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed
Experiment 2 : Dictionary of the Text's Inflected Forms The second experiment looks like the first one, apart from the fact that instead of the dictionary of the text's stems, we used a dictionary of all the inflected forms of the text's stems. The correction of the testing text by using this dictionary as well as the dictionary of the affixes made up in the previous experiment gives the following results: Table 4. Evaluation of the spelling checker : Dictionary of the text's inflected forms
Coverage 86.75%
Accuracy 92%
Ambiguity 58%
Proposal 4.88 forms
We notice that the use of the dictionary of stems allows reducing the number of proposals to a greater extent than the use of the dictionary of inflected forms. However, the latter gives better results at the level of coverage. As for accuracy, it decreases by using the dictionary of inflected forms. Indeed, the average number of proposals furnished by the corrector using the dictionary of inflected forms being more important, one can expect the possibility of having proposals without the good solution being more important also (since there is more choice among the proposals).
6
Final Assessment
As a final experiment we aimed to combine both previous experiments: words in text and words in context. Firstly, the search of proposals was carried out in the dictionary of inflected forms of the words in text. Each proposal was given a probability measuring its proximity with the context of the erroneous word it corrects. Less plausible proposals were eliminated. Thus we obtained an average of 2.68 proposals and a cover rate of 82% . The second combination, which we prefer, is looking for proposals in the general dictionary and then gives them a contextual probability. The proposals that belong to the dictionary of inflected forms of words in text are weighted by the note 0.8 and the others by the note 0.2 (according to the experiment of the dictionary of inflected forms, the good solution comes from the dictionary of inflected forms in 80% of cases). We then proceed in the same manner as before, by only keeping the proposals that are most probable. The average number of proposals obtained in this case is 3.98 with a coverage rate of 100% and an accuracy of 88.52%. Table 5. Final evaluation of the spelling checker
Combination 1 Combination 2
Coverage 81.97% 100%
Accuracy 86.% 88.52%
Ambiguity 46% 62.29%
Proposal 2.68 forms 3.98 forms
Efficient Automatic Correction of Misspelled Arabic Words
7
777
Conclusion
In this work, we have been interested in reducing the number of candidates given by an automatic Arabic spelling checker. The method we have developed is based on using the lexical context of the erroneous word. Although we considerably have reduced the number of proposals, we think that our method can be improved, mainly by making use of syntactic context as well as other contextual information. In fact, we manually measured the role of this information if it could be used. We notice that the syntactic constraints would be able to reduce the number of proposals by 40%, and as such without taking into account the failures of an automatic syntactic analyzer.
References [1] [2] [3] [4] [5]
Kukich K. : Techniques for automatically correcting words in text. In ACM Computing Surveys, Vol. 24, N.4, (1992) pp.377-439 Oflazer K. : Spelling correction in agglutinative languages. In Proceedings of the 4th ACL Conference on Applied Natural Language Processing, Stuttgart, Germany (1994) Ben Othmane Zribi C. and A. Zribi : Algorithmes pour la correction orthographique en arabe. TALN'99, 6ème Conférence sur le Traitement Automatique des Langues Naturelles, Corse (1999) Agirre E., Gojenola K., Sarasola K., Voutilainen A. : Towards a single proposal in spelling correction. In Proceedings of COLING-ACL 98, Montréal, (1998) Gale W., Church K. W., Yarowsky D. : Discrimination decisions for 100,000 dimensional spaces. In Current Issues in Computational Linguistics (1994), pp. 429-450
A Socially Supported Knowledge-Belief System and Its Application to a Dialogue Parser Naoko Matsumoto and Akifumi Tokosumi Department of Value and Decision Science Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku, Tokyo, Japan 152-8552 {matsun,akt}@valdes.titech.ac.jp http://www.valdes.titech.ac.jp/~matsun/
Abstract. This paper proposes a dynamically-managed knowledgebelief system which can represent continuously-changing beliefs about the world. Each belief in the system has a component reflecting the strength of support from other agents. The system is capable of adapting to a contextual situation due to the continuous revision of belief strengths through interaction with others. As a paradigmatic application of the proposed socially supported knowledge-belief system, a dialogue parser was designed and implemented in Common Lisp CLOS. The parser equipped with a belief management system outputs (a) speaker intention, (b) conveyed meaning, and (c) hearer emotion. Implications of our treatment of intention, meaning, and affective content within an integrated framework are discussed.
1
Introduction
Belief, defined in this paper as a knowledge structure with a degree of subjective confidence, can play a crucial role in cognitive processes. The context reference problem in natural language processing is one area where beliefs can dramatically reduce the level of complexity. For instance, in natural settings, contrary to the standard view within pragmatics and computational pragmatics, many utterances can be interpreted with little or no reference to the situations in which the utterances are embedded. The appropriate interpretation in a language user of the utterance “Your paper is excellent” from possible candidate interpretations, such as “The speaker thinks my newspaper is good to read,” or “The speaker thinks my thesis is great and is praising my effort,” will depend on the user's belief states. If, at a given time t, the dominant belief held by the user is that paper = {thesis}, then the lexical problem of disambiguating whether paper = {newspaper, thesis,...} simply disappears. The present paper proposes a belief system that can reflect the beliefs of socially distributed agents, including the speaker and the hearer. Our treatment of the belief system also addresses another important issue; the communication of mental states. In most non-cynical situations, the utterance “Your V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 778-784, 2003. Springer-Verlag Berlin Heidelberg 2003
A Socially Supported Knowledge-Belief System
779
paper is excellent” will evoke a positive emotion within the hearer. While an important function of the utterance would be to convey the epistemic content of the word excellent (of high quality), another important function would be the transfer of affective content, by eliciting a state of happiness in the hearer. Our belief system treats both of these epistemic and affective functions within an integrated framework. In the first half of the paper, we focus on the design and implementation of a dialogue parser equipped with a dynamically-managed belief system, and, then, turn to discuss how the parser incorporates the affective meaning of utterances.
2
The Proper Treatment of Context
Despite the importance attached to it, concepts of context are generally far too simplistic. One of the most neglected problems in pragmatics-related research is how to identify the proper context. Although most pragmatic theorists regard context as being essential for utterance interpretation (e.g., [1]) it is usually treated as simply the obvious and given information. While computational pragmatics tends to see the problem more realistically and, for instance, deconstructs it into a plan-recognition problem (e.g., [2]) context (sometimes in the form of a goal) is generally defined in terms of data structures. With its cognitive stance, Relevance Theory [3] takes a more satisfactory approach to the problem asserting that assumption is the cue to utterance interpretation. For example, in the right context, on hearing the utterance “It is raining,” by constructing the assumption “if it rains, the party will be postponed,” the hearer can move beyond the mere literal interpretation of the utterance (describing the weather) to an appropriate interpretation of the utterance's deeper significance. The theory claims that assumptions are cognitively constructed to have the maximum relevance. We agree with this view, but would point out that the cognitive computation for the construction of assumptions has yet to be fully described. Sharing a similar motivation with Relevance Theory, we propose a new idea— meaning supported by other language users in the community—to solve the difficulties associated with the context identification problem. In our position, context is not a data structure or existing information, but is rather a mechanism of support that exists between the cognitive systems of members of a linguistic community. With this support mechanism, largely-stable states of language meaning (with some degree of ambiguity) are achieved and the mutual transfer of ideas (as well as emotions) is made possible. We will detail the proposed knowledge-belief system in the following sections.
3
The Socially Supported Knowledge-Belief System
The Socially Supported Knowledge-Belief System (SS-KBS) is a computational model of utterance comprehension which incorporates the dynamically-revisable belief system of a language user. The belief system models the user's linguistic and world knowledge with a component representing the degree of support from other us-
780
Naoko Matsumoto and Akifumi Tokosumi
ers in the language community. A key concept in this system is the socially supported knowledge-belief (sskb) — so named to emphasize its social characteristics. As a hearer model, the task for SS-KBS is to identify the intention of the speaker which is embedded in an utterance, using its sskb database. When the speaker's intention is identified, the SS-KBS incorporates the intention within the knowledge-belief structure for the hearer. Thus, the hearer model builds its own knowledge-belief structure in the form of sskbs. In the SS-KBS, each knowledge-belief (sskb) has a value representing the strength of support from others. The knowledge-belief that has the highest level of support is taken as the most likely interpretation of a message, with the ranking order of a knowledge-belief changing dynamically through interaction with others. The current SS-KBS is designed as an utterance parser. Fig. 1 shows the general architecture of the SS-KBS model. Start
Socially Supported Knowledge Belief database
Word analysis
belief strength
word Get the next word
Get sskb
paper
1st thesis 7 2nd newspaper 5
+1
Store each sskb
sskb End of utterance Computation of intention Confirm
speaker’s belief paper = thesis 1
Final decision
the speaker
Fig. 1. General architecture of the SS-KBS
The SS-KBS has its origins in Web searching research where, in a typical search task, no information about the context of the search or about the searcher is available, yet a search engine is expected to function as if it knows all the contextual information surrounding target words [4,5,6]. The main similarity between the SS-KBS and search engines is in the weighed ordering of utterance meaning (likely meaning first) by the system and the presumed ordering of found URLs (useful site first) by search engines. Support from others (other sites in the case of Web search) is the key idea in both cases. 3.1
Implementation of the SS-KBS as a Dialogue Parser
In the SS-KBS, an agent's knowledge-belief reflects the level of support from other agents' obtained through verbal interaction. Our first natural choice for an application of this system is a dialogue parser, called Socially Supported Knowledge-Belief
A Socially Supported Knowledge-Belief System
781
Parser (SS-KBP) which has been implemented in Common Lisp Object System (CLOS). By applying the system, it is possible to interpret utterances without context and to produce utterances based on the dynamically revised database of knowledgebeliefs. 3.2
Internal Structure of an Socially Supported Knowledge-Belief (sskb)
We designed the SS-KBP as a word-based parser, because this has an architecture that can seamlessly accommodate various types of data structures. The SS-KBP processes a sequence of input words using its word knowledge which consists of three types of knowledge; (a) grammatical knowledge, (b) semantic knowledge, and (c) discourse knowledge (Table 1). Each knowledge type is represented as a daemon which is a unit of program code executable in a condition-satisfaction mechanism [7]. The grammatical knowledge controls the behavior of the current word according to its syntactic category. The semantic knowledge deals with the maintenance of the sskb database derived from the current word. The sskb has a data structure similar to the deep cases developed in generative semantics. Although the original notion of deep case was only applied to verbs, we have extended the notion to nouns, adjectives, and adverbs. For instance, in a sskb, the noun paper has an evaluation slot, with a value of either good or bad. The sskbs for nouns are included with evaluation slots in the SS-KBP, as we believe that evaluations, in addition to epistemic content, are essential for all nominal concepts in ordinary dialogue communication. The discourse knowledge can accommodate information about speakers and the functions of utterances. Table 1. Knowledge categorization for the SS-KBP know ledge categorization
action
grarm m atical specifying the category knowledge
sem antic knowledge
exam ples depending its category (ex) "paper" --- noun daem on
extracting the m eaning
extracting the first sskb of the current word
changing the belief structure
controlling the knowledge belief structure with new input belief
im plem entation [category daem on] [predictable category daem on]
[knowledge belief daem on]
(ex) "paper" --- thesis (sskb) [knowledge belief m anagem ent daem on]
predicting the deep case of the current word predicting the following word
(ex1) "buy" (verb) --- "Tom " as agent (sskb)
[predictable word daem on]
(ex2) "flower" (noun) --- "beautuful" as evaluation (sskb) inferring the user of (ex) "pavem ent" --- British (sskb) the word discourse knowledge
inferring the speaker's intention
[speaker inforam tion daem on]
inferring the utterance function (ex) Can you help m e? --- "request" (sskb)
[utterance function daem on]
The SS-KBP parses each input word using the sskbs connected to the word. As the proposed parser is strictly a word-based parser, it can analyze incomplete utterances. An example of a knowledge-belief implementation in the parser can be seen in the following fragment of code:
782
Naoko Matsumoto and Akifumi Tokosumi
;;;class object for 'paper. The word ‘paper” belongs ;;;to “commonnoun.” (defclass paper (commonnoun) ;;;the category is commonnoun ((Cate :reader Cate :initform 'commonnoun) ;;;the word's sskb. The first sskb is ‘thesis,' the second ;;;sskb is ‘newspaper.' (sskb :accessor sskb :initform '((7 thesis)(5 newspaper))) ;;;the predictable verb. (predAct :accessor predAct :initform '((4 read)(3 write))) ;;;the user information of the word. (userInfo :accessor userInfo :initform '((4 professor)(2 father))) ;;;the evaluation. If the word has evaluated, ;;;the slot is filld. (EvalInducer :accessor EvalInducer :initform '((3 beautiful)(1 useful))) ;;;the utterance function. (uttFunc :accessor uttFunc :initform '((6 order)(5 encourage))))) Weighing of a sskb in the SS-KBP is carried out through the constant revision of its rank order. The parser adjusts the rank order of the sskb for each word whenever interaction occurs with another speaker. 3.3
Inferring the Speaker's Intention
In the SS-KBP, the speaker's intension is inferred by identifying the pragmatic functions of an utterance, which are kept in the sskbs. As utterances usually have two or more words, the parser employs priority rules for the task. For instance, when an utterance has an adjective, its utterance function is normally selected as the final choice of the pragmatic functions for the utterance. In case of the utterance “Your paper is excellent,” pragmatic functions associated to the words “your” and “paper” are superseded by a pragmatic function “praise” retrieved from a sskb of the word “excellent.” Because of these rules, it is always possible for the SS-KBP to infer the speaker's intention without reference to the context. 3.4
Emotion Elicitation
Emotion elicitation is one of the major characteristics of the SS-KBP. Ordinary models of utterance interpretation do not include the emotional responses of the hearer, although many utterances elicit emotional reactions. We believe that emotional reactions, as well as meaning/intention identification, are an inherent function of utterance exchange and the process is best captured as a socially-supported belief-knowledge processing. In SS-KBP, emotions are evoked by sskbs activated by input words. When the SS-KBP determines the final utterance functions as a speaker's intention, it extracts the associated emotions from the utterance functions. In Fig. 2, SS-KBP ob-
A Socially Supported Knowledge-Belief System
783
tains an utterance function “praise” from the utterance “Your paper is excellent”, then, the SS-KBP searches the emotion knowledge for the word “praise”, it extracts the emotion “happy.” Input utterance Your
paper
is
excellent
daemons
daemons
daemons
daemons
results
results
results
results
The final intention “praise”
.
praise Emotion - happy - glad - joyful
Emotion production daemon
Fig. 2. Emotion Elicitation in the SS-KBP
4
Future Directions
The present SS-KBP is able to interpret the speaker's intentions and is also capable of producing the hearer's emotion responses. For a more realistic dialogue model, however, at least two more factors may be necessary: (a) a function to confirm the speaker's intentions, and (b) a function to identify the speaker. The confirmation function would provide the sskbs with a more accurately-tuned rank order, and would, thus, improve the reliability of utterance interpretation. This function may also lead to the generation of emotions toward speakers. When an interpretation of an utterance is confirmed in terms of the level of match to speaker expectations, then the model could express the emotion of “satisfaction” towards the interpretation results and might recognize the speaker as having a similar belief structure to its own. The speaker identification function could add to the interpretation of ironical expressions. Once the model recognizes that speaker A has a very similar belief structure to itself, it would be easier to infer speaker A's intentions. If, however, speaker A's utterance were then to convey a different belief structure, the model could possibly interpret the utterance as either irony or a mistake.
5
Conclusions
The proposed SS-KBS model can explain how the hearer's knowledge-belief system works through interaction with others. The system reflects the intentions of the other
784
Naoko Matsumoto and Akifumi Tokosumi
agent at each conversation turn, with each knowledge-belief having a rank based on the support from other. We have designed and examined the SS-KBS model in the following pragmatics-related tasks: (a) a context-free utterance interpretation parser, (b) a synchronizing mechanism between hearer and speaker knowledge-belief systems, and (c) a dynamically changing model of hearer emotion.
References [1] [2] [3] [4]
[5] [6] [7]
Grice, H. P.: Logic and Conversation. In: Cole, P., Morgan, J. L. (eds.): Syntax and Semantics; vol. 3. Speech Acts. Academic Press, New York (1975) 45-58 Cohen, P. R. Levesque, H. J. :Persistence, Intention and Commitment. In: Cohen, P. R., Morgan, J., Pollack, M. (eds.): Intentions in Communication. MIT Press, Cambridge, Mass. (1990) 33-70 Sperber, D., Wilson, D.: Relevance; Communication and Cognition. Harvard Univ. Press, Cambridge, Mass. (1986) Matsumoto, N., Anbo, T., Uchida, S., Tokosumi, A.: The model of Interpretation of Utterances using a Socially Supported Belief Structure. Proceedings of the 19th Annual Meeting of the Japanese Cognitive Science Society. (In Japanese). (2002) 94-95 Matsumoto, N., Anbo, T., Tokosumi, A.: Interpretation of Utterances using a Socially Supported Belief System. Proceedings of the 10th Meeting of the Japanese Association of Sociolinguistic Science. (In Japanese). (2002) 185-190 Matsumoto, N., Tokosumi, A.: Pragmatic Disambiguation with a Belief Revising System. 4th International Conference on Cognitive Science ICCS/ASCS – 2003 (To appear). (Submitted and accepted) Tokosumi, A.: Integration of Multi-Layered Linguistic Knowledge on a Daemon-Based Parser.: In Masanao Toda (ed.): Cognitive Approaches to Social Interaction Processes. Department of Behavioral Science, Hokkaido University, Sapporo, Japan. (In Japanese with LISP code). (1986) Chapter 10
Knowledge-Based Question Answering Fabio Rinaldi1 , James Dowdall1 , Michael Hess1 , Diego Moll´ a2 , Rolf Schwitter2 , and Kaarel Kaljurand1 1
Institute of Computational Linguistics, University of Z¨ urich, Z¨ urich, Switzerland {rinaldi,dowdall,hess,kalju}@cl.unizh.ch 2 Centre for Language Technology, Macquarie University, Sydney, Australia {diego,rolfs}@ics.mq.edu.au
Abstract. Large amounts of technical documentation are available in machine readable form, however there is a lack of effective ways to access them. In this paper we propose an approach based on linguistic techniques, geared towards the creation of a domain-specific Knowledge Base, starting from the available technical documentation. We then discuss an effective way to access the information encoded in the Knowledge Base. Given a user question phrased in natural language the system is capable of retrieving the encoded semantic information that most closely matches the user input, and present it by highlighting the textual elements that were used to deduct it.
1
Introduction
In this article, we present a real-world Knowledge-Based Question Answering1 system (ExtrAns), specifically designed for technical domains. ExtrAns uses a combination of robust natural language processing technology and dedicated terminology processing to create a domain-specific Knowledge Base, containing a semantic representation for the propositional content of the documents. Knowing what forms the terminology of a domain and understanding the relation between the terms is vital for the answer extraction task. Specific research in the area of Question Answering has been promoted in the last few years in particular by the Question Answering track of the Text REtrieval Conference (TREC-QA) competitions [17]. As these competitions are based on large volumes of text, the competing systems cannot afford to perform resource-consuming tasks and therefore they usually resort to a relatively shallow text analysis. Very few systems tried to do more than skim the surface of the text, and in fact many authors have observed the tendency of the TREC systems to converge to a sort of common architecture (exemplified by [1]). The TREC-QA competitions focus on open-domain systems, i.e. systems that can (potentially) answer any generic question. In contrast a question answering system working on a technical domain can take advantage of the formatting and style conventions 1
The term Answer Extraction is used as equivalent to Question Answering in this paper.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 785–792, 2003. c Springer-Verlag Berlin Heidelberg 2003
786
Fabio Rinaldi et al.
in the text and can make use of the specific domain-dependent terminology, besides it does not need to handle very large volumes of text.2 We found that terminology plays a pivotal role in technical domains and that complex multiword terms quickly become a thorn in the side of computational accuracy and efficiency if not treated in an adequate way. A domain where extensive work has been done using approaches comparable to those that we present here is the medical domain. The Unified Medical Language System (UMLS)3 makes use of hyponymy and lexical synonymy to organize the terms. It collects terminologies from differing sub-domains in a metathesaurus of concepts. The PubMed4 system uses the UMLS to relate metathesaurus concepts against a controlled vocabulary used to index the abstracts. This allows efficient retrieval of abstracts from medical journals, but it requires a complex, predefined semantic network of primitive types and their relations. However, [2] criticizes the UMLS because of the inconsistencies and subjective bias imposed on the relations by manually discovering such links. Our research group has been working in the area of Question Answering for a few years. The domain selected as initial target of our activity was that of Unix man pages [11]. Later, we targeted different domains and larger document sets. In particular we focused on the Aircraft Maintenance Manual (AMM) of the Airbus A320 [15]. However, the size of the SGML-based AMM (120Mb) is still much smaller than the corpus used in TREC-QA and makes the use of sophisticated NLP techniques possible. Recently, we have decided to embark upon a new experiment in Answer Extraction, using the Linux HOWTOs as a new target domain. As the documents are open-source, it will be possible to make our results widely available, using a web interface similar to that created for the Unix manpages.5 The remainder of this paper is organized around section 2, which describes the operations adopted for structuring the terminology and section 3, which describes the role of a Knowledge Base in our Question Answering system.
2
Creating a Terminological Knowledge Base
Ideally, terminology should avoid lexical ambiguity, denoting a single object or concept with a unique term. More often than not, the level of standardization needed to achieve this ideal is impractical. With the authors of complex technical documentation spread across borders and languages, regulating terms becomes increasingly difficult as innovation and change expand the domain with its associated terminology. This type of fluidity of the terminology not only increases the number of terms but also results in multiple ways of referring to the same 2 3 4 5
The size of the domains we have dealt with (hundreds of megabytes) is one order of magnitude inferior to that of the TREC collections (a few gigabytes). http://www.nlm.nih.gov/research/umls/ http://www.ncbi.nlm.nih.gov/pubmed/ http://www.cl.unizh.ch/extrans/
Knowledge-Based Question Answering
787
TERM 1
7
cover strip
electrical cable electrical line 2
cargo door
3
compartment door
4
enclosure door
5
functional test operational check
8
12
fastner strip attachment strip
stainless steel cover strip 9
cargo compartment door cargo compartment doors cargo-compartment door
11
door functional test 10
cockpit door
Fig. 1. A sample of the AMM Terminological Knowledge Base
domain object. Terminological variation has been well investigated for expanding existing term sets or producing domain representations [3, 6]. Before the domain terminology can be structured, the terms need to be extracted from the documents, details of this process can be found in [4, 14]. We then organize the terminology of the domain in an internal component called the Terminological Knowledge Base (TermKB) [13]. In plainer terms, we could describe it simply as a computational thesaurus for the domain, organized around synonymy and hyponymy and stored in a database. The ExtrAns TermKB identifies strict synonymy as well as three weaker synonymy relations [5] and exploits the endocentric nature of the terms to construct a hyponymy hierarchy, an example of which can be seen in Fig.1. We make use of simple pattern matching techniques to determine lexical hyponymy and some strict synonymy, more complex processing is used to map the immediate hyperonymy and synonymy relations in WordNet onto the terminology. To this end we have adapted the terminology extraction tool FASTR [7]. Using a (PATRII) phrase structure formalism in conjunction with the CELEX morphological database and WordNet semantic relations, variations between two terms are identified. Hyponymy Two types of hyponymy are defined, modifier addition producing lexical hyponymy and WordNet hyponymy translated from WordNet onto the term set. As additional modifiers naturally form a more specific term, lexical hyponymy is easily determined. Term A is a lexical hyponym of term B if: A has more tokens than B; the tokens of B keep their order in A; A and B have the same head.6 The head of a term is the rightmost non-symbol token (i.e. a word) which can be determined from the part-of-speech tags. This relation is exemplified in Fig.1 between nodes 1 and 8 . It permits multiple hyperonyms as 9 is a hyponym of both 2 and 3 . 6
This is simply a reflection of the compounding process involved in creating more specific (longer) terms from more generic (shorter) terms.
788
Fabio Rinaldi et al.
WordNet hyponymy is defined between terms linked through the immediate hyperonym relation in WordNet. The dashed branches in Fig.1 represent a link through modifier hyponymy where the terms share a common head and the modifiers are related as immediate hyperonyms in WordNet. Nodes 3 and 4 are both hyperonyms of 10 . Similarly, “floor covering” is a kind of “surface protection” as “surface” is an immediate hyperonym of “floor” and “protection” is an immediate hyperonym of “covering”. Mapping a hierarchical relation onto terms in this manner is fine when the hyponymy relation exists in the same direction, i.e. from the modifiers of t1 to the modifiers of t2 and the head of t1 to the head of t2. Unfortunately, it is less clear what the relation between t1, “signal tone” and t2, “warning sound” can be characterized as. This type of uncertainty makes the exploitation of such links difficult. Synonymy Four relations make up synsets, the organizing unit of the TermKB. These are gradiated from strict synonymy to the weakest useful relation. Simplistic variations in punctuation 9 , acronym use or orthography produce strict synonymy. Morpho-syntactic processes such as head inversion also identify this relation, cargo compartment doors −→ door of the cargo compartment. Translating WordNets synset onto the terminology defines the three remaining synonymy relations. Head synonymy 7 , modifier synonymy 12 and both 5 . Automatically discovering these relations across 6032 terms from the AMM produces 2770 synsets with 1176 lexical hyponymy links and 643 WordNet hyponymy links. Through manual validation of 500 synsets, 1.2% were determined to contain an inappropriate term. A similar examination of 500 lexical hyponymy links identified them all as valid.7 However, out of 500 WordNet hyponymy links more than 35% were invalid. By excluding the WordNet hyponymy relation we obtain an accurate knowledge base of terms (TermKB), the organizing element of which is the synset and which are also related through lexical hyponymy.
3
Question Answering in Technical Domains
In this section we briefly describe the linguistic processing performed in the ExtrAns systems, extended details can be found in [15]. An initial phase of syntactic analysis, based on the Link Grammar parser [16] is followed by a transformation of the dependency-based syntactic structures generated by the parser into a semantic representation based on Minimal Logical Forms, or MLFs [11]. As the name suggests, the MLF of a sentence does not attempt to encode the full semantics of the sentence. Currently the MLFs encode the semantic dependencies between the open-class words of the sentences (nouns, verbs, adjectives, and adverbs) plus prepositional phrases. The notation used has been designed to incrementally incorporate additional information if needed. Thus, other modules of the NLP system can add new information without having to remove old 7
This result might look surprising, but it is probably due to the fact that all terminology has been previously manually verified, see also footnote 6.
Knowledge-Based Question Answering
789
information. This has been achieved by using flat expressions and using underspecification whenever necessary [10]. An added value of introducing flat logical forms is that it is possible to find approximate answers when no exact answers are found, as we will see below. We have chosen a computationally intensive approach, which allows a deeper linguistic analysis to be performed, at the cost of higher processing time. Such costs are negliDocument gible in the case of a single sentence (like a user query) but become rapidly impractical in the case of the analysis of a large document set. The approach we take is to analyse all the documents in an off-line stage (see Fig. 2) and store a representation of their contents (the MLFs) in a KnowlTerm edge Base. In an on-line phase (see Fig. 3), the MLF which processing results from the analysis of the user query is matched in the KB against the stored representations, locating those MLFs that best answer the query. At this point the sysLinguistic tems can locate in the original documents the sentences Analysis from which the MLFs where generated. One of the most serious problems that we have encountered in processing technical documentation is the syntactic ambiguity generated by multi-word units, in particuKnowledge Base lar technical terms. Any generic parser, unless developed specifically for the domain at hand, will have serious problems dealing with those multi-words. On the one hand, it Fig. 2. Creating is likely that they contain tokens that do not correspond the KB (offline) to any word in the parser’s lexicon, on the other, their syntactic structure is highly ambiguous (alternative internal structures, as well as possible undesired combinations with neighbouring tokens). In fact, it is possible to show that, when all the terminology of the domain is available, a much more efficient approach is to pack the multi-word units into single lexical tokens prior to syntactical analysis [4]. In our case, such an approach brings a reduction in the complexity of parsing of almost 50%.
QUERY
Query Filtering
Term KB
QUERY + Synset
Syntactic & Semantic Analysis
Semantic Matching
Document KB
document logical form
Document
Answers in Document
Fig. 3. Exploiting the Terminological KB and Document KB (online)
790
Fabio Rinaldi et al.
During the analysis of documents and queries, if a term belonging to a synset is identified, it is replaced by its synset identifier, which then allows retrieval using any other term in the same synset. This amounts to an implicit ‘terminological normalization’ for the domain, where the synset identifier can be taken as a reference to the ‘concept’ that each of the terms in the synset describe [8]. In this way any term contained in a user query is automatically mapped to all its variants. When an answer cannot be located with the approach described so far, the system is capable of ‘relaxing’ the query, gradually expanding the set of acceptable answers. A first step consists of including hyponyms and hyperonyms of terms in the query. If the query extended with this ontological information fails to find an exact answer, the system returns the sentence (or set of sentences) whose MLF is semantically closest with the MLF of the question. Semantic closeness is measured here in terms of overlap of logical forms; the use of flat expressions for the MLFs allow for a quick computation of this overlap after unifying the variables of the question with those of the answer candidate. The current algorithm for approximate matching compares pairs of MLF predicates and returns 0 or 1 on the basis of whether the predicates unify or not. An alternative that is worth exploring is the use of ontological information to compute a measure based on the ontological distance between words, i.e. by exploring its shared information content [12]. The expressivity of the MLF can further be expanded through the use of meaning postulates of the type: “If x is included in y, then x is in y”. This ensures that the query “Where is the temperature bulb?” will still find the answer “A temperature bulb is included in the auxiliary generator”. It should be clear that this approach towards inferences has so far only the scope of a small experiment, a large-scale extension of this approach would mean dealing with problems such as domain-specific inferences, contradictory knowledge, inference cycles and the more general problem of knowledge acquisition. In fact such an approach would require a domain Ontology, or even more general World-Knowledge.8 While the approach described so far is capable of dealing with all variations in terminology previously identified in the offline stage, the user might come up with a new variant of an existing term, not previously seen. The approach that we take to solve this problem is to filter queries (using FASTR, see Fig. 3) for these specific term variations. In this way the need for a query to contain a known term is removed. For example, the subject of the query “Where is the equipment for generating electricity?” is related through synonymy to the synset of electrical generation equipment, providing the vital link into the TermKB. 8
Unfortunately, a comprehensive and easy-to-use repository of World Knowledge is still not available, despite some commendable efforts in that direction [9].
Knowledge-Based Question Answering
791
Fig. 4. Example of interaction with the system
4
Conclusion
In this paper we have described the automatic creation and exploitation of a knowledge base in a Question Answering system. Traditional techniques from Natural Language Processing have been combined with novel ways of exploiting the structure inherent in the terminology of a given domain. The resulting Knowledge Based can be used to ease the information access bottleneck of technical manuals.
References [1] Steven Abney, Michael Collins, and Amit Singhal. Answer extraction. In Sergei Nirenburg, editor, Proc. 6th Applied Natural Language Processing Conference, pages 296–301, Seattle, WA, 2000. Morgan Kaufmann. 785 [2] J. J. Cimino. Knowledge-based terminology management in medicine. In Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, editors, Recent Advances in Computational Terminology, pages 111–126. John Benjamins Publishing Company, 2001. 786 [3] B. Daille, B. Habert, C. Jacquemin, and J. Royaut´e. Empirical observation of term variations and principles for their description. Terminology, 3(2):197–258, 1996. 787 [4] James Dowdall, Michael Hess, Neeme Kahusk, Kaarel Kaljurand, Mare Koit, Fabio Rinaldi, and Kadri Vider. Technical terminology as a critical resource. In International Conference on Language Resources and Evaluations (LREC-2002), Las Palmas, 29–31 May 2002. 8 787, 789 [5] Thierry Hamon and Adeline Nazarenko. Detection of synonymy links between terms: Experiment and results. In Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, editors, Recent Advances in Computational Terminology, pages 185–208. John Benjamins Publishing Company, 2001. 787 8
Available at http://www.cl.unizh.ch/CLpublications.html
792
Fabio Rinaldi et al.
[6] Fidelia Ibekwe-SanJuan and Cyrille Dubois. Can Syntactic Variations Highlight Semantic Links Between Domain Topics? In Proceedings of the 6th International Conference on Terminology and Knowledge Engineering (TKE02), pages 57–64, Nancy, August 2002. 787 [7] Christian Jacquemin. Spotting and Discovering Terms through Natural Language Processing. MIT Press, 2001. 787 [8] Kyo Kageura. The Dynamics of Terminology, A descriptive theory of term formation and terminological growth. Terminology and Lexicography Research and Practice. John Benjamins Publishing, 2002. 790 [9] D. B. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 11, 1995. 790 [10] Diego Moll´ a. Ontologically promiscuous flat logical forms for NLP. In Harry Bunt, Ielka van der Sluis, and Elias Thijsse, editors, Proceedings of IWCS-4, pages 249– 265, 2001. 789 [11] Diego Moll´ a, Rolf Schwitter, Michael Hess, and Rachel Fournier. Extrans, an answer extraction system. T. A. L. special issue on Information Retrieval oriented Natural Language Processing, 2000. 8 786, 788 [12] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95–130, 1998. 790 [13] Fabio Rinaldi, James Dowdall, Michael Hess, Kaarel Kaljurand, and Magnus Karlsson. The Role of Technical Terminology in Question Answering. In Proceedings of TIA-2003, Terminologie et Intelligence Artificielle, pages 156–165, Strasbourg, April 2003. 8 787 [14] Fabio Rinaldi, James Dowdall, Michael Hess, Kaarel Kaljurand, Mare Koit, Kadri Vider, and Neeme Kahusk. Terminology as Knowledge in Answer Extraction. In Proceedings of the 6th International Conference on Terminology and Knowledge Engineering (TKE02), pages 107–113, Nancy, 28–30 August 2002. 8 787 [15] Fabio Rinaldi, James Dowdall, Michael Hess, Diego Moll´ a, and Rolf Schwitter. Towards Answer Extraction: an application to Technical Domains. In ECAI2002, European Conference on Artificial Intelligence, Lyon, 21–26 July 2002. 8 786, 788 [16] Daniel D. Sleator and Davy Temperley. Parsing English with a link grammar. In Proc. Third International Workshop on Parsing Technologies, pages 277–292, 1993. 788 [17] Ellen M. Voorhees. The TREC question answering track. Natural Language Engineering, 7(4):361–378, 2001. 785
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation in Horizontal Transliteration of Hanman Characters Huaglory Tianfield School of Computing and Mathematical Sciences Glasgow Caledonian University 70 Cowcaddens Road, Glasgow, G4 0BA, UK
[email protected] Abstract. The basic idea for differentiable horizontal transliteration of single Hanman characters is to equate the horizontal transliteration of a Hanman character to its toneless Zhongcentrish phonetic letters appended with a suffixal word which is from the meaningful augmentation. This paper presents a knowledge-based system method to unitarize the choice from all the meanings associative of the Hanman character.
1
Introduction
All the languages on the globe fall into two streams, i.e. plain-Hanman character string based languages and knowledge-condensed Hanman character based languages [1]. Knowledge-condensed Hanman character based languages are such as Zhongcentrish (used in the mainland, Taiwan, Hong Kong and Macao of Zhonghcentre, Singapore, etc.), many other oriental languages (Japanese, Korean, Vietnamese, etc.), Maya used by Mayan in South America and ancient Egyptian in Africa. Hanman characters are the main Hanman characters for forming the languages of Zhongcentrish and many other oriental languages such as Korean, Japanese, Vietnamese, etc. as well. Hanman characters are the only knowledge-condensed Hanman characters that are in wide use in modern society. Maya are only used in research at present. Plain-Hanman character string based languages are such as Latin, English, German, Spanish, French, Greek, Portuguese, Russian, Swedish, etc. The main Hanman characters are the letters from the alphabets and thus these languages are simply called letter string based ones. Horizontal transliteration is a mechanical literal transliteration from Hanman characters into letter strings. Horizontal transliteration of Hanman characters actually represents the most profound issue of the mutual transformation and comprehension between Hanman character based languages (and more generally, cultures) and plainHanman character string based languages (cultures). The differentiable horizontal V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 793-799, 2003. Springer-Verlag Berlin Heidelberg 2003
794
Huaglory Tianfield
transliteration of knowledge-condensed Hanman characters ca serve as such a mechanical bridge. The conventional way of horizontal transliteration of Hanman characters is by toneless Zhongcentrish phonetic letters (ZPL). The problem of ZPL horizontal transliteration is that it is practically impossible for a viewer to invert to the original Hanman character of the ZPL horizontal transliteration unless the viewer has been told of the original Hanman character in advance or the viewer has had sufficient a priori knowledge about the context around the original Hanman character. Toneless ZPL has only 7% differentiability and is very unsatisfactory and incompetent because of the great homonym rate of Hanman characters. This paper proposes an unconventional approach to the differentiable horizontal transliteration of single Hanman characters under contextless circumstance. The basic idea is to equate the horizontal transliteration of a Hanman character to its toneless ZPL appended with a suffixal word which is from the meaningful augmentation of the Hanman character [2]. As a Hanman character normally has more than one associative meaning, the key is how to determine a unitary suffixal word for the horizontal transliteration.
2
Zhongcentrish Phonemics and Phonetics of Hanman Characters
Architectonics of Hanman characters contains some phonemics, e.g. homonyms. Actually phonetics of most Hanman characters is associated by opposite syllable forming of a Hanman character. That is, the sound initiator phoneme of the Hanman character is marked by another known / simpler Hanman character which has the same sound initiator phoneme, and the rhyme phoneme of the Hanman character is marked by another known / simpler Hanman character which has the same rhyme phoneme. In this sense, phonetics and phonemics are inherent of Zhongcentrish, irrelevant from the inception and existence of ZPL. Latin transcription of Zhongcentrish phonetics began in the early 16th century and eventually yielded over 50 different systems. On February 21, 1958, Zhongcentre adopted the Hanman character based language Phonetic Scheme - known as ZPL (ZPL) to replace the Wade-Giles and Lessing transcription systems [3]. ZPL uses a modified Roman alphabet to phonetically spell the proper pronunciation of Hanman characters. Sounds of ZPL strings only roughly correspond to the English pronunciation of the ZPL strings and thus ZPL is hetero-pronounceable to English. Since its inception, ZPL has become a generally recognized standard for Zhongcentrish phonetics throughout most of the world. ZPL is used throughout the world and has since become the world standard, including the recognition by the International Standards Organization (ISO), the United Nations, the U.S. government, much of the world's media, and the region of Taiwan in 1999. ZPL now is standard of the International Standards Organization (ISO), standard of the United Nations, standard by the Zhongcentrish law of national language and Hanman characters.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
795
The original Zhongcentrish phonetic transcription, including symbols for sound initiator phonemes, rhyme phonemes and tones, is by radicals, as depicted in the first rows of Fig. 1.
(a) Zhongcentrish phonetic sound initiator phonemes (21): symbols versus letters
(b) Zhongcentrish (simple or compound) rhyme phonemes (35): symbols versus letters
(c) Symbols of five tones (5) Fig. 1. Zhongcentrish phonetic transcription
In addition to the 21 basic sound initiator phonemes, there are three expanded sound initiator phonemes, i.e. "nil" sound initiator phoneme, and / w / and / y /. The latter two are phonemically the same as / u / and / i /. In addition to the 35 rhyme phonemes, there are four infrequently used rhyme phonemes, i.e. "nil" rhyme phoneme, and / ê / (umlauted / e /), / er /, / m /, / ( η) ng /. Superficially, there are 24 x 39 = 936 mechanical combinations. However, many of the mechanical combinations are either unfeasible or unrealistic. Actually, over the 936 mechanical combinations, only 417 pronunciations are used for Hanman characters. There are 417 strings of ZPL in Zhongcentrish dictionary. All the Hanman characters can be written in ZPL. There are more than 10 thousand Hanman characters in Zhongcentrish among which 6 thousand are frequent usage. Conservatively, accounted by 6,000 Hanman characters, below ratios can be obtained. Theoretical average toneless homonym ratio (TLHR) of Zhongcentrish is TLHR = 6,000 / 417 ≈ 15 / 1
(1)
796
Huaglory Tianfield
which means that one distinctive string of ZPL has to serve fifteen Hanman characters on average. Theoretical average toned homonym ratio (THR) of Zhongcentrish is THR = 6,000 / 417 / 4 ≈ 4 / 1
(2)
which means that one distinctive toned string of ZPL has to serve four Hanman characters on average. Although five different tones are defined to refine and differentiate Zhongcentrish phonetics, there still exist too many homonyms in Zhongcentrish.
3
Differentiable Horizontal Transliteration (DHT)
Comprehension needs common context. That is, the ontology of the viewer of a proposition needs to be the same (or sufficiently intersected) as the ontology of the creator of the proposition. In oral communication (conversation, talk, speech…), under contextless circumstance, after pronunciation, the speaker has to augment the saying by some contexts to the listener, so as for accurate convey of meaning. Such augmented context usually includes the meaning of the Hanman character itself, whose pronunciation has just been presented, or the meaning of the common words composed of the Hanman character. Such context augmentation is also widely used for any letter string based language, under contextless circumstances. For instance, the speaker tells the listener, by spelling the word letter by letter, some acronym, post code, name, etc. For example, U…N, U for "university", N for "nature", R…Z, R for "run", Z for "zero". Here a new method is proposed for horizontal transliteration of single Hanman characters under contextless circumstance, named DHT. It is differentiable and thus the horizontal transliteration can be uniquely accurately reverted back to the original Hanman character. 3.1
Rationale of Horizontal Transliteration
The meanings of a Hanman character are used to augment the context. The reason a different Hanman character was created because there is a different meaning needs to be conveyed. Therefore, there is one-to-one correspondence between the architectonics of Hanman character and the meanings expressed by a Hanman character. Dispersivity set of a Hanman character ::= all meanings the Hanman character expresses or associates to express
(3)
Dispersivity set of a letter-string word ::= all meanings the word denotes
(4)
Usually, dispersivity set has more than one element, and dispersivity set of the Hanman character is not equal to dispersivity set of the translation word. In horizontal transliteration, only one major intersection between the dispersivity set of the Hanman character and the dispersivity set of the translation word is used for horizontal transliteration. This can be depicted in Fig. 2.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
HM 1
LM 11 LM 12
797
Translation Word 1
HM 2 HM 3
An Hantman Character
HM 4 HM 5 HM 6
LM 21 LM 22
Translation Word 2
HM 7
HM 8 HM 9
LM 31 LM 32 LM 33 LM 34
Translation Word 3
Fig. 2. Dispersivity sets of Hanman character and translation words and their intersection. HM: heuristic meaning, LM: letter-string meanings
3.2
Two-Tier Letter String for DHT
Horizontal transliteration is a concatenation of two sections of letter strings, as depicted in Fig. 3. Horizontal transliteration is initial capital. ∆
DHT = ZPL+ English meaning word ∆
(5)
= prefix concatenated by a suffixal word
Prefix = ZPL of the Hanman character plus a tone letter
(6)
To be natural, horizontal transliteration must contain in the first place ZPL, because ZPL is standard. The prefix gives phonetic naturalness of DHT. 3.3
Suffixal Word
The suffixal word is from the translation words of the meanings of the Hanman character. Normally there is more than one meaning associated with a Hanman character. The problem is how to find a unique suffixal word from all the meaningfully augmented words. Here a knowledge-based system method is presented for the unitarization of meaningfully augmented words of a Hanman character. Finding a unique suffixal word, i.e., the unitarization of meaningfully augmented words, for a Hanman character can be formulated as knowledge based system. A knowledge-based system can be modeled as a triangle pulled on its three angles by Inference engine, knowledge base and data base, respectively. A problem solving is a process of utilizing knowledge upon data under the guide of Inference engine, as depicted in Fig. 4.
798
Huaglory Tianfield
An Hantman Character
Phonetics
DHT Letter String
::=
TLZPL
Meanings
Translation Word Selected by a Rule Based System
Fig. 3. Horizontal transliteration of a Hanman character consists of two sections of letter strings
Inference Engine
Rule Base
Libraries Z-E Dictionary & E-Z Dictionary
Fig. 4. Knowledge based system for finding a unique suffixal word for a Hanman character. Z-E: Zhongcentrish-English, E-Z: English-Zhongcentrish
The suffixal word is chosen according to below rules and inference rules so that it is simple and commonly used and its first syllable remains unmixable even when concatenated to the prefix. It is assumed that for each Hanman character, a number of libraries of translation words be available, as depicted in Table 1. Table 1. Classified libraries of all the meaningful augmentations associated with a Hanman character
Library 1 Library 2 Library 3 Library 4 Library 5 Library 6
translation words of the Hanman character of itself translation words of the most commonly used Hanman phrases that are started with the single Hanman character if there is no translation word, use the core word that most frequently appears in the translation phrases of the Hanman character if there is no translation word and if the translation phrase of the Hanman character is composed of two words, concatenate the phrase as a word translation words of the general concepts to which the expressions of the single Hanman character belong translation words of less commonly used Hanman phrases that are started with the single Hanman character
All the translation words should be meaningful of themselves and should not be context-dependent and/or prescriptive words such as names of places, countries, people, dynasties, etc.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
Rule 01 Rule 02 Rule 03 Rule 04 Rule 05 Rule 06 Rule 07 Rule 08 Rule 09 Rule 10
799
The first syllable of the suffixal word remains unmixable even when the suffixal word is concatenated to the prefix. The suffixal word has most meaningful accuracy to what the Hanman character most commonly expresses. The suffixal word is the same property as the Hanman character, e.g. both are verb, noun, adjective, adverb, or preposition, etc. The translation word of the Hanman character has less interpretation than other translation words. The suffixal word has minimum number of syllables, regardless of parts of speech (noun, verb, adjective, adverb, preposition, exclamation) and specific expressions of the word. The suffixal word is most commonly used. The suffixal word is noun. The suffixal word has minimum number of letters among its homorooted genealogy, regardless of parts of speech (noun, verb, adjective, adverb, preposition, exclamation) and specific expressions of the word. The suffixal word is commendatory or neutral than derogatory. In case Rule 1 can not be satisfied, place a syllable-dividing marker (') (apostrophe) between the prefix and the suffixal word when they are concatenated.
Inference rule 1 Inference rule 2 Inference rule 3 Inference rule 4
No rules are applied to Library i if Library (i-1) is nonempty, i = 1, 2, Rule 1 is mandatory, has absolute paramount importance and should always be applied first of all. Always apply rules to the least numbered library of translation words. End if successful. Only if failed on all the less numbered libraries, turn to other libraries of translation words. The less numbered a rule, the higher priority the rule has.
References [1]
[2]
[3]
Tianfield, H.: Computational comparative study of English words and Hanman character words. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC'2001) Tucson, Arizona, USA, October 7-10, 2001, 1723-1728 Tianfield, H.: Differentiable horizontal transliteration of single hieroglyphic Hanman characters under contextless circumstance. Proceedings of the 1st International Symposium on Multi-Agents and Mobile Agents in Virtual Organizations and E-Commerce (MAMA'00), December 11-13, 2000, Wollongong, Australia, 7 pages Wen Zi Gan Ge Chu Ban She, ed. Han (man) Yu (lanaguage) Pin (spell) Yin (sound) Fang'An (scheme). Beijing: Wen Zi Gan Ge Chu Ban She, 1958. It was passed as ISO 7098 in 1982
An Expressive Efficient Representation: Bridging a Gap between NLP and KR Jana Z. Sukkarieh University of Oxford, Oxford, England
[email protected] Abstract. We do not know how humans reason, whether they reason using natural language (NL) or not and we are not interested in proving or disproving such a proposition. Nonetheless, it seems that a very expressive transparent medium humans communicate with, state their problems in and justify how they solve these problems is NL. Hence, we wished to use NL as a Knowledge Representation(KR) in NL knowledgebased (KB) sytems. However, NL is full of ambiguities. In addition, there are syntactic and semantic processing complexities associated with NL. Hence, we consider a quasi-NL KR with a tractable inference relation. We believe that such a representation bridges the gap between an expressive semantic representation (SR) sought by the Natural Language Processing (NLP) community and an efficient KR sought by the KR community. In addition to being a KR, we use the quasi-NL language as a SR for a subset of English that it defines. Also, it is capable of a general-purpose domain-independent inference component which is, according to semanticists, all what it takes to test a semantic theory in any NLP system. This paper gives only a flavour for this quasi-NL KR and its capabilities (for a detailed study see [14]).
1
Introduction
The main objection against a semantic representation or a Knowledge representation is that they need experts to understand1 . Non-experts communicate via a natural language (usually) and more or less they understand each other while performing a lot of reasoning. Nevertheless, for a long time, the KR community dismissed the idea that NL can be a KR. That’s because NL can be very ambiguous and there are syntactic and semantic processing complexities associated with NL. Recently, researchers started looking at this issue again. Possibly, it has to do with the NL Processing community making some progress in terms of processing and handling ambiguity and the KR community realising that a lot of knowledge is already ’coded’ in NL and one should reconsider the way they handle expressivity and ambiguity to make some advances on this
Natural Language Processing. Knowledge Representation. Though expert systems tend now to use more friendly representations, the latter still need experts.
1
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 800–815, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Expressive Efficient Representation
801
front. We have chosen one of these KR, namely, [8], as a starting point to build a KR system using as input a simplified NL that this KR defines. One of the interesting things about this system is that we are exploring a novel meaning representation for English that no one has incorporated in an NLP system before and building a system that touches upon most of the areas of NLP (even if it is restricted). We extended the basic notion and extended its proof theory for it to allow deductive inferences that semanticists agree that it is the best test for any NLP semantic capacity (see FraCas project2 ). Before we started our work, no one built an automatic system that sanctions the FraCas deductive inferences. By extending the KR we extended the subset of English that this KR defines. In the following section we give a description of the NL-like KR, McLogic (related articles [16], [15]). In section 3 and 4 we give an idea of the inference task and the controlled English that McLogic defines, respectively (see details in [14], [13], respectively). We conclude by summarising and emphasising the practical implications of our work.
2
NL-Like KR: McLogic
McLogic is an extension of a Knowledge Representation defined by McAllester et al [8], [9]. Other NL-like Knowledge representations are described in [5], [10] and [7], [6], [11] but McAllester et. al have a tractable inference relation. For McLogic NL-like means it is easy to read, natural-looking ’word’ order 3 . The basic form of McLogic, we call it McLogic0 , and some extensions built on top of it are presented next. 2.1
The Original Framework: McLogic0
McLogic0 has two logical notions called class expressions and formulae. In the following, we define the syntax of McLogic0 and along with it we consider examples and their corresponding denotations. For the sake of clarity, we write V(e) to mean the denotation of e. Building Blocks Table 1 summarises the syntax of McLogic0 . The building unit for sentences or formulae in McLogic0 is a class expression denoting a set. First, the constant symbols, Pooh, Piglet , Chris − Robin are class expressions that denote singleton sets consisting of the entities Pooh, Piglet and Chris-Robin, respectively. Second, the monadic predicate symbols like bear , pig, hurt , cry and laugh are all monadic class expressions that denote sets of entities that are bears, pigs or are hurt, that cry or laugh respectively. Third, expressions of the form (R (some s )) and (R (every s)) where R is a binary relation and s is a class expression such as (climb(some tree)) and (sting(every bear )), where 2 3
http://www.cogsci.ed.ac.uk/˜fracas/ with the exception of Lambda expressions which are not very English as you will see below!
802
Jana Z. Sukkarieh
V((climb (some tree))) = {y | ∃ x .x ∈ tree ∧ climb(y, x )} and V((sting (every bear ))) = {y | ∀ x .x ∈ bear −→ sting(y, x )}. Fourth, a class expression can be a variable symbol that denotes a singleton set, that is, it varies over entities in the domain. Finally, a class expression can be a lambda expression of the form λ x .φ(x ), where φ(x ) is a formula and x is a variable symbol and V(λ x .φ(x ))= {d | φ(d ) is true}. An example of a lambda expression will be given in the next section.
Table 1. The syntax of McLogic0 . R is a binary relation symbol, s and w are class expressions, x is a variable and φ(x ) is a formula.
Class Expressions
Example
a constant symbol a monadic predicate symbol (R(some s)) (R(every s)) a variable symbol λ x .φ(x )
c, Pooh student (climb(some tree)) (climb(every tree)) x λ x .(x (likes (Pooh))
A Well-Formed formula
Example
(every s w ) (some s w ) Negation of a formula Boolean combinations of formulae
(every athlete energetic) (some athlete energetic) not(some athlete energetic) (every athlete energetic) and (every athlete healthy)
Well-Formed Sentences A sentence in McLogic0 is either: – an atomic formula of the form (every s w ) or (some s w ) where s and w are class expressions, for example, (every athlete energetic), – or a negation of these forms, for example, not (every athlete energetic), – or a boolean combination of formulae such as (every athlete healthy) and (some researcher healthy).
An Expressive Efficient Representation
803
The meaning for each sentence is a condition on its truth value: (every s w ) is true if and only if (iff) V (s) ⊂ V (w ). (some s w ) is true iff V (s) ∩ V (w ) = ∅. The negation and the Boolean combination of formulae have the usual meaning of logical formulae. Note that since variables and constants denote singleton sets, the formulae (some Pooh cry) and (every Pooh cry) are equivalent semantically. For a more natural representation, we can abbreviate this representation to (Pooh cry). Now, that we have defined formulae, we can give an example of a lambda expression: ζ = λ x .(some/every x (read (some book ))), that can be written as ζ = λ x .(x (read (some book ))), where V(ζ) = {d | (d (read (some book ))) is true}. The above is an informal exposition of the meaning of the constituents. Before we go on to describe the extensions, it is important to note that McLogic0 is “related to a large family of knowledge representation languages known as concept languages or frame description languages (FDLs)... However, there does not appear to be any simple relationship between the expressive power ... [of McLogic 0 ] and previously studied FDLs “[9]. Consider, for example, translating the formula (every W (R(some C ))) into a formula involving the expression of the form ∀ R.C 4 you will find that it seems there is no way to do that. 2.2
A Richer McLogic
As we did earlier with McLogic0 , we provide the syntax of the extension together with some examples of formulae and their associated meanings. The extensions that are needed in this paper are summarised in table 2. The main syntactic and semantic innovation than McLogic0 is the cardinal class expression that allows symbolising “quantifiers” other than ’every’ and ’some’ as it will be made clear later. 4
Recall that an object x is a member of the class expression ∀ R.C if, for every y such that the relation R holds between x and y, the individual y is in the set denoted by C .
804
Jana Z. Sukkarieh
Table 2. The syntax of the extensions required. R is a binary relation symbol (R −1 is the inverse of a binary relation), R3 is a 3-ary relation and R3−1 is an inverse of a 3-ary relation, s and t are class expressions, Q1 and Q2 can be symbols like every, some, at most ten, most , and so on, Mod is a function that takes a class expression, s, and returns a subset of s.
Class Expressions Example
s +t s$t s#Mod ¬(s) (R3 (Q1 s) (Q2 t)) (R −1 (Q1 s)) (R3−1 (Q1 s) (Q2 t)) Q2 ∗ s
happy + man client$representative drive#fast ¬(man) (give(some student) (some book )) (borrow −1 (some student)) (give −1 (some book ) (some librarian)) more than one ∗ man
Formulae
Example
(N ∗ s t)
(more than one ∗ man snore)
2.3
Syntax of McLogic − McLogic0 , with Example Denotations
The two logical notions, namely, class expressions and formulae, are the same as McLogic0 . In addition to these two, we introduce the notion of a function symbol. Definition 1. If f is an n-place function symbol, and t1 , . . . , tn are class expressions then f (t1 , . . . , tn ) is a class expression. There are special function symbols in this language, namely, unary symbols, ¬, 2-ary operators, + , $, and ∗. We introduce first the 3 operators, +, and $ and ¬. The operator, ∗, will be introduced after we define a cardinal class expression. Definition 2. Given two class expressions s and t : – s + t is a class expression. – s$t is a class expression. – ¬(s) is a class expression. s + t is defined to be the intersection of the two sets denoted by s and t . s$t is defined to be the union of the two sets denoted by s and t . Moreover, ¬(s) is defined to be the complement of the set denoted by s.
An Expressive Efficient Representation
805
Example 1. Since man and (eat (some apple)) are class expressions then man + (eat (some apple)), man$(eat (some apple)), and ¬(man) are class expressions. They denote the sets V (man) ∩ V ((eat (some apple))), V (man) ∪ V ((eat (some apple))), and (V (man))c (complement of the set denoting man), respectively. It is important to mention special unary functions that take a class expression and returns a subset of that class expression. Hence, if we let mod func be one of these functions then mod func(s1 ) = s2 , where s1 is a class expression and V(s2 ) is a subset of V(s1 ). We will write s1 #mod func for mod func(s1 ). For example , let f be an example of such a function, then f (drive) ⊆ V (drive) and f (man) ⊆ V (man) and so on. Now, we introduce a cardinal class expression. We group symbols like most , less than two, more than one, one, two, three, four , · · · under the name cardinal class expressions. Definition 3. A cardinal class expression, N , represents a positive integer N and denotes the set V(N ) = {X | X is a set of N objects}. Note that a cardinal class expression does not have a meaning independently of entities (in some specified domain). This is motivated by the way one introduces a (abstract) number for a child, it is always associated with objects. Having defined a cardinal class expression, the operator, ∗ can be defined: Definition 4. Given a cardinal class expression N and a class expression that is not a cardinal class expression s, then N ∗ s is defined to be V(N ) ∩ P (V(s)) where P (V(s)) is the power set of the denotation of s. In other words, N ∗ t is interpreted as {X | X ⊆ t ∧ | X |= N }. The operation, ∗, has the same meaning as +, that is, intersection between sets. However, the introduction of a new operator is to emphasize the fact that ∗ defines an intersection between sets of sets of entities and not sets of entities. Example 2. Given the cardinal class expression ten and the non-cardinal class expression book , we can form ten ∗ book and this denotes the set of all sets that consist of ten books. Introducing the operator, ∗, and the class expression N ∗ s allows us to introduce the class expression (R (N ∗ s)) where R is a binary relation or the inverse of a binary relation. For the sake of presentation, we define an inverse of a binary relation: Definition 5. Given a binary relation R, the inverse, R −1 , of R is defined as such: d R d iff d R −1 d . Examples can be (borrow (ten ∗ book )), (buy(more than one ∗ mug)) or (buy −1 (john)). To stick to NL, we can use boughtby for buy −1 and so on. Being class expressions, the last 3 examples will denote sets: a set of entities that borrow(ed) ten books, a set of entities that buy (bought) more than one mug and a set of entities that were bought by john. Hence, the definition:
806
Jana Z. Sukkarieh
Definition 6. A class expression of the form (R(N ∗ s)) denotes the set {x | ∃ y ∈ N ∗ s ∧ ∀ i.(i ∈ y ←→ xRi)}. Moreover, the introduction of ∗ and N ∗s allows the introduction of a formula of the form (N ∗ s t ). Example 3. Given the class expressions two, man, old and (borrow (more than one ∗ book )), some of the formulae we can form are: (two ∗ man old ), (two ∗ old man), (two ∗ man (borrow (more than one ∗ book ))) We want the above formulae to be true if and only if ’two men are old’, ’two old (entities) are men’ and ’two men borrow(ed) more than one book’, respectively. The truth conditions for a formula of the form (N ∗ s t ) are given as follows: Definition 7. (N ∗ s t ) is true if and only if some element in N ∗ s is a subset of t. In other words, there exists a set X ⊆ s that has N elements such that X ⊆ t or for short N s’s are t ’s. The last extension in this paper, namely, a class expression with a 3-ary relation, will be introduced in what follows: Definition 8. A class expression with a 3-ary relation is of the form: – – – –
(R(some (R(some (R(every (R(every
s) (every t )), s) (some t )), s) (every t )), or s) (every t )), . . .
In a more general way : (R(Q1 s) (Q2 t )), (R(Q3 ∗ s) (Q4 ∗ t )), (R(Q1 s) (Q4 ∗ t )), (R(Q3 ∗ s) (Q2 t )), where, Qi for i = 1, 2 is either some or every, Qi for i = 3, 4 are cardinal class expressions, R is a 3-ary relation or an inverse of a 3-ary relation, s and t are non-cardinal class expressions. Examples of such class expressions can be : (give(mary) (some book )), (hand (some student ) (some parcel )), (send (more than one ∗ flower ) (two ∗ teacher )). It is natural that we make these class expressions denote entities that ’give mary some book’, that ’hand(ed) some student some parcel’ and that ’send more than one flower to two teachers’, respectively. In particular, we define the following:
An Expressive Efficient Representation
807
quantifiers give different meanings V((R(some s) (every t ))) = {y | ∃ x ∈ s. ∀ z .z ∈ t −→ y, z , x ∈ R}. V((R(some s) (some t ))) = {y | ∃ x ∈ s ∧ ∃ z ∈ t ∧ y, z , x ∈ R}. V((R(every s) (every t ))) = {y | ∀ x .x ∈ s −→ ∀ z .z ∈ t −→ y, z , x ∈ R}. V((R(every s) (some t ))) = {y | ∀ x .x ∈ s −→ ∃ z .z ∈ t ∧ y, z , x ∈ R}. To allow for the change of quantifiers, we consider the form (R(Q1 s) (Q2 t )) whose denotation is {y | ∃ X .X ⊂ s ∧ ∃ Z .Z ⊂ t such that ∀ x . ∀ y.x ∈ X ∧ z ∈ Z −→ y, z , x ∈ R}, where the cardinality of X and Z depend on Q1 and Q2 respectively. Basically, we are just saying that (R(Q1 s) (Q2 t )) is the set of elements that relate, through R, with elements in s and elements in t and that the number of elements of s and t in concern depend on the quantifiers Q1 and Q2 , respectively. For completeness, we define the inverse of a 3-ary relation and give an example: Definition 9. Given a 3-ary relation R, the inverse, R −1 , of R is defined as: Arg1 , Arg2 , Arg3 ∈ R iff Arg2 , Arg3 , Arg1 ∈ R −1 For example, V((give −1 (some student ) (john))) is: {y | ∃ x ∈ student.y, john, x ∈ give −1 } = {y | ∃ x ∈ student.x , y, john ∈ give}. We have described above the syntax of the extension of McLogic0 , and gave example denotations. As we said earlier, McLogic0 together with the extensions will be called McLogic in this paper. To see a formal semantics for McLogic you can consult [14]. Having described the logic we go on to describe the inferences McLogic sanctions.
3
The Inference Task McLogic Sanctions
We are concerned with NL inferences but not with implicatures nor suppositions. Moreover, we do not deal with defeasible reasoning, abductive or inductive reasoning and so on. In our work [14], we focus on deductive (valid) inferences that depend on the properties of NL constructs. Entailments from an utterance U, or several utterances Ui that seem “natural”, in other words, that people do entail when they hear U. For example, in the following, D1 are deduced from Scenario S1 : – S1 : (1) a. some cat sat on some mat. b. The cat has whiskers.
808
Jana Z. Sukkarieh ⇓ D1 :
some cat exists, some mat exists, some cat sat on some mat, some cat has whiskers (cat1 has whiskers), some whiskers exist. ⇓ whiskers sat on some mat
We define a structurally-based inference to be one that depends on the specific semantic properties of the syntactic categories of sentences in NL. For example, – S2 : most cats are feline animals. ⇓ D2 : most cats are feline, most cats are animals.
– S3 Smith and Jones signed the con⇓ tract. D3 : Smith signed the contract.
D2 depend on the monotonicity properties of generalised quantifiers and D3 on those of conjoined Noun Phrases - among other classes that the FraCas deal with. 3.1
Inference Set
The proof theory is specified with a set of deductive inference rules with their contrapositives. The original inference set, which corresponds to McLogic0 consists of 32 inference rules. We added rules that correspond to the extensions and that are induced by the structurally-based inferences under consideration. In the following, we list some of the rules 5 . We assume these rules are clear and the soundness of each rule can be easily shown against the formal semantics of the logic (consult [14] for a formal semantics). (7) (every C C ) (20) (every W C $W ) (22) (every C + W W ) C exists) (10) (some (some C C ) (some C W ) (12) (some W C ) C Z ),(some C W ) (14) (every (some Z W) ),(at most one (16) (every C(atWmost one C ) (some C exists) (18) (not (every C W )) Z C ),(every Z W ) (24) (every(every Z C +W ) (not (some S exists)) (26) (every T (R(every S ))) 5
W)
(19) (every C C $W ) (21) (every C + W C ) (38) (every C #W C ) (some C W ) (11) (some C exists) (every C W ),(every W Z ) (13) (every C Z ) W ),(at most one C ) (15) (some C (every C W) (at most one C )) (17) (not(some C exists) C Z ),(every W Z ) (23) (every(every C $W Z ) (some C Z ) (25) (some C +Z exists) $Z ),(not (some C W )) (27) (every C W(every C Z)
The numbers are not in sequence as to give an idea of the rules added.
An Expressive Efficient Representation $Z ),(not (some C Z )) (28) (every C W (every C W ) (every C W ) (30) (every (R(every W )),(R(every C ))) (R(some C )) exists) (32) (some(some C exists) (more than one∗C +W exists) (36) (more than one∗C C ) 1 ∗C W ) (39) (N (N1 ∗W C ) Z ),(every Z C ) (41) (D W (D W C) (at most N ∗C W )) (43) (at most N ∗C +Z W )) most N ∗C W )) (45) (at(atmost N ∗C Z +W )) 4 ∗C Z ) (47) (N(N 4 ∗C $W Z )
809
(every C W ) (29) (every (R(some C )) (R(some W ))) (some C W ) (31) (every (R(every C )) (R(some W ))) (more than one∗C W ) (35) (more than one∗C +W exists) W ),(every C Z ) (37) (N1 ∗C (N 1 ∗Z W ) (N3 ∗C W ),(every C Z ) (40) (N3 ∗C W +Z ) most N ∗C W )) (42) (at(atmost N ∗C W #Z )) (at most N ∗C (R(some W ))) (44) (at most N ∗C (R(some Z +W ))) than one∗C W ) (46) (more(some C W)
In the above, C , W and Z are class expressions. D is a representation for a determiner that is monotonically increasing on 2nd argument. N1 belongs to {at least N , N , more than one} and N3 belongs to {most , more than one, some(sg), at least N } and N4 in {more than one, N , at most N , at least N } where N is a cardinal. In the following, we show a few examples of structurally-based inferences licensed by the properties of monotonicity of GQs that are sanctioned by the above inference rules. It is enough, in these examples, to consider the representation without knowing how the translation is done nor which English constituents is translated to what. I leave showing how McLogic is used as a typed SR to a different occasion. The fact that McLogic is NL-like makes the representation and the proofs in this suite easy to follow. 3.2
Illustrative Proofs
Examples 1, 3, 4 and 5 are selected from The Fracas Test Suite [1]. The Fracas test suite is the best benchmark we could find to develop a general inference component. “The test suite is a basis for a useful and theory/system-independent semantic tool” [1]. From the illustrative proofs, it is seen what we mean by not using higher-order constructs and using what we call combinators, like “sine”, “cosine” instead. ’some’ is monotonically increasing on second argument as it is shown in examples 1 and 2. Argument 1 IF Some mammals are four-legged animals THEN Some mammals are four-legged. (more than one ∗ mammal four − legged + animal) r 39 (more than one ∗ four − legged + animal mammal)
(every four − legged + animal four − legged) r 37
(more than one ∗ four − legged mammal ) r 39 (more than one ∗ mammal four − legged)
Argument 2 IF some man gives Mary an expensive car THEN some man gives Mary a car.
810
Jana Z. Sukkarieh
To deal with 3-ary relations, we need to augment rules, like (29), (30) and (31). For this particular case, rule (29) is needed. Hence, we add the rules in table 3. Now, we can provide the proof of argument 2:
Table 3. Augmenting rule 29 to cover 3-ary relations with ’some’ and ’every’ only. C , W , Q are class expressions (not cardinal ones nor time class expressions). (3-ary 1)
(every C W ) (every (R(some Q) (some C )) (R(some Q)(some W )))
(3-ary 2)
(every C W ) (every (R(every Q) (some C )) (R(every Q) (some W )))
(3-ary 3)
(every C W ) (every (R(some C ) (some Q)) (R(some W ) (some Q)))
(3-ary 4)
(every C W ) (every (R(some C ) (every Q)) (R(some W ) (every Q)))
SINCE (every expensive + car car ) THEN (every (give(Mary) (some expensive + car )) (give(Mary) (some car ))) using rule 3-ary 1. MOREOVER, SINCE (some man (give(Mary) (some expensive + car ))) THEN (some (give(Mary) (some expensive + car )) man) using rule 12. THE TWO CONCLUSIONS IMPLY (some (give(Mary) (some car )) man) using rule 14. THE LAST CONCLUSION IMPLIES (some man (give(Mary) (some car ))).
The determiner ’some’ is monotonically increasing on first argument. Using rule 14 together with (every irish + delegate delegate) justify the required conclusion in argument 3. Argument 3 IF Some Irish delegate snores, THEN Some delegate snores. ’every’ is monotonically decreasing on first argument as in the following: Argument 4 IF every resident of the North American continent travels freely within Europe AND every canadian resident is a resident of the North American continent THEN every canadian resident travels freely within Europe. The transitivity rule (rule 13) proves the conclusion of argument 4. ’At most N’ is monotonically decreasing on second argument as it is shown in example 5. Rule 42 proves this property for ’at most ten’:. Argument 5 IF At most ten commissioners drive, THEN At most ten commissioners drive slowly.
An Expressive Efficient Representation
811
The inference rules that McLogic is equipped with sanction all the examples given in the Fracas test suite licensed by the properties of GQs. Here, we have only included some of them. The following examples are given in [4] and we show that their validity is easily shown by our proof system. Argument 6 IF no teacher ran THEN no tall teacher ran yesterday. Alternatively, in McLogic, IF not (some teacher run) THEN not (some tall + teacher run#yesterday). (every tall + teacher teacher ) using rule 22. This together with not(some teacher run) imply not(some run tall + teacher ). Moreover, (every run#yesterday run) using rule 38. Using rule 14, the last two results justify not(some run#yesterday tall + teacher ). The contrapositive of rule 12 gives the required result.
Argument 7 IF every teacher ran yesterday THEN every tall teacher ran. Alternatively, in McLogic, IF (every teacher run#yesterday) THEN (every tall + teacher run). Using the transitivity rule 13 for the premise together with (every run#yesterday run) yields
(every
teacher
run).
Again
the
transitivity
rule
together
with
(every tall + teacher teacher ) justifies the conclusion.
Argument 8 IF every student smiled AND no student who smiled walked THEN no student walked. Since
(every
student
student)
(rule
7)
and
(every
student
smile)
then
(every student student + smile) using rule 24. Using acontrapositive of rule 14, the last result together with not(some student + smile walk ) justify not(some student walk ).
Fyodorov et.al use similar rules. rule 7 is what they call reflexivity rule, rule 24 is what they call conjunction rule. We do not have monotonicity rules as such because monotonicity can be proved through other rules. The last example, they have is the following: Argument 9 IF exactly four tall boys walked AND at most four boys walked THEN exactly four boys walked Assuming that ’exactly C Nom VP’ to be semantically equivalent to ’at least C Nom VP’ and ’at most C Nom VP’, where C is a cardinal, the argument is reduced to showing that ’at least four boys walked’. Given (at least four ∗ tall + boy walk ) and (every tall +boy boy) then (at least four ∗boy walk ) using rule 37. In all the examples above, combining ’determiners’ other than ’some’ and ’every’ is minimal. For example, the argument Argument 10 IF some managers own at least two black cars, THEN some managers own at least two cars.
812
Jana Z. Sukkarieh
Table 4. Again rules similar to rule 29 that cover 3-ary relations but with ’some’, ’every’ and cardinal class expressions. C , W , Q are class expressions (not cardinal ones nor time class expressions) and N is a cardinal class expression. (3-ary 5)
(every C W ) (every (R(some Q) (N ∗C )) (R(some Q)(N ∗W )))
(3-ary 6)
(every C W ) (every (R(every Q) (N ∗C )) (R(every Q) (N ∗W )))
(3-ary 7)
(every C W ) (every (R(N ∗Q) (N ∗C )) (R(N ∗Q) (N ∗W )))
(3-ary 8)
(every C W ) (every (R(N ∗C ) (some Q)) (R(N ∗W ) (some Q)))
(3-ary 9)
(every C W ) (every (R(N ∗C ) (every Q)) (R(N ∗W ) (every Q)))
(3-ary 10)
(every C W ) (every (R(N ∗C ) (N ∗Q)) (R(N ∗W ) (N ∗Q)))
is true but the proof won’t go through unless we introduce another rule like (every P Q) . This, as (29) but with other ’determiners’, namely: (every (R(N ∗P )) (R(N ∗Q))) well, makes it necessary to account for different ’determiners’ in case of a 3-ary relation. The rules required are in table 4 above. Similar augmentations should occur for rules 30 and 31, whether a binary relation accounting for ’determiners’ other than ’some’ and ’every’ or accounting for a 3-ary relation. It is important to say that these inferences are licensed independently of the different scope possibilities of quantifiers. For instance, a property of ’every’ licenses the inference ’every man likes a woman’ from the sentence, S, ’every man likes a beautiful woman’, independently of any interpretation given to S. We described the logic and the inferences it allows. In the next section, for the purpose of completeness only, we describe, very briefly, how the logic and the inferences fit into the picture of defining a computer processable controlled English.
4
CLIP: The Controlled English McLogic Defines
Definition 10. CLIP is a sublanguage of English with the following properties: – It is syntactically and semantically processable by a computer. – Each sentence in CLIP has a well-formed translation in McLogic. – The ambiguities in a sentence are controlled in a way that the interpretation of that sentence allow inferences required in FraCas D16 [1] – The vocabulary is controlled only as far as the syntactic category. The word CLIP implicitly ’clips’ a part of the ’whole’, that is, dialect or sublanguage not full English. Here is an example:
An Expressive Efficient Representation
813
Calvin: Susie understands some comic books. Many comic books deal with serious issues. All superheroes face tough social dilemmas. It is not true that a comic book is an escapist fantasy. Every comic book is a sophisticated social critique. Hobbes: Most comic books are incredibly stupid. Every character conveys a spoken or graphic ethical message to the reader before some evil spirit wins and rules.
McLogic0 is the basic building block for CLIP. To start with, an English sentence belongs to CLIP if, and only if, it has a well-formed translation in McLogic0 . Further, we extended McLogic0 to account for more English constituents motivated by the structurally-based inferences in the FraCas test suite. Inferences with their corresponding properties, premises and conclusion add to the expressivity of the dialect. To emphasize the above idea, we consider some kind of recursive view: Base Case: McLogic0 Recursive Step: McLogicn depends on McLogicn−1 However, it is not an accumulative one-way hierarchy of languages since CLIP and the reasoning task motivates the extension.
5
Conclusion
We believe that the NLP community and the KR community seek common goals, namely, representing knowledge and reasoning about such knowledge. Nonetheless, the two communities have difficult-to-meet desiderata, namely, expressivity and taking information in context into account for a semantic representation and efficient reasoning for a KR. Finding one representation that is capable of both is still a challenge. We argue that an NL-like KR may bring that gap closer. Trying to achieve our aim of an NL-based formal KR led to an interesting inquiry into (not in order of importance): – a novel meaning representation for English – a KR in relation to the following: Given English utterances U1 , ..., Un and an English utterance C, a machine has to decide whether C follows from U1 ∧ ... ∧ Un . – A way for an NL expert Knowledge-Based System (KBS) to provide a clear justification for the line of reasoning used to draw its conclusions. An experimental system has been developed. The system has been tested (as a guideline for its development) on two substantial examples: the first is a wellknown test case for theorem provers (‘Schubert’s Steamroller’) [12] and the second is a well-known example from the Z programme specification literature
814
Jana Z. Sukkarieh
(‘Wing’s library problem’ [2]). Further, our aim was a general reasoning component that could handle a test suite which consists of a set of structurallybased deductions; that is, deductions licensed by specific properties of English constituents, made independently of the domain. As we mentioned earlier, as a guideline for structurally-based inferences, we have used the FraCas test suite. Hence, we extended, in a formal manner, McLogic0 and accordingly the KR system that takes a restricted but still powerful sublanguage of English as input. As far as we know, no one has provided a deductive computational engine covering all the types of inference illustrated or listed in the FraCas test suite or incorporated McLogic0 (nor even extending it) into an NLP system before. The practical value of the study can be seen, at least, in the following way: 1. For a knowledge engineer, having a NL-like (i.e. transparent) KR makes it easier to debug any KB reasoning system. 2. Since McLogic’s inference set is domain-independent and aims to be rich enough to be a test for the semantic capacity of any NLP system then McLogic could be used in an advanced question-answer sytem in any domain.
References [1] Cooper, R. and Crouch, D. and van Eijck, J. and Fox, C. and van Genabith, J. and Jaspars, J. and Kamp, H. and Milward, D. and Pinkal, M. and Poesio, M. and Pulman, S.: The Fracas Consortium, Deliverable D16. With additional contributions from Briscoe, T. and Maier, H. and Konrad, K. (1996) 809, 812 [2] Diller A.: Z An Introduction to Formal Methods. Wing’s Library Problem. John Wiley and Sons Ltd. (1994) 814 [3] Dowty, D. R. and Wall, R. E. and Peters, S.: Introduction to Montague Semantics. D. Reidel Publishing Co., Boston. (1981) [4] Fyodorov, Y. and Winter, Y. and Francez, N.: A Natural Logic inference system. Inference in Computational Semantics. (2000) 811 [5] Hwang C. H. and Schubert L. K.: Episodic Logic: A comprehensive, Natural Representation for Language Understanding. Minds and Machines. 3 number 4 (1993) 381–419 801 [6] Iwanska, L. M. : Natural Language Processing and Knowledge Representation. Natural Language is a Powerful Knowledge Representation system. MIT press. (2000) 7–64 801 [7] Iwanska L. M.: Natural Language Is A Representational Language. Knowledge Representation Systems based on Natural Language. AAAI Press. (1996) 44–70 801 [8] McAllester, D. and Givan, R.: Natural Language Syntax and First-Order Inference. Artificial Intelligence. 56 (1992)1–20 801 [9] McAllester, D. and Givan, R. and Shalaby, S. : Natural Language Based Inference Procedures Applied to Schubert’s Steamroller. AAAI. (1991) 801, 803 [10] Schubert, L. K. and Hwang, C. H.: Natural Language Processing and Knowledge Representation. Episodic Logic Meets Little Red Riding Hood-A comprehensive Natural Representation for Language Understanding. AAAI Press; Cambridge, Mass.: MIT. (2000) 111–174 801
An Expressive Efficient Representation
815
[11] Shapiro S. C. : Natural Language Processing and Knowledge Representation. SNePs: A logic for Natural Language Understanding and Commonsense Reasoning. AAAI Press; Cambridge, Mass.: MIT. (2000) 175–195 801 [12] Stickel M. E.: Schubert’s Steamroller Problem: Formulations and Solutions. Journal of Automated Reasoning. 2 (1986) 89–101 813 [13] Sukkarieh, J. Z.: Mind Your Language! Controlled Language for Inference Purposes. To appear in EAMT/CLAW03. Dublin, Ireland. (2003) 801 [14] Sukkarieh, J. Z.: Natural Language for Knowledge Representation. University of Cambridge. http://www.clp.ox.ac.uk/people/staff/jana/jana.htm. (2001) 800, 801, 807, 808 [15] Sukkarieh, J. Z.: Quasi-NL Knowledge Representation for Structurally-Based Inferences. P. Blackburn and M. Kolhase (eds). Proceedings of the Third International Workshop on Inference in Computational Semantics, Siena, Italy. (2001) 801 [16] Sukkarieh, J. Z. and Pulman, S. G.: Computer Processable English and McLogic. H. Bunt et al. (eds) Proceedings of the Third International Workshop on Computational Semantics. Tilburg, The Netherlands. (1999) 801
Connecting Word Clusters to Represent Concepts with Application to Web Searching Arnav Khare Information Technology, Dept., Institute of Engineering & Technology DAVV, Khandwa Road, Indore, MP, India
[email protected] Abstract. The need to have a search technique which will help a computer in understanding the user, and his requirements, has long been felt. This paper proposes a new technique, of doing so. It proceeds by first clustering the English language words into clusters of similar meaning, and then connecting those clusters according to their observed relationships and co-occurrences in web pages. These known relationships between word clusters are used to enhance the user's query, and in effect ‘understand' it. This process will result in giving results of more value to the user. This procedure does not suffer with the problems faced by many of the presently used techniques. Keywords: Knowledge Representation, Machine Learning.
1
Introduction
As the size of the internet, and with it the information on it grew, the task of finding the right place for the information you want has become more and more daunting. Search engines have attempted to solve this problem using a variety of approaches, which have been outlined in the Related Research section. In this paper, a new approach is suggested to understand the user, which will result in more appropriate results to his queries. This technique takes a connectionist approach. There is considerable evidence to support the fact that similar techniques are also applied in the human mind. A thing, meaning or concept is linked to another concept based on their co-occurrence. If two things happen together several times, they are linked to each other. This is known in psychology as ‘Reinforced Learning'. If two events happen one after another many times, it gives rise to the concept of causation. This connectionist theory has been applied in neural networks, but at the physical level of neurons. The proposed method works at a higher level in which concepts are connected to each other to represent meaning.
2
Related Research
The initial approach to text & information retrieval, known as ad hoc retrieval, was to consider a document as a bag of words. Each such bag (document) was remembered V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 816-823, 2003. Springer-Verlag Berlin Heidelberg 2003
Connecting Word Clusters to Represent Concepts
817
for the words that it contained. When user asked a query (which is a string of words), it matched the words in the query with the words in stored documents. To improve the results, some metrics were applied to rank them in order of relevance, quality, popularity, etc. Brin and Page [3] discuss how they implemented this method in Google. Many improvements were tried in various meta-search engines like MetaCrawler [4], SavvySearch [5], etc. But there were many problems associated with this method like, 1.
The numbers of documents that are found by this method were very large (typically in thousands for a large scale search engine). 2. A word generally has multiple meanings, senses & connotations to it. This technique retrieved documents carrying all the meanings of a word. For e.g., if a query contains the word ‘tiger', the engine would retrieve pages which contain references to both ‘the Bengal tiger' & ‘Tiger Woods'.
To tackle these problems, approaches from artificial intelligence and machine learning had to be adopted. Sahami [6] gives an in depth discussion of these approaches. The newer ranking methods such as TFIDF weighing [8], usually weigh each query term according to its rarity in the collection (often referred to as the inverse document frequency, or IDF) and then multiply this weight by the frequency of the corresponding query term in individual document (referred to as the term frequency, or TF) to get an overall measure of that document's similarity to the user's query. The premise underlying such measures is that word frequency statistics provide a good measure for capturing a document's relevance to a user's information need (as articulated in his query). A discussion of various raking methods can be found in [7]. [9] & [12] also give a detailed discussion of building systems with ad hoc retrieval as well as additional information on document ranking measures. The history of development of SMART, one of historically most important retrieval systems based on TFIDF is also given in [11]. But this approach too looks for frequency of a term in a document without considering its meaning. Vector space models ([12] and [8]) have also been used for document retrieval. According to this approach, a document is represented by a multi-dimensional vector, in which there is a dimension for each category. The nearness of two documents, or a document and query, can be found out by using the scalar product of their two vectors. But, this approach has a problem that their can only be a limited number of dimensions, and which dimensions should be added. Also, if a new dimension has to be added, forming new vectors for all the previously analyzed documents is an unnecessary overhead. Two sentences may mean the same, even if they use different words. But previously discussed techniques do not address this issue. Some methods like Latent Semantic Indexing [17], try to solve it. The word category map method can also be used for this purpose. Many attempts have also been made in gathering contextual knowledge behind a user's query. Lawrence [13] and Brezillon [23] discuss the use of context in web search. McCarthy [14] has tried to formalize context. The Intellizap system [15] and SearchPad [16] try to capture the context of a query from the background document, i.e. the document which the user was reading before he asked the query.
818
Arnav Khare
Recent research has focused on use of Natural Language Processing in text retrieval. The use of lexical databases like WordNet ([1] and [2]) has been under consideration for a while. Richardson and Smeaton [18] and Smeaton and Quigley [19] tried to use WordNet in Information Retrieval. It has been shown [21] that use of index words with their WordNet synset or sense improves the information retrieval performance. Research is underway in using Universal Networking Language (UNL) to create document vectors. UNL (explained in [20]) represents a document in form of semantic graph. Bhattacharya and Choudhary [22] have used UNL for text clustering.
3
Word Clusters
A thought having a meaning may be conveyed by a person to another, using a number of words which have, if not same, similar meanings. These set of words (synonyms) refer to approximately the same meaning. Let us refer to this set of similar words as a ‘Cluster'. A word may fall into a number of clusters, depending on the number of different meanings it has. This clustering can be done by available electronic dictionaries, thesauri, or corpora (for e.g. WordNet[1]), or by using automated clustering methods. According to its role in a sentence a word may be classified as a noun, verb or adjective (overlooking other minor syntactic categories). Corresponding to them, there will be three kinds of word clusters. Out of them, noun clusters have a special characteristic- inheritance of properties of one class by another. For e.g.: the cluster {dog, canine} will inherit properties from cluster {mammal } which will inherit from { animal, creature, …}. Thus, the clusters of nouns can be organized into hierarchies of clusters. These hierarchies can be viewed in form of inverted trees.
Fig. 1. A Network representation of semantic relations between nouns. Hyponymy refers to ‘isa' relationship. Antonymy refers to relationship between nouns with opposing meanings. Meronymy refers to containment or ‘has-a' relationship. [2]
Connecting Word Clusters to Represent Concepts
819
Yet, some words will still be left, which will not be clustered. They may be names, slang and new words that have not made their way to universal knowledge or acceptability. Their lexicon has to be maintained separately.
4
Forming a Network of Clusters
Once the various clusters of words which represent meanings have been defined, we can add knowledge by analyzing how these meanings are related. There is no better place to learn relations between meanings than the internet. When this engine visits a page, it follows the following procedure. 1.
Replace those words which have unambiguous meanings, with the cluster number of that word. 2. For the words that have different meanings, look at the clusters near that word. Match those clusters with the clusters near each of the word's meanings, i.e. if a word W has three meanings A, B and C. Then look at the clusters in the neighborhood of W in the document, and match it with the clusters in the neighborhood of clusters A, B and C. The first match will be the closest meaning. Thus correct sense of W will be identified. Now instead of a sequence of words, we have a sequence of meanings represented by clusters. 3. We will now remember a page for the meanings it has. A site is represented by a sequence of clusters. We can do an inverse mapping in which for every cluster the sites containing a word in that cluster are stored. 4. The sequence of clusters in a document can be used to determine the relationships of clusters with each other. A technique can be adopted that n clusters in the neighborhood of a cluster will be taken to be related to that cluster. Any value of n that is suitable may be taken. A link will be created between clusters which are within n distance of each other in a document. If the link is already present, it is strengthened, by increasing its weight by 1. Using this, the probability that given a cluster, which is the most probable related cluster, can be found out. Thus if a sufficiently large number of occurrences of a cluster have been met, the m (a suitable value) top most probable next clusters will be the ones that are actually meaningfully related to that cluster. This linked network resembles a weighted graph. It can be represented as a table. Thus a network of related clusters has been created. This network acts as a knowledge base in which knowledge is represented as links between word meanings.
820
Arnav Khare
Fig. 2. An example network of clusters linked to each other. Darker lines indicate links with higher weights
5
Querying the Network of Word Clusters
When a user query is presented, the following procedure can be used to find the sites that are related to the words in the query. 1. 2.
Get the clusters corresponding to the words in the query. Find a path between the query clusters in the network. The path will consist of a number of other clusters. This path represents the sequence of meaning that connects the query words together. These additional clusters will enhance the original query. These clusters correspond to the meaning of the user query as the engine has ‘understood' it. 3. Get the list of sites that correspond to these clusters. Out of these sites rank the sites which carry the original words of the query, higher. Again the resulting list can be ranked according to any metric mentioned earlier, such as nearness of query words in that document, popularity of the document, relevance of the query words within that document (i.e. heading, bold, italic, underline, etc), or any other suitable metric. 4. Present the results to the user. Finding the Path between Clusters The path finding procedure depends on the number of word clusters in the query. • •
If the number of clusters in query is one: give the sites corresponding to that word and its cluster members. If the number of clusters is two or more: Do the following for a fixed number of iterations. For each cluster in query, get the list of related clusters, with their
Connecting Word Clusters to Represent Concepts
821
probabilities. If a cluster is a noun cluster in a hierarchy, get the list of clusters connected to its parent as well. This is because a child noun will inherit the properties of its parent. New probabilities for each cluster in the list are calculated with respect to the original query cluster. Compare the list at each iteration to look for a common cluster between the lists. If a common cluster is found then a link between those two clusters has been found. Save that path. But, continue, until the paths to all the clusters in the query have been found, or the number of fixed iterations have completed. If no common cluster is found, for the list of each cluster, go to the cluster with the highest next probability. This is similar to ‘best first search', widely applied in artificial intelligence. If after the fixed number of iterations, only some of the connections have been found, take the intermediate clusters of those connections and consider the rest of query clusters as single clusters and treat them as in 1.
Fig. 3. The path between two query clusters (darker) is found when a cluster common to each others' lists is found. The intermediate clusters are used to ‘understand' the query
6
Evaluation
The evaluation of the proposed technique is underway, and the results should be available soon. Analysis of the technique shows that the worst-case situation for this technique will be when no path is found between the query words. Even in this case the meaning of the query words would have been known ( by knowing the cluster to which it belongs, and thus disambiguating it) even if the meaning of the whole query has not been comprehended. Hence, even in its worst case, this procedure will give better results than earlier used techniques, which cannot differentiate between different uses of a word.
822
Arnav Khare
Though this technique uses quantitative methods to learn links between words, it will give better results because it has background knowledge about the user's query, which is the key to understanding its meaning. It is this huge amount of background knowledge about the world that differentiates between the human and computer understanding.
7
Conclusion
As was seen in the last section, the new technique promises to deliver much better results, as compared to currently used methods. It overcomes many of the deficiencies that afflict the methods outlined earlier. It is also an intuitive approach, which relates various concepts that occur together, to each other. This gives a new way to approach computer understanding. Further work can be done in deciding, the number of iterations which will be suitable while searching the network, and the manner in which the noun hierarchy may be more efficiently used. Finding better methods to link words, than simply their co-occurrence in a document, will help a lot. Also, finding how this network may be enhanced to produce human logic is an interesting problem. Such a network can be applied in the future to other tasks such as Natural Language Processing, and Image Processing & Recognition in which visual features of an object may also be stored.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
WordNet Lexical Database. http://www.cogsci.princeton.edu/~wn/ George A. Miller, Richard Beckwith, Christiane Fellbaum, et al. 1993. Introduction to WordNet: An On-line Lexical Database. Sergey Brin, Lawrence Page. The Anatomy of a Large Scale Hypertextual Web Search Engine. In Proceedings of the 7th International World Wide Web Conference, pages 107–117, Brisbane, Australia, April 1998. Elsevier Science. Selberg and Etzioni. The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, 12(1) pages 8-14, 1997. A.E. Howe and D. Dreilinger. SavvySearch: A Meta-Search Engine that Learns which Search Engines to Query. 1997. M. Sahami. Using Machine Learning to Improve Information Access. PhD dissertation. Stanford Univ. December 1998. D. Harmon. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W.B Frakes and R. Baeza-Yates, Eds. Prentice Hall, 1992, pages. 363-292. G. Salton and C. Buckley. Term Weighing approaches in Automatic Text Retrieval. Information Processing and Management 24,5 (1988), pages 513523. G. Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.
Connecting Word Clusters to Represent Concepts
823
[10] G. Salton, A. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM 18 (1975), 613-620. [11] G. Salton. The SMART Information Retrieval system. Prentice Hall, Englewood Cliffs, NJ, 1975. [12] C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979. [13] Steve Lawrence. Context in Web Search, IEEE Data Engineering Bulletin, Volume 23, Number 3, pp. 25–32, 2000. [14] J. McCarthy, 1993. "Notes on formalizing context", Proceedings of the 13th IJCAI, Vol. 1, pp. 555-560. [15] L. Finkelstein, et al. Placing Search in Context: The Concept revisited. 10th World Wide Web Conference, May 2-5, 2001, Hong Kong. [16] K. Bharat. SearchPad: Explicit Capture of Search Context to Support Web Search. In Proceedings of the 9th International World Wide Web Conference, WWW9, Amsterdam, May 2000. [17] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 1990. [18] R. Richardson and A. Smeaton. Using Wordnet in a Knowledge-Based Approach to Information Retrieval. In Proceedings of the BCS-IRSG Colloquium, Crewe. [19] Smeaton and A. Quigley. 1996. Experiments on using semantic distances between words in image caption retrieval. In Proceedings of the 19th International conference on research and Development in IR. [20] H. Uchida, M. Zhu, Senta T. Della. UNL: A Gift for a Millennium. The United Nations University, 1995. [21] Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan Cigarran Indexing with WordNet synsets can improve Text Retrieval, Proceedings of the COLING/ACL '98 Workshop on Usage of WordNet for NLP, Montreal.1998. [22] P. Bhattacharya and B. Choudhary. Text Clustering using Semantics. [23] P. Brezillon. Context in Problem Solving: A Survey. [24] Sven Martin, Jorg Liermann, Hermann Ney. ‘Algorithms for Bigram and Trigram Word Clustering'. [25] Ushioda, Akira. 1996. Hierarchical clustering of words and application to nlp tasks. In Proceedings of the Fourth Workshop on Very Large Corpora.
A Framework for Integrating Deep and Shallow Semantic Structures in Text Mining Nigel Collier, Koichi Takeuchi, Ai Kawazoe, Tony Mullen, and Tuangthong Wattarujeekrit National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan {collier,koichi,zoeai,mullen,tuangthong}@nii.ac.jp
Abstract. Recent work in knowledge representation undertaken as part of the Semantic Web initiative has enabled a common infrastructure (Resource Description Framework (RDF) and RDF Schema) for sharing knowledge of ontologies and instances. In this paper we present a framework for combining the shallow levels of semantic description commonly used in MUC-style information extraction with the deeper semantic structures available in such ontologies. The framework is implemented within the PIA project software called Ontology Forge. Ontology Forge offers a server-based hosting environment for ontologies, a server-side information extraction system for reducing the effort of writing annotations and a many-featured ontology/annotation editor. We discuss the knowledge framework, some features of the system and summarize results from extended named entity experiments designed to capture instances in texts using support vector machine software.
1
Introduction
Recent work in knowledge representation undertaken as part of the Semantic Web initiative [1] has enabled Resource Description Framework (RDF) Schema, RDF(S), [2] to become a common metadata standard for sharing knowledge on the World Wide Web. This will allow for the explicit description of concepts, properties and relations in an ontology, which can then be referenced online to determine the validity of conforming documents. The use of ontologies allows for a deep semantic description in each domain where a group of people share a common view on the structure of knowledge. However it still does not solve the knowledge acquisition problem, i.e. how to acquire ontologies how to instantiate them with instances of the concepts. Instantiation will be a major effort and needs support tools if the Semantic Web is to expand and fulfill its expected role. This is where we consider information extraction (IE) has an important part to play. IE systems are now well developed for capturing low level semantics inside texts such as named entities and coreference (identity) expressions. IE however does not offer a sufficiently formal framework for specifying relations between concepts, assuming for example that named entities are instances of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 824–834, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Framework for Integrating Deep and Shallow Semantic Structures
825
a disjoint set of concepts with no explicit relations. Without such formalization it is difficult to consider the implications of concept class relations when applying named entity outside of simple domains. To look at this another way, by making the deep semantics explicit we offer a way to sufficiently abstract IE to make it portable between domains. Our objective in Ontology Forge is to join the two technologies so that human experts can create taxonomies and axioms (the ontologies) and by providing a small set of annotated examples, machine learning can take over the role of instance capturing though information extraction technology. As a motivating example we consider information access to research results contained in scientific journals and papers. The recent proliferation of scientific results available on the Web means that scientists and other experts now have far greater access than ever before to the latest experimental results. While a few of these results are collected and organized in databases, most remain in free text form. Efficient tools such as the commercial search engines have been available for some years to help in the document location task but we still do not have effective tools to help in locating facts inside documents. Such tools require a level of understanding far beyond simple keyword spotting and for this reason have taken longer to develop. Access to information is therefore now limited by the time the expert spends reading the whole text to find the key result before it can be judged for relevance and synthesized into the expert’s understanding of his/her domain of knowledge. By explicitly describing the concepts and relations inside the scientific domain and then annotating a few examples of texts we expect that machine learning will enable new instances to be found in unseen documents, allowing computer software to have an intelligent understanding of the contents, thereby aiding the expert in locating the information he/she requires. The focus of this paper is on three topics: firstly to summarize our system called Ontology Forge which allows ontologies and annotations to be created cooperatively within a domain community; secondly to present a metadata scheme for annotations that can provide linkage between the ontology and a text; and thirdly to present results from our work on instance capturing as an extended named entity task [5] using machine learning. We motivate our discussion with examples taken from the domain of functional genomics.
2
The Ontology Forge System
The Ontology Forge design encourages collaborative participation on an ontology project by enabling all communication to take place over the Internet using a standard Web-based interface. After an initial domain group is formed we encourage group members to divide labour according to levels of expertise. This is shown in the following list of basic steps required to create and maintain a domain ontology. Domain Setup a representative for a domain (the Domain Manager) will apply to set up an ontology project that is hosted on the Ontology Forge server.
826
Nigel Collier et al.
Community Setup the Domain Manager will establish the domain group according to levels of interest and competence, i.e. Experts and Users. Ontology Engineering Ontologies are constructed in private by Domain Experts through discussion. Ontology Publication When the Domain Manager decides that a version of the ontology is ready to be released for public access he/she copies it into a public area on the server. Annotation When a public ontology is available Domain Users can take this and annotate Web-based documents according to the ontology. Annotations can be uploaded to the server and stored in the private server area for sharing and discussion with other domain members. Annotation Publication The Domain Manager copies private annotations into the public server area. Training When enough annotations are available the Domain Manager can ask the Information Extraction system called PIA-Core to learn how to annotate unseen domain texts. Improvement Cycle Annotation and training then proceed in a cycle of improvement so that Domain Users correct output from the Information Extraction system and in turn submit these new documents to become part of the pool of documents used in training. Accuracy gradually improves as the number of annotations increases. The amount of work for annotators is gradually reduced. The overall system architecture for Ontology Forge on the server side includes a Web server, a database management system and an application server which will primarily be responsible for information extraction. On the client side we have an ontology editor and annotation tool called Open Ontology Forge (OOF!). 2.1
Annotation Meta-data
In this section we will focus on the specific characteristics of our annotation scheme. The root class in an Ontology Forge ontology is the annotation, defined in RDF Schema as a name space and held on the server at a URI. The user does not need to be explicitly aware of this and all classes that are defined in OOF will inherit linkage and tracking properties from the parent so that when instances are declared as annotations in a base Web document, the instance will have many of the property values entered automatically by OOF. Basically the user is free to create any classes that help define knowledge in their domain according to the limits of RDF Schema. We briefly now describe the root annotation class: context This relates an Annotation to the resource to which the Annotation applies and takes on an XPointer value. ontology id Relates an Annotation to the ontology and class to which the annotation applies.
A Framework for Integrating Deep and Shallow Semantic Structures
827
conventional form The conventional form of the annotation (if applicable) as described in the PIA Annotation Guidelines. identity id Used for creating coreference chains between annotations where the annotations have identity of reference. From a linguistic perspective this basically assumes a symmetric view of coreference where all coreference occurrences are considered to be equal[12]. constituents This is used to capture constituency relations between annotations if required. orphan This property takes only Boolean values corresponding to ‘yes’ and ‘no’. After the annotation is created, if it is later detected that the annotation can no longer be linked to its correct position in doc id, then this value will be set to ‘yes’ indicating that the linkage (in context) needs correcting. author The name of the person, software or organization most responsible for creating the Annotation. It is defined by the Dublin Core [8] ‘creator’ element. created The date and time on which the Annotation was created. It is defined by the Dublin Core ‘date’ element. modified The date and time on which the Annotation was modified. It is defined by the Dublin Core ‘date’ element. sure This property takes only Boolean values corresponding to ‘yes I am sure’ and ‘no I am not sure’ about the assignment of this annotation. Used primarily in post-annotation processing. comment A comment that the annotator wishes to add to this annotation, possibly used to explain an unsure annotation. It is defined by the Dublin Core ‘description’ element. Due to the lack of defined data types in RDF we make use of several data types in the Dublin Core name space for defining annotation property values. In OOF we also allow users to use a rich set of data type values using both Dublin Core and also XML Schema name space for integers, floats etc. Users can also define their own enumerated types using OOF. A selected view of an annotation can be seen in Figure 1 showing linkage into a Web document and part of the ontology hierarchy. One point to note is the implementation of coreference relations between JAK1 and this protein. This helps to maximize information about this object in the text.
3
Named Entities as Instances
Named entity (NE) extraction is now firmly established as a core technology for understanding low level semantics of texts. NE was formalized in MUC6 [7] as the lowest level task and since then several methodologies have been widely explored: heuristics-based, using rules written by human experts after inspecting examples; supervised learning using labelled training examples; and non-supervised learning methods; as well as combinations of supervised and nonsupervised methods.
828
Nigel Collier et al.
Fig. 1. Overview of an annotation
NE’s main role has been to identify expressions such as the names of people, places and organizations as well as date and time expressions. Such expressions are hard to analyze using traditional natural language processing (NLP) because they belong to the open class of expressions, i.e. there is an infinite variety and new expressions are constantly being invented. Applying NE to scientific and technical domains requires us to consider two important extensions to the technology. The first is to consider how to capture types, i.e. instances of conceptual classes as well as individuals. The second is to place those classes in an explicitly defined ontology, i.e. to clarify and define the semantic relations between the classes. To distinguish between traditional NE and extended NE we refer to the later as NE+. Examples of NE+ classes include, a person’s name, a protein name, a chemical formula or a computer product code. All of these may be valid candidates for tagging depending on whether they are contained in the ontology. Considering types versus individuals, there are several issues that may mean that NE+ is more challenging than NE. The most important is the number of variants of NE+ expressions due to graphical, morphological, shallow syntactic and discourse variations. For example the use of head sharing combined with embedded abbreviations in unliganded (apo)- and liganded (holo)-LBD. Such expressions will require syntactic analysis beyond simple noun phrase chunking if they are to be successfully captured. NE+ expressions may also require richer
A Framework for Integrating Deep and Shallow Semantic Structures
829
contextual evidence than is needed for regular NEs - for example knowledge of the head noun or the predicate-argument structure.
4
Instance Capturing
In order to evaluate the ability of machine learning to capture NE+ expressions we investigated a model based on support vector machines (SVMs). SVMs like other inductive-learning approaches take as input a set of training examples (given as binary valued feature vectors) and finds a classification function that maps them to a class. There are several points about SVM models that are worth summarizing here. The first is that SVMs are known to robustly handle large feature sets and to develop models that maximize their generalizability by sometimes deliberately misclassify some of the training data so that the margin between other training points is maximized [6]. Secondly, although training is generally slow, the resulting model is usually small and runs quickly as only the support vectors need to be retained. Formally we can consider the purpose of the SVM to be to estimate a classification function f : χ → {±1} using training examples from χ × {±1} so that error on unseen examples is minimized. The classification function returns either +1 if the test data is a member of the class, or −1 if it is not. Although SVMs learn what are essentially linear decision functions, the effectiveness of the strategy is ensured by mapping the input patterns χ to a feature space Γ using a nonlinear mapping function Φ : χ → Γ . Since the algorithm is well described in the literature cited earlier we will focus our description from now on the features we used for training. Due to the nature of the SVM as a binary classifier it is necessary in a multiclass task to consider the strategy for combining several classifiers. In our experiments with Tiny SVM the strategy used was one-against-one rather than one-against-the-rest. For example, if we have four classes A, B, C and D then Tiny SVM builds classifiers for (1) A against (B, C, D), (2) B against (C, D), and (3) C against D. The winning class is the one which obtains the most votes of the pairwise classifiers. The kernel function k : χ × χ → R mentioned above basically defines the feature space f by computing the inner product of pairs of data points. For x ∈ χ we explored the simple polynomial function k(xi , xj ) = (xi · xj + 1)d . We implemented our method using the Tiny SVM package from NAIST 1 which is an implementation of Vladimir Vapnik’s SVM combined with an optimization algorithm [10]. The multi-class model is built up from combining binary classifiers and then applying majority voting. 4.1
Feature Selection
In our implementation each training pattern is given as a vector which represents certain lexical features. All models use a surface word, an orthographic feature [4] 1
Tiny SVM is available from http://cl.aist-nara.ac.jp/ ˜taku-ku/software/ TinySVM/
830
Nigel Collier et al.
and previous class assignments, but our experiments with part of speech (POS) features [3] showed that POS features actually inhibited performance in the molecular biology data set which we present below. This is probably because the POS tagger was trained on news texts. Therefore POS features are used only for the MUC-6 news data set where we show a comparison with and without these features. The vector form of the window includes information about the position of each word. In the experiments we report below we use feature vectors consisting of differing amounts of ‘context’ by varying the window around the focus word which is to be classified into one of the NE+ classes. The full window of context considered in these experiments is ±3 about the focus word. 4.2
Data Sets
The first data set we used in our experiments is representative of the type of data that we expect to be produced by the Ontology Forge system. Bio1 consists of 100 MEDLINE abstracts (23586 words) in the domain of molecular biology annotated for the names of genes and gene products [14]. The taxonomic structure of the classes is almost flat except for the SOURCE class which denotes a variety of locations where genetic activity can occur. This provides a good first stage for analysis as it allows us to explore a minimum of relational structures and at the same time look at named entities in a technical domain. A break down of examples by class is shown in Table 1. An example from one MEDLINE abstract in the Bio1 data set is shown in Figure 2. As a comparison to the NE task we used a second data set (MUC-6) from the collection of 60 executive succession texts (24617 words) in MUC-6. Details are shown in Table 2.
The [Tat protein]protein of [human immunodeficiency virus type 1]source.vi ([HIV1]source.vi ) is essential for productive infection and is a potential target for antiviral therapy. [Tat]protein , a potent activator of [HIV-1]source.vi gene expression, serves to greatly increase the rate of transcription directed by the viral promoter. This induction, which seems to be an important component in the progression of acquired immune deficiency syndrome (AIDS), may be due to increased transcriptional initiation, increased transcriptional elongation, or a combination of these processes. Much attention has been focused on the interaction of [Tat]protein with a specific RNA target termed [TAR]RNA ([transactivation responsive]RNA ) which is present in the leader sequence of all [HIV-1]source.vi mRNAs. This interaction is believed to be an important component of the mechanism of transactivation. In this report we demonstrate that in certain [CNS-derived cells]source.ct [Tat]protein is capable of activating [HIV-1]source.vi through a [TAR]RNA -independent pathway.
Fig. 2. Example MEDLINE abstract marked up for NE+ expressions
A Framework for Integrating Deep and Shallow Semantic Structures
831
Table 1. NE+ classes used in Bio1 with the number of word tokens for each class Class PROTEIN
Description proteins, protein groups, families, complexes and substructures. DNA 358 IL-2 promoter DNAs, DNA groups, regions and genes RNA 30 TAR RNAs, RNA groups, regions and genes SOURCE.cl 93 leukemic T cell line Kit225 cell line SOURCE.ct 417 human T lymphocytes cell type SOURCE.mo 21 Schizosaccharomyces pombe mono-organism SOURCE.mu 64 mice multiorganism SOURCE.vi 90 HIV-1 viruses SOURCE.sl 77 membrane sublocation SOURCE.ti 37 central nervous system tissue
5
# Example 2125 JAK kinase
Results
Results are given in the commonly used Fβ=1 scores [15] which is defined as Fβ=1 = (2P R)/(P + R), where P denotes Precision and R Recall. P is the ratio of the number of correctly found NE chunks to the number of found NE chunks, and R is the ratio of the number of correctly found NE chunks to the number of true NE chunks. Table 3 shows the overall F-score for the two collections, calculated using 10-fold cross validation on the total test collection. The Table highlights the role that the context window plays in achieving performance. From our results it is clear that +2-2 offers the best overall context window for the feature sets that we explored in both Bio1 and MUC-6 collections. One result that is clear from Table 3 is the effect of POS features. We found that Bio1 POS features had a negative effect. In contrast POS features are overall very beneficial in the MUC-6 collection. Basically we think that the POS tagger was trained on news-related texts and that actually the MUC-6 texts share a strong overlap in vocabulary - consequently POS tagging is very accurate. On
Table 2. Markup classes used in MUC-6 with the number of word tokens for class label Class DATE LOCATION ORGANIZATION MONEY PERCENT PERSON TIME
# 542 390 1783 423 108 838 3
832
Nigel Collier et al.
Table 3. Overall F-scores for each of the learning methods on the two test sets using 10-fold cross validation on all data. † Results for models using surface word and orthographic features but no part of speech features; ‡ Results for models using surface word, orthographic and part of speech features Window Data set -3+3 -3+2 -2+2 -1+1 -1+0 Bio1† MUC-6† MUC-6‡
71.78 71.69 72.12 72.13 65.65 73.21 73.04 74.10 72.96 65.94 74.66 75.13 75.66 74.92 68.83
the other hand the vocabulary in the Bio1 texts is far removed from anything the POS tagger was trained on and accuracy drops down. In analysis of the results we identified several types of error. The first and perhaps most serious type was caused by local syntactic ambiguities such as head sharing in 39-kD SH2, SH3 domain which should have been classed as a PROTEIN, but the SVM split it into two PROTEIN expressions SH2 and SH3 domain. In particular the ambiguous use of hyphen, e.g. 14E1 single-chain (sc) Fv antibody , and parentheses, e.g. scFv (14E1), seemed to cause the SVM some difficulties. It is likely that the limited feature information we gave to the SVM was the cause of this and can be improved on using grammatical features such as head noun or main verb. A second minor type of error seemed to be the result of inconsistencies in the annotation scheme for Bio1.
6
Related Work
There were several starting points for our work. Perhaps the most influential was the Annotea project [11] which provided an annotation hosting service based on RDF Schema. Annotations in Annotea can be considered as a kind of electronic ‘Post-It ’ for Web pages, i.e. a remark that relates to a URI. Annotea has several features which our design has used including: (a) use of XLink/XPointer linkage from annotations to the place in the document where a text occurs; (b) use of generic RDF schema as the knowledge model enhanced with types from Dublin Core. However, there are several major differences between our work and Annotea including our focus on aiding annotation through the use of information extraction, various levels of support for cooperative development of annotations and ontologies, the use of domain groups and explicit roles and access rights for members. Also influential in our development has been Prot´eg´e-2000 [13] and other ontology editors such as Ont-O-Mat [9] which provide many of the same basic features as OOF. While Prot´eg´e-2000 provides a rich environment for creating ontologies it provides limited support for large-scale instance capturing and
A Framework for Integrating Deep and Shallow Semantic Structures
833
collaborative development. Although Prot´eg´e-2000 can be extended to provide server-based storage it is not an inherent part of the design. Ont-O-Mat and OntoEdit from the University of Karlsruhe provide similar functionality to OOF but does not link instances back to the text or have the built-in design focus on collaborative development and information extraction.
7
Conclusion and Future Plans
In this paper we have presented an annotation scheme that links concepts in ontologies with instances in texts using a system that conforms to RDF(S). The software that implements this scheme called Ontology Forge is now being completed and is now undergoing user trials. There are many benefits of linking IE to deep semantic descriptions including abstract of knowledge and portability for IE, and automated instance capturing for Semantic Web. While we have emphasized the use of RDF(S) in Ontology Forge we recognize that it has inherent weaknesses due to a lack of support for reasoning and inference as well as constraints on consistency checking. The current knowledge model used in OOF offers a subset of RDF(S), e.g. we forbid multi-class membership of instances, and we are now in the process of reviewing this so that OOF can be capable of export to other RDF(S)-based languages with stronger type inferencing such as DAML+OIL which we aim to support in the future. We have focussed in this paper on linking named entities to deep structures formalized in ontologies. A key point for our investigation from now is to look at how we can exploit knowledge provided by concept relations between entities and properties of entities which will be of most value to end users and downstream applications.
References [1] T. Berners-Lee, M. Fischetti, and M. Dertouzos. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. Harper, San Francisco, September 1999. ISBN: 0062515861. 824 [2] D. Brickley and R. V. Guha. Resource Description Framework (RDF) schema specification 1.0, W3C candidate recommendation. http://www.w3.org/TR/2000/CR-rdf-schema-20000327, 27th March 2000. 824 [3] E. Brill. A simple rule-based part of speech tagger. In Third Conference on Applied Natural Language Processing – Association for Computational Linguistics, Trento, Italy, pages 152–155, 31st March – 3rd April 1992. 830 [4] N. Collier, C. Nobata, and J. Tsujii. Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’2000), Saarbrucken, Germany, July 31st–August 4th 2000. 829 [5] N. Collier, K. Takeuchi, C. Nobata, J. Fukumoto, and N. Ogata. Progress on multi-lingual named entity annotation guidelines using RDF(S). In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’2002), Las Palmas, Spain, pages 2074–2081, May 27th – June 2nd 2002. 825
834
Nigel Collier et al.
[6] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273– 297, November 1995. 829 [7] DARPA. Information Extraction Task Definition, Columbia, MD, USA, November 1995. Morgan Kaufmann. 827 [8] Dublin core metadata element set, version 1.1: Reference description. Technical Report, Dublin Core Metadata Initiative, http://purl.org/DC/documents/recdces-19990702.htm, 1999 1999. 827 [9] S. Handschuh, S. Staab, and A. Maedche. CREAM - creating relational metadata with a component-based, ontology-driven annotation framework. In First International Conference on Knowledge Capture (K-CAP 2001), Victoria, B. C., Canada, 21 – 23 October 2001. 832 [10] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. 829 [11] J. Kahan, M. R. Koivunen, E. Prud’Hommeaux, and R. R. Swick. Annotea: An open RDF infrastructure for shared web annotations. In the Tenth International World Wide Web Conference (WWW10), pages 623–630, May 1 – 5 2000. 832 [12] A. Kawazoe and N. Collier. An ontologically-motivated scheme for coreference. In Proceedings of the International Workshop on Semantic Web Foundations and Application Technologies (SWFAT), Nara, Japan, March 12th 2003. 827 [13] N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with Prot´eg´e-2000. IEEE Intelligent Systems, 16(2):60–71, 2001. 832 [14] Y. Tateishi, T. Ohta, N. Collier, C. Nobata, K. Ibushi, and J. Tsujii. Building an annotated corpus in the molecular-biology domain. In COLING’2000 Workshop on Semantic Annotation and Intelligent Content, Luxemburg, 5th–6th August 2000. 830 [15] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. 831
Fuzzy Methods for Knowledge Discovery from Multilingual Text Rowena Chau and Chung-Hsing Yeh School of Business Systems Faculty of Information Technology Monash University Clayton, Victoria 3800, Australia {Rowena.Chau,ChungHsing.Yeh}@infotech.monash.edu.au
Abstract. Enabling navigation among a network of inter-related concepts associating conceptually relevant multilingual documents constitutes the fundamental support to global knowledge discovery. This requirement of organising multilingual document by concepts makes the goal of supporting global knowledge discovery a concept-based multilingual text categorization task. In this paper, intelligent methods for enabling concept-based multilingual text categorisation using fuzzy techniques are proposed. First, a universal concept space, encapsulating the semantic knowledge of the relationship between all multilingual terms and concepts, is generated using a fuzzy multilingual term clustering algorithm based on fuzzy c-means. Second, a fuzzy multilingual text classifier that applies the multilingual semantic knowledge for concept-based multilingual text categorization is developed using the fuzzy k-nearest neighbour classification technique. Referring to the multilingual text categorisation result as a browseable document directory, concept navigation among a multilingual document collection is facilitated.
1
Introduction
The rapid expansion of the World Wide Web throughout the globe means electronically accessible information is now available in an ever-increasing number of languages. In a multilingual environment, one important motive of information seeking is global knowledge discovery. Global knowledge discovery is significant when a user wish to gain an overview of a certain subject area covered by a multilingual document collection before exploiting it. In such a situation, concept navigation is required. The basic idea of concept navigation is to provide the user with a browseable concept space that gives a fair indication of the conceptual distribution of all multilingual documents over the domain. This requirement of organising multilingual documents by concepts makes the goal of supporting global knowledge discovery a concept-based multilingual text categorisation task.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 835-842, 2003. Springer-Verlag Berlin Heidelberg 2003
836
Rowena Chau and Chung-Hsing Yeh
Text categorisation is a classification problem of deciding whether a document belongs to a set of pre-specified classes of documents. In a monolingual environment, text categorisation is carried out within the framework of the vector space model [7]. In the vector space model, documents are represented as feature vectors in a multidimensional space defined by a set of terms occurring in the documents collection. To categorise documents, this set of terms become the features of the classification problem. Documents represented by similar feature vectors belong to the same class. However, multilingual text represents a unique challenge to text categorisation, due to the feature incompatibility problem contributed by the vocabulary mismatch phenomenon across languages. Different languages use different sets of terms to express a set of universal concepts. Hence, documents in different languages are represented by different sets of features in separate feature spaces. This language-specific representation has made multilingual text incomparable. Text categorisation methods that rely on shared terms (features) will not work for multilingual text categorisation. To overcome this problem, a universal feature space where all multilingual text can be represented in a language-independent way is necessary. Towards this end, a corpusbased strategy aiming at unifying all existing feature spaces by discovering a new set of language-independent semantic features is proposed. The basic idea is: given a new universal feature space defined by a set of language-independent concepts, multilingual text can then be uniformly characterised. Consequently, multilingual text categorisation can also take place in a language-independent way. In what follows, the framework of the corpus-based approach to discovering multilingual semantic knowledge for enabling concept-based multilingual text categorisation is introduced in Section 2. Then, a fuzzy multilingual term clustering method for semantic knowledge discovery using fuzzy c-means is presented in Section 3. This is followed by discussion of the application of the multilingual semantic knowledge for developing a fuzzy multilingual text classifier using fuzzy k-nearest neighbour algorithm in Section 4. Finally, a conclusive remark is included in Section 5.
2
The Corpus-Based Approach
The framework of the corpus-based approach to discovering multilingual semantic knowledge for enabling concept-based multilingual text categorization is depicted in Figure 1. First, using the co-occurrence statistics of a set of multilingual terms extracted from a parallel corpus, a fuzzy clustering algorithm, known as fuzzy c-means [1], is applied to group semantically related multilingual terms into concepts. By analyzing the co-occurrence statistics, multilingual terms are sorted into multilingual term clusters (concepts) such that terms belonging to any one of the clusters (concepts) should be as similar as possible while terms of different clusters (concepts) are as dissimilar as possible. As such, a concept space acting as a universal feature space for all languages is formed. Second, referring to the concept space as a set of pre-defined concepts, each multilingual document, being characterized by a set of index terms is then mapped to the concepts to which it belongs according to their conceptual content. For this purpose, a fuzzy multilingual text classifier is developed based on a fuzzy classification method known as the fuzzy k-nearest neighbour algorithm [4].
Fuzzy Methods for Knowledge Discovery from Multilingual Text A parallel corpus in 2 languages: A, B A
B
Multilingual Term Space termA
termB
837
Multilingual Term Clustering (Fuzzy C-Means)
Concept Space concept Multilingual Document Space documentA
documentB
Multilingual Text Categorization (Fuzzy k-Nearest Neighbor)
Document Directory concept Document
Rank
...
...
Fig. 1. A corpus-based approach to enabling concept-based multilingual text categorization
Finally, membership values resulted from fuzzy classification are used to produce a ranked list of documents. Associating each concept with a list of multilingual documents ranked in descending order of conceptual relevance, the resulting document classification will provide a contextual overview of the document collection. Thus, it can be used as a browseable document directory in user-machine interaction. By enabling navigation among inter-related concepts associating conceptually relevant multilingual documents, global knowledge discovery is facilitated.
3
Discovering Multilingual Semantic Knowledge
A fuzzy clustering algorithm, known as fuzzy c-means, is applied to discover the multilingual semantic knowledge by generating a universal concept space from a set of multilingual terms. This concept space is formed by grouping semantically related multilingual terms into concepts, thus revealing the multilingual semantic knowledge of the relationship between all multilingual terms and concepts. As concepts tend to overlap in terms of meaning, crisp clustering algorithm like kmeans that generates partition such that each term is assigned to exactly one cluster is inadequate for representing the real data structure. In this aspect, fuzzy clustering methods, such as the fuzzy c-means [1], that allow objects (terms) to be classified to more than one cluster with different membership values is preferred. With the application of fuzzy c-means, the resulting fuzzy multilingual term clusters (concepts), which are overlapping, will provide a more realistic representation of the intrinsic semantic relationship among the multilingual terms. To realize this idea, a set of multilingual terms, which are the objects to be clustered, are first extracted from a parallel corpus of N parallel documents. A parallel corpus is a collection of documents containing identical text written in multiple languages. Statistical analysis of parallel corpus has been suggested as a potential means of providing effective lexical information basis for multilingual text management [2].
838
Rowena Chau and Chung-Hsing Yeh
This has already been successfully applied in research for building translation model for multilingual text retrieval [6]. Based on the hypothesis that semantically related multilingual terms representing similar concepts tend to co-occur with similar inter- and intra-document frequencies within a parallel corpus, fuzzy c-means is applied to sort a set of multilingual terms into clusters (concepts) such that terms belonging to any one of the clusters (concepts) should be as similar as possible while terms of different clusters (concepts) are as dissimilar as possible in terms of the concepts they represent. To apply FCM for multilingual term clustering, each term is then represented as an input vector of N features where each of the N parallel documents is regarded as an input feature with each feature value representing the frequency of that term in the nth parallel document. The Fuzzy Multilingual Term Clustering Algorithm is as follows: 1.
Initialize the membership values µik of the k multilingual terms xk to each of the i concepts (clusters) for i = 1, . . . , c and k = 1, . . . ,K randomly such that: c
∑ µik = 1
∀ k = 1,..., K
(1)
∀i = 1,...c ; ∀k = 1,...K
(2)
i =1
and µik ∈ [0,1]
2.
Calculate the concept prototype (cluster centers) vi using these membership values µik : K
vi =
∑ ( µik )m ⋅ x k k =1 K
∑ ( µik )
,
∀i = 1,...,c
µ iknew
using these cluster centers vi:
(3)
m
k =1
3.
Calculate the new membership values
1
µiknew =
2 m −1
, ∀i = 1,..., c ; ∀k = 1,..., K
v −x i k j =1 v j − x k c
∑ 4.
(4)
If µ new − µ > ε , let µ = µ new then go to step 2. Else, stop.
As a result of clustering, every multilingual term is now assigned to various clusters (concepts) with various membership values. As such, a universal concept space, which is a fuzzy partition of the multilingual term space, revealing the multilingual
Fuzzy Methods for Knowledge Discovery from Multilingual Text
839
semantic knowledge of the relationship between all multilingual terms and concepts, is now available for multilingual text categorization.
4
Enabling Multilingual Text Categorization
A multilingual text categorization system that facilitates concept-based document browsing must be capable of mapping every document in every language to its relevant concepts. Documents in different languages describe similar concepts using significantly diverse terms. As such, semantic knowledge of the relationship between multilingual terms and concepts is essential for effective concept-based multilingual text categorization. In this section, a fuzzy multilingual text classifier, which applies the multilingual semantic knowledge discovered in the previous section for categorizing multilingual documents by concepts, is developed based on the fuzzy k-nearest neighbor classification technique. Text categorization, which has been widely studied in the field of information retrieval, is based on the cluster hypothesis [8] which states that documents having similar contents are also relevant to the same concept. To accomplish text categorization, the crisp k-nearest neighbor (k-NN) algorithm [3] had been widely applied [5] [9]. For deciding whether an unclassified document d should be classified under category c, k-nearest neighbor algorithm looks at whether the k pre-classified sample documents most similar to d have also been classified under c. One of the problems encountered in using the k-NN algorithm in text categorization is that: when the concepts are overlapping, a pre-classified document that actually falls into both concepts with different strength of membership is failed to be given different weights to differentiate its unequal importance in deciding the class membership of a new document in two different concepts. Another problem is the result of document classification using k-NN algorithm is binary. A document is classified as either belong or not belong to a concept and once a document is assigned to a concept, there is no indication of its ‘strength' of membership in that concept. This is too arbitrary since documents usually relate to different concepts with different relevance weighting. To overcome these problems, the fuzzy k-nearest neighbor algorithm that gives a class membership degree to a new object, instead of assigning it to a specific class, is more appropriate. By the fuzzy k-NN algorithm, the criteria for the assignment of membership degree to a new object depends on the closeness of the new object to its nearest neighbors and the strength of membership of these neighbors in the corresponding classes. The advantages lie both in the avoidance of an arbitrary assignment and in the support of a degree of relevance from the resulting classification. Fuzzy multilingual text categorization can be seen as the task of determining a membership value µi ( d j ) ∈ [0,1] , for document dj with respect to concept ci, to each entry of the C×D decision matrix, where C={c1,…,cm} is a set of concepts, D={d1,…,dn} is a set of multilingual documents to be classified. However, before the fuzzy k-nearest neighbor decision rule can be applied, decisions regarding the set of pre-classified documents and the value of parameter k must be made. For many operation-oriented text categorization tasks such as document routing and information filtering, a set of pre-classified documents determined by the user or
840
Rowena Chau and Chung-Hsing Yeh
the operation are always necessary so that the text classifier can learn from these examples to automate the classification task. However, for multilingual text categorization, a set of pre-classified documents may not be necessary. This is because classification of multilingual text by concepts is a concept-oriented decision. It is made on the basis of a document's conceptual relevance to a concept but not on how a similar document is previously classified during a sample operation As long as the conceptual context of both concepts and documents are well represented, a decision on the conceptual classification of multilingual document can then be made. In fact, given the result of the fuzzy multilingual term clustering in the previous stage, concept memberships of all multilingual terms are already known. Interpreting each term as a document containing a single term, a virtual set of pre-classified multilingual documents is readily available. Given the concept membership of every multilingual term, the class membership values of every single-term document in the corresponding concepts are also known. For fuzzy multilingual text categorization, conceptual specifications provided by the result of fuzzy multilingual term clustering are considered reasonably sufficient and relevant for supporting the decision. Classifying multilingual document using the fuzzy k-nearest neighbor algorithm also involves determining a threshold k, indicating how many neighboring documents have to be considered for computing µi ( d j ) . In our multilingual document classification problem, the nearest neighbors to an unclassified document with k index terms will be the k single-term virtual documents where each of them contains one of the unclassified document's k index terms respectively. This is based on the assumption that a single-term document should contain at least one index term of another document to be considered related or conceptually close. As a result, by applying the fuzzy k-nearest neighbor algorithm for the development of our fuzzy multilingual text classifier, the classification decision of an unclassified document with k index terms will be implemented as a function of its distance from its k single-term neighboring documents (each containing one of the k index terms) and the membership degree of these k neighboring documents in corresponding concepts. The Fuzzy Multilingual Text Classifier is as follows: 1. 2.
Determine the k neighboring documents for document dj. Compute µi ( d j ) using: 1 µi ( d s ) 2 /( m −1 ) d j − ds s =1 ,∀i = 1,K ,m µi ( d j ) = k 1 2 /( m −1 ) s =1 d j − d s k
∑
(5)
∑
where
µi ( d s ) is the membership degree of the kth nearest neighboring sample
document ds in concept ci and m is the weight determining each neighbor's contribution to µi ( d j ) . When m is 2, the contribution of each neighboring document is weighted by the reciprocal of its distance from the document being classi-
Fuzzy Methods for Knowledge Discovery from Multilingual Text
841
fied. As m increases, the neighbors are more evenly weighted, and their relative distances from the document being classified have less effect. As m approaches 1, the closer neighbors are weighted far more heavily than those farther away, which has the effect of reducing the number of documents that contribute to the membership value of the document being classified. Usually, m = 2 is chosen. The results of this computation are used to produce a ranked list of documents being classified to a particular concept with the most relevant one appearing at the top. When every concept in the concept space is associated with a ranked list of relevant documents, multilingual text categorization is completed. Using the concept space as a browseable document directory for user-machine interaction in multilingual information seeking, the user can now explore the whole multilingual document collection by referring to any concept of his interest. As a result, global knowledge discovery that aims at scanning conceptually correlated documents in multiple languages in order to gain a better understanding of a certain area is facilitated.
5
Conclusion
In this paper, a corpus-based approach to multilingual semantic knowledge discovery for enabling concept-based multilingual text categorization is proposed. The key to its effectiveness is the discovery of the multilingual semantic knowledge with the formation of a universal concept space that overcomes the feature incompatibility problem by facilitating representation of documents in all languages in a common semantic framework. By enabling navigation among sets of conceptually related multilingual documents, global knowledge discovery is facilitated. This concept-based multilingual information browsing is particularly important to users who need to stay competent by keeping track of the global knowledge development of a certain subject domain regardless of language.
References [1] [2] [3] [4] [5]
Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press. New York (1981). Croft, W. B., Broglio, J. and Fujii, H. Applications of Multilingual Text Retrieval. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences. 5 (1996) 98-107. Duda, R. O. and Hart, P. E. Pattern Classification and Scene Analysis. New York. Wiley (1973). Keller, J. M., Gray, M R. and Givens, J. A. A Fuzzy k-Nearest Neighbor Algorithm. IEEE Transactions of Systems, Man and Cybernetics. (1985) SMC-15 (4) 580-585. Lam. W. and Ho, C. Y. Using a Generalized Instance Set for Automatic Text Categorization. In proceedings of the 21st Annual International ACM SIGIR Conference in Research and Development in Information Retrieval. ACM Press. New York (1998) 81-89.
842
Rowena Chau and Chung-Hsing Yeh
[6]
Littman, M. L., Dumais, S. T. and Landaur, T. K. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing. In Grefenstette, G. (editor) Cross-Language Information Retrieval. Kluwer Academic Publishers, Boston (1998). Salton, G. Automatic Text Processing: The Transformation, analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading. MA (1989). Van Rijsbergen, C. J. Information Retrieval. Butterworth. (1972). Yang, Y. Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In proceedings of the 7th Annual International ACM SIGIR Conference in Research and Development in Information Retrieval. ACM Press. New York (1994)13-22.
[7] [8] [9]
Automatic Extraction of Keywords from Abstracts Yaakov HaCohen-Kerner Department of Computer Science Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031 91160 Jerusalem, Israel
[email protected] Abstract. The rapid increasing of online information is hard to handle. Summaries such as abstracts help us to reduce this problem. Keywords, which can be regarded as very short summaries, may help even more. Filtering documents by using keywords may save precious time while searching. However, most of the documents do not include keywords. In this paper we present a model that extracts keywords from abstracts and titles. This model has been implemented in a prototype system. We have tested our model on a set of abstracts of Academic papers containing keywords composed by their authors. Results show that keywords extracted from abstracts and titles may be a primary tool for researchers.
1
Introduction
With the explosive growth of online information, more and more people depend on summaries. People do not have time to read everything, they prefer to read summaries such as abstracts rather than the entire text, before they decide whether they would like to read the whole text or not. Keywords, regarded as very short summaries, can be grate help. Filtering documents by using keywords can save precious time while searching. In this study, we present an implemented model that is capable of extracting keywords from abstracts and titles. We have tested our model on a set of 80 abstracts of Academic papers containing keywords composed by the authors. We compare the proposed keywords to the authors' keywords and analyze the results. This paper is arranaged as follows. Section 2 gives background concerning text summarization and extraction of keywords. Section 3 describes the proposed model. Section 4 presents the experiments we have carried out. Section 5 discusses the results. Section 6 summarizes the research and proposes future directions. In the Appendix we present statistics concerning the documents that were tested.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 843-849, 2003. Springer-Verlag Berlin Heidelberg 2003
844
Yaakov HaCohen-Kerner
2
Background
2.1
Text Summarization
“Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks)” [3]. Basic and classical articles in text summarization appear in “Advances in automatic text summarization” [3]. A literature survey on information extraction and text summarization is given by Zechner [7]. In general, the process of automatic text summarization is divided into three stages: (1) analysis of the given text, (2) summarization of the text, (3) presentation of the summary in a suitable output form. Titles, abstracts and keywords are the most common summaries in Academic papers. Usually, the title, the abstract and the keywords are the first, second, and third parts of an Academic paper, respectively. The title usually describes the main issue discussed in the study and the abstract presents the reader a short description of the background, the study and its results. A keyword is either a single word (unigram), e.g.: ‘learning', or a collocation, which means a group of two or more words, representing an important concept, e.g.: ‘machine learning', ‘natural language processing'. Retrieving collocations from text was examined by Smadja [5] and automatic extraction of collocations was examined by Kita et al. [1]. Many academic conferences require that each paper will include a title page containing the paper's title, its abstract and a set of keywords describing the paper's topics. The keywords provide general information about the contents of the document and can be seen as an additional kind of a document abstraction. The keyword concept is also widely used, throughout the Internet, as a common method for searching in various search-engines. Due to the simplicity of keywords, text search engines are able to handle huge volumes of free text documents. Some examples for commercial websites specializing with keywords are: (1) www.wordtracker.com that provides statistics on most popular keywords people use and how many competing sites use them. In addition, it helps you find all keyword combinations that bear any relation to your business or service. (2) www.singlescan.co.uk. that offers to register a website to one or more unique keywords (such as domain names). These keywords enable the website to be submitted into strategic positions on the search engines. 2.2
Extraction of Keywords
The basic idea of keyword extraction for a given article is to build a list of words and collocations sorted in descending order, according to their frequency, while filtering general terms and normalizing similar terms (e.g. “similar” and “similarity”). The filtering is done by using a stop-list of closed-class words such as articles, prepositions and pronouns. The most frequent terms are selected as keywords since we assume that the author repeats important words as he advances and elaborates.
Automatic Extraction of Keywords from Abstracts
845
Examples of systems that applied this method for abstract-creation are described in Luhn [2] and Zechner [6]. Luhn's system is probably the first practical extraction system. Luhn presents a simple technique that uses term frequencies to weight sentences, which are then extracted to form an abstract. Luhn's system does not extract keywords. However, he suggests generating index terms for information retrieval. In his paper, Luhn presents only two running-examples. Zechner's system does not extract keywords. It rather generates text abstracts from newspaper articles by selecting the “most relevant” sentences and combining them in text order. This system relies on a general, purely statistical principle, i.e., on the notion of “relevance”, as it is defined in terms of the combination of tf*idf weights of words in a sentence. Experiments he carried out achieved precision/recall values of 0.46/0.55 for abstracts of six sentences and of 0.41/0.74 for abstracts of ten sentences.
3
The Model
Our plan is composed of the following steps: (1) Implementing the basic idea for extraction of keywords mentioned in the previous section, (2) Testing it for abstracts that contain keywords composed by the authors (3) Comparing the extracted keywords with the given keywords and analyzing the results. The implementation is as follows: the system chooses the “n” most highly weighted words or collocations as the proposed keywords by the system. The value of “n” has been set at “9” because of two reasons: (1) the number “9” is accepted as the maximal number of items that an average person is able to remember without apparent effort, according to the cognitive rule called “7±2”. This means that an average person is capable of remembering approximately between 5 and 9 information items over a relatively short term [4]; (2) The number 9 is the largest number of keywords given by the authors of the abstracts we have selected to work with. These 9 extracted keywords will be compared to the keywords composed by the authors. A detailed analysis will identify full matches, partial matches and failures. Fig. 1 describes our algorithm in detail. In order to reduce the size of the word weight matrixes that mentioned in step 1 in the algorithm, we transform each word to lower case. In step 2, weights are calculated by counting full matches and partial matches. A full match is a repetition of the same word including changes such as singular/plural or abbreviations, first letter in lower case / upper case. A partial match between two different words is defined if both words have the same first five letters (see explanation below). All other pairs of words are regarded as failures. For each abstract from the data-base A partial match between different words is defined when the first five letters of both words are the same. That is because in such a case we assume that these words have a common radical. Such a definition, on the one hand, usually identifies close words like nouns, verbs, adjectives, and adverbs. On the other hand, it does not enable most of non-similar words to be regarded as partial matches.
846
Yaakov HaCohen-Kerner
1.
Create word weight matrixes for all unigrams, 2-grams and 3-grams. Approximately 450 high frequency close class words (e.g.: we, this, and, when, in, usually, also, do, near) are excluded via a stop-list.
2.
Compute the weights of all unigrams, 2-grams, 3-grams by counting full and partial appearances (definitions in the next paragraph).
3.
Sort each one of these kinds of x-grams in descending order, merge them and select the n highest weighted groups of words as the proposed keywords by the system.
4.
Analyze the results: compare the proposed keywords to the keywords composed by the author and report on the full matches, partial matches and failures of the system. Fig. 1. The algorithm
A positive example for this definition is as follows: all 8 following words are regarded as partial matches because they have the same 5-letter prefix “analy”: the nouns “analysis”, “analyst”, “analyzer”, the verb “analyze”, and the adjectives “analytic”, “analytical”, “analyzable “, and the adverb “analytically”. A negative example for this definition is: all 8 following words: “confection”, “confab”, “confectioner”, “confidence”, “confess”, “configure”, “confinement”, and “confederacy” are regarded as non partial matches because they have in common only a 4-letter prefix “conf”. As stated in section 2.2, the basic idea in automatic finding of keywords is that the most frequent terms are selected as keywords since we assume that the author repeats important words as he advances and elaborates. An additional criterion taken into account in many summarizing programs is the importance of the sentence from which the term is taken. There are various methods to identify the importance of sentences, e.g.: (1) location of sentences (position in text, position in paragraph, titles, …), (2) sentences which include important cue words and phrases like: “problem”, “assumption”, “our investigation”, “results”, “conclusions”, “in summary”, (3) analysis of the text based on relationships between sentences, e.g.: logical relations, syntactic relations, and (4) structure and format of the document, e.g.: document outline and narrative structure. In our study, we use these two basic features (frequency and importance of sentences) for extracting keywords. In addition, we take into account the length of the terms, which are being tested as potential keywords. A statistical analysis of the distribution of words lengths of authors' keywords regarding our data base shows that 2-grams are the most frequent grams (189 out of 332 which is about 57%), then unigrams (102 out of 332 which is about 31%), then 3-grams (34 out of 332 which is about 10%) and finally 4-grams (7 out of 332 which is about 2%).
Automatic Extraction of Keywords from Abstracts
847
Table 1. Summarization of general results
Tested keywords
Full matches
Partial matches
Failures
332 100%
77 23.19%
128 38.56%
127 38.25%
Number of Percentage
4
Experiments
We have constructed a database containing 80 documents that describe Academic papers. Each document includes the title of the paper, its abstract, and a list of keywords composed by the author. Most of the documents are taken from publications and technical reports available at http://helen.csi.brandeis.edu/papers/long.html/. Each document includes the title of the paper, its abstract, and a list of keywords composed by the author. Table 1 shows the general results.
5
Discussion
Previous systems use similar techniques not for keywords' extraction but rather for abstract-creation. Luhn's system does not present general results. Zechner's system achieves precision/recall values of 0.46/0.55 for abstracts of six sentences and values of 0.41/0.74 for abstracts of ten sentences. Our system presents a rate of 0.62 for finding partial and full matches. Our results are slightly better. However, the systems are not comparable because they deal with different tasks and with different kinds of text documents. Full matches where found in a rate of 23.19%. Partial matches and upper where found in a rate of 61.75%. Apparently, the rate of the system's success appears to be unimpressive. However, we claim that these are rather satisfying results due to some interesting findings: (1) 86.25% of the abstracts do not include (in an exact form) at least one of their own keywords. (2) 55.12% of the keywords chosen by their authors do appear (in an exact form) neither in the title nor in the abstract. Full statistics is given in the Appendix. The main cause for these findings is the authors! Most of the abstracts do not include (in an exact form) at least one of their own keywords. Most of the keywords chosen by the authors do not appear (in an exact form) neither in the title nor in the abstract. These results might show that at least some of the authors have problems either in defining their titles and abstracts or in choosing their keywords. In such circumstances our results are quite good.
848
Yaakov HaCohen-Kerner
6
Summary and Future Work
As far as we know no existing program is able to extract keywords from abstracts and titles. Results show that keywords extracted from abstracts and titles may be a primary tool for researchers. Future directions for research are: (1) using different learning techniques might improve the results of the extraction, (2) selecting an optimal set of initial weights, and (3) elaborating the model for extracting and learning of keywords from entire articles.
Acknowledgements Thanks to Ari Cirota and two anonymous referees for many valuable comments on earlier versions of this paper.
References [1]
[2]
[3] [4] [5] [6] [7]
Kita, K., Kato, Y., Omoto, T, & Yano, Y. Automatically extracting collocations from corpora for language learning. Proceedings of the international Conference on Teaching and Language Corpora. Reprinted in A Wilson & A McEnery (eds.): UCREL Technical Papers Volume 4 (Special Issue), Corpora in Language Education and Research, A Selection of Papers from TALC 94, Dept. of Linguistics, Lancaster University, England. (1994) 53-64 Luhn, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and development, 159-165. Reprinted in Advances in automatic text summarization, 1999, In Mani I. And Maybury M. (eds.): Cambridge MA: MIT Press (1959) Mani I. And Maybury M. T . Introduction, Advances in automatic text summarization, In Mani I. And Maybury M. (eds.): Cambridge MA: MIT Press (1999) Miller, G. A. The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity of Information, Psychological Science, 63 (1956) 81-97 Smadja, F. Retrieving Collocations from Text. Computational Linguistics 19(1) (1993) 143-177 Zechner, K. Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences. In Proceedings of the 16th international Conference on Computational Linguistics (1996) 986-989 Zechner, K. A.: A Literature Survey on Information Extraction and Text Summarization. Term Paper, Carnegie Mellon University (1997)
Automatic Extraction of Keywords from Abstracts
Appendix: Statistics Concerning the Tested Documents There are 80 abstracts There is 1 abstract with 1 keyword There are 12 abstracts with 2 keywords There are 19 abstracts with 3 keywords There are 22 abstracts with 4 keywords There are 12 abstracts with 5 keywords There are 4 abstracts with 6 keywords There are 4 abstracts with 7 keywords There are 4 abstracts with 8 keywords There are 2 abstracts with 9 keywords There are 332 keywords in all abstracts The average number of keywords in a single abstract is 4.15 There are 102 keywords with length of 1 word There are 189 keywords with length of 2 words There are 34 keywords with length of 3 words There are 7 keywords with length of 4 words The average length in words of a single keyword is 1.84 There are 0 abstracts with 1 sentence There are 0 abstracts with 2 sentences There are 1 abstract with 3 sentences There are 4 abstracts with 4 sentences There are 15 abstracts with 5 sentences There are 15 abstracts with 6 sentences There are 17 abstracts with 7 sentences There are 10 abstracts with 8 sentences There are 13 abstracts with 9 sentences There are 5 abstracts with 10 sentences The average length in sentences of a single abstract is 6.88 # of abstracts which do not include their own 1 keyword is : 18 # of abstracts which do not include their own 2 keywords is : 18 # of abstracts which do not include their own 3 keywords is : 14 # of abstracts which do not include their own 4 keywords is : 11 # of abstracts which do not include their own 5 keywords is : 6 # of abstracts which do not include their own 6 keywords is : 1 # of abstracts which do not include their own 7 keywords is : 1 The total sum of the author's keywords that do not appear in their own abstracts is : 183 That is, 55.12% of the author's keywords do not appear in their own abstracts. There are 33 abstracts, which don't include keywords of their own with length of 1 word There are 59 abstracts, which don't include keywords of their own with length of 2 words There are 16 abstracts, which don't include keywords of their own with length of 3 words There are 7 abstracts, which don't include keywords of their own with length of 4 words There are 69 abstracts do not include (in an exact form) at least one of their own keywords. 86.25% of abstracts do not include (in an exact form) at least one of their own keywords.
849
Term-length Normalization for Centroid-based Text Categorization
Verayuth Lertnattee and Thanaruk Theeramunkong Information Technology Program, Sirindhorn International Institute of Technology Thammasat University, Pathumthani, 12121, Thailand {
[email protected],
[email protected],
[email protected]}
Centroid-based categorization is one of the most popular algorithms in text classication. Normalization is an important factor to improve performance of a centroid-based classier when documents in text collection have quite dierent sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classication accuracy in all cases. Abstract.
1
Introduction
With the fast growth of online text information, there has been extreme need to organize relevant information in text documents. Automatic text categorization (also known as text classication) becomes a signicant tool to utilize text documents eciently and eectively. In the past, a variety of classication models were developed in dierent schemes, such as probabilistic models (i.e. Bayesian classication) [1, 2], regression models [3], example-based models (e.g., k -nearest neighbor) [3], linear models [4, 5], support vector machine (SVM) [3, 6] and so on. Among these methods, a variant of linear models called a centroid-based model is attractive since it has relatively less computation than other methods in both the learning and classication stages. Despite the less computation time, centroidbased methods were shown in several literatures including those in [4, 7, 8], to achieve relatively high classication accuracy. The classication performance of the model strongly depend on the weighting method applied in the model. In this paper, a new type of normalization so-called term-length normalization, is investigated in text classication. Using various data sets, the performance is investigated in three environments of a standard centroid-based classier (TFIDF) (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. In the rest of this paper, section 2 presents centroid-based text categorization. Section 3 describes the concept of normalization in text categorization. The proposed term-length normalization V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 850-856, 2003. c Springer-Verlag Berlin Heidelberg 2003
Term-length Normalization for Centroid-based Text Categorization
851
is given in section 4. The data sets and experimental settings are described in section 5. In section 6, experimental results using four data sets are given. A conclusion and future work are made in section 7. 2
Centroid-based Text Categorization
In the centroid-based text categorization, a document (or a class) is represented by a vector using a vector space model with a bag of words (BOW) [9, 10]. The simplest and popular one is applied term frequency (tf ) and inverse document frequency (idf ) in the form of tfidf for representing a document. In a vector space model, given a set of documents D = { d1 ; d2 ; :::; dN }, a document dj is rep! resented by a vector dj = fw1j ; w2j ; :::; wmj g = ftf1j idf1 ; tf2j idf2 ; :::; tfmj idfm g, where wij is a weight assigned to a term ti in the document. In this definition, tfij is term frequency of a term ti in a document dj and idfi is inverse document frequency, dened as log (N=dfi ). The idf can be applied to eliminate the impact of frequent terms that exist in almost all documents. Here, N is the total number of documents in a collection and dfi is the number of documents, which contain the term ti . Besides term weighting, normalization is another important factor to represent a document or a class. The detail of normalization is described latter in the next section. Class prototype ! ck is obtained by summing up all document vectors in Ck and then normalizing the result by its size. The ! ! dj j, where Ck dj =j formal description of a class prototype ! ck is dj 2Ck dj 2Ck = {dj j dj is a document belonging to the class ck }. The simple term weighting is tf idf where tf is average class term frequency of the term. The formal description of tf is dj 2Ck tfijk =jCk j, where jCk j is the number of documents in a class ck . Term weighting described above can also be applied to a query or a test document. In general, the term weighting for a query is tf idf . Once a class prototype vector and a query vector have been constructed, the similarity between these two vectors can be calculated. The most popular one is cosine distance. This similarity can be calculated by the dot product between these two vectors. Therefore, the test document will be assigned to the class whose class prototype vector is the most similar to the vector of the test document.
P
P
P
3
Normalization Methods
In the past, most normalization methods based on document-length normalization which reduced the advantage of a long document over a short one. A longer document may include term with higher term frequency and more unique terms in document representation and causes to increase the similarity and chances of retrieval of longer documents in preference over shorter documents. To solve this issue, normally all relevant documents should be treated as equally important for classication or retrieval especially by the way of normalization. As the most options for radical approach, document-length normalization is incorporated into term weighting formula to equalize the length of document vectors. Although
852
Verayuth Lertnattee and Thanaruk Theeramunkong
there have been several types of document-length normalizations proposed [10], the cosine normalization [4, 8] is the most commonly used. It can solve the problem of overweighting due to both terms with higher frequency and more unique terms. The cosine normalization is done by dividing all elements in a vector 2 with the length of the vector, that is wi where wi is the weight of the term i ti before normalization. In centroid-based method, two types of normalization have been used: (1) normalizing the documents in a certain class and then merge the documents into a class vector and (2) merging the documents into a class vector and then normalize the class vector. The latter type of normalization was used in our experiment. We called the process to normalize the class vector as class-length normalization, which strongly depends on term weighting of the class vector before normalization. In this paper, we propose the novel method so-called term-length normalization that adjusts the term weighting to be more eective and powerful before following with class-length normalization.
pP
4
Term-length Normalization
Our concept is based on the fact that terms which have high discriminating power should have some of the following properties. They should occur frequently in a certain class. This corresponds to tf in a centroid-based method. With the same tf , we consider the terms which distribute consistently are more important than those occur in only few documents with high frequency. To achieve the goal, we propose general term-length in the form of root mean term frequency calculated from all documents in the class (tf lik (n)). The tf lik (n) of a term ti in a class ck at a degree n is dened as:
tf lik (n)
=
v P u u t n
j
n ijk
tf
(1)
j j Ck
Term-length normalization can be held by tf =tf l(n) as shown below
0 tf
ik
The tf 0
(n)
=
h p
tf ik
=
tf lik (n)
r
P
n
tfijk
j
j j Ck
n
1
P j
i
n ijk
(2)
tf
can alse be viewed as a kind of term distribution. The value of (n) is in range of 1= n jCk jn 1 ; 1 . The minimum value is gained when a ik term occurs merely in one document in the class. The maximum value is obtained when a term occurs equally in all documents. The value n is the degree of the utilization of term distribution. The higher n, the more important utilization of term distribution is. In our concept, the term weighting depends on tf , tf 0 and idf. So the weighting skelaton can be dened as:
0 tf
ik
(n)
wik
x
0
y
z
= (tf ik ) (tfik (n)) (idfi )
(3)
Term-length Normalization for Centroid-based Text Categorization
853
From the equation 3, the value x, y and z are the powers which represent the level of contribution of each factor to the term weighting, respectively. In this paper, we focus on x and y and let z to the standard value (z =1). When x=0, the term weighting depend on only tf 0 and idf. When x > 0, the tf has positive eect on term weighting. On the other hand, when x < 0, the tf has negative eect on term weighting. In our preliminary experiments, we found out that tf should have a positive eect on term weighting, that is the value of x should be equal to or greater than zero. For example, Figure 1 illustrates three kinds of classes each of which holds ten documents. Each number shows the occurrence of a certain term in the document. In Figure 1(a), the term appears only in two documents. In Figure 1(b), the term appears in eight documents in the class but the tf of the term is the same as the case of Figure 1(a), i.e. 1.40. Intuitively, the term in Figure 1(b) should be more important than the term in Figure 1(a). Focusing on only tf , we cannot grasp this dierence. However, if we consider the tf 0 (tf =tf l(n)), we can observe that the case (b) obtains a higher value than the case (a) does. For a more complex situation, term distribution pattern in Figure 1(c) is similar to the pattern in Figure 1(b) but Figure 1(c) has a higher 0 tf . In this case, merely tf is not enough for representing the important level of the term. We need to consider tf . Furthermore, when the value of n is higher, the value of tf 0 (n) is lower. So the term weighting is more sensitive to the term distribution pattern, i.e. from tf 0 (2) to tf 0 (3). In conclusion, we introduced a kind of term distribution by tf =tf l(n) that contributes to the term weighting along with the traditional tf and idf.
0
0 8
0 0
6 0
0
0
0
tf=1.40 tfl(2)=3.16 tf´(2)=0.44 tfl(3)=4.18 tf´(3)=0.34
(a)
Fig. 1.
5
2
2 1
0 0
1 2
2
2
2
tf=1.40 tfl(2)=1.61 tf´(2)=0.87 tfl(3)=1.71 tf´(3)=0.82
(b)
4
4 2
0 0
2 4
4
4
4
tf=2.80 tfl(2)=3.22 tf´(2)=0.87 tfl(3)=3.42 tf´(3)=0.82
(c)
Three dierent cases of a term occurring in a class
Data sets and Experimental Settings
Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The rst data set, DI is a set of web pages collected from www.rxlist.com. It includes 4,480 English web pages
854
Verayuth Lertnattee and Thanaruk Theeramunkong
with 7 classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. The second data set, Newsgroups contains 20,000 documents. The articles are grouped into 20 dierent UseNet discussion groups. The third and fourth data sets are constructed from WebKB. These web pages were collected from departments of computer science from four universities with some additional pages from some other universities. In our experiment, we use the four most popular classes: student, faculty, course and project as our third data set called WebKB1. The total number of web pages is 4,199. Alternatively, this collection can be rearranged into ve classes by university (WebKB2): cornell, texas, washington, wisconsin and misc. All headers are eliminated from the documents in News documents and all HTML tags are also omitted from documents in DI, WebKB1 and WebKB2. For all data sets, a stop word list is applied to take away some common words, such as a, by, he and so on, from the documents. The following two experiments are performed. In the rst experiment, term-length normalization is applied with the powers of tf and t are equal. After the simple term weighting is modied by term-length normalization, the nal class vectors are construced without class-length normalization and two types of class-length normalization: cosine and summing weight compare to the same methods without term-length normalization. The second experiment, we adjust three dierent powers of tf and t. In all experiments, a data set is split into two parts: 90% for the training set and 10% for the test set (10-fold cross validation). As the performance indicator, classication accuracy (%) is applied. It is dened as the ratio of the number of documents assigned with their correct classes to the total number of documents in the test set. 6
Experimental Results
6.1 Eect of Term-length Normalization Based on Only tf 0 and idf In the rst experiment, the powers of tf and t are equal. Two types of termlength normalization: t(2) and t(3) were applied to the term weighting. As a baseline, a standard centroid-based classier (tf idf ) without term-length normalization (N) is used. We investigate 2 kinds of term weighting tf 0 (2) idf (T2) and tf 0 (3) idf (T3), the nal class vectors were constructed without classlength normalization (CN0) and two types of class-length normalization: cosine normalization (CNC) and summing weight normalization (CNW). The result is shown in Table 1. The bold indicates the highest accuracy in each data sets and on average of 4 data sets. When the power of tf equals to that of t, tf 0 is construced and the tf is not available to express the eect of term frequency. The tf 0 concerns the term distribution among documents in a class. Using tf 0 insteads of tf can improve accuracy on all data sets which have no class-length normalization especially in News. It also enhances the eect of cosine normalization in DI, News and WebKB1. For summing weight, It can improve performance only in News. The 0 0 tf (3) performs better than the tf (2) on average accuracy. In the next experiment, we will turn on the eect of tf to combine with tf 0 and idf.
Term-length Normalization for Centroid-based Text Categorization Table 1.
855
Eect of term-length normalization when the powers of tf and t are equal.
Class
DI
N
Norm.
CN0 CNC CNW
T2
66.99 74.84
News
T3
N
75.71 69.13
T2
WebKB1
T3
N
T2
82.05 79.93 53.32 66.16
WebKB2
T3
N
66.73
40.27
T2
Average
T3
N
T2
T3
51.01 26.41 57.43 68.52 62.20
91.67 92.08 95.33 74.76 82.46 82.09 77.71 75.30 78.85 88.76 61.59 79.04 83.23 77.86 83.83 93.06 66.83
78.95 73.63
76.40 75.29 73.02 59.37
70.35
33.36
22.72 22.82 68.27 56.33 61.85
, tf 0 and idf 6.2 Eect of Term-length Normalization Based on tf In this experiment, three dierent patterns were applied to set the power of tf became higher than the power of t for turning on the eect of tf to combine with tf 0 and idf. For t(2), the three patterns are: (1) tf idf = tf l(2) which equal to (tf )0:5 tf 0 (2)0:5 idf (TA), (2) (tf )1:5 idf =tf l(2) which equal to (tf )0:5 tf 0 (2) idf (TB) and (3) (tf )2 idf =tf l (2) which equal to tf tf 0 idf (2) (TC). The same patterns of term weighting were also applied to t(3). The result is shown in Table 2. The bold indicates the highest accuracy in each data sets and on average of 4 data sets.
p
Table 2.
and t(3)
Term
Eect of term-length normalization when the power of tf is higher than t (2)
Class
Norm. Norm.
t(2) t(3)
DI
TA
TB
News
TC
TA
WebKB1
TB TC TA
TB
TC
53.56 44.92 28.01
23.53
Average
TA
TB
70.11 71.03
68.62 78.28
CNC
95.09 96.88
96.45 80.67 79.60 69.83 80.42 82.42
80.45 89.26 89.97 90.45 86.36 87.22 84.29
CNW
90.76 95.20
96.41 78.72
77.69 71.59 72.23 80.42
80.35 23.60 29.51
70.56
66.33 70.70
79.73
CN0
70.56 71.70
69.02 77.70
72.55 63.55 62.68 64.90
54.58 30.94 22.98
22.17
60.47 58.03
52.33
CNC
96.16 97.08 96.43 80.39
78.33 68.36 81.23 82.81 80.23 88.74 89.47
89.28
86.63 86.92
83.58
CNW
92.77 96.23
76.14 70.45 75.54 82.21
70.30
67.65 74.46
79.60
81.04 23.74 43.27
63.74 59.38
TC
CN0
96.61 78.55
74.79 65.22 61.66 63.68
WebKB2
TC TA TB
52.73
According to the results in Table 2 when the powers of tf are higher than t, almost methods outperform those methods in previous experiment except in News . This suggests that term distribution is very valuable for classifying documents in News. Term-length normalization can improve classication accuracy in all types of class-length normalization, especially cosine normalization. From the average classication accuracy, the maximum result is obtained when using the power of tf = 1.5, the power of t(2) = 1.0 following by cosine normalization. The t(2) is work better on News and WebKB2 while the t(3) work bettern on DI and WebKB1. The highest result in each data set is gained from the dierent combination of tf and tf 0 (n). The suitable combination of tf and 0 tf (n) depends on individual characteristic of the data sets.
856 7
Verayuth Lertnattee and Thanaruk Theeramunkong Conclusion and Future Work
This paper showed that term-length normalization was useful in centroid-based text categorization. It considers the term distribution in class. The terms that appear in several documents in class are important than those appear in few documents with higher term frequencies. The distributions were used to represent discriminating power of each term and then to weight that term. Adjusting the power of tf , t and the level n in a suitable portion is a key factors to improve the accuracy. From the experiments, the results suggested that termlength normalization can improve classication accuracy and work well with all methods of class-length normalization. For our future work, we plan to evaluate the term-length normalization with other types of class-length normalization.
Acknowledgement. This work has been supported by National Electronics and
Computer Technology Center (NECTEC), project number NT-B-06-4C-13-508. References
1. McCallum, A., Rosenfeld, R., Mitchell, T., Ng, A.Y.: Improving text classication by shrinkage in a hierarchy of classes. In: Proc. 15th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA (1998) 359367 2. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Learning to classify text from labeled and unlabeled documents. In: Proceedings of AAAI-98, 15th Conference of the American Association for Articial Intelligence, Madison, US, AAAI Press, Menlo Park, US (1998) 792799 3. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International SIGIR, Berkley (1999) 4249 4. Chuang, W.T., Tiyyagura, A., Yang, J., Giurida, G.: A fast algorithm for hierarchical text classication. In: Data Warehousing and Knowledge Discovery. (2000) 409418 5. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Fisher, D.H., ed.: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 143151 6. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In Nédellec, C., Rouveirol, C., eds.: Proceedings of ECML98, 10th European Conference on Machine Learning. Number 1398, Chemnitz, DE, Springer Verlag, Heidelberg, DE (1998) 137142 7. Han, E.H., Karypis, G.: Centroid-based document classication: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery. (2000) 424431 8. Lertnattee, V., Theeramunkong, T.: Improving centroid-based text classication using term-distribution-based weighting and feature selection. In: Proceedings of INTECH-01. 2nd International Conference on Intelligent Technologies, Bangkok, Thailand (2001) 349355 9. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (1988) 513523 10. Singhal, A., Salton, G., Buckley, C.: Length normalization in degraded text collections. Technical Report TR95-1507 (1995)
Recommendation System Based on the Discovery of Meaningful Categorical Clusters Nicolas Durand1 , Luigi Lancieri1 , and Bruno Cr´emilleux2 1
France Telecom R&D 42 rue des Coutures - 14066 Caen C´edex 4 - France {nicola.durand,luigi.lancieri}@francetelecom.com 2 GREYC CNRS-UMR 6072 Universit´e de Caen Campus Cˆ ote de Nacre - 14032 Caen C´edex - France
[email protected] Abstract. We propose in this paper a recommendation system based on a new method of clusters discovery which allows a user to be present in several clusters in order to capture his different centres of interest. Our system takes advantage of content-based and collaborative recommendation approaches. The system is evaluated by using proxy server logs, and encouraging results were obtained.
1
Introduction
The search of relevant information on the World Wide Web is still a challenge. Even if the indexing methods get more efficient, search engines stay passive agents and do not take into account the context of the users. Our approach suggests an active solution based on the recommendation of documents. We propose in this paper a hybrid recommendation system (collaboration via content ). Our method allows to provide recommendations based on the content of the document, and also recommendations based on the collaboration. Recommendation systems can be installed on the user’s computer (like an agent for recommending web pages during the navigation), or on particular web sites, platforms or portals. Our approach can be applied to systems where users’ consultations can be recorded (for example: a proxy server, a restricted web site or a portal). Our system is based on recent KDD (Knowledge Discovery in Databases) methods. In [2], we defined a new method of clusters discovery from frequent closed itemsets. In this paper, we create a recommendation system by taking advantage of this method. We form clusters of users having common centres of interest, by using the keywords of the consulted documents. Our method relates to content-based filtering (the identification of common keywords) but uses some technics of collaborative filtering (i.e. clustering of users). Moreover, we discover a set of clusters and not a strict clustering (i.e. a partition) like the recommendation systems based on clustering. This means that our approach enables a user V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 857–864, 2003. c Springer-Verlag Berlin Heidelberg 2003
858
Nicolas Durand et al.
to be in several clusters. We can retrieve a user with several kinds of queries corresponding to his different centres of interest. Our system is autonomous, it does not need the intervention of users. Indeed, we use logs (consultations of documents for several users) that are a good source of information to indicate what the users want [13]. We can perform both content-based recommendations and collaboration-based recommendations. For a new document arriving in the system, we can recommend it to users by comparing the keywords of the document to the clusters. We can also recommend to a user of a cluster, documents that other users of the cluster have also consulted (collaborative approach). The rest of the paper is organized as follows: in the next section, we present related work. Then, we detail our method of clusters discovery (called Ecclat). In Section 4, we explain our recommendation system based on the discovered clusters. Then, we describe our experimentations and give some results. We conclude in Section 6.
2
Related Work
Recommendation systems are assimilated to information filtering systems because the ideas and the methods are very close. There are two types of filtering: content-based filtering and collaborative filtering. Content-based filtering identifies and provides relevant information to users on the basis of the similarity between the information and the profiles. We can quote the Syskill &Webert system [12] which produces a bayesian classifier from a learning database containing web pages scored by the user. The classifier is then used to establish if a page can interest the user. SiteHelper [9] recommends only the documents of a web site. It uses the feedback of the user. Letizia [6] is a client-side agent which searches web pages similar to the previous consulted or bookmarked ones. WebWatcher [4] uses the proxy server logs to do some recommendations. Mobasher et al. [7] propose a recommendation system based on the clustering of web pages from web server logs. This system determines the URLs which can interest the user by matching the URLs of the user current session with the clusters. Collaborative filtering finds relevant users who have similar profiles, and provides the documents they like to each other. Rather than the similarity between documents and profiles, this method measures the similarity between profiles. Tapestry [3] and GroupLens [5] allow users to comment the Netnews documents, and to get the ones recommended by the others. Amalthaea [8] is an agent which allows to create and modify the user profile. In these systems, users must specify their profiles. Among the autonomous collaborative approaches, we have some methods based on the clustering and the associations of items. Wilson et al. [14] use the frequent associations containing two items (TV programs) in order to determine the similarity between two user profiles. There are also some hybrid approaches. Pazzani [11] showed by some experimentations that the hybrid systems use more information, and provide more precise recommendations. Pazzani talk about collaboration via content, because
Recommendation System Based on the Discovery of Categorical Clusters
859
the profile of each user is based on the content, and is used to detect the similarity among the users. Fab [1] implements this idea in a similar way. In Fab, some agents (one per user) collect documents and put them in the central repository (to take advantage of potential overlaps between user’s interests) in order to recommend them to users. We can also cite OTS [15] which allows a set of users to consult some papers provided by a publication server. The users are grouped according to their profile. These profiles are defined and based on the content of papers. Contrary to OTS, our system can provide recommendations on documents not consulted by users yet, and our method of cluster discovery does not use defined profiles.
3
Clusters Discovery with Ecclat
We have developed a clustering method (named Ecclat [2]) for the discovery of interesting clusters in web mining applications i.e. clusters with possible overlapping of elements. For instance, we would like to retrieve a user (or a page) from several kinds of queries corresponding to several centres of interest (or several points of views). Another characteristic of Ecclat is to be able to tackle large data bases described by categorical data. The approach used by Ecclat is quite different from usual clustering techniques. Unlike existing techniques, Ecclat does not use a global measure of similarity between elements but is based on an evaluation measure of a cluster. The number of clusters is not set in advance. In the following discussion, each data record is called a transaction (a user) and is described by items (the consulted keywords). Ecclat discovers the frequent closed itemsets [10] (seen as potential clusters), evaluates them and selects some. An itemset X is frequent if the number of transactions which contains X is at least the frequency threshold (called minf r) set by the user. X is a closed itemset if its frequency only decreases when any item is added. A closed itemset checks an important property for clustering: it gathers a maximal set of items shared by a maximal number of transactions. In other words, this allows to capture the maximum amount of similarity. These two points (the capture of the maximum amount of similarity and the frequency) are the basis of our approach of selection of meaningful clusters. Ecclat selects the most interesting clusters by using a cluster evaluation measure. All computations and interpretations are detailed in [2]. The cluster evaluation measure is composed of two measures: homogeneity and concentration. With the homogeneity value, we want to favour clusters having many items shared by many transactions (a relevant cluster has to be as homogeneous as possible and should gather “enough” transactions). The concentration measure limits the overlapping of transactions between clusters. Finally, we define the interestingness of a cluster as the average of its homogeneity and concentration. Ecclat uses the interestingness to select clusters. An innovative feature of Ecclat is its ability to produce a clustering with a minimum overlapping between clusters (which we call “approximate clustering”) or a set of clusters with a slight overlapping. This functionality depends on the value of a param-
860
Nicolas Durand et al.
eter called M . M is an integer corresponding to a number of transactions not yet classified that must be classified by a new selected cluster. The algorithm performs as follows. The cluster having the highest interestingness is selected. Then as long as there are transactions to classify (i.e. which do not belong to any selected cluster) and some clusters are left, we select the cluster having the highest interestingness and containing at least M transactions not classified yet. The number of clusters is established by the algorithm of selection, and is bound to the M value. Let n be the number of transactions, if M is equal to 1, we have at worst (n − minf r + 1) clusters. In practice, this does not happen. If we increase the M value, the number of clusters decreases. We are close to a partition of transactions with M near to minf r.
4
Recommendation System
In this section, we present the basis of our recommendation system. It is composed of an off-line process (clusters discovery with Ecclat) and an on-line process realizing recommendations. The on-line process computes a score between a new document and each of the discovered clusters. For a document and a cluster, if the score is greater than a threshold, then the document is recommended to the users of the clusters. We can also use the collaboration and recommend the documents that the users of a cluster have consulted to any users of a cluster. At this moment, we concentrate ourselves on the first type of recommendations. The score between a document and a cluster is computed as follows. Let D be a document and KD be the set of its keywords. Let Ci be a cluster, Ci is composed of a set of keywords KCi and a set of users UCi . We compute the covering rate : CR(D, Ci ) =
|KD ∩ KCi | ∗ 100 |KD |
Let mincr be the minimum threshold of the covering rate. If CR(D, Ci ) ≥ mincr, then we recommend the document D to the users UCi . Let us take an example, a document KD ={fishing hunting england nature river rod}, and the following clusters: – KC1 ={fishing hunting internet java}, CR=33%. – KC2 ={fishing england}, CR=33%. – KC3 ={fishing hunting england internet java programming}, CR=50%. – K ={fishing}, CR=16%. C4
– KC5 ={internet java}, CR=0%.
In this example, we have the following order: C3 > C1, C2 > C4, and C5 is discarded. Let us remark that the used measure (CR) is adapted to the problem, because in a cluster, keywords can refer to different topics. For instance, if a set of users are interested in fishing and programming, it is possible to
Recommendation System Based on the Discovery of Categorical Clusters
861
have a corresponding cluster like C3 . This point does not have to influence the covering rate. For this reason, we select this measure which depends on the common keywords between the document and the cluster, and on the number of the keywords of the document. CR does not depend on the number or the composition of the keywords set of the cluster. The other classical measures like Jaccard, Dice, Cosine, are not adapted to our problem. The possible mixing of topics does not influence the recommendations, but mincr does not have to be too high, because the number of keywords for a cluster is free, and for a document, it is fixed. Another remark, if a user is very interested in C++ and if he is the only one, we do not detect this. We take into account the common interests shared by the group.
5
Experimentation
In order to evaluate recommendations, we used proxy server logs coming from France Telecom R&D. This data contains 147 users and 8,727 items. Items are keywords of the HTML pages browsed by 147 users of a proxy-cache, over a period of 1 month. 24,278 pages were viewed. For every page, we extracted a maximum of 10 keywords with an extractor (developed at France Telecom R&D) based on the frequency of significant words. Let L be the proxy server log. For a document D in L, we determine the users interested by D (noted U sersR(D)), by using the previous discovered clusters. Then, we check by using the logs, if the users who have consulted the document (noted U sers(D)) are present in U sersR(D). Let us remark that we do not use a web server where the sets of documents and of keywords are known and relatively stable over time. For a proxy server, the set of documents and especially the set of keywords can be totally different between two periods. So we used the same period to discover the clusters and the recommendations for a first evaluation without human feedbacks. We use the following measures to evaluate the results: failure(D ) =
r hit (D ) =
|Users(D ) − UsersR(D )| |Users(D )| |UsersR(D ) ∩ Users(D )| |UsersR(D )|
The failure rate evaluates the percentage of users who consulted a document that has not been recommended. The r hit value (recommendation hit) measures the percentage of users indicated in the recommendations of a document, and from those who really consulted it. We set minf r to 10%. It corresponds to a minimal number of 14 users per cluster. The number of frequent closed itemsets is 454,043. We set M to 1 in order to capture the maximum of different centres of interest (overlapping between clusters), we find 45 clusters (the average number of users per cluster is 21). Let
862
Nicolas Durand et al.
80 rhit 70
60
50
40
30
20
10
0 0
Fig. 1. Distribution of the documents according to the failure rate, mincr=20%
2000
4000
6000
8000
10000
12000
Fig. 2. r hit value according to the rank of the documents, mincr=20%
us note that here, our aim is not to study the impact of the parameters. This has already been done in [2]. The choice of the mincr value is not easy. The mincr value influences especially the number of recommended documents. The higher the mincr value is, the lower the number of recommended documents is. Too many recommendations make the system unpractical. We need to have a compromise between the number of recommended documents and, as we could guess, the quality of the system. For the evaluation, we did not really perform recommendations to users, we just evaluated the accuracy of our recommandations. So we used a relatively low value of mincr in order to have a lot of recommendations. We set mincr to 20%. The system has recommended 11,948 documents (49.2% of the total). In Figure 1, we remark that 80% of the documents (among 11,948) are well recommended, and we have only 16% of failure. We ranked the documents according to the r hit values and we obtained Figure 2. We can deduce from the r hit measure that the number of users who are in the results and have not consulted the document is not null. We found more users, maybe they would have been interested, but we cannot verify it. It would be necessary to have human feedbacks.
6
Conclusion
We have presented a recommendation system based on the discovery of meaningful clusters of users according to the content of their consulted documents. Our method of clusters discovery allows to capture the various centres of interest for the users because of the possibility to have a user in several clusters and so retrieve him with several kinds of queries. We provided recommendations of documents using the discovered clusters. We evaluated our method on proxy server logs (not usually done in this application), and we obtained good results, that is encouraging for other experiments (with human feedbacks) and the development of our system. In future works, we will evaluate the second type
Recommendation System Based on the Discovery of Categorical Clusters
863
of possible recommendations i.e. based on the collaboration. We will also look for an incremental version of Ecclat in order to propose a system in pseudo real-time.
References [1] M. Balabanovic. An Adaptive Web Page Recommendation Service. In the 1st International Conference on Autonomous Agents, pages 378–385, Marina del Rey, CA, USA, February 1997. 859 [2] N. Durand and B Cr´emilleux. ECCLAT: a New Approach of Clusters Discovery in Categorical Data. In the 22nd Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence (ES’02), pages 177–190, Cambridge, UK, December 2002. 857, 859, 862 [3] D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using Collaborative Filtering to Weave an Information Tapestry. Communication of the ACM, 35(12):61–70, 1992. 858 [4] T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In the 15th Int. Joint Conference on Artificial Intelligence (IJCAI’97), pages 770–775, Nagoya, Japan, August 1997. 858 [5] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon, and J. Riedl. GroupLens: Applying Collaborative Filtering to Usenet News. Communication of the ACM, 40(3):77–87, March 1997. 858 [6] H. Lieberman. Letizia: An Agent that Assists Web Browsing. In the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI’95), pages 924– 929, Montr´eal, Qu´ebec, Canada,, August 1995. 858 [7] B. Mobasher, R. Cooley, and J. Srivastava. Creating Adaptive Web Sites through Usage-Based Clustering of URLs. In IEEE Knowledge and Data Engineering Exchange Workshop (KDEX99), Chicago, november 1999. 858 [8] A. Moukas. Amalthaea: Information Discovery and Filtering Using a MultiAgent Evolving Ecosystem. International Journal of Applied Artificial Intelligence, 11(5):437–457, 1997. 858 [9] D.S.W. Ngu and X. Wu. SiteHelper : A Localized Agent that Helps Incremental Exploration of the World Wide Web. In the 6th international World Wide Web Conference, pages 691–700, Santa Clara, CA, 1997. 858 [10] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems, 24(1):25–46, Elsevier, 1999. 859 [11] M Pazzani. A Framework for Collaborative, Content-Based and Demographic Filtering. Artificial Intelligence Review, 13(5):393–408, 1999. 858 [12] M Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying interesting web sites. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 54–61, Portland, Oregon, 1996. 858 [13] M. Spiliopoulou. Web Usage Mining for Web site Evaluation. Com. of the ACM, 43(8):127–134, August 2000. 858 [14] D. Wilson, B. Smyth, and D. O’Sullivan. Improving Collaborative Personalized TV Services. In the 22nd Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence (ES’02), pages 265–278, Cambridge, UK, December 2002. 858
864
Nicolas Durand et al.
[15] Y.H. Wu, Y.C. Chen, and A.L.P. Chen. Enabling Personalized Recommendation on the Web Based on User Interests and Behaviors. In the 11th Int. Workshop on Research Issues in Data Engineering (RIDE-DM 2001), pages 17–24, Heidelberg, Germany, April 2001. 859
A Formal Framework for Combining Evidence in an Information Retrieval Domain Josephine Griffith and Colm O’Riordan Dept. of Information Technology National University of Ireland, Galway, Ireland
Abstract. This paper presents a formal framework for the combination of multiple sources of evidence in an information retrieval domain. Previous approaches which have included additional information and evidence have primarily done so in an ad-hoc manner. In the proposed framework, collaborative and content information regarding both the document data and the user data is formally specified. Furthermore, the notion of user sessions is included in the framework. A sample instantiation of the framework is provided.
1
Introduction
Information retrieval is a well-established field, wherein a relatively large body of well-accepted methods and models exist. Formal models of information retrieval have been proposed by Baeza-Yates and Ribeiro-Neto [1], Dominich [6] and vanRijsbergen [14]. These are useful in the development, analysis and comparison of approaches and systems. Various changes in the information retrieval paradigm have occurred, including, among others, the move towards browsing and navigation-based interaction rather than the more traditional batch mode interaction; the use of multiple retrieval models; and the use of multiple representations of the available information. Recent trends, particularly in the web-search arena, indicate that users will usually combine the process of querying with browsing, providing feedback and query reformulation until the information need is satisfied (or until the user stops searching). Although many sources of evidence are usually available, many information retrieval models do not incorporate numerous sources of evidence (except for some ad-hoc approaches and several approaches using inference networks [13]). It has been shown that different retrieval models will retrieve different documents and that multiple representations of the same object (e.g. different query representations) can provide a better understanding of notions of relevance [5]. Within the field of information retrieval, the sub-field of collaborative filtering has emerged. Many parallels and similarities exist between the two approaches: in each process the goal is to provide timely and relevant information to users. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 864–871, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Formal Framework for Combining Evidence
865
However, different evidence is used in these two processes to identify this relevant information. Collaborative filtering models predominately consider social or collaborative data only [10]. In various situations, collaborative filtering, on its own, is not adequate, e.g. when no ratings exist for an item; when new users join a system; or when a user is not similar to any other users in the dataset. One way to overcome these limitations is to utilise any available content information. However, formal approaches to integrate content information and collaborative information have not, to date, been fully investigated. In traditional approaches, a single mathematical model is typically used in isolation. Additions, which are not expressible in that model, are added as ad-hoc extensions. Basu et al. claim that “there are many factors which may influence a person in making choices, and ideally, one would like to model as many of these factors as possible in a recommendation system”[3]. Although the semantics and stages of information retrieval and collaborative filtering have been studied in detail in isolation, we believe there is a need to unify the varying paradigms into a single complete framework. Currently, a myriad of ad-hoc systems exist which obscures accurate comparisons of systems and approaches. Producing a formal framework to combine different sources of evidence would be beneficial as it would allow the categorisation of existing and future systems; would allow comparison at system design level; and would provide a blue-print from which to build one’s own designs and systems. This paper presents a formal framework for information retrieval systems. The framework caters for the incorporation of traditionally disjoint approaches (collaborative filtering and information retrieval) and moves towards accommodating recent trends in the information retrieval paradigm. Section 2 gives a brief overview of information retrieval and collaborative filtering. Section 3 details the proposed framework which combines collaborative and content information. Section 4 presents an instantiation of the framework and conclusions are presented in Section 5.
2
Related Approaches: Information Retrieval and Collaborative Filtering
The aim of information retrieval and filtering is to return a set of documents to a user based on a ranked similarity between a user’s information need (represented by a query) and a set of documents. Baeza-Yates and Ribeiro-Neto [1] proposed the following representation of an information retrieval model: [D, Q, F , R(qi , dj )] where D is a set of document representations; Q is a set of query representations; F is a framework for modelling these representations (D and Q) and the relationship between D and Q; and R(qi , dj ) is a ranking function which associates a real number with qi and dj such that an ordering is defined among
866
Josephine Griffith and Colm O’Riordan
the documents with respect to the query qi . Well known instantiations include the Boolean model and the vector space model. Numerous other instantiations have been explored. While useful, the Baeza-Yates and Ribeiro-Neto model does not address other important aspects of the information retrieval cycle. These include the notions of feedback, a browsing paradigm, and higher level relationships (between documents and between users). Collaborative filtering produces recommendations for some active user using the ratings of other users (these ratings can be explicitly or implicitly gathered) where these users have similar preferences to the active user. Typical or “traditional” collaborative filtering algorithms use standard statistical measures to calculate the similarity between users (e.g. Spearman correlation, Pearson correlation, etc.) [10]. Other approaches have also been used [8]. Within the information retrieval and filtering domains, approaches have been developed for combining different types of content and combining results from different information retrieval systems [5] and recently, for combining content and web link information [9]. Approaches combining multiple sources of relevance feedback information have also been investigated [12]. Several authors suggest methods for combining content with collaborative information, including [2], [3], [4], [11].
3
Proposed Framework
To model extra information, the model proposed by Baeza-Yates and RibeiroNeto is first slightly modified and then mapped to the collaborative filtering case. The two models (content and collaborative) are then combined. 3.1
Collaborative Filtering Model
A model for collaborative filtering can be defined by: < U, I, P, M, R(I, u) > where U is a set of users; I is a set of items; P is a matrix of dimension |U | × |I| containing ratings by users U for items I where a value in the matrix is referenced by pui ; M is the collaborative filtering model used (e.g. correlation methods; probabilistic models; machine learning approaches); and R(I, u) returns a ranking of a set of items1 based on P , given user u. 1
This differs from R as described in the previous section where R returns a real number for a document/query pair. In our model R returns a ranking of the entire document set with respect to a query.
A Formal Framework for Combining Evidence
3.2
867
An Information Retrieval and Collaborative Filtering Framework
Combining both the information retrieval and collaborative filtering models, a framework is provided where the following can be specified: < U, I, P, A, V, Q, R(I, u, q), G > with terms U, I, P as defined previously in the collaborative filtering model. In addition, the following are required: – A: a list of attributes for each item in I, i.e. [a1 , a2 . . . an ]. – V : a matrix of dimension |I| × |A| containing the associated values for each attribute a of each item in I. These are not necessarily atomic and may be of arbitrary complexity. – Q: a set of user queries where a query q is defined as a list of weighted values for attributes, i.e. [(val1 , w1 ), (val2 , w2 ), . . .]. – R(I, u, q): specifies a ranking of the items I with respect to the: • similarity to q, and • evidence in P for user u. – G: models the components and the relationships between them. Based on the explicit information, (i.e. U, I, P, A, V, Q), a number of functions can be defined. Let b = {1, . . . , u} correspond to the number of users in U ; b = {1, . . . , i} correspond to the number of items in I; and b = {1, . . . , n} correspond to the number of attributes in A. Then: 1. h1 : P × b → P(U ), a mapping of P , for a user u, onto the power set of U , i.e. a group of users in the same neighbourhood as u. This corresponds to traditional memory-based collaborative filtering approaches [10]. 2. h2 : P × b → P(I), a mapping of P , for an item i, onto the power set of I, i.e. a group of items which are similar to an item i. This corresponds to item-item based collaborative filtering approaches as proposed in [7]. 3. h3 : V × b → P(I), a mapping of V , for an item i, onto the power set of I, i.e. a cluster of items which are similar to an item i. Standard clustering approaches can be used. 4. h4 : V × b → P(A), a mapping of P , for an attribute a, onto the power set of A, i.e. a cluster of attributes which are similar to some a. Such evidence has not been traditionally used in information retrieval, but has some parallels in data mining. 3.3
Extending the Framework to Include Further Evidence
Other information also needs to be captured including the concepts of feedback and session histories, giving: < U, I, P, A, V, Q, R(I, u, q), G, f b, Ses, History >
868
Josephine Griffith and Colm O’Riordan
Without concerning ourselves with details of how feedback information is gathered (whether explicitly, implicitly or both) we can see feedback as providing a mapping from one state, s, to another, i.e. f b : s → s where the state s is defined as an instantiation of the information in the system. This typically involves changing the values and weights of attributes in the query q; any value in P and any higher level relationships affected by the feedback2 . For the purpose of this paper, we view feedback as changing a subset of the values in q and P to give q and P . A session, Ses, can be defined as a sequence of such mappings such that the state following f bt does not radically differ from the state following f bt+1 , for all t. This necessitates having some threshold, τ , on the measure of similarity between successive states. A history, history, of sessions is maintained per user such that the end of one session can be distinguished from the start of a new session, i.e. the similarity between the final state of one session and the initial state of the next session is lower than the aforementioned threshold τ . Of course, a user should also be able to indicate that a new session is beginning. Given f b, Ses, and History, other higher level relationships could be derived. In particular one could identify user sessions; identify frequent information needs per user; and identify groups of similar users based on queries, behaviour etc. Again, there exists many approaches to define, implement and use this information (e.g. data mining past behaviours). Many possible instantiations of the given framework exist, which specify how the components are defined and combined. We will now consider a possible instantiation to deal with a well understood domain.
4
Sample Instantiation
We provide a possible instantiation of the given framework. Consider, for example, a movie recommender domain with the following being considered: – U is a set of users, seeking recommendations on movies. – I is a set of items (movies), represented by some identifier. – P is the collaborative filtering matrix with ratings by the system or user, for items in I for users in U . – A is a list of attributes [a1 , a2 . . . an ], associated with the items in I, for example: [title, year, director, actors, description, . . .]. – V is a matrix with associated values for each attribute a of each item i. The value of attribute a for item i is referred to by via . – Q is the set of user queries where a query q is a list of weighted values for attributes, i.e. [(val1 , w1 ), (val2 , w2 ), . . .], where all weights are real numbers in the range [0, 1]. 2
In many systems less information is used, but one could envision other sources of information being used.
A Formal Framework for Combining Evidence
869
– R(I, u, q) returns a ranking of the items in I with respect to their similarity to some query q and also based on evidence in P . – G models components and the relationship between them. – Feedback, session information and history are not used in this current instantiation. There exist many means to calculate R(I, u, q). One reasonably intuitive approach is to find the similarity between the attribute values of I and the attribute values in the user query, q, and also find, if available, the associated collaborative filtering value for each of the items. Constants (α and β) can be used which allow the relative importance of the two approaches to be specified3 , giving: β α R(I, u, q) = simcontent (I, q) + C(P ) α+β β+α where simcontent (I, q) is the content-based approach (information and/or data retrieval) and C(P ) is the collaborative-based approach using P . The function simcontent (I, q) returns a similarity-based ranking where, for each i in I, similarity to the query q is defined as: n n (sim(vij , valj ) × wj ) / wj j=1
j=1
where n is the number of attributes and wj is the user assigned weighting to the j th attribute in q (the default is 1 if the user does not specify a weighting, indicating all attributes are equally important). For each attribute in the query, sim(vij , valj ) returns a number indicating the similarity between vij and valj . This can be calculated using an approach suitable to the domain, e.g. Boolean approach or Euclidean distance. The similarities should be normalised to be in the range [0, 1]. A collaborative filtering module, C(P ), will produce values for items in I for a user, u, based on the prior ratings of that user u and the ratings of similar users. Any standard collaborative filtering approach can be used. Again, the range should be constrained appropriately.
5
Conclusions
Within information retrieval, there exists many approaches to providing relevant information in a timely manner (e.g. content filtering, collaborative filtering, etc.). Recent changes in the information retrieval paradigm indicate that users intersperse query formulation, feedback and browsing in the search for relevant information. 3
Note that if α = β, then the content and collaborative approaches have equal importance, otherwise, one has greater importance than the other.
870
Josephine Griffith and Colm O’Riordan
Although formal models and frameworks exist, there has been a lack of frameworks to formally capture the different approaches and include recent changes in the paradigm. In this paper we provide such a framework which captures content-based information, collaborative-based information, and notions of feedback and user sessions. We argue that such a framework allows for the comparison of various approaches and provides a blue-print for system design and implementation. We also provide a sample instantiation. Future work will involve more detailed modelling of possible instantiations making use of all components in the framework as well as some of the higher level relationships.
References [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999. 864, 865 [2] M. Balabanovic and Y. Shoham. Fab: Content-based, collaborative recommendation. Communications of the ACM, 40(3):66–72, 1997. 866 [3] C. Basu, H. Hirsh, and W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 714–721, 1998. 865, 866 [4] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining content-based and collaborative filters in an online newspaper. In Online Proceedings of the ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation, University of California, Berkeley, 1999. 866 [5] W. B. Croft. Combining approaches to information retrieval. Kluwer Academic Publishers, 2000. 864, 866 [6] S. Dominich. Mathematical Foundations of Information Retrieval. Kluwer Academic Publishers, 2001. 864 [7] D. Fisk. An application of social filtering to movie recommendation. In H.S. Nwana and N. Azarmi, editors, Software Agents and Soft Computing. Springer, 1997. 867 [8] J. Griffith and C. O’Riordan. Non-traditional collaborative filtering techniques. Technical report, Dept. of Information Technology, NUI, Galway, Ireland, 2002. 866 [9] R. Jin and S. Dumais. Probabilistic combination of content and links. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 402–403, 2001. 866 [10] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of ACM 1994 Conference on CSCW, pages 175–186. Chapel Hill, 1994. 865, 866, 867 [11] C. O’ Riordan and H. Sorensen. Multi-agent based collaborative filtering. In M. Klusch et al., editor, Cooperative Information Agents 99, Lecture Notes in Artificial Intelligence, 1999. 866 [12] I. Ruthven and M. Lalmas. Selective relevance feedback using term characteristics. In Proceedings of the 3rd International Conference on Coceptions of Library Information Science, CoLIS 3, 1999. 866
A Formal Framework for Combining Evidence
871
[13] H.R. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Trans. on Info. Systems, 3, 1991. 864 [14] C.J. van Rijsbergen. Towards an information logic. In ACM SIGIR Conference on Research and Development in Information Retrieval, 1989. 864
Managing Articles Sent to the Conference Organizer Yousef Abuzir Salfeet Study Center, Al-Quds Open University, Salfeet, Palestine
[email protected],
[email protected] Abstract. The flood of articles or emails to the conference/session organizer is an interesting problem. It becomes important to correctly organize these electronic documents or emails by their similarity. A thesaurus based classification system can be used to simplify this task. This system provides the functions to index and retrieve a collection of electronic documents based on thesaurus to classify them. Classifying these electronic articles or emails presents the most opportunities for the organizer to help him to select and send the right articles to the reviewers taking in our account their profiles or interests. By automatically indexing the articles based on a thesaurus, the system can easily selects relevant articles according to user profiles and send an email message to the reviewer containing the articles
1
Introduction
In organizing and delivering electronic articles by their similarity and classifying them undergoes restrictions. The job has to be done fast, for instance managing the flow of the articles coming in to the organizer of conference or chairman of a session. A thesaurus-based classification system can simplify this task. This system provides the functions to index and retrieve a collection of electronic documents based on thesaurus to classify them. The thesaurus is used not only for indexing and retrieving messages, but also for classifying. By automatically indexing the email messages and/or the electronic articles using a thesaurus, conference organizer can easily locate the related articles or messages and find the topic. The assignment of submitted manuscripts to reviewers is a common task in the scientific community and is an important part of the duties of journal editors, conference program chairs, and research councils. For conference submission, however, the reviews and review assignments must be completed under severe time pressure, with a very large number of submissions arriving near the announced deadline, making it difficult to plan the review assignments much in advance. These dual problems of large volume and limited time make the assignment of submitted manuscripts to reviewers a complicated job that has traditionally been handled by a single person (or at most few people) under quite stressful conditions. Also, manual review assignment is only possible if the person doing the assignment V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 871-878, 2003. Springer-Verlag Berlin Heidelberg 2003
872
Yousef Abuzir
knows all the members of the review committee and their respective areas of expertise. As some conferences grow in scope with respect to number of submissions and reviewers as well as the number of subdomains of their fields, it would be desirable to develop automated means of assigning the submitted manuscripts to appropriate member of the review committee. In this paper we suggest an application involving classification of the articles or emails sent by authors to the conference organizer. For this application to be realized it was necessary to develop the Database and Expert Systems Applications (DEXA) thesaurus, which is used for indexing and classification of those electronic documents. In this paper, we describe an approach to extract and structure terms from the web pages of the conference to construct domain independent thesaurus. The tool, ThesWB [1], [2] is used to construct a thesaurus from the HTML pages. Web documents, however, contain rich knowledge that describes the document's content. These pages can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. We analyzed the nature of the web content and metadata in relation to requirements for thesaurus construction. After creating the thesaurus, its performance is tested by our toolkit TDOCS [3], [4]. The system used to classify a sample of electronic documents and emails in cache directory containing electronic articles related to that conference. However, due to the sheer number of articles published, it is a time-consuming task to select the most interesting one for the reviewer. Therefore, a method of articles categorization is useful if it is to obtain relevant information quickly. We have been researching into methods of automated electronic articles classification through predefined categories for the purpose of developing an article delivering-service system, which sends the right articles to those reviewers who are interested in that particular topic. The contents of this paper are divided in 6 sections. In section 2 a general introduction and a description of user profiling are presented. Section 3 the thesaurus technology describes how the thesaurus has been created to support the classification and user profiling functions. Section 4 describes the basic functionality of the articles classification and delivering system. In Section 5 an experiment and evaluation, we test the classification and delivering functions using an example collection of articles. The last section presents the conclusions.
2
General Overview
With this system, the organizers of the conferences can get support for archiving, indexing and classification these articles into different classes or topics based on thesaurus (Fig.1). The process in initiated by an author writing a paper, and submitting it over the Internet by email to the conference organizer. Conference organizers receive that paper and use the system to index and classify it based on the thesaurus. Once the paper has been classified it is mapped to the interest of reviewers (User Profile of the reviewers) to send it to appropriate reviewer. Once the paper is complete, there needs to be a mechanism for submitting it to the system. This will be accomplished by sending it as an e-mail attachment. The e-mail should include special structure that contains the abstract and list of keywords relate to the paper that is being submitted.
Managing Articles Sent to the Conference Organizer
873
TDO CS S y ste m
A u th o r
O r g a n iz e r
R e v ie w e r
C o n fe r e n c e P r o c e e d in g s
Fig. 1. Articles flow in the system
When a paper is received by email, it will be tested to ensure that it conforms to the required document structure. An email will be sent to the author, indicating that the paper has been received. If the paper is accepted by the system, then email is sent to appropriate reviewers asking if they can review it. (Appropriate reviewers may be nominated by the system based on keywords, or perhaps on reference citations). At this stage the review process starts. The system should provide the organizer with the necessary tools to track the papers currently under review. This should include the paper name and author, the names of the reviewers, when the paper was sent for review, the reviewer's summary opinion when it is available, and links to detailed notes. The organizer will receive the review. Based on the review, the organizer will decide to either accept the paper, or reject it. 2.1
User Profiles
User profile is a collection of information that describes a user [5]. It may be defined as a set of keywords which describe the information in which the user is interested in. With user profile the user can set certain criteria of preference and ask for articles of specific. The profiling can be done by user-defined criteria in our case here by collecting keywords from reviewer's email message that sent to the conference. With the use of classification techniques based on thesaurus, the articles adapts to reviewers needs and interests according to his/her profile. 2.2
User Profile Creation
User Profile of the reviewers constructed as follows Fig 2. At first email address of the sender and the other fields are extracted from the email header lines. The body part of the email message will be extracted and parsed by TDOCS system. As a result of indexing process all keywords from the message and concepts derived from thesaurus are stored in a database reflects user interests. Thesaurus is used to create the user-profile from received email messages. These messages include many common terms appeared in the thesaurus. Moreover, these terms reflect the interest of reviewers.
874
Yousef Abuzir
TD O CS
Indexing
D EX A DB
em ail m essages
keyw ord D B concept D B
U ser Profile DB review er
K eyw ord
C oncept
ThesW B m aintenance
Fig. 2. User-Profile creation from the email message of reviewer
3
Constructing the DEXA Thesaurus
In order to turn Information Retrieval systems into more useful tools for both the professional and general user, one usually tends to enrich them with more intelligence by integrating information structure, such as thesauri. There are three approaches to construct a thesaurus. The first approach, on designing a thesaurus from document collection, is a standard one [6], [7]. By applying statistical [6] or linguistic [8], [9] procedures we can identify important terms as well as their significant relationships. The second approach is merging existing thesauri [10]. The third automatic is based on tools from expert systems [11]. Our experiment to construct the Database and Expert Systems Applications (DEXA) Thesaurus is based on selecting web pages that are well represented of the domain. The web pages we selected were sample of pages related to call for papers. We start by parsing those web pages using ThesWB [1], [2]. The parsing process will generate list of terms represented in hierarchical structure. An HTML document can be viewed as a structure of different nodes. This document can be parsed into a tree. The tree structure is constructed from parsing the tags and the corresponding content. The set of tags such as , ,
,
, etc. can be used to create the layout structure of the HTML page. For example, the tree structure consists of Head node and Body node. The Head is further divided into TITLE, META NAME= “KEYWORD”, etc. The Body node has other low sub-levels like , ,
, etc. During the parsing process, ThesWB apply text extraction rules for each type of tags. There are rules to extract and build the tree structure of the tags . The extraction rule for each of these tags is applied until all tags have been extracted. The second step is to eliminate and remove noisy terms and relationships between terms from the list. The list in Fig. 3 shows a sample of the new list after removing the noisy terms. Later, we convert this list to the thesaurus using ThesWB Converter Tool.
Managing Articles Sent to the Conference Organizer
875
Fig. 3. A sample of terms having hierarchical relationship extracted by ThesWB from links in http://www.dexa.org/
The Database and Expert Systems Applications (DEXA) Thesaurus has been designed primarily to be used for indexing and classification of articles sent by emails for reviewers of a conference for evaluation. This thesaurus provides a core terminology in the field of Computer Science. The draft version contains 141 terms.
4
Article Management System
In this paper we used the thesaurus to classify the articles into concepts "subject hierarchy". In our application, all articles are classified into concepts. Our classification approach uses DEXA Thesaurus as a reference thesaurus. Each article is automatically classified into the best matching concept in the DEXA Thesaurus. TDOCS system gives weight for each concept. This weight can be used as selection criteria for the best matching concepts for the articles. The system compares the articles to a list of keywords that describe a reviewer to classify the articles as relevant or irrelevant, Fig. 4. The system uses the user profile to nominate an appropriate reviewer for each article.
5
An Experiment and Evaluation
In order to classify articles a thesaurus is used, the thesaurus can be used to reflect the interests of a reviewer as well as the main topic of the article. The thesaurus is used not only for indexing and retrieving messages, but also for classifying articles [12]. We evaluated this classification method using TDOCS Toolkit and a collection of articles. The collection includes a list of abstract and keywords relate to the Database and expert system domain in general The TDOCS parses the articles and indexes them using a thesaurus. Thereafter the organizer can use Document Search environment to retrieve the related article.
876
Yousef Abuzir
articles TDOCS
Indexing
DEXA DB
Index DB
keyword DB concept DB Reviewer
User Profile DB
Delivery System
Fig. 4. An overview of the Conference articles Delivering-Service
The task of the delivery system is to get a collection of articles to be delivered to a user. The ultimate goal of the system is to select articles that best reflect reviewer's interest. We used TDOCS as tool to index these articles. We can use the result of indexing to classify the articles according to the main root terms or other concepts that reflect the different topics. Reviewer interest will be match to these database results to select the articles that reflect his/her interest. We used VC++ API application to map the interest of each user to the indexing result and get the articles that best reflect his/her interest. Then, the system will delivery these articles by electronic mail to that reviewer. The proposed approach has already been put to practice. A sample of 20 articles was automatically indexed using DEXA Thesaurus. The results were manually evaluated. The test results showed that a good indexing quality has been achieved. We get articles from a cache directory. The batch process to index all the articles takes about 15 second. It takes about 0.75 seconds to classify each article compared to 1-2 minutes human indexer needs. The system provides the fully automatic creation of structured user profiles. It explores the ways of incorporating users' interests into the parsing process to improve the results. The user profiles are structured as a concept hierarchy. The profiles are shown to converge and to reflect the actual interests of the reviewer.
6
Conclusions and Future Work
The increasing number of article sent to the conference presents a rich area, which can benefit immensely from automatic classification approach. We present an approach of automated articles classification through thesaurus for the purpose of developing an article delivering-service system. This paper describes experimental trail to test the feasibility of thesaurus-based articles classification and distribution system. The system seeks to identify keywords
Managing Articles Sent to the Conference Organizer
877
and concepts that characterize articles and classify these articles into pre-defined categories based on the hierarchical structure of the thesaurus. The explicit interest of reviewer in his/her profile enables the system to predict articles of interest Experimental result of our approach shows that the use of thesaurus contributes to improve accuracy and the improvements offered by classification method. The DEXA Thesaurus is useful and effective for indexing and retrieval of electronic articles. Concept hierarchies in DEXA Thesaurus were used to capture user profile and classify articles content. Our experiment proves that To summarize, automatic articles classification is an important problem nowadays. This paper proposes an approach base on thesaurus to classify and distribute the articles. The experimental results indicate accurate result.
References [1]
Abuzir, Y. and Vandamme, F., ThesWB: Work Bench Tool for Automatic Thesaurus Construction, in Proceedings of STarting Artificial Intelligence Researchers Symposium (STAIRS 2002), Lyon-France, July 22-23, 2002. [2] Abuzir, Y. and Vandamme, F., ThesWB: A Tool for Thesaurus Construction from HTML Documents, in Workshop on Text Mining Held in Conjunction with the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, May 6, 2002. [3] Abuzir, Y. and Vandamme, F., Automatic E-mail Classification Based on Thesaurus, in Proceedings of AI2002 - Twentieth IASTED International Conference APPLIED INFORMATICS AI 2002, Innsbruck, Austria, 2002. [4] Abuzir, Y., Vervenne, D., Kaczmarski, P. and Vandamme, F., TDOCS Thesauri for Concept-Based Document Retrieval, R&D Report BIKIT, BIKIT – LAE, University of Ghent 1999. [5] Jovanovic, D., A Survey of Internet Oriented Information Systems Based on Customer Profile and Customer Behavior, SSGRR 2001, L'Aquila,, Italy , Aug06 12 2001. [6] Salton, G., McGill, M. J., Introduction to modern information retrieval. McGraw Hill, New York. 1983. [7] Crouch, C. J., An approach to the automatic construction of global thesauri, Information Processing & Management, 26(5): 629-40, 1990. [8] Grefenstette, G., Use of syntatic context to produce term association lists for text retrieval. In SIGIR'92, pp. 89-97, 1992. [9] Ruge, G. Experiments on linguistically based term associations. In RIAO'91, pp. 528-545, 1991. [10] Sintichakis, M. and Constantopoulos, P. A Method for Monolingual Thesauri Merging. In Proc. 20th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Philadeplphia PA, USA, July 1997. [11] Gnü tzer, U., Jtü tner, G., Seegmüller, G. and Sarre, F., Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Information Processing & Management, Vol. 25, No. 3, pp. 265-273, 1989.
878
Yousef Abuzir
[12] Abuzir Y., Van Vosselen N., Gierts S., Kaczmarski P. and Vandamme F., MARIND Thesaurus for Indexing and Classifying Documents in the Field of Marine Transportation, accepted in MEET/ MARIND 2002, Oct. 6-1, Varna Bulgaria, 2002.
Information Retrieval Using Deep Natural Language Processing Rossitza Setchi1, Qiao Tang1, and Lixin Cheng2 Systems Engineering Division Cardiff University Cardiff, UK, CF24 0YF {Setchi,TangQ}@cf.ac.uk http://www.mec.cf.ac.uk/i2s/ 2 Automation Institute, Technology Center Baosteel, Shanghai, China, 201900 [email protected] http://www.baosteel.com 1
Abstract. This paper addresses some problems of the conventional information retrieval (IR) systems, by suggesting an approach to information retrieval that uses deep natural language processing (NLP). The proposed client-side IR system employs a Head-Driven Phrase Grammar (HPSG) formalism and uses Attribute Values Matrices (AVMs) for information storage, representation and communication. The paper describes the architecture and the main processes of the system. The initial experimental results following the implementation of the HPSG processor show that the extraction of semantic information using the HPSG formalism is feasible.
1
Introduction
Studies indicate that the majority of the web users find information by using traditional keyword-based search engines. Despite the enormous success of these search engines, however, they often fail to provide accurate, up-to-date, and personalised information. For example, traditional search engines can help in identifying entities of interest but they fail in determining the underlying concepts or the relationships between these entities [1]. Personalization is still a challenging task due to the lack of enough computational power needed to analyze the query history of the individual users, identify frequently used concepts or knowledge domains, and re-rank accordingly the query results. Deep linguistic analysis of document collections is not performed for a similar reason, as it is much slower than conventional crawling and indexing [1]. These and other issues have motivated intensive research in the area of personalized information retrieval (IR) and filtering. A promising solution is offered by the emerging semantic web technologies that attempt to express the information contained on the web in a formal way. However, the transformation of billions of web V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 879-885, 2003. Springer-Verlag Berlin Heidelberg 2003
880
Rossitza Setchi et al.
pages written in various natural languages into machine-readable form is a colossal task at present. In addition, the range of metadata formats used and the inconsistent ontologies employed are still major obstacles. The aim of this work is to address these issues by developing a system for personalized information retrieval that uses natural language processing (NLP) techniques. The rest of the paper is organized as follows. Section 2 provides background information on the NLP technique employed in this research. Section 3 introduces the proposed IR system by focusing on its architecture, main processes, implementation and testing. Finally, Section 4 provides conclusions and directions for further research.
2
Background
The NLP algorithm used in this research is the Head-Driven Phrase Grammar (HPSG). It is a constraint-based, lexicalist grammar formalism for natural language processing with a logic ground in typed feature structures [2]. Recently, HPSG has been used in building a large-scale English grammar environment [3]. The HPSG formalism comprises of principles, grammar rules, lexical rules, and a lexicon. The principles are global rules that all syntactically well-formed sentences obey, while the grammar rules are applied to specific grammar structures. The lexical rules are employed when deriving words from lexemes. Finally, the lexicon contains lexical knowledge. The HPSG formalism is represented using Attribute Values Matrices (AVMs), which are notations for describing feature structures [4]. In addition, concept maps are used as tools for organizing and representing knowledge. They are effective means for presenting concepts and expressing complex meaning [5].
3
An Information Retrieval System Using HPSG
3.1
Technical Approach
The idea underlying this research is to utilize deep NLP techniques for information retrieval. Instead of using keyword-based algorithms for shallow natural language processing, a deep grammar analysis is employed to analyse sentences retrieved from the web, and identify main entities such as noun phrases, verb phrases and modifiers. These entities are revealed in the context of the sentences, which makes possible the relationships between these entities to be identified and analyzed. The noun phrases and verb phrases are then generalized into conceptual knowledge that is then indexed and later used to improve the information retrieval process. In addition to analysing the content of the retrieved web documents, it is equally important to accurately identify the purpose of the query made by the end user, i.e. the context behind his/her search for information. The approach adopted in this work is to extract the main concepts contained in the query and combine them with information from a knowledge base (KB). The dynamics of the user's ad-hoc and longterm information needs is encapsulated by adding new concepts and relationships from the queries in a user model. Finally, once the purpose of the query is identified, the content is retrieved from the web and analyzed using AVMs, the result is represented via a concept map.
Information Retrieval Using Deep Natural Language Processing
User Queries
Internet
Intelligent Agent
881
Concept Representation
Web Spiders
Concept KB
HPSG Processor AVM Engine
User Profile DB
Personalised KB
Language KB
Fig. 1. Architecture of the proposed information retrieval system
3.2
System Architecture and Main Processes
The architecture of the proposed system is shown in Fig. 1. The system uses an HPSG processor for parsing user queries and web pages retrieved from the web. The web spiders are agents that retrieve and filter web pages. The intelligent agent is involved in the processing of the user queries and in information personalization. The concept representation module provides graphical presentation of the retrieved information using concept maps. The concept KB contains background conceptual knowledge. The personalized KB includes previously retrieved and indexed items that might be needed in future queries. The language KB consists of systematised morphological, syntactical, semantic and pragmatic knowledge, specific for the English language. The user profile database (DB) is for storing the details of previous user's queries. The AVM engine is the AVM format wrapper of the HPSG processor. Two processes are described below. The first process concerns the user interaction with the system, namely, the way his/her query is processed. The second process relates to the retrieval of web content that is relevant to the user query. Processing user queries (Fig. 2). A user query is first processed by the intelligent agent that uses the HPSG processor to interpret the user query. The HPSG processor uses the language KB to analyze the basic grammar and semantics of the query. When the interpretation of the query in AVM form is obtained, the intelligent agent extracts the important features from it by combining information from three sources. These are the user profile DB, which reports previous user preferences that are relevant to the current query, the personalized KB that may contain relevant knowledge from previous retrieval tasks, and the language KB. After analyzing this input, the intelligent agent sends a query command to the web spiders. The user profile database is updated by storing the query in AVM form. The user preferences are modified accordingly, by comparing the main concept features in the AVM result with the
882
Rossitza Setchi et al.
concept KB. This assures that the user profile is maintained in a dynamic manner that captures and reflects the user's short-term and long-term interest. Retrieving web content (Fig. 3). The process starts when the web spiders receive the query from the intelligent agent. The web spiders retrieve related web content and forward it to the HPSG processor to parse the text. Then, the user profile database and the language KB are employed to analyse the content. When the web spiders receive the analysed result from the HPSG processor, they evaluate it using information from the user profile database, and choose whether to store the retrieved information or discard it. If the information is found to be of relevance, the result in AVM form is added to the personalised KB. Finally, the concept representation module provides graphical interpretation of the AVM result for the user. 3.3
Implementation and Testing
The HPSG processor was built in Java using 53 classes, in about 4,000 lines code. It has two sub-modules: a unification parser and a HPSG formalism module. A part of its UML class diagram is illustrated in Fig. 4. The algorithms employed are adapted from [6-8]. The design and the implementation of the HPSG processor are described in more detail in [9]. The HPSG processor was tested on a P4 1GHz 256M memory Windows NT 4.0 workstation. An experiment with 40 English phrases and sentences was conducted. In this experiment, the language KB included about 50 lexemes, and the grammar contained 8 ID schemas (grammar rules). An example of parsing a simple question is illustrated in the Annex. The experimental results [9] showed that the parsing time increases proportionally to the number of words in a sentence. Therefore, the parsing of a text with 50 sentences (400 words) would require approximately 12 seconds. This time will need to be greatly reduced through optimizing the parsing algorithm and introducing elements of learning in it.
Web Spiders
User Queries
User Profile DB
Intelligent Agent
Concept KB
HPSG
Language KB
Processor
AVM Engine
Personalised
KB
Fig. 2. Processing a user query
Information Retrieval Using Deep Natural Language Processing
Internet
Web Spiders
Concept KB
User Profile DB
HPSG Processor
Concept Representation
883
AVM Engine
Personalised KB
Language KB
Fig. 3. Retrieving web content
The web spiders are implemented in 10 Java classes in about 3,700 lines code with multithreading support. The module is configured to start exploring the web from any web address, for example portals or search engines. In the second case, they automatically submit keywords as instructed by the intelligent agent. The conducted experiment showed that web spiders could retrieve approximately 1,000 pages in an hour. The IR system developed uses an electronic lexical database, WordNet 1.6 [10], to retrieve lexical knowledge e.g. part of speech, hyponyms, hypernyms and synonyms. The User Interface is implemented with 12 classes in about 2,400 lines code; it uses an html render engine.
Grammar
Uniparser
1
1
1 * 1
Sign
1
*
*
Entry
Fig. 4. UML class diagram of the HPSG processor's main components
884
Rossitza Setchi et al.
4
Conclusions and Further Work
A client-side information retrieval system using a deep natural language grammar and semantic analyses processor is proposed in this work. It uses HPSG algorithms and AVMs to extract concept relationships and semantic meaning. The initial experiments show the feasibility of the proposed approach. Further work includes a large-scale performance evaluation of this system that will include extending the current language KB and conducting experiments with a number of search engines.
References [1]
Chakrabarti, S., Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann Publishers (2002). [2] Carpenter, B., The Logic of Typed Feature Structures. Cambridge University Press, New York (1992). [3] Copestake, A., Implementing Typed Feature Structure Grammars. University of Chicago Press, Stanford, California (2002). [4] Sag, I. and A. Wasow, T., Syntactic Theory: A Formal Introduction. Cambridge University Press, Stanford, California (1999). [5] Novak, J.D. and Gowin, D.B., Learning How to Learn. Cambridge University Press, New York (1984). [6] Keselj, V., Java Parser for HPSGs: Why and How. Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'99, Waterloo, Ontario, Canada (1999). [7] Kasami, T., An Efficient Recognition and Syntax Algorithm for Context-free Languages. Bedford, Massachusetts, US, Air Force Cambridge Research Laboratory (1965). [8] Younger, D. H., Recognition of Context-free Languages in Time n3. Information and Control, Vol. 10 (2), 189-208 (1967). [9] Chen, L. Internet Information Retrieval Using Natural Language Processing Technology, MSc Thesis, Cardiff University, Cardiff (2002). [10] Fellbaum, C., WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998).
Information Retrieval Using Deep Natural Language Processing
885
Annex
Fig. A1. Parsing of the sentence “what did Napoleon give Elisa”. (1) and (2) show that the sentence is an interrogative question. (3), (4), (5) and (6) show that “what” needs to be answered and it has to be a nominative component. (7), (8), (9) and (10) illustrate that the subject of the sentence is “Napoleon”, the main verb is “give”, “Elisa” is the indirect object complement of the transitive word “give”, and the nominative component that the question asks is a direct object complement of “give”. Therefore, basic information such as subject, verb, objects, and what the question asked is, can be drawn from the parsing result. Further algorithms can be applied to match this semantic information with the information stored in the knowledge base, e.g. “In 1809, Napoleon gave his sister, Elisa, the Grand Duchy of Tuscany ”
Ontology of Domain Modeling in Web Based Adaptive Learning System Pinde Chen and Kedong Li School of Education Information Technology South China Normal University 510631 Guangdong , China {pinde,likd}@scnu.edu.cn
Abstract: Domain model, which embodies logical relations of teaching material and related instruction strategies, is an important component in Adaptive Learning System(ALS). It is a base of User Modeling and Inference Engine. In this paper, Ontology of Domain Model in Web based Intelligent Adaptive Learning System(WBIALS) and its formal presentation were discussed in detail.
1
Introduction
Adaptive learning supporting environment is a crucial research issue in distance education. Since mid-nineties, the number of research papers and reports on this area is increasing fast, especially on Adaptive Hypermedia System[1] . By using AI(Artificial Intelligence) in hypermedia, the system can understand user and application domain, and then it can customize learning material for the user and direct him through learning process. De Bra[2] gives an AHS reference model named AHAM . In this model, AHS is composed of 4 components: Domain model(DM) User model(UM), Teaching model or Adaptation model(TM or AM), Adaptive engine(AE). DM is a main component in AHS. The system should understand the domain, know the status and requirements of the user so it can adapt to the user. DM involves knowledge representation and its organization. UM involves representation and maintenance of user information. Based on DM and UM, AE performs operations and adapts to the user. Ontology is a philosophy concept on essence of existence. But in these recent years, it is used in computer science, and acted as more and more important role in AI, computer linguistics and database theory. But as far, ontology still has not a consistent definition and fixed application domain. Gruber[3] in Standford University pointed out that Ontology is precise description on concept, which was accepted by many researchers. Ontology was used to described the essence of the thing, whose main purpose is to represent the implied information precisely which can be reused and shared.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 886-892, 2003. Springer-Verlag Berlin Heidelberg 2003
Ontology of Domain Modeling in Web Based Adaptive Learning System
887
In this paper, ontology of domain knowledge in Web based Intelligent Adaptive Leaning Supporting System(WBIALS) was presented in detail.
2
Related Works
In recent years, different efforts are made to develop open learning technology specifications, in laboratories, in industry consortia, in associations of professionals and in standardization bodies. And there are already some results (e.g. IEEE LTSC LOM, IMS specifications, ADL SCORM). In LOM[9], the concept of learn object is defined, which is any entity, digital or non-digital, that can be used, re-used, or referenced during technology-supported learning. Learning objects are entities that may be referred to with metadata. In practice learn objects are mostly smaller objects, and have not provide sufficient information to build a learning unit. In IEEE LTSC CMI, a course can be cut into several blocks, and a block can be cut into assignable units. For assignable units, prerequisite unit can be set, which make course sequencing possible. In IMS Learning Design Specification[10], an information model of unit of learning is modeled which is a framework for learning material and learning process.
3
Ontology of Domain Model
Domain model represents the elements and their relations in a domain. In WBIALS system, structure of domain model can be depicted as following (fig 1) :
curriculum group
curriculum
conecpt
glossary
example
task
FAQ
learning unit
curriculum structure
test group
Discuss Group
test item
Fig. 1. Ontology of domain model (UML class diagram)
888
Pinde Chen and Kedong Li
3.1
Curriculum and Curriculum Group
The goal of WBIALS system is a general platform to support adaptive learning. It can support many curriculums. So, In this system, curriculum group is a set of curriculums. Definition 1: Curriculum group is a set of curriculums, i.e., Curriculum group={x|x is a curriculum}. 3.2
Concept in Curriculum
Concepts in curriculum have five types: fact, concept, ability, principle, problem solving. Concepts in a curriculum are used to index glossary, examples, learning units, entries in FAQ and so on. For every concept ci,, we can represent it as a 4-tuple , where C_id: identifier of a concept; C_name: name of a concept; C_type: type of a concept; C_des: description of a concept; Definition 2: A concept is a 4-tuple . All concepts in a curriculum is a set named Cs, where Cs={ci | ci is a concept in a curriculum }. 3.3
Learning Unit
Learning Unit is an entity, which is a set of all kinds of learning material, instruction guidance , related information and so on. Usually, A learning unit consists of the following elements: • Instruction hint or introduction. According to different instruction theory, content in this part can be learning goal, learning guidance or a real question about this unit. •
Unit content. This is the main part of a learning unit, which is object for user learning. It is often composed of one or a few html pages, one of which works as an entrance into this unit. Considering different users with different learning style or different background, who may need different unit content, the structure of unit content should be a 4-tuple < Lu_id, Url, ls_ID, Bg_ID>, where Lu_id: identifier of learning unit; Url: address of a html file; ls_ID: identifier of learning style; Bg_ID: identifier of background knowledge. For one Lu_id, it can map to several html files, one of which adapts to one type of user.
Ontology of Domain Modeling in Web Based Adaptive Learning System
889
•
Lecture Lecture is an another instruction form which is often used in traditional classroom. It can be powerpoint file or video record, which could is alternative learning material provided to the user.
•
Example
•
Demonstration
•
Exercise
•
Summary
•
Expand content
It is not necessary to contain all the above elements in a learning unit. Generally, only unit content is necessary, and all the other elements are optional. Definition 3: learning unit is a atomic learning cell, which is a set of all kinds of learning material, instruction guidance , related information and so on, can be represented as a 8-tuple< Intr, Cont, Lect, Exam, Demo, Exer, Summ,Expa>. Actually, every element in a learning unit is mapped to a paragraph, an address of a html file , or a pointer to other object. A concept is an abstract representation of an information item from the application domain. A learning unit is an entity in the system, so concept must be related to learning unit. A learning unit is often related to several concepts. The relation between them can also be represented by 3-tuple , where: C_id: identifier of a concept; Lu_id: identifier of a learning unit; Ob_tier: Object tier, Ob_tier ∈ {knowledge, comprehension, application, analysis, synthesis, evaluation }. A learning unit has one or none exercise group, and one exercise group consists of many exercise items. 3.4
Curriculum Structure
The following information was included in the curriculum structure: • • •
tree-like directory pre-next network inference network
3.4.1 Tree-Like Directory Tree-like directory was constructed by some learning units. It help the users get a book-like overview of the curriculum.
890
Pinde Chen and Kedong Li
For most nodes in the tree, normally each of them can be mapped to a learning unit. But there are also some nodes , which are called virtual node, have not content with them. Besides, some nodes in the directory could be optional learning unit. Definition 4: curriculum directory is a tree-like structure, the node in it can be represented as 7-tuple , where Node_ID: identifier of a node; Node_name: name of a node, it will be showed in the directory. Node_layer: tier of a node, Node_sequence: sequence of a node. According to Node_sequence and Node_layer, the structure and appearance of a directory can be decided. Lu_ID: identifier of learning unit. Node_type: this node is optional or compulsory. Url: when this node is a virtual one, this item will be used to decide which html page will be displayed . 3.4.2 Relation among Learning Units in a Curriculum One learning unit in a curriculum often has a relation to another one. There are inherent logic relations among them. When writing a textbook, the author should analyze the inherent relation of a curriculum and then decide the sequence of chapter. In the process of learning, the student will be more efficient if he can follow a rational route through the textbook. Actually, during the instruction design, the main job of the analysis of instruction tasks is to work out the instruction goal, and then divide it into several small goals so as to set up a proper sequence which promote valid learning. There are two types of relations which are pre-next relation and inference relation among the learning unit. 3.4.3 Network of Pre-next Relation According to instruction experience, if the following generation rule exists between learning unit vi and vj: if vi is mastered then vj can be learn next ( supporting factor is λ) It shows that learning unit vi is precondition of vj, and the supporting factor is λ. This rule can be represented as 3-tupe . The degree of “mastered” can be : very good, good, average, small part, not. Combining all the pre-next generation rules together, it makes a weighted directed acyclic diagram (DAG) which is shown as Fig 2 Definition 5: Network of pre-next relation is a weighted directed acyclic diagram (DAG) G=, where V={ v1,v2……vn} is a set of learning unit. R={| if vi is precondition of vj}. W={wij| ∈ R and wij ∈ [0,1]}
Ontology of Domain Modeling in Web Based Adaptive Learning System
891
Fig 2 Network of pre-next relations
3.4.4 Inference Network According to instruction experiences, if the following generation rule exists between learning unit vi and vj: if vi is “mastered” then vj is “mastered” (threshold is λ) It showes the inference relation between vi and vj . That is to say, from the system ,if we obtain the evidence that degree of mastery of vi is bigger than λ, we can inference that vj is mastered. Degree of “mastered” can be : very good, good, average, small part, not. Inference relations in a curriculum makes a weighted directed acyclic digraph (DAG) too. Definition 6: Network of inference relation is a weighted directed acyclic digraph(DAG) G=, where V={ v1,v2……vn} is a set of learning unit. R={| if vi is “mastery” then vj is “mastery” }. W={wij| ∈ R and wij ∈ [0,1]} From above , mastery degree of learning unit, pre-next rule and inference rule are all vague. So it is necessary to use fuzzy set and fuzzy rule to represent them. 3.5
Leaning Task
Learning task, which supports a task-based learning style, is another organization form of the curriculum content. A learning task, which is subset of set of learning unit, consists of several learning units. It has a identifier. Pre-next relation and reference relation in a task is a subgraph of the pre-next network and inference network. 3.6
FAQ and Discuss Group
Every entry in FAQ and discuss group is indexed by the identifier of learning unit, which lay the ground for customizing help information.
892
Pinde Chen and Kedong Li
4
Conclusion
In this paper, we give a precise description on ontology of domain model. In WBIALS system, we use an overlay model as the user model, and fuzzy sets and fuzzy rules are used to represent vague information and make inference. The system provides adaptive presentation, adaptive navigation and customized help information. Especially, adaptive dynamic learning zone was implemented in task-based learning style. This paper is sponsored by the scientific fund of Education Department , GuangDong, China. Project No. is Z02021
References [1]
Brusilovsky, P. (1996) Methods and techniques of adaptive hypermedia. User Modeling and User-Adapted Interaction, 6 (2-3), pp. 87-129. [2] De Bra, P., Houben, G.J., Wu, H., AHAM: A Dexter-based Reference Model for Adaptive Hypermedia, Proceedings of the ACM Conference on Hypertext and Hypermedia, pp. 147-156, Darmstadt, Germany, 1999. (Editors K. Tochtermann, J. Westbomke, U.K. Wiil, J. Leggett) [3] Gruber T R.Toward Principles for the Design of Ontologies Used for Knowledge Sharing.Int. Journal of Human and Computer Studies, 1995:907928 [4] Liao Minghong, Ontology and Information Retrieval, Computer Engineering (china),2000,2, p56-p58 [5] Jin Zhi, Ontology-based Requirement s Elicitation, Chinese J. Computers, 2000,5, p486-p492. [6] Brusilovsky, P., Eklund, J., and Schwarz, E. (1998) Web-based education for all: A tool for developing adaptive courseware. Computer Networks and ISDN Systems (Proceedings of Seventh International World Wide Web Conference, 14-18 April 1998, 30 (1-7), 291-300. [7] Gerhard, W., Kuhl, H.-C. & Weibelzahl, S.(2001). Developing Adaptive Internet Based Courses with the Authoring System NetCoach. In: P. De Bra P. and Brusilovsky (eds.) Proceedings of the Third Workshop on Adaptive Hypertext and Hypermedia. Berlin: Springer. [8] De Bra, P., & Calvi, L. (1998). AHA! An open Adaptive Hypermedia Architecture. The New Review of Hypermedia and Multimedia , 4 115-139. [9] IEEE LTSC , http://ltsc.ieee.org [10] IMS Learning Design Specification, http://www.imsglobal.org
Individualizing a Cognitive Model of Students' Memory in Intelligent Tutoring Systems Maria Virvou and Konstantinos Manos Department of Informatics University of Piraeus Piraeus 18534, Greece [email protected];
[email protected]
Abstract: Educational software applications may be more effective if they can adapt their teaching strategies to the needs of individual students. Individualisation may be achieved through student modelling, which is the main practice for Intelligent Tutoring Systems (ITS). In this paper, we show how principles of cognitive psychology have been adapted and incorporated into the student modelling component of a knowledge-based authoring tool for the generation of ITSs. The cognitive model takes into account the time that has passed since the learning of a fact has occurred and gives the system an insight of what is known and remembered and what needs to be revised and when. This model is individualised by using evidence from each individual student's actions.
1
Introduction
The fast and impressive advances of Information Technology have rendered computers very attractive media for the purposes of education. Many presentation advantages may be achieved through multimedia interfaces and easy access can be ensured through the WWW. However, to make use of the full capabilities of computers as compared to other traditional educational media such as books, the educational applications need to be highly interactive and individualised to the particular needs of each student. It is simple logic that response individualized to a particular student must be based on some information about that student; in Intelligent Tutoring Systems this realization led to student modeling, which became a core or even defining issue for the field (Cumming & McDougall 2000). One common concern of student models has been the representation of the knowledge of students in relation to the complete domain knowledge, which should be learnt by the student eventually. Students' knowledge has often been considered as a subset of the domain knowledge, such as in the overlay approach that was first used in (Stansfield, Carr & Goldstein 1976) and has been used in many systems since then (e.g. Matthews et al. 2000). Another approach is to consider the student's knowledge V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 893-897, 2003. Springer-Verlag Berlin Heidelberg 2003
894
Maria Virvou and Konstantinos Manos
in part as a subset of the complete domain knowledge and in part as a set of misconceptions that the student may have (e.g. Sleeman et al. 1990). Both cases represent an all or nothing approach on what the student knows and they certainly do not take into account the temporal aspects of knowledge, which are associated with how students learn and possibly forget. In view of the above, this paper describes the student modeling module of an educational application. This module can measure-simulate the way students learn and possibly forget through the whole process of a lesson. For this purpose, it uses principles of cognitive psychology concerning human memory. These principles are combined with evidence from the individual students' actions. Such evidence reveals the individual circumstances of how a student learns. Therefore the student model takes into account how long it has been since the student has last seen a part of the theory, how many times s/he has repeated it, how well s/he has answered questions relating to it. As a test-bed for the generality of our approach and its effectiveness within an educational application we have incorporated it in a knowledge based authoring tool. The authoring tool is called Ed-Game Author (Virvou et al. 2002) and can generate ITSs that operate as educational games in many domains.
2
The Cognitive Model of Memory for an Average Student
A classical approach on how people forget is based on research conducted by Herman Ebbinghaus and appears in a reprinted form in (Ebbinghaus, 1998). Ebbinghaus' empirical research led him to the creation of a mathematical formula which calculates an approximation of how much may be remembered by an individual in relation to the time from the end of learning (Formula 1).
b=
100 ∗ k (log t ) c + k
(1)
Where: • • •
t: is the time in minutes counting from one minute before the end of the learning b: the equivalent of the amount remembered from the first learning. c and k : two constants with the following calculated values: k = 1.84 and c = 1.25
In the student model of Ed-Game Author the Ebbinghaus calculations have been the basis for finding out how much is remembered by an average student. In particular, there is a database that simulates the mental library of the student. Each fact a student encounters during the game-lesson is stored in this database as a record. In addition to the fact, the database also stores the date and time when the fact was last used, in a field called LastAccessDate. A fact is first added to the memory database when a student is first taught this fact through a lesson. When a fact is inserted in the database, the current date and time is also recorded in the field called TeachDate.
Individualizing a Cognitive Model of Students’ Memory in Intelligent Tutoring Systems
895
Thus, whenever the system needs to know the current percentage of a student's memory of a fact, the equation (2) is used, which is largely based on the Ebbinghaus' power function. However, equation (2) has been adapted to include one more factor, which is called the Retention Factor (RF). The retention factor is used to individualise this equation to the particular circumstances of each student by taking into account evidence from his/her own actions. If the system does not take into account this evidence from the individual students' actions then the Retention Factor may be set to 100, in which case the result is identical to the generic calculations of Ebbinghaus concerning human memory in general. However, if the system has collected sufficient evidence for a particular student then when a fact is first encountered by this student the Retention Factor is set to 95 and then it is modified accordingly as will be described in detail in the following sections. The RF stored in the “mental” database for each fact is the one representing the student's memory state at the time showed by the TeachDate field.
X% =
b ∗ RF 100
(2)
Where b is the Ebbinghaus' power function result, setting t=Now-TeachDate.
3
Memorise Ability
One important individual student characteristic that is taken into account is the ability of each student to memorise new facts. Some students have to repeat a fact many times to learn it while others may remember it from the first occurrence with no repetition. To take into account these differences, we have introduced the student's Memorise Ability factor (MA). This factor's values range between 0 and 4. The value 0 corresponds to “very weak memory”, 1 to “weak memory”, 2 to “moderate memory”, 3 to “strong memory” and 4 to “very strong memory”. During the course of a virtual-game there are many different things that can give an insight on what the student's MA is. One important hint can be found in the time interval between a student's having read about a fact and his/her answering a question concerning this fact. For example, if the student has given a wrong answer about a fact that s/he has just read about then s/he is considered to have a weak memory. On the other hand if s/he gives a correct answer concerning something s/he had read about a long time ago then s/he is considered to have a strong memory. Taking into consideration such evidence, the student's MA value may be calculated. Using MA the Retention Factor is modified according to the MA value of the student in the way illustrated in Table 1. As mentioned earlier, every fact inserted in the database has an initial RF of 95.
896
Maria Virvou and Konstantinos Manos Table 1. Retention Factor's modification depending on the Memorise Ability
Memorise Ability Value 0 1 2 3 4
Retention Factor Modification RF` = RF – 5 RF` = RF – 2 RF` = RF RF` = RF + 2 RF` = RF + 5
After these modifications the RF ranges from 90 (very weak memory) to 100 (very strong memory), depending on the student's profile. Taking as a fact that any RF below 70 corresponds to a “forgotten” fact, using the Ebbinghaus' power function, the “lifespan” of any given fact for the above mentioned MA may be calculated. So a student with a very weak memory would remember a fact for 3 minutes while a student with a very strong memory would remember it for 6.
4
Response Quality
During the game, the student may also face a question-riddle (which needs the “recall” of some fact to be correctly answered). In that case the fact's factor is updated according to the student's answer. For this modification an additional factor, the Response Quality (RQ) factor, is used. This factor ranges from 0 to 3 and reflects the “quality” of the student's answer. In particular, 0 represents “no memory of the fact”, 1 represents an “incorrect response; but the student was close to the answer”, 2 represents “correct response; but the student hesitated” and 3 represents a “perfect response”. Depending on the Reponse Quality Factor, the formulae for the calculation of the new RF are illustrated in Table 2. When a student gives an incorrect answer, the TeachDate is reset, so that the Ebbinghaus' power function is restarted. When a student gives a correct answer, the increase of his/her Retention Factor depends on his/her profile and more specifically on his/her Memorise Ability factor. In particular, if the student's RQ is 2 and s/he has a very weak memory then the RF will be increased by 3 points (extending the lifespan of the memory of a fact for about a minute) while if s/he has a very strong memory the RF will be increased by 15 (extending its lifespan for over 6 minutes). These formulae for the calculation of the RF give a more “personal” aspect in the cognitive model, since they are not generic but based on the student's profile. Table 2. Response Quality Factor reflecting the student's answer's quality
RQ 0 1 2 3
Modification RF' = X – 10, where TeachDate=Now RF' = X – 5, where TeachDate = Now RF' = RF + (MA + 1) * 3 RF' = RF + (MA + 1) * 4
Individualizing a Cognitive Model of Students’ Memory in Intelligent Tutoring Systems
897
In the end of a “virtual lesson”, the final RF of a student for a particular fact is calculated. If this result is above 70 then the student is assumed to have learnt the fact, else s/he needs to revise it.
5
Conclusions
In this paper we described the part of the student modelling process of an ITS authoring tool that keeps track of the students' memory of facts that are taught to him/her. For this reason we have adapted and incorporated principles of cognitive psychology into the system. As a result, the educational application takes into account the time that has passed since the learning of a fact has occurred and combines this information with evidence from each individual student's actions. Such evidence includes how easily a student can memorise new facts and how well she can answer to questions concerning the material taught. In this way the system may know when each individual student may need to revise each part of the theory being taught.
References [1] [2] [3]
[4]
[5]
[6]
Cumming G. & McDougall A.: Mainstreaming AIED into Education? International Journal of Artificial Intelligence in Education, Vol. 11, (2000), 197-207. Ebbinghaus, H. (1998) “Classics in Psychology, 1885: Vol. 20, Memory”, R.H. Wozniak (Ed.), Thoemmes Press, 1998. Matthews, M., Pharr, W., Biswas G. & Neelakandan, (2000). “USCSH: An Active Intelligent Assistance System,” Artificial Intelligence Review 14, pp. 121-141. Sleeman, D., Hirsh, H., Ellery, I. & Kim, I. (1990). “ Extending domain theories: Two case studies in student modeling”, Machine Learning, 5, pp. 1137. Stansfield, J.C., Carr, B., & Goldstein, I.P. (1976) Wumpus advisor I: a first implementation of a program that tutors logical and probabilistic reasoning skills. At Lab Memo 381. Massachusetts Institute of Technology, Cambridge, Massachusetts. Virvou M., Manos C., Katsionis G., Tourtoglou K.(2002), “Incorporating the Culture of Virtual Reality Games into Educational Software via an Authoring Tool”, Proceedings of IEEE International Conference on Systems, Man and Cybernetics (SMC 2002, Tinisia).
An Ontology-Based Approach to Intelligent Instructional Design Support Helmut Meisel1 , Ernesto Compatangelo1 , and Andreas H¨ orfurter2 1
2
Department of Computing Science University of Aberdeen AB24 3UE Scotland Kompetenzwerk, Riedering, Germany
Abstract. We propose a framework for an Intelligent Tutoring System to support instructors in the design of a training. This requires the partial capture of Instructional Design theories, which define processes for the creation of a training and outline methods of instruction. Our architecture is based on a knowledge representation that is based on the use of ontologies. Reasoning based on Description Logics supports the modelling of knowledge, the retrieval of suitable teaching methods, and the validation of a training. A small exemplary ontology is used to demonstrate the kind of Instructional Design knowledge that can be captured in our approach.
1
Background and Motivation
Instructional design (ID) theories support instructors in designing a training. These theories describe processes for the creation of a training, outlining methods of instruction together with situations in which those methods should be used [1]. The design of a training involves the definition and the classification of learning goals, the selection of suitable teaching methods and their assembly into a course [2]. The learning outcome of a training is improved if these theories are applied in practice. This paper proposes an architecture for an Intelligent Tutoring System that promotes the application of ID theories in practice. Such a system should assist an instructor in designing a training while educating him/her in Instructional Design. More specifically, this system should fulfill the following requirements: – Assist its user in the selection of appropriate teaching methods for a training and encourage the application of a wide range of available teaching methods. – Instruct its user about particular Teaching Methods (TMs) – Highlight errors in the design of a training. In our envisaged system (see Figure 1 for an architectural overview) ID experts maintain the system’s knowledge directly. This requires a conceptual representation of ID knowledge that can be understood by non computing experts. Inferences are necessary to perform the validation and verification tasks V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 898–905, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Ontology-Based Approach to Intelligent Instructional Design Support
Procedural Knowledge Instructional Design Maintains Knowledge Ontology Editor
Instructionl Designer
Domain Knowledge Instructional Design
V&V Ontologies
User Interface
899
Creates Course
Suggest Methods
Trainee
Reasoner
Fig. 1. Architectural Overview
as well as the retrieval of suitable teaching methods. We propose ontologies as the knowledge representation mechanism and show how set theoretic reasoning with ontologies can provide the necessary inferences. Related work in Artificial Intelligence has contributed to Instructional Design with tools targeting its authoring [3] and with systems for the (partial) automation of ID processes [4]. Early systems were built with heuristic knowledge such as condition-action rules [5]. A recent example of a rule-based approach is the Instructional Material Description Language (IMDL), that provides specifications for the automated generation of alternative course designs. IMDL considers instructional elements (i.e. learners or learning objectives) as pre-conditions and didactic elements (i.e. software components for the display of courses) as post-conditions. An inference engine creates alternative courses based on these specifications [6]. While this approach performs reasoning about ID knowledge, it does not provide a solution for an Intelligent Tutoring System. Rule-based systems create a “conceptual gap” between an expert’s knowledge and its representation in the rule-base [5]. Rule-based systems do not highlight relationships between corresponding rules and are therefore difficult to create or to maintain. In e-learning, Meta-Data (such as the Learning Object Metadata) enable sharing and reuse of educational resources. However, current standards offer only limited support to describe pedagogical knowledge [7, 8]. Our framework subsumes these standards, as Metadata classes can be viewed as ontology concepts. Ontologies are increasingly used to organize ID knowledge in e-learning platforms. In most cases, they facilitate the retrieval of semantically marked-up educational resources [9, 10]. The Educational Modelling Language (EML) aims to provide a framework for the conceptual modelling of learning environments. It consists of four extendable top level ontologies that describe (i) theories about learning and instruction, (ii) units of study, (iii) domains, and (iv) how learners learn. EML ontologies could be reused in our framework. However, we aim to deploy the ontologies directly in an Intelligent Tutoring System rather than using them for design purposes only. Knowledge representation and reasoning is equally important to meet the requirements for an Intelligent Tutoring System in ID. While rule-based systems could provide the necessary inferences, representational issues prevent their us-
900
Helmut Meisel et al.
Fig. 2. Exemplary Ontology
age. Ontologies, on the other hand, are suitable means to capture ID knowledge. However, there is no standard inference mechanism for reasoning with ontologies.
2
Representation of ID Knowledge
Instructional Design knowledge consists of a domain part about “methods of instructions” and their application and a procedural part about ID processes. Procedural knowledge of ID processes is of static nature and can be encoded in the User Interface as sequences of screens. Our framework mainly addresses the declarative part (i.e. the domain knowledge) of instructional design. The architecture of the proposed system relies on ontologies to capture this knowledge. Ontologies are “shared specifications of conceptualizations” [11] and encourage collaborative development by different experts. They capture knowledge at the conceptual level, thus enabling ID experts to directly manipulate it without the involvement of a knowledge engineer. Trainers may explore the ontology either by browsing through its entries or by querying it. Queries can also be assembled by the user interface in accordance with the information provided during the design process.
An Ontology-Based Approach to Intelligent Instructional Design Support
System Ontology
... applicable_domain
supports_ktype Knowledge Type
901
Teaching Method
Intensional Descriptions further ontologies: - KnowledgeTypes - ApplicableDomains - ... Extensional Descriptions further individuals e.g. Java-Prog:Domain
Applicable Domain
Ontology: Methods ReceptiveMethod ActiveMethod GroupWork
TeamProject Concrete Methods
Fig. 3. Layered knowledge representation
Ontologies represent knowledge in taxonomies, where more specific concepts inherit the properties of those concepts they specialize. This allows knowledge reuse when an ontology needs to be extended. Figure 2 shows a small exemplary ontology to demonstrate, how ID knowledge can be captured. The part about Teaching Programming has been extracted from an existing evaluation of TMs in Computing Science [12]). Teaching methods are represented in the ontology by describing the situations they can be applied to (e.g. learning goals, characteristics of the learners, course domain, etc.). Instructions about the application of a teaching method can also be added to the ontology. In our exemplary ontology a Training is defined as a sequence of at least three different Teaching Methods thus asserting that diverse teaching methods should be applied during a training. Moreover, each Training must have at least one Learning Goal. A Learning Goal is addressed by one or more Teaching Method. Note that it is thus possible to apply more than one teaching method to a learning goal. The selection of a Teaching Method in our example depends on (i) the Domain it is applied to (e.g. programming), (ii) the supported Learning Goal (e.g. application of a programming language) and (iii) the Learner. In our exemplary ontology, Teaching Programming is understood to be the set of all teaching methods that are applicable to Programming. A specific rule like “It is generally agreed that fluency in a new programming language can only be attained by actually programming” [12] can also be included in the specification. This rule is represented by linking the concepts Programming Exercise and Gain Fluency in Programming with the attributes Learning Goal and Addressed by.
902
Helmut Meisel et al.
Instructional information about concepts must also be included in the ontology. This kind of information can be reviewed by a user in order to learn more about a particular teaching method. In the example, description, strength, and weakness of Teaching Method carry this information. Subclasses inherit this information from its superclasses. For instance, as Lecture about syntax and semantics is a subclass of both Lecture and Teaching Programming, it inherits the instructional information from both of them. A further benefit of multiple inheritance is that multiple views on the ontology can be defined. For instance, Teaching Methods can be classified according to learning strategy (e.g. Collaborative Learning, Discovery Learning, or Problem-based Learning). Our architecture represents knowledge in three layers (see Figure 3): – System Ontology. It defines categories of links between concepts (i.e. attributes). Links either point to instructional information about a TM or represent constraints for the selection of a TM. Only links representing constraints are subject to reasoning. Furthermore, the System Ontology specifies top-level concepts (e.g. Teaching Method, Learner, Learning Goal) which are relevant to model the specific implementation. – Intensional Descriptions. Every class in the System Ontology is specialized by subclasses (e.g. teaching methods can be classified as either receptive or active methods). This level defines generic building blocks for the description of concrete teaching methods in the following layer. – Extensional Descriptions. Individuals populate the ontologies defined in the previous layers. For instance, concrete teaching methods are elements of intensional descriptions (e.g. a Concrete Teaching Method “Writing a bubble sort program in Java” is an instance of the class Programming Exercise). The usage of individuals is essential for validation purposes as the reasoner can identify whether an individual commits to the structure defined in the ontology. This will be explored in the next section.
3
Reasoning Services
We have developed an Intelligent Knowledge Modelling Environment called ConcepTool [13], which is based on a conceptual model that can be emulated in a Description Logic [14]. This enables deductions based on set-theoretic reasoning, where concepts are considered as set specifications and individuals are considered as set elements. Computation of set containment (i.e. whether a set A is included in a set B), inconsistency (i.e. whether a set is necessarily empty), and set membership (i.e. whether an individual x is an element of the set A) enables the introduction of the following reasoning services. – Ontology Verification and Validation. Automated reasoning can provide support to ID experts during ontology creation and maintenance by detecting errors and by highlighting additional hierarchical links. For instance, if the ontology states that a TM must mention its strengths and weaknesses, the system will not accept TMs without this description. If the reasoner derives
An Ontology-Based Approach to Intelligent Instructional Design Support
903
a set containment relationship between two classes (e.g. a TM applicable to any domain and another TM applicable to the Computing domain only), it suggests the explicit introduction of a subclass link. In this case, all the properties of the superclass will be propagated to the subclass. – Retrieval of suitable teaching methods. Teaching Methods are returned as the result of a query, stated as a concept description, which is issued to the reasoner. This retrieves all the individual TMs that are members of this concept. In our exemplary ontology, a query that searches all the TMs for the Computing Domain with the Learning Goal Gain fluency in Programming returns all the elements of the classes Programming Exercise, Write Programs from Scratch, and Adapt Existing Code. Note that individuals are not included in the exemplary ontology shown in Figure 2. The query concept can be generated by the user interface during the creation of a training. As reasoning in Description Logics is sound, only those teaching methods are suggested which satisfy the constraints. – Validation and Verification of a training. Errors in the design of a training can be detected in two ways. The first way is to check whether a training commits to the axioms of the ontology. The training to be validated is considered as an instance of the Training class and thus needs to commit to its structure. Possible violations w.r.t. our exemplary ontology might be (i) to define only two Teaching Methods (whereas at least three are required), (ii) forget to specify a Learning Goal, or (iii) forget to address a learning goal with a Teaching Method. A further possibility to validate a training is to define classes of common errors (e.g. a training with receptive teaching methods only) and check whether the training under validation is an instance of a common error class.
4
Discussion and Future Research
This paper presents an architecture for an Intelligent Tutoring System in Instructional Design (ID). Such an architecture addresses declarative ID knowledge, which can be directly manipulated by ID experts as this knowledge is represented explicitly using ontologies. Automated reasoning fulfills the requirements stated in Section 1 (i.e. assist users in the selection of appropriate teaching methods and highlight errors in the design of a training). The requirement to instruct a user about particular teaching methods can be achieved by attaching instructional information to each TM. Our framework does not commit to any particular ID theory. In principle, it can be applied to any approach as long as this allows the construction of an ontology. However, as the framework does not include a representation of procedural knowledge, the user interface (most likely) needs to be developed from scratch. Nevertheless, we anticipate that changes of an ID process will rarely occur. As the ConcepTool ontology editor is reasonably complete, we aim to develop ontologies and implement the User Interface part of this architecture. This
904
Helmut Meisel et al.
will investigate the validity of the assumption that a user who is not an expert in knowledge modelling can represent ontologies without the help of a knowledge engineer.
Acknowledgements The work is partially supported by the EPSRC under grants GR/R10127/01 and GR/N15764.
References [1] C., R.: Instructional design theories and models: A new paradigm of instructional theory. Volume 2. Lawrence Erlbaum Associates (1999) 898 [2] Van Merriˆenboer, J.: Training complex cognitive skills: A four-component instructional design model for technical training. Educational Technology Publications (1997) ISBN: 0877782989. 898 [3] Murray, T.: Authoring intelligent tutoring systems: An analysis of the state of the art. International Journal of Artificial Intelligence in Education 10 (1999) 98 – 129 899 [4] Kasowitz, A.: Tools for automating instructional design. Technical Report EDOIR-98-1, Education Resources Information Center on Information Technology (1998) http://ericit.org/digests/EDO-IR-1998-01.shtml. 899 [5] Mizoguchi, R., Bourdeau, J.: Using ontological engineering to overcome common ai-ed problems. International Journal of Artificial Intelligence in Education 11 (2000) 107–121 899 [6] Gaede, B., H., S.: A generator and a meta specification language for courseware. In: Proc. of the World Conf. on Educational Multimedia, Hypermedia and Telecommunications 2001(1). (2001) 533–540 899 [7] Pawlowski, J.: Reusable models of pedagogical concepts - a framework for pedagogical and content design. In: Proc. of ED-MEDIA 2002, World Conference on Educational Multimedia, Hypermedia and Telecommunications. (2002) 899 [8] M., R., Wiley, D.: A non-authoritative educational metadata ontology for filtering and recommending learning objects. Interactive Learning Environments 9 (2001) 255–271 899 [9] Leidig, T.: L3 - towards an open learning environment. ACM Journal of Educational Resources in Computing 1 (2001) 899 [10] Allert, H., et al.: Instructional models and scenarios for an open learning repository - instructional design and metadata. In: Proc. of E-Learn 2002: World Conference on E-Learning in Corporate, Government, Healthcare, & Higher Education. (2002) 899 [11] Uschold, M.: Knowledge level modelling: concepts and terminology. The Knowledge Engineering Review 13 (1998) 5–29 900 [12] Nicholson, A.E., Fraser, K.M.: Methodologies for teaching new programming languages: a case study teaching lisp. In: Proc. of the 2nd Australasian conference on Computer science education. (1997) 901 [13] Compatangelo, E., Meisel, H.: K—ShaRe: an architecture for sharing heterogeneous conceptualisations. In: Proc. of the 6th Intl. Conf. on Knowledge-Based Intelligent Information & Engineering Systems (KES’2002), IOS Press (2002) 1439–1443 902
An Ontology-Based Approach to Intelligent Instructional Design Support
905
[14] Horrocks, I.: FaCT and iFaCT. In: Proc. of the Intl. Workshop on Description Logics (DL’99). (1999) 133–135 902
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine Pedro Pablo G´ omez-Mart´ın, Marco Antonio G´ omez-Mart´ın, and Pedro A. Gonz´ alez-Calero Dep. Sistemas Inform´ aticos y Programaci´ on Universidad Complutense de Madrid, Spain {pedrop,marcoa,pedro}@sip.ucm.es
Abstract. Knowledge-based learning environments have become an ideal solution to provide an effective learning. Those systems base their teaching techniques upon constructivist problem solving, to supply an engaged learning environment. The students are presented with more and more challenging exercises, selected from a set of different scenarios depending on their knowledge. This paper presents a new of such systems, which aims to teach Java compilation with the help of a metaphorical virtual environment that simulates the Java Virtual Machine.
1
Introduction
Knowledge-based learning environments are considered to be a good solution to instruct students in those domains where “learning by doing” is the best methodology of teaching. Students are faced to more and more complex problems, tailored to their needs depending on their increasing knowledge. Nowadays, improvement in computer capacity let us implement multimedia systems and show real-time graphics to the users. New educational software has started to use virtual environments so users are immersed in synthetic microworlds where they can experiment and check their knowledge ([10]). A supplementary enhance is to incorporate animated pedagogical agents who inhabit in these virtual environments. An animated pedagogical agent is a lifelike on-screen character who provides contextualized advice and feedback throughout a learning episode ([6]). These agents track students’ actions and supply guidance and help in order to promote learning when misconceptions are detected. Students are allowed to ask for help at any time, and the agent will offer specific suggestions and concept explanations according to the current exercise. In order to supply user with guidance, agents need to possess a good comprehension of the taught domain and the current scenario, and to hold information about the user knowledge (usually stored in a user profile). These systems also require a pedagogical module that determines what aspects of the domain knowledge the student should practice according to her strengths
Supported by the Spanish Committee of Science & Technology (TIC2002-01961)
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 906–913, 2003. c Springer-Verlag Berlin Heidelberg 2003
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
907
and weaknesses. With this purpose, these programs need a big collection of scenarios where different exercises are kept indexed by the concepts they assess. The user profile keeps information about which of such concepts the student already understands, and which she doesn’t know. The pedagogical module uses this data and the indexes in the set of exercises to choose the most suitable scenario ([9]). Animated pedagogical agents are an active research topic and several of them have been implemented, for example Design-A-Plant, Internet Advisor, CPU-City and Steve ([7], [4], [1], [2]).
2
General Description
Our aim is to implement an animated pedagogical agent to teach the Java Virtual Machine (JVM) structure and Java language compilation. Users are supposed to know imperative programming, particularly Java programming, and they will be able to improve their knowledge of object oriented programming and the compilation process. Our program uses a metaphorical 3D virtual environment which simulates the JVM ([5]). The user is symbolized as an avatar which is used to interact with the objects in the virtual world. This avatar owns an inventory where it can keep objects. The virtual world is also inhabited by Javy (JavA taught VirtuallY), an animated pedagogical agent that is able to perform two main functions: (1) monitor the student whilst she is solving a problem with the purpose of detecting the errors she makes in order to give her advice or guidance, and (2) resolve by himself the exercise giving explanation at each step. In the virtual environment there are different objects that the user can manipulate using four basic primitives, borrowed from some entertainment programs: “look at”, “take”, “use” and “use with”. Some of the objects have a human look, although they can’t be considered intelligent agents. Their simple mission is to allow the user to interact with some of the metaphorical JVM structures. Each exercise is designed by tutors using an authoring tool, and it consists in Java source code and its associated compiled code (Java byte codes). The student has to use the different provided structures to execute the supplied Java byte codes. Compilation phase learning is considered a side effect, since the system also shows the Java source code, where the sentence which has generated the compiled instructions been executed is highlighted. The user is supposed to pay attention to both of the codes, and to try to understand how the compilation process is performed. Of course, Javy can also give information about this procedure. A second information source is available in the “look at” user interface operation. For example, if the student ordered “look at the operand stack” her own avatar would answer “Operand stack stores temporal results needed to calculate the current expression. We could ask Javy to get more information.” Auxiliary characters behaviour and phrases said by the avatar are relatively hard-coded. However, Javy is quite more complex, because not only does he give
908
Pedro Pablo G´ omez-Mart´ın et al.
advice and help, but also he can replace the user and finish the current exercise in order to show the student the correct way. Consequently, Javy actions are based in a rather big amount of knowledge about the domain, and a more detailed description is needed.
3
Conceptual Hierarchy
Our system stores the domain concepts the students have to learn using a conceptual hierarchy. At first glance, the application uses them to perform two tasks: – User model: the system keeps information about the knowledge the user already has about the specific domain being taught. The user model stores the concepts we can assume she knows, and those that she is learning. – Exercise database: pedagogical module uses the user model to retrieve the next scenario to be presented to the student. It uses an exercise database which is indexed using the concepts it tests. The system should try to select exercises which exhibit a degree of complexity that is not too great for the learner but is sufficiently complex to be challenging ([3]). Conceptual hierarchy is also used by Javy to generate explanations. As for our system, we have identified five kinds of concepts with different abstraction levels: – Compilation concepts: they are related to the high-level structures that the student should learn to compile (e.g. arithmetic expressions) – Virtual machine instructions: this group includes the JVM instructions, sorted into a hierarchy. – Microinstructions: they are the different primitive actions that both the student and Javy can perform in order to change de JVM state (e.g. pushing a value in the operand stack). Each concept representing a virtual machine instruction in the previous group is related to all the microinstructions it needs to execute the instruction. – Virtual machine structures: each part in the JVM has a concept in the hierarchy (like “operand stack”, “frame stack” and “heap”). – User interface operations: each microinstruction is executed in the virtual world using a metaphorical element, and interacting with it through one of the three operations (“take”, “use” or “use with”). Concepts in this group symbolize the relation between them. For example, the concept “use operand stack” (with another object) is related to the microinstruction “push value”. User model and indexes of the exercices database only use the first two groups of concepts (compilation and virtual machine instructions concepts). Each concept has a name and a description, which is used by Javy when the user asks for information about such concept. As students are clearly irritated by the repetition of the same explanation ([7]), each concept has also a short description, used when the first one has been presented recently to the user.
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
909
The conceptual hierarchy is built by the authors or tutors who provide the specific knowledge to the system. An authoring tool has been developed, and it allows the tutor to define concepts and to set its properties and relations.
4
Representation of the JVM Instruction Set
Students have to execute by themselves the JVM instructions. The system, meanwhile, monitors them and checks they are performing correctly each step. In addition, Javy also has to be able to execute each instruction. Actually, we could think that the user only would learn how the JVM works. However we assume that the user will also learn the compilation of Java programs as a side effect by comparing the source and compiled code provided by the system and by using Javy explanations. The system has to store information about the steps (microinstructions) that have to be executed to complete each JVM instruction. The conceptual relations between JVM instructions and primitive instructions are not enough because no order information is kept. Therefore, the system also stores an execution graph of each instruction. Each graph is related to the concept of the instruction it represents. Graphs nodes are states and microinstructions are stored in the edges. In addition, each edge is related to the microinstruction concept in the concept hierarchy. When a primitive action is executed, the state changes. Some of these microinstructions have arguments or parameters, and the avatar uses objects in the inventory to supply them. One example is the primitive action of pushing a constant value in the operand stack. The user got the value in a previous primitive action like fetch the JVM instruction. Graph edges also store explanations that show why their primitive actions are needed to complete the JVM instructions execution. This is useful because the description related to the microinstruction concept explains what they do, but don’t why they are important in a specific situation. For example, the microinstruction concept “push a value into the operand stack” has a description such as “The selected value in the inventory is stacked up the operand stack. If no more values were piled up, the next extracted value would be the value just pushed”. On the other hand, this primitive action is used in the iload instruction execution; the explanation related to it in the graph edge could be “Once we have got the local variable, it must be loaded into the stack”. Graphs also store wrong paths and associate with them an explanation indicating why they are incorrect. This is used to detect misconceptions. When the user tries to execute one of such primitive actions, Javy stops her and explains why she is wrong. Due to Javy don’t allow the user to execute an invalid operation, the graph only has to store “one level of error”, because in our first approach, we never let her reach a wrong execution state. In case the student executes a microinstruction that has not considered in the graph, Javy also stops her, but he is not able to give her a suitable explanation. When this occurs, the system registers which wrong operation was executed and in which state
910
Pedro Pablo G´ omez-Mart´ın et al.
the user was. Later the tutor who built the graph will analyze this data and will expand the graph with those wrong paths more frequently repeated, adding explanations about why they are incorrect. When Javy is performing a task, he executes the microinstructions of a correct path, and uses their explanations. Therefore, our knowledge representation is valid for executing and monitoring cases. These graphs also are built by the authors who provide the specific domain knowledge. A new authoring tool is provided which allows user to create the graphs. As we can expect, the tool checks coherence between the conceptual hierarchy and the graphs being created.
5
Scenarios
So far, knowledge added to the system is general, and is often focused upon JVM structures and instruction execution. The use of the application consists in Javy or the student executing a Java program using the compiled code. These exercises (frequently called scenarios) are provided by a tutor or course author, using a third authoring tool that must simplify exercise construction, because scenario-based (case-based) instruction paradigm (which Javy is based on) works better if a big amount of scenarios have been defined ([8]). The tutor creates the Java code of the scenario which can include one or several Java classes. Each exercise will have a description and a list of all the concepts that are tested in it. These concepts are used in the case retrieval when a student starts a new exercise. The authoring tool compiles the Java code given by the tutor. One of the resultant class files will have the main function. Each JVM instruction is automatically related to a portion of the source code, and during the exercise resolution, when the avatar is performing this JVM instruction, this source code region will be highlighted. The course author can modify the relations between them in order to correct mistakes made by the authoring tool when it creates them automatically. Explanations in the hierarchy concept and graphs are used in the instruction execution. However, our aim is student to learn as side effect how Java source code is compiled. Javy will never check if the user understand these concepts, but he can provide information about the process. In order for Javy to be able to give explanations about the compiled code, the author divide the Java byte codes in different levels, and construct a tree (usually a subset of the syntax tree). For example, a while execution can be decomposed into the evaluation of the boolean expression and the loop instructions. The author also gives an explanation about each part, and why each phase is important. That description is referred to the relationship between source code and compiled code. When the avatar is carrying out a scenario, current frame and current counter program are related to a region in the Java source code, which will be part of a greater region, and so on. If the student asks Javy for an explanation, he will use the text given by the author in the current region. When the user asks again, Javy will use the parent explanation.
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
911
In addition, each region can be linked with a concept in the hierarchy described above. Therefore Javy is able to use explanations store in these concepts to give more details.
6
Architecture
The system has three elements: the virtual world, the agent and the user interface. The virtual environment represents the Java Virutal Machine where Javy and the student perform the actions. We have implemented a simplified JVM with all the capabilities we try to teach. The virtual world contains static and dynamic objects. The first ones are stationary objects like terrain, fences and so on, and they are rendered directly by the user interface, using OpenGL. Dynamic objects are controlled by the Object Interface Manager (OIM) and can be objects representing a JVM structure (a loaded class appears as a building), or objects which the user or agent interacts with. In some cases, the entity belongs in both categories, for example, the operand stack in the virtual world is part of the JVM ant the user interats with it in order to push/pop values. When the student’s avatar is close to some of the interactive elements, the user interface shows their names so the student can select one and perform the desired operation (“look at”, “take”, “use” or “use with”). Javy management is divided in three main modules: cognition, perception and movement. Perception layer receives changes in the virtual environment, and informs the cognition layer about them. This layer interprets the new state and decides to perform some action, which is communicated to the movement layer in order to move Javy and to execute it. Cognition layer interprets the input received from the perception layer, constructs plans and executes them, sending primitive actions to the movement layer. When Javy is demonstrating a task, these plans execute the correct microinstructions. When he is monitoring, the plans check the user actions and warn her in case of error. Cognition layer uses instructions graphs and the trees of each scenario. These trees can be seen as trees of tasks and subtasks. However, Javy only uses them to explain what he is doing when he is performing a task or the student asks for help. When Javy is executing an exercise (each JVM instruction), cognition layer knows the leaf where he is. When the student asks for an explanation of an action (“why? ”), Javy uses the text stores in the leaf. One he has explained it, he will use the parent’s explanation if the pupil asks again, and so on. On the other hand, low-level operation (microinstructions execution) is managed by graphs. As is described above, the edge’s graph contains the primitive actions the avatar must perform to complete each JVM instruction. When Javy is executing the task, cognition layer chooses the next action using the graph, and orders the movement layer to achieve the microinstruction, and it waits until perception layer notifies him its completion. When perception layer informs a task has been completed, the layer uses the graph, change
912
Pedro Pablo G´ omez-Mart´ın et al.
the state using the transition which contains the action, and decides the next primitive action.
7
Conclusions and Future Work
In this paper we have presented a new animated pedagogical agent, Javy, who teaches Java Virtual Machine structure and the compilation process of Java language using a constructivist learning environment. Although the system still needs a lot of implementation work, we predict some other future improvements. Currently, compilation process learning is a side effect, and the system doesn’t evaluate it. In this sense, one more feature can be added to the application: instead of just giving the user Java source code and its Java byte codes, future exercises could consist only in the Java source code, and the user would have to compile this code ‘on the fly’ and execute it in the metaphorical virtual machine. Javy would detect the user’s plans when she is executing primitive actions, to find out which JVM instruction she is trying to execute, and prevent her doing it when she is wrong. Currently, all the explanations given by the agent are fixed by the course author who build the domain knowledge or scenarios. A language generation module could be added. Finally, we have to study if our system is useful. In order to check its benefits, we must analyze if students take advantage of the software, and if they are engaged on it.
References [1] W. Bares, L. Zettlemoyer, and J. C. Lester. Habitable 3d learning environments for situated learning. In Proceedings of the Fourth International Conference on Intelligent Tutoring Systems, pages 76–85, San Antonio, TX, August 1998. 907 [2] W. L. Johnson, J. Rickel, R. Stiles, and A. Munro. Integrating pedagogical agents into virtual environments. Presence: Teleoperators & Virtual Environments, 7(6):523–546, December 1998. 907 [3] J. C. Lester, P. J. FitzGerald, and B. A. Stone. The pedagogical design studio: Exploiting artifact-based task models for constructivist learning. In Proceedings of the Third International Conference on Intelligent User Interfaces (IUI’97), pages 155–162, Orlando, FL, January 1997. 908 [4] J. C. Lester, J. L. Voerman, S.G. Towns, and C. B. Callaway. Cosmo: A life-like animated pedagogical agent with deictic believability. In Working Notes of the IJCAI ’97 Workshop on Animated Interface Agents: Making Them Intelligent, pages 61–69, Nagoya, Japan, August 1997. 907 [5] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. 2nd Edition. Addison-Wesley, Oxford, 1999. 907 [6] R. Moreno, R.E. Mayer, and J. C. Lester. Life-like pedagogical agents in constructivist multimedia environments: Cognitive consequences of their interaction. In Conference Proceedings of the World Conference on Educational Multimedia Hypermedia, and Telecommunications (ED-MEDIA), pages 741–746, Montreal, Canada, June 2000. 906
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
913
[7] B. Stone and J. C. Lester. Dynamically sequencing an animated pedagogical agent. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 424–431, August 1996. 907, 908 [8] D. Stottler, N. Harmon, and P. Michalak. Transitioning an its developed for schoolhouse use to the fleet: Tao its, a case study. In Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2001), August 2001. 910 [9] R. H. Stottler. Tactical action officer intelligent tutoring system (tao its). In Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2000), November 2000. 907 [10] J. F. Trindade, C. Fiolhais, V. Gil, and J. C. Teixeira. Virtual environment of water molecules for learning and teaching science. In Proceedings of the Computer Graphics and Visualization Education’99 – (GVE’99), pages 153–158, July 1999. 906
Self-Organization Leads to Hierarchical Modularity in an Internet Community Jennifer Hallinan Institute for Molecular Biosciences The University of Queensland Brisbane, QLD, Australia 4072 [email protected]
Abstract. Many naturally-occurring networks share topological characteristics such as scale-free connectivity and a modular organization. It has recently been suggested that a hierarchically modular organization may be another such ubiquitous characteristic. In this paper we introduce a coherence metric for the quantification of structural modularity, and use this metric to demonstrate that a selforganized social network derived from Internet Relay Chat (IRC) channel interactions exhibits measurable hierarchical modularity, reflecting an underlying hierarchical neighbourhood structure in the social network.
1
Introduction
The social community existing by virtue of the internet is unique in that it is unconstrained by geography. Social interactions may occur between individuals regardless of gender, race, most forms of physical handicap, or geographical location. While some forms of interaction are subject to external control, in that a moderator may decide what topics will be discussed and who will be allowed to participate in the discussion, many are subject to little or no authority. Such networks self-organize in a consistent, structured manner. Internet Relay Chat (IRC) is a popular form of online communication. Individuals log onto one of several server networks and join one or more channels for conversation. Channels may be formed and dissolved by the participants at will, making the IRC network a dynamic, self-organizing system. Interactions between individuals occur within channels. However, one individual may join more than one channel at once. The system can therefore be conceptualized as a network in which channels are nodes, and an edge between a pair of nodes represents the presence of at least one user on both channels simultaneously. The IRC network is extremely large; there are dozens of server networks, each of which can support up to at least 100,000 users. In the light of previous research into the characteristics of self-organized social networks, the IRC network would be expected to exhibit a scale-free pattern of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 914-920, 2003. Springer-Verlag Berlin Heidelberg 2003
Self-Organization Leads to Hierarchical Modularity in an Internet Community
915
connectivity [1], [2]. Social networks have also been shown to have a locally modular structure, in which clusters of relatively tightly-connected nodes exist, having fewer links to the rest of the network [3], [4], [5], and it has been suggested that these networks are potential candidates for hierarchical modularity [6], [7].
2
Methods
2.1
Data Collection
Network data was collected using a script for the IRC client program mIRC. The script starts in a single channel and identifies which other channels participants are currently on. It then visits each of these channels in turn and repeats the process. This data was then converted into a set of channel-channel interaction pairs. Because of the dynamic nature of the network, which changes over time as individuals join and leave different channels, no attempt was made to collect data for the entire network. Instead, the network consists of all the channels for which data could be collected in one hour, starting from a single channel and spidering out so that the data collected comprised a single connected component. 2.2
Modularity Detection
Modularity in the network was detected using the iterated vector diffusion algorithm described by [8]. The algorithm operates on a graph. It is initialized by assigning to each vertex in the graph a binary vector of length n, initialized to
0, i ≠ j vi , j = 1, i = j
(1)
where i is an index into the vector and j is the unique number assigned to a given vertex. The algorithm proceeds iteratively. At each iteration an edge is selected at random and the vectors associated with each of its vertices are moved towards each other by a small amount, δ. This vector diffusion process is iterated until a stopping criterion is met. We chose to compute a maximum number of iterations as the stopping criterion. This number, n, is dependant upon both the number of connections in the network, c, and the size of δ, such that
α n = c * δ
(2)
where α is the average amount by which a vector is changed in the course of the run. A value for α of 0.1 was selected empirically in trials on artificially generated networks. The final set of vectors is then subjected to hierarchical clustering using the algorithm implemented by [9].
916
Jennifer Hallinan
Fig. 1. Thresholding a cluster tree. a) Tree thresholded at parent level 2 produces two clusters b) The same tree thresholded at parent level 3 has four clusters
2.3
Cluster Thresholding
The output of the cluster algorithm is a binary tree, with a single root node giving rise to two child nodes, each of which give rise to two child nodes of their own, and so on. The tree can therefore be thresholded at various levels (two parents, four parents, eight parents, etc.; see Figure 1) and the modularity of the network at each level can be examined. 2.4
Hierarchical Modularity Detection
The binary hierarchical classification tree produced by the cluster detection algorithm was thresholded at every possible decision level, and the average coherence of the modules detected at each level measured, using a coherence metric χ:
2k
1
n
k ji
i − ∑ χ = ( ) ( ) − + n n n k k 1 j =1 jo ji
(3)
where ki is the total number of edges between vertices in the module, n is the number of vertices in the network, kji is the number of edges between node j and other vertices within the module, and kjo is the number of edges between vertex j and other vertices outside the module. The first term in this equation is simply the proportion of possible links within the nodes comprising the module which actually exist; a measure of the connectivity within the module. The second term is the average proportion of edges per node which are internal to the module. A highly connected node with few external edges will therefore have a lower value of χ than a highly connected node with many external edges. χ will have a value in the range [-1, +1]. At each level in the hierarchy the number of modules and the average modular coherence of the network was computed. Average coherence was then plotted against threshold level to produce a "coherence profile" summarizing the hierarchical modularity of the network.
Self-Organization Leads to Hierarchical Modularity in an Internet Community
917
Fig. 2. The IRC network. Diagram producing using the Pajek software package [10]
3
Results
3.1
Connectivity Distribution
The IRC network consists of 1,955 nodes (channels) and 1, 416 edges, giving an average connectivity of 1.07 (Figure 1). Many self-organized networks have been found to have a scale-free connectivity distribution. In the IRC network connectivity ranged from 1 to 211 edges per node, with a highly non-normal distribution (Figure 2).Although the connectivity distribution of the network is heavily skewed, the log-log plot in Figure 2b indicates that it is not completely scale-free. There is a strongly linear area of the plot, but both tails of the distribution depart from linearity. This deviation from the expected scalefree distribution is probably due to practical constraints on the formation of the network. It can be seen from Figure 2b that nodes with low connectivity are underrepresented in the network, while nodes of high connectivity are somewhat overrepresented.
918
Jennifer Hallinan
Fig. 3. Connectivity distribution of the IRC network. a) Connectivity histogram. b) Connectivity plotted on a log-log scale
Under-representation of low connectivity nodes indicates that very few channels share participants with only a few other channels. This is not unexpected, since a channel may contain up to several hundred participants; the likelihood that most or all of those participants are only on that channel, or on an identical set of channels, is slim. The spread of connectivity values at the high end of the log-log plot indicates that there is a relatively wide range of connectivities with approximately equal probabilities of occurrence. There is a limit to the number of channels to which a user can pay attention, a fact which imposes an artificial limit upon the upper end of the distribution. 3.2
Hierarchical Modularity
The coherence profile of the IRC network is shown in Figure 3a, with the profile for a randomly connected network with the same number of nodes and links shown in Figure 3b. The IRC profile shows evidence of strong modular coherence over almost the entire range of thresholds in the network. The hierarchical clustering algorithm used will identify “modules” at every level of the hierarchy, but without supporting evidence it cannot be assumed that these modules represent real structural features in the network. The coherence metric provides this evidence. In the IRC network it can be seen that at the highest threshold levels, where the algorithm is identifying a small number of large modules, the coherence of the “modules” found is actually negative; there are more external than internal edges associated with the nodes of the modules. By threshold level 14 a peak in modular coherence is achieved. At this point, the average size of the modules is about 13 nodes (data not shown). As thresholding continues down the cluster tree, detecting more and smaller modules, the modular coherence declines, but remains positive across the entire range of thresholds. It appears that the IRC network has a hierarchically modular structure, with a characteristic module size of about 14; very close to the mode of the connectivity distribution apparent in Figure 2a. The random network, in comparison, has negative coherence over much of its range, with low modular coherence apparent at higher thresholds.
Self-Organization Leads to Hierarchical Modularity in an Internet Community
919
Fig. 4. a) Coherence profile of the IRC network. b) Coherence profile of an equivalent randomly connected network
4
Conclusions
The analysis of large, naturally-occurring networks is a topic of considerable interest to researchers in fields as diverse as sociology, economics, biology and information technology. Investigation of networks in all of these areas has revealed that, despite their dissimilar origins, they tend to have many characteristics in common. A recent candidate for the status of ubiquitous network characteristic is a hierarchically modular topological organization. We present an algorithm for the quantification of hierarchical modularity, and use this algorithm to demonstrate that a self-organized internet community, part of the IRC network, is indeed organized in this manner, as has been hypothesized, but not, as yet, demonstrated. Progress in the analysis of network topology requires the development of algorithms which can translate concepts such a “module: a subset of nodes whose members are more tightly connected to each other than they are to the rest of the network” into a numeric measure such as the coherence metric we suggest here. Such metrics permit the objective analysis and comparison of the characteristics of different networks. In this case, application of the algorithms to the IRC network detects significant hierarchical modularity, providing supporting evidence for the contention that this topology may be characteristic of naturally-occurring networks.
References [1] [2] [3] [4] [5]
Albert, R., Jeong, H. & Barabasi, A.-L. Error and attack tolerance of complex networks. Nature 406, 378 - 382 (2000). Huberman, B. A. & Adamic, L. A. Internet: Growth dynamics of the world-wide web. Nature 401(6749), 131 (1999). Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Santa Fe Institute Working Paper 01-12-077 (2001). Flake, G. W., Lawrence, S., Giles, C. L. & Coetzee, F. M. Self-organization and identification of web communities. IEEE Computer 35(3), 66 - 71 (2002). Vasquez, A., Pastor-Satorras, R. & Vespignani, A. Physics Review E 65, 066130 (2002).
920
Jennifer Hallinan
[6]
Ravasz, E., Somera, A. L., Oltvai, Z. N. & Barabasi, A. -L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551 - 1555 (2002). Ravasz, E. & Barabasi, A.-L. Hierarchical organization in complex networks. LANL Eprint Archive (2002). Hallinan, J. & Smith, G. Iterative vector diffusion for the detection of modularity in large networks. InterJournal Complex Systems B article 584 (2002). Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 14863 - 14868 (1998). Batagelj, V. & Mrvar, A. Pajek - Program for Large Network Analysis. Connections 21: 47 – 57 (1998).
[7] [8] [9] [10]
Rule-Driven Mobile Intelligent Agents for Real-Time Configuration of IP Networks Kun Yang1,2, Alex Galis1, Xin Guo1, Dayou Liu3 1
University College London Department of Electronic and Electrical Engineering Torrington Place, London, WC1E 7JE, UK {kyang,agalis,xguo}@ee.ucl.ac.uk 2 University of Essex Department of Electronic Systems Engineering Wivenhoe Park, Colchester Essex, CO4 3SQ, UK 3 Jilin University School of Computer Science & Technology Changchun, 130012 China
Abstract. Even though intelligent agent has proven itself to be a promising branch of artificial intelligence (AI), its mobility capacity has yet been paid enough attention to match the pervasive trend of networks. This paper proposes to inject intelligence into mobile agent of current literature by introducing rule-driven mobile agent so as to maintain both intelligence and mobility of current agent. Particularly, this methodology is fully exemplified in the context of real-time IP network configuration through intelligent mobile agent based network management architecture, policy specification language and policy information model. A case study for inter-domain IP VPN configuration demonstrates the design and implementation of this management system based on the test-bed developed in the context of European Union IST Project CONTEXT.
1
Background and Rationale
After years of recession, Artificial Intelligence (AI) regained it vitality relatively thanks to the inception of Intelligent Agent (IA). Agent was even highlighted as another approach of AI by S. Russell et al. [1]. Intelligent agent usually is a kind of software of autonomous, intelligent and social capability. Intelligent Agent and its related areas have been intensively researched over last decades and enormous achievement covering a wide range of research fields are available in the literature. As computers and networks become more pervasive, the requirement of intelligent agent being more (automatically) moveable is getting more a necessity than an option. As an active branch of agent technology researches, mobile agent paradigm intends to bring an increased performance and flexibility to distributed systems by promoting V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 921-928, 2003. Springer-Verlag Berlin Heidelberg 2003
922
Kun Yang et al.
"autonomous code migration" (mobile code moving between places) instead of traditional RPC (remote procedure call) such as CORBA, COPS (Common Open Policy Service) [2]. It turns out that more attention has been given to the mobility of mobile agent whereas the intelligence of mobile agent is seldom talked about in the mobile agent research community. Mobile agent technology is very successfully used in the network-related applications, especially network management, where its mobility feature is largely explored [3], but these mobile agents are usually lack of intelligence. We believe mobile agent is first of all an agent that has intelligence. This paper aims to explore the potential use of mobile agent to manage IP network in a more intelligent and automated way. For this purpose, mobile agents should contain certain extent of intelligence to reasonably respond to the possible change in destination elements and perform negotiation. This kind of intelligence should reflect the management strategy of administrator. A straightforward way for network administrator to give network management command or guide is to produce highlevel rules such as if sourceHost is within finance and time is between 9am and 5pm then useSecureTunnel. Then mobile agent can take this rule and enforce it automatically. By using rules to give network management command or strategy, a unique method of managing network can be guaranteed. The use of rule to management network is exactly what so called Policy-based Network Management (PBNM) [4] is about since policies usually appear as rules for network management. Here in this paper, we don't distinguish the difference between rule-based management and policy-based management and in many cases, the term “policybased” is more likely to be used. In order to put this idea into practice, a specific network management task is selected, i.e., IP VPN (Virtual Private Network) configuration. VPN enables an organization to interconnect its distributed sites over a public network with much lower price than the traditional leased-line private network. VPN is a key and typical network application operating in every big telecom operator as a main revenue source. But the lack of real-time and automated configuration and management capabilities of current IP VPN deployment makes the management of growing networks timeconsuming and error-prone. The integration of mobile agents and policy-based network management (as such making a mobile intelligent agent) claims to be a practical solution to this challenge. This paper first discusses an intelligent mobile agent based IP network management architecture with emphasis on IP VPN; then a detailed explanation with respect to policy specification language and IP VPN policy information model is presented. Finally, before the conclusion, a case study for inter-domain IP VPN configuration is demonstrated, aiming to exemplify the design and implementation of this intelligent MA-based IP network management system.
Rule-Driven Mobile Intelligent Agents
Credential Check Policy-based IP Management Tool
LDAP
923
PDP (MA Platform)
Policy Parser PDP Manager
MA Factory LDAP
KLIPS (Kernel IPsec) Key Management
Policy Repository
Monitoring KLIPS (Kernel IPsec) Tunnel
PDP (IP VPN) SA Management IKE Management
Mobile Agent Initiator PEP (MA Platform)
IPsec Configuration Hardware Router
Linux Router
Tunnel
Fig. 1. Intelligent MA-based IP Network Management Architecture
2
An Intelligent MA-Based IP Network Management Architecture
2.1
Architecture Overview
An intelligent MA based IP network management system architecture and its main components are depicted in Fig. 1, which is organized based on the PBNM concept as suggested by IETF Policy Working Group [5]. Please note that we use IP VPN configuration as an example, but this architecture is generic enough for any other IP network management tasks provided that the corresponding PDP and its information model are given. The PBNM system mainly includes four components: policy management tool, policy repository, Policy Decision Point (PDP) and Policy Enforcement Point (PEP). Policy management tool serves as a policy creation environment for the administrator to define/edit/view policies in an English-like declarative language. After validation, new or updated policies are translated into a kind of object oriented representation and stored in the policy repository, which is used for the storage of policies in the form of LDAP (Lightweight Directory Access Protocol) directory. Once the new or updated policy is stored, signaling information is sent to the corresponding PDP, which then retrieves the policy by PDP Manager via Policy Parser. After passing the Credential Check, the PDP Manager gets the content of the retrieved policy, upon which it selects the corresponding PDP, in this case, IP VPN PDP. After rule-based reasoning on the retrieved policy which may involve in other related policies stored in Policy Repository, PDP decides the action(s) to be taken against the policy. Then corresponding mobile agents that are initiated via Mobile Agent Initiator carry the bytecode for the actions and move themselves to the PEP and
924
Kun Yang et al.
enforce the policy on PEP. The automation of the whole procedure also depends on a proper policy information model that can translate the rule-based policies to the element level actions. This will be discussed separately in next section. Since there is plenty of work presenting rule-based reasoning in the knowledge engineering field, this paper prefers not to repeat them again. Please note that both PDPs and PEPs are in the form of mobile intelligent agents and intelligence is embedded inside the bytecode itself. 2.2
IP VPN Components
IP VPN operational part can be regarded as a type of PDP since it performs a subnet of policy management functionality. For easy demonstration in Fig. 1, all the VPN functional components are placed into one single PDP box. In actual implementation, they can be separated into different PDPs and be coordinated by a VPN PDP manager. Our IP VPN implementation is based on FreeS/WAN IPsec [6], which is a Linux implementation of the IPsec (IP security) protocols. Since IP VPN is built up via the Internet which is a shared public network with open transmission protocols, VPNs must include measures for packet encapsulation (tunneling), encryption and authentication so as to avoid the sensitive data from being tampered by any unauthorized third parties during data transit. Three protocols are used: AH (Authentication Header) provides a packet-level authentication service; ESP (Encapsulating Security Payload) provides encryption plus authentication; and finally, IKE (Internet Key Exchange) negotiates connection parameters, including keys, for the other two. KLIPS (kernel IPsec) from FreeS/WAN has implemented AH, ESP, and packet handling within the kernel [6]. More discussion is given to IKE issues which are closely related to the policies delivered by administrator via policy management tool. Key Management Component: Encryption usually is the starting point of any VPN solution. These encryption algorithms are well known and widely exist in lots of cryptographic libraries. The following features need to be taken into consideration for key management component: key generation, key length, key lifetime, and key exchange mechanism. IKE Management: IKE protocol was developed to manage these key exchanges. Using IPSec with the IKE, a system can set up security associations (SAs) that include information on the algorithms for authenticating and encrypting data, the lifetime of the keys employed, the key lengths, etc; and these information are usually extracted from rule-based policies. Each pair of communicating computers will use a specific set of SAs to set up a VPN tunnel. The core of the IKE management is an IKE daemon that sits on the node to which SAs need to be negotiated. IKE daemon is distributed on each node that is to be an endpoint of an IKE-negotiated SA. IKE protocol sets up IPsec connections after negotiating appropriate parameters. This is done by exchanging packets on UDP port 500 between two gateways. The ability of cohesively monitoring all VPN devices is vitally important. It is essential to ensure that policies are being satisfied by determining the level of
Rule-Driven Mobile Intelligent Agents
925
performance and knowing what in the network is not working properly if there are. The monitoring component drawn in PDP box is actually a monitoring client for enquiring status of VPN devices or links. The real monitoring daemons are located next to the monitored elements and are implemented using different technologies depending on the features of monitored elements.
3
Policy Specification Language and Information Model
Based on the above network management system architecture presented, this section details the design and implementation of this architecture in terms of two critical policy-based management concerns, i.e., policy specification language and policy information model. A high level policy specification language has been designed and implemented to provide the administrator with the ability of adding and changing policies in the policy repository. Policy takes the following rule-based format: [PolicyID] IF {condition(s)} THEN {action(s)}
It means action(s) is/are taken if the condition(s) is/are true. Policy condition can be in both disjunctive normal form (DNF, an ORed set of AND conditions) or conjunctive normal form (CNF, and ANDed set of OR conditions). PolicyID field defines the name of the policy rule and is also related to the storage of this policy in policy repository. An example of policy is given below, which forces the SA to specify which packets are to be discarded. IF (sourceHost == Camden) and (EncryptionAlgorithm == 3DES) THEN IPsecDiscard
This rule-based policy is further represented by XML (eXtensible Mark-up Language) due to XML's built-in syntax check and its portability across the heterogeneous platforms [7]. An object oriented information model has been designed to represent the IP VPN management policies, based on the IETF PCIM (Policy Core Information Model) [8] and its extensions [9]. The major objective of such information models is to bridge the gap between the human policy administrator who enters the policies and the actual enforcement commands executed at the network elements. IETF has described an IPsec Configuration Policy Model [10], representing IPsec policies that result in configuring network elements to enforce the policies. Our information model extends the IETF IPsec policy model by adding more functionalities sitting at a higher level (network management level). Fig. 2 depicts a part of the inheritance hierarchy of our information model representing the IP VPN policies. It also indicates its relationships to IETF PCIM and its extensions. Some of the actions are not directly shown due to the space limitation.
926
Kun Yang et al. Policy
(PCIM)
PolicyAction (PCIM) SimplePolicyAction SAAction
(PCIMe) (abstract)
IKEAction IKERejectAction IPsecAction
(abstract)
IPsecDiscardAction IPsecBypassAction PreconfiguredIPsecAction PolicyCondition (PCIM)
Fig. 2. Class Inheritance Hierarchy of VPN Policy Information Model
4
Case Study: Inter-domain IP VPN
Inter-domain communication is also a challenging research field in network management. This paper provides, as a case study, a solution to inter-domain communication by introducing the mobile intelligent agent. Mobile intelligent agent plays a very important role since the most essential components in PBNM, such as PDP and PEP, are in the form of mobile intelligent agents. Other non-movable components in PBNM architecture, such as policy receiving module, are in the form of stationary agents waiting for the communication with coming mobile agents. Mobile intelligent agents are also responsible for transporting XML-based policy across multiple domains. This case study had been implemented within the context of EU IST Project CONTEXT [11]. Administrator
MA Storage XML
Dom ain A SNMP
Linux+MA+PEP
Policy Station
XML
MA code download
Dom ain B
Cisco Router IP VPN Cisco Router
SNMP
Linux+MA+PEP
Fig. 3. Inter-domain IP VPN based on Intelligent Mobile Agents
Rule-Driven Mobile Intelligent Agents
927
The entire scenario is depicted in Fig. 3. Network administrator uses Policy Management Station to manage the underlying network environment (including two domains with one physical router and one Linux machine next to Cisco router at each domain) by giving policies, which are further translated into XML files and transported to relevant sub-domain PBNM stations using mobile intelligent agents. In this scenario, two mobile intelligent agents are generated at the same time, each going to one domain. Let's take one mobile agent for example. After the mobile agent arrives at the sub-domain management station, the mobile agent communicates with the stationary agent waiting at the sub-domain management station. Based on this policy, the sub-domain PDP manager can download the proper PDP, which is in the form of mobile agent, to make the policy decision. After this, the selected or/and generated policies are handed to PEP manager, which, also sitting on the sub-domain PBNM station, requires the availability of PEP code, e.g., for new IP tunnel configuring, according to the requirement given in policy. The PEP, also in the form of mobile agent, moves itself to the Linux machine, on which it uses SNMP (Simple Network Management Protocol) to configure the physical router so as to set up one end of IP VPN tunnel. Same process happened at the other domain to bring up the other end of IP VPN tunnel.
5
Conclusions and Future Work
As shown in the above case study, after administrator provided the input requirements, the entire configuration procedure processed automatically. Administrator doesn't need to know or analyse the specific sub-domain information thanks to the mobility and intelligence of mobile agents. The rule-driven mobile agents enable the achievement of many advantages, such as, the automated and rapid deployment of new services, the customisation of existing network features, the scalability and cost reduction in network management. However, this is just a first step to bring intelligence and mobility of software agent into the field of IP network management. Defining a full range of rules for IP network management and the study of how they can coexist together towards a practical network management solution are the future work. Rule conflict check and resolution mechanisms will also require more work as the number of policies dramatically increases.
Acknowledgements This paper describes part of the work undertaken in the context of the EU IST project CONTEXT (IST-2001-38142-CONTEXT). The IST programme is partially funded by the Commission of the European Union.
928
Kun Yang et al.
References [1] [2] [3] [4] [5] [6] [7]
[8] [9] [10] [11]
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 1995 K. Yang, A. Galis, T. Mota, and A. Michalas. “Mobile Agent Security Facility for Safe Configuration of IP Networks”. Proc. of 2nd Int. Workshop on Security of Mobile Multiagent Systems: 72-77, Bologna, Italy, July 2002 D. Gavalas, D. Greenwood,M. Ghanbari, M. O'Mahony. “An infrastructure for distributed and dynamic network management based on mobile agent technology”. Proc. of Int. Conf. on Communications: 1362 -1366, 1999 M. Sloman. “Policy Driven Management for Distributed Systems”. Journal of Network & System Management, 2(4): 333-360, 1994 IETF Policy workgroup web page: http://www.ietf.org/html.charters/policy-charter.html FreeS/WAN website: http://www.freeswan.org/ K. Yang, A. Galis, T. Mota and S. Gouveris. “Automated Management of IP Networks through Policy and Mobile agents”. Proc. of Fourth International Workshop on Mobile Agents for Telecommunication Applications (MATA2002): 249-258. LNCS-2521, Springer. Barcelona, Spain, October 2002 J. Strassner, E. Ellesson, and B. Moore. “Policy Framework Core Information Model”. IETF Policy WG, Internet Draft, May, 1999. B. Moore. “Policy Core Information Model Extensions”. IETF-Draft, IETF Policy Working Group. 2002. J. Jason. IPsec Configuration Policy Model. IETF draft. European Union IST Project CONTEXT web site: http://context.upc.es/
Neighborhood Matchmaker Method: A Decentralized Optimization Algorithm for Personal Human Network Masahiro Hamasaki1 and Hideaki Takeda1,2 1
The Graduate University for Advanced Studies Shonan Village, Hayama, Kanagawa 240-0193, Japan 2 National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
Abstract. In this paper, we propose an algorithm called Neighborhood Matchmaker Method to optimize personal human networks. Personal human network is useful for various utilization of information like information gathering, but it is usually formed locally and often independently. In order to adapt various needs for information utilization, it is necessary to extend and optimize it. Using the neighborhood matchmaker method, we can increase a new friend who is expected to share interests via all own neighborhoods on the personal human network. Iteration of matchmaking is used to optimize personal human networks. We simulate the neighborhood matchmaker method with the practical data and the random data and compare the results by our method with those by the central server model. The neighborhood matchmaker method can reach almost the same results obtained by the sever model with each type of data.
1
Introduction
Information exchanging among people is one of powerful and practical ways to solve information flood because people can act intelligent agents for each other to collect, filter and associate necessary information. The power stems from personal human network. If we need variable information to exchange, we must have a good human network. Personal human network is useful for various utilization of information like information gathering, but it is usually formed locally and often independently. In order to adapt various needs for information utilization, it is necessary to extend and optimize it. In this paper, we propose a network optimization method called ”Neighborhood Matchmaker Method”. It can optimize networks distributedly from the arbitrarily given networks.
2
Related Work
There are some systems to capture and utilize personal human network in computer. Kautz et al. [1] emphasized importance of human relations for WWW and V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 929–935, 2003. c Springer-Verlag Berlin Heidelberg 2003
930
Masahiro Hamasaki and Hideaki Takeda
showed done primary work for finding human relations, i.e., their system called ReferralWeb can find people by analyzing bibliography database. Sumi et al. [2] supported people to meet persons who have same interests and share information using mobile computers and web applications. Kamei et al. supported to form communities using visualization relationship among participants[3]. In these systems, they assume a group as a target either explicitly or implicitly. The first problem is how to form such groups, especially how we can find people as members of groups. We call it ”meet problem”. The second problem is how to find suitable people in groups for the specific topics and persons. We call this problem ”select problem”. The bigger group is the more likely to contain valuable persons to exchange information. However, we have to make more efforts with these systems in order to select such persons from a lot of candidates in the group. It is difficult for us to organize and manage such the large group. Therefore information exchanging systems should support methods that realize the above two requirements i.e., to meet and select new partners.
3
Neighborhood Matchmaker Method
As we mention in the previous chapter, if we need better relationship for information exchanging, we must meet and select partners more and more. It is a big burden for us, because we should meet all the candidates before we select them in advance. Since we do not know new friends before meeting them, we have no ways to select them. How can we solve this problem in our daily life? The practical way is introduction of new friends by the current friends. It is realistic and efficient because the person who knows both can judge whether this combination is suitable or not. Friends work as matchmaker for new friends. We formalize this ”friends as matchmaker” as an algorithm to extend and optimize networks. The key feature of this approach is no need for central servers. The benefits of this approach are three-fold. The first is to keep spread of information minimally. Information on a person is transferred to only persons connected to her/him directly. It is desirable to keep personal information secure. The second is distributed computation. Computation to figure out better relationship is done by each node, i.e., computers used by participants work for it. It is appropriate for a personal human network because we do not have to care the size of network. The third is gradual computation. The network will be converged gradually so that we can obtain the optimal network to some extent even if we stop the computation anytime.
4
Formalization
In this chapter, we introduce a model that can optimize networks by formalizing the method in our real life. We call that method ”Neighborhood Matchmaker Method (NMM)” hereafter. Before explaining NMM, we define the network model for this problem. At first we define a person as a node, and a connection
A Decentralized Optimization Algorithm for Personal Human Network
931
Fig. 1. Behavior of nodes
for information exchanging between people as a path. Here we assume that we can measure a degree of the connection between two nodes (hereinafter referred to as ”connection value”). Then, we can define that making a good environment for information exchanging is optimizing this network. In NMM, the network is optimized by matchmaking of neighbor nodes. We need the following two conditions to apply NMM. – All nodes can possibly connect to each other – All nodes can calculate relationship between nodes connected to them A summary these conditions, all nodes can act as matchmakers for their connected nodes to improve the connection network. The behavior of a node as a matchmaker is as follows. 1. Each node calculates connection values between its neighbor nodes. (We call this node ”matchmaker”) 2. If it finds pairs of nodes which have good enough connection values by computation of connection value, it recommends them i.e., it tells each element of the recommendation pair that the pair is a good candidate for connection. 3. The node that receives recommendation decides whether it accepts or not. We can optimize personal human network by iteration of this behavior. Figure 1 shows these behaviors. In the next chapter, we test this method with simulations.
5
Experiments
Since NMM just ensures local optimization, we should investigate the global behavior when applying this method. We test the method by simulation. We simulate optimization with NMM using the random data and the practical data. 5.1
The Procedure of the Simulation
In the previous chapter, we introduce NMM as the three steps, but the third step, i.e., decision is free to choose any tactics for recommendation nodes. In
932
Masahiro Hamasaki and Hideaki Takeda
Fig. 2. Flow chart of the simulation the simulation, we choose a simple tactics. Each node wants to connect to other nodes that have better connection values i.e., if a new node is better in connection than the worst existing node, the former replaces the latter. Figure 2 shows the flow chart of this simulation. At first, we create nodes each of which has some data to represent a person. In this experiment, the data is a 10-dimensional vector or WWW bookmark taken by users. We initially put paths between nodes randomly. We fix the number of paths during simulation. It means that addition of a path requires deletion of a path. One node is selected randomly and exchanges paths in every turn. In this simulation, all nodes take the following tactics for exchanging paths. A node must add the best path recommended by matchmakers. If a node adds a path, it must remove the worst path instead. So that, the size of paths in the network is fixed. The adding path must be better than the worst path already had. If all nodes cannot get a new path using matchmakers, the network is converged. At that time, this simulation is concluded. 5.2
The Measurement
Since the purpose of the simulation is how our method achieves optimization of the network, we should define what is the optimized network. We adopt a simple criterion. The best network for n paths is the network that includes n best paths in connection values 1 . A good news is that this network can be easily calculated by collecting and computing information for all nodes. Then we can compare this best network and networks generated by our method. Of course this computation requires a central server while our method can be performed distributedly. We compare two networks in the following two ways. One is cover rate that is how much paths in the best network is found in the generated network. It means 1
This criterion may not be ”best” for individual nodes, because some nodes may not have any connections. We can adopt other criterion if needed.
A Decentralized Optimization Algorithm for Personal Human Network
933
how much similar in structure two networks are. The other is reach rate that is comparison of the average of connection values between the best and generated networks. It indicates how much similar in effectiveness two networks are. These parameters are defined as the following formulas: cover rate =
| {Pcurrent ∩ Pbest } | N
N
reach rate =
l=1 N
f (pl |pl ∈ {Pcurrent })
m=1
p N {P } {Pbest } {Pcurrent } f (p)
6
f (pm |pm ∈ {Pbest })
: a path : the size of paths : a set of paths : the best set of paths : the current set of paths : a value of path
Simulation Results
There are two parameters to control experiments. One is the number of nodes and the other is the number of paths. In this experiment, we set the size of nodes from 10 to 100 and the size of paths from the 1 to 5 times the number of nodes. The simulation is performed 10 times for each set of parameters, and we use the average as the results. The graphs in Figure 3 plot the average of cover-rate against turn. Figure 3-a shows the results when the size of paths is fixed as thrice and Figure 3-b shows the results when the size of nodes is fixed to 60. In our formalization, we cannot know whether the network will converge. However, we can see that all graphs became horizontal. It implis that all networks were converged using matchmaking. And we can find the average of measurements and the turn of convergence are effective the size of nodes and paths. We observed similar results on reach-rate. The difference from reach-rate is less dependent on size of paths and nodes. We also examine the relevance between the size of networks and the turn of convergence. After iteration of simulation varying size of nodes and paths, we obtain the graph in Figure 4 plots the average of convergence turns against the size of nodes. This graph indicated that the turn of convergence increases linearly when the size of nodes increases. In this simulation, only a single node can exchange paths in a turn, so the times of exchanging per node do not became so large. Let me estimate the complexity computation of the algorithm roughly. When the average of the number of neighborhood nodes is r, this algorithm calculates connection values 2r times in every turn. When the size of nodes is N and the number of turns of convergence is kN according to Figure 4, the calculation
934
Masahiro Hamasaki and Hideaki Takeda
Fig. 3. Cover-Rate in the random data
times to converge is 2rkN using NMM. In the centralized model the calculation times is N 2 because we have to calculate connection values among all nodes. Since r and k are fix value, the order is O(N ) using NMM. It is less than O(N 2 ) using the centralized model. We also used the practical data generated by people. We use WWW bookmarks to measure connection values among people. Users always add a web page in which she/he is interested, and organize topics as folder in WWW bookmark. So it can be said that WWW bookmark represents the user profile. In this simulation, we need to calculate relationship between nodes. We use a parameter called ”category resemblance” such as a value of relationship between nodes [4]. This parameter is based on resemblance of folder structure of WWW Bookmark. We examine the average of measurements and convergence turns. We found that there is the similar tendency with the random data. These results indicate that the network could be optimized in the practical data.
Fig. 4. Average of Convergence Turn
A Decentralized Optimization Algorithm for Personal Human Network
7
935
Conclusion
In this paper, we propose the way to obtain a new person who is a partner for exchanging information and proposed a method called ”Neighborhood Matchmaker Method (NMM)”. Our method use collaborative and autonomous matchmaking and do not need any central servers. Nevertheless, by examining our experiment results, the optimal personal human network can be obtained. In this simulation we need the number of paths that is 2 to 3 times of the number of nodes and the number of turns that is 1.5 to 2 times the number of nodes in order to optimize the network sufficiently. It is applicable to any size of community, because it calculates relationship among people without collecting all data at his server. It is possible to assist bigger groups that are more likely to contain valuable persons to exchange information. And it is less computational cost. Furthermore it is an easy and quick method because we can start up anytime and anywhere without registration to servers. We can assist to form dynamic and emergent communities that are typical in the Internet. Now, we are developing the system using this proposed method. It is the system for sharing hyper-links and comments. In the real world, personal network changes dynamically through the exchanging information among people. A further direction of this study will be to experiment with this system and investigate effectiveness for it in real world.
References [1] H. Kautz, B. Selman, M. Shah. ReferraWeb: Combining Social Networks and Collaborative Filtering. In the Communications of the ACM, Vol. 40, No. 3, 1997. 929 [2] Y. Sumi, K. Mase. Collecting, Visualizing, and Exchanging Personal Interests and Experiences in Communities. the 2001 International Conference on Web Intelligence (WI-01), 2001. 930 [3] K. Kamei, et al. Community Organizer: Supporting the Formation of Network Communities throuh Spatial Representation. In Proceedings of the 2001 Sympo´ sium on Applications and the Internet (SAINT01), 2001. 930 [4] M. Hamasaki, H. Takeda. Experimental Results for a Method to Discover Human Relationship Based on WWW Bookmarks. In Proceedings of the Knowledge-Based ´ Intelligent Information & Engineering Systems (KES01), 2001. 934
Design and Implementation of an Automatic Installation System for Application Program in PDA Seungwon Na and Seman Oh Dept. of Computer Engineering, at Dongguk University 263-Ga Phil-Dong, Jung-Gu Seoul 100-715, Korea {nasw,smoh}@dgu.ac.kr
Abstract. The development of Internet technology onto mobile communication technology has brought us wireless Internet, which is growing into a popular service because of it's added convenience of mobility. Wireless Internet was first provided through cellular phones, but the current trend is moving toward PDA(Personal Digital Assistant), which have extended functionality. Applications to increase the functionality of PDA are constantly being developed, and occasionally application software must be installed. Also, when a PDA's power supply becomes fully discharged, all data stored in the RAM(Random Access Memory) becomes obsolete, and programs must be reinstalled. This paper presents an automated application program installation system, PAIS(PDA Auto Installing System), designed as a solution to the problem of PDA users having to do the installation of application programs on their PDA themselves. When this system is applied, PDA users can save the time and effort required for reinstallations, an added convenience, and application software companies can save on costs previously needed to create materials explaining the installation process.
1
Introduction
Wireless Internet was first provided through cellular phones in 1999, in Korea. But because of the many restrictions, including limited CPU processing power, narrow memory space, small display area etc., much needed to be improved to use cellular phones as mobile Internet devices[9]. Therefore, PDA, having extended functionality over cellular phones, are rising as the new medium for mobile Internet devices. PDA were originally used mainly as personal schedule planners, but with the addition of wireless connection modules, they are evolving into next-generation mobile devices which provide not only telephony services, but also wireless Internet services. To provide various services, many new technologies are being developed, and each time, users must install new software. Also, when a PDA's power supply becomes fully discharged, all data stored in the RAM becomes obsolete, and in this case V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 936-943, 2003. Springer-Verlag Berlin Heidelberg 2003
Design and Implementation of an Automatic Installation System
937
users must reinstall application software[10]. Added to this, PDA are equipped with one of several different operating systems, such as PPC2002, P/B, Palm, Linux. This means that there are several programs to choose from, and the installation process becomes complicated. These problems must be addressed before the use wireless Internet on PDAs can be accelerated. In this paper we present the design and implementation of a system that automates the installation process of application programs on PDAs. This system is named PIAS(PDA Auto Installing System). The main function of the system is explained as follows. The PDA agent sends an installation data file to the server, which the server compares to it's file management table to figure out which installation file to send, and then sends the right installation file to each PDA. The agent then installs the downloaded file into the proper directory.
2
Related Works
In this section, the PDA's structure and SyncML that is presented by standard of data synchronization technology are discussed. 2.1
Current Status of PDA
A PDA device is comprised of a network communication unit, an input/output unit, and a memory storage unit. Table 1 shows the components of each unit. PDA's memory unit is made up of ROM(Read Only Memory) and RAM(Random Access Memory). ROM is divided again into EPROM and Flash ROM. Flash ROM supports both read and write access[3]. A PDA does not have a separate auxiliary storage device, and uses the object storage technique, storing most executable programs and data in the RAM component. Using RAM as a memory device provides faster execution speed compared to ROM, but has the problem of being erased when the power supply is fully discharged. 2.2
SyncML(Syncronization Markup Language)
SyncML is a standard language for synchronizing all devices and applications over any network. SyncML defines three basic protocols, the Representation protocol, which defines message data format, the Synchronization protocol, which defines synchronization rules, and the Sync protocol, which defines binding methodology, for transmitting messages[11]. Table 1. PDA Components
Device unit
Components
Network Communication
Modem
TCP/IP
IrDA
Input/Output
Touch Panel
MIC, Phone
LCD, Display
Memory Storage
ROM
RAM
Memory Card
938
Seungwon Na and Seman Oh
Fig. 1. SyncML Framework
The SyncML Representation protocol is a syntax for SyncML Messages, which defines an XML DTD for expressing all forms of data, metadata, and synchronization instructions needed to accomplish synchronization[6]. Transfer protocols include HTTP, WSP(Wireless Session Protocol), OBEX(Object Exchange Protocol). These protocols are not dependent on the Representation protocol or Sync protocol and, therefore, other transfer protocols can be bound later on. The Sync protocol defines the rules for exchanging messages between client and server to add, delete, and modify data, and defines the operational behaviour for actual synchronization, and also defines synchronization types. Through these steps the SyncML supports the following features[4]: a) b) c) d) e) f)
Efficiency of operation on both wire and wireless Support of several different transport protocols Support of any data type Access of data from several applications Consideration of limited mobile device resources Compatibility with the currently existing internet and web environment The structure of a SyncML framework is shown in Figure 1 below.
Fig. 2. PAIS Operation Concept
Design and Implementation of an Automatic Installation System
939
Fig. 3. PDA Agent Operation
3
PAIS Design
PAIS(PDA Auto Installing System) is a system which sends executable file data stored in a PDA to the server and receives the selected installation file and automatically installs the downloaded file. The basic operation concept of PAIS is as follows. Each PDA(A, B) sends it's data(60, 70) to the PAIS server. The server compares this to it's own data(100) and selects an installation file(40, 30) and creates an installation package. These installation packages are then uploaded to the proper device. PAIS consists of two components, the embedded PDA agent and the PAIS server which manages setup files. The PDA agent sends and receives file information, and the PAIS server manages installation files[8]. 3.1
PDA Agent Design
The main function of the PDA agent is to collect internal file information and transmit it to the PAIS server, and download the final installation file and automatically install it to the appropriate directory. The PDA agent collects executable file information from the registry and sends it to the PAIS server. The collected information is as shown in Table 2 and is stored as a binary file with the ID. The detailed information collected from the PDA's registry is the basis for the application program. The collected file is data and sent as in Figure 4. Table 2. PDA gathering Information Examples
category Customer Information PDA Device PDA O/S Application
Detailed Information
ID
Customer ID
10~
HP, Samsung, SONY PPC 2002, Palm, Linux Internet Brower, e-Book Viewer, VM etc
20~ 30~ 40~
940
Seungwon Na and Seman Oh
Fig. 4. PAIS Agent File Processing Flow
3.2
PAIS Server Design
The PAIS server processes the PDA's connection authentication using the DB server and manages application programs to create packages with installation files and sends the installation files to each PDA. It also is connected to an SMS(Short Message Service) server and alerts the PDA user of new or upgraded files to promote installation. As an additional function, statistical data and bulletin board support is provided. The SMS service informs the user about new updates but the actual installation through the PDA agent is controlled by the user's settings. PDA device type and application program information is managed through specified ID numbers and the agent manages on an application program basis with an already specified ID number table. The PAIS server compares the transmitted PDA file information to the installation file comparison table. The basis for this comparison is the version information, and only when the version is higher is the designated installation file selected. Figure 6, shown below, outlines a case where the first and third are selected. In this case, the selected executable files are packaged and uploaded to the PDA through the wireless connection. The package file contains three URLs, the basic file catalog URL, the detailed file information URL, and the download URL.
Fig. 5. PAIS Agent File Processing Flow
Design and Implementation of an Automatic Installation System
941
Fig. 6. Install File Comparison Example
Transmission mode : Server address ? UID = serviceID_ platformID_application ID [BASIC_URL]
= http://www.PAIS.com/basic.asp
[DETAIL_URL]
= http://www.PAIS.com/detail.asp
[DOWNLOAD_URL] = http://www.PAIS.com/download.asp
4
PAIS Implementation
The development environment and software tools used to implement the PAIS system proposed in this paper are as follows. • • •
Embedded Visual C++ 3.0(Language) Pocket PC PDA PocketPC 2002(O/S)
The download and automatic installation process of an application program were implemented through an emulator as shown in Figure 7.
Fig. 7. Implement Result
942
Seungwon Na and Seman Oh
The screen capture on the left shows the PDA uploading the application's information to the server. The screen capture on the center shows the PDA downloading the package file from the server. The PDA is downloading a package for installing the service. The screen capture on the right shows the automatic installation of the received install package file.
5
Conclusion
When the power supply of a PDA becomes fully discharged, all data stored in the RAM is erased. Each time this happens, users must reinstall all programs. Also, because PDAs are constantly being upgraded, not only do users have to install new programs, but updates also occur frequently. When this happens, users must do the actual installation themselves. To solve this drawback in PDAs, we have proposed, in this paper, a system named PIAS, which automatically installs wanted application programs on a PDA. When PIAS is applied, PDA users can save the time and effort required for installations, an added convenience, and application software companies can save on costs previously needed to create materials explaining the installation process. This improves the limited PDA environment to provide better wireless Internet services. For future research, we plan to look into ways of expanding the use of this automatic installation program technology to cellular phones and other mobile devices.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Byeongyun Lee, “Design and Implementation of SyncML Data synchronization system on session Manager”, KISS Journal, Vol.8, No.6, pp.647~656, Dec. 2002. Insuk Ha, “All synchronization suite of programs SDA analysis”, micro software, pp.309~313, Aug. 2001. James Y. Wilson, Building Powerful Platforms with windows CE, 2Edition, Addison Wesley, March 2000. Jiyeon Lee, “The Design and Implementation of Database for Sync Data Synchronization”, KIPS, Conference, Seoul, Vol.8, No.2, pp.1343~1346, Oct. 2001. Suhui Ryu, “Design and Implementation of Data Synchronization Server's Agent that use Sync protocol”, KISS Conference, Seoul, Vol.8, No.2, pp.1347~1350, Oct. 2001. Sync Initiative, Sync Architecture version 0.2, May 2000. Taegyune An, Mobile programming that do with pocket PC, Inforgate, July 2002. Uwe Hansman, Synchronizing and Managing Your Mobile Data, Prentice Hall, Aug. 2002. Intromobile, Mobile Multimedia Technology Trend, http://www.intromobile.co.kr/solution/
Design and Implementation of an Automatic Installation System [10] [11]
Microsoft, PocketPC Official site, http://www.microsoft.com/ mobile/pocketpc/ Sync Initiative, http://www.syncml.org/
943
Combination of a Cognitive Theory with the Multi-attribute Utility Theory Katerina Kabassi and Maria Virvou Department of Informatics University of Piraeus 80 Karaoli & Dimitriou Str. 18534 Piraeus, Greece {kkabassi,mvirvou}@unipi.gr
Abstract. This paper presents how a novel combination of a cognitive theory with the Multi-Attribute Utility Theory (MAUT) can be incorporated in a Graphical User Interface (GUI) in order to provide adaptive and intelligent help to users. The GUI is called I-Mailer and is meant to help users achieve their goals and plans during their interaction with an e-mailing system. I-Mailer constantly observes the user and in case it suspects that s/he is involved in a problematic situation it provides spontaneous advice. In particular, I-Mailer suggests alternative commands that the user could have given instead of the problematic one that s/he gave. The reasoning about which alternative command could have been best is largely based on the combination of the cognitive theory and MAUT.
1
Introduction
In real world situations humans must make a great number of decisions that usually involve several objectives, viewpoints or criteria. The representation of different points of view (aspects, factors, characteristics) with the help of a family of criteria is undoubtedly the most delicate part of the decision problem's formulation (Bouyssou 1990). Multi-criteria decision aid is characterised by methods, that support planning and decision processes through collecting, storing and processing different kinds of information in order to solve a multi-criteria decision problem (Lahdelma 2000). Although research in Artificial Intelligence has tried to model the reasoning of users, it has not considered multi-criteria decision aid as much as it could. This paper shows that the combination of a cognitive theory with a theory of multicriteria analysis can be incorporated into a user interface to improve its reasoning. This reasoning is used by the system to provide automatic intelligent assistance to users who are involved in problematic situations. For this purpose, a graphical user interface has been developed. The user interface is called Intelligent Mailer (I-Mailer) and is meant to operate as a standard e-mail client. I-Mailer monitors all users' actions and reasons about them. In case it diagnoses a problem, the system provides spontaneous assistance. The system uses Human Plausible Reasoning (Collins & Michalski V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 944-950, 2003. Springer-Verlag Berlin Heidelberg 2003
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
945
1989) (henceforth referred to as HPR) and its certainty parameters in order to make inferences about possible users' errors based on evidence from users' interaction with the system. HPR is a cognitive theory about human reasoning and has been used in IMailer to simulate the users' reasoning, which may be correct or incorrect (but still plausible) and thus may lead to “plausible” user errors. HPR has been previously adapted in two other Intelligent Help Systems (IHS), namely RESCUER (Virvou & Du Boulay, 1999) which was an IHS for UNIX users and IFM (Virvou & Kabassi, 2002), which was an IHS for users of graphical file manipulation programs such as the Microsoft Windows Explorer (Microsoft Corporation 1998). However, none of these systems did incorporate the combination of MAUT and HPR. In case of a user's error, I-Mailer uses the statement transforms, the simplest class of inference patterns of HPR, to find the action that the user might have meant to issue instead of the one causing the problem. However, the main problem with this approach is the generation of many alternative actions. In order to select the most appropriate one, I-Mailer uses HPR's certainty parameters. Each certainty parameter represents a criterion that a user is taking into account in order to select the action that s/he means to issue. Therefore, each time the system generates an alternative action, a problem is defined that has to be solved taking into account multiple, often conflicting, criteria. The aim of multi-criteria decision analysis is to recommend an action, while several alternatives have to be evaluated in terms of many criteria. One important theory in Multi-Criteria Analysis is the Multi-Attribute Utility Theory (MAUT). The theory is based on aggregating different criteria into a function, which has to be maximised. In I-Mailer, we have applied MAUT to combine the values for a given action of a user and then to rank the set of actions and thus select the best alternative action to suggest to a user.
2
Background
2.1
Multi-attribute Utility Theory
As Vincke (1992) points out, a multi-criteria decision problem is a situation in which, having defined a set A of actions and a consistent family F of n criteria g1 , g 2 , …,
g 3 ( n ≥ 3 ) on A, one wishes to rank the actions of A from best to worst and determine a subset of actions considered to be the best with respect to F. The preferences of the Decision Maker (DM) concerning the alternatives of A are formed and argued by reference to the n points of view adequately reflected by the criteria contained in F . When the DM must compare two actions a and b, there are three cases that can describe the outcome of the comparison: the DM prefers a to b, the DM is indifferent between the two or the two actions are incompatible. The traditional approach is to translate a decision problem into the optimisation of some function g defined on A . If g ( a ) > g (b) then the DM prefers a to b, whereas if g ( a ) = g (b) then the DM is indifferent between the two. The theory
946
Katerina Kabassi and Maria Virvou
defines a criterion as a function g , defined on A , taking its values in a totally ordered set, and representing the DM's preferences according to some point of view. Therefore, the evaluation of action α according to criterion j is written g j (a ) . MAUT is based on a fundamental axiom: any decision-maker attempts unconsciously (or implicitly) to maximise some function U = U ( g 1 , g 2 ,..., g n ) aggregating all the different points of view which are taken into account. In other words, if the DM is asked about his/her preferences, his/her answers will be coherent with a certain unknown function U . The role of the researcher is to try to estimate that function by asking the DM some well-chosen questions. The simplest (and most commonly used) analytical form is, of course, the additive form: n
U (a) = ∑ k jU j ( x aj )
(1)
j =1
a
where ∀j: x j = g j (a ) ,
n
∑k j =1
2.2
j
= 1 , U j ( x j ) = 0 and U j ( y j ) = 1 .
Human Plausible Reasoning
Human Plausible Reasoning theory (Collins & Michalski 1989, Burstein et al. 1991) is a cognitive theory that attempts to formalise plausible inferences that occur in people's responses to different questions when the answers are not directly known. The theory is grounded on an analysis of people's answers to everyday questions about the world, a set of parameters that affect the certainty of such answers and a system relating the different plausible inference patterns and the different certainty parameters. For example, if the question asked was whether coffee is grown in Llanos region in Colombia, the answer would depend on the knowledge retrieved from memory. If the subject knew that Llanos was in a savanna region similar to that where coffee grows, this would trigger an inductive, analogical inference, and generate the answer yes (Carbonel & Collins 1973). According to the theory a large part of human knowledge is represented in “dynamic hierarchies”, which are always being updated, modified or expanded. In this way, the reasoning of people with patchy knowledge can be modelled. Statement transforms, the simplest class of inference patterns, exploit the 4 possible relations among arguments and among referents to yield 8 types of statement transform. Statement transforms can be affected by certainty parameters. The degree of similarity (σ) signifies the degree of resemblance of a set to another. The degree of typicality (τ) represents how typical a subset is within a set (for example, cow is a typical mammal). The degree of frequency (ϕ) count how frequent a referent is in the domain of the descriptor. Dominance (δ) indicates how dominant a subset is in a set (for example, elephants are not a large percentage of mammals). Finally the only certainty parameter applicable to any expression is the degree of certainty (γ). The degree of
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
947
certainty or belief that an expression is true is defined by HPR as γ = f (σ , δ , φ ,τ ) . However, the exact formula for the calculation of γ is not specified.
3
Intelligent Mailer
Intelligent Mailer (I-Mailer) is an Intelligent Graphical User Interface that works in a similar way as a standard e-mail client but it also incorporates intelligence. The system's main aim is to provide spontaneous help and advice to users who have made an error with respect to their hypothesised intentions. Every time a user issues a command the system reasons about it using a limited goal recognition mechanism and reasons about the command issued. If the command is considered problematic the system uses the principles of HPR to make plausible guesses about the user's errors. In particular, it implements some statement transforms in order to generate alternative commands that the user could have issued instead of the problematic one issued. If an alternative command is found that would be compatible with the user's hypothesised goals and would not be problematic like the command issued by the user then this command is suggested to the user. However, the main problem with this approach is the generation of many alternative commands. In order to select the most appropriate one, I-Mailer uses an adaptation of five of the certainty parameters introduced in HPR. The values of the certainty parameters for every alternative action are supplied by the user model of the particular user (Kabassi & Virvou 2003). In I-Mailer, the degree of similarity (σ) is used to calculate the resemblance of two commands or two objects. It is mainly used to show possible confusions that a user may have made between commands. The typicality (τ) of a command represents the estimated frequency of execution of the command by the particular user. The degree of frequency (ϕ) of an error represents how often a specific error is made by a particular user. The dominance (δ) of an error in the set of all errors shows which kind of error is the most frequent for a particular user. The values for these certainty parameters are calculated based on the information stored in the individual user models and the domain representation. Finally, all the parameters presented above are combined to calculate a degree of certainty related to every alternative command generated by IMailer. This degree of certainty (γ) represents the system's certainty that the user intended the alternative action generated. However, as mentioned earlier, the exact formula for the calculation of the degree of certainty has not been specified by HPR. Therefore, we have used MAUT to calculate the degree of certainty as presented in detail in the next section. An example of how I-Mailer works is presented below: A user in an attempt to organise his mailbox, he intends to delete the folder ‘Inbox\conference1\'. However, he accidentally attempts to delete ‘Inbox\conference2\'. In this case, he runs the risk of losing all e-mail messages stored in the folder ‘Inbox\conference2\'. I-Mailer would suggest the user to delete the folder ‘Inbox\conference1\' because the folder ‘Inbox\conference1\' is empty whereas the folder ‘Inbox\conference2\' is not and the two folders have very similar names therefore one could have been mistaken for the other.
948
Katerina Kabassi and Maria Virvou
Fig. 1. A graphical representation of the user's mailbox
4
Specification of the Degree of Certainty Based on MAUT
The decision problem in I-Mailer is to find the best alternative command to be suggested to the user instead of the problematic one that s/he has issued. This problem involves the calculation of the degree of certainty γ for each alternative command generated through the statement transforms. Each of the certainty parameters of HPR is to be considered as a criterion. This means that we consider other criteria such as the degree of similarity, the degree of frequency, the degree of typicality and dominance. Thus, the main goal of MAUT in I-Mailer is to try to optimise the function n
U (a) = ∑ k jU j ( x aj ) described in Section 2.1, where a is an alternative comj =1
mand to be suggested, U
j
is the estimation of the value of every certainty parameter
(j=1,2,3,4) and k j is the weight of corresponding certainty parameter. For each alternative command each certainty parameter is given a value based on the knowledge representation and long term observations about the user. For example, in a case where the user is very prone to accidental slips the degree of frequency would also be very high. As another example, the degree of similarity between two mails with subject ‘Confirmation1' and ‘Confirmation2' is 0.90 because they have very similar names and are neighbouring objects in the graphical representation of the electronic mailbox. The theory defines that in order to estimate the values of the weights kj's, one must try to obtain (n-1) pairs of indifferent actions, where n is the number of criteria. In the case of I-Mailer we had to find examples of 3 pairs of indifferent actions since n=4 (we have 4 criteria). A pair of indifferent actions in I-Mailer is a pair of alternative commands for which a human advisor would not have any preference between one or another of the two. Therefore, we conducted an empirical study in order to find some pairs of alternative actions that human experts would count as indifferent. The empirical study involved 30 users of different levels of expertise in the use of an emailing system and 10 human experts of the domain. All users were asked to interact with a standard e-mailing system, as they would normally do. During their interaction
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
949
with the system, their actions were video captured and the human experts were asked to comment on the protocols collected. The analysis of the comments of human experts revealed that in some cases the human experts thought that two alternatives where equally likely to have been intended by the user. From the actions that the majority human experts counted as indifferent, we selected 3 pairs that had the greater acceptability. For each alternative action, we calculated the values of the certainty parameters and we substituted those to function U ( a ) =
n
∑k U j
j
( x aj ) and we let the weights of the certainty parame-
j =1
ters to be the unknown quantities ( k j ). As the value of U ( a ) for two indifferent actions is the same, we equalise the values of function U ( a ) for each pair of indifferent actions. This process resulted in the 3 following equations. a1Ia2 ⇔ 0.65 k1 +0.50 k 2 +0.31 k 3 +0.35 k 4 =0.55 k1 +0.60 k 2 +0.39 k 3 +0.03 k 4 a3Ia4 ⇔ 0.30 k1 +0.40 k 2 +0.23 k 3 +0.65 k 4 =0.25 k1 +0.55 k 2 +0.11 k 3 +0.13 k 4 a5Ia6 ⇔ 0.75 k1 +0.70 k 2 +0.56 k 3 +0.43 k 4 =0.90 k1 +0.55 k 2 +0.48 k 3 +0.55 k 4 4
These equations together the equation
∑k
j
= 1 is a system of 4 equations with 4
j =1
unknown quantities, which is easy to be solved. After this process, the values of the weights of the certainty parameters were found to be k1 = 0.44 , k 2 = 0.36 ,
k3 = 0.18 and k 4 = 0.02 . Therefore, the final formula for the calculation of the degree of certainty was found to the following:
γ = 0.44σ + 0.36δ + 0.18φ + 0.02τ 5
(2)
Conclusions
In this paper, we have described an intelligent e-mailing system, I-Mailer, that helps users achieve their goals and plans. In order to provide intelligent and individualised help, I-Mailer uses a combination of the principles of Human Plausible Reasoning with the Multi-Attribute Utility Theory. HPR provides a domain-independent, formal framework for generating hypotheses about the users' beliefs and intentions from the point of view of a human advisor who watches the user over his/her shoulder and reasons about his/her actions. In case the system, suspects that the user is involved in a problematic situation, it uses the adaptation of HPR in order to find possible alternatives that the user might have meant to issue instead of the unintended one. However, usually this process results in the generation of many alternatives. Therefore, the system uses the certainty parameters of HPR and MAUT in order to find the most “predominant” alternative action, which is the action that is more likely to have been intended by the user.
950
Katerina Kabassi and Maria Virvou
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Bouyssou, D.: Building criteria: a prerequisite for MCDA. In Costa, C. (ed.): Readings in SCDA, Bana e Springer Verlag (1990). Burstein, M.H., Collins, A. & Baker, M.: Plausible Generalisation: Extending a model of Human Plausible Reasoning. Journal of the Learning Sciences, (1991) Vol. 3 and 4, 319-359. Carbonel, J.R. & Collins, A.: Natural semantics in artificial intelligence. In Proceedings of the Third International Joint Conference on Artificial Intelligence, Stanford, California, (1973) 344-351. Collins, A. & Michalski R.: The Logic of Plausible Reasoning: A core Theory. Cognitive Science, (1989) Vol. 13, 1-49. Kabassi, K. & Virvou, M.: Adaptive Help for e-mail Users. In Proceedings of the 10th International Conference on Human Computer Interaction (HCII'2003), to appear. Lahdelma, R., Salminen, P. & Hokkanen, J.: Using Multicriteria Methods in Environmental Planning and Management, Environmental Management, Springer-Verlag; New York, (2000) Vol. 26, No. 6, 565-605. P. Vincke: Multicriteria Decision-Aid. Wiley (1992). Virvou, M. & Du Boulay, B.: Human Plausible Reasoning for Intelligent Help. User Modeling and User-Adapted Interaction, (1999) Vol. 9, 321-375. Virvou, M. & Kabassi, K.: Reasoning about Users' Actions in a Graphical User Interface. Human-Computer Interaction, (2002) Vol. 17, No. 4, 369-399.
Using Self Organizing Feature Maps to Acquire Knowledge about Visitor Behavior in a Web Site Juan D. Vel´ asquez1 , Hiroshi Yasuda1 , Terumasa Aoki1 , Richard Weber2 , and Eduardo Vera3 1
Research Center for Advanced Science and Technology, University of Tokyo {jvelasqu,yasuda,aoki}@mpeg.rcast.u-tokyo.ac.jp 2 Department of Industrial Engineering, University of Chile [email protected] 3 AccessNova Program, Department of Computer Science University of Chile [email protected]
Abstract. When a user visits a web site, important information concerning his/her preferences and behavior is stored implicitly in the associated log files. This information can be revealed by using data mining techniques and can be used in order to improve both, content and structure of the respective web site. From the set of possible that define the visitor’s behavior, two have been selected: the visited pages and the time spent in each one of them. With this information, a new distance was defined and used in a self organizing map which identifies clusters of similar sessions, allowing the analysis of visitors behavior. The proposed methodology has been applied to the log files from a certain web site. The respective results gave very important insights regarding visitors behavior and preferences and prompted the reconfiguration of the web site.
1
Introduction
When a visitor enters a web site, the selected pages have direct relation with the desired information he/she is looking for. The ideal structure of a web site should support the visitors in finding such information. However, reality is quite different. In many cases, the structure of a Web site does not help to find the desired information, although a page that contains it, does exist [3]. Studying visitors’ behavior is important in order to create more attractive contents, to predict her/his preferences and to prepare links with suggestions, among others [9]. These research initiatives aim at facilitating web site navigation, and in the case of commercial sites, at increasing market shares [1], transforming visitors into customers, increasing customers loyalty and predicting their preferences. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 951–959, 2003. c Springer-Verlag Berlin Heidelberg 2003
952
Juan D. Vel´ asquez et al.
Each click of a web site visitor is stored in files, known as web logs [7]. The knowledge about visitors’ behavior contained in these files can be extracted using data mining techniques such as e.g. self-organizing feature maps (SOFM). In this work, a new distance measure between web pages is introduced which is used as input for a specially developed self-organizing feature map that identifies clusters of different sessions. This way, behavior of a web site’s visitors can be analyzed and employed for web site improvement. The special characteristic of the SOFM is its thoroidal topology, which has shown already its advantages when it comes to maintain the continuity of clusters [10]. In Sect. 2, a technique to compare user sessions in a web site is introduced. Section 3 shows how the user behavior vector and the distance between web page is used as input for self-organizing feature maps in order to cluster sessions. Section 4 presents the application of the suggested methodology for a particular web site. Finally, Sect. 5 concludes the present work and points at extensions.
2 2.1
Comparing User Sessions in a Web Site User Behavior Vector Based on Web Site Visits
We define two variables of interest: the set of pages visited by the user and the time spent on each one of them. This information can be obtained from the log files of the web site, which are preprocessed using the sessionization process [5, 6]. Definition 1. User Behavior Vector. U = {u(1), . . . , u(V )} where u(i) = (up (i), ut (i)), and up (i) is the web page that the user visits in the event i of his session. ut (i) is the time the user spent visiting the web page. V is the number of pages visited in a certain session. Figure 1 shows a common structure of a web site. If we have a user visiting the pages 1,3,6,11 and spending 3, 40, 5, 16 seconds, respectively, the corresponding user vector is: U = ((1,3),(3,40),(6,5),(11,16)).
2
U1 U U3 1 2 5
6
10
16
3
11
17
4
7
8
9
12
13
14
18
19
20
15
21
22
Fig. 1. A common structure of web site and the representation of user behavior vectors
Using Self Organizing Feature Maps
953
After completing the sessionization step, we have the pages visited and the time spent in each one of them, except for the last visited page, of which we only know when its visit began. An approximation for this value is to take the average of the time spent in the other pages visited in the same session. Additionally, it is necessary to consider that the number of web pages visited by different users varies. Thus the numbers of components in the respective user behavior vectors differ. However, we can introduce a modification in order to create vectors with the same number of components. Let L be the maximum number of components in a vector, and U a vector with S components so that S ≤ L. Then the modified vector is: (up (k), ut (k)) 1 ≤ k ≤ S (1) U = (0, 0) S, · · · , < αn , βn >}
(9)
where αi and βi are the number of coherent and incoherent pixels of color i, and hi = αi +βi . We separately perform ICA on the conherent part {α1 , · · · , αn } and the incoherent part {β1 , · · · , βn }. The new color index contains the ICA features of the two vectors. Fig. 3 compares the retrieval performance of ICA and ICA+CCV. It can be seen that the ICA+CCV outperforms ICA when the number of the total components(coherent and incoherent) k > 17.
6
Conclusion
In this paper, we apply ICA to extract a new index from color histogram. ICA generalize the technique of PCA and has proven to be a effective tool for finding
Image Retrieval Based on Independent Components of Color Histograms
1441
Fig. 3. The retrieval performance of ICA + CCV
structure in data. Differing from PCA, ICA is not restricted to an orthogonal transformation and can find the structure in non-Gaussian data. We first use PCA to reduce the dimensionality of the color histogram, and then apply ICA to the low-dimensional subspace. The effectiveness of the proposed color index has been demonstrated by experiments. Comparisons with PCA and SVD show that ICA feature outperforms these low-dimensional color indices in terms of retrieval accuracy. Experiments are also done to incoporate spatial information in the new index. We perform ICA on the the two parts of CCV and obtain a improved accuracy with slightly more components.
Acknowledgements This research was partly supported by the Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Science.
References [1] A.Pentlard, R. W. Picard and S. Sclarroff: Photobook: Cont-based manipulation of image databases. International Journal of Computer Vision, vol. 18, no. 3,(1996) 233-254. [2] Y-W. Chen, X-Y. Zeng, Z. Nakao and H. Q. Lu: An ICA-based illumination-free texture model and its application to image retrieval. Leture notes in computer science, Springer, vol. 2532, (2002) 167-174. [3] M. J. Swain and D. H. Ballard: Color indexing. Int. Journal of Computer Vision, vol. 7, no.1,(1991)11-32. [4] M. A. Stricker and M. Orengo: Similarity of color images. Storage Retrieval Still Image Video Database IV, SPIE. (1996)381-392. [5] J. Hafner, H. S. Sawhney,et al: Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. Pattern Anal. and Machine Intel., vol. 17, no. (1995) 729-736. [6] C. Faloutsos, W. Equitz, et al: Efficient and effective query by image content. Journal of Intelligent Information Systems, vol. 3, no.4.(1996)231-262.
1442
Xiang-Yan Zeng et al.
[7] A. Bell. and T. Sejnowski: The ’independent components’ of natural scenes are edge filters. Visiob research, vol. 37.(1997)3327-3338. [8] P. Common, ”Independent component analysis-a new concept? Signal Processing, vol. 36, pp. 287-314, 1994. [9] A. Hyvarinen and E. Oja, ”A fast fixed-point algorithm for indepepndent component analysis. Neural Computation, vol. 9, pp. 1483-1492, 1997. [10] J. Z. Wang, J. Li, and G. Iederhold, ”SIMPLICITY: Semantic-sensitive integrated matching for picture libraries,” IEEE Trans. Pattern Anal. and Machine Intell., vol. 23, no. 9, pp. 947-963, 2001. [11] G. Pass, and R. Zabin: Histogram refinement for content based image retrieval. IEEE Workshop Applications Computer Vision. (1998) 59-66.
Face Recognition Using Overcomplete Independent Component Analysis Jian Cheng1, Hanqing Lu1, Yen-Wei Chen2, and Xiang-Yan Zeng2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, 100080, P.R.China {jcheng,luhq}@nlpr.ia.ac.cn 2 Department of EEE, Faculty of Engineering, University of the Ryukyus, Japan [email protected], [email protected] 1
Abstract. Most current face recognition algorithms find a set of basis functions in a subspace by training the input data. However, in many applications, the training data is limited or only a few training data are available. In the case, these classic algorithms degrade rapidly. The overcomplete independent component analysis (overcomplete ICA) can separate out more source signals than the input data. In this paper, we use the overcomplete ICA for face recognition with the limited training data. The experimental results show that the overcomplete ICA can improve efficiently the recognition rate.
1
Introduction
The face recognition problem has attracted much research effort in the last 10 years. Among the popular face recognition techniques, the eigenfaces method, proposed by Turk and Pentland [1], is one of the most successful methods. The eigenfaces method is based on Principal Component Analysis (PCA), which decreases the dimensions of the input data by decorrelated the 2nd-order statistical dependencies of the input data. However, PCA cannot represent the high-order statistical dependencies such as relationships among three or more pixels. Independent Component Analysis (ICA) [2,3] is a relatively recent technique that can be considered as a generalization of PCA. The ICA technique can find a linear transform for the input data using a set of basis which be not only decorrelating but also as mutual independent as possible. ICA has been successfully applied to face recognition by Bartlett[4]. The results of Bartlett show that the ICA representations are superior to PCA for face recognition. However, both PCA and ICA techniques for face recognition have a fatal drawback: need large-scale database to learn the basis functions. In general, we cannot obtain the enough large face databases. In other words, the face database is not enough large to obtain the high recognition precision rate. Recently, an extension of ICA model: the overcomplete independent component analysis, has be paid more and more attention. A distinct difference compared to the standard ICA, the overcomplete ICA is assumed that more sources than observations. In this paper, we propose a new V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1443-1448, 2003. Springer-Verlag Berlin Heidelberg 2003
1444
Jian Cheng et al.
face recognition approach using the overcomplete ICA with a small training face database. The paper is organized as follows: in the second section 2, we give a brief introduction to the overcomplete ICA. The experimental results are shown in section 3. The concluding remarks will be given in the final section.
2
Overcomplete Independent Component Analysis
The overcomplete independent component analysis model can be denoted as
x =As + ε
(1)
x = ( x1 , x 2 , L , x m ) T is
Where
the
m-dimensional
observed
variables,
s = ( s1 , s 2 , L s n ) T is the n-dimensional latent variables or source signals, A is a m × n dimensional unknown mixing matrix, ε is assumed to be a white Gaussian noise. In the standard independent component analysis, the mixing matrix A is a square matrix, i.e. the dimension of the observed variables equals to the dimension of the source signals. Whereas, m < n in the overcomplete ICA. In general, it is an illposed problem. Some methods have been proposed for estimating the mixing matrix A . Such as Lewicki and Sejnowski [5,6,7] use a Bayesian method for inferring an optimal basis to find efficient codes. Their method comprises mainly two steps: inferring the sources s and learning the basis matrix A . 2.1
Inferring the Sources
Given the model (1), the noise ance
σ
2
ε
is assumed to be white Gaussian noise with vari-
so that
log P( x | A, s ) ∝ −
1 2σ 2
( x − As )
(2)
Using Bayes' rule, s can be inferred from x
P ( s | x, A) ∝ P( x | A, s ) P( s )
(3)
prior distribution of s is Laplacian distribution P ( s ) ∝ exp(−θ | s |) , which constrain s to be sparse and statistically independent
We
assume
the
components. Maximizing the posterior distribution P ( s | x, A) , the s can be approximated as:
Face Recognition Using Overcomplete Independent Component Analysis
1445
sˆ = max P( s | x, A) S
= max[log P( x | A, s ) + log P( s )] S
(4)
1 = min 2 | x − As | 2 +θ T | s | S 2σ 2.2
Learning the Basis Vectors
The objective for learning the basis vectors is to obtain a good model of the observed data. The expectation of log-probability of the observed data under the model can assess the goodness of fit
L = E{log P( x | A)}
(5)
P ( x | A) = ∫ dsP( x | A, s ) P( s )
(6)
where
An approximation of L can be obtained using the Gaussian approximation to the posterior:
1 1 L ≈ const. − E 2 | x − Asˆ | 2 + log P ( sˆ) − log det H 2 2σ
(7)
Where H is the Hessian matrix of the log posterior at sˆ . Performing gradient ascent T
on L with respect to A and multiplying by AA can derive the iteration equation for learning the basis vectors:
∆A ∝ − A( zs T + A T AH −1 )
(8)
Where z = d log P ( s ) / ds .
3
Experimental Results
Among classic face recognition algorithms, most are subspace analysis methods such as PCA (eigenfaces), LDA, etc. These algorithms represent face images by mapping the input data from high dimensional space to low dimensional subspace. The training of subspace usually needs large-scale face database. However, in many applications, the training data is limited or only a few training images are available. In this instance, in order to improve the recognition rate, we must obtain as much as possible information from the limited input data. As shown in section 2, the overcomplete ICA can obtain more source signals than the observed signals. We performed the experiments on face recognition using the overcomplete ICA to extract the face features.
1446
Jian Cheng et al.
Our experiments were performed on face images in a subset of the FERET database. The training data contained 20 face images selected randomly from the test database. The test database includes 70 individuals; each one has 6 face images with different luminance and expression, in all 420 face images. Each face image has 112 × 92 pixels. We use the model described as Eq.(1). x is a 20 × 10304 matrix. Each row is a training face. A is set to 20 rows and 30 columns. The overcomplete ICA algorithm introduced in section 2 is used to produce the source matrix s with 30 rows and 10304 columns. Each row of s is a source face image. The 30 source faces are shown in Fig.1.
Fig.1. 30 source faces separated from the 20 training faces using the overcomplete ICA
In order to compare the overcomplete ICA algorithm with other algorithms, face recognition experiments are performed on the test face database. We compare the overcomplete ICA with the PCA [1] and the standard ICA [4] based on the same training database and the test database described as above. In [4], there are two ICA models for face recognition. We selected the first model that is superior to the second model. Firstly, the source face images in the overcomplete ICA, the eigenfaces in PCA and the independent components in the standard ICA are inferred by each algorithm as a set of basis functions B . Secondly, each face image a from the test database is projected on the set of basis functions B . The coefficients f are considered as the feature vector of this face for recognition.
Face Recognition Using Overcomplete Independent Component Analysis
f = B*a
1447
(9)
The face recognition performance is evaluated for the feature vector f using cosines as the similarity measure. The nearest one is the most similar.
d ij =
fi ∗ f j
(10)
fi ∗ f j
Fig.2 compares the face recognition performance using the overcomplete ICA, the standard ICA and PCA. The horizontal axis is the most similar n face images to the query face image. The vertical axis is the average precision number in the test database. The Fig.2 shows a trend for the overcomplete ICA and the standard ICA to outperform the PCA algorithm. The overcomplete ICA is a trivial improvement upon the standard ICA.
Overcomplete ICA Standard ICA
PCA
Fig. 2. The horizontal axis is the first n most similar face images to the query face image. The vertical axis is the average precision numbers in all test face image. The dashed line represents the overcomplete ICA. The solid line represents the standard ICA and the dash-dotted line represents the PCA
1448
4
Jian Cheng et al.
Conclusions
In this paper, we applied the overcomplete ICA to learning efficient basis function for face recognition. The three different algorithms for face recognition have been compared. The experimental results demonstrated that the overcomplete ICA can improve the precision rate of face recognition.
Acknowledgement This research was partly supported by the Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Science.
References [1] [2] [3] [4] [5] [6] [7]
M.Turk and A.Pentland: Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(1): 77-86, 1991. A.Bell and T.Sejnowski: An Information Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, Vol.7, 1129-1159, 1995. A.Hyvärinen: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Tran. on Neural Networks, Vol.10, No.3, 626634,1999. M.S. Bartlett, J.B. Movellan and T.J. Sejnowski: Face Recognition by Independent Comp onent Analysis. IEEE Tran. On Neural Networks, Vol.13 (6) 1450-1464, 2002. M.S. Lewicki and T.J. Sejnowski: Learning Overcomplete Representations. NeuralComp. ,Vol.12, 337-365, 2000. M.S. Lewicki, B.A. Olshausen: A Probabilistic Framework for the Adaptation and Comparison of Image Codes. J. Opt. Soc. Am. A: Optics, Image Science and Vision. Vol.16 (7), 1587-1601, 1999. T.-W. Lee, M.S. Lewicki, M. Girolami and T.J. Sejnowski: Blind Source Separation of More Sources Than Mixtures Using Overcomplete Representations. IEEE Sig. Proc. Lett., Vol.6(4), 87-90, 1999.
An ICA-Based Method for Poisson Noise Reduction Xian-Hua Han, Yen-Wei Chen, and Zensho Nakao Department of EEE, Faculty of Engineering, University of the Ryukyus, Japan [email protected] [email protected]
Abstract. Many image systems rely on photon detection as a basis of image formation. One of the major sources of error in these systems is Poisson noise due to the quantum nature of the photon detection process. Unlike additive Gaussian noise, Poisson noise is signal dependent, and consequently separating signal from noise is a very difficult task. In most current Poisson noise reduction algorithms, noisy signal is firstly pre-processed to approximate Gaussian noise and then denoise by a conventional Gaussian denoising algorithm. In this paper, based on the property that Poisson noise adapts to the intensity of signal, we develop and analyze a new method using an optimal ICAdomain filter for Poisson noise removal. The performance of this algorithm is assessed with simulated data experiments and experimental results demonstrate that this algorithm greatly improves the performance in denoising image.
1
Introduction
In medical and astronomical imaging systems, images obtained are often contaminated by noise, and the noise usually obeys a Poisson law and hence is highly dependent on the underlying intensity pattern being imaged. So the contaminated image can be decomposed as the true mean intensity and Poisson noise, and the noise represents the variability of pixel amplitude about the true mean intensity. It is wellknown that the variance of a Poisson random variable is equal to the mean. Therefore, the variability of the noise is proportional to the intensity of image and hence signaldependent [1]. This signal dependence makes it much difficult to separate signal from noise. Current methods for Poisson noise reduction mainly include two types of strategies. One would be to work with the square root of the noisy image, since the square-root operation is a variance stabilizing transformation [8]. However, after preprocessing, it's impossible that Poisson noise tend to a white noise if there are a few number of photon counts. So it is not completely suitable to adopt Gaussian noise reduction algorithm. The other strategy is the method of wavelet shrinkage. The basic function of wavelet transformation, however, is fixed and cannot adapt to different kinds of data sets. Recently, an ICA based denoising method has been developed by Hyvarinen and his co-workers[2][3][4]. The basic motivation behind this method is that the ICA V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1449-1454, 2003. Springer-Verlag Berlin Heidelberg 2003
1450
Xian-Hua Han et al.
components of many signals are often very sparse so that one can remove noise in the ICA domain. It is proven that the ICA-domain filtering for denoising non-Gaussian signals corrupted by Gaussian noise can perform well by applying a soft-threshold (shrinkage) operator on the components of sparse coding [5]. But for data sets contaminated by Poisson noise, it is necessary to develop new filtering in ICAdomain to adapt to the property of noise that is signal dependent. In this paper, we develop a novel ICA-domain shrinkage procedure for noise removal in Poisson noise image. The shrinkage scheme (filter) adapts to both the signal and the noise, and balances the trade-off between noise removal and excessive smoothing of image details. The filtering procedure has a simple interpretation as a joint edge detection/estimation process. Our method is closely related to the method of wavelet shrinkage, but it has the important benefit over wavelet methods that the representation is determined solely by the statistical properties of the data sets. Therefore, ICA based methods may perform better than wavelet based methods in denoising application. The paper is organized as follows: in the second section, we review ICA based denoising algorithm. Section 3 gives a new ICA-domain shrinkage scheme for Poisson noise. The experimental results are shown in section 4. The concluding remarks will be given in the final section.
2
ICA Based Denoising Algorithm
In using the method of Hyvarinen et al. to denoise Gaussian noisy signal, one needs first to employ the fixed point algorithm on noise-free data to get the transformation matrix (basic functions), and then to use maximum likelihood to estimate parameters for the shrinkage scheme. Assume that we observe an n-dimensional vector contaminated by Gaussian noise. We denote by x the observed noisy vector, by s the original non-Gaussian vector and by v the noise signal. Then we obtain that x=s+u
(1)
In the method of Hyvarinen et al., u is Gaussian white noise. The goal of signal denoising is to find š=g(x) such that š is close to s in some well-defined sense. The ICA based denoising method of Hyvarinen et al. works as follows [5]: Step 1 Estimate an orthogonal ICA transformation matrix W using a set of noise-free representative data z. Step 2 For i =1,…,n, estimate a density which approximates actual distribution of variable s i = w i
T
z ( where w i is the ith column of w). Based on the estimated
model and the variance of u (assumed to be known), determine the nonlinear function g i .
An ICA-Based Method for Poisson Noise Reduction
1451
Step 3 For each observed x, the final denoising procedure is: (1) ICA transform (2) Nonlinear shrinkage (3) Reverse transform
y = wx š i = g i (y i )
(2) (3)
T
(4)
š=w
š
The ICA based method needs additional noise-free data to estimate the transformation matrix w and shrinkage nonlinearities. Although we cannot obtain the exact original of noisy image, we can use this kind of images that have similar structure with noisy image. Example for if noisy image is man-made image, we choose similar noise-free man-made image to get transformation matrix w. In the ICA based method of Hyvarinen et al., the additional noise is Gaussian white noise, However, our goal is to reduce Poisson noise in image. In the next section, based on Poisson noise's special property, we will give an efficient shrinkage scheme which can be obtained directly from the noisy data.
3
New Shrinkage Scheme Adjusts to Poisson Noise
In the case of Poisson noise, the noise power will differ between ICA- domain coefficients depending on the image intensity after the image ICA transformation. This spatial variation of the noise must be accounted for in the ICA-domain shrinkage function design. The shrinkage function of Hyvarinen et al. does not adjust to these differences. Given the signal and noise power, a natural choice for an ICA-domain shrinkage function is
s, if SNR in s is high 0 if SNR in s is low
š =
(5)
According to R. D. Nowak et al. [6], we can use cross-validation algorithm to design an optimal shrinkage function of this form. Since ICA transform matrix (basic function) has similar property with wavelet basic function, we can directly use the nonlinear shrinkage function of waveletdomain denoising. The optimal shrinkage function in [6] takes the form š=s
s2 − δ 2 s2
(6)
where š is the ICA-domain denoised coefficient, s is the ICA-domain noisy 2
2
coefficient; š and δ are, respectively, the power of noise-free signal and Poisson noise. We can just take š² this form š
2
=s
2
-δ
2
(7)
where s can be directly obtained from ICA transformation, the noise power can be gotten from the following formulation (δ i
2
is the ith component's noise power)
1452
Xian-Hua Han et al.
δi
2
= (w i .w i ) x
(8)
So we can obtain noise power in each sample of noise data in ICA-domain and then denoise each data sample according to the SNR. Usually, we can interpret the shrinkage function as the following: Because the ICA transform matrix w can be considered as a local directional filter, after ICA transformation, the ICA-domain coefficient can be thought as the projections of the image onto localized “details”. For the noise power estimate, we project the image onto the square of the corresponding transformation matrix, which effectively computes a weighted average of local intensity in the image. This may be an approximation of noise power according to the property of Poisson noise. It is clear that the estimate of noise power can adapt to local variations in the signal or noise. The above shrinkage function simply weights each noisy ICA coefficient s(i,j) by a factor equal to the estimated signal power divided by estimate signal-plus-noise power. If the estimated signal to signal-plus noise power ratio is negative, the shrinkage function just thresholds the ICA-domain coefficients to zero. Hence, this optimal shrinkage function has a very simple interpretation as a data-adaptive ICA – domain Wiener filter.
4
Experiment Results
In this section, we present a comparison of the performance of the proposed shrinkage scheme in ICA-domain, modified Wiener filter, and R. D. Nowak's wavelet denoising method [1]. In our experiment, we use standard Lena image and simulate Poisson process in Lena image to gain Poisson noise image. The original image and noisy image are respectively showed in (a) and (b) of fig. 1. In our experiment, we used Tony Bell and T.J. Sejnowski's infomax algorithm to learn the ICA transformation matrix w [7][8], and 8*8 sub-windows were randomly sampled from noise-free images. These subwindows were presented as 64dimentional vectors. The DC value was removed from each vector as a processing step. The infomax algorithm was performed on these vectors to obtain the transformation matrix w. For the reason given in [5], we orthogonalize w by w=w (w
T
w)
−1 / 2
(9)
After ICA transformation, the denoising algorithm was applied on each subwindows in image and 64 constructions were obtained for each pixel. The final result was the mean of these reconstructions. Experimental results are showed in figure 1. We denoise Lena image contaminated by Poisson noise using modified Wiener filter (wiener filtering on the square root of noisy image), wavelet-domain optimal filter, and our method. It is evident that our method is capable of producing better noiseremoval results than the two others. Table 1 gives respectively SNR and M.S.E of these three algorithms. In calculation, we normalize intensity of all image to [0,255] to make comparison. From table 1, we see our method can obtain higher SNR and smaller M.S.E.
An ICA-Based Method for Poisson Noise Reduction
(a)
(c)
1453
(b)
(d)
(e)
Fig. 1. (a) The original image, (b) Poisson noisy image, (c) Result by Wiener filter on the square root of noisy image, (d) Result by wavelet-domain filter, (e) Result by our method
Table 1. SNR and M.S.E Comparison
Noisy Image Modified Wiener Filtering Wavelet Transformation Our Denoising algorithm
5
S/N (DB) 10.4249 15.6969 15.7572 19.9366
M.S.E 5.7885 1.7085 1.6830 0.6375
Conclusion
The usual Poisson denosing methods mainly include modified Wiener filter and wavelet shrinkage. However, the preprocessing of modified Wiener filter cannot approximate Poisson noise to Gaussian noise. For wavelet shrinkage, the basic function is fixed. ICA based method adjusts the transform matrix according to the data. It provides a new way to improve denoising performance. However, this method needs additional noise-free data to estimate the transformation matrix w. Future work will focus on how to obtain the ICA-transformation matrix from noisy data.
1454
Xian-Hua Han et al.
References [1] [2] [3] [4] [5] [6] [7] [8]
R.D. Nowak and R. Baraniuk, "Wavelet Domain Filtering for Photon Imaging Systems," IEEE Transactions on Image Processing (May, 1999). Hyvarinen, E. Oja, and P. Hoyer, “Image Denoising by Sparse Code Shrinkage,” S. Haykin and B. Kosko (eds), Intelligent Signal Processing, IEEE Press 2000 P. Hoyer, Independent Component Analysis in Image Denoising Master's Thesis, Helsinki University of Techonology, 1999 R. Oktem et al, Transform Based Denoising Algorithms: Comparative Study, Tampere University of Technology, 1999. Hyvarinen, “Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Liklihood Estimation,” Neural Computation, 11(7): 1739-1768, 1999. R.D. Nowak, “Optimal signal estimation using cross-validation,” IEEE Signal Processing Letters, vol. 3, no. (2), pp. 23-25, 1996. T-W.Lee, M.Girolami, A.J.Bell and T.J.Sejnowski. “A Unifying Informationtheoretic Framework for Independent Component Analysis,” Computers & Mathematics with Applications, Vol 31 (11), 1-21, March 2000. T-W. Lee, M. Girolami and T.J. Sejnowski. “Independent Component Analysis using an Extended Infomax Algorithm for Mixed Sub-Gaussian and SuperGaussian Source,” Neural Computation, 1999, Vol.11(2): 417-441,
Recursive Approach for Real-Time Blind Source Separation of Acoustic Signals Shuxue Ding and Jie Huang School of Computer Science and Engineering, The University of Aizu Tsuruga, Ikki-machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan [email protected] [email protected]
Abstract. In this paper we propose and investigate a recursive approach for blind source separations (BSS) or/and for independent component analyses (ICA). Based on this approach we present a deterministic (without a stochastic learning) algorithm for real-time blind source separation of convolutive mixing. When employed to acoustic signals, the algorithm shows a superior rate of convergence over its counterpart of gradientbased approach based on our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we also give experiments to illustrate the effectiveness and validity of the algorithm.
1
Introduction
To separate acoustic sources blindly in a real-world environment has been proven to be a very challenging problem [1]. A usual way to aim this purpose is modeling a real-world superposition of audio sources by a mixture of delayed and filtered versions of sources. To successfully separate sources, one needs to estimate the relative delays between channels and weights of filter taps as accurate as possible. It is a challenging task to accurately estimate the weights since the filters are very long to model a realistic environment. In paper [2], authors studied blind separations of acoustic sources in some real-world environments, especially inside vehicles. In that approach, they first separate sources in the time-frequency domain since the mixture becomes instantaneous, and then reconstruct output-separated signals in the time domain. The similar consideration can also be found in [3], though a different criterion for BSS has been used. Although the approach is quite effective, it can only work in a batch mode or a semi-real-time mode with a large buffer. Therefore one has to use other approaches if he wants to realize a real-time type of blind source separations. A possible consideration is to adapt the filter weights in the time domain by a gradient search of the minimum point of the corresponding cost function. Indeed, there has been a huge number of discussions related to such a method (e.g., [4, 5, 6, 7]).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1455–1462, 2003. c Springer-Verlag Berlin Heidelberg 2003
1456
Shuxue Ding and Jie Huang
However, a major problem, i.e. the problem of convergent speed, arises if one intends to use such an approach to acoustic signals. The gradient search of the minimum still cannot converge so fast to satisfy requirements of usual realistic applications, even though so-called nature gradient approach has greatly improved the convergence of a normal gradient approach [4]. Since transmitting channels are usually time variant in real-world environments, we need the learning processing to converge as fast as possible to catch up the time-variations. Since the convergence becomes slower if the eigenvalue spread of the correlation matrix of input signals is larger, the slow convergence in a gradient-based approach seems deeply relate to the instability of acoustic sources. The local minima of cost function are also related to the slow convergent speed. Based on our experiments, a gradient learning can scarcely converge to the true minimum point in most realistic situations, since there are too many local minima. It should be helpful if one is reminded of the situation of adaptive filtering by a supervised learning. Usually the adaptive processing is implemented by the least mean square (LMS) algorithm, which also converges slower for signals with larger eigenvalue spreads. However, what is called recursive least square (RLS) algorithm can remarkably improved the situation [8]. This least-square problem is usually formulated as the normal equation with respect to the up-tonow samples of signals, which corresponds deterministically to the true minimum point of the weighted mean square. By solving the normal equation, one can get to the true minimum point directly avoiding traps of the local minima. The motivation of this paper is to investigate the possibility of developing a recursive type of BSS approach that can also improve the convergence speed of conventional gradient-based approaches. In this approach, we resolve the problem that arises from the non-stationary property of source signals, and simultaneously resolve the local minimum problem of cost function. A deterministic (without a stochastic learning process) algorithm is presented for a real-time blind separation of convolutively mixed sources. When employed to acoustic signals, the algorithm shows the superior rate of convergence and the lower floor of cost function than its counterpart of gradient-based approach based on our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we also give experiments to illustrate the effectiveness and validity of the algorithm.
2
Problem Formulation
We assume M statistically independent sources s(t) = [s1 (t), ..., sM (t)]T , where t is the number of samples. These sources are convolved and mixed in a linear medium leading to N sensors signals x(t) = [x1 (t), ..., xN (t)]T , x(t) = A ∗ s(t)
(1)
where ∗ denotes the convolution operator and A is a N × M matrix of filters that describing transmitting channels. At present stage of discussions we have ignored the term of sensor noises for simplicity.
Recursive Approach for Real-Time Blind Source Separation
1457
The purpose of BSS is to find an inverse model W of A such that y(t) = W ∗ x(t)
(2)
and the components of y(t) become as independent as possible. We can transform equation (1) and equation (2) into the frequency domain, X(ω, t) = A(ω)S(ω, t)
(3)
Y(ω, t) = W(ω)X(ω, t)
(4)
and where X(ω, t) = DFT([x(t), ..., x(t + L − 1)]), Y(ω, t) = DFT([y(t), ..., y(t + L − 1)]), and W(ω) = DFT(W). Here DFT is discrete Fourier transform and L is its length. In recursive implementations of the approach of BSS, we start the computation with known initial conditions and use the information contained in new samples to update the old estimation of optimal solution. We therefore find that the length of observable samples is variable. Moreover, we expect to separate sources in the frequency domain since this way is more efficient than in the time domain. Accordingly, we express the cost function to be minimized as l(ω, n), where ω is frequency and n is the variable length of the observable sample blocks. Conveniently, if we set n = 1 for the initial sample block, n is equal to the number of current sample block. Similarly, the separation matrix becomes W(ω, n), which is also n dependent. As a first try of recursive approach, in this paper, we use a cost function that is based on second order of moments of signals. There have already been a lot of discussions about convolutive BSS by such a cost function [3, 5, 6, 7, 9]. Since subband signals on frequency-bins are approximately orthogonal each other, we can realize a source separation by separation on each-bin independently. However, a difference to the previous discussions is that we introduce a weighting factor into the cost function as a custom of recursive approach. We thus write the cost function as |(RY (ω, n))ij |2 (5) l(ω, n) = i =j
where (RY (ω, n))ij ≡
n
β(n, k)Yi (ω, kδ)YjH (ω, kδ)/Λi
(6)
k=0
is weighted correlation matrix of the outputs, where β(n, k) is the weighting factor, and δ is the shifting number of sample between neighboring blocks. Here Λi is a factor (variance of i-th output) to normalize the covariance of the i-th output. The use of the weighting factor β(n, k) is intended to ensure that samples in the distant past are ”forgotten” in order to afford the possibility of following the
1458
Shuxue Ding and Jie Huang
statistical variations of the observable samples when the BSS operates in a nonstationary environment. A special form of weighting that is commonly used is the exponential weighting factor of forgetting factor defended by β(n, k) = λn−k , for k = 1, 2, ..., n. Here λ is a positive constant close to, but less than, 1. By equation (4) and equation (6), the weighted correlation matrix of the outputs can be written as RY (ω, n) = W(ω, n)RX (ω, n)W H (ω, n) where (RX (ω, n))ij ≡
n
λn−k Xi (ω, kδ)XH j (ω, kδ)
(7)
(8)
k=0
is weighted correlation matrix of the normalized observations in the frequency domain. Now, the problem of blind source separation can be formulated as to find W(ω, n) such that the cost function l(ω, n) attains its minimum value. In BSS and ICA, the traditional way to find the minimum value of the cost function is to exploit stochastic gradient optimization approaches [4, 1]. Instead of this approach, now we want to give a different approach that is recursive. The idea is that when the outputs of BSS become independent each other, the weighted cross-correlation of the outputs in equation (7) should attain zero approximately. I.e., RY (ω, n) = I or W(ω, n)RX (ω, n)W H (ω, n) = I
(9)
We might call equation (9) as a ”normal equation” of BSS in analogy with the normal equation of RLS algorithm in adaptive filters [8]. The problem of BSS now can be attributed as the problem of finding solutions of equation (9).
3
A Recursive Algorithm for BSS/ICA
We shall give a recursive approach to estimate RX (ω, n) and a deterministic method to solve equation (9), which makes an online processing quite easy. At first, equation (9) can be written as W(ω, n)W H (ω, n) = (RX (ω, n))−1
(10)
It is easy to show that the n-th correlation matrix RX (ω, n) relates the n − 1-th correlation matrix RX (ω, n − 1) by (RX (ω, n))ij = λ(RX (ω, n − 1))ij + Xi (ω, nδ)XH j (ω, nδ)
(11)
where RX (ω, n − 1) is the previous value of the correlation matrix, and the matrix product Xi (ω, nδ)XH j (ω, nδ) plays the role of a ”correction” term in the updating operation.
Recursive Approach for Real-Time Blind Source Separation
1459
According to the matrix inversion lemma [4], we obtain −1 (R−1 (R−1 X (ω, n))ij = λ X (ω, n − 1))ij +
(
−1 H λ−2 R−1 X (ω, n − 1)X(ω, n)X (ω, n)RX (ω, n − 1) )ij −1 1 + λ−1 XH (ω, n)RX (ω, n − 1)X(ω, n)
(12)
Fig. 1 shows a block diagram for our recursive algorithm. In the figure, we have ignored the parts for DFT on the input signals and the part for IDFT on the output signals. Here, we adopt the overlap-and-save method [8] for the real-time DFT-IDFT processing part. This method is needed since that (1) in order to make the separation filters perform linear convolutions instead of cyclic ones, a part of the weights on taps have to be set to zeros [8]; (2) so-called the permutation problem [3, 2] can be solved by constraint on the solutions of W(ω, n) such that it restricts those filters that have no time response beyond a fixed size [6]. The initial condition for the recursive processing is R−1 X (ω, n) = I, for n ≤ 0. In the conventional gradient-type of algorithm usually some parameters related to the output signals, for an example, score function, need to be estimated, and then the results of these estimations are fed back to update the separation matrix of filters. However, from Fig. 1, we can see that in recursive BSS algorithm there is not such kind of feedbacks at all. All of estimations that are exploited to update the separation matrix are on the input signals only.
Fig. 1. Block diagram for recursive BSS
1460
4
Shuxue Ding and Jie Huang
Simulations and Experiments
For the blind separation system, since no reference signal is available and parameters of mixing model are unknown, it is not straightforward to define a performance measure. Since the cost function defined by equation (5) is related to the cross-correlations between the outputs, and we can show that the crosscorrelation between independent sources is very small, it can be taken as a quantitative separation performance measure. In this paper, we only consider the case that M = 2 and N = 2. However, there exist some extra background noises. 4.1
Simulation Results of Separations of Real-Word Benchmarks
In this subsection, a real-world benchmark recording that have been downloaded from the web [10] have been used to evaluate the recursive BSS algorithm. Figure 2 shows the learning curves for the conventional gradient-based BSS and for the recursive BSS that has been proposed in this paper. The results presented in Fig. 2 obviously show the superior rate of convergence of the recursive BSS over its counterpart of gradient-based BSS algorithm. In this figure, the vertical axis shows the values of the cost function, and the horizontal axis shows the iteration numbers. Since we have a single iteration of processing for one signal sample block, the horizontal axis also shows the sample block number n. The wild fluctuations of learning curves in Fig. 2 and the following figures are due to the non-stationarity of the sources.
0 Gradient BSS Recursive BSS -5
-10
Cost function (dB)
-15
-20
-25
-30
-35
-40
-45
-50
50
100
150
200
250
300
Number of iteration
Fig. 2. Learning curves for recursive BSS (λ = 0.95) and gradient BSS algorithms (µ = 0.01, optimized). Signals: rss mA and rss mB (Lee [10]); Size of filter taps=1024; Length of FFT=4096
Recursive Approach for Real-Time Blind Source Separation
1461
0 Gradient BSS Recursive BSS -5
-10
Cost function (dB)
-15
-20
-25
-30
-35
-40
-45
-50
50
100
150
200
250
300
350
Number of iteration
Fig. 3. Learning curves for the recursive BSS (λ = 0.95) and gradient BSS algorithms (µ = 0.01, optimized). Signals: real-world recordings in vehicle environment; Size of filter taps = 2048; Length of FFT=8192
4.2
Experiments Results of Separations of Real-World Recordings
Real-time experiments have been implemented by both a model on Simulink and a model on TMS320C6701 Evaluation Module board from Texas Instruments. Experiments were done with audio recorded in a real acoustic environment (an automobile). The automobile interior that was used for the recordings was 114.0 cm× 136.5 cm× 233.0 cm (height × width × depth). Two persons read their sentences and the resulting sound was recorded by two microphones spaced 10.0 cm apart. The recordings are digitized to 16 bit per sample, with sample rate 44.1kHz. The acoustic environment was corrupted by the noise of the engine and other directionless noises. The learning curve presented in Fig. 3 also obviously shows the superior rate of convergence of the recursive BSS over its counterpart of gradient-based BSS algorithm. The meanings of the vertical and horizontal axes are the same as that in Fig. 2.
5
Conclusions and Discussions
In this paper we have proposed and investigated a recursive approach for realtime implementation of BSS/ICA. Based on this approach we have presented a deterministic algorithm for real-time blind separation of convolutively mixed sources. When employed to acoustic signals, the algorithm has shown the superior rate of convergence over its counterpart of gradient-based approach based on
1462
Shuxue Ding and Jie Huang
our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we have also given experiments to illustrate the effectiveness and validity of the algorithm. At present stage, we have only realized a recursive BSS algorithm with cost function based on second-order statistics of signals. However, this approach is very general that it can be applied to other cost functions based on higher-order statistics of signals. We would like to present such kind of considerations and investigations in future.
References [1] Torkkola, K.: Blind separation of convolved sources based on information maximization. In S. Usui, Y. Tohkura, S. Katagiri, and E. Wilson, editors, Proc. NNSP96, pp. 423-432, New York, NY, 1996. IEEE press 1455, 1458 [2] Ding, S., Otsuka, M., Ashizawa, M., Niitsuma, T., and Sugai, K.: Blind source separation of real-world acoustic signals based on ICA in time-frequency domain. Technical Report of IEICE. Vol. EA2001-1, pp. 1-8, 2001 1455, 1459 [3] Murata, N., Ikeda, S. and Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. BSIS Technical Reports. 98-2, 1998 1455, 1457, 1459 [4] Cichocki, A. and Amari, S.: Adaptive blind signal and image processing. John Wiley & Sons, LTD., 2002 1455, 1456, 1458, 1459 [5] Kawamoto, M., Matsuoka, K. and Ohnishi, N.: A method of blind separation for convolved non-stationary signals. Neuralcomputing, Vol. 22, pp. 157-171,1998 1455, 1457 [6] Parra, L. and Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Proc., Vol. 8, No. 3, pp. 320-327, May 2000 1455, 1457, 1459 [7] van de Laar, J., Habets, E., Peters, J. and Lokkart, P.: Adaptive Blind Audio Signal Separation on a DSP. Proc. ProRISC 2001, pp. 475-479, 2002 1455, 1457 [8] Haykin, S.: Adaptive filter theory. 3rd Edition, Prentice-Hall, Inc., 1996 1456, 1458, 1459 [9] Schobben, D. W. E. and Sommen, P. C. W.: A new blind signal separation algorithm based on second order statistics: Proc. IASTED, pp 564-569, 1998 1457 [10] Lee, T.: http://www.cnl.salk.edu/ tewon/ 1460